For the convenience of the Windows developer community, I periodically compile the Zopfli and Brotli compressors from source, building for Win32 and code-signing the binaries (Interested? Get Zopfli.exe and Brotli.exe). After announcing the latest build on Twitter, I got an interesting question in reply:
While I try to use the latest compiler (VS2015 U1), I’ve never used PGO with C++ myself. Profile guided optimization requires that you first compile a special instrumented binary that you run against a training set of data. The generated profiling data is fed into the compiler and it compiles an optimized binary based on the observed execution of the code, tuning the hottest paths for speed.
As with any technology-adoption question, I wondered: 1> Is using PGO hard? and 2> Will it noticeably improve performance?
Spoiler alert: The answers are “No” and “Yes.”
I started by skimming this old blog about PGO in Visual Studio; it looks pretty simple.
Optimizing a compressor with PGO is pretty straightforward. Unlike a GUI application with thousands of different operations, a compressor really only does one thing—compress.
I created a folder with files that I felt reasonably represent the types of data that I’ll be compressing with Zopfli (eight files captured via Fiddler). I could’ve experimented using a broader sample, but this seemed like a fine corpus of data with which to begin.
Click Build > Profile Guided Optimization > Instrument to generate an instrumented binary:
Right-click the project in the Solution Explorer pane and choose Debugging under the Configuration Properties category. Edit the Command Arguments to specify the training scenario. Zopfli accepts a list of files to compress, so we simply list all eight:
Close the dialog and click Build > Profile Guided Optimization > Run Instrumented/Optimized Application to run our application and generate profiling data:
The scenario then runs; it takes a bit of extra time due to the cost of the profiling instructions in the instrumented binary. After it completes, a new file (Zopfli!1.pgc) is written to the \Release\ folder; if we’d run the application multiple times to train different scenarios, Zopfli!2.pgc, Zopfli!3.pgc, etc would be present as well.
Finally, click Build > Profile Guided Optimization > Optimize to generate a new build using the profiling data to select paths for optimization. You can see the effect of the profiling database on the Build in the Output window:
Now your executable has been optimized.
Pretty simple, right?
Proper benchmarking is an entire field itself, but let’s do the simplest thing that could possibly work to check the effectiveness of the optimizations:
We run the script a few times and see that the original unoptimized binary takes ~64 seconds to compress the corpus and the optimized binary takes ~46 seconds, a savings of almost 30%.
You should run the same benchmark against a new set of data, just to ensure that your changes yield similar improvements (or at least no regression!) given different input data. A few runs of my PNGDistill tool (which uses Zopfli internally) show improvements of 10% to 25% when using the optimized compressor.
Pretty cool, right?