Getting Started with Profile Guided Optimization

For the convenience of the Windows developer community, I periodically compile the Zopfli and Brotli compressors from source, building for Win32 and code-signing the binaries (Interested? Get Zopfli.exe and Brotli.exe). After announcing the latest build on Twitter, I got an interesting question in reply:

Do you even PGO?

While I try to use the latest compiler (VS2015 U1), I’ve never used PGO with C++ myself. Profile guided optimization requires that you first compile a special instrumented binary that you run against a training set of data. The generated profiling data is fed into the compiler and it compiles an optimized binary based on the observed execution of the code, tuning the hottest paths for speed.

As with any technology-adoption question, I wondered: 1> Is using PGO hard? and 2> Will it noticeably improve performance?

Spoiler alert: The answers are “No” and “Yes.”

I started by skimming this old blog about PGO in Visual Studio; it looks pretty simple.

Optimizing a compressor with PGO is pretty straightforward. Unlike a GUI application with thousands of different operations, a compressor really only does one thing—compress.

I created a folder with files that I felt reasonably represent the types of data that I’ll be compressing with Zopfli (eight files captured via Fiddler). I could’ve experimented using a broader sample, but this seemed like a fine corpus of data with which to begin.

Click Build > Profile Guided Optimization > Instrument to generate an instrumented binary:

Build > Profile Guided Optimization > Instrument

Right-click the project in the Solution Explorer pane and choose Debugging under the Configuration Properties category. Edit the Command Arguments to specify the training scenario. Zopfli accepts a list of files to compress, so we simply list all eight:

Edit Command arguments

Close the dialog and click Build > Profile Guided Optimization > Run Instrumented/Optimized Application to run our application and generate profiling data:

Run Instrumented/Optimized Application

The scenario then runs; it takes a bit of extra time due to the cost of the profiling instructions in the instrumented binary. After it completes, a new file (Zopfli!1.pgc) is written to the \Release\ folder; if we’d run the application multiple times to train different scenarios, Zopfli!2.pgc, Zopfli!3.pgc, etc would be present as well.

Finally, click Build > Profile Guided Optimization > Optimize to generate a new build using the profiling data to select paths for optimization. You can see the effect of the profiling database on the Build in the Output window:

Build output shows optimizations

Now your executable has been optimized.

Pretty simple, right?

Proper benchmarking is an entire field itself, but let’s do the simplest thing that could possibly work to check the effectiveness of the optimizations:

Script runs optimized and unoptimized

We run the script a few times and see that the original unoptimized binary takes ~64 seconds to compress the corpus and the optimized binary takes ~46 seconds, a savings of almost 30%.

ZopFli PGO vs non PGO

You should run the same benchmark against a new set of data, just to ensure that your changes yield similar improvements (or at least no regression!) given different input data. A few runs of my PNGDistill tool (which uses Zopfli internally) show improvements of 10% to 25% when using the optimized compressor.

Pretty cool, right?

-Eric Lawrence

Getting Started with Profile Guided Optimization

Brotli

Regular readers of my blog know how much I love Zopfli, Google’s compression engine that often shrinks output by 5% or better when compared to the popular zlib engine. The beauty of Zopfli is that its output is compatible with all of the billions of existing DEFLATE encoders deployed worldwide, making its use an easy choice for any static content.

But imagine for a moment what compression ratios we could achieve if we weren’t limited by compatibility with existing decoders? If we could add a new compression engine to the web, what might it look like?

The Brotli compression engine, co-written by Jyrki Alakuijala (inventor of Zopfli), provides one answer. Brotli combines the LZ77 and Huffman algorithms of DEFLATE with a larger sliding window (up to 16mb1 vs. DEFLATE’s 32kb) and context modeling; the specification also calls for a 122kb static dictionary.

Brotli In Browsers

Today, Brotli is the compression engine behind the newish WOFF2 font format, providing savings of approximately 25% over WOFF 1.0 fonts compressed with Zopfli. Not content to rest on their laurels, Google has announced their Intent to Implement Brotli as a general purpose HTTP Content-Encoding, allowing web developers to use it to compress script, stylesheets, svg, xml, and the like. Firefox beat Google to the finish and shipped Brotli support in the Firefox 44 Dev build.

Probably HTTPS only

Past attempts to add new compression algorithms (bzip2 and SDCH) have demonstrated that a non-trivial number of intermediaries (proxies, gateway scanners) fail when Content-Encodings other than GZIP and DEFLATE are specified, so Brotli will probably only be supported over HTTPS connections, where intermediaries are less likely to interfere.

    Accept-Encoding: br, gzip, deflate, sdch

Results

Facebook investigated Brotli and found it would save about 17% of CSS bytes and 20% of JavaScript bytes (compared with Zopfli). When run on the CSS and JavaScript from the Alexa top-300k, Brotli saved 12% of CSS bytes and 9% of JavaScript bytes when compared to gzip.

Running a few simple tests with Fiddler, I saw great results with Brotli:

    Content-Encoding: br

jQueryMobileMin.js
image

Microsoft homepage:
image

A random giant XML documentation file:
image

Microsoft Word Online WordEditor.js
image

Microsoft Word Online WordEditor.Wac.TellMeModel.js
image

Cloudflare’s blog post on Brotli includes some benchmarks too.

Brotli is optimized for decompression speed. When compressing, Brotli is slower than zlib’s deflate, but considerably faster than zopfli, lzma and bzip2; given 1gb of extremely compressible content, Brotli finished compressing it to 3339 bytes after 301 seconds of CPU time. After 8040 seconds of CPU time, zopfli.exe crashed when a memory allocation failed.

Running Brotli.exe

To make things simpler for Windows users, I’ve built the latest release (v0.3) from GitHub for Win32 using Visual Studio 2015. You can download the Authenticode-signed Windows Brotli.exe from my site.

To compress a file, specify the input and output filenames:

  • --in filename
  • --out compressed_filename

… and optionally specify any of the following arguments:

  • --quality n
  • --force
  • --verbose

The quality parameter controls the compression-speed vs. compression-ratio tradeoff; the higher the quality, the slower but denser the compression. The supported range is 0 to 11, and 11 is the default.

The force parameter instructs Brotli to overwrite the output file if it already exists.

The verbose parameter instructs Brotli to display its compression speed in megabytes per second upon completion.

To decompress a file, use the --decompress parameter and specify the input and output filenames:

  • --in compressed_filename
  • --out filename

… and optionally specify the --verbose parameter to instruct Brotli to display its decompression speed in megabytes per second upon completion.

If you’d like to expose Brotli inside Fiddler 4.6.0.5+, place brotli.exe inside Fiddler’s \Tools\ subfolder and restart to see it appear on the Transformer tab:

image

Tracking Brotli

If you’d like to follow along:

Alas, the Brotli Discussion forum is currently empty.

Assorted Further Investigations

1. Someone needs to register the brotli token in the IANA registry (although Google’s SDCH and Microsoft’s Xpress aren’t listed there either).

2. Implementers should consider protections against “brotli bombing” DoS attacks. Brotli’s high compression ratio makes attacks even cheaper for the bad guys.  A trivial test of compressing a file containing all 0s shows that Brotli can achieve a compression ratio of at least 386516:1, meaning that 1389 bytes of compressed data can blow up to 512MB when uncompressed. In contrast, DEFLATE has a maximum compression ratio approaching 1032 to 1, so an attacker would need to send 375 times as much data over the network to achieve a similar result. That being said, even DEFLATE can result in a denial-of-service, as with this 5.8mb PNG file that can require allocation of up to 141GB of memory.

3. Brotli’s use in WOFF2 means that browsers have already taken on its attack surface. However, not all attack surface is created equal; WOFF2 fonts can be decoded inside a very restricted sandbox. When Chrome 1.0 released, I was surprised to learn its HTTP decompressors were in a full-trust process; it turns out that is still the case today, which makes fuzzing against decompressors very interesting to an attacker.

4. Brotli’s static dictionary was generated from a broad corpus of content, but considering the most likely use cases (static files), it may not be optimal for this use. At this point, it’s probably too late to change it.

5. When used as a Content-Encoding, will brotli be used “bare” or in some framing format (e.g. with a trailing CRC and size marker)? Will it have magic bytes that will allow sniffing? (Per @mcmanusducksong, Firefox is going with a bare stream and no magics. boo)

6. While not terribly relevant to my scenarios, it turns out Google builds a lot of compression engines I’d never heard of, e.g. Snappy and Gipfeli. When compression speed is more important than ratio, they’re worth a look.

7. Brotli makes the most sense for pre-compression of static content; to that end, someone needs to xcopy the http_gzip_static module for nginx and make a few tweaks to create a new http_brotli_static module. While the nginx team may eventually release one, Google released a brotli module that supports both dynamic and static compression.

1 While Brotli can use a 16mb window, for performance reasons it appears that constraining the window to 4mb is the plan for most scenarios.

Brotli

Zopfli All The Things

I’ve written about Zopfli quite a bit in the past, and even wrote a tool to apply it to PNG files. For fun, I had a look at one of the most optimized pages in the world: Google.com, through the lens of Zopfli.

Here are the basic resources delivered by the Google homepage:

Zopfli WhatIf

This breakdown shows that Google isn’t optimizing their own compression using the compressor they wrote. The Savings column shows the number of bytes saved by using Zopfli over whatever Google used to compress the asset. Using the default settings in an ideal world, Google could save up to 16.5k, almost 5% of the bytes transferred, by using Zopfli.

I’ve color-coded the column based on how practical I believe the savings to be—the green numbers are the static images where there’s no question the size benefit could be realized. The yellow numbers are cases where script files are compressed; given the complicated query string parameters, I’m betting these scripts are dynamically generated and the compression cost of Zopfli might not be reasonable. The red number is the homepage itself, which probably isn’t reasonable to Zopfli compress as it certainly is generated dynamically.

So, most likely the savings of a practical Zopfli deployment on the homepage page would be about 3.7kb; savings are much greater on other pages on other sites.

More interesting, however, is the Google API CDN, which hosts scripts for other sites; optimizing these would take a minute or two at most and make every site that uses them faster.

Zopfli savings

Use Zopfli; give the tubes a little bit more room.

-Eric

PS: You may already have zopfli.exe on your system; Fiddler installs a copy to its \Tools\ subfolder!

Zopfli All The Things