performance

While there are many different ways for servers to stream data to clients, the Server-sent Events / EventSource Interface is one of the simplest. Your code simply creates an EventSource and then subscribes to its onmessage callback:

Implementing the server side is almost as simple: your handler just prefaces each piece of data it wants to send to the client with the string data: and ends it with a double line-ending (\n\n). Easy peasy. You can see the API in action in this simple demo.

I’ve long been sad that we didn’t manage to get this API into Internet Explorer or the Legacy Edge browser. While many polyfills for the API exist, I was happy that we finally have EventSource in the new Edge.

Yay! \o/

Alas, I wouldn’t be writing this post if I hadn’t learned something new yesterday.

Last week, a customer reached out to complain that the new Edge and Chrome didn’t work well with their webmail application. After they used the webmail site for a some indeterminate amount time, they noticed that its performance slowed to a crawl– switching between messages would take tens of seconds or longer, and the problem reproduced regardless of the speed of the network. The only way to reliably resolve the problem was to either close the tabs they’d opened from the main app (e.g. the individual email messages could be opened in their own tabs) or to restart the browser entirely.

As the networking PM, I was called in to figure out what was going wrong over video conference. I instructed the user to open the F12 Developer Tools and we looked at the network console together. Each time the user clicked on a message, new requests were created and sat in the (pending) state for a long time, meaning that the requests were getting queued and weren’t even going to the network promptly.

But why? Diagnosing this remotely wasn’t going to be trivial, so I had the user generate a Network Export log that I could examine later.

In examining the log using the online viewer, the problem became immediately clear. On the Sockets tab, the webmail’s server showed 19 requests in the Pending state, and 6 Active connections to the server, none of which were idle. The fact that there were six connections strongly suggested that the server was using HTTP/1.1 rather than HTTP/2, and a quick look at the HTTP/2 tab confirmed it. Looking at the Events tab, we see five outstanding URLRequests to a URL that strongly suggests that it’s being used as an EventSource:

Each of these sockets is in the READING_RESPONSE state, and each has returned just ten bytes of body data to each EventSource. The web application is using one EventSource instance of the app, and the user has five tabs open to the app.

And now everything falls into place. Browsers limit themselves to 6 concurrent connections per server. When the server supports HTTP/2, browsers typically need just one connection because HTTP/2 supports multiplexing many (typically 100) streams onto a single connection. HTTP/1.1 doesn’t afford that luxury, so every long-lived connection used by a page decrements the available connections by one. So, for this user, all of their network traffic was going down a single HTTP/1.1 connection, and because HTTP/1.1 doesn’t allow multiplexing, it means that every action in the UI was blocked on a very narrow head-of-line-blocking pipe.

Looking in the Chrome bug tracker, we find this core problem (“SSE connections can starve other requests”) resolved “By Design” six years ago.

Now, I’m always skeptical when reading old bugs, because many issues are fixed over time, and it’s often the case that an old resolution is no longer accurate in the modern world. So I built a simple repro script for Meddler. The script returns one of four responses:

  • An HTML page that consumes an EventSource
  • An HTML page containing 15 frames pointed at the previous HTML page
  • An event source endpoint (text/event-stream)
  • A JPEG file (to test whether connection limits apply across both EventSources and other downloads)

And sure enough, when we load the page we see that only six frames are getting events from the EventSource, and the images that are supposed to load at the bottom of the frames never load at all:

Similarly, if we attempt to load the page in another tab, we find that it doesn’t even load, with a status message of “Waiting for available socket…”

The web app owners should definitely enable HTTP/2 on their server, which will make this problem disappear for almost all of their users.

However, even HTTP/2 is not a panacea, because the user might be behind a “break-and-inspect” proxy that downgrades connections to HTTP/1.1, or the browser might conceivably limit parallel requests on HTTP/2 connections for slow networks. As noted in the By Design issue, a server depending on EventSource in multiple tabs might use a BroadcastChannel or a SharedWorker to share a single EventSource connection with all of the tabs of the web application.

Alternatively, swapping an EventSource architecture with one based on WebSocket (even one that exposes itself as a EventSource polyfill) will also likely resolve the problem. That’s because, even if the client or server doesn’t support routing WebSockets over HTTP/2, the WebSockets-Per-Host limit is 255 in Chromium and 200 in Firefox.

Stay responsive out there!

-Eric

InPrivate Mode was introduced in Internet Explorer 8 with the goal of helping users improve their privacy against both local and remote threats. Safari introduced a privacy mode in 2005.

All leading browsers offer a “Private Mode” and they all behave in the same general ways.

HTTP Caching

While in Private mode, browsers typically ignore any previously cached resources and cookies. Similarly, the Private mode browser does not preserve any cached resources beyond the end of the browser session. These features help prevent a revisited website from trivially identifying a returning user (e.g. if the user’s identity were cached in a cookie or JSON file on the client) and help prevent “traces” that might be seen by a later user of the device.

In Firefox’s and Chrome’s Private modes, a memory-backed cache container is used for the HTTP cache, and its memory is simply freed when the browser session ends. Unfortunately, WinINET never implemented a memory cache, so in Internet Explorer InPrivate sessions, data is cached in a special WinINET cache partition on disk which is “cleaned up” when the InPrivate session ends.

Because this cleanup process may be unreliable, in 2017, Edge made a change to simply disable the cache while running InPrivate, a design decision with significant impact on the browser’s network utilization and performance. For instance, consider the scenario of loading an image gallery that shows one large picture per page and clicking “Next” ten times:

InPrivateVsRegular

Because the gallery reuses some CSS, JavaScript, and images across pages, disabling the HTTP cache means that these resources must be re-downloaded on every navigation, resulting in 50 additional requests and a 118% increase in bytes downloaded for those eleven pages. Sites that reuse even more resources across pages will be more significantly impacted.

Another interesting quirk of Edge’s InPrivate implementation is that the browser will not download FavIcons while InPrivate. Surprisingly (and likely accidentally), the suppression of FavIcon downloads also occurs in any non-InPrivate windows so long as any InPrivate window is open on the system.

Web Platform Storage

Akin to the HTTP caching and cookie behaviors, browsers running in Private mode must restrict access to HTTP storage (e.g. HTML5 localStorage, ServiceWorker/CacheAPI, IndexedDB) to help prevent association/identification of the user and to avoid leaving traces behind locally. In some browsers and scenarios, storage mechanisms are simply set to an “ephemeral partition” while in others the DOM APIs providing access to storage are simply configured to return “Access Denied” errors.

You can explore the behavior of various storage mechanisms by loading this test page in Private mode and comparing to the behavior in non-Private mode.

Within IE and Edge’s InPrivate mode, localStorage uses an in-memory store that behaves exactly like the sessionStorage feature. This means that InPrivate’s storage is (incorrectly) not shared between tabs, even tabs in the same browser instance.

Network Features

Beyond the typical Web Storage scenarios, browser’s Private Modes should also undertake efforts to prevent association of users’ Private instance traffic with non-Private instance traffic. Impacted features here include anything that has a component that behaves “like a cookie” including TLS Session Tickets, TLS Resumption, HSTS directives, TCP Fast Open, Token Binding, ChannelID, and the like.

Automatic Authentication

In Private mode, a browser’s AutoComplete features should be set to manual-fill mode to prevent a “NameTag” vulnerability, whereby a site can simply read an auto-filled username field to identify a returning user.

On Windows, most browsers support silent and automatic authentication using the current user’s Windows login credentials and either the NTLM and Kerberos schemes. Typically, browsers are only willing to automatically authenticate to sites on “the Intranet“. Some browsers behave differently when in Private mode, preventing silent authentication and forcing the user to manually enter or confirm an authentication request.

In Firefox Private Mode and Edge InPrivate, the browser will not automatically respond to a HTTP/401 challenge for Negotiate/NTLM credentials.

In Chrome Incognito, Brave Incognito, and IE InPrivate, the browser will automatically respond to a HTTP/401 challenge for Negotiate/NTLM credentials even in Private mode.

Notes:

  • In Edge, the security manager returns MustPrompt when queried for URLACTION_CREDENTIALS_USE.
  • Unfortunately Edge’s Kiosk mode runs InPrivate, meaning you cannot easily use Kiosk mode to implement a display that projects a dashboard or other authenticated data on your Intranet.
  • For Firefox to support automatic authentication at all, the
    network.negotiate-auth.allow-non-fqdn and/or network.automatic-ntlm-auth.allow-non-fqdn preferences must be adjusted.

Detection of Privacy Modes

While browsers generally do not try to advertise to websites that they are running inside Private modes, it is relatively easy for a website to feature-detect this mode and behave differently. For instance, some websites like the Boston Globe block visitors in Private Mode (forcing login) because they want to avoid circumvention of their “Non-logged-in users may only view three free articles per month” paywall logic.

Sites can detect privacy modes by looking for the behavioral changes that signal that a given browser is running in Private mode; for instance, indexedDB is disabled in Edge while InPrivate. Detectors have been built for each browser and wrapped in simple JavaScript libraries. Defeating Private mode detectors requires significant investment on the part of browsers (e.g. “implement an ephemeral mode for indexedDB”) and fixes lagged until mainstream news sites (e.g. Boston Globe, New York Times) began using these detectors more broadly.

See also:

Advanced Private Modes

Generally, mainstream browsers have taken a middle ground in their privacy features, trading off some performance and some convenience for improved privacy. Users who are very concerned about maintaining privacy from a wider variety of threat actors need to take additional steps, like running their browser in a discardable Virtual Machine behind an anonymizing VPN/Proxy service, disabling JavaScript entirely, etc.

The Brave Browser offers a “Private Window with Tor” feature that routes traffic over the Tor anonymizing network; for many users this might be a more practical choice than the highly privacy-preserving Tor Browser Bundle, which offers additional options like built-in NoScript support to help protect privacy.

-Eric

I’ve previously talked about using PNGDistill to optimize batches of images, but in today’s quick post, I’d like to show how you can use the tool to check whether images in your software binaries are well optimized.

For instance, consider Chrome. Chrome uses a lot of PNGs, all mashed together a single resources.pak file. Tip: Search for files for the string IEND to find embedded PNG files.

With Fiddler installed, go to a command prompt and enter the following commands:

cd %USERPROFILE%\AppData\Local\Google\Chrome SxS\Application\60.0.3079.0
mkdir temp
copy resources.pak temp
cd temp
"C:\Program Files (x86)\Fiddler2\tools\PngDistill.exe" resources.pak grovel
for /f "delims=|" %f in ('dir /b *.png') do "c:\program files (x86)\fiddler2\tools\pngdistill" "%f" log

You now have a PNGDistill.LOG file showing the results. Open it in a CSV viewer like Excel or Google Sheets. You can see that Chrome is pretty well-optimized, with under 3% bloat.

image

Let’s take a look at Brave, which uses electron_resources.pak:

image

Brave does even better! Firefox has images in a few different files; I found a bunch in a file named omni.ja:

image

The picture gets less rosy elsewhere though. Microsoft’s MFC140u.dll’s images are 7% bloat:

image

Windows’ Shell32.dll uses poor compression:

image

Windows’ ImageRes.dll has over 5 megabytes (nearly 20% of image weight) bloat:

image

And the Windows 10’s ApplicationFrame.dll is well-compressed, but the images have nearly 87% metadata bloat:

image

Does ImageBloat Matter?

Well, yes, it does. Even when software isn’t distributed by webpages, image bloat still takes up precious space on your disk (which might be limited in the case of a SSD) and it burns cycles and memory to process or discard unneeded metadata.

Optimize your images. Make it automatic via your build process and test your binaries to make sure it’s working as expected.

-Eric

PS: Rafael Rivera wrote a graphical tool for finding metadata bloat in binaries; check it out.

PPS: I ran PNGDistill against all of the PNGs embedded in EXE/DLLs in the Windows\System32 folder. 33mb * 270M devices = 8.9 petabytes of wasted storage for imagebloat in system32 alone.  Raw Data:

Windows 10 Build 14986 adds support for Brotli compression to the Edge browser (but, somewhat surprisingly, not IE11). So at the end of 2016, we now have support for this improved compression algorithm in Chrome, Firefox, Edge, Opera, Brave, Vivaldi, and the long tail of browsers based on Chromium. Of modern browsers, only Apple is a holdout, with a “Radar” feature request logged against Safari but no public announcements.

Unfortunately, behavior across browsers varies at the edges:

  • Edge advertises support for and decodes Brotli compression on both HTTP and HTTPS requests.
  • Chrome advertises Brotli for HTTPS connections but will decode Brotli for both HTTPS and HTTP responses.
  • Firefox advertises Brotli for HTTPS connections and will not decode Brotli responses on HTTP responses.

There’s nothing horribly broken here: sites can safely serve Brotli content to clients that ask for it and those clients will probably decode it. The exception is when the request goes over HTTP… the reason Firefox and Chrome limit their request for Brotli to HTTPS is that, historically, middleboxes (like proxies and gateway filters) have been known to corrupt compression schemes other than gzip and deflate. This proved to be such a big problem in the rollout of SDCH (a now defunct compression algorithm Chrome supported), that the Brotli implementers decided to try to avoid the issue by requiring a secure transport.

-Eric

PS: Major sites, including Facebook and Google, have started deploying Brotli in production– if your site pulls fonts from Google Fonts, you’re already using Brotli today! In unrelated news, the 2016 Performance Calendar includes a post on serving Brotli from CDNs that don’t explicitly support it yet. Another recent post shows how to pair maximal compression for static files with fast compression for dynamically generated responses.

  • The most common exception logged by Fiddler telemetry is OutOfMemoryException.
  • Yesterday, a Facebook friend lamented: “How does firefox have out of memory errors so often while only taking up 1.2 of my 8 gigs of ram?
  • This morning, a Python script running on my machine as a part of the Chromium build process failed with a MemoryError, despite 22gb of idle RAM.

Most platforms return an “Out of Memory error” if an attempt to allocate a block of memory fails, but the root cause of that problem very rarely has anything to do with truly being “out of memory.” That’s because, on almost every modern operating system, the memory manager will happily use your available hard disk space as place to store pages of memory that don’t fit in RAM; your computer can usually allocate memory until the disk fills up (or a swap limit is hit; in Windows, see System Properties > Performance Options > Advanced > Virtual memory).

So, what’s happening?

In most cases, the system isn’t out of RAM—instead, the memory manager simply cannot find a contiguous block of address space large enough to satisfy the program’s allocation request.

In each of the failure cases above, the process was 32bit. It doesn’t matter how much RAM you have, running in a 32bit process nearly always means that there are fewer than 3 billion addresses1 at which the allocation can begin. If you request an allocation of n bytes, the system must have n unused addresses in a row available to satisfy that request.

Making matters much worse, every active allocation in the program’s address space can cause “fragmentation” that can prevent future allocations by splitting available memory into chunks that are individually too small to satisfy a new allocation with one contiguous block.

Out-of-address-space

Running out of address space most often occurs when dealing with large data objects like arrays; in Fiddler, a huge server response like a movie or .iso download can be problematic. In my Python script failure this morning, a 1.3gb file (chrome_child.dll.pdb) needed to be loaded so its hash could be computed. In some cases, restarting a process may resolve the problem by either freeing up address space, or by temporarily reducing fragmentation enough that a large allocation can succeed.

Running 64-bit versions of programs will usually eliminate problems with address space exhaustion, although you can still hit “out-of-memory” errors before your hard disk is full. For instance, to limit their capabilities and prevent “runaway” allocations, Chrome’s untrusted rendering processes run within a Windows job object with a 4gb memory allocation limit:

Job limit 4gb shown in SysInternals Process Explorer

Elsewhere, the .NET runtime restricts individual array dimensions to 2^31 entries, even in 64bit processes2.

-Eric Lawrence

1 If a 32bit application has the LARGEADDRESSAWARE flag set, it has access to s full 4gb of address space when run on a 64bit version of Windows.

2 So far, four readers have written to explain that the gcAllowVeryLargeObjects flag removes this .NET limitation. It does not. This flag allows objects which occupy more than 2gb of memory, but it does not permit a single-dimensional array to contain more than 2^31 entries.

Fiddler’s Transformer tab has long been a simple way to examine the use of HTTP compression of web assets, especially as new compression engines (like Zopfli) and compression formats (like Brotli) arose. However, the one-Session-at-a-time design of the Transformer tab means it is cumbersome to use to evaluate the compressibility of an entire page or series of pages.

Introducing Compressibility

Compressibility is a new Fiddler 4 add-on1 which allows you to easily find opportunities for compression savings across your entire site. Each resource dropped on the compressibility tab is recompressed using several compression algorithms and formats, and the resulting file sizes are recorded:

Compressibility tab

You can select multiple resources to see the aggregate savings:

Total savings text

WebP savings are only computed for PNG and JPEG images; Zopfli savings for PNG files are computed by using the PNGDistill tool rather than just using Zopfli directly. Zopfli is usable by all browsers (as it is only a high-efficiency encoder for Deflate) while WebP is supported only by Chrome and Opera. Brotli is available in Chrome and Firefox, but limited to use from HTTPS origins.

Download the Addon…

To show the Compressibility tab, simply install the add-on, restart Fiddler, and choose Compressibility from the View > Tabs menu2.

View > Tabs > Compressibility menu screenshot

The extension also adds ToWebP Lossless and ToWebP Lossy commands to the ImageView Inspector’s context menu:

ImagesMenuExt

I hope you find this new addon useful; please send me your feedback so I can enhance it in future updates!

-Eric

1 Note: Compressibility requires Fiddler 4, because there’s really no good reason to use Fiddler 2 any longer, and Fiddler 4 resolves a number of problems and offers extension developers the ability to utilize newer framework classes.

2 If you love Compressibility so much that you want it to be shown in the list of tabs by default, type prefs set extensions.Compressibility.AlwaysOn true in Fiddler’s QuickExec box and hit enter.

For the convenience of the Windows developer community, I periodically compile the Zopfli and Brotli compressors from source, building for Win32 and code-signing the binaries (Interested? Get Zopfli.exe and Brotli.exe). After announcing the latest build on Twitter, I got an interesting question in reply:

Do you even PGO?

While I try to use the latest compiler (VS2015 U1), I’ve never used PGO with C++ myself. Profile guided optimization requires that you first compile a special instrumented binary that you run against a training set of data. The generated profiling data is fed into the compiler and it compiles an optimized binary based on the observed execution of the code, tuning the hottest paths for speed.

As with any technology-adoption question, I wondered: 1> Is using PGO hard? and 2> Will it noticeably improve performance?

Spoiler alert: The answers are “No” and “Yes.”

I started by skimming this old blog about PGO in Visual Studio; it looks pretty simple.

Optimizing a compressor with PGO is pretty straightforward. Unlike a GUI application with thousands of different operations, a compressor really only does one thing—compress.

I created a folder with files that I felt reasonably represent the types of data that I’ll be compressing with Zopfli (eight files captured via Fiddler). I could’ve experimented using a broader sample, but this seemed like a fine corpus of data with which to begin.

Click Build > Profile Guided Optimization > Instrument to generate an instrumented binary:

Build > Profile Guided Optimization > Instrument

Right-click the project in the Solution Explorer pane and choose Debugging under the Configuration Properties category. Edit the Command Arguments to specify the training scenario. Zopfli accepts a list of files to compress, so we simply list all eight:

Edit Command arguments

Close the dialog and click Build > Profile Guided Optimization > Run Instrumented/Optimized Application to run our application and generate profiling data:

Run Instrumented/Optimized Application

The scenario then runs; it takes a bit of extra time due to the cost of the profiling instructions in the instrumented binary. After it completes, a new file (Zopfli!1.pgc) is written to the \Release\ folder; if we’d run the application multiple times to train different scenarios, Zopfli!2.pgc, Zopfli!3.pgc, etc would be present as well.

Finally, click Build > Profile Guided Optimization > Optimize to generate a new build using the profiling data to select paths for optimization. You can see the effect of the profiling database on the Build in the Output window:

Build output shows optimizations

Now your executable has been optimized.

Pretty simple, right?

Proper benchmarking is an entire field itself, but let’s do the simplest thing that could possibly work to check the effectiveness of the optimizations:

Script runs optimized and unoptimized

We run the script a few times and see that the original unoptimized binary takes ~64 seconds to compress the corpus and the optimized binary takes ~46 seconds, a savings of almost 30%.

ZopFli PGO vs non PGO

You should run the same benchmark against a new set of data, just to ensure that your changes yield similar improvements (or at least no regression!) given different input data. A few runs of my PNGDistill tool (which uses Zopfli internally) show improvements of 10% to 25% when using the optimized compressor.

Pretty cool, right?

-Eric Lawrence