Chrome

I’ve written about Browser Proxy Configuration a few times over the years, and I’m delighted that Chromium has accurate & up-to-date documentation for its proxy support.

One thing I’d like to call out is that Microsoft Edge’s new Chromium foundation introduces a convenient new debugging feature for debugging the behavior of Proxy AutoConfiguration (PAC) scripts.

To use it, simply add alert() calls to your PAC script, like so:

alert("!!!!!!!!! PAC script start parse !!!!!!!!");
function FindProxyForURL(url, host) {
alert("Got request for (" + url+ " with host: " + host + ")");
return "PROXY 127.0.0.1:8888";
}
alert("!!!!!!!!! PAC script done parse !!!!!!!!");

Then, collect a NetLog trace from the browser:

msedge.exe --log-net-log=C:\temp\logFull.json --net-log-capture-mode=IncludeSocketBytes

…and reproduce the problem.

Save the NetLog JSON file and reload it into the NetLog viewer. Search in the Events tab for PAC_JAVASCRIPT_ALERT events:

Even without adding new alert() calls, you can also look for HTTP_STREAM_JOB_CONTROLLER_PROXY_SERVER_RESOLVED events to see what proxy the proxy resolution process determined should be used.

One current limitation of the current logging is that if the V8 Proxy Resolver process…

… crashes (e.g. because Citrix injected a DLL into it), there’s no mention of that crash in the NetLog; it will just show DIRECT. Until the logging is enhanced, users can hit SHIFT+ESC to launch the browser’s task manager and check to see whether the utility process is alive.

Try using the System Resolver

In some cases (e.g. when using DirectAccess), you might want to try using Windows’ proxy resolution code rather than the code within Chromium.

The --winhttp-proxy-resolver command line argument will direct Chrome/Edge to call out to Windows’ WinHTTP Proxy Service for PAC processing.

Differences in PAC Processing

  • Internet Explorer/WinINET/Edge Legacy call the PAC script’s FindProxyForURLEx function (introduced to unlock IPv6 support), if present, and FindProxyForURL if not.
  • Chrome/Edge/Firefox only call the FindProxyForURL function and do not call the Ex version.
  • Internet Explorer/WinINET/Edge Legacy expose a getClientVersion API that is not defined in other PAC environments.

Notes for Other Browsers

  • Prior to Windows 8, IE showed PAC alert() notices in a modal dialog box. It no longer does so and alert() is a no-op.
  • Firefox shows alert() messages in the Browser Console (hit Ctrl+Shift+J); note that Firefox’s Browser Console is not the Web Console where web pages’ console.log statements are shown.

-Eric

Brave, Mozilla Firefox, Google Chrome and Microsoft Edge presented on our current privacy work at the Enigma 2020 conference in late January. The talks were mostly high-level, but there were a few feature-level slides for each browser.

My ~10 minute presentation on Microsoft Edge was first, followed by Firefox, Chrome, and Brave.

At 40 minutes in a 35min Q&A session starts, first with questions from the panel moderator, followed by questions from the audience.

Prelude

In late 2004, I was the Program Manager for Microsoft’s clipart website, delivering a million pieces of clipart to Microsoft Office customers every day. It was great fun. But there was a problem– our “Clip of the Day” feature, meant to spotlight a new and topical piece of clipart every day, wasn’t changing as expected.

After much investigation (could the browser itself really be wrong?!?), I wrote to the IE team to complain about what looked like bugs in its caching implementation. In a terse reply, I was informed that the handful of people then left on the browser team were only working on critical security fixes, and my caching problems weren’t nearly important enough to even look at.

That night, unable to sleep, I tossed and turned and fumed at the seeming arrogance of the job link in the respondent’s email signature… “Want to change the world? Join the new IE team today!

Gradually, though, I calmed down and reasoned it through… While the product wasn’t exactly beloved, everyone I knew with a computer used Internet Explorer. Arrogant or not, it was probably accurate that there was nothing I could do with my career at that time that would have as big an impact as joining the IE team. And, I smugly realized that if I joined the team, I’d get access to the IE source code, and could go root out those caching bugs myself.

I reached out to the IE lead for an informational interview the following day, and passed an interview loop shortly thereafter.

After joining the team, I printed out the source code for the network stack and sat down with a red pen. There were no fewer than six different bugs causing my “Clip of the Day gets stuck” issue. When my devs fixed the last of them, I mentioned this and my story to my GPM (boss’ boss).

Does this mean you’re a retention risk?” Tony asked.

Maybe after we fix the rest of these…” I retorted, pointing at the pile of paper with almost a hundred red circles.

No one in the world loved IE as much as I did, warts and all. Investigating, documenting, and fixing problems in Internet Explorer was a nearly all-consuming passion throughout my twenties. Internet Explorer pioneered a broad range of (mostly overlooked) innovations, and in rediscovering them, I felt like one of the characters on Lost — a castaway in a codebase whose brilliant designers were long gone. IE9 was a fantastic, best-of-its-time browser, and I’ll forever be proud of it. But as IE9 wound down and the Windows 8 adventure began, it was already clear that its lead would not last against the Chrome juggernaut.

I shipped IE7, IE8, IE9, and IE10, leaving Microsoft in late 2012, shortly after IE10 was finished, to build Fiddler for Telerik.

In 2015, I changed my default browser to Chrome. In 2016, I joined the Chrome Security team. I left Google in the summer of 2018 and rejoined the Microsoft Edge team, and that summer and fall I spent 50% of my time rediscovering bugs that I’d first found in IE and blogged about a decade before.

Fortunately, Edge’s faster development pace meant that we actually got to fix some of the bugs this time, but Chrome’s advantages in nearly every dimension left Edge very much in an underdog status. Fortunately, the other half of my time was spent working on our (then) secret project to replatform the next version of our Edge browser atop the open-source Chromium project.

We’ve now shipped our best browser ever — the Chromium-based Microsoft Edge. I hope you’ll try it out.

It’s with love that I beg you… please let Internet Explorer retire to the great bitbucket in the sky. It’s time. It’s been time for a long time.

Burndown List

Last night, as I read the details of yet another 0-day security bug in Internet Explorer, I posted the following throwaway tweet, which netted a surprising number of interactions:

I expected the usual slew of “Yeah, IE is terrible,” and “IE was always terrible,” and “Somebody tell my {boss,school,parents}” responses, but I didn’t really expect serious replies. I got some, however, and they’re interesting.

Shared Credentials

Internet Explorer shares a common networking stack (WinINET) and Cookie Jar (for Intranet/Trusted sites) with many native code applications on Windows, including Windows Explorer. Tim identifies a scenario where Windows Explorer relies on an auth cookie being found in the WinINET cookie jar, put there by Internet Explorer. We’ve seen similar scenarios in some Microsoft Office flows.

Depending on a cookie set by Internet Explorer might’ve been somewhat reasonable in 2003, but Vista/IE7’s introduction of Protected Mode (and cookie jar partitioning) in 2006 made this a fragile architecture. The fact that anything depends upon it in 2020 is appalling.

Thoughts: I need to bang on some doors. This is depressing.

Certificate Issuance

Developers who apply digital signatures to their apps and server operators who expose their sites over HTTPS do so using a digital certificate. In ideal cases, getting a certificate is automatic and doesn’t involve a browser at all, but some Certificate Authorities require browser-based flows. Those flows often demand that the user use either Internet Explorer or Firefox because the former supports ActiveX Controls for certificate issuance, while Firefox, until recently, supported the Keygen element.

WebCrypto, now supported in all modern browsers, serves as a modern replacement for these deprecated approaches, and some certificate issuers are starting to build issuance flows atop it.

Thoughts: We all need to send some angry emails. Companies in the Trust space should not be built atop insecure technologies.

Banking, especially in Asia

A fascinating set of circumstances led to Internet Explorer’s dominance in Asian markets. First, early browsers had poor support for Unicode and East Asian character sets, forcing website developers to build their own text rendering atop native code plugins (ActiveX). South Korea mandated use of a locally-developed cipher (SEED) for banking transactions[1], and this cipher was not implemented by browser developers… ActiveX again to the rescue. Finally, since all users were using IE, and were accustomed to installing ActiveX controls, malware started running rampant, so banks and other financial institutions started bundling “security solutions” (aka rootkits) into their ActiveX controls. Every user’s browser was a battlefield with warring native code trying to get the upper hand. A series of beleaguered Microsoft engineers (including Ed Praitis, who helped inspire me to make my first significant code commits to the browser) spent long weeks trying to keep all of this mess working as we rearchitected the browser, built Protected Mode and later Enhanced Protected Mode, and otherwise modernized a codebase nearing its second decade.

Thoughts: IE marketshare in Asia may be higher than other places, but it can’t be nearly as high as it once was. Haven’t these sites all pivoted to mobile apps yet?

Reader Survey: Do you have any especially interesting scenarios where you’re forced to use Internet Explorer? Sound off in the comments below!

Q&A

Q: I get that IE is terrible, but I’m an enterprise admin and I own 400 websites running lousy websites written by a vendor in a hurry back in 2004. These sites will not be updated, and my employees need to keep using them. What can I do?

A: The new Chromium-based Edge has an IE Mode; you can configure your users so that Edge will use an Internet Explorer tab when loading those sites, directly within Edge itself.

Q: Uh, isn’t IE Mode a security risk?

A: Any use of an ancient web engine poses some risk, but IE Mode dramatically reduces the risk, by ensuring that only sites selected by the IT Administrator load in IE mode. Everything else seamlessly transitions back to the modern, performant and secure Chromium Edge engine.

Q: What about Web Browser Controls (WebOCs) inside my native code applications?

A: In many cases, WebOCs inside a native application are used to render trusted content delivered from the application itself, or from a server controlled by the application’s vendor. In such cases, and presuming that all content is loaded over HTTPS, the security risk of the use of a WebOC is significantly lower. Rendering untrusted HTML in a WebOC is strongly discouraged, as WebOCs are even less secure than Internet Explorer itself. For compatibility reasons, numerous security features are disabled-by-default in WebOCs, and the WebOC does not run content in any type of process sandbox.

Looking forward, the new Chromium-based WebView2 control should be preferred over WebOCs for scenarios that require the rendering of HTML content within an application.

Q: Does this post mean anything has changed with regard to Internet Explorer’s support lifecycle, etc?

A: No. Internet Explorer will remain a supported product until its support lifecycle runs out. I’m simply begging you to not use it except to download a better browser.

Footnotes

[1] The SEED cipher wasn’t just a case of the South Korean government suffering from not-invented-here, but instead a response to the fact that the US Government at the time forbid export of strong crypto.

Problems in accessing websites can often be found and fixed if the network traffic between the browser and the website is captured as the problem occurs. This short post explains how to capture such logs.

Capturing Network Traffic Logs

If someone asked you to read this post, chances are good that you were asked to capture a web traffic log to track down a bug in a website or your web browser.

Fortunately, in Google Chrome or the new Microsoft Edge (version 76+), capturing traffic is simple:

  1. Optional but helpful: Close all browser tabs but one.
  2. Navigate the tab to chrome://net-export
  3. In the UI that appears, press the Start Logging to Disk button.
  4. Choose a filename to save the traffic to. Tip: Pick a location you can easily find later, like your Desktop.
  5. Reproduce the networking problem in a new tab. If you close or navigate the //net-export tab, the logging will stop automatically.
  6. After reproducing the problem, press the Stop Logging button.
  7. Share the Net-Export-Log.json file with whomever will be looking at it. Optional: If the resulting file is very large, you can compress it to a ZIP file.
Network Capture UI

Privacy-Impacting Options

In some cases, especially when you dealing with a problem in logging into a website, you may need to set either the Include cookies and credentials or Include raw bytes options before you click the Start Logging button.

Note that there are important security & privacy implications to selecting these options– if you do so, your capture file will almost certainly contain private data that would allow a bad actor to steal your accounts or perform other malicious actions. Share the capture only with a person you trust and do not post it on the Internet in a public forum.

Tutorial Video

If you’re more of a visual learner, here’s a short video demonstrating the traffic capture process.

In a future post, I’ll explore how developers can use the NetLog Viewer to analyze captured traffic.

-Eric

Appendix A: Capture on Startup

In rare cases, you may need to capture network data early (e.g. to capture proxy script downloads and the like. To do that, close Edge, then run

msedge.exe --log-net-log=C:\some_path\some_file_name.json --net-log-capture-mode=IncludeSocketBytes

Note: This approach also works for Electron JS applications like Microsoft Teams:

%LOCALAPPDATA%\Microsoft\Teams\current\Teams.exe --log-net-log=C:\temp\TeamsNetLog.json

I suspect that this is only going to capture the network traffic from the Chromium layer of Electron apps (e.g. web requests from the nodeJS side will not be captured) but it still may be very useful.

Appendix B: References

As your browser navigates from page to page, servers are informed of the URL from where you’ve come from using the Referer HTTP header1; the document.referrer DOM property reveals the same information to JavaScript.

Similarly, as the browser downloads the resources (images, styles, JavaScript) within webpages, the Referer header on the request allows the resource’s server to determine which page is requesting the resource.

The Referrer is omitted in some cases, including:

  • When the user navigates via some mechanism other than a link in the page (e.g. choosing a bookmark or using the address box)
  • When navigating from HTTPS pages to HTTP pages
  • When navigating from a resource served by a protocol other than HTTP(S)
  • When the page opts-out (details in a moment)

Usefulness

The Referrer mechanism can be very useful, because it helps a site owner understand from where their traffic is originating. For instance, WordPress automatically generates this dashboard which shows me where my blog gets its visitors:

BlogStats

I can see not only which Search Engines send me the most users, but also which specific posts on Reddit are driving traffic my way.

Privacy Implications

Unfortunately, this default behavior has a significant impact on privacy, because it can potentially leak private and important information.

Imagine, for example, that you’re reviewing a document your mergers and acquisitions department has authored, with the URL https://contoso.com/​Q4/PotentialAcquisitionTargetsUpTo5M.docx. Within that document, there might have a link to https://fabrikam.com​/financialdisclosures.htm. If you were to click that link, the navigation request to Fabrikam’s server will contain the full URL of the document that led you there, potentially revealing information that your firm would’ve preferred to keep quiet.

Similarly, your search queries might contain something you don’t mind Bing knowing (“Am I required to disclose a disease before signing up for HumongousInsurance.com?”) but that you didn’t want to immediately reveal to the site where you’re looking for answers.

If your web-based email reader puts your email address in the URL, or includes the subject of the current email, links you click in that email might be leaking information you wish to keep private.

The list goes on and on. This class of threat was noted almost thirty years ago:

Referrer Policy

Websites have always had ways to avoid leaking information to navigation targets, usually involving nonstandard navigation mechanisms (e.g. meta refresh) or by wrapping all links so that they go through an innocuous page (e.g. https://example.net/offsitelink.aspx).

However, these mechanisms were non-standard, cumbersome, and would not control the referrer information sent when downloading resources embedded in pages. To address these limitations, Referrer Policy was developed and implemented by most browsers2.

CanIUseRP

Referrer Policy allows a website to control what information is sent in Referer headers and exposed to the document.referrer property. As noted in the spec, the policy can be specified in several ways:

  • Via the Referrer-Policy HTTP response header.
  • Via a meta element with a name of referrer.
  • Via a referrerpolicy content attribute on an aareaimgiframe, or link element.
  • Via the noreferrer link relation on an aarea, or link element.
  • Implicitly, via inheritance.

The policy can be any of the following:

  • no-referrer – Do not send a Referer.
  • unsafe-url – Send the full URL (lacking only auth info and fragment), even on navigations from HTTPS to HTTP.
  • no-referrer-when-downgrade – Don’t send the Referer when navigating from HTTPS to HTTP. [The longstanding default behavior of browsers.]
  • strict-origin-when-cross-origin – For a same-origin navigation, send the URL. For a cross-origin navigation, send only the Origin of the referring page. Send nothing when navigating from HTTPS to HTTP. [Spoiler alert: The new default.]
  • origin-when-cross-origin For a same-origin navigation, send the URL. For a cross-origin navigation, send only the Origin of the referring page. Send the Referer even when navigating from HTTPS to HTTP.
  • same-origin – Send the Referer only for same-origin navigations.
  • origin – Send only the Origin of the referring page.
  • strict-origin – Send only the Origin of the referring page; send nothing when navigating from HTTPS to HTTP.
  • empty string – Inherit, or use the default

As you can see, there are quite a few policies. That’s partly due to the strict- variations which prevent leaking even the origin information on HTTPS->HTTP navigations.

Improving Defaults

With this background out of the way, the Chromium team has announced that they plan to change the default Referrer Policy from no-referrer-when-downgrade to strict-origin-when-cross-origin. This means that cross-origin navigations will no longer reveal path or query string information, significantly reducing the possibility of unexpected leaks.

As with other big privacy changes, this change is slated to ship in v80, the code has been in for five years and you can enable it in Chrome 78+ and Edge 78+:

  1. Visit chrome://flags/#reduced-referrer-granularity
  2. Set the feature to Enabled
  3. Restart your browser
Flag

I’ve published a few toy test cases for playing with Referrer Policy here.

As noted in their Intent To Implement, the Chrome team are not the first to make changes here. As of Firefox 70 (Oct 2019), the default referrer policy is set to strict-origin-when-cross-origin, but only for requests to known-tracking domains, OR while in Private mode. In Safari ITP, all cross-site HTTP referrers and all cross-site document.referrers are downgraded to origin. Brave forges the Referer (sending the Origin of the target, not the source) when loading 3rd party resources.

Understand the Limits

Note that this new default is “opt-out”– a page can still choose to send unrestricted referral URLs if it chooses. As an author, I selfishly hope that sites like Reddit and Hacker News might do so.

Also note that this new default does not in any way limit JavaScript’s access to the current page‘s URL. If your page at https://contoso.com/SuperSecretDoc.aspx includes a tracking script:

script

… the HTTPS request for track.js will send Referer: https://contoso.com/, but when the script runs, it will have access to the full URL of its execution context (https://contoso.com/SuperSecretDoc.aspx) via the window.location.href property.

Test Your Sites

If you’re a web developer, you should test your sites in this new configuration and update them if anything is unexpectedly broken. If you want the browser to behave as it used to, you can use any of the policy-specification mechanisms to request no-referrer-when-downgrade behavior for either an entire page or individual links.

Or, you might pick an even stricter policy (e.g. same-origin) if you want to prevent even the origin information from leaking out on a cross-site basis. You might consider using this on your Intranet, for instance, to help prevent the hostnames of your Intranet servers from being sent out to public Internet sites.

Stay private out there!

-Eric

1 The misspelling of the HTTP header name is a historical error which was never corrected.

2 Notably, Safari, IE11, and versions of Edge 18 and below only supported an older draft of the Referrer policy spec, with tokens never (matching no-referrer), always (matching unsafe-url), origin (unchanged) and default (matching no-referrer-when-downgrade). Edge 18 supported origin-when-cross-origin, but only for resource subdownloads.

For a small number of users of Chromium-based browsers (including Chrome and the new Microsoft Edge) on Windows 10, after updating to 78.0.3875.0, every new tab crashes immediately when the browser starts.

Impacted users can open as many new tabs as they like, but each will instantly crash:

EveryTabCrashes

EdgeHavingAProblem

As of Chrome 81.0.3992, the page will show the string Error Code: STATUS_INVALID_IMAGE_HASH.

What’s going wrong?

This problem relates to a security/reliability improvement made to Chromium’s sandboxing. Chromium runs each of the tabs (and extensions) within locked down (“sandboxed”) processes:

JAIL

In Chrome 78, a change was made to prevent 3rd-party code from injecting itself into these sandboxed processes. 3rd-party code is a top source of browser reliability and performance problems, and it has been a longstanding goal for browser vendors to get this code out of the web platform engine.

This new feature relies on setting a Windows 10 Process Mitigation policy that instructs the OS loader to refuse to load binaries that aren’t signed by Microsoft. Edge 13 enabled this mitigation in 2015, and the Chromium change brings parity to the new Edge 78+. Notably, Chrome’s own DLLs aren’t signed by Microsoft so they are specially exempted by the Chromium sandboxing code.

Unfortunately, the impact of this change is that the renderer is killed (resulting in the “Aw snap” page) if any disallowed DLL attempts to load, for instance, if your antivirus software attempts to inject its DLLs into the renderer processes. For example, Symantec Endpoint Protection versions before 14.2 are known to trigger this problem.

If you encounter this problem, you should follow the following steps:

Update any security software you have to the latest version.

Other than malware, security software is the other likely cause of code being unexpectedly injected into the renderers.

Temporarily disable the new protection

You can temporarily launch the browser without this sandbox feature to verify that it’s the source of the crashes.

  1. Close all browser instances (verify that there are no hidden chrome.exe or msedge.exe processes using Task Manager)
  2. Use Windows+R to launch the browser with the command line override:
msedge.exe --disable-features=RendererCodeIntegrity

or

chrome.exe --disable-features=RendererCodeIntegrity

Ensure that the tab processes work properly when code integrity checks are disabled.

If so, you’ve proven that code integrity checks are causing the crashes.

Hunt down the culprit

Visit chrome://conflicts#R to show the list of modules loaded by the client. Look for any files that are not Signed By Microsoft or Google.

If you see any, they are suspects. (There will likely be a few listed as “Shell Extension”s; e.g. 7-Zip.dll, that do not cause this problem)– check for an R in the Process types column to find modules loading in the Renderers.

You should install any available updates for any of your suspects to see if doing so fixes the problem.

Check the Event Log

The Windows Event Log will contain information about modules denied loading. Open Event Viewer. Expand Applications and Services Logs > Microsoft > Windows > CodeIntegrity > Operational and look for events with ID 3033. The detail information will indicate the name and location of the DLL that caused the crash:CodeIntegrity

Optional: Use Enterprise Policy to disable the new protection

If needed, IT Adminstrators can disable the new protection using the RendererCodeIntegrity policy for Chrome and Edge. You should outreach to the software vendors responsible for the problematic applications and request that they update them.

Other possible causes

Note that it’s possible that you could have a PC that encounters symptoms like this (all subprocesses crash) but not a result of the new code integrity check. In such cases, the Error Code on the crash page will be something other than STATUS_INVALID_IMAGE_HASH.

  • For instance, Chromium once had an obscure bug in its sandboxing code that caused all sandboxes to crash depending on the random memory mapping of Address Space Layout Randomization.
  • Similarly, Chrome and Edge still have an active bug where all renderers crash on startup if the PC has AppLocker enabled and the browser is launched elevated (as Administrator).

-Eric

As we’ve been working to replatform the new Microsoft Edge browser atop Chromium, one interesting outcome has been early exposure to a lot more bugs in Chromium. Rapidly root-causing these regressions (bugs in scenarios that used to work correctly) has been a high-priority activity to help ensure Edge users have a good experience with our browser.

Stabilization via Channels

Edge’s code stabilizes as it flows through release channels, from the developer’s-only ToT/HEAD (Tip-of-tree, the latest commit in the source repository) to the Canary Channel build (updated daily) to the Dev Channel updated weekly, to the Beta Channel (updated a few times over its six week lifetime) to our Stable Channel.

Until recently, Microsoft Edge was only available in Canary and Dev channels, which meant that any bugs landed in Chromium would almost immediately impact almost all users of Edge. Even as we added a Beta channel, we still found users reporting issues that “reproduce in all Edge builds, but not in Chrome.

As it turns out, most of the “not in Chrome” comparisons turn out to mean that the problems are “not repro in Chrome Stable.” And that’s because the either the regressions simply haven’t made it to Stable yet, or because the regressions are hidden behind Feature Flags that are not enabled for Chrome’s stable channel.

A common example of this is LayoutNG, a major update to the layout engine used in Chromium. This redesigned engine is more flexible than its predecessor and allows the layout engineers to more easily add the latest layout features to the browser as new standards are finalized. Unfortunately, changing any major component of the browser is almost certain to lead to regressions, especially in corner cases. Google had enabled LayoutNG by default in the code for Chrome 76 (and Edge picked up the change), but then subsequently used the Feature Flag to disable LayoutNG for the stable channel three days before Chrome 76 shipped. As a result, the new LayoutNG engine is on-by-default for Chrome Beta, Dev, and Canary and Edge Beta, Dev, and Canary.

The key difference was that until January 2020, Edge didn’t yet have a public stable channel to which any bug-impacted users can retreat. Therefore, reproducing, isolating, and fixing regressions as quickly as possible is important for Edge engineers.

Isolating Regressions

When we receive a report of a bug that reproduces in Microsoft Edge, we follow a set of steps for figuring out what’s going on.

Check Latest Canary

The first step is checking whether it reproduces in our Edge Canary builds.

If not, then it’s likely the bug was already found and fixed. We can either tell our users to sit tight and wait for the next update, or we can search on CRBug.com to figure out when exactly the fix went in and inform the user specifically when a fix will reach them.

Check Upstream

If the problem still reproduces in the latest Edge Canary build, we next try to reproduce the problem in the equivalent build of Chrome or plain-vanilla Chromium.

If Not Repro in Chromium?

If the problem doesn’t not reproduce in Chrome, this implies that the problem was caused by Microsoft’s modifications to the code in our internal branches. Alternatively, it might also be the case that the problem is hidden in upstream Chrome behind an experimental flag, so sometimes we must go spelunking into the browser’s Feature Flag configuration by visiting:

chrome://version/?show-variations-cmd 

The Command-Line Variations section of that page reveals the names of the experiments that are enabled/disabled. Launching the browser with a modified version of the command line enables launching Chrome in a different configuration2.

If the issue really is unique to Edge, we can use git blame and similar techniques on our code to see where we might have caused the problem.

If Repro in Chromium?

If the problem does reproduce in Chrome or Chromium, that’s strong evidence that we’ve inherited the problem from upstream.

Sanity Check: Does it Repro in Firefox?

If the problem isn’t a crashing bug or some other obviously incorrect behavior, I will first check the site’s behavior in the latest Firefox nightly build, on the off chance that the browsers are behaving correctly and the site’s markup or JavaScript is actually incorrect.

Try tweaking common Flags

Depending on the area where the problem occurs, a quick next step is to try toggling Feature Flags that seem relevant on the chrome://flags page. For instance, if the problem is in layout, try setting chrome://flags/#enable-layout-ng to Disabled. If the problem seems to be related to the network, try toggling chrome://flags/#network-service-in-process, and so on.

Understanding whether the problem can be impacted by flags enables us to more quickly find its root cause, and provides us an option to quickly mitigate the problem for our users (by adjusting the flag remotely from our experimental configuration server).

Bisecting Regressions

The gold standard for root-causing regressions is finding the specific commit (aka “change list” aka “pull request” aka “patch”) that introduced the problem. When we determine which commit caused a problem, we not only know exactly what builds the problem affects, we also know what engineer committed the breaking change, and we know what exactly code was changed– often we can quickly spot the bug within the changed files.

Fortunately, Chromium engineers trying to find regressing commits have a super power, known as bisect-builds.py. Using this script is simple: You specify the known-bad Chromium version and a guesstimate of the last known good version. The script then automates the binary search of the builds between the two positions, requiring the user to grade each trial as “Good” or “Bad”, until the introduction of the regression is isolated.

The simplicity of this tool belies its magic— or, at least, the power of the log2 math that underlies it. OmahaProxy informs me that the current Chrome tree contains 692,278 commits. log2(692278) is 19.4, which means that we should be able to isolate any regressing change in history in just 20 trials, taking a few minutes at most1. And it’s rare that you’d want to even try to bisect all of history– between stable and ToT we see ~27250 commits, so we should be able to find any regression within such a range in just 15 trials.

On CrBug.com, regressions awaiting a bisect are tagged with the label Needs-Bisect, and a small set of engineers try to burn down the backlog every day. But running bisects is so easy that it’s usually best to just do it yourself, and Edge engineers are doing so constantly.

One advantage available to Googlers but not the broader Chromium community is the ability to do what’s called a “Per-Revision Bisect.” Inside Google, Chrome builds a new build for every single change (so they can unambiguously flag any change that impacts performance), but not all of these builds are public. Publicly, Google provides the Chromium Continuous Build Archive, an archive of builds that are built approximately every one to fifteen commits. That means that when we do bisects, we often don’t get back a single commit, but instead a regression range. We then must look at all of the commits within that range to figure out which was the culprit. In my experience, this is rarely a problem– most commits in the range obviously have nothing to do with the problem (“Change the icon for some far off feature”), so there are rarely more than one or two obvious suspects.

The Edge team does not currently expose our Edge build archives for bisects, but fortunately almost all of our web platform work is contributed upstream, so bisecting against Chromium is almost always effective.

Recent Example

Last Thursday, we started to get reports of a “missing certificate” problem in Microsoft Edge, whereby the browser wasn’t showing the expected Lock icon for HTTPS pages that didn’t contain any mixed content:

The certificate is missing

While the lock was missing for some users, it was present for others. After also reproducing the issue in Chrome itself, I filed a bug upstream and began investigating.

Back when I worked on the Chrome Security team, we saw a bunch of bugs that manifested like this one that were caused by refactoring in Chrome’s navigation code. Users could hit these bugs in cases where they navigated back/forward rapidly or performed an undo tab close operation, or sites used the HTML5 History API. In all of these cases, the high-level issue is that the page’s certificate is missing in the security_state: either the ssl_status on the NavigationHandle is missing or it contains the wrong information.

This issue, however, didn’t seem to involve navigations, but instead was hit as the site was loaded and thus it called to mind a more recent regression from back in March, where sites that used AppCache were missing their lock icon. That issue involved a major refactoring to use the new Network Service.

One fact that immediately jumped out at me about the sites first reported to hit this new problem is that they all use ServiceWorker (e.g. Hotmail, Gmail, MS Teams, Twitter). Like AppCache, ServiceWorker allows the browser to avoid hitting the network in response to a fetch. As with AppCache, that characteristic means that the the browser must somehow have the “original certificate” for that response from the network so it can set that certificate in the security_state when it’s needed.

But where does that certificate live?

Chromium stores the certificate for a given HTTPS response after the end of the cache entry, so it should be available whenever the cached resource is used3. A quick disk search revealed that ServiceWorker stores its scripts inside the folder:

%localappdata%\Microsoft\Edge SxS\User Data\Default\Service Worker\ScriptCache

Comparing the contents of the cache file on a “good” and a “bad” PC, we see that the certificate information is missing in the cache file for a machine that reproduces the problem:

The serialized certificate chain is present in the “Good” case

So, why is that certificate missing? I didn’t know.

I performed a bisect three times, and each time I ended up with the same range of a dozen commits, only one of which had anything to do with anything to do with caching, and that commit was for AppCache, not ServiceWorker.

More damning for my bisect suspect was the fact that this suspect commit landed (in 78.0.3890) the day after the build (3889) upon which the reproducing Edge build was based. I spent a bunch of time figuring out whether this could be the off-by-one issue in Edge build numbering before convincing myself that no, it couldn’t be: build number misalignment just means that Edge 3889 might not contain everything that’s in Chrome 3889.

Unless an Edge Engineer had cherry-picked the regressing commit into our 3889 (unlikely), the suspect couldn’t be the culprit.

Edge 3889 doesn’t include all of the commits in Chromium 3889.

I posted my research into the bug at 10:39 PM on Friday and forty minutes later a Chrome engineer casually revealed something I didn’t realize: Chrome uses two different codepaths for fetching a ServiceWorker script– a “new_script_loader” and an “updated_script_loader.”

And instantly, everything fell into place. The reason the repro wasn’t perfectly reliable (for both users and my bisect attempts) was that it only happened after a ServiceWorker script updates.

  • If the ServiceWorker in the user’s cache is missing, it is downloaded by the new_script_loader and the certificate is stored.
  • If the ServiceWorker script is present and is unchanged on the server, the lock shows just fine.
  • But if the ServiceWorker in the user’s cache is present but outdated, the updated_script_loader downloads the new script… and omits the certificate chain. The lock icon disappears until the user clears their cache or performs a hard (CTRL+F5) refresh, at which point the lock remains until the next script update.

With this new information in hand, building a reliable reduced repro case was easy– I just ripped out the guts of one of my existing PWAs and configured it so that it updated itself every five seconds. That way, on nearly every load, the cached ServiceWorker would be deemed outdated and the script redownloaded.

With this repro, we can kick off our bisect thusly:

python tools/bisect-builds.py -a win -g 681094 -b 690908 --verify-range --use-local-cache -- --no-first-run --user-data-dir=/temp https://webdbg.com/apps/alwaysoutdated/

… and grade each build based on whether the lock disappears on refresh:

Grading each build based on whether the lock disappears

Within a few minutes, we’ve identified the regression range:

A culprit is found

In this case, the regression range contains just one commit— one that turns on the new ServiceWorker update check code. This confirms the Chromium engineer’s theory and that this problem is almost identical to the prior AppCache bug. In both cases, the problem is that the download request passed kURLLoadOptionNone and that prevented the certificate from being stored in the HttpResponseInfo serialized to the cache file. Changing the flag to kURLLoadOptionSendSSLInfoWithResponse results in the retrieval and storage of the ssl_info, including the certificate.

The fix was quick and straightforward; it will be available in Chrome 78.0.3902 and the following build of Edge based on that Chromium version. Notably, because the bug is caused by failure to put data in a cache file, the lock will remain missing even in later builds until either the ServiceWorker script updates again, or you hard refresh the page once.

-Eric

1 By way of comparison, when I last bisected an issue in Internet Explorer, circa 2012, it was an extraordinarily painful two-day affair.

2 You can use the command line arguments in the variations page (starting at --force-fieldtrials=) to force the bisect builds to use the same variations.

Chromium also has a bisect-variations script which you can use to help narrow down which of the dozens of active experiments is causing a problem.

If all else fails, you can also reset Chrome’s field trial configuration using chrome.exe --reset-variation-state to see if the repro disappears.

3 Aside: Back in the days of Internet Explorer, WinINET had no way to preserve the certificate, so it always bypassed the cache for the first request to a HTTPS server so that the browser process would have a certificate available for its subsequent display needs.