Best Practice: Post-Mortems

I’ve written a bit about working at Google in the past. Google does a lot of things right, and other companies would benefit by following their example.

At Google, one of the technical practices that I thought was both essential and very well done was the “post-mortem”– whenever they hit a significant problem, after putting out the fires and getting everything running again, they’d get the engineers closest to the problem to spend a day or two investigating the root cause of the issue and writing up their findings for everyone to read. The visibility of post-mortems meant that even a lowly browser engineer could go read in-depth content about how a live service went down for a day (“We didn’t think about what would happen if the data center caught on fire during the migration“), or the comic tale about what happens when a catering order for 1000 donuts is misunderstood as an order for 1000 dozen donuts. Some post-mortems are even made public.

The aim was a “blameless” post-mortem (nobody got in trouble for the results) where the goal was to identify the true root causes (not just the immediately precipitating errors) and file bugs to eradicate those causes and prevent recurrence of not just the same problem, but all similar problems in the future. As a part of the process, they’d calculate out exactly how much the problem ended up costing in direct dollars (lost revenue, damage, etc). 

Bugs filed from post-mortems got worked on with priority– there was solid evidence showing the real danger of leaving things unfixed, and no one wanted to get burned by the same root causes twice. Having open, broadly shared post-mortems helps ensure that the same mistakes aren’t repeated, and it helps build a common understanding of the greater impact of fire marshals over firefighters.

A key technique in the post-mortem was following the “Five Whys” paradigm (famously introduced at Toyota) for finding root causes, in which the participants would start at the immediate issue and then probe further toward the root causes by asking “And why did that happen?” (The downtime was caused because the database ran out of space and the code didn’t notice. Why? Because there was no test for that case. Why? Because the test environment ran on different hardware with a mock database that couldn’t run out of space. Why? Because it was deemed too difficult to test on production-class hardware. Why? Because we haven’t prioritized building a parallel test environment. Why? Because it’s expensive and we didn’t think it was necessary. Now we know better). 

The post-mortems were serious affairs — mandatory, well-funded (engineering time is expensive), and broadly reviewed — all of them published on an intranet portal for anyone in the company to learn from. They were tremendously effective — fixes for the root causes were prioritized based on cost and impact and rapidly addressed. I don’t think Google could have become a trillion-dollar company without them.

Many companies’ engineering cultures have adopted post-mortems in theory— but if your culture isn’t willing to expect, fund, recognize, and respect them, they become yet another source of overhead and another exhausting checkbox to tick.

Badware Techniques: Notification Spam

I tried visiting an old colleague’s long-expired blog today, just to see what would happen. I got redirected here:

Wat? What is this even talking about? There’s no “Allow” link or button anywhere.

The clue is that tiny bell with a red X in the omnibox– This site tried to ask for permission to spam me with notifications forevermore. The site hopes that I don’t understand the permission prompt, I will assume this is one of the billions of CAPTCHAs on today’s web, and that I will simply click “Allow”.

However, in this case, Edge said “Naw, we’re not even going to bother showing the prompt for this site” and suppressed it by default.

The resulting user experience isn’t an awesome one for the user, but there’s not a ton the browser can do about that in general– websites can always lie to visitors, and the browser’s ability to do anything reasonable in response is limited. The truly bad outcome (a continuous flood of spam notifications appearing inside the OS, leading the user to wonder whether they’ve been hacked for weeks afterward) has been averted because the user never sees the “Shoot self in foot” option.

This “Quieter Notifications” behavior can be found in Edge Settings; you can use the other toggle to turn off Notification permission requests entirely:

edge://settings/content/notifications screenshot

Today, there’s no “Report this site is trying to trick users” feature. The current menu command ... > Help and Feedback > Report Unsafe Site is today only used to report sites that distribute malware or conduct phishing attacks for blocking with SmartScreen.

Edge’s Super-Res Image Enhancement

One interesting feature that the Edge team is experimenting with this summer is called “SuperRes” or “Enhance Images.” This feature allows Microsoft Edge to use a Microsoft-built AI/ML service to enhance the quality of images shown within the browser. You can learn more about how the images are enhanced (and see some examples) in the Turing SuperRes blog post.

Currently only a tiny fraction of Stable channel users and much larger fraction of Dev/Canary channel users have the feature enabled by field trial flags. If the feature is enabled, you’ll have an option to enable/disable it inside edge://settings:

Users of the latest builds will also see a “HD” icon appear in the omnibox. When clicked, it opens a configuration balloon that allows you to control the feature:

As seen in the blog post, this feature can meaningfully enhance the quality of many photographs, but the model is not yet perfect. One limitation is that it tends not to work as well for PNG screenshots, which sometimes get pink fuzzies:

“Pink Fuzzies” JPEG Artifacts

… and today I filed a bug because it seems like the feature does not handle ICCv4 color profiles correctly.

Green tint due to failed color profile handling

If you encounter failed enhancements like this, please report them to the Edge team using the … > Help and Feedback > Send Feedback tool so the team can help improve the model.

On-the-Wire

Using Fiddler, you can see the image enhancement requests that flow out to the Turing Service in the cloud:

Inspecting each response from the server takes a little bit of effort because the response image is encapsulated within a Protocol Buffer wrapper:

Because of the wrapper, Fiddler’s ImageView will not be able to render the image by default:

Fortunately, the response image is near the top of the buffer, so you can simply focus the Web Session, hit F2 to unlock it for editing, and use the HexView inspector to delete the prefix bytes:

…then hit F2 to commit the changes to the response. You can then use the ImageView inspector to render the enhanced image, skipping over the remainder of the bytes in the protocol buffer (see the “bytes after final chunk” warning on the left):

Stay sharp out there!

-Eric

PS: There is not, as of October 2022, a mechanism by which a website can opt its pages out of this feature.

QuickFix: Trivial Chrome Extensions

Almost a decade before I released the first version of Fiddler, I started work on my first app that survives to this day, SlickRun. SlickRun is a floating command line that can launch any app on your PC, as well as launching web applications and performing other simple and useful features, like showing battery, CPU usage, countdowns to upcoming events, and so forth:

SlickRun allows you to come up with memorable commands (called MagicWords) for any operation, so you can type whatever’s natural to you (e.g. bugs/edge launches the Edge bug tracker) for any operation.

One of my favorite MagicWords, beloved for decades now, is goto. It launches your browser to the best match for any web search:

For example, I can type goto download fiddler and my browser will launch and go to the Fiddler download page (as found by an “I’m Feeling Lucky” search on Google) without any further effort on my part.

Unfortunately, back in 2020 (presumably for anti-abuse reasons), Google started interrupting their “I’m Feeling Lucky” experience with a confirmation page that requires the user to acknowledge that they’re going to a different website:

… and this makes the goto user flow much less magical. I grumbled about Google’s change at the time, without much hope that it would ever be fixed.

Last week, while vegging on some video in another tab, I typed out a trivial little browser extension which does the simplest possible thing: When it sees this page appear as the first or second navigation in the browser, it auto-clicks the continue link. It does so by instructing the browser to inject a trivial content script into the target page:

"content_scripts": [
    {
      "matches": ["https://www.google.com/url?*"],
      "js": ["content-script.js"]
    }

…and that injected script clicks the link:

// On the Redirect Notice page, click the first link.
if (window.history.length<=2) {
  document.links[0].click(); 
}
else {
  console.log(`Skipping auto-continue, because history.length == ${window.history.length}`);
}

This whole thing took me under 10 minutes to build, and it still delights me every time.

-Eric

Passkeys – Syncable WebAuthN credentials

Passwords have lousy security properties, and if you try to use them securely (long, complicated, and different for every site), they often have horrible usability as well. Over the decades, the industry has slowly tried to shore up passwords’ security with multi-factor authentication (e.g. one-time codes via SMS, ToTP authenticators, etc) and usability improvements (e.g. password managers), but these mechanisms are often clunky and have limited impact on phishing attacks.

The Web Authentication API (WebAuthN) offers a way out — cryptographically secure credentials that cannot be phished and need not be remembered by a human. But the user-experience for WebAuthN has historically been a bit clunky, and adoption by websites has been slow.

That’s all set to change.

Passkeys, built atop the existing WebAuthN standards, offers a much slicker experience, with enhanced usability and support across three major ecosystems: Google, Apple, and Microsoft. It will work in your desktop browser (Chrome, Safari, or Edge), as well as well as on your mobile phone (iPhone or Android, in both web apps and native apps).

Passkeys offers the sort of usability improvement that finally makes it practical for sites to seize the security improvement from retiring passwords entirely (or treating password-based logins with extreme suspicion).

PMs from Google and Microsoft put together an awesome (and short!) demo video for the User Experience across devices which you can see over on YouTube.

You can visit https://passkeys.dev to learn more.

I’m super-excited about this evolution and hope we’ll see major adoption as quickly as possible. Stay secure out there!

-Eric

Bonus Content: A PassKeys Podcast featuring Google Cryptographer Adam Langley, IMO one of the smartest humans alive.

Understanding Browser Channels

Edge channel logs

Microsoft Edge (and upstream Chrome) is available in four different Channels: Stable, Beta, Dev, and Canary. The vast majority of Edge users run on the Stable Channel, but the three pre-Stable channels can be downloaded easily from microsoftedgeinsider.com. You can keep them around for testing if you like, or join the cool kids and set one as your “daily driver” default browser.

Release Schedule

The Stable channel receives a major update every four weeks (Official Docs), Beta channel more often than that (irregularly), Dev channel aims for one update per week, and Canary channel aims for one update per day.

While Stable only receives a major version update every four weeks, in reality it will usually be updated several times during its four-week lifespan. These are called respins, and they contain security fixes and high-impact functionality fixes. (The Extended Stable channel for Enterprises updates only every eight weeks, skipping every odd-numbered release).

Similarly, some Edge features are delivered via components, and those can be updated for any channel at any time.

Why Use a Pre-Stable Channel?

The main reason to use Beta, Dev, or even Canary as your “daily driver” is because these channels (sometimes referred to collectively as “pre-release channels”) are a practical time machine. They allow you to see what will happen in the future, as the code from the pre-release channels flows from Canary to Dev to Beta and eventually Stable.

For a web developer, Enterprise IT department, or ISV building software that interacts with browsers this time machine is invaluable– a problem found in a pre-Release channel can be fixed before it becomes a work-blocking emergency during the Stable rollout.

For Edge and the Chromium project, self-hosting of pre-release channels is hugely important, because it allows us to discover problematic code before billions of users are running it. Even if an issue isn’t found by a hand-authored customer bug report submission, engineers can discover many regressions using telemetry and automatic crash reporting (“Watson”).

What If Something Does Go Wrong?

As is implied in the naming, pre-Stable channels are, well, less Stable than the Stable channel. Bugs, sometimes serious, are to be expected.

To address this, you should always have at least two Edge channels configured for use– the “fast” channel (Dev or Canary) and a slower channel (Beta or Stable).

If there’s a blocking bug in the version you’re using as your fast channel, temporarily “retreat” from your fast to slow channel. To make this less painful, configure your browser profile in both channels to sync information using a single MSA or AAD account. That way, when you move from fast to slow and back again, all of your most important information (see edge://settings/profiles/sync for data types) is available in the browser you’re using.

Understanding Code Flow

In general, the idea is that Edge developers check in their code to the internal Main branch. Code from Microsoft employees is joined by code pulled by the “pump” from the upstream Chromium project, with various sheriffs working around the clock to fix any merge conflicts between the upstream code pumped in and the code Microsoft engineers have added.

Every day, the Edge build team picks a cut-off point, compiles an optimized release build, runs it through an automated test gauntlet, and if the resulting build runs passably (e.g. the browser boots and can view some web pages without crashing), that build is blessed as the Canary and released to the public. Note that the quality of Canary might well be comically low (the browser might render entirely in purple, or have menu items that crash the browser entirely) but still be deemed acceptable for release. The Canary channel, jokes aside, is named after the practice of bringing birds into mining tunnels deep underground. If a miner’s canary falls over dead, the miners know that the tunnel is contaminated by odorless but deadly carbon monoxide and they can run for fresh air immediately. (Compared to humans, canaries are much more sensitive to carbon monoxide and die at a much lower dose). Grim metaphors aside, the Canary channel serves the same purpose– to discover crashes and problems before “regular” users encounter them. Firefox avoids etymological confusion and names its latest channel “Nightly.”

Every week or so, the Edge build team selects one of the week’s Canary releases and “promotes” it to the Dev branch. The selected build is intended to be one of the more reliable Canaries, with fewer major problems than we’d accept for any given Canary, but sometimes we’ll pick a build with a major problem that wasn’t yet noticed. When it goes out to the broader Dev population, Microsoft will often fix it in the next Canary build, but folks on the busted Dev build might have to wait a few days for the next Canary to Dev promotion. It’s for this reason that I run Canary as my daily driver rather than Dev.

Notably for Canary and Dev, the Edge team does not try to exactly match any given upstream Canary or Dev release. Sometimes, we’ll skip a Dev or Canary release when we don’t have a good build, or sometimes we’ll ship one when upstream does not. This means that sometimes (due to pump latency, “sometimes” is nearly “always”) an Edge Canary might have slightly different code than the same day’s Chrome Canary. Furthermore, due to our code pump works, Edge Canary can even have slightly different code than Chromium’s even for the exact same Chrome version number.

In contrast, for Stable, we aim to match upstream Chrome, and work hard to ensure that Version N of Edge has the same upstream changelists as the matching Version N of Chrome/Chromium. This means that anytime upstream ships or respins a new version of Stable, we will ship or respin in very short order.

In some cases, upstream Chromium engineers or Microsoft engineers might “cherry-pick” a fix into the Dev, Beta, or Stable branches to get it out to those more stable branches faster than the normal code-flow promotion. This is done sparingly, as it entails both effort and risk, but it’s a useful capability. If Chrome cherry-picks a fix into its Stable channel and respins, the Edge team does the same as quickly as possible. (This is important because many cherry-picks are fixes for 0-day exploits.)

Code Differences

As mentioned previously, the goal is that faster-updating channels reflect the exact same code as will soon flow into the more-stable, slower-updating channels. If you see a bug in Canary version N, that bug will end up in Stable version N unless it’s reported and fixed first. Other than a different icon and a mention on the edge://version page, it’s often hard to tell which channel is even being used.

However, it’s not quite true that the same build will behave the same way as it flows through the channels. A feature can be coded so that it works differently depending upon the channel.

For example, Edge has a “Domain Actions” feature to accommodate certain websites that won’t load properly unless sent a specific User-Agent header. When you visit a site on the list, Edge will apply a UA-string spoof to make the site work. You can see the list on edge://compat/useragent:

However, this Domain Actions list is applied only in Edge Stable and Beta channels and is not used in Edge Dev and Canary.

Edge rolls out features via a Controlled Feature Rollout process (I’ve written about it previously). The Experimental Configuration Server typically configures the “Feature Enabled” rate in pre-release channels (Canary and Dev in particular) to be much higher (e.g. 50% of Canary/Dev users will have a feature enabled, while 5% of Beta and 1% of Stable users will get it).

Similarly, there exist several “experimental” Extension APIs that are only available for use in the Dev and Canary channels. There are also some UI bubbles (e.g. warning the user about side-loaded “developer-mode” extensions) that are shown only in the Stable channel.

Chrome and Edge offer a UX to become the default browser, but this option isn’t shown in the Canary channel.

Individual features can also take channel into account to allow developer overrides and the like, but such features overrides tend to be rather niche.

Thanks for helping improve the experience for everyone by self-hosting pre-Stable channels!

-Eric

PS: The Chrome team has a nice article about their channels.

Certificate Revocation in Microsoft Edge

When you visit a HTTPS site, the server must present a certificate, signed by a trusted third-party (a Certificate Authority, aka CA), vouching for the identity of the bearer. The certificate contains an expiration date, and is considered valid until that date arrives. But what if the CA later realizes that it issued the certificate in error? Or what if the server’s private key (corresponding to the public key in the certificate) is accidentally revealed?

Enter certificate revocation. Revocation allows the trusted third-party to indicate to the client that a particular certificate should no longer be considered valid, even if it’s unexpired.

There are several techniques to implement revocation checking, and each has privacy, reliability, and performance considerations. Back in 2011, I wrote a long post about how Internet Explorer handles certificate revocation checks.

Back in 2018, the Microsoft Edge team decided to match Chrome’s behavior by not performing online OCSP or CRL checks for most certificates by default.

Wait, What? Why?

The basic arguments are that HTTPS certificate revocation checks:

  • Impair performance (tens of milliseconds to tens of seconds in latency)
  • Impair privacy (CAs could log what you’re checking and know where you went)
  • Are too unreliable to hard-fail (too many false positives on downtime or network glitches)
  • Are useless against most threats when soft-fail (because an active MITM can block the check)

For more context about why Chrome stopped using online certificate revocation checks many years ago, see these posts from the Chromium team explaining their thinking:

Note: Revocation checks still happen

Chromium still performs online OCSP/CRL checks for Extended Validation certificates only, in soft-fail mode. If the check fails (e.g. offline OCSP responder) the certificate is just treated as a regular TLS certificate without the EV treatment. Users are very unlikely to ever notice because the EV treatment, now buried deep in the security UX, is virtually invisible. Notably, however, there is a performance penalty– if your Enterprise blackholes or slowly blocks access to a major CA’s OCSP responder, TLS connections from Chromium will be 🐢 very slow. Update: Chromium has announced that v106+ will no longer revocation check for EV.

Even without online revocation checks, Chromium performs offline checks in two ways.

  1. It calls the Windows Certificate API (CAPI) with an “offline only” flag, such that revocation checks consult previously-cached CRLs (e.g. if Windows had previously retrieved a CRL), and certificate distrust entries deployed by Microsoft.
  2. It plugs into CAPI an implementation of CRLSets, a Google/Microsoft deployed list of popular certificates that should be deemed revoked.

On Windows, Chromium uses the CAPI stack to perform revocation checks. I would expect this check to behave identically to the Internet Explorer check (which also relies on the Windows CAPI stack). Specifically, I don’t see any attempt to set dwUrlRetrievalTimeout away from the default. How CAPI2 certificate revocation works. Sometimes it’s useful to enable CAPI2 diagnostics.

CRLSets are updated via the Component Updater; if the PC isn’t ever on the Internet (e.g. an air-gapped network), the CRLSet will only be updated when a new version of the browser is deployed. (Of course, in an environment without access to the internet at large, revocation checking is even less useful.)

After Chromium moves to use its own built-in verifier, it will perform certificate revocation checks using its own revocation checker. Today, that checker supports only HTTP-sourced CRLs (the CAPI checker also supports HTTPS, LDAP, and FILE).

Group Policy Options

Chromium (and thus Edge and Chrome) support two Group Policies that control the behavior of revocation checking.

The EnableOnlineRevocationChecks policy enables soft-fail revocation checking for certificates. If the certificate does not contain revocation information, the certificate is deemed valid. If the revocation check does not complete (e.g. inaccessible CA), the certificate is deemed valid. If the certificate revocation check successfully returns that the certificate was revoked, the certificate is deemed invalid.

The RequireOnlineRevocationChecksForLocalAnchors policy allows hard-fail revocation checking for certificates that chain to a private anchor. A “private anchor” is not a “public Certificate Authority”, but instead e.g. the Enterprise root your company deployed to its PCs for either its internal sites or its Monster-in-the-Middle MITM network traffic inspection proxy). If the certificate does not contain revocation information, the certificate is deemed invalid. If the revocation check does not complete (e.g. inaccessible CA), the certificate is deemed invalid. If the certificate revocation check successfully returns that the certificate was revoked, the certificate is deemed invalid.

Other browsers

Note: This section may be outdated!

Here’s an old survey of cross-browser revocation behavior.

By default, Firefox still queries OCSP servers for certificates that have a validity lifetime over 10 days. If you wish, you can require hard-fail OCSP checking by navigating to about:config and toggling security.OCSP.require to true. See this wiki for more details. Mozilla also distributes a CRLSet-like list of intermediates that should no longer be trusted, called OneCRL.

For the now-defunct Internet Explorer, you can set a Feature Control registry DWORD to convert the usual soft-fail into a slightly-less-soft fail:

HKEY_CURRENT_USER\SOFTWARE\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_WARN_ON_SEC_CERT_REV_FAILED

iexplore.exe=1

Edge Legacy did not have any option for non-silent failure for revocation checks.

New Recipes for 3rd Party Cookies

For privacy reasons, the web platform is moving away from supporting 3rd-party cookies, first with lockdowns, and eventually with removal of support in late 2023 the second half of 2024.

Background: What Does “3rd-Party” Mean?

A 3rd-party cookie is one that is set or sent from a 3rd-party context on a web page.

A 3rd-party context is a frame or resource whose registrable domain (sometimes called eTLD+1) differs from that of the top-level page. This is sometimes called “cross-site.” In this example:

domain2.com and domain3.com are cross-site 3rd-parties to the parent page served by domain1.com. (In contrast, a resource from sub.domain1.com is cross-origin, but same-site/1st Party to domain1.com).

Importantly, frames or images[1] from domain2.com and domain3.com cannot see or modify the cookies in domain1.com‘s cookie jar, and script running at domain1.com cannot see or set cookies for the embedded domain2.com or domain3.com contexts.

Background: Existing Restrictions

Q: Why do privacy advocates worry about 3rd-party cookies?
A: Because they are a simple way to track a given user’s browsing across the web.

Say a bunch of unrelated sites include ads from an advertising server. A 3rd-party cookie set on the content from the ad will allow that ad server to identify the set of sites that the user has visited. For example, consider three pages the user visits:

The advertiser, instead of simply knowing that their ad is running on Star Trek’s website, is also able to know that this specific user has previously visited sites related to running and a medication, and can thus target its advertisements in a way that the visitor may deem a violation of their privacy.

For this reason, browsers have supported controls on 3rd-party cookies for decades, but they were typically off-by-default or trivially bypassed.

More recently, browsers have started introducing on-by-default controls and restrictions, including the 2020 change that makes all cookies SameSite=Lax by default.

However, none of these restrictions will go as far as browsers will go in the future.

A Full Menu of Replacements

In order to support scenarios that have been built atop 3rd-party cookies for multiple decades, new patterns and technologies will be needed.

The Easy Recipe: CHIPS

In 2020, cookies were made SameSite=Lax by default blocking cookies from being set and sent in 3rd-party contexts by default. The workaround for Web Developers who still needed cookies in 3rd-party contexts was simple: when a cookie is set, adding the attribute SameSite=none will disable the new behavior and allow the cookie to be set and sent freely. Over the course of the last two years, most sites that cared about their cookies began sending the attribute.

The CHIPS proposal (“Cookies having independent partitioned state”) offers a new but more limited escape hatch– a developer may opt-in to partitioning their cookie so that it’s no longer a “3rd party cookie”, it’s instead a partitioned cookie. A partitioned cookie set in the context of domain3.com embedded inside runnersworld.com will not be visible in the context domain3.com embedded inside startrek.com. Similarly, setting the cookie in the context domain3.com embedded inside gas-x.com will have no impact on the cookie’s value in the other two pages. If the user visits domain3.com as a top-level browser navigation, the cookies that were set on that origin’s subframes in the context of other top-level pages remain inaccessible.

Using the new Partitioned attribute is simple; just add it to your Set-Cookie header like so:

Set-Cookie: __Host-id=4d5e6;Partitioned;SameSite=None; Secure;Path=/; 

Support for CHIPS is expected to be broad, across all major browsers.

I was initially a bit skeptical about requiring authors to explicitly specify the new attribute– why not just treat all cookies in 3rd-party contexts as partitioned? I eventually came around to the arguments that an explicit declaration is desirable. As it stands, legacy applications already needed to be updated with a SameSite=None declaration, so we probably wouldn’t keep any unmaintained legacy apps working if we didn’t require the attribute.

The Explicit Recipe: The Storage Access API

The Storage Access API allows a website to request permission to use storage in a 3rd party context. Microsoft Edge joined Safari and Firefox with support for this API in 2020 as a mechanism for mitigating the impact of the browser’s Tracking Prevention feature.

The Storage Access API has a lot going for it, but lack of universal support from major browsers means that it’s not currently a slam-dunk.

A Niche Recipe: First Party Sets

In some cases, the fact that cookies are treated as “3rd-party” represents a technical limitation rather than a legal or organizational one. For example, Microsoft owns xbox.com, office.com, and teams.microsoft.com, but these origins do not today share a common eTLD+1, meaning that pages from these sites are treated as cross-site 3rd-parties to one another. The First Party Sets proposal would allow sites owned and operated by a single-entity to be treated as first-party when it comes to privacy features.

Originally, a new cookie attribute, SameParty, would allow a site to request inclusion of a cookie when the cross-origin sub-resource’s context is in the same First Party Set as the top-level origin, but a recent proposal removes that attribute.

The Authentication Recipe: FedCM API

As I explained three years ago, authentication is an important use-case for 3rd-party cookies, but it’s hampered by browser restrictions on 3P cookies. The Federated Credential Management API proposes that browsers and websites work together to imbue the browser with awareness and control of the user’s login state on participating websites. As noted in Google’s explainer:

We expect FedCM to be useful to you only if all these conditions apply:

  1. You’re an identity provider (IdP).
  2. You’re affected by the third-party cookie phase out.
  3. Your Relying Parties are third-parties.

FedCM is a big, complex, and important specification that aims to solve exclusively authentication scenarios.

Complexity Abounds

The move away from supporting 3rd-party cookies has huge implications for how websites are built. Maintaining compatibility for desirable scenarios while meaningfully breaking support for undesirable scenarios (trackers) is inherently extremely challenging– I equate it to trying to swap out an airliner’s engines while the plane is full of passengers and in-flight.

Combinatorics

As we add multiple new approaches to address the removal of 3P cookies, we must carefully reason about how they all interact. Specifications need to define how the behavior of CHIPS, First-Party-Sets, and the Storage Access API all intersect, for example, and web developers must account for cases where a browser may support only some of the new features.

Cookies Aren’t The Only Type of Storage

Another compexity is that cookies aren’t the only form of storage– IndexedDB, localStorage, sessionStorage, and various other cookie-like storages all exist in the web platform. Limiting only cookies without accounting for other forms of storage wouldn’t get us to where we want to be on privacy.

That said, cookies are one of the more interesting forms of storage when it comes to privacy, as they

  1. are sent to the server before the page loads,
  2. operate without JavaScript enabled,
  3. operate in cases like <img> elements where no script-execution context exists
  4. etc.

Cookies Are Special

Another interesting aspect of migrating scenarios away from cookies is that we lose some of the neat features that have been added over the years.

One such feature is the HTTPOnly declaration, which prevents a cookie from being accessible to JavaScript. This feature was designed to blunt the impact of a cross-site scripting attack — if script injected into a compromised page cannot read a cookie, that cookie cannot be leaked out to a remote attacker. The attacker is forced to abuse the XSS’d page immediately (“a sock-puppet browser”) limiting the sorts of attacks that can be undertaken. Some identity providers demand that their authentication tokens be carried only via HTTPOnly cookies, and if an authentication token must be available to JavaScript directly, the provider mints that token with a much shorter validity lifetime (e.g. one hour instead of one week).

Another cookie feature is TLS Token Binding, an obscure capability that attempts to prevent token theft attacks from compromised PCs. If malware or a malicious insider steals Token-bound cookie data directly from a PC, that cookie data will not work from another device because the private key material used to authenticate the cookies cannot be exported off of the compromised client device. (This non-exportability property is typically enforced by security hardware like a TPM.) While Token binding provides a powerful and unique capability for cookies, for various reasons the feature is not broadly supported.

Deprecating 3rd-Party Cookies is Not a Panacea

Unfortunately, getting rid of 3rd-party cookies doesn’t mean that we’ll be rid of tracking. There are many different ways to track a user, ranging from the obvious (they’re logged in to your site, they have a unique IP address) to the obscure (various fingerprinting mechanisms). But getting rid of 3rd-party cookies is a valuable step as browser makers work to engineer a privacy sandbox into the platform.

It’s a fascinating time in the web platform privacy space, and I can’t wait to see how this all works out.

-Eric

[1] Interestingly, if domain1.com includes a <script> element pointed at a resource from domain2.com or domain3.com, that script will run inside domain1.com‘s context, such that calls to the document.cookie DOM property will return the cookies for domain1.com, not the domain that served the script. But that’s not important for our discussion here.

My Next Opportunity

This is the farewell email I sent to my Edge teammates yesterday.


IWebBrowser3::BeforeNavigate()

When I left the Internet Explorer team in 2012 to work on Fiddler full-time, I did so with a measure of heartbreak, absolutely certain that I would never be quite as good at anything else. When I came back to the Edge team in 2018, I looked back with amusement at the naïveté of my earlier melancholy. I had learned a huge amount during my six years away, and I brought new skills and knowledge to bear on the ambitious challenge of replatforming Edge atop Chromium. While it’s still relatively early days, our progress over these last four years has truly been amazing—we’ve adopted tens of millions of lines of code as our own, grown the team, built a batteries-included product superior to the market leader, and started winning share for the first time in years. More importantly, we’ve modernized our team culture: more inclusive, heavily invested in learning, with faster experimentation and more transparent public communication. It’s been an inspiring journey.

In the fifty months since my return, I’ve written 124 blog posts and landed 168 changelists in upstream Chromium (plus one or two downstream :), dwarfing the 94 CLs I landed back when I was a Chrome engineer. I had the honor of leading PMs in both the Pixels and Bytes subteams in Web Platform, presented the Edge Privacy Story, and travelled around the world (Lyon and Fukuoka) for W3C TPAC meetings. I’ve had the opportunity to help many other teams as a member of “Microsoft’s team in Chromium”, and to engage directly with Enterprise customers as they migrated off of IE and onto a modern standards-based web platform. I’ve helped to interview and hire a set of awesome new PMs. Throughout it all, I’ve strived to maximize my impact to benefit the billions of humans who browse the web.

This Friday (July-22-2022) will be the last day of my current tour. I leave things in good hands: Erik is an amazing engineering manager, and I’ll miss racing him to discover the root cause of gnarly networking problems. I’ve spent this second tour doing my very best to write everything down– if anything, I’m but a caching proxy server for my archive of blog posts. I didn’t write an encyclopedic guide on ramping up on browser dev, or an opinionated set of career advice just for fun— I’ve been quietly working to keep my bus factor as low as possible. I encourage everyone to take full advantage of the democratization of knowledge-sharing provided by our internal wikis and public docs site—seize every opportunity to “leave it better than you found it.”

Thank you all for the years of awesome collaborations on building a browser to delight our users.

Next Monday, I’ll be moving over to join some old friends on Microsoft’s Web Protection team, working to help protect users from all manner of internet-borne threats.

I’m not going far; please stay in touch via Twitter, LinkedIn, or good old-fashioned email.

Until next time,

-@ericlaw

Edge URL Schemes

The microsoft-edge: Application Protocol

Microsoft Edge implements an Application Protocol with the scheme microsoft-edge: that is designed to launch Microsoft Edge and pass along a web-schemed URL and/or additional arguments. A basic invocation might be as simple as:

microsoft-edge:http://example.com/

However, as is often the case with things I choose to write about, there’s a bit of hidden complexity that may not be immediately obvious.

Non-Public

The purpose of this URL scheme is to enable Windows and cooperating applications to invoke particular user-experiences in the Edge browser.

This scheme is not considered “public” — there’s no official documentation of the scheme, and the Edge team makes no guarantees about its behavior. We can (and do) add or modify functionality as needed to achieve desired behaviors.

Over the last few years, we’ve added a variety of functionality to the scheme, including the ability to invoke UX features, launch into a specific user profile, and implement other integration scenarios. By way of example, Windows might advertise the Edge Surf game and, if the user chooses to play, the game is launched by ShellExecuting the URL microsoft-edge:?ux=surf_game.

Because of the non-public and inherently unstable (not backward-compatible) nature of this URL scheme, it is not an extensibility point and it is not supported to configure the handler to be anything other than Microsoft Edge.

Under the hood: handling of this scheme can be found in Edge’s non-public version of the StartupBrowserCreator::ProcessCmdLineImpl function I wrote about recently as a part of my post on Chromium Startup.

Tricky Bits

One (perhaps surprising) restriction on the microsoft-edge scheme is that it cannot be launched from inside Edge itself. If a user inside Edge clicks a link to the microsoft-edge: scheme, nothing visibly happens. Only if they open the F12 Console will they see an error message:

The microsoft-edge protocol is blocked inside Edge itself to avoid “navigation laundering” problems, whereby going through the external handler path would result in loss of context. Losing the context of a navigation can introduce vulnerabilities in both security and abuse. For example, a popup blocker bypass existed on Android when Android Chrome failed to block the Chrome version of this protocol. The Edge WebView2 control also blocks navigation to the protocol, although I expect that an application which wants to allow it can probably do so with the appropriate event handlers.

Another tricky bit concerns the fact that a user may have multiple different channels of Edge (Stable, Beta, Dev, Canary) installed, but the microsoft-edge protocol can only be claimed by one of them. This can be potentially confusing if a user has different channels selected to handle HTTPS: and microsoft-edge links:

…because some links will open in Edge Canary while others will open in Edge Beta.


The edge: Built-In Scheme

Beyond the aforementioned application protocol, Microsoft Edge also supports a Built-In Scheme named edge: and in contrast to the microsoft-edge: application protocol, this scheme is only available within the browser. You can not invoke an edge: URL elsewhere in Windows, or pass it to Edge as a command-line argument.

The edge: scheme is simply an alias for the chrome and about schemes used in Chromium to support internal pages like about:flags, about:settings, and similar (see edge:about for a list).

For security reasons, regular webpages cannot navigate to or load subresources from the edge/chrome schemes. Years ago, a common exploit pattern was to navigate to chrome:downloads and then abuse its privileged WebUI bindings to escape the browser sandbox. There are also special debug urls like about:inducebrowsercrashforrealz that will do exactly as they say.