dev

Yesterday, we covered the mechanisms that modern browsers can use to rapidly update their release channels. Today, let’s look at how to figure out when an eagerly awaited fix will become available in the Canary channels.

By way of example, consider crbug.com/977805, a nasty beast that caused some extensions to randomly be disabled and marked corrupt:

corruption

By bisecting the builds (topic of a future post) to find where the regression was introduced, we discovered that the problem was the result of a commit with hash fa8cdc81f5 that landed back on May 20th. This (probably security) change exposed an earlier bug in Chromium’s extension verification system such that an aborted request for a resource in an extension (say, because a page getting torn down just as a content script was getting injected) resulted in the verification logic thinking that the extension’s resource file was corrupted on disk.

On July 12th, the area owner landed a fix with the commit hash of cad2f6468. But how do I know whether my browser has this fix already? In what version(s) did the fix get released?

To answer these questions, we turn back to our trusted OmahaProxy. In the Find Releases box at the bottom, paste the full or partial hash value into the box and hit the Find Releases button:

CommitHashFix

The system will churn for a bit and then return the following page:

CommitHashLanded

So, now we know two things: 1) The fix will be in Chromium-based browsers with version numbers later than 77.0.3852.0, and 2) So far, the fix only landed there and hasn’t been merged elsewhere.

Does it need to be merged? Let’s figure out where the original regression was landed using the same tool with the regressing change list’s hash:

regressregress

We see that the regression originally landed in Master before the Chrome 76 branch point, so the bug is in Chrome 76.0.3801 and later. That means that after the fix is verified, we’ll need to request that it be merged from Master where it landed, over to the 76 branch where it’s also needed.

We can see what that’ll look like by looking at the fix for crbug.com/980803. This regression in the layout engine was fixed by a1dd95e43b5 in 77, but needed to be put into Chromium 76 as well. So, it was, and the result is shown as:Merged

Note: It’s possible for a merge to be performed but not show up here. The tool looks for a particular string in the merge’s commit message, and some developers accidentally remove or alter it.

Finally, if you’re really champing at the bit for a fix, you might run Find Releases on a commit hash and see

notyetin

Assuming you didn’t mistype the hash, what this means is that the fix isn’t yet in the Canary channel. If you were to clone the Chromium master @HEAD and build it yourself, you’d see the fix, but it’s not yet in a public Canary. In almost all cases, you’ll need to wait until the next morning (Pacific time) to get an official channel build with the fix.

Now, so far we’ve mostly focused on Chrome, but what about other Chromium-based browsers?

Things are mostly the same, with the caveat that most other Chromium-based browsers are usually days to weeks to (gulp) months behind Chrome Canary. Is the extensions bug yet fixed in my Edge Canary?

The simplest (and generally reliable) way to check is to just look at the Chrome token in the browser’s user agent string by visiting edge://version or using my handy Show Chrome Version browser extension. As you can see in both places, Edge 77.0.220.0 Canary is based on Chromium 77.0.3843, a bit behind the 77.0.3852 version containing the extensions verification fix:

ShowChromeVersion

So, I’ll probably have to wait a few days to get this fix into my browser.

Note that it’s possible for Microsoft and other Chromium embedders to “cherry-pick” critical fixes into our builds before our merge pump naturally pulls them down from upstream, but this is a relatively rare occurrence for Edge Canary. 

 

tl;dr: OmahaProxy is awesome!

-Eric

Many websites offer a “Log in” capability where they don’t manage the user’s account; instead, they offer visitors the ability to “Login with <identity provider>.”

When the user clicks the Login button on the original relying party (RP) website, they are navigated to a login page at the identity provider (IP) (e.g. login.microsoft.com) and then redirected back to the RP. That original site then gets some amount of the user’s identity info (e.g. their Name & a unique identifier) but it never sees the user’s password.

Such Federated Identity schemes have benefits for both the user and the RP site– the user doesn’t need to set up yet another password and the site doesn’t have to worry about the complexity of safely storing the user’s password, managing forgotten passwords, etc.

In some cases, the federated identity login process (typically implemented as a JavaScript library) relies on navigating the user to a top-level page to log in, then back to the relying party website into which the library injects an IFRAME1 back to the identity provider’s website.

FederatedID

The authentication library in the RP top-level page communicates with the IP subframe (using postMessage or the like) to get the logged-in user’s identity information, API tokens, etc.

In theory, everything works great. The IP subframe in the RP page knows who the user is (by looking at its own cookies or HTML5 localStorage or indexedDB data) and can release to the RP caller whatever identity information is appropriate.

Crucially, however, notice that this login flow is entirely dependent upon the assumption that the IP subframe is accessing the same set of cookies, HTML5 storage, and/or indexedDB data as the top-level IP page. If the IP subframe doesn’t have access to the same storage, then it won’t recognize the user as logged in.

Unfortunately, this assumption has been problematic for many years, and it’s becoming even more dangerous over time as browsers ramp up their security and privacy features.

The root of the problem is that the IP subframe is considered a third-party resource, because it comes from a different domain (identity.example) than the page (news.example) into which it is embedded.

For privacy and security reasons, browsers might treat third-party resources differently than first-party resources. Examples include:

  1. The Block 3rd Party cookies option in most browsers
  2. The SameSite Cookie attribute
  3. P3P cookie blocking in Internet Explorer2
  4. Zone Partitioning in Internet Explorer and Edge Spartan3
  5. Safari’s Intelligent Tracking Protection
  6. Firefox Content Blocking
  7. Microsoft Edge Tracking Prevention

When a browser restricts access to storage for a 3rd party context, our theoretically simple login process falls apart. The IP subframe on the relying party doesn’t see the user’s login information because it is loaded in a 3rd party context. The authentication library is likely to conclude that the user is not logged in, and redirect them back to the login page. A frustrating and baffling infinite loop may result as the user is bounced between the RP and IP.

The worst part of all of this is that a site’s login process might usually work, but fail depending on the user’s browser choice, browser configuration, browser patch level, security zone assignments, or security/privacy extensions. As a result, a site owner might not even notice that some fraction of their users are unable to log in.

So, what’s a web developer to do?

The first task is awareness: Understand how your federated login library works — is it using cookies? Does it use subframes? Is the IP site likely to be considered a “Tracker” by popular privacy lists?

The second task is to build designs that are more resilient to 3rd-party storage restrictions:

  • Be sure to convey the expected state from the Identity Provider’s login page back to the Relying Party. E.g. if your site automatically redirects from news.example to identity.example/login back to news.example/?loggedin=1, the RP page should take note of that URL parameter. If the authentication library still reports “Not signed in”, avoid an infinite loop and do not redirect back to the Identity Provider automatically.
  • Authentication libraries should consider conveying identity information back to the RP directly, which will then save that information in a first party context.For instance, the IP could send the identity data to the RP via a HTTP POST, and the RP could then store that data using its own first party cookies.
  • For browsers that support it, the Storage Access API may be used to allow access to storage that would otherwise be unavailable in a 3rd-party context. Note that this API might require action on the part of the user (e.g. a frame click and a permission prompt).

The final task is verification: Ensure that you’re testing your site in modern browsers, with and without the privacy settings ratcheted up.

-Eric

[1] The call back to the IP might not use an IFRAME; it could also use a SCRIPT tag to retrieve JSONP, or issue a fetch/XHR call, etc. The basic principles are the same.
[2] P3P was removed from IE11 on Windows 10.
[3] In Windows 10 RS2, Edge 15 “Spartan” started sharing cookies across Security Zones, but HTML5 Storage and indexedDB remain partitioned.

As we rebuild Microsoft Edge atop the Chromium open-source platform, we are working through various scenarios that behave differently in the new browser. In most cases, such scenarios also worked differently between 2018’s Edge (aka “Spartan”) and Chrome, but users either weren’t aware of the difference (because they used Trident-derived browsers inside their enterprise) or were aware and simply switched to a Microsoft-browser for certain tasks.

One example of a behavioral gap is related to running ClickOnce apps. ClickOnce is a Microsoft application deployment framework that aims to allow installation of native-code applications from the web in (around) one click.

Chrome and Firefox can successfully install and launch ClickOnce’s .application files if the .application file specifies a deploymentProvider element with a codebase attribute (example):

InstallPrompt

Installation prompt when opening an .application file.

However, it’s also possible to author and deploy an .application that doesn’t specify a deploymentProvider element (example). Such files launch correctly from Internet Explorer and pre-Chromium Edge, but fail in Firefox and Chrome with an error message:

ApplicationCannotBeStarted

ClickOnce fails for a downloaded .application file.

So, what gives? Why does this scenario magically work in Edge Spartan but not Firefox or Chrome?

The secret can be found in the EditFlags for the Application.Manifest ProgId (to which the .application filename extension and application/x-ms-application MIME type are mapped):

ApplicationManifestRegistry

Registry settings for the Application.Manifest ProgId.

The EditFlags contain the FTA_AlwaysUseDirectInvoke flag, which is documented on MSDN as 

FTA_AlwaysUseDirectInvoke 0x00400000
Introduced in Windows 8. Ensures that the verbs for the file type are invoked with a URL instead of a downloaded version of the file. Use this flag only if you’ve registered the file type’s verb to support DirectInvoke through the SupportedProtocols or UseUrl registration.

If you peek in the Application.Manifest’s Shell\Open\Command value, you’ll find that it calls for running the ShOpenVerbApplication function inside dfshim.dll, passing along the .application file’s path or URL in a parameter (%1):

“C:\Windows\System32\rundll32.exe” “C:\Windows\System32\dfshim.dll”,ShOpenVerbApplication %1

And therein lies the source of the behavioral difference.

When you download and open an Application.Manifest file from Edge Spartan, it passes the source URL for the .application to the handler. When you download the file in Firefox or Chrome, it passes the local file path of the downloaded .application file. With only the local file path, the ShOpenVerbApplication function doesn’t know how to resolve the relative references in the Application Manifest’s XML and the function bails out with the Cannot Start Application error message.

Setting FTA_AlwaysUseDirectInvoke also has the side-effect of removing the “Save” button from Edge’s download manager:

NoSave

…helping prevent the user from accidentally downloading an .application file that won’t work if opened outside of the browser from the Downloads folder (since the file’s original URL isn’t readily available to Windows Explorer).

Advice to Publishers

If you’re planning to distribute your ClickOnce application from a website, specify the URL in Visual Studio’s ClickOnce Publish Wizard:

Manifest

Specify “From a Web site” in the ClickOnce Publish Wizard.

This will ensure that even if DirectInvoke isn’t used, ShOpenVerbApplication can still find the files needed to install your application.

Workarounds

A company called Meta4 offers a Chrome browser extension that aims to add fuller support for ClickOnce to Chrome. The extension comes in two pieces– a traditional JavaScript extension and a trivial “native” executable (written in C#) that simply invokes the ShOpenVerbApplication call with the URL. The JavaScript extension launches and communicates with the native executable running outside of the Chrome sandbox using Native Messaging.

Unfortunately, the extension is a bit hacky– it installs a blocking onBeforeRequest handler which watches all requests (not just downloads), and if the target URL’s path component ends in .application, it invokes the native executable. Alas, it’s not really safe to make any assumptions about extensions in URLs (the web is based on MIME types, rather than filenames).

Next Steps

For the Edge team– TBD.

Do you use ClickOnce to deploy your applications? If so, are you specifying the deployment URL in the manifest file?

-Eric

PS: Notably, Internet Explorer doesn’t rely upon the DirectInvoke mechanism; removing the EditFlags value entirely causes IE to show an additional prompt but the install still succeeds. That’s because IE activates the file using a MIME handler (see the CLSID subkey of Application.Manifest) much like it does for .ZIP files. The DirectInvoke mechanism was invented, in part, to replace the legacy MIME handler mechanism.

This issue report complains that Edge doesn’t stream AAC files and instead tries to download them. It notes that, in contrast, URLs that point to MP3s result in a simple audio player loading inside the browser.

Edge has always supported AAC so what’s going on?

The issue here isn’t about AAC, per-se; it’s instead about whether or not the browser, upon direct navigation to an audio stream, will accommodate that by generating a wrapper HTML page with an <audio> element pointed at that audio stream URL.

PlaceholderPage

A site that wants to play streaming AAC in Edge (or, frankly, any media type, for any browser) should consider creating a HTML page with an appropriate Audio or Video element pointed at the stream.

The list of audio types for which Edge will automatically generate a wrapper page does not include AAC:

audio/mp4, audio/x-m4a, audio/mp3, audio/x-mp3, audio/mpeg,
audio/mpeg3, audio/x-mpeg, audio/wav, audio/wave, audio/x-wav,
audio/vnd.wave, audio/3gpp, audio/3gpp2

In contrast, Chrome creates the MediaDocument page for a broader set of known audio types:

static const char* const kStandardAudioTypes[] = {
 "audio/aac",  "audio/aiff", "audio/amr",  "audio/basic",  "audio/flac",
 "audio/midi",  "audio/mp3",  "audio/mp4",  "audio/mpeg",  "audio/mpeg3", 
 "audio/ogg", "audio/vorbis",  "audio/wav",  "audio/webm",  "audio/x-m4a",
 "audio/x-ms-wma",  "audio/vnd.rn-realaudio",  "audio/vnd.wave"};

If the the response sends Content-Type: application/octet-stream, includes a Content-Dispostion: attachment, or puts a download attribute on the anchor <a> element that leads to the media, Edge will download the media file instead of playing it in the browser.

Note: In Windows 10 RS5, the extension model is capable enough that it’s possible to write a browser extension that intercepts navigation directly to audio/video Media types and renavigates to a wrapper page. [Sample code]

-Eric

PS: Edge has similar special handling for video types:

"application/mp4","video/mp4","video/x-m4v","video/3gpp",
"video/3gpp2","video/quicktime"

 

My oldest supported Windows application is a launcher app named SlickRun, and it’s ~24 years old this year. I haven’t done much to maintain it over the last few years, although it’s now available in 64-bit and runs great on Windows 10. (Thanks go to Embarcadero, who now offer a free “Community” edition of Delphi, the language/platform I ported SlickRun to circa 1994).

I still fix bugs in SlickRun from time to time, and as I was playing with Rust a few days ago I was reminded of one of the oldest limitations in my code– if you update your system’s %PATH% variable, those changes aren’t seen by applications/consoles spawned by SlickRun (even after the change) until you restart SlickRun. This is particularly annoying because it’s so unexpected– users expect that command consoles launched by Win+R,cmd.exe,Enter will behave the same way as Win+Q,cmd,Enter, but the former consoles have the updated %PATH% while the latter do not.

While ShellExecute() sounds like it’s an API that causes the shell (aka Explorer) to execute something, in fact it does nothing of the sort.

Updating the Environment Block

The root cause of the “outdated path” problem is that processes launched via ShellExecute inherit the environment variables of their spawning process, and those environment variables (typically) are assigned as the process launches and never touched again. Because SlickRun starts with Windows, the %PATH% when it starts is the %PATH% that every process it launches inherits. (You can easily view a process’ environment block using the Properties > Environment tab in Process Explorer).

So, how does Explorer detect the change? That part I figured out ages ago– after updating an environment variable, the System Properties > Environment Variables Control Panel UI (or the SetX.exe console tool) broadcasts a WM_SETTINGCHANGE message to all top-level windows with an lparam containing the string “Environment”. I could easily add code to SlickRun to detect that the variables had changed, but for decades I didn’t really know what to do next… I didn’t know how to read the updated variables (without doing something hacky like restarting the process) nor ensure that they were passed to the applications spawned by ShellExecute.

Yesterday, I got fed up and started Googling. A few posts on StackOverflow mentioned a promising-sounding function, RegenerateUserEnvironment. And while that function appears to be undocumented, there’s an amazing issue filed in an open-source tracker that explains exactly how Windows Explorer uses this function– basically, just wait for the WM_SETTINGCHANGE event, then call the API. The RegenerateUserEnvironment will replace the calling process’ current environment block with the latest values.

Launching at Medium Integrity

While we’re on the topic of executing applications “like the shell”, another scenario came up twelve years ago when Windows Vista was first introduced. The SlickRun installer, written in NSIS, launches SlickRun when installation completes. Unfortunately, the installer runs with Admin rights (High integrity), which means that, by default, all of the programs it launches inherit that integrity. For SlickRun, this is especially bad because it means that any programs that it, in turn, launches during that first session (e.g. your browser!) will run at High integrity too. Not good.

While you can easily use the “Runas” verb to ShellExecute to launch a High integrity application from a Medium integrity application, there (depressingly) isn’t a way to do the opposite. For years, the official recommendation was to do some fancy coding to clone Explorer’s tokens and use those. Unfortunately, this is quite complicated to implement, especially within a NSIS script.

As it turns out, however, there’s a trivial workaround which works quite well– while ShellExecute doesn’t run things as the shell, applications can easily get Explorer to launch anything they like at Explorer’s integrity. The trick is to simply invoke explorer.exe and pass the filename to be executed as the first command line argument:

While this approach isn’t technically supported, I expect it is likely to continue to work for the foreseeable future.

 

It’s depressing that together these tricks have taken me almost twenty years to discover, but I’m happy that I have. I hope they help you out.

-Eric

In yesterday’s episode, I shared the root cause of a bug that can cause document.cookie to incorrectly return an empty string if the cookie is over 1kb and the cookie grows in the middle of a DOM document.cookie getter operation.

Unfortunately, that simple bug wasn’t the root cause of the compatibility problem that I was investigating when my code-review uncovered it. The observed compatibility bug was slightly different– in the repro case, only one of the document’s cookies goes missing, and it goes missing even when only one page is setting the cookie.

After the brain-melting exercise of annotating the site’s minified framework libraries (console.log(‘…’) ftw!) via Fiddler’s AutoResponder, I found that the site uses the document.cookie API to save the same cookie (named “ld“) three times in a row, adding some information to the cookie each time. However, the ld cookie mysteriously disappears between 0.4 and 6 milliseconds after it gets set the third time. I painstakingly verified that the cookie wasn’t getting manipulated from any other context when it disappeared.

Hmm…

As I wrote up the investigation notes, I idly noted that due to a trivial typo in the website’s source code, the ld cookie was set first as a Persistent cookie, then (accidentally) as a Session cookie, then as a Persistent cookie.

In re-reading the notes an hour later, again my memory got tickled. Hadn’t I seen something like this before?

Indeed, I had. Just about five years ago, a user reported a similar bug where a HTTP response contained two Set-Cookie calls for the same cookie name and Internet Explorer didn’t store either cookie. I built a reduced test case and reported it to the engineering team.

Pushing Cookies

The root cause of the cookie disappearance relates to the Internet Explorer and Edge “loosely-coupled architecture.”

In IE and Edge, each browser tab process runs its own networking stack, in-process1. For persistent cookies, this poses no problem, because every browser process hits the same WinINET cookie storage area and gets back the latest value of the persistent cookie. In contrast, for session cookies, there’s a challenge. Session cookies are stored in local (per-process) variables in the networking code, but a browser session may include multiple tab processes. A Session cookie set in a tab process needs to be available in all other tab processes in that browser session.

As a consequence, when a tab writes a Session cookie, Edge must send an interprocess communication (IPC) message to every other process in the browser session, telling each to update its internal variables with the new value of the Session cookie. This Cookie Pushing IPC is asynchronous, and if the named cookie were later modified in a process before the IPC announcing the earlier update to the cookie is received, that later update is obliterated.

The Duplicate Set-Cookie header version of this bug got fixed in the Fall 2017 Update (RS3) to Windows 10 and thus my old Set-Cookie test case case no longer reproduces the problem.

Unfortunately, it turns out that the RS3 fix only corrected the behavior of the network stack when it encounters this pattern– if the cookie-setting calls are made via document.cookie, the problem reappears, as in this document.cookie test case.

BadBehavior

Playing with the repro page, you’ll notice that manually pushing “Set HOT as a Session cookie” or “Set as a Persistent cookie” works fine, because your puny human reflexes aren’t faster than the cookie-pushing IPC. But when you push the “Set twice” button that sets the cookie twice in fast succession, the HOT cookie disappears in Edge (and in IE11, if you have more than one tab open).

Until this bug is fixed, avoid using document.cookie to change a persistent cookie to a session cookie.

-Eric

In contrast, in Chrome, all networking occurs in the browser process (or a networking-only process), and if a tab process wants to get the current document.cookie, it must perform an IPC to ask the browser process for the cookie value. We call this “cookie pulling.”

Many classic Windows APIs accept a pointer to a byte buffer and a pointer to an integer indicating the size of the buffer. If the buffer is large enough to hold the data returned from the API, the buffer is filled and the API returns S_OK. If the buffer supplied is not large enough to hold all of the data, the API instead returns ERROR_INSUFFICIENT_BUFFER, updating the supplied integer with the length of the buffer required. The client is expected to reallocate a new buffer of the specified size and call the API again with the new buffer and length.

For example, the InternetGetCookieEx function, used to query the WinINET networking stack for cookies for a given URL, is one such API. The GetExtendedTcpTable function, used to map sockets to processes, is another.

The advantage of APIs with this form is that you can call the API with a reasonably-sized stack buffer and avoid the cost of a heap allocation unless the stack buffer happens to be too small.

In the case of Internet Explorer and Edge, the document.cookie DOM API getter’s implementation first calls the InternetGetCookieEx API with a 1024 WCHAR buffer. If the buffer is big enough, the cookie string is then immediately returned to the page.

However, if ERROR_INSUFFICIENT_BUFFER is returned instead (and if the size needed is 10240 characters (MAX_COOKIE_LEN) or fewer), the API will allocate a new buffer on the heap and call the API again. If the API succeeds, the cookie string is returned to the page, otherwise if any error is returned, an empty string is returned to the page.

Wait. Do you see the problem here?

It’s tempting to conclude that the document.cookie API doesn’t need to be thread-safe–JavaScript that touches the DOM runs in one thread, the UI thread. But cookies are a form of data storage that is available across multiple threads and processes. For instance, subdownload network requests for the page’s resources can be manipulating the cookie store in parallel, and if I happen to have multiple tabs or windows open to the same site, they’ll be interacting with the same cookie jar.

So, consider following scenario: The document.cookie implementation calls InternetGetCookieEx but gets back ERROR_INSUFFICIENT_BUFFER with a required size of 1200 bytes. The implementation dutifully allocates a 1200 byte buffer, but before it gets the chance to call InternetGetCookieEx again, an image on the page sets a new 4 byte cookie which WinINET puts in the cookie jar. Now, when InternetGetCookieEx is called again, it again returns ERROR_INSUFFICIENT_BUFFER because the required buffer is now 1204 characters. Because document.cookie isn’t using any sort of loop-until-success, it returns an empty cookie string.

Now, this is all fast native code (C/C++), so surely this sort of thing is just theoretical… it can’t really happen on a fast computer, right?

Around ten years ago, I showed how you can use Meddler to easily generate a lot of web traffic for testing browsers. Meddler is a simple web server that has a simple GUI code editor slapped on the front (most developers would use node.js or Go for such tasks). I quickly threw together a tiny little MeddlerScript which exercises cookies by loading cookie-setting images in a loop and monitoring the document.cookie API to see if it ever returns an empty string.

Boy, does it ever. On my i7 machines, it usually only takes a few seconds to run into the buggy case where document.cookie returns an empty string.

Failure

I haven’t gone back to check the history, but I suspect this IE/Edge bug is at least fifteen years old.

After confirming this bug, it felt strangely familiar, as if I’d hit this landmine before. Then, as I was writing this post, I realized when… Back in 2011, I shared the C# code Fiddler uses for mapping a socket to a process. That code relies on the GetExtendedTcpTable API, which has the same reallocate-then-reinvoke design. Fortunately, I’d fixed the bug a few weeks later in Fiddler, but it looks like I never updated my blog post (sorry about that).

-Eric

PS: Unrelated, but one more pitfall to be aware of: InternetGetCookieExW has a truly bizarre shape, in that the lpdwSize argument is a pointer to a count of wide characters, but if ERROR_INSUFFICIENT_BUFFER is returned, the size argument is set to the count of bytes required.

PPS: As of Windows 10 RS3, Edge (and IE) support 180 cookies per domain to match Chrome, but the network stack will skip setting or sending individual cookies with a value over 5120 bytes.