Cookies and Concurrency, Redux

Note: This post concerns Edge Legacy (aka Spartan) and does not apply to the modern Chromium-based Edge.

In yesterday’s episode, I shared the root cause of a bug that can cause document.cookie to incorrectly return an empty string if the cookie is over 1kb and the cookie grows in the middle of a DOM document.cookie getter operation.

Unfortunately, that simple bug wasn’t the root cause of the compatibility problem that I was investigating when my code-review uncovered it. The observed compatibility bug was slightly different– in the repro case, only one of the document’s cookies goes missing, and it goes missing even when only one page is setting the cookie.

After the brain-melting exercise of annotating the site’s minified framework libraries (console.log(‘…’) ftw!) via Fiddler’s AutoResponder, I found that the site uses the document.cookie API to save the same cookie (named “ld“) three times in a row, adding some information to the cookie each time. However, the ld cookie mysteriously disappears between 0.4 and 6 milliseconds after it gets set the third time. I painstakingly verified that the cookie wasn’t getting manipulated from any other context when it disappeared.

Hmm…

As I wrote up the investigation notes, I idly noted that due to a trivial typo in the website’s source code, the ld cookie was set first as a Persistent cookie, then (accidentally) as a Session cookie, then as a Persistent cookie.

In re-reading the notes an hour later, again my memory got tickled. Hadn’t I seen something like this before?

Indeed, I had. Just about five years ago, a user reported a similar bug where a HTTP response contained two Set-Cookie calls for the same cookie name and Internet Explorer didn’t store either cookie. I built a reduced test case and reported it to the engineering team.

Pushing Cookies

The root cause of the cookie disappearance relates to the Internet Explorer and Edge Legacy (Spartan) “loosely-coupled architecture.”

In IE and Edge Legacy, each browser tab process runs its own networking stack, in-process1. For persistent cookies, this poses no problem, because every browser process hits the same WinINET cookie storage area and gets back the latest value of the persistent cookie. In contrast, for session cookies, there’s a challenge. Session cookies are stored in local (per-process) variables in the networking code, but a browser session may include multiple tab processes. A Session cookie set in a tab process needs to be available in all other tab processes in that browser session.

As a consequence, when a tab writes a Session cookie, Edge Legacy must send an interprocess communication (IPC) message to every other process in the browser session, telling each to update its internal variables with the new value of the Session cookie. This Cookie Pushing IPC is asynchronous, and if the named cookie were later modified in a process before the IPC announcing the earlier update to the cookie is received, that later update is obliterated.

The Duplicate Set-Cookie header version of this bug got fixed in the Fall 2017 Update (RS3) to Windows 10 and thus my old Set-Cookie test case case no longer reproduces the problem.

Unfortunately, it turns out that the RS3 fix only corrected the behavior of the network stack when it encounters this pattern– if the cookie-setting calls are made via document.cookie, the problem reappears, as in this document.cookie test case.

BadBehavior

Playing with the repro page, you’ll notice that manually pushing “Set HOT as a Session cookie” or “Set as a Persistent cookie” works fine, because your puny human reflexes aren’t faster than the cookie-pushing IPC. But when you push the “Set twice” button that sets the cookie twice in fast succession, the HOT cookie disappears in Edge Legacy (and in IE11, if you have more than one tab open).

Until this bug is fixed, avoid using document.cookie to change a persistent cookie to a session cookie.

-Eric

In contrast, in Chrome, all networking occurs in the browser process (or a networking-only process), and if a tab process wants to get the current document.cookie, it must perform an IPC to ask the browser process for the cookie value. We call this “cookie pulling.”

ERROR_INSUFFICIENT_BUFFER and Concurrency

Many classic Windows APIs accept a pointer to a byte buffer and a pointer to an integer indicating the size of the buffer. If the buffer is large enough to hold the data returned from the API, the buffer is filled and the API returns S_OK. If the buffer supplied is not large enough to hold all of the data, the API instead returns ERROR_INSUFFICIENT_BUFFER, updating the supplied integer with the length of the buffer required. The client is expected to reallocate a new buffer of the specified size and call the API again with the new buffer and length.

For example, the InternetGetCookieEx function, used to query the WinINET networking stack for cookies for a given URL, is one such API. The GetExtendedTcpTable function, used to map sockets to processes, is another.

The advantage of APIs with this form is that you can call the API with a reasonably-sized stack buffer and avoid the cost of a heap allocation unless the stack buffer happens to be too small.

In the case of Internet Explorer and Edge Legacy, the document.cookie DOM API getter’s implementation first calls the InternetGetCookieEx API with a 1024 WCHAR buffer. If the buffer is big enough, the cookie string is then immediately returned to the page.

However, if ERROR_INSUFFICIENT_BUFFER is returned instead (and if the size needed is 10240 characters (MAX_COOKIE_LEN) or fewer), the API will allocate a new buffer on the heap and call the API again. If the API succeeds, the cookie string is returned to the page, otherwise if any error is returned, an empty string is returned to the page.

Wait. Do you see the problem here?

It’s tempting to conclude that the document.cookie API doesn’t need to be thread-safe–JavaScript that touches the DOM runs in one thread, the UI thread. But cookies are a form of data storage that is available across multiple threads and processes. For instance, subdownload network requests for the page’s resources can be manipulating the cookie store in parallel, and if I happen to have multiple tabs or windows open to the same site, they’ll be interacting with the same cookie jar.

So, consider following scenario: The document.cookie implementation calls InternetGetCookieEx but gets back ERROR_INSUFFICIENT_BUFFER with a required size of 1200 bytes. The implementation dutifully allocates a 1200 byte buffer, but before it gets the chance to call InternetGetCookieEx again, an image on the page sets a new 4 byte cookie which WinINET puts in the cookie jar. Now, when InternetGetCookieEx is called again, it again returns ERROR_INSUFFICIENT_BUFFER because the required buffer is now 1204 characters. Because document.cookie isn’t using any sort of loop-until-success, it returns an empty cookie string.

Now, this is all fast native code (C/C++), so surely this sort of thing is just theoretical… it can’t really happen on a fast computer, right?

Around ten years ago, I showed how you can use Meddler to easily generate a lot of web traffic for testing browsers. Meddler is a simple web server that has a simple GUI code editor slapped on the front (most developers would use node.js or Go for such tasks). I quickly threw together a tiny little MeddlerScript which exercises cookies by loading cookie-setting images in a loop and monitoring the document.cookie API to see if it ever returns an empty string.


import Meddler;
import System;
import System.Text;
import System.Net.Sockets;
import System.Windows.Forms;
// You can set options for this script using the format:
// ScriptOptions("StartURL" (where {$PORT} is autoreplaced by the Meddler port number), "Optional HTTPS Certificate Thumbprint", "Random # Seed")
// public ScriptOptions("https://localhost:{$PORT}/Test2", "fc ba fd cd 07 02 14 db a6 b7 ad 37 92 a9 65 0a 75 33 4f 9a", "1234")
class Handlers
{
static function OnConnection(oSession: Session)
{
try{
if (oSession.ReadRequest()){
var oHeaders: ResponseHeaders = new ResponseHeaders();
oHeaders.Status = "200 OK";
oHeaders["Connection"] = "close";
oHeaders["Cache-Control"] = "no-cache";
if (oSession.requestHeaders.Path.indexOf(".jpg")>-1){
oHeaders["Content-Type"] = "image/jpeg";
oHeaders["Set-Cookie"] = "C"+(Fuzz.NewInteger(1,7).ToString())+"="+Fuzz.NewString('a', Fuzz.NewInteger(128,256));
oSession.WriteString(oHeaders);
oSession.WriteBytes(Fuzz.NewJPG(Fuzz.NewInteger(100,999).ToString(), 80, 60));
}
else
{
oHeaders["Content-Type"] = "text/html";
oSession.WriteString(oHeaders);
oSession.WriteString("<!doctype html>\r\n<head>\r\n<title>Cookie Hammer</title>\r\n"
+ "<script>\r\n\r\nsetInterval(function(){\r\n let a=document.cookie;\r\n if (a.length < 1) { alert('Cookie was empty\\n' + document.cookie); }\r\n"
+ "document.getElementById(\"divLen\").innerText = new Date() + ' ' + (\"Cookie Length: \" + a.length.toString());\r\n\r\n }, 1);\r\n</script>\r\n"
);
oSession.WriteString("</head>\r\n<body>\r\n\r\n");
for (var i=0; i<64; i++)
oSession.WriteString("<img onload='this.src = \"CookieRandom.jpg?"+i.toString()+"=\" + Math.random();' src=\"CookieRandom.jpg?\" />\r\n");
oSession.WriteString("\r\n<div id=\"divLen\"></div>\r\n\r\n</body>\r\n</html>");
}
}
oSession.CloseSocket();
}
catch(e)
{
// MessageBox.Show("Script threw exception\n"+e, "OnConnection Failed");
MeddlerObject.Log.LogString("Script threw exception\n"+e);
}
}
// Optional method called on compile
static function Main(){
var today: Date = new Date();
MeddlerObject.StatusText = " Rules.js was loaded at: " + today;
}
}

view raw

CookieHammer.ms

hosted with ❤ by GitHub

Boy, does it ever. On my i7 machines, it usually only takes a few seconds to run into the buggy case where document.cookie returns an empty string.

Failure

I haven’t gone back to check the history, but I suspect this IE/EdgeLegacy bug is at least fifteen years old.

After confirming this bug, it felt strangely familiar, as if I’d hit this landmine before. Then, as I was writing this post, I realized when… Back in 2011, I shared the C# code Fiddler uses for mapping a socket to a process. That code relies on the GetExtendedTcpTable API, which has the same reallocate-then-reinvoke design. Fortunately, I’d fixed the bug a few weeks later in Fiddler, but it looks like I never updated my blog post (sorry about that, I only mentioned it in a 2013 comment).

-Eric

PS: Unrelated, but one more pitfall to be aware of: InternetGetCookieExW has a truly bizarre shape, in that the lpdwSize argument is a pointer to a count of wide characters, but if ERROR_INSUFFICIENT_BUFFER is returned, the size argument is set to the count of bytes required.

PPS: As of Windows 10 RS3, Edge Legacy (and IE) support 180 cookies per domain to match Chrome, but the network stack will skip setting or sending individual cookies with a value over 5120 bytes.

Edge Interop Issues

As we finish up the next release of Windows 10 (Fall 2018), my team is hard at work triaging incoming bugs.

Many such bugs take the form “Edge does the wrong thing for this page. ${Other_Browser} works okay.

This post is designed to be an (ever-growing) index of some of the behavioral deltas that are the root cause of such issues:


Edge doesn’t allow navigation to DATA urls, even when they’d otherwise be converted to file downloads.

Using pushState or replaceState with |undefined| as the URL argument shows “undefined” in the Address box in Edge/IE but not Chrome or Firefox.

IE/Edge strip the Content-Encoding header from a compressed response; Firefox and Chrome leave the header in. For XmlHttpRequest’s getAllResponseHeaders, IE and Firefox maintain the case of HTTP Response header names while Chrome/Edge/Safari do not.

Chrome recognizes that a file with a .JSON extension has the type application/json (and vice versa) while IE/Edge only recognize that when the registry is configured with that mapping.

Chrome includes a hack that works around certificates that do not exactly match the domain on which they are served. Firefox, Edge, and IE do not include this hack, leading to a Certificate Name Mismatch Error when loading:

WWWAddition

Edge does not fully support the URL standard, meaning that URLs of the form http:/example.com (note the missing slash) do not work as expected.

Edge and IE do not allow navigation to HTTP URLs containing a UserInfo component. Other browsers currently (reluctantly) allow this syntax.

Edge RS5 introduces support for Web Authentication specification (in order to support FIDO2 tokens). That specification extends the Credential Management API with new methods, so the navigator.credentials object now exists. However, Edge does not implement the navigator.credentials.preventSilentAccess() method and attempting to call it will cause an exception due to the missing method. (Edge always prevents silent access, so a future implementation of this method will simply fulfill the promise immediately).

When a server returns a HTTP/[301|302|303|307|308] response, Edge/IE are unable to read the response body if the server didn’t include a Content-Length header or Transfer-Encoding: chunked (HTTP/1.1). This turns out to break login to YouTube TV, where Google returns a response body over HTTP/2 (which does not require explicit content lengths thanks to its inherent message framing).

Edge supports most of CSP2 but currently does not support nonces on sourced script elements (only inline script and styles) [Test page]. This limitation significantly complicates deployment of CSP for sites that cannot easily enumerate their source locations in the Content-Security-Policy header (Edge does not support CSP3’s strict-dynamic directive yet either). A broad rule (e.g. script-src https: ) can be used as a workaround but this does increase attack surface.

IE and Edge begin immediately downloading the content of a SCRIPT SRC, not waiting until the SCRIPT element is added to this DOM. This means, for instance, that adding a |crossorigin| attribute to that element after setting its source does not result in an |Origin| header being sent on the request.


Continued over here in this post…

-Eric

Script-Generated Download Files

As we finish up the next release of Windows 10, my team is hard at work triaging incoming bugs. Here’s a pattern that has come up a few times this month:

Bug: I click download in Edge Legacy:

DownloadButton…but I end up on an error page:

WompWompDataURI

Womp womp.

If you watch the network traffic, you’ll see that no request even hits the network in the failing case. But, if you carefully scroll that ugly error URL to see the middle, the source of the problem appears:

ms-appx-web://microsoft.microsoftedge/assets/errorpages/dnserror.html?ErrorStatus=0x80704006&NetworkStatusSupported=1#data:text/csv;charset=UTF-8...n%0D%0A

The error shows that Edge Legacy failed to navigate to a URL with the Data URI scheme.

Ever since we introduced support for DATA URLs a decade ago in Internet Explorer 8, they’ve been throttled with one major limitation: You cannot navigate to these URIs at the top level of the browser. Edge Legacy loosened things up so that Data URLs under 4096 characters can be used as the source of IFRAMEs, but the browser will not navigate to a data URL at the top level. The new Chromium-based Edge does not have this limitation.

(Yes, this error page could use some love.)

Now, you might remember that last winter, Chrome took a change to forbid top-level navigation to data URIs (due to spoofing concerns), but that restriction contains one important exception: navigations that get turned into downloads (due to their MIME type being one other than something expected to render in the browser) are exempted. So this scenario sorta works in Chrome. (I say “sorta” because the authors of this site failed to specify a meaningful filename on the link, so the file downloads without the all-important .csv extension).

ChromeWorksSorta2

So, does IE/Edge Legacy’s restriction on Data URIs mean that webdevs cannot generate downloadable files dynamically in JavaScript in a way that works in all browsers?

No, of course not.

There are many alternative approaches, but one simple approach is to just use a blob URL, like so:

var text2 = new Blob(["a,b,c,d"], { type: 'text/csv'});
var down2 = document.createElement("a");
down2.download = "simple.csv";
down2.href = window.URL.createObjectURL(text2);
down2.addEventListener("onclick", function(){ if (navigator.msSaveOrOpenBlob) {navigator.msSaveOrOpenBlob(text2,"simple.csv"); return false;}});
document.body.appendChild(down2);
down2.innerText="I have a download attribute. Click me";

When the link is clicked, the CSV file is downloaded with a proper filename.

See this GitHub thread for a fuller discussion.

-Eric

PS: Even in the new Chromium-based Edge, downloads sourced from blob: or data: may not be handled properly by browser policies that call for special handling of files from specified sites. See https://crbug.com/1169904, for details.

CORS and Vary

Yesterday, I started looking a site compatibility bug where a page’s layout is intermittently busted. Popping open the F12 Tools on the failing page, we see that a stylesheet is getting blocked because it lacks a CORS Access-Control-Allow-Origin response header:

NoStylesheetCORS

We see that the client demands the header because the LINK element that references it includes a crossorigin=anonymous directive:

crossorigin="anonymous" href="//s.axs.com/axs/css/90a6f65.css?4.0.1194" type="text/css" />

Aside: It’s not clear why the site is using this directive. CORS is required to use  SubResource Integrity, but this resource does not include an integrity attribute. Perhaps the goal was to save bandwidth by not sending cookies to the “s” (static content) domain?

In any case, the result is that the stylesheet sometimes fails to load as you navigate back and forward.

Looking at the network traffic, we find that the static content domain is trying to follow the best practice Include Vary: Origin when using CORS for access control.

Unfortunately, it’s doing so in a subtly incorrect way, which you can see when diffing two request/response pairs for the stylesheet:

VaryDiff

As you can see in the diff, the Origin token is added only to the response’s Vary directive when the request specifies an Origin header. If the request doesn’t specify an Origin, the server returns a response that lacks the Access-Control-* headers and also omits the Vary: Origin header.

That’s a problem. If the browser has the variant without the Access-Control directives in its cache, it will reuse that variant in response to a subsequent request… regardless of whether or not the subsequent request has an Origin header.

The rule here is simple: If your server makes a decision about what to return based on a what’s in a HTTP header, you need to include that header name in your Vary, even if the request didn’t include that header.

If your response specifies Vary: Origin to ensure that the browser rechecks with you before reusing a response, your server had better not return a HTTP/304 Not Modified response to a request from a different Origin without *also* updating the Access-Control-Allow-Origin to match the new request context. Otherwise, your cached response is not going to be readable in that new context.

-Eric

PS: This seems to be a pretty common misconfiguration, which is mentioned in the fetch spec:


CORS protocol and HTTP caches

If CORS protocol requirements are more complicated than setting `Access-Control-Allow-Origin` to * or a static origin, `Vary` is to be used.

Vary: Origin

In particular, consider what happens if Vary is not used and a server is configured to send Access-Control-Allow-Origin for a certain resource only in response to a CORS request. When a user agent receives a response to a non-CORS request for that resource (for example, as the result of a navigation request), the response will lack Access-Control-Allow-Origin and the user agent will cache that response. Then, if the user agent subsequently encounters a CORS request for the resource, it will use that cached response from the previous non-CORS request, without Access-Control-Allow-Origin.

But if Vary: Origin is used in the same scenario described above, it will cause the user agent to fetch a response that includes Access-Control-Allow-Origin, rather than using the cached response from the previous non-CORS request that lacks Access-Control-Allow-Origin.

However, if Access-Control-Allow-Origin is set to * or a static origin for a particular resource, then configure the server to always send Access-Control-Allow-Origin in responses for the resource — for non-CORS requests as well as CORS requests — and do not use Vary.