A Cold and Slow 3M Half

My second run of the 3M Half Marathon was Sunday January 21, 2024. My first half-marathon last year was cold (starting at 38F), but this year’s was slated to be even colder (33F) and I was nervous.

For dinner on Saturday night, I had a HelloFresh meal of meatballs and mashed potatoes, and I went to bed around 9:45pm. I set an alarm for 6, but I woke up around 5:15 am and lingered in bed until 5:30. I drank a cup of coffee right away and then had a productive trip to the bathroom. I ate a banana and had another cup of coffee while I prepped my gear and got dressed.

I put on my new UnderArmor leggings and shorts with the number that I’d attached the night before. I packed an additional running shirt in case it was cold enough to double-layer; something I’d never done before but the forecast called for 33 degrees, five degrees colder than last year’s cold run. I also put on a pair of $3 disposable cotton gloves that I’d picked up at the packet pickup expo the day before. I wore new Balega socks and my trusty orange Hokas (my new ones aren’t quite broken in yet).

My water bottle’s pouch would hold my car key, an iPhone SE to provide tunes streamed to one Bluetooth earpiece (the other died months ago) and snacks: a pack of Gu gummies and some Jelly Belly Energy beans (which I ended up liking most).

I left the house around 6:45 for the 7:30 am race. While waiting at a light in the parking traffic I concluded that I definitely was going to need that second shirt, so I put it on under my trusty Decker Challenge shirt that brought me to the top of Kilimanjaro.

By 7:22 I had parked and was waiting in a long line for a porta-potty near the start, debating about whether or not I should just skip it and go find my pace group. Ultimately, the race began just before I had a turn, although it was nice to dispose of that second cup of coffee. Alas, I was forced to start with the 2:40 pacers. I started my Fitbit a minute or two before my group made it across the starting line. Alas, my second watch (an ancient Timex I found somewhere) crashed when I tried to start it as I crossed the starting line.

I spent the first mile passing folks and by the second mile marker I’d reached the 2:05 pace group. For the next few miles, the 2:00 pace group was in sight in the distance, but I’d never caught up to them, despite my hope of running most of the race with the 1:55 pace group. I consoled myself that I’d probably crossed the start line two minutes after the 2:00 group so my dream of finishing in under 2 hours was probably still possible.

Around mile 5, my energy started to flag, but shortly thereafter an 8yo in a tutu running ahead of me guilted me into realizing that this wasn’t as hard as I was making it out to be. I passed her in about half a mile, grateful for the boost.

Shortly after mile 6, I discarded my gloves which had served me well. By this point I was taking short walking breaks and had concluded that I was unlikely to set any PRs.

Miles 6 through 9 were full of signs. My favorite was the 3yo boy holding a sign that said “This seems like a lot of work for a banana“– I told him it was the best one I’d seen. I groaned a bit at some of the signs held by twenty-somethings; one woman’s read “Find a cute butt and follow it” while another proclaimed: “Wow, that looks really long and hard!” Like last year, around mile 9 I stopped for a pee break although this time it was almost nothing… I had sipped under 16 ounces on the entire run.

By mile 7, my torso was starting to get a bit warm in my double shirts, but in another few miles the breeze had picked up and I was glad that I had them. Sheesh, it was chilly.

Amazingly, nothing hurt. My feet felt good. My legs felt good. My throat and lungs felt fine. My lips, wearing a swipe of chapstick, were fine. My thighs and chest, coated in BodyGlide, were not chafing anywhere. I didn’t have any weird aches in my arms or back. The closest thing I had to any pain was the bottom of my nose, which was getting chapped in the cold.

This year, I was anticipating the two hills downtown, and while I can’t say that I ran up them, they were much less demoralizing than last year. As the finish line approached, it had been a few miles since I’d seen my last pacer, but I figured I was somewhere in the 2:13-2:18 range. I idly hoped I’d still beat my time from last year’s slow Galveston Half.

Ultimately, I crossed the finish with a chip time of 2:09:24, a 9:52/m pace.

I wish FitBit made it easier to trim their data to the actual run portion of the race :)

This year, I made sure not to blow by the volunteers passing out the medals just over the finish line.

Tired but feeling like I could’ve easily run a few more miles at a slow pace, I hopped the bus back to the starting point. I grabbed my phone and a jacket from the car and walked a mile to the bagel shop to get a coffee and celebratory breakfast sandwich… the day felt even colder. Back home on the couch after a long warm shower, I signed up for next year’s race.

Next month is the Galveston Half Marathon. I hope to run the race with the 2:00 pacer the whole way, but I’ll settle for beating 2:09, five minutes faster than last year’s effort.

The Blind Doorkeeper Problem, or, Why Enclaves are Tricky

When trying to protect a secret on a client device, there are many strategies, but most of them are doomed. However, as a long-standing problem, many security experts have tried to chip away at its edges over the years. Over the last decade there’s been growing interest in using enclaves as a means to protect secrets from attackers.

Background: Enclaves

The basic idea of an enclave is simple: you can put something into an enclave, but never take it out. For example, this means you can put the private key of a public/private key pair into the enclave so that it cannot be stolen. Whenever you need to decrypt a message, you simply pass the ciphertext into the enclave and the private key is used to decrypt the message and return the plaintext, crucially without the private key ever leaving the enclave.

There are several types of enclaves, but on modern systems, most are backed by hardware — either a TPM chip, a custom security processor, or features in mainstream CPUs that enable enclave-style isolation within a general-purpose CPU. Windows exposes a CreateEnclave API to allow running code inside an enclave, backed by virtualization-based security (VBS) features in modern processors. The general concept behind Virtual Secure Mode is simple: code running at the normal Virtual Trust Level (VTL0) cannot read or write memory “inside” the enclave, which runs its code at VTL1. Even the highly-privileged OS Kernel code running at VTL0 cannot spy on the content of a VTL1 enclave.

DLLs loaded into an enclave must be signed by a specific class of certificate (provided by Azure Trusted Signing) and the code’s signature and integrity are validated before it is loaded into the enclave. After the privileged code is loaded into the enclave, it has access to all of the memory of the current process (both untrusted VTL0 and privileged VTL1 memory). In-enclave code cannot load most libraries and thus can only call a tiny set of external library functions, mostly related to cryptography.

Security researchers spend a lot of time trying to attack enclaves for the same reason that robbers try to rob banks: because that’s where the valuables are. At this point, most enclaves offer pretty solid security guarantees– attacking hardware is usually quite difficult which makes many attacks impractically expensive or unreliable.

However, it’s important to recognize that enclaves are far from a panacea, and the limits of the protection provided by an enclave are quite subtle.

A Metaphor

Imagine a real-world protection problem: You don’t want anyone to get into your apartment, so you lock the door when you leave. However, you’re in the habit of leaving your keys on the bar when you’re out for drinks and bad guys keep swiping them and entering your apartment. Some especially annoying bad guys don’t just enter themselves, they also make copies of your key and share it with their bad-guy brethren to use at their leisure.

You hit on the following solution: you change your apartment’s lock, making only one key. You hire a doorkeeper to hold the key for you, and he wears it on a chain around his neck, never letting it leave his person. Every time you need to get in your apartment, you ask the doorkeeper to let you in and he unlocks the door for you.

No one other than the doorkeeper ever touches the key, so there’s no way for a bad guy to steal or copy the key.

Is this solution secure?

Well, no. The problem is that you never gave your doorkeeper instructions on who is allowed to tell him to unlock the door, so he’ll open it for anyone who asks. Your carefully-designed system is perfectly effective in protecting the key but utterly fails in achieving the actual protection goal: protecting the contents of your apartment.

What does this have to do with enclaves?

Sometimes, security engineers get confused about their goals, and believe that their objective is to keep the private key secret. Keeping the private key secret is simply an annoying requirement in service of the real goal: ensuring that messages can be decrypted/encrypted only by the one legitimate owner of the key. The enclave serves to prevent that key from being stolen, but preventing the key from being abused is a different thing altogether.

Consider, for example, the case of locally-running malware. The malware can’t steal the enclaved key, but it doesn’t need to! It can just hand a message to the code running inside the enclave and say “Decrypt this, please and thank you.” The code inside the enclave dutifully does as it’s asked and returns the plaintext out to the malware. Similarly, the attacker can tell the enclave “Encrypt this message with the key” and the code inside the enclave does as directed. The key remains a secret from the malware, but the crypto system has been completely compromised, with the attacker able to decrypt and encrypt messages of his choice.

So, what can we do about this? It’s natural to think: “Ah, we’ll just sign/encrypt messages from the app into the enclave and the code inside the enclave will validate that the calls are legitimate!” but a moment later you’ll remember: “Ah, but how do we prevent the protect that app’s key?” and we’re back where we started. Oops.

Another idea is that the code inside the enclave will examine the running process and determine whether the app/caller is the expected legitimate app. Unfortunately, this is extremely difficult. While the VTL1 code can read all of app’s VTL0 memory, to confidently determine that the host app is legitimate would require something like groveling all of the executable pages in the process’ memory, hashing them, and comparing them to a “known good” value. If the process contains any unexpected code, it may be compromised. Even if you could successfully implement this process snapshot hash check, an attacker could probably exploit a race condition to circumvent the check, and you’d be forever bedeviled with false-positives caused by non-malicious code injection from accessibility utilities or security tools.

In general, any security checks from inside the enclave that look at memory in VTL0 are potentially subject to a TOCTOU attack — an attacker can change any values at any time unless they have been copied into VTL1 memory. Microsoft Security wrote an important summary of things your code needs to do to in order to program securely with enclaves.

Another idea would be to prompt the user: the code inside the enclave could pop up an unspoofable dialog asking “Hey, would you like me to sign this message [data] with your key?” Unfortunately, in the Windows model this isn’t possible — code running inside an enclave can’t show any UI, and even if it could, there’s nothing that would prevent such a confirmation UI from being redressed by VTL0 code. Rats.

Conclusion

Before you reach for an enclave, consider your full threat model and whether using the enclave would meaningfully mitigate any threats.

For example, token binding uses an enclave to render cookie theft useless– while token binding doesn’t mitigate the threat of locally-running malware abusing cookies via a sock-puppet browser, it does mitigate the threat of cookies theft via XSS/cookie leaks. Furthermore, it complicates the lives of malicious insiders using tightly-locked down corporate PCs at financial services firms– those PCs are heavily monitored and audited, so forcing an attacker to abuse the key on device is a significant improvement in security posture. (The upcoming Device Bound Session Credentials feature has similar properties and may be more deployable than token binding was).

A similar scenario involves Hardware Security Modules (HSMs) used in code-signing and certificate issuance scenarios: while the overall security goal is to prevent misuse of the key, preventing the attacker from egressing the key improves the threat model because it allows other components of the overall system (auditing, alerting, AV, EDR/XDR, etc) to combat attackers attempting to abuse the unexportable key.

-Eric

PS: A great discussion of hardware-backed enclaves on popular phones can be read here.

PPS: I turns out I wrote a much shorter version of this post (with a similar metaphor) previously in response to a bug report.

Coding at Google

I wrote this a few years back, but I’ve had occasion to cite it yet again when explaining why engineering at Google was awesome. To avoid it getting eaten by the bitbucket, I’m publishing it here.

Background: From January 2016 to May 2018, I was a Senior SWE on the Chrome Enamel Security team.

Google culture prioritizes developer productivity and code velocity. The internal development environment has been described (by myself and others) as “borderline magic.” Google’s developer focus carried over to the Chrome organization even though it’s a client codebase, and its open-source nature means that cannot depend upon Google-internal tooling and infrastructure.

I recounted the following experience after starting at Google:

When an engineer first joins Google, they start with a week or two of technical training on the Google infrastructure. I’ve worked in software development for nearly two decades, and I’ve never even dreamed of the development environment Google engineers get to use. I felt like Charlie Bucket on his tour of Willa Wonka’s Chocolate Factory—astonished by the amazing and unbelievable goodies available at any turn. The computing infrastructure was something out of Star Trek, the development tools were slick and amazing, the process was jaw-dropping.

While I was doing a “hello world” coding exercise in Google’s environment, a former colleague from the IE team pinged me on Hangouts chat, probably because he’d seen my tweets about feeling like an imposter as a SWE.  He sent me a link to click, which I did. Code from Google’s core advertising engine appeared in my browser in a web app IDE. Google’s engineers have access to nearly all of the code across the whole company. This alone was astonishing—in contrast, I’d initially joined the IE team so I could get access to the networking code to figure out why the Office Online team’s website wasn’t working.

Neat, I can see everything!” I typed back. “Push the Analyze button” he instructed. I did, and some sort of automated analyzer emitted a report identifying a few dozen performance bugs in the code. “Wow, that’s amazing!” I gushed. “Now, push the Fix button” he instructed. “Uh, this isn’t some sort of security red team exercise, right?” I asked. He assured me that it wasn’t. I pushed the button. The code changed to fix some unnecessary object copies. “Amazing!” I effused. “Click Submit” he instructed. I did, and watched as the system compiled the code in the cloud, determined which tests to run, and ran them.

Later that afternoon, an owner of the code in the affected folder typed LGTM (Googlers approve changes by typing the acronym for Looks Good To Me) on the change list I had submitted, and my change was live in production later that day. I was, in a word, gobsmacked. That night, I searched the entire codebase for misuse of an IE cache control token and proposed fixes for the instances I found.

-Me, 2017

The development tooling and build test infrastructure at Google enable fearless commits—even a novice can make contributions into the codebase without breaking anything—and if something does break, culturally, it’s not that novice’s fault: instead, everyone agrees that the fault lies with the environment – usually either an incomplete presubmit check or missing test automation for some corner case. Regressing CLs (changelists) can be quickly and easily reverted and resubmitted with the error corrected. Relatedly, Google invests heavily in blameless post-mortems for any problem that meaningfully impacts customer experience or metrics. Beyond investing in researching and authoring the post-mortem in a timely fashion, post-mortems are broadly-reviewed and preventative action items identified therein are fixed with priority.

Google makes it easy to get started and contribute. When ramping up into a new space, the new engineer is pointed to a Wiki or other easily-updated source of step-by-step instructions for configuring their development environment. This set of instructions is expected to be current, and if the reader encounters any problems or changes, they’re expected to improve the document for the next reader (“Leave it better than you found it”). If needed, there’s usually a script or other provisioning tool used to help get the right packages/tools/dependencies installed, and again, if the user encounters any problems, the expectation is that they’ll either file a bug or commit the fix to the script.

Similarly, any ongoing Process is expected to have a “Playbook” that explains how to perform the process – for example, Chrome’s HSTS Preload list is compiled into the Chrome codebase from snapshots of data exported from HSTSPreload.org. There’s a “Playbook” document that explains the relevant scripts to run, when to run them, and how to diagnose and fix any problems. This Playbook is updated whenever any aspect of the process changes as a part of whatever checkin changes the process tooling.

As a relatively recent update, the Chromium project now offers a very lightweight contribution experience that can be run entirely in a web browser, which mimics the Google internal development environment (Cider IDE with Borg compiler backend).

Mono-repo, no team/feature branches, Google internally uses a mono-repo into which almost all code (with few exceptions, including Chrome) is checked in, and the permissions allow any engineer anywhere in the company to read it, dramatically simplifying both direct code reuse as well as finding expertise in a given topic. Because Chrome is an open-source project, it uses its own mono-repo containing approximately 25 million lines of code. Chrome does not, in general, use shared branches for feature development, only to fork for the release branches (e.g. Canary is forked in order to create the Dev branch, and there are firm rules about cherry-picking from Main into those branches).

An individual developer will locally create branches for each fix that he’s working on, but those branches are almost never seen by anyone else; his PR is merged to HEAD at which point everyone can see it. As a consequence, landing non-trivial changes, especially in areas where others are merging, often results in many commits and a sort of “chess game” where you have to anticipate where the code will be moving as your pieces are put in. This strongly encourages developers to land code in many small CLs that coax the project toward the desired end-state, each with matching automated tests to ensure that you’re protected against anyone else landing a change that regresses your code. Those tests end up defending your code for years to come.

Because all work is done in Main, there’s little in the way of cross-team latency, because you need not wait for an RI/FI (reverse-integrate, forward-integrate) to bring features around to/from other branches.

Cloud build. Google uses cloud build infrastructure (Borg/Goma) to build its projects so developers can work on relatively puny workstations but compile with hundreds to thousands of cores. A clean build of Chrome for Windows that took 46 minutes on a 48 thread Xeon workstation would take 6 minutes on 960 Goma cores, and most engineers are not doing clean builds very often.

This Cloud build infrastructure is heavily leveraged throughout the engineering system—it means that when an engineer puts a changelist up for review, the code is compiled for five to ten different platforms in parallel in the background and then the entire automated test suite is run (“Tryjob”) such that the engineer can find any errors before another engineer even begins their code review. Similarly, artifacts from each landed CL’s compilation are archived such that there’s a complete history of the project’s binaries, which enables automated tooling to pinpoint regressions (performance via perfbots, security via ClusterFuzz, reliability via their version of Watson) and engineers to quickly bisect other types of regressions.

Great code search/blame. Google’s Code Search features are extremely fast and, thanks to the View-All monorepo and lack of branches, it’s very easy to quickly find code from anywhere in the company. Cross-references work correctly, so things like “Find References” will properly find all callers of a specific function rather than just doing a string search for that name. Viewing Git history and blame is integrated, so it’s quick and easy to see how code evolved over time.

24-hour Code Review Culture. Google’s engineering team has a general SLA of 24 hours on code-review. The tools help you find appropriate reviewers, and the automation helps ensure that your CL is in the best possible shape (proper linting, formatting, all tests pass, code coverage %s did not decline) before another human needs to look at it. The fast and simple review tools help reviewers concentrate on the task at hand, and the fact that almost all CLs are small/tiny by Microsoft standards help keep reviews moving quickly. Similarly, Google’s worldwide engineering culture mean that it’s often easy to submit a CL at the end of the day Pacific time and then respond to review feedback received overnight from engineers in Japan or Germany.

Opinionated and Enforced Coding Standards. Google has coding standards documents for each language (e.g. C++) that are opinionated and carefully revised after broad and deep discussions among practitioners interested in participating. These coding standards are, to the extent possible, enforced by automated tooling to ensure that all code is written to the standard, and these standards are shared across teams by default, with any per-project exceptions (e.g. Chrome’s C++) treated as an overlay.

Easily Discovered Area Interest/Ownership Google has an extremely good internal “People Directory” – it allows you to search for any employee based on tags/keywords, so you can very quickly find other folks in the company that own a particular area. Think “Dr Whom/Who+” with 100ms page-load-times, and backed by a work culture where folks keep their own areas of ownership and interest up-to-date because it’s both simple and because if they fail to do so, they’re going to keep getting questions about things they no longer own. Similarly, the OWNERS system within the codebases are up-to-date because they are used to enforce OWNERS review of changes, so after you find a piece of code, it’s easy to find both who wrote it (fast GIT BLAME) and who’s responsible for it today. Company/Division/Team/Individual OKRs are all globally visible, so it’s easy to figure out what is important to a given level of the organization, no matter how remote.

Simple/fast bug trackers. Google’s bug tracker tools are simple, load extremely quickly, and allow filing/finding bugs against anything very quickly. There’s a single internal tracker for most of Google, and a public tracker (crbug.com) for the Chromium OSS project.

Simple/fast telemetry/data science tools. Google’s equivalent of Watson is extremely fast and has code to automatically generate stack information, hit counts, recent checkins near the top-of-stack functions, etc. Google’s equivalent of SQM/OCV is extremely fast and enables viewing of histograms and answering questions like “What percentage of page loads result in this behavior” without learning a query language, getting complicated data access permissions, or suffering slow page loads. These tools enable easy creation of “notifications/subscriptions” so developers interested in an area can get a “chirp” email if a metric moves meaningfully.

Sheriffs and Rotations. Most recurring processes (e.g. bug triage) have both a Sheriff and a Deputy and Google has tools for automatically managing “rotations” so that the load is spread throughout the team. For some expensive roles (e.g. a “Build Sheriff”) the developer’s primary responsibility while sheriff becomes the process in question and their normal development work is deferred until their rotation ends; the rotation tool shows the schedule for the next few months, so it is relatively easy to plan for this disruption in your productivity.

Intranet Search that doesn’t suck While Google tries to get many important design docs and so forth into the repo directly there’s still a bunch of documentation and other things on assorted wikis and Google Docs, etc. As you might guess, Google has an internal search engine for this non-public content that works quite well, in contrast to other places I’ve worked.

Fall 2023 Races

While I’ve been running less, I haven’t completely fallen out of the habit, and I still find spending an hour on the treadmill to be the simplest way to feel better for the rest of the day. Real-world racing remains appealing, for the excitement, the community, and for the forcing function to get on the treadmill even on days when I’m not “feeling it.”

Alas, I’m not very proud of any of my fall race times, but I am glad that I’ve managed to keep up running after the motivation of Kilimanjaro has entered the rear-view mirror, and I’m happy that I recently managed to sneak up on running my longest-by-far distance.

Daisy Dash 10K – 10/22

I had lucky bib #123 for this run. The course wasn’t much to look at, but was pleasant enough, looping around a stadium and a few shopping centers in “Sunset Valley” on the south side of Austin. I was worried going out that my lower back had been sore for a few days, and my belly felt slightly grumbly, but once I started running, neither bothered me at all for the next few hours. Thanks, body!

It was a 68 and breezy; a bit humid, but not nearly as bad as the Galveston Half I ran 9 months ago. I brought along an iPhone and Bluetooth earbuds and for my second race without music, after having stupid technical problems. In Galveston, my earbud died, and this time, I’d failed to download any music. Doh!

I burned 910 calories, and while my heart rate wasn’t awesome, it recovered quickly whenever my pace dropped.

I started out reasonably strong, finishing the first 5K in 26:39. I was wearing my Fitbit watch, but immediately abandoned any attempt to use it to monitor my pace during the run. I’m pleasantly surprised to find that I naturally settled into the exact 7mph pace that is my most common treadmill default.

Sadly, after the first 5K, I was panting and started taking 30 to 90 second walking breaks every three quarters of a mile or so, dropping my pace significantly in the back half of the race.

I tried to psych myself up with thoughts of Kilimanjaro and how short this race was (only 10K! I keep thinking), but I never got into that heavenly rhythm where the minutes just slide by for a while.

Still, I vowed to finish in under an hour, and in the last mile I was excited that I might beat 58 minutes, putting me about 5 minutes behind my second Cap10K in the spring, and still 10 minutes faster than my first real-world 10K.

Ultimately, I finished in 57:40:

While my arches felt a bit sore after the race (pavement is clearly much harder on me than my treadmill), I felt like running an extra four miles wouldn’t’ve been a nightmare… good news for the “Run for the Water” 10 miler coming up in two short weeks. But first, a charity 5K.

Microsoft Giving Campaign 5K – 10/27

The Microsoft Giving Campaign’s charity 5K consisted of five laps around a small lake near Microsoft’s north Austin office. It was fun to meet other Microsoft employees (many of whom started during the pandemic) and go for a fun run around the private Quarry Lake trail.

Alas, I had a hard time with my pacing and less fun than I’d expecting dodging the mud puddles.

In total, it took me 32:41 to run the slightly-extended (3.27) course.

Last year’s 5K was on a different course and I’d gone out too fast, ultimately needing to walk in the third mile, such that when I finished it, I turned around and ran the full course again, this time at a steady pace. While I don’t seem to have recorded my times anywhere, I think both runs were around 27 minutes.

Run for the Water – 10 Miler 11/5

The weather for the race was nearly perfect. I was nervous, having remembered last year’s race, but excited to revisit this race with another year’s worth of running under my belt.

I managed the steep hill at mile 5.5 well, but fell off pace at miles 8 and 9.

Sprinting at the end, my sunglasses went flying off my shirt; embarrassingly, I had to loop back into the finishers’ chute after crossing the finish line to collect them.

I finished in 1:38:05, a pace of 9:48, a somewhat-disappointing 9:08 slower than last year‘s race. Ah well.

Turkey Trot 5-Miler 11/23

I arrived for the Thanksgiving day race very early and ended up spending almost an hour outside before the race began, getting a bit colder than I’d’ve liked. I warmed up quickly when the run began.

I finished in 45:45, a pace of 9:09, a bit (1:39) slower than last year.

Shortly before the race started, I made the poor decision to eat a free granola bar and it didn’t sit well during the run. After sprinting through the finishers chute, I spent a panicked half-minute trying to find a convenient place to throw up amongst the thousands of bystanders until my heart calmed down and the urge subsided.

Virtual Decker Challenge Half 12/9

Having surveyed the boring Decker Challenge course last year while driving by during my missed half, I decided to sign up for the virtual option this year. The 2022 shirt (made by Craft) has become one of my favorites, and I wore it constantly on Kilimanjaro, including on summit day. I registered for the Virtual event to collect this years’ shirt, but alas, it turned out to be a disappointing no-name black long-sleeve tech shirt. There were basically no other perks to the virtual race — no medal, and nowhere even to report your virtual results. Bummer.

For the race itself, I chose to repeat the Jackson Hole half that I ran last year, beating last year’s time by almost 4 minutes, with a time of 2:00:20. This wasn’t a great time, but I was grateful for it: I’d had quite a bit of wine with dinner on Friday night, but had to run on Saturday morning because on Sunday I was taking my eldest to the Eagles/Cowboys game.

Virtual London Marathon 12/17

On this last weekend before the holidays, with my kids partying at their mom’s and my housemate out of town on a cruise, I was bored and starting to feel a bit stir crazy. I decided I’d run a marathon on my treadmill, picking 2021 London with Casey Gilbert. Unfortunately, the race was broken up into segments, so each time I had to switch to the next segment ended up being an “aid station” where I’d refill my water bottle or grab another stroopwafel.

The first half went great– I ran at a constant pace and finished in 1:55, feeling good and ready to keep going; my relatively slow pace kept my heart rate down. I finished the first twenty miles in a respectable 3:04:30, starting to worry that finishing in under 4 hours was going to be tough, but achievable. Unfortunately, things fell apart in miles 21 to 25, and my pace dropped dramatically; even though my heart rate was well under control, my legs were tired and I was beat. I rallied for the final three quarters of a mile back to 7mph, finishing with a somewhat disappointing but perhaps unsurprising 4:20:10.

Still, I’m proud of this race– while it was undoubtably much easier on the body than a real-world marathon, it was in no way easy and I never gave up.

After cooling down, I showered, ate, and caught “Love Actually” at the Alamo Drafthouse for the afternoon. My FitBit reported that I cracked 45000 steps for the day. My intake of three running stroopwafels, two bananas, a leftover fajita, a salad, a steak and vegetables probably did not cover the 4000+ calories I burned in the run.

Two days later, I’m still sore, although all three blisters have mostly faded away.

Early 2024 Races

I’ve got a good slate for early 2024, running the 3M Half again in January, the Galveston Half in February, and the Capitol 10K in April. I really hope to beat last year’s times in the Half Marathons (slow and steady ftw!), but don’t expect I’ll be able to beat my 2023 Cap10K time of 52:25.

Defense Techniques: Blocking Protocol Handlers

Application Protocols represent a compelling attack vector because they’re the most reliable and cross-browser compatible way to escape a browser’s sandbox, and they work in many contexts (Office apps, some PDFs handlers, some chat/messaging clients, etc).

Some protocol handlers are broadly used, while others are only used for particular workflows which may not be relevant in the user or company’s day-to-day workflows.

The Edge team can already block known-unsafe protocol schemes via a browser Component Update, but that only likely to happen if the scheme is broadly exploited across the ecosystem.

An organization or individual may wish to reduce attack surface by blocking the use of unwanted protocol handlers.

To block access to a given protocol handler from browsers, you can set the URLBlocklist policies in Chrome and Edge to prevent access to those protocols from within the browser.

For example, in 2022, the ms-appinstaller handler had a security bug and many organizations that did not need this handler for their environments wished to disable it. They can set the policies:

REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Edge\URLBlockList" /v "1" /t REG_SZ /d "ms-appinstaller:*" /f

REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Google\Chrome\URLBlockList" /v "1" /t REG_SZ /d "ms-appinstaller:*" /f

After the policy is set, attempt to navigate the browser to the protocol will show an error page:


After the 2022 incident, the App Installer team also created a specific group policy to disable the MS-AppInstaller scheme within the handler itself:

REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\AppInstaller" /v "EnableMSAppInstallerProtocol" /t REG_DWORD /d "0" /f

When the App Installer protocol is disabled within the handler itself, invoking the protocol simply shows an error message:

Stay safe out there!

-Eric

PS: On Windows 11 today, unfortunately, there’s not always a simple way to block an arbitrary scheme for the system as a whole, unless the protocol handler application can be uninstalled entirely.

The Windows Set a default for a link type settings pane only allows you to choose a different Microsoft Store app that offers to support the protocol scheme, preventing you from unsetting the scheme entirely or pointing it to a harmless non-handler (e.g. Calculator).

To address this, we have to tell Windows that our generic little executable (saved to C:\windows\alert.exe) knows how to handle the ms-appinstaller protocol scheme, by listing it in the RegisteredApplications key with a URLAssociations key professing support:

Windows Registry Editor Version 5.00

[HKEY_CLASSES_ROOT\com.bayden.alert]

[HKEY_CLASSES_ROOT\com.bayden.alert\shell]

[HKEY_CLASSES_ROOT\com.bayden.alert\shell\open]

[HKEY_CLASSES_ROOT\com.bayden.alert\shell\open\command]
@="\"C:\\windows\\alert.exe\" \"%1\""

[HKEY_CURRENT_USER\Software\RegisteredApplications]
"BaydenAlert"="SOFTWARE\\Bayden Systems\\Alert\\Capabilities"

[HKEY_CURRENT_USER\Software\Bayden Systems\Alert]

[HKEY_CURRENT_USER\Software\Bayden Systems\Alert\Capabilities]

[HKEY_CURRENT_USER\Software\Bayden Systems\Alert\Capabilities\URLAssociations]
"ms-appinstaller"="com.bayden.alert"

After we do this, we can change the protocol handler app from the real one to our stub:

…such that subsequent attempts to invoke the handler are harmlessly displayed:

Attack Techniques: Steganography

Attackers are incentivized to cloak their attacks to avoid detection, keep attack chains alive longer, and make investigations more complicated.

One type of cloaking involves steganography, whereby an attacker embeds hidden data inside an otherwise innocuous file. For instance, an attacker might embed their malicious code inside an image file, not in an attempt to exploit a vulnerability in image parsers, but instead just as a convenient place to stash malicious data. That malware-laden image file may then be hosted anywhere that allows image uploads. There are plenty of such places on the public internet, because image file types are generally not considered dangerous.

In the recent Diamond Sleet attack, the attackers embedded the second stage of the attack as data inside of a PNG file and hosted that file on three unwitting web services, including the popular Imgur and GitHub services. The first stage of the attack code reaches out to one of three URLs, downloads the image, extracts the attack code, decrypts it, and runs the resulting malware.

When parsing the malicious PNG file, we see that the attackers got lazy– they shoved their data in the middle of the file after the end of its final IDAT chunk and before the IEND chunk.

In this case, the attackers didn’t bother formatting their attack as a valid PNG chunk; even though the malicious data is only 498,176 bytes long, the bytes 1518E13A at the front of the malicious content would suggest to a PNG parser to expect almost 354MB of data in the chunk.

But none of that matters to the attacker’s code — they don’t need to parse the file as if it were legitimate, they just grab the part of the file they care about and ignore the rest.

Developers of malicious browser extensions have been using this approach for years, because they learned from experience that the JavaScript files inside extension uploads get more scrutiny from browsers’ web stores’ security reviewers than the other files in the extension packages.

Defenders who want to detect hidden code can try to look for anything suspicious: malformed chunks, unknown chunk types, trailing data, and suspiciously inefficient files (e.g. much larger than the pixel count would suggest). But ultimately, there’s no way to guarantee that you’ll ever detect embedded messages.

A sufficiently motivated attacker could encrypt their malware and then encode it as legitimate pixel data (say, the “random” pixels of the stars in the sky) and there’s no way for a researcher to detect it without knowing the decryption routine. That said, finding an image’s URL inside a captured copy of the first stage or its network traffic is typically a pretty strong indication that there’s something malicious embedded in the file, because attackers tend not to bother downloading data they don’t need.

-Eric

Troubleshooting Edge (or Chrome) Broken UI

Last time, we looked at how to troubleshoot browser crashes. However, not all browser problems result in the tab or browser crashing entirely. In some cases, the problem is that some part of the browser UI doesn’t render correctly.

This most commonly occurs with parts of the UI that are written in HTML and JavaScript. In Chromium-based browsers, that includes the F12 Developer Tools and the pages served from chrome:// or edge:// urls.

Within Microsoft Edge, HTML-based UI also includes many of the new UI elements including the History/Favorites/Download Manager “Hubs”/bubbles and the sidebar on the right.

Debugging Built-in Pages

Sometimes, a built-in page will fail to function or render correctly, or might not show any content at all. For example, performing a search inside edge://settings might result in the page content disappearing entirely.

When this happens, you can simply hit F12 to open the browser’s F12 Developer Tools. The Console tab of the tools will show the JavaScript error that caused the problem. If you report the bug (Alt+Shift+I), including the Console tab’s content in your screenshot will make it simpler for the engineers to figure out what went wrong.

Debugging Developer Tools

Sometimes, one or more tabs in the F12 Developer Tools does not load as expected. For example, here’s a case where the Application tab doesn’t load:

Fortunately, you can use a second instance of the Developer Tools to debug the Developer Tools!

First, undock the DevTools into their own window by clicking the menu inside the tools, then set the Dock location to undocked.

Next, use the Developer Tools to debug the Developer Tools by clicking into the DevTools window and hitting the Ctrl+Shift+I keyboard shortcut. A second instance of the DevTools will appear, debugging the first instance of the DevTools:

The Console tab of the second Developer Tools instance will show any script errors from the first:

Debugging Hubs

Within Edge, various flyouts from the toolbar are implemented as small HTML content areas that appear when you click various toolbar buttons. To debug these flyouts:

  1. Open a second Edge window and navigate it to edge://inspect.
  2. Click the Pages link on the left-hand side.
  3. In the first window, open the problematic Hub UI (e.g. hit Ctrl+H to open the History hub).
  4. Click the Inspect link under the page that appears in the Pages list:

Alternatively, you could try loading the Hub as a top-level page by using its undocumented URL (e.g. navigate to edge://downloads/hub to open the Edge Downloads page in its Hubs mode).

Using DevTools to debug the browser itself feels like a super-power.

-Eric

Troubleshooting Edge (or Chrome) Browser Crashes

In the modern browser world, there are two types of crashes: browser crashes and renderer crashes.

In a browser crash, the entire browser window with all of its tabs simply vanishes, either on startup, or at some point afterward. The next time the browser starts, it should recognize that the last time it exited was unexpected and offer to recover the tabs that were open at the time of the crash:

Crash recovery prompt shown after restart

In contrast, in a renderer crash, one (or more) tabs will show a simple crash page with a "This page is having a problem" error message:

Edge’s Renderer Crash notification page

The Error code mentioned at the bottom of the page might help point to the cause of the problem. Sometimes if you search for this code on Twitter or Reddit you might see a recent set of complaints about it and learn what causes it.

Automatic Crash Reporting

Microsoft Edge uses a Microsoft crash-reporting system called Watson (after Sherlock Holmes’ famous assistant). When your browser or one of its tabs crashes, the system collects information useful for engineers to diagnose and fix the problem and uploads it to Microsoft’s servers for analysis. (Chrome includes a similar system, called CrashPad, which sends Chrome crash reports to Google).

Backend systems automatically sort crash reports into buckets, count them, and surface information for the browsers’ reliability engineers to investigate in priority order. These engineers constantly monitor the real-world crashes worldwide and work to send fixes out to users via updated versions (“respins“) as quickly as possible.

In the vast majority of cases, you don’t need to do anything to get a browser update beyond just waiting. Visiting about:help will kick off a version update check:

…but even if you cannot do that (e.g. because the browser crashes on startup such that you cannot navigate anywhere), Edge’s background updater will check for updates every few hours and silently install them. If you want to hurry it along, you can do so from the command-line.

about:crashes

You can see information about recent crashes by visiting the about:crashes page in the browser. If you end up talking to an engineer about a crash or post a question in a forum asking for help, posting the CAB ID and Bucket ID information from Edge’s crash report (or the Uploaded Crash Report ID from Chrome’s crash report) will allow browser engineers the ability to easily pull up the data from your crash.

ProTip: When sharing these ID values, please share them as text, not a screenshot; no one wants to try to retype 32 digit hexadecimal numbers from your grainy JPEGs.

Screenshot for this explanation only. Share text, not screenshots!

Bucket and CAB IDs often greatly enhances engineers’ ability to help you understand what’s going on.

If you’d like, you can click the Submit feedback button next to a crash and provide more information about what specifically you were doing that led to the crash. In most cases, this isn’t necessary, but it can be useful if you have a reproducible crash (e.g. it happens every time), and you were doing something that might otherwise be hard to reproduce (e.g. loading a non-public web page).

Self-Troubleshooting

What if the entire browser crashes on startup, such that you cannot visit the about://crashes page? Or you want to see whether you can change something about your browser to avoid the crash. In such cases, you might want to explore the following troubleshooting steps.

Troubleshooting: Try a new Profile

Some crashes are profile-dependent– a setting or extension in your profile might be the source of a problem. If you have multiple different browser profiles already, and you’re hitting a renderer crash, try switching to a different profile and see whether the problem recurs.

If you’re hitting a browser crash (which prevents you from easily switching profiles), try starting Edge with a new profile:

  1. Close all Edge instances; ensure that msedge.exe is not running in Task Manager. (Sometimes, Edge’s Startup Boost feature means there may be one or more hidden processes running.)
  2. Click Start > Run or hit Windows+R and enter msedge.exe –profile-directory=Rescue

If Edge opens and browses without crashing, the problem was likely related to something in your profile — whether a buggy extension, an experiment, or some other setting.

Troubleshooting: Change Channels

If you’re crashing in a pre-release channel (Canary, Dev, or Beta) try retreating to a more stable channel. The pre-release channels are expected to have crashes from time-to-time (especially Dev and Canary) — the whole point of publishing these channels is to allow the team to capture crash reports from the wild before those bugs make it into the Stable release.

If you’re crashing in the Stable channel, try the Dev channel instead– while the team generally tries to pull all fixes into the Stable channel, sometimes those fixes might make it into the Canary/Dev builds first.

Troubleshooting: Try disabling extensions

Generally, a browser extension should not be able to crash a tab or the browser, but sometimes browser bugs mean that they might. If you’re encountering renderer (tab) crashes, try loading the site in InPrivate/Incognito mode to see if the crash disappears. If it does, the problem might be one of your browser extensions (because extensions, by default, do not load in Private mode).

You can visit about:extensions to see what browser extensions are installed and disable them one-by-one to see whether the problem goes away.

Troubleshooting: Try running without extensions at all

In rare cases, a browser bug tickled by extensions might cause the browser to crash on startup such that you cannot even visit the about:extensions page to turn off extensions.

To determine whether a browser crash is related to extensions, follow these steps:

  1. Close all Edge instances; ensure that msedge.exe is not running in Task Manager. (Sometimes, Edge’s Startup Boost feature means there may be one or more hidden processes running.)
  2. Click Start > Run or hit Windows+R and enter msedge.exe –disable-extensions
  3. Verify that Edge opens correctly
  4. Visit edge://crashes and get the CABID/BucketID information from the uploaded crash report

Unfortunately, while in the “Extensions disabled” mode, you cannot visit the about:extensions page to manually disable one or more of your extensions — in this mode, the browser will act as if it has no extensions at all.

Troubleshooting: Check for Accessibility Tools

Sometimes, Chromium tabs might crash due to buggy handling of accessibility messages. You can troubleshoot to see whether Accessibility features are unexpectedly enabled (and try disabling them) to see whether that might be the source of the problem. See Troubleshooting Accessibility.

Troubleshooting: Check for incompatible software

If all of the browser’s tabs crash, or you see many different tab crashes across many different sites, it’s possible that there’s software on your PC that is interfering with the browser. Read my post about how to use about:conflicts page to discover problematic security or utility software that may be causing the problem.

Troubleshooting: Check Device for Errors

While it’s rare that a PC that worked fine one day abruptly stops working the next due to a hardware problem, it’s not impossible. That’s particularly true if the crash isn’t just a browser crash but a bluescreen of death (aka bugcheck) that crashes the whole computer.

In the event of a bluescreen or if the error code is STATUS_ACCESS_VIOLATION, consider running the Windows Memory diagnostic (which tests every bit of your RAM) to rule out faulty memory. You can also right-click on your C: drive in Explorer, choose Properties > Tools > Error Checking to check for disk errors.

If the crash is a bluescreen, you should also try updating any drivers for your graphics card and other devices, as many bluescreens are a result of driver bugs.

Advanced Troubleshooting: Try a different flavor

If you’re a software engineer and want to help figure out whether the crash is in Chromium itself, or just in Chrome or Edge, try running the equivalent channel of the other browser (e.g. if Chrome Dev crashes, try Edge Dev on the same URL).

Advanced Troubleshooting: Examine Dumps

Software engineers familiar with using debuggers might be able to make sense of the minidump files collected for upload to the Watson service. These .dmp files are located within a directory based on the browser’s channel.

For example, crashes encountered in Edge Stable are stored in the %LOCALAPPDATA%\Microsoft\Edge\User Data\Crashpad\reports\ directory.

-Eric

PS: Want to see what crash experiences look like for yourself?

To simulate a browser crash in Edge or Chrome, go to the address bar and type about://inducebrowsercrashforrealz and hit enter. To simulate a tab crash, enter about://crash instead.

Driving Electric – One Year In

One year ago, I brought home a new 2023 Nissan Leaf. I didn’t really need a car, but changing rules around tax credits meant that I pretty much had to buy the Leaf last fall if I wanted to save $7500. It was my first new car in a decade, and I’m mostly glad I bought it.

Quick Thoughts

The Leaf is fun to drive. Compared to my pokey 2013 CX-5 with its anemic 155 horsepower, the Leaf accelerates like a jet on afterburner. While the CX-5 feels a bit more stable on the highway at 80mph, the Leaf is an awesome city car — merging onto highways is a blast.

That said, the car isn’t without its annoyances.

  • The nominal 160 mile range is a bit too short for comfort, even for my limited driving needs; turning on the A/C (or worse, the defroster) shaves ~5-10% of miles off.
  • When the car is off, you cannot see the current charge level and predicted range unless you have the key and “start” the car. (While plugged in, three lights indicate the approximate charge progress.)
  • My Wallbox L2 charger has thrice tripped the 30A breaker, even when I lower the Wallbox limit to 27 amps.
  • The back seat is pretty small– while technically seating 3, I’d never put more than 2 kids back there for a ride of any length, and even my 10yo is likely to “graduate” to the front seat before long.
  • The trunk is surprisingly large though. I only need the CX-5 for long road trips, 5+ passengers, or when I’ve got a ladder or a dog to move.

Miles and Power

In my first year, I’ve put 6475 miles on my Leaf (Update: I hit 15000 miles the month of its second birthday), with 1469kWh coming from my Wallbox L2 wall charger (~6.4kWh/h), perhaps 120kWh from the slow 120V charger (1.8kWh/h), and 6kWh from a 40kWh/h DC charger, for a total of 1595kWh of electricity.

This represents almost exactly 40 “fillups” of the 40kWh battery (though I rarely charged to over 90%), and an energy cost of somewhere around $150 for the year. (My actual cost is somewhat harder to measure, since I now have solar panels). By way of comparison, my CX-5 real-world driving is ~28mpg, and the 231 gallons of gas I would have used would’ve cost me around $700.

One of the big shortcomings for the Leaf is that it uses the standards-war loser CHAdeMO fast-charger standard, which means that fast-chargers are few and far between. Without a fast charger (which would allow a full fill-up in about an hour), taking roadtrips beyond 80 miles is a dicey proposition.

For most of the year, I had thought that Austin only had two CHAdeMO chargers (one at each of the malls) but it turns out that there are quite a few more on the ChargePoint network, including one at the Austin Airport. Having said that, my one trial of that fast charger cost a bit more than the equivalent in gasoline — I spent $3.74 to fill 6kW (~27 miles) in 17 minutes, at a pace that was around half what the charger should be able to attain– annoying because the charger bills pay-per-minute rather than by kWh. But it’s nice to know that maybe I could use the Leaf for a road trip with careful planning.

Conclusions

I like the Leaf, I like the price I paid for it, and I like that it’s better for the environment. That said, if I were to buy an electric today, it’d almost certainly be a Tesla Model Y.

In a year or two, it’s possible that I’ll swap the Leaf for a more robust electric SUV, or that I’ll trade the Mazda up for a plug-in hybrid.

-Eric

Protecting Auth Tokens

Authenticating to websites in browsers is complicated. There are numerous different approaches:

  • the popular “Web Forms” approach, where username and password (“credentials”) are collected from a website’s Login page and submitted in a HTTPS POST request
  • credentials passed in all requests via standard WWW-Authenticate HTTP headers
  • TLS handshakes augmented by client certificates
  • crypto-backed FIDO2/Passkeys via the WebAuthN API

Each of these authentication mechanisms has different user-experience effects and security properties. Sometimes, multiple systems are used at once, with, for example, a Web Forms login being bolstered by multifactor authentication.

In most cases, however, Authentication mechanisms are only used to verify the user’s identity, and after that process completes, the user is sent a “token” they may send in future requests in lieu of repeating the authentication process on every operation.

These tokens are commonly opaque to the client browser — the browser will simply send the token on subsequent requests (often in a HTTP cookie, a fetch()-set HTTP header, or within a POST body) and the server will evaluate the token’s validity. If the client’s token is missing, invalid, or expired, the server will send the user’s browser through the authentication process again.

Threat Model

Notably, the token represents a verified user’s identity — if an attacker manages to obtain that token, they can send it to the server and perform any operation that the legitimate user could. Obviously, then, these tokens must be carefully protected.

For example, tokens are often stored in HTTPOnly cookies to help limit the threat of a cross-site scripting (XSS) Attack — if an attacker manages to exploit a script injection inside a victim site, the attacker’s injected script cannot simply copy the token out of the document.cookie property and transmit it back to themselves to freely abuse. However, HTTPOnly isn’t a panacea, because a script injection can allow an attacker to use the victim’s browser as a sock puppet wherein the attacker simply directs the victim’s own browser to issue whatever requests are desired (e.g. “Transfer all funds in the account to my bitcoin wallet <x>“).

Beyond XSS attacks conducted from the web, there are two other interesting threats: local malware, and insider threats. Protecting against these threats is akin to trying to keep secrets from yourself.

In the malware case, an attacker who has managed to get malicious software running on the user’s PC can steal tokens from wherever they are stored (e.g. the cookie database, or even the browser processes memory at runtime) and transmit them back to themselves for abuse. Such attackers can also usually steal passwords from the browser’s password manager. However, stolen tokens could be more valuable than stolen passwords, because a given site may require multi-factor authentication (e.g. confirmation of logins from a mobile device) to use a password whereas a valid token represents completion of the full login flow.

In the insider threat scenario, an organization (commonly, a financial services firm) has employees that perform high value transactions from machines which are carefully secured, audited, and heavily monitored to ensure employee compliance with all mandated security protocols. In the case of an insider threat, whereby a rogue employee hopes to steal from their employer, the attacker may steal their own authentication token (or, better yet, a token from a colleague’s unlocked PC), and take that token to use on a different client that is not secured and monitored by the employer. By abusing the auth token from a different device, the attacker may evade detection long enough to abscond with their ill-gotten gains.

SaaS Expanded the Threat

In the old days of the 1990/2000s’ corporate environments, an attacker who stole a token from an Enterprise user had a limited ability to use it, because the enterprise’s servers were only available on the victim’s Intranet, not reachable from the public internet. Now, however, many enterprises mostly rely upon 3rd party software sold as a “service” that is available from anywhere on the Internet. An attacker who steals a token from a victim can abuse that token from anywhere in the world.

Root Cause

All of these threats have a common root cause: nothing prevents a token from being used in a different context (device/location) than the one in which it was issued.

Browser changes to raise the bar against local theft of cookies are (necessarily) of limited effectiveness, and always will be.

While some sites attempt to implement theft detection for their tokens (e.g. requiring the user reauthenticate and obtain a new token if the client’s IP address or geographic location changes), such protections are complex to implement and can result in annoying false positives (e.g. when a laptop moves from the office to the coffee shop or the like).

Similarly, organizations might use Conditional Access or client certificates to prevent a stolen token from being used from a machine not managed by the enterprise, but these technologies aren’t always easy to deploy. However, conditional access and client certificates point at an interesting idea: what if a token could be bound to the client that received it, such that the token cannot be used from a different client?

A Fix?

Update: The Edge team has decided to remove Token Binding starting in Edge 130.

Token binding, as a concept, has existed in multiple forms over the years, but in 2018, a set of Internet Standards was finalized to allow binding cookies to a single client. While the implementation was complex, the general idea is simple:

  1. Store a secret key on a client in a storage area that prevents it from being copied (“non-exportable”).
  2. Have the browser “bind” received cookies to that secret key, such that the cookies will not be accepted by the server if sent from another client.

Token binding had been implemented by Edge Legacy (Spartan) and Chromium, but unfortunately for this feature, it was ripped out of Chrome right around the time that the Standards were finalized, just as Microsoft replatformed Edge atop Chromium.

As the Edge PM for networking, I was left in the unenviable position of trying to figure out why this had happened and what to do about it.

I learned that in Chromium’s original implementation of token binding, the per-site secret was stored in a plain file directly next to the cookie database. This design would’ve mitigated the threat of token theft via XSS attack, but provided no protection against malware or insiders, which could steal the secrets file just as easily as stealing the cookie database itself.

To provide security against malware and insiders, the secret must be stored somewhere where it cannot be taken off the machine. The natural way to do that would be to use the Trusted Platform Module (TPM) which is special hardware designed to store “non-exportable” secrets. While interacting with the TPM requires different code on each OS platform, Chromium surmounts that challenge for many of its features. The bigger problem was that it turns out that some TPMs offer very low performance, and some pages could delay page load for dozens of seconds while communicating with the TPM.

Ultimately, the Edge team brought Token Binding back to the new Chromium-based Edge browser with two major changes:

  1. The secrets were stored using Windows 10’s Virtual Secure Mode, offering consistently high-performance, and
  2. Token-binding support is only enabled for administrator-specified domains via the AllowTokenBindingForUrls Group Policy

This approach ensured that Token Binding would be supported with high performance, but with the limitations that it was only supported on Win10+, and not as a generalized solution any website could use.

Even when these criteria are met, Token Binding provides limited protection against locally-running malware– while the attacker can no longer take the token off the box to abuse elsewhere, they can still use the victim PC as a sock puppet, driving a (hidden) browser instance to whatever URLs they like, abusing the bound token locally.

Beyond those limitations, Token Binding has a few other core challenges that make it difficult to use. First is that the web server frontend must include support for TB, and many did not. Second is that Token Binding binds the authentication tokens to the TLS connection to the server, making it incompatible with TLS-intercepting proxy servers (often used for threat protection). While such proxy servers are not very common, they are more common in exactly the sorts of highly-regulated environments where token binding is most desired. (An unimplemented proposal aimed to address this limit).

While token binding is a fascinating primitive to improve security, it’s very complex, especially when considering the deployment requirements, narrow support, and interactions with already-extremely-complicated changes on the way for cookies.

Update: The Edge team has decided to remove Token Binding starting in Edge 130.

What’s Next?

The Chrome team is experimenting with a new primitive called Device Bound Session Credentials. Read the explainer — it’s interesting! The tl;dr of the proposal is that the client will maintain a securely stored (e.g. on the TPM) private key, and a website can demand that the client prove its possession of that private key.


-Eric

PS: I’ve personally always been more of a fan of client certificates used for mutual TLS authentication (mTLS), but they’ve long been hampered by their own shortcomings, some of which have been mitigated only recently and only in some browsers. mTLS has made a lot of smart and powerful enemies over the decades. See some criticisms here and here.