On July 19, 2024, one of the most far-reaching and widespread outages in the history of IT happened when CrowdStrike, a cybersecurity company based in Austin, Texas, distributed an update to its Falcon Sensor software. The update ran with a bug, which affected roughly 8.5 million Windows machines with the blue screen of death, and cost the global economy an estimated $10 billion dollars. Naturally, we wanted to see how the incident impacted mabl customers and if that impact would show up in mabl test results.

A Look at mabl’s Data

The incident started at 4:09 UTC, according to CrowdStrike’s incident review. Looking at global test result pass rates, we quickly saw a significant decrease that aligns with this timing.

This dip happened on Friday, and continued through the weekend as teams worked to remove the offending file from their systems. By late Sunday, we saw tests begin to move back to passing, with all tests running as normal by later in the day on Monday, July 22nd. It’s important to note that while many mabl customers do run tests against production, the large bulk of test skews towards staging, QA, and earlier stage environments. It’s likely the timelines we see here may lag behind.

The total number of tests is also down 15% week over week. While that may seem like a fluke, it actually makes sense, since many of our clients set up their tests to run sequentially in stages. This means that if an early-stage test fails, the later-stage ones won’t execute. Given so many early-stage failures, based on the CrowdStrike outage, the data tracks.

Takeaways

The first thing we noticed is that mabl customers did a nice job with recovery! Industry reports indicate companies severely impacted as late as the 25th.  As far as we can tell from the data, teams using mabl were back to normal operations by late Monday.

More broadly, how can we as people building software in modern environments protect ourselves from similar incidents in the future?  An obvious first step is to demand our vendors provide ways to stage roll outs of even critical configuration updates in lower level environments before production, something Crowdstrike has already committed to developing for the Rapid Response Content component.

Beyond that, it’s imperative to ensure your quality strategy has sufficient automation of the full end-to-end user experience, something we at Mabl believe deeply in.  There’s no amount of unit or pre-release testing that would’ve enabled CrowdStrike customers to catch this issue. Hopefully with better control of rollout, robust system monitoring would. But it’s easy to have system monitoring gaps, especially for anything that doesn’t show up in standard telemetry metrics. The only reliable way to ensure your app is providing the experience your customers expect is to have testing and monitoring that uses the app the same way they do.  

If you’ve found gaps in your testing system, we invite you to give mabl a try with a free 14-day trial. 

Try mabl Free for 14 Days!

Our AI-powered testing platform can transform your software quality, integrating automated end-to-end testing into the entire development lifecycle.