Many enterprise customers of Microsoft Windows especially airlines, banks and health-care systems use a third party threat detection software called Falcon built by Crowdstrike. Falcon is installed & assimilated inside the Windows operating system on the Microsoft devices for detecting threats on the real time basis. On July 19, 2024 at 04:09 UTC, as part of ongoing operations, CrowdStrike released a sensor configuration update to Windows systems. As the luck may be this configuration update triggered a logic error resulting in a system crash and the notorius Blue Screen of Death (BSOD) for customers running Falcon sensor for Windows version 7.11 and above.
Though the logical error was remediated in hour and half from its release, the remote machines had crashed and where unreachable for Crowdstrike to make remote fix. The only way forward was to manually restart Windows, get into Safe mode, apply the software fix and then restart the machine again.
Global tech outage hit airlines, banks, health care and public transit. Shares of CrowdStrike which advertises being used by over half of Fortune 500 companies, dropped 11% to $304.96 in New York trading, wiping out more than $9 billion in market value. It was their biggest single-day decline since November 2022.
What was the logical error?
Interestingly, as per Zach Vorhiesi a whistle-blower and software engineer who as spent a decade of building embedded C++ projects for Google, logical error seemed to be NULL pointer from the memory unsafe C++ language as per the stack trace.
Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with.
What was the problem area?
In case of Falcon configuration update, the software tried to read memory address 0x9c (aka 156).
Why is this bad?
This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS. That is what appears in the stack dump.
So why is memory address 0x9c trying to be read from?
It turns out that C++, the language Crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die. Programmers in C++ are supposed to check for this when they pass objects around by "checking full null".
Usually you'll see something like this:
string* p = get_name();
if (p == NULL) { print("Could not get name"); }
While Crowdstrike is is still doing a Root Cause Analysis (RCA) of the issue, prima facie it appears that the innocuous looking NULL pointer exception would have brought down the banking and air traffic systems worldwide. Another perplexing question, is that in this world of DevOps and Software QA automation, how did this issue go undetected by team Crowdstrike.
References:
Comments