Amid the rhetorical jousting between Delta and Microsoft regarding the discussion of fault and damages during the July 19th worldwide Windows crash, the company at the center of the outage, CrowdStrike just release a report detailing exactly when wrong that day.
According to a CrowdStrike Root Cause Analysis (RCA) report, an error in the expected amount of input fields caused a recently introduced sensory resulted in an ‘out-of-bounds’ memory read that cause the now infamous Windows system crashes around the world.
Extracting some of the RCA report, CrowdStrike explains that its relatively new CrowdStrike Falcon sensor, designed to deliver AI and machine learning protections, was the target of a series of updates that spanned almost six months without incident.
However, on July 19, CrowdStrike issued a Rapid Response Content update intended for certain Windows hosts that carried 21 input fields where the Falcon Sensor was only prepped to receive updates with 20 input fields.
CrowdStrike’s RCA report also clarifies that the extensive damage caused by the crash is ‘not exploitable by a threat actor’ nor is capable of reoccurring thanks to “mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience.”
As for what CrowdStrike has done to prevent a similar industry rattling outage triggered by its services, the company lays out several actions its taken moving forward that include:
- Update Content Configuration System test procedures. This work has been completed. This includes upgraded tests for Template Type development, with automated tests for all existing Template Types. Template Types are part of the sensor and contain predefined fields for threat detection engineers to leverage in Rapid Response Content.
- Add additional deployment layers and acceptance checks for the Content Configuration System. This work has been completed with an updated deployment ring process, ensuring Template Instances pass successive deployment rings before rollout into production.
- Provide customers additional control over the deployment of Rapid Response Content updates. New capabilities have been implemented and deployed to our cloud that allow customers to control how Rapid Response Content is deployed, with additional functionality planned for the future.
- Prevent the creation of problematic Channel 291 files. Validation for the number of input fields has been implemented to prevent this issue from happening.
- Implement additional checks in the Content Validator. Additional checks are planned for release into production by August 19, 2024.
- Enhance bounds checking in the Content Interpreter for Rapid Response Content in Channel File 291.Bounds checking was added on July 25, 2024, with general availability expected August 9, 2024. These fixes are being backported to all Windows sensor versions 7.11 and above through a sensor software hotfix release.
- Engage two independent third-party software security vendors to conduct further review of the Falcon sensor code and end-to-end quality control and release processes. This work has begun and will be ongoing as part of our focus on security and resilience by design.
While it may stem some of the brand bleeding the July 19th outage has caused, the financial and reputational damage has been done for everyone involved.
It’ll be interesting where this RCA report lands in the coming days and CrowdStrike, Delta, and Microsoft all point the fingers at each other in what could end up being a costly legal battle over the initial outage and response efforts by the three companies.