So as we all know on the news, the cybersecurity firm Crowdstrike Y2K’d it’s own end customers with a shoddy non-tested update.
But how does this happen? Aren’t there programming teams and check their code or pass it to a quality assurance staff to see if it bricked their own machines?
8.5 Million machines too, does that effect home users too or is it only for windows machines that have this endpoint agent installed?
Lastly, why would large firms and government institutions such as railway networks and hospitals put all their eggs in one basket? Surely chucking everything into “The Cloud (Literally just another man’s tinbox)” would be disastrous?
TLDR - Confused how this titanic tits up could happen and that 8.5 Million windows machines (POS, Desktops and servers) just packed up.
Lastly, why would large firms and government institutions such as railway networks and hospitals put all their eggs in one basket? Surely chucking everything into “The Cloud (Literally just another man’s tinbox)” would be disastrous?
Because they are best in class. No one else does EDR like Crowdstrike does. Can you imagine the IT support headaches if you had 200,000 PCs and servers some running one EDR and others running a different one. The amount of edge cases you would come across is ridiculous.
It would make data correlation a nightmare if an actual security incident occured.
deleted by creator
Well obviously that’s about to change, and some of the core product its still fantastic, but their (presumably) greed and process handling around how they deliver changes has failed here.
The product is still good, hopefully they can mature
I want to clarify something that you hinted at in your post but I’ve seen in other posts too. This isn’t a cloud failure or remotely related to it, but a facet of a company’s security software suite causing crippling issues.
I apologize ahead of time, when I started typing this I didn’t think it would be this long. This is pretty important to me and I feel like this can help clarify a lot of misinformation about how IT and software works in an enterprise.
Crowdstrike is an EDR, or Endpoint Detection and Response software. Basically a fancy antivirus that isn’t file signature based but action monitoring based. Like all AVs, it receives regular definition updates around once an hour to anticipate possible threat actors using zero-day exploits. This is the part that failed, the hourly update channel pushed a bad update. Some computers escaped unscathed because they checked in either right before the bad update was pushed or right after it was pulled.
Another facet of AVs is how they work depends on monitoring every part of a computer. This requires specific drivers to integrate into the core OS, which were updated to accompany the definition update. Anything that integrates that closely can cause issues if it isn’t made right.Before this incident, Crowdstrike was regarded as the best in its class of EDR software. This isn’t something companies would swap to willy nilly just because they feel like it. The scale of implementing a new security software for all systems in an org is a huge undertaking, one that I’ve been a part of several times. It sucks to not only rip out the old software but also integrate the new software and make sure it doesn’t mess up other parts of the server. Basically companies wouldn’t use CS unless they are too lazy to change away, or they think it’s really that good.
EDR software plays a huge role in securing a company’s systems. Companies need this tech for security but also because they risk failing critical audits or can’t qualify for cybersecurity insurance. Any similar software could have issues - Cylance, Palo Alto Cortex XDR, Trend Micro are all very strong players in the field too and are just as prone to having issues.
And it’s not just the EDR software that could cause issues, but lots of other tech. Anything that does regular definition or software updating can’t or shouldn’t be monitored because of the frequency or urgency of each update would be impractical to filter by an enterprise. Firewalls come to mind, but there could be a lot of systems at risk of failing due to a bad update. Of course, it should fall on the enterprise to provide the manpower to do this, but this is highly unlikely when most IT teams are already skeleton crews and subject to heavy budget cuts.So with all that, you might ask “how is this mitigated?” It’s a very good question. The most obvious solution “don’t use one software on all systems” is more complicated and expensive than you think. Imagine bug testing your software for two separate web servers - one uses Crowdstrike, Tenable, Apache, Python, and Node.js, and the other uses TrendMicro, Qualys, nginx, PHP, and Rust. The amount of time wasted on replicating behavior would be astronomical, not to mention unlikely to have feature parity. At what point do you define the line of “having redundant tech stacks” to be too burdensome? That’s the risk a lot of companies take on when choosing a vendor.
On a more relatable scale, imagine you work at a company and desktop email clients are the most important part of your job. One half of the team uses Microsoft Office and the other half uses Mozilla Thunderbird. Neither software has feature parity with the other, and one will naturally be superior over the other. But because the org is afraid of everyone getting locked out of emails, you happen to be using “the bad” software. Not a very good experience for your team, even if it is overall more reliable.A better solution is improved BCDR (business continuity disaster recovery) processes, most notably backup and restore testing. For my personal role in this incident, I only have a handful of servers affected by this crisis for which I am very grateful. I was able to recover 6 out of 7 affected servers, but the last is proving to be a little trickier. The best solution would be to restore this server to a former state and continue on, but in my haste to set up the env, I neglected to configure snapshotting and other backup processes. It won’t be the end of the world to recreate this server, but this could be even worse if this server had any critical software on it. I do plan on using this event to review all systems I have a hand in to assess redundancy in each facet - cloud, region, network, instance, and software level.
Laptops are trickier to fix because of how distributed they are by nature. However, they can still be improved by having regular backups taken of a user’s files and testing that Bitlocker is properly configured and curated.All that said, I’m far from an expert on this, just an IT admin trying to do what I can with company resources. Here’s hoping Crowdstrike and other companies greatly improve their QA testing, and IT departments finally get the tooling approved to improve their backup and recovery strategies.
Fantastic write up. I’d just add something to this bit:
Basically companies wouldn’t use CS unless they are too lazy to change away, or they think it’s really that good.
I work in Cyber Security for a large organization (30,000+ end points). We’re considering moving to CrowdStrike. Even after this cock-up, we’re still considering moving to CS. I’ve had direct experience with several different A/V and EDR products, and every single one of them has had a bad update cause systems to BSOD. The reason this one hit so hard is that CS is one of the major EDR/XDR vendors. But ya, it’s generally considered that good. Maybe some folks will move away after this. And maybe another product is nipping at their heels and will overtake them in the near future. But, for now, it’s not surprising that it was everywhere for this situation to get really FUBAR.
If people don’t start using alternatives, the Centralization remains as a vulnerability and this will happen again then.
Perhaps instead of clients using two different security systems, Crowd strike and similar companies could have two or more completely independent teams sending out separate versions of their hourly updates. That way when something like this happens it would likely not bring down all of a client’s systems, and help resilience? It could be made into a requirement for providing security software to critical/strategic industries like healthcare, power transmission/distribution, defense, etc.
Great write up, and I’m glad you were able to recover most of your servers after this incident.
8.5 Million machines too, does that effect home users too or is it only for windows machines that have this endpoint agent installed?
This software mandated by cyber insurance companies to ‘keep your business secure’ aka, your staff broke policy so we don’t have to pay this out claim.
No home user should ever run something like this at all. This is entirely a corporate thing.
This is actually an excellent question.
And for all the discussions on the topic in the last 24h, the answer is: until a postmortem is published, we don’t actually know.
There are a lot of possible explanations for the observed events. Of course, one simple and very easy to believe explanation would be that the software quality processes and reliability engineering at CrowdStrike are simply below industry standards – if we’re going to be speculating for entertainment purposes, you can in fact imagine them to be as comically bad as you please, no one can stop you.
But as a general rule of thumb, I’d be leery of simple and easy to believe explanations. Of all the (non-CrowdStrike!) headline-making Internet infrastructure outages I’ve been personally privy to, and that were speculated about on such places as Reddit or Lemmy, not one of the commenter speculations came close to the actual, and often fantastically complex chain of events involved in the outage. (Which, for mysterious reasons, did not seem to keep the commenters from speaking with unwavering confidence.)
Regarding testing: testing buys you a certain necessary degree of confidence in the robustness of the software. But this degree of confidence will never be 100%, because in all sufficiently complex systems there will be unknown unknowns. Even if your test coverage is 100% – every single instruction of the code is exercised by at least one test – you can’t be certain that every test accurately models the production environments that the software will be encountering. Furthermore, even exercising every single instruction is not sufficient protection on its own: the code might for instance fail in rare circumstances not covered by the test’s inputs.
For these reasons, one common best practice is to assume that the software will sooner or later ship with an undetected fault, and to therefore only deploy updates – both of software and of configuration data – in a staggered manner. The process looks something like this: a small subset of endpoints are selected for the update, the update is left to run in these endpoints for a certain amount of time, and the selected endpoints’ metrics are then assessed for unexpected behavior. Then you repeat this process for a larger subset of endpoints, and so on until the update has been deployed globally. The early subsets are sometimes called “canary”, as in the expression “canary in a coal mine”.
Why such a staggered deployment did not appear to occur in the CrowdStrike outage is the unanswered question I’m most curious about. But, to give you an idea of the sort of stuff that may happen in general, here is a selection of plausible scenarios, some of which have been known to occur in the wild in some shape or form:
- The update is considered low-risk (for instance, it’s a minor configuration change without any code change) and there’s an imperious reason to expedite the deployment, for instance if it addresses a zero-day vulnerability under active exploitation by adversaries.
- The update activates a feature that an important customer wants now, the customer phoned a VP to express such, and the VP then asks the engineers, arbitrarily loudly, to expedite the deployment.
- The staggered deployment did in fact occur, but the issue takes the form of what is colloquially called a time bomb, where it is only triggered later on by a change in the state of production environments, such as, typically, the passage of time. Time bomb issues are the nightmare of reliability engineers, and difficult to defend against. They are also, thankfully, fairly rare.
- A chain of events resulting in a misconfiguration where all the endpoints, instead of only those selected as canaries, pull the update.
- Reliabilty engineering not being up to industry standards.
Of course, not all of the above fit the currently known (or, really, believed-known) details of the CrowdStrike outage. It is, in fact, unlikely that the chain of events that resulted in the CrowdStrike outage will be found in a random comment on Reddit or Lemmy. But hopefully this sheds a small amount of light on your excellent question.
But how does this happen? Aren’t there programming teams and check their code or pass it to a quality assurance staff to see if it bricked their own machines?
I mean - we’re all just people. Fuck ups happen because people check other people’s work - even with excellent systems in place shit will slip through… we just try to minimize how often that happens.
That is the risks of DevOps continuous improvement/continuous development (ci/cd) . why break things one at a time, when you can break them in the millions at once.
I fully expect to see CS increase their QA for the next year or 2 then slowly dwindle it back to pre fuckup levels when their share price recovers.
But how does this happen?
It’s destined to happen, according to Normal Accident Theory.
Aren’t there programming teams and check their code or pass it to a quality assurance staff to see if it bricked their own machines?
Yes, there are probably a gigantic number of tests, reviews, validation processes, checkpoints, sign-offs, approvals, and release processes. The dizzying number of technical components and byzantine web of organizational processes was probably a major factor in how this came to pass.
Their solution will surely be to add more stage-gates, roles, teams, and processes.
As Tim Harford puts it at the end of this episode about “normal accidents”… “I’m not sure Galileo would agree.”
Crowdstrike already did this years ago and the ceo was cto at mcaffee when they did this as well. Sounds like a culture problem if you ask me. I think the guy has an impulsivity issue.
Hell they had a similar incident on Debian servers THIS YEAR. Absolutely unacceptable a corp this half-assed has the type of control it does.
Thats why they need to see harsh consequences.
Sounds like he knows how to run a “lean” operation.
To be fully honest, its how I see companies when I apply to them. You‘re supposed to „make things work“ no matter the risks. If you complain too much about „ethics“, „morals“ or other nonsense, you‘re out.
Could a solution to this be any of the following:
- Windows performing automatic driver rollback in the event of it reaching an unbootable state
- Software vendors pushing out updates to a smaller pool of endpoints and monitor for heartbeat for 30 mins or so before releasing for all endpoints
- A way for windows to expose the relevant data to trusted software without the software needing to operate in kernel space
“Could a solution be any amount of basic sanity in software architecture” yes probably
My thought is hackers hacked and pushed a crash, CEO has to claim it was an untested update to save face.
That would be nice, but I suspect it is a lot simpler. Low or no QC pushed a fuct update out.
I did too but looks like comment section believes this was a genuine fuck up… But yeah prolly management under staffing and underpaying somebody.
Scary to think one software is a failure point of this magnitude…if any foreign state hackers didn’t already know the world’s failure point, they certainly do now
That’s my thinking also, the execs at these companies don’t really care. They are there to make the money. If they fuck up, they just lose their job likely still get paid out too. So for them there is every incentive to take risks.