What really caused that AWS outage in December?
For the first time, AWS has confirmed that one of its AI systems did indeed delete and recreate one of its environments in December, shutting down part of that service for about 13 hours. What happened behind the scenes — including an aggressive AWS statement against the media outlet that initially reported the issue — is far more interesting.
For IT, this raises questions of how users need to interact with AI systems. AI tools and service have today so effectively mastered natural language that people can forget there isn’t a human involved. That can include approving an AI action without insisting on more details.
Consider a human driver inside a self-driving vehicle, such as a Tesla with “full self-driving.” Let’s say the car is driving down a highway at 65 MPH and going around a curve. But instead of following the curve, the vehicle drives straight, plunges through a guardrail — and the car and passenger drop a few hundred feet to their demise.
In theory, the human driver is in charge and can take back driving control at any point. But if the incident happens with no warning, the driver won’t likely have the half-second needed to resume control in time. Is that the vehicle AI’s fault or is the human to blame for not taking over?
You could make a legitimate argument that it was absolutely the human’s fault because they never should have trusted the self-driving tech in the first place.
That brings us to the current state of IT decisions and AI, which in turn brings us back to the December AWS disaster.
The back-story was broken by the Financial Times, which reported the 13-hour outage was caused by a Kiro agentic coding system that decided to improve operations by deleting and then recreating a key environment.
AWS on Friday shot back to flag what it dubbed “inaccuracies” in the FT story. “The brief service interruption they reported on was the result of user error — specifically misconfigured access controls — not AI as the story claims,” AWS said.
To quote Obi-Wan Kenobi, “So, what I told you was true…from a certain point of view. Luke, you’re going to find that many of the truths we cling to depend greatly on our own point of view.” The more we look into the particulars of the December incident, the more user error doesn’t mean what the company is suggesting it means.
AWS continued: “The disruption was an extremely limited event last December affecting a single service (AWS Cost Explorer—which helps customers visualize, understand, and manage AWS costs and usage over time) in one of our 39 Geographic Regions around the world. It did not impact compute, storage, database, AI technologies, or any other of the hundreds of services that we run.”
That much seems true. It’s also a classic misdirection. The company conveniently forgot to confirm that the point of the story — that the system decided to delete and recreate an environment — was correct.
“The issue stemmed from a misconfigured role — the same issue that could occur with any developer tool (AI powered or not) or manual action.” That’s an impressively narrow interpretation of what happened.
AWS then promised it won’t do it again. “We implemented numerous safeguards to prevent this from happening again — not because the event had a big impact (it didn’t), but because we insist on learning from our operational experience to improve our security and resilience. Additional safeguards include mandatory peer review for production access. While operational incidents involving misconfigured access controls can occur with any developer tool — AI-powered or not — we think it is important to learn from these experiences. The Financial Times‘ claim that a second event impacted AWS is entirely false.”
As for the AWS statement, the hyperscaler doth protest too much, methinks.
This is a critical issue for a few reasons. First, AWS is hardly the first AI firm shouting “user error” when their systems misbehave. Secondly, this is part of a disconcerting trend of AI systems overreaching or even flatly ignoring human instructions.
In an emailed comment, AWS added that, “Kiro puts developers in control — users need to configure which actions Kiro can take, and by default, Kiro requests authorization before taking any action. In this case, an engineer was using a role with broader permissions than expected — a user access control issue, not an AI autonomy issue. The issue stemmed from a misconfigured role — the same issue that could occur with any developer tool or manual action.”
In an interview ,an AWS spokesperson argued that the user error was not the approved system request, but that the AWS engineer apparently misunderstood what their own level of privilege was. “The human was confused by what privileges that they had. They thought that they had narrower privileges than they actually had,” the AWS spokesperson said.
This becomes relevant because most agentic systems, and Kiro is one of them, have the same privileges as the human they’re working with. The AWS argument is that the engineer might have been more careful or somehow acted differently had he or she understood the high level of privilege the agent had been granted.
The key detail missing — which AWS would not clarify — is just what was asked and how the engineer replied. Had the engineer been asked by Kiro “I would like to delete and then recreate this environment. May I proceed?” and the engineer replied, “By all means. Please do so,” that would have been user error. But that seems highly unlikely.
The more likely scenario is that the system asked something along the lines of “Do you want me to clean up and make this environment more efficient and faster?” Did the engineer say “Sure” or did the engineer respond, “Please list every single change you are proposing along with the likely result and the worst-case scenario result. Once I review that list, I will be able to make a decision.”
That gets into a key IT issue: Do we need training on how to interact with AI? If an employee starts answering AI tools as if they were human, problems will materialize. AI systems seem smart, but they do not process data as humans do.
Recently, an AWS executive posted about a glitch involving an AI system that was trying to replicate registration forms. It looked at fields such as username and password and saw that the system only permitted one user to have that exact string of characters. The AI extrapolated from that and started rejecting users if they were the same age, with the notice “user with this age already exists.”
It’s like a civil service employee who memorized a rule but never asked the point of the rule. Without knowing that context, that employee can’t make a rational decision about when the rule should be waived.
Like the car driver who went over the cliff, the smartest decision is to not use any autonomous AI system at all. But given that it seems all-but-unavoidable today, the second-best option is to insist that employees demand to know precisely what they are being asked to approve.
That may not eliminate AI disasters, but it will hopefully slow them down.
Read more: What really caused that AWS outage in December?