Yesterday, Amazon Web Services had a significant outage. All types of services were affected, and no one was immune. Banks, airlines, Roku, Fortnite, etc., are all seeing impacts. Even in some cases, your smart plugs may not have worked as expected. So what caused this and why? Amazon is attributing the outage to what seems to be a DNS issue, according to this running timeline at CNN. You can use the timeline to get caught up on the blackout and the sequence of events as they have occurred so far.
Amazon is a significant part of keeping businesses moving across the world, as are our other popular large-scale cloud providers, Microsoft, Google, and Oracle. Enterprises use these services to run their operations, allowing them to rent services as needed to deliver products and services, making the need for in-house server rooms and IT teams obsolete. The drawbacks are few, but those few can be very impactful as we’re seeing today. You have to trust the cloud providers’ infrastructure, and when something does go wrong, there’s an understanding that business owners are hands-off and dependent upon the cloud provider to bring services back online.
We rarely hear of cloud-provider outages from the big four; these providers usually have full redundancy at every turn and redundancy for the redundancy. So, most minor issues the end consumer never sees an impact. However, when you have an event like today, the effect is significant and often brings services people depend on for daily functioning to a halt, such as banking.
Are four major cloud providers enough to handle the world’s business needs, or do these outages highlight a need for more? It would seem that we need to highlight the need for more providers. Cyber attacks are becoming larger and more common, and thus far, it’s usually a single provider having downtime. Thankfully, we’ve yet to experience a multi-cloud provider outage at the same time. The AWS outage was enough to cripple both businesses and consumers’ basic needs.
The model will continue, the cloud, as we know it, will continue to grow, and companies will continue to invest in it. It still comes back to putting all of your eggs in one basket, as we saw with the AWS outage. If a company puts all its critical infrastructure with a single cloud provider in a single region, an issue in that region takes the entire system down. This is what companies saw with AWS and would see using similar logic with other providers as well. This isn’t a singular provider problem.
So what are some solutions to the problem? Mostly, the things companies don’t like to do are spending money and, on a small scale, maintaining replicas of critical infrastructure that are hot spares. Maintain those in a different region or cloud provider, or, if capacity exists, internally. A simple DNS change could restore critical services. That move often doubles the budget and the upkeep, so it becomes something companies usually do not do. There was a time, a decade or so ago, when we would never have thought of outsourcing critical business functions to third parties; now it’s commonplace.
So, who builds these needed cloud provider options? It’s not for the faint of heart; creating a true cloud with genuine redundancy and replication is very expensive and requires a high amount of technical knowledge. Maybe keeping critical infrastructure housed at a private provider as a hot spare is a good idea; in most cases, these options are more cost-effective than public cloud.
What will we learn? If I were to guess, many companies are evaluating their disaster recovery plans and any hosting plans that were impacted by the outage. Of course, the financial teams are scurrying to figure out how to handle the lost revenue and response times. Fallout will be behind the scenes, not something we will learn about or report.


