
Amazon service ‘recovering’ as Snapchat and banks among sites hit by outage
Liv McMahonTechnology Journalist And
Lily JamaliNorth America Technology Representative


Amazon Web Services (AWS) said late Monday that it has resolved a massive outage that left some of the world’s largest websites offline for the day.
More than 1,000 apps and websites – including social media platforms such as Snapchat and banks such as Lloyds and Halifax – were affected by problems that Amazon said were at the heart of the cloud computing giant’s US operations.
Platform outage monitor Downdetector said the number of user reports of problems globally during the outage on Monday exceeded 11 million.
Even after Amazon fixed the original problem, experts said the outage demonstrated the dangers of many companies relying on a single, dominant provider.
“This episode highlights how interdependent our infrastructure is,” said Professor Alan Woodward of the University of Surrey.
“Many online services rely on third parties for their physical infrastructure, and this shows that even the largest of these third-party providers can have problems.
“Small mistakes, often made by humans, can have widespread and significant consequences.”
The problem appears to have started at around 07:00 BST on Monday, as users began reporting problems accessing multiple platforms.
These include a variety of sites and services, from massive online games like Fortnite to the language-learning app Duolingo.
Earlier in the day, Downdetector told the BBC it had seen more than four million reports from users on 500 sites in just a few hours – more than double the amount seen in an entire weekday.
It later rose to more than 11 million, it said, as more services tried to recover, including Reddit and Lloyds Bank.
At around 2300 BST, Amazon said all AWS services had “returned to normal operations.”
But the company didn’t have to throttle parts of its own system to fix the underlying problem.
According to Mike Chappell, an information technology professor at the University of Notre Dame, a new series of “cascading failures” may have occurred after the initial outage.
“It’s like when you have a massive power outage. Crews start working to try to get it back on,” Mr Chappell said. “The power can shine through at times,” he explained, but it’s possible that Amazon initially “only addressed the symptoms” and not the cause.
what went wrong
Amazon has yet to fully detail what caused Monday’s outage or issue an official statement.
It said in an update on its service status web page that the issue “appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1”.
DNS, which stands for Domain Name System, is likened to a phone book for the Internet.
It effectively translates website names that people use (such as bbc.co.uk) into numbers that computers can read and understand.
This process fundamentally underpins the way we use the Internet, and disruptions to it can leave web browsers unable to find the content they’re looking for.
CloudFlare chief executive Matthew Prince told the BBC that the AWS outage highlighted how cloud services power how the internet works.
“Everybody has a bad day, Amazon had a bad day today,” he said.
“There are wonderful things about the cloud, it allows you to scale… but if you have an outage like this, it can bring down a lot of the services we depend on.”
And Corey Crider, head of the Future of Technology Institute, told the BBC it was “a bit like a bridge collapsing”.
“An essential part of the economy has been torn apart,” she said.
And with cloud computing relying on Amazon, Microsoft and Google – roughly 70% – she said the situation was “unsustainable”.
“Once you have supply concentrated in a handful of monopoly suppliers, when something like this falls through, it takes a large percentage of the economy,” she said.
“We should try to buy more local services instead of relying on a handful of American monopoly platforms.
“This is a threat to our security, our sovereignty and our economy, and we need to address structural decoupling to make our markets more resilient to these types of shocks.”
One computer science expert says some of the responsibility lies with companies using AWS.
“Companies using Amazon haven’t taken enough care to build protection systems into their applications,” says Ken Berman, a computer science professor at Cornell University in New York.
Outages like Monday’s happen frequently, though not always on this scale.
Berman tells the BBC that app developers should be careful to invest in backing up mission-critical applications that live in the cloud.
“We know how to make these systems robust, and we know how to do it safely,” Biermann says.
The question of liability may come up in court.
More than a year after the massive CrowdStrike outage, Delta Airlines is still battling the company to recover more than $500m in damages.
Even after CrowdStrike fixed the problem, the airline said it had to manually reset 40,000 servers, which delayed major flights for days.
Additional reporting by Esyllt Carr.


Post Comment