Time for an explanation! Microsoft experienced a massive outage last week. On April 1, most of the company’s services went down due to what was a DNS issue with Azure.
The company has now released a detailed status update explaining what went wrong.
Apparently, there was an unusual surge in Azure DNS queries from all over the world, which the cloud platform is designed to mitigate through layers of caches and traffic shaping. However, a specific series of events exposed an error in the code for the DNS service that basically made it less efficient.
How, you ask?
The software titan has systems in place that would usually drop illegitimate DNS queries that cause volumetric spikes like this. But because many of the queries were retries, they were considered legitimate. For this reason, the DNS services became unavailable after some time.
But wait, there’s more!
Things only got worse as clients faced errors, as subsequent DNS retries only resulted in more traffic piling up. And this is how things went from bad to worse.
Redmond says the issues started at 9:21PM UTC, and Azure services themselves had been fixed as early as 10:00PM. The company also lined up additional capacity, in case further mitigation was needed. That said, this time exceeded Microsoft’s goal for fixing problems like this.
Since many Microsoft services depend on Azure, recovery times were different for each of them. But the company says that by 10:30PM, most services were back online.
Microsoft has further revealed that it not just fixed the DNS code defect, but has also updated the logic on its mitigation systems to prevent excessive retries. It will continue working to improve the detection and mitigation of volumetric spikes in traffic.
But it certainly was a day to remember!