Multifactor authentication meltdown, be the proper headline. But Microsoft actually managed to unearth three independent root causes, along with monitoring gaps that led to the recent service outage.
That it did so while Office 365 users are besieged by login problems, makes it just a tad ironic.
But that is cloud for you — have to take good with the bad.
Long story short, Redmond has dug deep to identify what went wrong for 14 hours on November 19.
As detailed by Mary Jo Foley, the cloud giant has posted a root cause analysis of what went wrong with its authentication service last week — one that resulted in countless Azure, Office 365, Dynamics and other Microsoft users not being able to authenticate for much of the day.
The status history page has the information, but apparently, Azure Active Directory Multifactor Authentication (MFA) services went down.
Taking Office 365 and Dynamics users with them, leaving these folks unable to authenticate.
It was a combination of multiple factor, pun always intended.
Starting with a latency issue in the MFA frontend communication to its cache services, along with a race condition in processing responses from the MFA backend servers. These two issues were introduced in a code update that began in some datacenters on Tuesday and was completed Friday.
November 13 to November 16, that is to say.
A third identified root cause was triggered by the second, rendering in the MFA backend unable to process any further requests from the frontend. Even as everything seemed to work fine based on the monitoring the company had set up.
As a result of these issues piling up, customers in the European, Middle Eastern and African (EMEA) and Asian Pacific (APC) regions were the first to be affected. Western European and American datacenters were hit soon after.
Microsoft has apologized to affected customers, but made no mention of any planned financial compensation.
It has, however, identified a number of intended next steps to improve the MFA service including a review of its update deployment procedures so as to prevent meltdown issues like this in the future.