Microsoft has revealed it took five hours to acknowledge lengthy disruptions affecting European customers in late March because the task of informing customers relied on a US-based incident manager, who was asleep at the time.
The delays affected customers in Europe and the UK for three days beginning around 9am UTC on March 24. However, at the outset, as customers struggled with extra-sluggish Azure services, Microsoft missed its 10-minute target for acknowledging issues by a wide margin.
In a post mortem, Chad Kimes, director of engineering at Azure admits Microsoft’s “communication during this incident was also problematic” and apologized for the frustration and confusion this caused to the 6,136 customers affected.
The technical issue itself was caused by virtual-machine capacity constraints due to a surge in demand for Azure compute resources during COVID-19 coronavirus pandemic, which resulted in 21-minute delays affecting Microsoft’s Pipelines DevOps service for releasing new builds targeting Windows and Linux agents in Azure. The longest delay was nine hours, according to Kimes.
Microsoft says it is planning to improve its live-site processes to “ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types”.
The company is also rolling out architectural changes to mitigate bottlenecks in spinning up new agents from its hosted agent pool.