Microsoft today explained why customers that use Azure multi-factor authentication were locked out of Azure and Office 365 accounts for about 14 hours last Monday.
As reported last week, Microsoft Azure Active Directory (AD) Multi-Factor Authentication (MFA) services suffered a lengthy outage affecting organizations that mandate MFA, a key method for blocking phishing attacks as the attacker needs physical access to a user’s second authenticator, such as a phone or an authentication dongle.
But instead of locking out attackers, a trio of bugs, botched mitigations, and blind spots in monitoring services, locked out users of Office 365, Azure, Dynamics, and basically any service that relied on Azure AD MFA.
The length and timing of the outage -- on Monday between 4:39 AM London time and 18:38 that day -- meant some organizations in Europe were unable to access Microsoft software for the entire day. The lockout affected users in UK Parliament, and Microsoft confirmed today that US Government customers were also affected.
“We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future,” Microsoft said in a statement on its status page.
Microsoft also apologized for tardy updates on its Azure and Office 365 status page during the incident, which angered users and admins seeking answers during the outage.
Microsoft’s engineers have now pinpointed three independent root causes for the incident, each of which created knock-on effects that exacerbated what started as latency issues and quickly became a full outage.
Microsoft also admits it found “gaps in telemetry and monitoring” for the MFA services, which partly explains why the recovery took so long.
The first two root causes had been lurking in an MFA front-end server since a mid-November code update in some data centers, but the bugs didn’t manifest until traffic exceeded a certain threshold. The first time that threshold was crossed was early Monday morning in Microsoft’s Azure West Europe data centers. APAC users were also affected because EU data centers service APAC and EMEA traffic.
The first issue only caused latency between MFA front-end server and caching services, which are meant to improve reliability and performance. However, once the threshold was passed, a race condition surfaced in frontend servers that were handling responses from the MFA backend server.
This triggered additional latency and a third pre-existing but undiscovered bug, which jammed up processing on backend servers and prevented them from responding to further MFA requests from the frontend. This prevented Azure from delivering sign-in messages and notifications to end-users. All the while, everything looked completely normal to Microsoft engineers looking at its monitoring systems.
Paradoxically, the buggy November code update was intended to improve how MFA services in Azure AD handle connections to caching services.
Microsoft also admits that its mitigations ended up spreading the problems from Europe to the US. During early efforts to restore services, Microsoft diverted traffic from its EU datacenters to the East US datacenter, which engineers thought would resolve the latency issues and give engineers space to fix the West EU data centers. However, the same problems began happening on East US MFA frontend servers.
During the second phase of the recovery effort, beginning 7:50 UTC, Microsoft’s engineers rolled back the recent code updates, added more capacity, increased throttling limits, recycled cache and frontend servers, and applied a hotfix to MFA frontend servers so they bypassed the cache.
This mitigated latency problems but US Government customers and customers in China were still reporting issues with MFA. So the search continued for additional root causes
Microsoft has kicked off a major review of its deployment processes, monitoring services, and its containment processes in order to “avoid propagating an issue to other datacenters”. The company also will update its communications process to the Service Health Dashboard to detect publishing issues during incidents. These reviews are expected to be complete by January 2019.
The Azure incident caps off a horrible two months for Microsoft. The company on November 13 re-released the Windows October 2018 Update over a month after pulling it because it deleted gigabytes of files on some users Windows PCs.