Microsoft has shed some light on the root cause behind yesterday's massive Azure authentication outage that affected multiple Microsoft services and blocked users from logging into their accounts.
Customers experienced authentication errors across many Microsoft services, including Microsoft 365, Microsoft Teams, Exchange Online, Forms, Xbox Live, Intune, Outlook.com, Office Web, SharePoint Online, OneDrive for Business, Yammer, and more.
After confirming that the service outage affected login and authentication flows across its online services, Microsoft said that the widespread outages resulted from an Azure Active Directory (Azure AD) configuration issue.
This issue prevented users from authenticating to Microsoft 365, Exchange Online, Microsoft Teams, or any other service relying on Azure AD.
"Between 19:00 UTC (approx) on March 15, 2021, and 09:25 UTC on March 16, 2021 customers may have encountered errors performing authentication operations for any Microsoft and third-party applications that depend on Azure Active Directory (Azure AD) for authentication," Microsoft explained today in a preliminary root cause analysis report.
Signing keys rotation failure leads to token validation issues
As Microsoft explained, the authentication and login issues behind yesterday's outage were caused by an error that affected the correct rotation of the signing keys used to support Azure AD's use of OpenID.
Signing keys are private and public cryptographic key pairs that are used to sign authentication requests from a user.
Microsoft's identity platform rotates signing keys on a periodic basis for security purposes, with apps being required to handle key rollover events so that authentication attempts don't fail.
"As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use," Microsoft said.
"Over the last few weeks, a particular key was marked as 'retain' for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that 'retain' state, leading it to remove that particular key."
After the signing key was removed, even though it was marked to be retained longer, apps using Azure AD authentication services immediately stopped trusting the tokens signed with the removed key.
This led to all user login attempts to affected apps and services being rejected and, as a result, users no longer were able to access their accounts.
Microsoft engineers rolled back the key metadata to the state before the worldwide service outage started to mitigate the issue.
However, the outage wasn't immediately mitigated due to the different "server implementations that handle caching differently."
Users continued experiencing issues until the impacted apps managed to pick up the updated key metadata and refresh their caches.
While the outage impact was largely mitigated after rolling back the key changes, Microsoft is still working on bringing back up Intune and Microsoft Managed Desktop.
Azure AD backup authentication system still a work in progress
"We understand how incredibly impactful and unacceptable this is and apologize deeply," Microsoft said.
"We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future."
In September, Microsoft customers experiencing another massive worldwide outage showing "transient" errors that knocked down Office 365 and related services, including Microsoft Teams, Office.com, Power Platform, and Dynamics365.
As Microsoft explained at the time, that outage was caused by an Azure AD service update that mistakenly hit the production environment.
While Redmond started working on an Azure AD backup authentication system following the September outage, it didn't help because it is only designed to cover token issuance issues and no the token validation ones caused by the key rotation error.