- Google Cloud API service before blaming a generalized interruption
- Most of the regions were in line in 40 minutes, but some took even more
- The company has promised to protect against future interruptions and improve communication.
After the recent interruption of Google Cloud, which led to sites such as Spotify, Cloudflare and Discord offline, the company launched its detailed report sharing exactly why it failed customers.
The company says that the root cause was a code problem in service control, part of the company’s management and management management control system.
Specifically, the update of non -valid automated installments and the lack of adequate errors management triggered a global shock loop, with 503 errors seen not only in Google Cloud Services, but also that they use their API.
Google cloud interruption caused by the API problem
The interruption affected Google Cloud infrastructure, as well as other popular applications of Google’s work space such as Drive, Docs, Gmail and Calendar. However, third -party sites that access the Google Cloud API, including the popular Spotify music transmission platform, which has 678 users, as well as some Cloudflare services, they were also affected.
“On May 29, 2025, a new characteristic service control was added to verifications for additional quotas policies,” the company wrote in its incident report. “The problem with this change was that I did not have the appropriate errors management or was protected by the flag.”
Google Cloud boasted that his site reliability engineering team had begun to triand the incident in two minutes, having identified the root cause in 10 minutes. “The red button [to disable the serving path] I was ready to implement ~ 25 minutes from the beginning of the incident, “said Google, with the complete deployment in 40 minutes.
Although the smallest regions recovered relatively fast, the largest regions such as US-Central-1 took longer to be again online, about two hours and 40 minutes in the case of this particular region.
In his mini incident report problems on the day of interruption, Google Cloud promised to “do better.” Its most detailed report promises the usual responses in the future, such as improving static analysis and test practices, auditing and modulating the service control of the service to contain future incidents, but the company has also committed to “improve [its] External communications “to better inform customers, ensuring that their communications infrastructure remains online even during such interruptions in the future.