Google Chat Incident Report Shows How Outages Can Happen

A Google incident report labeled “Confidential – Not for publication” about a Google Chat outage was apparently leaked. The document provides a rare glimpse into how Google’s backend can fail. While this is not connected to Google’s recent indexing failures, it does provide a view of the complexity of Google’s systems and the kinds of things that can go wrong.

Background of Google Chat Outage

About two weeks before Google’s indexing issues, there was a backend issue with Google Chat. An update was rolled out that included what they called a “post-processor” that was supposed to kick in after a specific preprocessor.

Apparently unknown to the engineering team was that there was a pre-existing error that triggered a major outage after an update on September 17, 2020.

It is implied that the error was undetected but the incident report never states that explicitly.

The September update included a post-processor that was looking for an output from a preprocessor. But because that output didn’t exist, another error happened, triggering the outage.

Here is how Google’s incident report describes it:

“The Google Chat backends utilize a number of pre-processing functions prior to processing an incoming request. These pre-processors perform a number of calls to different services (such as Google’s internal Identity service) and store these results in a local cache.

One of these preprocessors had been encountering an access error due to an incorrectly configured backend request, which prevented it from successfully completing.

This error initially did not cause any further issues.”

Once the post-processor was introduced within the September 17 update, the pre-existing error (in the preprocessor) caused the post-processor to glitch, resulting in what Google termed a “deadlock” which then resulted in application errors, i.e. the Google Chat outage.

Google was forced to roll back the update, then re-release a new update to compensate for the (apparently) previously undetected error.

Google’s description of the root cause of the Google Chat outage:

“On September 17th, a new release of the Google Chat backend was deployed. This release included a change that required a post-processor to have access to the results of the failed preprocessor above. However, as this preprocessor aborted it’s processing due to the access error, the cache was never populated.

Initially, this post-processor attempted to retrieve the required value, but because the cache did not contain the value required, this spawned a new thread that attempted to retrieve the value, but had a dependency on the post-processor that was holding a lock. This created a deadlock condition that was unable to be completed.

This deadlock caused the backend binary tasks to experience high thread lock contention, which ultimately led to application errors.”

Google’s Lesson Learned

Google’s incident report noted that their response to the incident was to improve detection of this specific issue, increase capacity to the backend and improve the pre-release testing for this specific kind of problem so that it does not happen again.

Google’s conclusion:

“To prevent the recurrence of this issue and reduce the impact of similar events, the following actions are being taken:

  • Adjusting the automated alerting system to improve the detection of lock contention issues..
  • Increasing the number of threads available to Google Chat backend services in order to reduce the potential impact of lock contention events.
  • Defining new testing which triggers this particular code path and identify this issue before reaching production.”

Three Insights from Google’s Outage

  1. An error was introduced into the live Google chat backend and apparently went undetected until a subsequent update tripped over it.
  2. Pre-release update testing apparently did not identify that the undetected error existed or would cause application errors.
  3. The undetected error was only discovered after the update was pushed to the live environment, producing a coding conflict that led to a an outage.

We like to think of Google as a monolithic company that creates amazing experiences on the web seemingly with a wave. But this incident shows how a seemingly little bug could be introduced into any of Google’s services and manifest into an outage.

Google doesn’t offer detailed incident reports related to outages in the search index.

That said, Google’s Gary Illyes did offer some candid comments about a Google search outage in April 2019 that was due to  human error. And in August 2020 he described how complex Google’s Caffeine is, shortly after another worldwide search index outage this summer.

The Google Chat incident report shows how something seemingly minor and almost inconsequential could cascade into a major outage and one can only imagine that similar issues have been bedeviling Google’s search index for the past year.

Citation

Google Cloud Issue Summary (PDF)