The Thundering Herd Problem in System Design

Introduction

Imagine you're standing outside an Apple Store, waiting for the new iPhone to launch. Obviously, you're not alone. You're just one among a big crowd waiting for that door to open. As soon as it does, each and every single person in the crowd rushes in, out of the fear of getting left behind. And before you know it, you're caught up in chaos, everyone pushing each other around, overloading the store.

This is exactly what is called the "Thundering Herd Problem" in the context of system design. Many clients make a request at the same time and overwhelm a resource (server, database, etc.).

What is the Thundering Herd Problem?

We all try to make our systems robust and fast, for which we adopt different system design patterns like scaling, caching, etc. However, certain loopholes in our system can lead to the "Thundering Herd Problem." Even large, robust systems suffer from this problem.

The Thundering Herd Problem simply refers to a large number of clients requesting a resource simultaneously, often reacting to the same event at the same time. It is not just high volume; it is simultaneous high volume.

Despite how the name sounds, it is important to understand that the Thundering Herd Problem is not a bug. It's just a pattern that occurs in systems with a large client base, and we need to understand why it occurs and how we can fix it.

Where Does it Commonly Occur?

This problem can occur in multiple parts of a system, such as caching systems, load balancers, etc., due to different reasons:

Cache Stampede: When a highly accessed cached item expires, multiple concurrent requests bombard the database to refill the cache.
Synchronized Retries: If a service fails to respond for a short period of time and then comes back online, many clients simultaneously retry failed requests, creating a spike in traffic.
Cron Jobs: Multiple servers or microservices are configured to run a cron job at the exact same scheduled time.
Sudden Traffic Spikes: A sudden influx of users all attempting to perform the same action at the same time, such as clicking a "Buy Now" button during a flash sale.

Real World Example

Consider the example of a social media platform, which caches popular posts to increase the overall speed. Assume the cache TTL is 60 seconds.

If everything is going fine, requests will hit our server, the server will fetch the post content from the cache, and return it in the response. There is no database call, and thus shorter request-response cycles.

However, after the 60 seconds have gone by and the cache expires, if a large number of requests come in simultaneously, since the cache is empty, each and every request will go straight to the database at the same time, resulting in the thundering herd problem.

This can overload the database, increase latency, and even lead to request failures.

Difference Between Normal Spike vs Thundering Herd

In normal situations, traffic spikes happen gradually. Imagine Hotstar starts streaming IPL live. Viewers keep joining over time, and our auto-scaling configuration is able to scale the servers to handle the increasing load.

On the other hand, in a thundering herd situation, all the traffic comes in at once. It is not gradual, it is instantaneous and unpredictable. This results in more load on the servers and databases than the system can handle.

In such situations, we often enter into a chain reaction:

More requests -> DB connections get exhausted -> requests start waiting -> latency increases → timeouts happen → clients retry → even more load. 💥

Why it Becomes Dangerous in Distributed Systems

Now that you have understood the thundering herd problem, it is important to understand its impact in the context of distributed systems.

Distributed systems amplify this problem even further:

There are many services and instances involved. If every request triggers 5 services, then with a spike of 100k requests, we are effectively dealing with 500k internal requests.
If one service slows down, other services depending on it also slow down, leading to cascading failures across the system.
Each service might implement its own retry logic, resulting in multiple retries across multiple services, which further multiplies the load.
Many services rely on shared resources like databases, caches, or message queues. When all services hit these resources simultaneously, they become bottlenecks and can fail under pressure.

Failures in distributed systems propagate and amplify, making the thundering herd problem significantly more dangerous.

System-Level Impact

CPU: A sudden surge in requests increases the number of active threads and processes, causing CPU usage to spike. This can lead to excessive context switching and reduced overall efficiency.
Database: A large number of concurrent queries can exhaust the connection pool, slow down query execution, and potentially crash the database under heavy load.
Cache: During a cache miss storm, the cache becomes temporarily ineffective, forcing repeated requests to hit the database and increasing pressure on backend systems.
Latency: As requests start queuing up, response times increase significantly. This leads to timeouts, failed requests, and a poor user experience.

The impact is not isolated, it affects every layer of the system.

Techniques to Minimize It

Request Coalescing

Instead of allowing all requests to go to the database to refill the cache, we ensure that only one request performs this operation. When multiple requests arrive for the same data, one request fetches the data from the database and updates the cache, while the others wait and reuse the same result.

This prevents duplicate work and reduces unnecessary load on the database.

Cache Locking / Mutex

Cache locking (or mutex) is a technique used to ensure that only one request can regenerate the cache at a time. When a request detects a cache miss, it acquires a lock and proceeds to fetch data from the database. Other requests are blocked or wait until the lock is released.

Staggered Expiry

Staggered Expiry is another technique that focuses on making sure that all the cached data does not expire at once. To achieve this, we introduce a small random variation (jitter) in their expiry times.

This reduces the chances of a sudden spike in load, by spreading out requests over time.

Exponential Backoff

When a request fails, you don’t just try again right away. With exponential backoff, you wait a bit, and each time you fail again, you wait even longer. The delay grows fast, so clients don’t all hammer the server at once.

It spreads out the load and keeps the system from getting overloaded.

Rate Limiting

Rate limiting sets a cap on how many times a client can hit the system in a given period. If you go over that number, your requests get blocked or slowed down. This keeps the service from getting swamped and makes sure everyone gets a fair shot at the resources.

Conclusion

The thundering herd problem is not caused by high traffic alone, but by many requests happening at the same time due to a shared trigger. While it can lead to serious issues in distributed systems, it can be effectively managed using the right techniques such as request coalescing, staggered expiry, backoff strategies, and rate limiting.

By understanding the root cause and applying these strategies, we can build systems that remain stable even under sudden spikes in load.