Cache Strategies in Distributed Systems

Introduction

In large-scale distributed systems, performance is not an option, but a necessity. Imagine if platforms like YouTube or Twitter slowed down or went completely offline for a few hours. In today’s world, we rely heavily on such systems, and they are expected to respond with low latency at all times.

A common approach to reducing latency is caching. A cache is simply a high-speed, temporary, and often in-memory data storage layer that stores a subset of frequently accessed data. It allows us to serve future requests faster, without having to hit the primary storage, such as the database.

This not only reduces latency but also decreases the load on the database, enabling the system to handle massive traffic efficiently.

Typically, every cached item has a TTL (Time-To-Live) associated with it. Once the TTL expires (for example, after 60 seconds), the cache is invalidated, and fresh data is fetched from the database as needed. This approach works well for small systems, but this naive model begins to break down at scale.

In distributed systems, cache expiration can lead to sudden traffic spikes, resulting in issues like the thundering herd problem. (You can read more about this in my blog post on the thundering herd problem.)

Basic TTL Caching is Not Enough

A basic TTL-based caching approach looks like this:

Data is cached -> requests are served from the cache -> TTL expires -> a new request comes in, triggering a fetch from the database -> and the cycle continues.

This works well when traffic is evenly distributed and requests arrive at random intervals. In smaller systems, this approach is often sufficient.

However, in large-scale systems, traffic is rarely uniform. It is often bursty, with many users requesting the same data at the same time. When the cache for such frequently accessed data expires, a large number of requests suddenly start missing the cache and go directly to the database, all at once.

This leads to sudden load spikes, increased latency, and potential system failures.

Cache Expiry can Cause Traffic Spikes

Consider the example of a social media platform, which caches popular posts to increase the overall speed. Assume the cache TTL is 60 seconds.

If everything is going fine, requests will hit our server, the server will fetch the post content from the cache, and return it in the response. There is no database call, and thus shorter request-response cycles.

However, after the 60 seconds have gone by and the cache expires, if a large number of requests come in simultaneously, since the cache is empty, each and every request will go straight to the database at the same time, resulting in the thundering herd problem. This can overload the database, increase latency, and even lead to request failures.

Clearly, we need smarter strategies to handle cache expiration in distributed systems.

Cache Management Strategies for Distributed Systems

TTL Jitter

Up until now, we’ve understood that the core issue is many cached items expiring at the same time. This causes a large number of requests to hit the database simultaneously, leading to chaos. The problem isn’t traffic itself, large-scale systems are expected to handle high traffic. The real issue is synchronized traffic.

TTL Jitter is a simple yet effective strategy that introduces randomness into cache expiration times. Instead of assigning a fixed TTL (for example, 60 seconds), we add a small variation and set the TTL within a range, such as 50–70 seconds.

ttl = baseTTL + random(-delta, +delta)

As a result, cache entries expire at different times, and incoming requests get naturally distributed instead of creating sudden spikes.

Without jitter: 10k keys expire at exactly 60s -> spike
With jitter: keys expire at 58s, 61s, 64s -> smoother load distribution

Probability-Based Early Expiration

The limitation of TTL jitter is that cache entries still expire eventually. Requests will still hit the database, we’ve only found a way to distribute them more evenly.

To handle this more effectively, we can take a proactive approach using probabilistic early expiration.

Instead of waiting for the TTL to reach zero, we allow some requests to refresh the cache before it actually expires. However, we don’t want every request to do this, as that would create unnecessary load. So, we introduce probability into the decision.

Each incoming request has a certain chance of recomputing the cached value early. This probability is not fixed, it increases as the cache gets closer to expiration.

When the TTL has just been set -> very low probability of recomputation
As the TTL decreases -> probability gradually increases
Near expiration -> high probability of recomputation

Thus, instead of reacting at the deadline like basic TTL-based caching, this method prepares the system gradually.

Mutex / Cache Locking

Even with strategies like TTL jitter and probabilistic early expiration, there are still situations where multiple requests may try to recompute the same cache entry at the same time. This leads to duplicate work, where several requests hit the database simultaneously for the same data, increasing load and reducing efficiency.

To solve this, we can use mutex (mutual exclusion) or cache locking. When a cache entry expires and multiple requests try to fetch fresh data, only one request is allowed to recompute the value, while the others wait.

The simple flow is:

A request checks the cache
If the cache is missing or expired -> it tries to acquire a lock
If it gets the lock -> it fetches data from the database and updates the cache
If it doesn’t get the lock -> it waits until the cache is updated

The tradeoff is that this introduces additional latency for waiting requests.

Stale-While-Revalidate (SWR)

Even with techniques like mutex locking, some requests may still have to wait while the cache is being recomputed. This introduces additional latency, which can negatively impact user experience.

The idea behind SWR is: instead of blocking requests when the cache expires, we serve the stale (expired) data and refresh the cache in the background.

In this approach, a cached item can exist in two states:

Fresh -> within TTL, safe to serve
Stale -> TTL expired, but still usable for a short duration

Because of this, users never have to wait for recomputation. The system continues to serve responses with low latency, while the cache gets updated asynchronously.

However, the tradeoff is that users may receive stale data for a short period of time. Because of this, SWR is not suitable for systems that require strict consistency, such as financial transactions or real-time critical data.

Cache Warming

So far, all the strategies we’ve discussed deal with cache behavior during runtime. But what if we could prepare the cache before requests even start coming in?

This is where cache warming (or pre-warming) comes in.

The idea is simple: instead of waiting for users to request data and then populating the cache (lazy loading), we proactively load frequently accessed data into the cache in advance.

This ensures that when the first request arrives, the data is already available in the cache, avoiding initial cache misses.

This can be done using:

background jobs
scheduled tasks (cron jobs)
analytics-based predictions

Conclusion

Caching is not just about storing data to make systems faster, it’s about making the system stable under load.

Each of these techniques makes a tradeoff between freshness, latency, and consistency. There is no one-size-fits-all solution, the right approach depends on the nature of your system and the type of data you are dealing with.

Cache Strategies in Distributed Systems

Introduction

Basic TTL Caching is Not Enough

Cache Expiry can Cause Traffic Spikes