Handling Cache Layer Failure and Preventing Cache Stampede

In a high-traffic distributed system, the cache (like Redis) acts as a shield for your database. When that shield falls, you face a Cache Stampede (or "thundering herd" problem): thousands of concurrent requests bypass the failed cache and hit the database simultaneously for the same piece of data.

This can lead to a cascading failure where your database locks up, latency spikes, and the entire system goes down. Here is how to design a resilient solution.

1. Implement "Promise Buffering" (Request Collapsing)

Instead of letting every microservice instance query the database for the same key, use request collapsing. When a cache miss occurs, the system ensures that only one request goes to the database, while others wait for the result of that single flight.

How it works: Use a local "In-Flight Map" or a semaphore.
Logic: If Key_A is missing, the first thread acquires a lock and fetches from the DB. Subsequent threads see the lock and "park" until the first thread returns the data and populates the cache.

2. The "Probabilistic Early Recomputation" Approach

Don't wait for the cache to expire or fail completely. You can use an algorithm (like XFetch) to re-warm the cache before it technically expires.

How it works: As the Time-to-Live (TTL) of a cache key nears its end, the application logic uses a probability calculation to decide whether to refresh the value early.
Benefit: This spreads out the re-population load and ensures that even if the cache layer is shaky, the most popular data is being refreshed in the background by a single worker.

3. Use Circuit Breakers and Adaptive Throttling

If the cache layer is down, you must protect the database from the sudden 100x increase in traffic.

Circuit Breaker: Use a library like Resilience4j. If the cache error rate hits a threshold, the circuit opens.
Fallback Strategy: When the circuit is open, the system can return a "stale" value from a local in-memory cache (like Caffeine or Guava) or a default "safe" value instead of hammering the DB.

4. Multi-Level Caching (L1/L2)

Avoid having a single point of failure by implementing a two-tier cache strategy:

L1 (Local Cache): A small, short-lived cache residing in the memory of the microservice instance itself.
L2 (Distributed Cache): Your global Redis cluster.

If Redis (L2) fails, the L1 cache still handles a significant portion of the repeat traffic for that specific pod, preventing a total stampede to the database.

5. Mutex Locks (Distributed Locking)

In a distributed environment, "Request Collapsing" needs to happen across multiple pods.

Logic: Use a distributed lock (e.g., Redlock or a Zookeeper node). Only the pod that acquires the lock is allowed to query the database.
Safety: Always set a short TTL on these locks so that if the pod fetching the data crashes, the lock is eventually released for another pod to try.

6. Soft TTL and Background Refresh

Set two expiration times: a soft TTL and a hard TTL.

When the soft TTL expires, the system returns the "stale" data to the user immediately but triggers a background asynchronous task to update the cache from the database.
This ensures the user never experiences the latency of a database trip, and the database only sees one update request.

Summary Architecture

To survive a total Redis outage:
1. Detect the failure via a Circuit Breaker.
2. Fall back to a local L1 cache or stale data.
3. Limit DB access using a Mutex or Semaphore so only 1-5% of traffic actually hits the DB to "re-warm" the system.
4. Queue or shed excess load to keep the database responsive for critical writes.

Handling Cache Layer Failure and Preventing Cache Stampede

Handling Cache Layer Failure and Preventing Cache Stampede

1. Implement "Promise Buffering" (Request Collapsing)

2. The "Probabilistic Early Recomputation" Approach

3. Use Circuit Breakers and Adaptive Throttling

4. Multi-Level Caching (L1/L2)

5. Mutex Locks (Distributed Locking)

6. Soft TTL and Background Refresh

Summary Architecture

Comments (0)