Incident Report: Solving the 95% to 20% Cache Hit Rate Collapse

The primary metric of a healthy distributed system is often its cache hit rate. When that rate plunges from a healthy 95% to 20%, the result is an immediate, catastrophic "thundering herd" effect on your database.

With scaling the database off the table, you are in a race against time to protect your primary data store from a total outage.

🚨 The Immediate Priority: Stop the Bleed

When the cache fails, the database takes the full brunt of every request. Before tuning queries or looking at code, you must address the Cache Stampede.

1. Implement Request Collapsing (Promise Softening)

If 1,000 users request the same expired key at once, don't let 1,000 queries hit the DB.
* The Fix: Use a "singleflight" or "request orchestrator" pattern.
* How it works: Only the first request goes to the DB; the other 999 wait for that first result to populate the cache.

2. Immediate TTL Extension

If the drop was caused by a sudden influx of new data or a change in traffic patterns, your keys might be evicting too fast.
* The Fix: Manually increase the Time-to-Live (TTL) for your most popular keys.
* The Goal: Buy the database breathing room by serving slightly "stale" data rather than no data.

🔍 Root Cause Analysis: What Broke?

Since we cannot scale the DB, we must identify why the cache is suddenly rejecting 80% of requests.

A. The "Key Leak" (High Cardinality)

Check if a recent code deploy changed how cache keys are generated.
* Example: Including a unique timestamp or session_id in the key name.
* Result: Every request is "unique," making the cache 0% effective because no two users share a key.

B. Hot Key Contention

Is one specific piece of data (a viral post or a global setting) being hit millions of times?
* The Symptom: Your cache nodes are hitting 100% CPU while memory is low.
* The Fix: Implement Local In-Memory Caching (L1) on the application servers for the top 1% of keys to bypass the network cache (L2) entirely.

C. Eviction Policy Overload

If your cache memory is full, it may be evicting keys before they can be reused.
* The Fix: If you can't add RAM, switch the eviction policy to LFU (Least Frequently Used) instead of LRU. This keeps "evergreen" data in memory even if it hasn't been touched in the last few seconds.

🛠️ The Technical Checklist

If you are the engineer on-call right now, follow this order:

Check for "Poison Pill" Queries: Look for a specific query pattern that started when the hit rate dropped. Kill those connections at the DB level if they are non-essential.
Circuit Breaking: If DB latency is climbing past a point of no return, trip the circuit breaker. Return a "Service Temporarily Unavailable" or a cached "Default Version" to the user to prevent a full DB crash.
Jitter and Randomness: Ensure your TTLs have "jitter." If all your keys expire at the exact same time, you create a "Cache Cliff" where the hit rate drops to zero every hour. Add +/- 10% randomness to your expiration times.

💡 Summary of the "Non-Scaling" Strategy

When you can't grow the database, you must shield it.

Filter at the App Layer: Drop non-critical background tasks.
Optimize the Cache Key: Ensure keys are as generic as possible to maximize reuse.
Serve Stale: In an emergency, an old record is better than a 504 Gateway Timeout.

Conclusion: A drop from 95% to 20% is rarely a "slow" degradation; it is almost always a change in Key Logic or a Cache Stampede. Fix the logic, collapse the requests, and the DB load will normalize without adding a single CPU core.

Incident Report: Solving the 95% to 20% Cache Hit Rate Collapse

Incident Report: Solving the 95% to 20% Cache Hit Rate Collapse

🚨 The Immediate Priority: Stop the Bleed

1. Implement Request Collapsing (Promise Softening)

2. Immediate TTL Extension

🔍 Root Cause Analysis: What Broke?

A. The "Key Leak" (High Cardinality)

B. Hot Key Contention

C. Eviction Policy Overload

🛠️ The Technical Checklist

💡 Summary of the "Non-Scaling" Strategy

Comments (0)