Incident Report: Solving the 95% to 20% Cache Hit Rate Collapse
Published on 2026-04-15 11:14 by Frugle Me (Last updated: 2026-04-15 11:14)
Incident Report: Solving the 95% to 20% Cache Hit Rate Collapse
The primary metric of a healthy distributed system is often its cache hit rate. When that rate plunges from a healthy 95% to 20%, the result is an immediate, catastrophic "thundering herd" effect on your database.
With scaling the database off the table, you are in a race against time to protect your primary data store from a total outage.
🚨 The Immediate Priority: Stop the Bleed
When the cache fails, the database takes the full brunt of every request. Before tuning queries or looking at code, you must address the Cache Stampede.
1. Implement Request Collapsing (Promise Softening)
If 1,000 users request the same expired key at once, don't let 1,000 queries hit the DB.
* The Fix: Use a "singleflight" or "request orchestrator" pattern.
* How it works: Only the first request goes to the DB; the other 999 wait for that first result to populate the cache.
2. Immediate TTL Extension
If the drop was caused by a sudden influx of new data or a change in traffic patterns, your keys might be evicting too fast.
* The Fix: Manually increase the Time-to-Live (TTL) for your most popular keys.
* The Goal: Buy the database breathing room by serving slightly "stale" data rather than no data.
🔍 Root Cause Analysis: What Broke?
Since we cannot scale the DB, we must identify why the cache is suddenly rejecting 80% of requests.
A. The "Key Leak" (High Cardinality)
Check if a recent code deploy changed how cache keys are generated.
* Example: Including a unique timestamp or session_id in the key name.
* Result: Every request is "unique," making the cache 0% effective because no two users share a key.
B. Hot Key Contention
Is one specific piece of data (a viral post or a global setting) being hit millions of times?
* The Symptom: Your cache nodes are hitting 100% CPU while memory is low.
* The Fix: Implement Local In-Memory Caching (L1) on the application servers for the top 1% of keys to bypass the network cache (L2) entirely.
C. Eviction Policy Overload
If your cache memory is full, it may be evicting keys before they can be reused.
* The Fix: If you can't add RAM, switch the eviction policy to LFU (Least Frequently Used) instead of LRU. This keeps "evergreen" data in memory even if it hasn't been touched in the last few seconds.
🛠️ The Technical Checklist
If you are the engineer on-call right now, follow this order:
- Check for "Poison Pill" Queries: Look for a specific query pattern that started when the hit rate dropped. Kill those connections at the DB level if they are non-essential.
- Circuit Breaking: If DB latency is climbing past a point of no return, trip the circuit breaker. Return a "Service Temporarily Unavailable" or a cached "Default Version" to the user to prevent a full DB crash.
- Jitter and Randomness: Ensure your TTLs have "jitter." If all your keys expire at the exact same time, you create a "Cache Cliff" where the hit rate drops to zero every hour. Add +/- 10% randomness to your expiration times.
💡 Summary of the "Non-Scaling" Strategy
When you can't grow the database, you must shield it.
- Filter at the App Layer: Drop non-critical background tasks.
- Optimize the Cache Key: Ensure keys are as generic as possible to maximize reuse.
- Serve Stale: In an emergency, an old record is better than a 504 Gateway Timeout.
Conclusion: A drop from 95% to 20% is rarely a "slow" degradation; it is almost always a change in Key Logic or a Cache Stampede. Fix the logic, collapse the requests, and the DB load will normalize without adding a single CPU core.
Comments (0)
Want to join the conversation?
Please log in to add a comment.