You built a web app. It works on localhost. You deploy it. Then the first spike of traffic hits and your server falls over, your database melts, and a single user hammering your API 200 times per second takes down the service for everyone else. This is the gap between a demo and a product, and three patterns close it: rate limiting, caching, and background jobs.
These are not advanced topics. They are table stakes for any backend that serves real users. Yet most tutorials skip them entirely, jumping from "here is how to build a REST API" to "deploy to production" without covering the infrastructure patterns that keep that API alive under real-world conditions. This article fixes that.
We will use Redis as the backbone for all three patterns. Redis is an in-memory data store that handles millions of operations per second with sub-millisecond latency. It is the industry standard for rate limiting, caching, and job queues — and understanding how to use it for these three purposes will make you a significantly more capable backend developer.
1. Why These Three Patterns Matter
Every backend application that serves more than a handful of users will eventually face three problems. First, some users (or bots, or attackers) will send far more requests than your server can handle. Without rate limiting, a single bad actor can consume all of your server's resources and deny service to legitimate users. Second, your database will become the bottleneck. Most web applications are read-heavy — the same data gets requested over and over. Without caching, every single request hits your database, even when the data has not changed in hours. Third, some operations simply take too long to do while a user is waiting. Sending an email, generating a PDF report, resizing an image, calling a third-party API — these tasks can take seconds or even minutes. Without background jobs, your users stare at a loading spinner while your server blocks on work that could happen asynchronously.
These three problems are not hypothetical. They are the most common reasons production applications fail, slow down, or become unreliable. And they all share a common solution: Redis.
Why Redis?
Redis operates entirely in memory, which means it can execute operations in microseconds rather than the milliseconds a disk-based database requires. It supports atomic operations (critical for rate limiting), key expiration (critical for caching), and list/queue data structures (critical for background jobs). A single Redis instance can handle over 100,000 operations per second. With Redis Cluster, you can scale horizontally across multiple nodes for high availability.
Redis is not the only option for each of these patterns individually. You could use Memcached for caching, or RabbitMQ for job queues. But Redis handles all three patterns well, which means you deploy and maintain one piece of infrastructure instead of three. For most teams, that simplicity wins.
The production gap: If your backend does not have rate limiting, caching, and background job processing, it is a demo — not a product. These three patterns are what separate "it works on my machine" from "it handles 10,000 concurrent users without breaking a sweat."
2. Rate Limiting
Rate limiting controls how many requests a client can make to your API within a given time window. Without it, a single client can monopolize your server, whether intentionally (a DDoS attack) or accidentally (a bug in someone's code that sends requests in an infinite loop). Rate limiting protects your server, your database, and your other users.
The Three Algorithms
There are three primary rate limiting algorithms, and each makes a different tradeoff. Choosing the right one depends on your use case.
| Algorithm | How It Works | Burst Behavior | Best For |
|---|---|---|---|
| Token Bucket | Tokens refill at a steady rate. Each request costs one token. When the bucket is empty, requests are rejected. | Allows bursts | APIs where occasional bursts are acceptable (user-facing APIs, webhook receivers) |
| Leaky Bucket | Requests enter a queue and are processed at a fixed rate. If the queue is full, new requests are dropped. | Strict — no bursts | Payment processing, database writes, anything requiring strict throughput control |
| Sliding Window | Counts requests in a rolling time window. Combines the simplicity of fixed windows with the accuracy of per-second tracking. | Balanced | Most APIs — this is the best general-purpose choice |
The Sliding Window algorithm is the best choice for most APIs. It avoids the boundary problem that fixed windows have (where a client can send double the limit by timing requests across the window boundary) while being simpler to implement than a full token bucket. If you are not sure which to use, start with sliding window.
Implementation with Redis and Lua Scripts
Here is where most tutorials get it wrong. They implement rate limiting with two separate Redis commands: one to read the current count, and one to increment it. This creates a TOCTOU race condition (time-of-check-time-of-use). Between the time you check the count and the time you increment it, another request can slip through. Under high concurrency, this means your rate limiter leaks.
The solution is Lua scripts. Redis executes Lua scripts atomically — the entire script runs as a single operation, and no other commands can execute in between. This eliminates the race condition entirely.
-- KEYS[1] = rate limit key (e.g., "rl:user:123")
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests per window
-- ARGV[3] = current timestamp in milliseconds
local key = KEYS[1]
local window = tonumber(ARGV[1]) * 1000
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove entries outside the current window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count remaining entries
local count = redis.call('ZCARD', key)
if count < limit then
-- Add this request with timestamp as score
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('EXPIRE', key, ARGV[1])
return 1 -- allowed
else
return 0 -- rejected
end
This script uses a Redis Sorted Set. Each request is stored as a member with the current timestamp as its score. When a new request comes in, the script first removes all entries older than the window, then counts how many remain. If the count is under the limit, the request is allowed and added to the set. If not, it is rejected. Because this runs as an atomic Lua script, there is no race condition — no request can sneak in between the check and the increment.
Distributed Rate Limiting
If your application runs on multiple servers behind a load balancer, you need a shared rate limiter. This is where Redis shines — all of your application servers connect to the same Redis instance (or cluster), so rate limit state is shared globally. A user who sends 50 requests to server A and 50 requests to server B still sees a combined count of 100, not two separate counts of 50.
For high availability, use Redis Sentinel or Redis Cluster. Sentinel provides automatic failover — if your primary Redis instance goes down, a replica is promoted within seconds. Cluster distributes data across multiple nodes for both performance and redundancy. For most applications, Sentinel is simpler to set up and sufficient.
Always add a circuit breaker. If Redis goes down, your rate limiter goes down. Without a circuit breaker, this means your application either crashes (if it waits for Redis indefinitely) or becomes unprotected (if you skip rate limiting on Redis errors). A circuit breaker detects that Redis is unreachable and falls back to a local in-memory rate limiter until Redis recovers. This way your app stays both alive and protected.
FastAPI + Redis Performance
A well-implemented FastAPI application with Redis-backed rate limiting can handle 10,000 requests per second on a single instance, with sub-millisecond latency for rate-limiting checks. The rate limiting itself adds negligible overhead — the Lua script executes in microseconds, and the network round-trip to Redis on the same machine is under a millisecond. The bottleneck will always be your application logic or database, not the rate limiter.
from fastapi import FastAPI, Request, HTTPException
import redis.asyncio as redis
import time
app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379)
RATE_LIMIT_SCRIPT = """...""" # The Lua script above
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
client_ip = request.client.host
key = f"rl:{client_ip}"
now = int(time.time() * 1000)
try:
allowed = await redis_client.evalsha(
script_sha, 1, key, 60, 100, now
)
except redis.ConnectionError:
# Circuit breaker: allow request if Redis is down
allowed = 1
if not allowed:
raise HTTPException(status_code=429, detail="Too many requests")
return await call_next(request)
3. Caching
Caching stores the result of an expensive operation (usually a database query or an external API call) so that subsequent requests can be served from memory instead of repeating the operation. A Redis cache can return data in under a millisecond. A database query typically takes 5–50 milliseconds. An external API call can take 200–2000 milliseconds. Caching is the single most effective way to make your application faster.
Caching Patterns Compared
There are three fundamental caching patterns, and the right choice depends on whether your application is read-heavy, write-heavy, or needs strict consistency.
| Pattern | How It Works | Consistency | Best For |
|---|---|---|---|
| Cache-Aside | App checks cache first. On miss, reads from DB, then writes result to cache. App manages the cache explicitly. | Eventual (stale reads possible) | Most common — read-heavy workloads, product catalogs, user profiles |
| Write-Through | App writes to cache and DB simultaneously. Cache is always up-to-date. | Strong | Financial data, inventory counts, anything where stale reads are unacceptable |
| Write-Behind | App writes to cache immediately. Cache asynchronously flushes writes to DB in batches. | Eventual (risk of data loss) | High write throughput — analytics events, logging, activity feeds |
Cache-Aside is the pattern you should use by default. It is the simplest, the most well-understood, and fits the majority of web application workloads. Your application checks Redis first. If the data is there (a cache hit), it returns immediately. If not (a cache miss), it queries the database, stores the result in Redis with a TTL (time-to-live), and then returns the data. The next request for the same data hits the cache and skips the database entirely.
import json
import redis
r = redis.Redis(host='localhost', port=6379)
def get_user_profile(user_id: int):
cache_key = f"user:{user_id}:profile"
# 1. Check cache
cached = r.get(cache_key)
if cached:
return json.loads(cached) # Cache hit
# 2. Cache miss - query database
profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
# 3. Store in cache with 5-minute TTL
r.setex(cache_key, 300, json.dumps(profile))
return profile
Cache Invalidation
Cache invalidation — deciding when to remove or update cached data — is famously one of the hardest problems in computer science. Here are three strategies that work in practice.
TTL-based expiration is the simplest approach. Every key gets a time-to-live. After that time, Redis automatically deletes the key, and the next request triggers a fresh database query. This works well when slightly stale data is acceptable (user profiles, product listings, blog posts). Always set a TTL on cache keys. Keys without TTLs live forever and will eventually consume all of your Redis memory.
Event-driven invalidation is more precise. When data changes (a user updates their profile, an admin modifies a product), your application explicitly deletes the relevant cache key. The next read triggers a fresh cache fill. This gives you near-real-time consistency without the complexity of write-through caching.
Version-based invalidation avoids the problem of deleting keys entirely. Instead of deleting a key when data changes, you increment a version number. Your cache keys include the version: user:123:profile:v7. When the profile is updated, the version becomes v8, and the old key is simply never accessed again (and eventually expires via TTL). This approach is particularly useful when you need to invalidate many related keys at once — change one version number and all related caches become stale simultaneously.
Never use KEYS in production. The Redis KEYS command scans every key in the database and blocks the entire Redis instance while it runs. On a Redis instance with millions of keys, this can block for seconds — effectively a denial of service. Use SCAN instead. SCAN iterates through keys incrementally without blocking, returning a cursor you use to fetch the next batch. It is safe for production use.
Monitoring Your Cache
A cache you do not monitor is a cache that will eventually cause an outage. Track three metrics:
- Cache hit rate: The percentage of requests served from cache versus database. A healthy cache has a hit rate above 80%. If your hit rate is below 50%, your TTLs may be too short, your cache keys may be too specific, or you are caching data that changes too frequently to benefit.
- Memory usage: Redis stores everything in memory. If Redis runs out of memory and you have not configured an eviction policy, it will start rejecting write commands. Monitor memory usage and set the
maxmemoryconfiguration directive so Redis knows its limit. - Eviction count: When Redis reaches its memory limit, it evicts keys according to your eviction policy. A high eviction rate means your cache is too small for your workload. Either increase Redis memory, reduce TTLs so keys expire sooner, or cache fewer things.
redis-cli INFO stats | grep keyspace_hits
redis-cli INFO stats | grep keyspace_misses
redis-cli INFO memory | grep used_memory_human
redis-cli INFO stats | grep evicted_keys
# Calculate hit rate:
# hit_rate = keyspace_hits / (keyspace_hits + keyspace_misses)
Common Caching Mistakes
Caching without TTLs. Every cached key should have an expiration. Without TTLs, your cache grows indefinitely, consumes all available memory, and serves increasingly stale data. There is no situation where a cache key should live forever.
Cache stampede. When a popular cache key expires, hundreds of requests simultaneously query the database to rebuild the cache. This can overwhelm your database. The fix is a cache lock (also called "thundering herd protection"): the first request that encounters a cache miss acquires a short-lived lock in Redis, rebuilds the cache, and releases the lock. All other requests wait for the lock to release, then read from the freshly populated cache.
Caching errors. If your database query fails and you cache the error response, every subsequent request gets the cached error instead of retrying the database. Never cache error states. Only cache successful responses.
4. Background Jobs
Some operations do not belong in the request-response cycle. When a user signs up and you need to send a welcome email, you should not make them wait for the email to actually send before showing the "Welcome!" page. When a user uploads an image and you need to resize it into five different dimensions, you should not block the HTTP response until all five resizes are done. When a user requests a report that requires aggregating data from three different databases, you should not hold the connection open for 30 seconds while the report generates.
These are all background job use cases. The pattern is simple: the web server receives the request, places the work on a queue, immediately responds to the user ("Your report is being generated"), and a separate worker process picks up the job from the queue and does the heavy lifting asynchronously.
Why Synchronous Processing Fails
Consider what happens when you send an email synchronously inside an API request. Your server connects to the SMTP provider, negotiates a TLS handshake, transmits the message, and waits for confirmation. This takes 1–5 seconds. During those seconds, that server thread is blocked — it cannot serve other requests. If 100 users sign up simultaneously, you need 100 threads blocked on email sending. Your server's thread pool is exhausted, and subsequent requests start timing out.
Background jobs solve this by moving the slow work off the critical path. The web server's only job is to accept the request, validate it, enqueue the work, and respond. It takes milliseconds, not seconds. The actual work happens in a separate process that can retry failures, handle timeouts, and process jobs at whatever rate your infrastructure supports.
Job Queue Architecture
A job queue has three components: the producer (your web server, which creates jobs), the queue (Redis, which stores jobs), and the consumer (worker processes, which execute jobs). The producer serializes the job data (usually as JSON) and pushes it onto a Redis list. Workers run in a loop, blocking-popping jobs from the list, deserializing them, and executing the task.
You do not need to build this from scratch. Mature job queue libraries exist for every major language:
| Library | Language | Backed By | Key Feature |
|---|---|---|---|
| Bull / BullMQ | Node.js | Redis | Priority queues, rate limiting per queue, repeatable jobs, built-in dashboard |
| Celery | Python | Redis or RabbitMQ | Mature ecosystem, periodic tasks with Celery Beat, result backends, workflow chains |
| Sidekiq | Ruby | Redis | Threaded (high throughput on single process), commercial Pro/Enterprise tiers, web dashboard |
All three are backed by Redis. If you are already running Redis for rate limiting and caching, your job queue adds zero new infrastructure. This is why Redis is the default choice for backend infrastructure.
# tasks.py
from celery import Celery
app = Celery('tasks', broker='redis://localhost:6379/0')
@app.task(bind=True, max_retries=3, default_retry_delay=60)
def send_welcome_email(self, user_id: int):
try:
user = db.get_user(user_id)
email_service.send(
to=user.email,
template='welcome',
context={'name': user.name}
)
except email_service.ConnectionError as exc:
# Retry with exponential backoff
raise self.retry(exc=exc, countdown=2 ** self.request.retries * 60)
# In your API route:
@app.post("/signup")
def signup(user_data: UserCreate):
user = create_user(user_data)
send_welcome_email.delay(user.id) # Enqueue, don't wait
return {"message": "Welcome! Check your email."}
Retry Patterns and Exponential Backoff
Jobs fail. The email service is temporarily down. The third-party API returns a 503. The database connection times out. A robust job queue handles failures gracefully with retries.
Exponential backoff is the standard retry strategy. Instead of retrying immediately (which often fails again for the same reason), you wait progressively longer between attempts: 1 minute, 2 minutes, 4 minutes, 8 minutes. This gives the failing dependency time to recover while avoiding a retry storm that makes the problem worse.
The formula is straightforward: delay = base_delay * (2 ^ attempt_number). With a base delay of 60 seconds, your retries happen at 60s, 120s, 240s, and 480s. Add a small random jitter (a few seconds of randomness) to prevent all failed jobs from retrying at exactly the same moment.
Dead Letter Queues
After a job has been retried the maximum number of times and still fails, it needs to go somewhere. This is what a dead letter queue (DLQ) is for. Instead of silently dropping the failed job, the system moves it to a separate queue where it can be inspected, debugged, and manually retried once the underlying issue is fixed.
A dead letter queue should contain the full job payload, the error message from the final failure, a count of how many times it was retried, and timestamps for when it was created, first failed, and finally moved to the DLQ. This information is essential for debugging. Without it, you are left wondering why 200 welcome emails were never sent last Tuesday.
const { Queue, Worker } = require('bullmq');
const emailQueue = new Queue('email', {
connection: { host: 'localhost', port: 6379 },
defaultJobOptions: {
attempts: 5,
backoff: { type: 'exponential', delay: 60000 },
removeOnComplete: true,
removeOnFail: false, // Keep failed jobs for inspection
}
});
const worker = new Worker('email', async (job) => {
await sendEmail(job.data.to, job.data.template);
}, { connection: { host: 'localhost', port: 6379 } });
worker.on('failed', (job, err) => {
if (job.attemptsMade >= job.opts.attempts) {
console.error(`Job ${job.id} moved to DLQ: ${err.message}`);
// Alert your monitoring system
}
});
Monitor your dead letter queue. Set up alerts that fire when jobs land in the DLQ. A growing DLQ means something is broken — an API key expired, a service is down, a bug was deployed. The DLQ is not just a place for failed jobs to die quietly. It is an early warning system.
5. Putting It All Together
Let us build a complete picture of how a production application uses all three patterns together, backed by a single Redis instance. We will use a FastAPI application as the example, but the architecture applies identically to Express, Django, Rails, or any other framework.
The Architecture
|
v
[Redis Sentinel]
/ | \
Rate Limits Cache Job Queues
(DB 0) (DB 1) (DB 2)
|
[Celery Workers x2]
|
[PostgreSQL DB]
Every incoming request first passes through the rate limiting middleware. The middleware runs the sliding window Lua script against Redis DB 0. If the request is allowed, it continues to the route handler. The route handler checks Redis DB 1 for cached data. On a cache hit, it returns immediately without touching PostgreSQL. On a cache miss, it queries PostgreSQL, caches the result, and returns. If the route needs to do slow work (send an email, generate a report, call an external API), it enqueues a job on Redis DB 2 and returns immediately. Celery workers consume jobs from the queue and process them asynchronously.
Redis Configuration for Production
The default Redis configuration is designed for development. For production, you need to change several settings.
# Memory limit - set to ~75% of available RAM
maxmemory 2gb
# Eviction policy - LRU for caching workloads
maxmemory-policy allkeys-lru
# Persistence - use both RDB snapshots and AOF log
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec
# Connection limits
maxclients 10000
timeout 300
tcp-keepalive 60
# Security
requirepass your_strong_password_here
bind 127.0.0.1 10.0.0.0/8
# Disable dangerous commands
rename-command KEYS ""
rename-command FLUSHALL ""
rename-command FLUSHDB ""
A few of these settings deserve explanation:
- maxmemory and maxmemory-policy allkeys-lru: When Redis reaches 2GB, it starts evicting the least recently used keys to make room for new ones. Without this, Redis would either crash or reject all writes when it runs out of memory. The
allkeys-lrupolicy is correct for caching workloads. If you are using Redis for data that must not be evicted (like job queues), use separate Redis instances or databases for cache versus persistent data. - Persistence (RDB + AOF): RDB creates point-in-time snapshots at intervals. AOF logs every write operation. Using both gives you fast recovery (load the most recent RDB snapshot, then replay the AOF log to catch up). For a pure caching workload, you can disable persistence entirely — if Redis restarts, the cache is empty and rebuilds from database queries. But for rate limiting state and job queues, persistence prevents data loss on restart.
- rename-command KEYS "": This disables the
KEYScommand entirely. As we discussed,KEYSblocks the entire Redis instance and should never be used in production. Disabling it prevents accidental misuse.
Connection Pooling
Every Redis connection consumes a file descriptor on both the client and server. If each request to your application opens a new Redis connection, you will quickly exhaust available file descriptors under load. Connection pooling solves this by maintaining a fixed number of persistent connections that are shared across requests.
import redis
pool = redis.ConnectionPool(
host='localhost',
port=6379,
password='your_strong_password_here',
max_connections=50,
socket_timeout=5,
socket_connect_timeout=2,
retry_on_timeout=True
)
r = redis.Redis(connection_pool=pool)
# All operations reuse connections from the pool
r.get("user:123:profile") # Uses pooled connection
r.setex("cache:data", 300, value) # Same pool
Redis Beyond These Three Patterns
Once you have Redis in your stack, you will find uses for it everywhere. It is not just a cache or a queue — it is a Swiss Army knife for backend infrastructure:
- Session store: Store user sessions in Redis instead of server memory. This lets users stay logged in even if they get routed to a different server behind your load balancer. Sessions expire automatically via TTL.
- Real-time leaderboard: Redis Sorted Sets are purpose-built for leaderboards. Add a score for a user with
ZADD, get the top 10 withZREVRANGE, get a user's rank withZREVRANK. All O(log N) operations, even with millions of entries. - Message broker: Redis Pub/Sub or Redis Streams can broadcast events between services. When a user updates their profile, publish an event. All services subscribed to that channel receive the update and can invalidate their local caches, update search indexes, or trigger notifications.
- Distributed locks: Use
SET key value NX EX 30to acquire a lock that automatically expires. This prevents two workers from processing the same job simultaneously, or two servers from running the same scheduled task at the same time.
The Checklist
Before you deploy your next backend application, verify that you have addressed each of these:
- Rate limiting is in place with atomic Lua scripts (not separate read/increment commands)
- A circuit breaker falls back to local rate limiting if Redis is unreachable
- Cache-aside pattern is implemented for all read-heavy database queries
- Every cache key has a TTL — no exceptions
- Cache stampede protection (locking) is in place for high-traffic keys
- You are using SCAN, never KEYS, for any key enumeration
- All slow operations (emails, file processing, external API calls, report generation) are handled by background workers
- Failed jobs retry with exponential backoff and jitter
- A dead letter queue catches jobs that exhaust all retries
- Redis is configured with
maxmemory, an eviction policy, persistence, and a password - Connection pooling is in place with sensible timeouts
- You are monitoring cache hit rates, memory usage, eviction counts, and DLQ depth
The bottom line: Rate limiting, caching, and background jobs are not optional for production backends. They are the minimum infrastructure that separates a reliable service from one that falls over under real traffic. Redis makes all three straightforward to implement. If you learn one piece of infrastructure deeply this year, make it Redis — it will pay dividends in every backend project you build.