By the end of this article, you will have a working Python microservices skeleton that handles failures gracefully, retries intelligently, and never takes your whole platform down because of one bad service.
What Is a Resilient Microservices Architecture?

A microservice is a small, independently deployable service that does one thing well like handling payments, managing users, or sending emails. Instead of one giant application (a monolith), you split your system run into new online casinos not using GamStop many small services that talk to each other.
Resilience means the system keeps working maybe in a degraded state even when parts of it fail.
Think of it like a power grid: if one neighborhood loses power, the rest of the city stays lit. That’s what we’re building.
Prerequisites
- Python 3.10+
- Basic knowledge of REST APIs
pipand a virtual environment- Docker (optional, for running services locally)
Install the required libraries:
pip install fastapi uvicorn httpx tenacity pybreaker structlog prometheus-client
Project Structure
resilient-microservices/
│
├── services/
│ ├── user_service/
│ │ ├── main.py
│ │ └── health.py
│ ├── order_service/
│ │ ├── main.py
│ │ ├── circuit_breaker.py
│ │ └── retry.py
│ └── notification_service/
│ └── main.py
│
├── gateway/
│ └── main.py
│
├── shared/
│ ├── logger.py
│ └── models.py
│
└── docker-compose.yml
Build the Base Services with FastAPI
Each service is a small, independent FastAPI application.
User Service (services/user_service/main.py)
from fastapi import FastAPI
from pydantic import BaseModel
import structlog
log = structlog.get_logger()
app = FastAPI(title="User Service")
class User(BaseModel):
id: int
name: str
email: str
# Fake in-memory database
USERS = {
1: User(id=1, name="Alice", email="alice@example.com"),
2: User(id=2, name="Bob", email="bob@example.com"),
}
@app.get("/users/{user_id}", response_model=User)
async def get_user(user_id: int):
log.info("Fetching user", user_id=user_id)
if user_id not in USERS:
from fastapi import HTTPException
raise HTTPException(status_code=404, detail="User not found")
return USERS[user_id]
@app.get("/health/live")
async def liveness():
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
# Check dependencies here (DB, cache, etc.)
return {"status": "ready"}
Order Service (services/order_service/main.py)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
import structlog
log = structlog.get_logger()
app = FastAPI(title="Order Service")
USER_SERVICE_URL = "http://localhost:8001"
class Order(BaseModel):
order_id: int
user_id: int
item: str
total: float
@app.post("/orders", response_model=dict)
async def create_order(order: Order):
log.info("Creating order", order_id=order.order_id, user_id=order.user_id)
# Call User Service to validate user exists
async with httpx.AsyncClient(timeout=3.0) as client:
try:
resp = await client.get(f"{USER_SERVICE_URL}/users/{order.user_id}")
resp.raise_for_status()
except httpx.HTTPStatusError as e:
raise HTTPException(status_code=404, detail="User not found")
except httpx.RequestError as e:
raise HTTPException(status_code=503, detail="User service unavailable")
return {"message": "Order created", "order_id": order.order_id}
@app.get("/health/live")
async def liveness():
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
return {"status": "ready"}
Add Retry with Exponential Backoff
Transient failures (network blips, brief service restarts) should be retried automatically but with increasing delays so you don’t hammer a recovering service.
services/order_service/retry.py
import httpx
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
before_sleep_log,
)
import structlog
import logging
log = structlog.get_logger()
# Retry up to 3 times with exponential backoff: 1s → 2s → 4s
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=8),
retry=retry_if_exception_type((httpx.RequestError, httpx.TimeoutException)),
before_sleep=before_sleep_log(logging.getLogger("tenacity"), logging.WARNING),
)
async def fetch_user_with_retry(client: httpx.AsyncClient, user_service_url: str, user_id: int):
"""
Fetches a user from the User Service with automatic retry on transient errors.
Raises immediately on 4xx (non-transient) errors.
"""
resp = await client.get(f"{user_service_url}/users/{user_id}", timeout=3.0)
if resp.status_code == 404:
# Don't retry — user genuinely doesn't exist
resp.raise_for_status()
if resp.status_code >= 500:
# Server error — raise so tenacity can retry
raise httpx.RequestError(f"Server error: {resp.status_code}")
return resp.json()
Usage in Order Service:
from retry import fetch_user_with_retry
async with httpx.AsyncClient() as client:
user = await fetch_user_with_retry(client, USER_SERVICE_URL, order.user_id)
Implement the Circuit Breaker
A Circuit Breaker monitors failures. When they exceed a threshold, it “trips” blocking further calls for a cooldown period so a struggling service isn’t overwhelmed.
[CLOSED] ──(failures > 5)──▶ [OPEN] ──(30s timeout)──▶ [HALF-OPEN]
▲ │
└───────────(success)─────────────────────────────────────┘
services/order_service/circuit_breaker.py
import pybreaker
import httpx
import structlog
log = structlog.get_logger()
class CircuitBreakerListener(pybreaker.CircuitBreakerListener):
def state_change(self, cb, old_state, new_state):
log.warning(
"Circuit breaker state changed",
name=cb.name,
old=str(old_state),
new=str(new_state),
)
# Trip after 5 failures; stay open for 30 seconds
user_service_breaker = pybreaker.CircuitBreaker(
fail_max=5,
reset_timeout=30,
name="UserService",
listeners=[CircuitBreakerListener()],
)
@user_service_breaker
async def call_user_service(client: httpx.AsyncClient, url: str, user_id: int):
"""
Wrapped call to User Service. The circuit breaker monitors this function.
If it fails 5 times, the breaker opens and immediately raises
CircuitBreakerError for 30 seconds — no actual HTTP calls are made.
"""
resp = await client.get(f"{url}/users/{user_id}", timeout=3.0)
resp.raise_for_status()
return resp.json()
Using it in Order Service with a fallback:
import pybreaker
from circuit_breaker import call_user_service
async with httpx.AsyncClient() as client:
try:
user = await call_user_service(client, USER_SERVICE_URL, order.user_id)
except pybreaker.CircuitBreakerError:
# Fallback: return cached/default response instead of crashing
log.error("Circuit open — using fallback for user lookup")
user = {"id": order.user_id, "name": "Unknown", "email": ""}
except httpx.HTTPStatusError:
raise HTTPException(status_code=404, detail="User not found")
Structure Logging with Correlation IDs
When a request touches 5 services, you need a single correlation ID to trace it through all of them. Structured logging makes this searchable in tools like Grafana Loki or ELK.
shared/logger.py
import structlog
import uuid
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
# Configure structlog for JSON output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
]
)
class CorrelationIDMiddleware(BaseHTTPMiddleware):
"""
Injects a correlation ID into every request.
Reads X-Correlation-ID header if provided, otherwise generates one.
Passes it downstream in the response header too.
"""
async def dispatch(self, request: Request, call_next):
correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4()))
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
correlation_id=correlation_id,
service=request.app.title,
path=request.url.path,
method=request.method,
)
response = await call_next(request)
response.headers["X-Correlation-ID"] = correlation_id
return response
Attach it to any FastAPI app:
from shared.logger import CorrelationIDMiddleware
app.add_middleware(CorrelationIDMiddleware)
Sample log output (JSON):
{
"event": "Creating order",
"order_id": 42,
"user_id": 1,
"correlation_id": "e3f1c2a9-...",
"service": "Order Service",
"level": "info",
"timestamp": "2026-04-16T10:22:01Z"
}
Expose Prometheus Metrics
Add observability by exposing metrics that Prometheus can scrape specifically the RED metrics: Rate, Errors, Duration.
# Add to any service's main.py
from prometheus_client import Counter, Histogram, make_asgi_app
from starlette.routing import Mount
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["service", "method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["service", "endpoint"]
)
CIRCUIT_BREAKER_TRIPS = Counter(
"circuit_breaker_trips_total",
"Number of times circuit breaker tripped",
["service", "dependency"]
)
# Mount /metrics endpoint for Prometheus scraping
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
API Gateway
The gateway is the single entry point. It handles routing, auth headers, and injects correlation IDs before forwarding to services.
gateway/main.py
from fastapi import FastAPI, Request, HTTPException
import httpx
import uuid
app = FastAPI(title="API Gateway")
ROUTES = {
"/users": "http://localhost:8001",
"/orders": "http://localhost:8002",
}
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE"])
async def gateway(path: str, request: Request):
# Match route prefix
target_base = None
for prefix, url in ROUTES.items():
if f"/{path}".startswith(prefix):
target_base = url
break
if not target_base:
raise HTTPException(status_code=404, detail="Route not found")
# Forward request with correlation ID
headers = dict(request.headers)
headers["X-Correlation-ID"] = headers.get("X-Correlation-ID", str(uuid.uuid4()))
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.request(
method=request.method,
url=f"{target_base}/{path}",
headers=headers,
content=await request.body(),
)
return resp.json()
except httpx.RequestError:
raise HTTPException(status_code=503, detail="Upstream service unavailable")
Docker Compose (Run Everything Locally)
docker-compose.yml
version: "3.9"
services:
user-service:
build: ./services/user_service
ports:
- "8001:8000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health/live"]
interval: 10s
timeout: 5s
retries: 3
order-service:
build: ./services/order_service
ports:
- "8002:8000"
depends_on:
user-service:
condition: service_healthy
environment:
- USER_SERVICE_URL=http://user-service:8000
gateway:
build: ./gateway
ports:
- "8000:8000"
depends_on:
- user-service
- order-service
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
Start everything:
docker-compose up --build
Testing Resilience
Simulate a service failure:
# Kill the user service
docker-compose stop user-service
# Try to create an order should get fallback, not 500 crash
curl -X POST http://localhost:8000/orders \
-H "Content-Type: application/json" \
-d '{"order_id": 1, "user_id": 1, "item": "Laptop", "total": 999.99}'
Watch the circuit breaker trip:
# Send 6 requests rapidly while user-service is down
for i in {1..6}; do curl -s http://localhost:8000/orders -X POST \
-H "Content-Type: application/json" \
-d '{"order_id": '$i', "user_id": 1, "item": "Test", "total": 10}'; done
After 5 failures, the 6th request gets rejected instantly no HTTP call made and the log shows Circuit breaker state changed: CLOSED → OPEN.
The golden rule of resilient microservices: design for failure, not for success. Every external call will eventually fail. Your job is to decide gracefully what happens when it does.