How to Create a Resilient Microservices Setup Using Python

By the end of this article, you will have a working Python microservices skeleton that handles failures gracefully, retries intelligently, and never takes your whole platform down because of one bad service.

What Is a Resilient Microservices Architecture?

A microservice is a small, independently deployable service that does one thing well like handling payments, managing users, or sending emails. Instead of one giant application (a monolith), you split your system run into new online casinos not using GamStop many small services that talk to each other.

Resilience means the system keeps working maybe in a degraded state even when parts of it fail.

Think of it like a power grid: if one neighborhood loses power, the rest of the city stays lit. That’s what we’re building.

Prerequisites

  • Python 3.10+
  • Basic knowledge of REST APIs
  • pip and a virtual environment
  • Docker (optional, for running services locally)

Install the required libraries:

pip install fastapi uvicorn httpx tenacity pybreaker structlog prometheus-client

Project Structure

resilient-microservices/
│
├── services/
│   ├── user_service/
│   │   ├── main.py
│   │   └── health.py
│   ├── order_service/
│   │   ├── main.py
│   │   ├── circuit_breaker.py
│   │   └── retry.py
│   └── notification_service/
│       └── main.py
│
├── gateway/
│   └── main.py
│
├── shared/
│   ├── logger.py
│   └── models.py
│
└── docker-compose.yml

Build the Base Services with FastAPI

Each service is a small, independent FastAPI application.

User Service (services/user_service/main.py)

from fastapi import FastAPI
from pydantic import BaseModel
import structlog

log = structlog.get_logger()
app = FastAPI(title="User Service")

class User(BaseModel):
    id: int
    name: str
    email: str

# Fake in-memory database
USERS = {
    1: User(id=1, name="Alice", email="alice@example.com"),
    2: User(id=2, name="Bob",   email="bob@example.com"),
}

@app.get("/users/{user_id}", response_model=User)
async def get_user(user_id: int):
    log.info("Fetching user", user_id=user_id)
    if user_id not in USERS:
        from fastapi import HTTPException
        raise HTTPException(status_code=404, detail="User not found")
    return USERS[user_id]

@app.get("/health/live")
async def liveness():
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    # Check dependencies here (DB, cache, etc.)
    return {"status": "ready"}

Order Service (services/order_service/main.py)

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
import structlog

log = structlog.get_logger()
app = FastAPI(title="Order Service")

USER_SERVICE_URL = "http://localhost:8001"

class Order(BaseModel):
    order_id: int
    user_id: int
    item: str
    total: float

@app.post("/orders", response_model=dict)
async def create_order(order: Order):
    log.info("Creating order", order_id=order.order_id, user_id=order.user_id)

    # Call User Service to validate user exists
    async with httpx.AsyncClient(timeout=3.0) as client:
        try:
            resp = await client.get(f"{USER_SERVICE_URL}/users/{order.user_id}")
            resp.raise_for_status()
        except httpx.HTTPStatusError as e:
            raise HTTPException(status_code=404, detail="User not found")
        except httpx.RequestError as e:
            raise HTTPException(status_code=503, detail="User service unavailable")

    return {"message": "Order created", "order_id": order.order_id}

@app.get("/health/live")
async def liveness():
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    return {"status": "ready"}

Add Retry with Exponential Backoff

Transient failures (network blips, brief service restarts) should be retried automatically but with increasing delays so you don’t hammer a recovering service.

services/order_service/retry.py

import httpx
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log,
)
import structlog
import logging

log = structlog.get_logger()

# Retry up to 3 times with exponential backoff: 1s → 2s → 4s
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=8),
    retry=retry_if_exception_type((httpx.RequestError, httpx.TimeoutException)),
    before_sleep=before_sleep_log(logging.getLogger("tenacity"), logging.WARNING),
)
async def fetch_user_with_retry(client: httpx.AsyncClient, user_service_url: str, user_id: int):
    """
    Fetches a user from the User Service with automatic retry on transient errors.
    Raises immediately on 4xx (non-transient) errors.
    """
    resp = await client.get(f"{user_service_url}/users/{user_id}", timeout=3.0)

    if resp.status_code == 404:
        # Don't retry — user genuinely doesn't exist
        resp.raise_for_status()

    if resp.status_code >= 500:
        # Server error — raise so tenacity can retry
        raise httpx.RequestError(f"Server error: {resp.status_code}")

    return resp.json()

Usage in Order Service:

from retry import fetch_user_with_retry

async with httpx.AsyncClient() as client:
    user = await fetch_user_with_retry(client, USER_SERVICE_URL, order.user_id)

Implement the Circuit Breaker

A Circuit Breaker monitors failures. When they exceed a threshold, it “trips” blocking further calls for a cooldown period so a struggling service isn’t overwhelmed.

[CLOSED] ──(failures > 5)──▶ [OPEN] ──(30s timeout)──▶ [HALF-OPEN]
   ▲                                                          │
   └───────────(success)─────────────────────────────────────┘

services/order_service/circuit_breaker.py

import pybreaker
import httpx
import structlog

log = structlog.get_logger()

class CircuitBreakerListener(pybreaker.CircuitBreakerListener):
    def state_change(self, cb, old_state, new_state):
        log.warning(
            "Circuit breaker state changed",
            name=cb.name,
            old=str(old_state),
            new=str(new_state),
        )

# Trip after 5 failures; stay open for 30 seconds
user_service_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=30,
    name="UserService",
    listeners=[CircuitBreakerListener()],
)

@user_service_breaker
async def call_user_service(client: httpx.AsyncClient, url: str, user_id: int):
    """
    Wrapped call to User Service. The circuit breaker monitors this function.
    If it fails 5 times, the breaker opens and immediately raises
    CircuitBreakerError for 30 seconds — no actual HTTP calls are made.
    """
    resp = await client.get(f"{url}/users/{user_id}", timeout=3.0)
    resp.raise_for_status()
    return resp.json()

Using it in Order Service with a fallback:

import pybreaker
from circuit_breaker import call_user_service

async with httpx.AsyncClient() as client:
    try:
        user = await call_user_service(client, USER_SERVICE_URL, order.user_id)
    except pybreaker.CircuitBreakerError:
        # Fallback: return cached/default response instead of crashing
        log.error("Circuit open — using fallback for user lookup")
        user = {"id": order.user_id, "name": "Unknown", "email": ""}
    except httpx.HTTPStatusError:
        raise HTTPException(status_code=404, detail="User not found")

Structure Logging with Correlation IDs

When a request touches 5 services, you need a single correlation ID to trace it through all of them. Structured logging makes this searchable in tools like Grafana Loki or ELK.

shared/logger.py

import structlog
import uuid
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

class CorrelationIDMiddleware(BaseHTTPMiddleware):
    """
    Injects a correlation ID into every request.
    Reads X-Correlation-ID header if provided, otherwise generates one.
    Passes it downstream in the response header too.
    """
    async def dispatch(self, request: Request, call_next):
        correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4()))

        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            correlation_id=correlation_id,
            service=request.app.title,
            path=request.url.path,
            method=request.method,
        )

        response = await call_next(request)
        response.headers["X-Correlation-ID"] = correlation_id
        return response

Attach it to any FastAPI app:

from shared.logger import CorrelationIDMiddleware
app.add_middleware(CorrelationIDMiddleware)

Sample log output (JSON):

{
  "event": "Creating order",
  "order_id": 42,
  "user_id": 1,
  "correlation_id": "e3f1c2a9-...",
  "service": "Order Service",
  "level": "info",
  "timestamp": "2026-04-16T10:22:01Z"
}

Expose Prometheus Metrics

Add observability by exposing metrics that Prometheus can scrape specifically the RED metrics: Rate, Errors, Duration.

# Add to any service's main.py
from prometheus_client import Counter, Histogram, make_asgi_app
from starlette.routing import Mount

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["service", "method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["service", "endpoint"]
)

CIRCUIT_BREAKER_TRIPS = Counter(
    "circuit_breaker_trips_total",
    "Number of times circuit breaker tripped",
    ["service", "dependency"]
)

# Mount /metrics endpoint for Prometheus scraping
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

API Gateway

The gateway is the single entry point. It handles routing, auth headers, and injects correlation IDs before forwarding to services.

gateway/main.py

from fastapi import FastAPI, Request, HTTPException
import httpx
import uuid

app = FastAPI(title="API Gateway")

ROUTES = {
    "/users":  "http://localhost:8001",
    "/orders": "http://localhost:8002",
}

@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE"])
async def gateway(path: str, request: Request):
    # Match route prefix
    target_base = None
    for prefix, url in ROUTES.items():
        if f"/{path}".startswith(prefix):
            target_base = url
            break

    if not target_base:
        raise HTTPException(status_code=404, detail="Route not found")

    # Forward request with correlation ID
    headers = dict(request.headers)
    headers["X-Correlation-ID"] = headers.get("X-Correlation-ID", str(uuid.uuid4()))

    async with httpx.AsyncClient(timeout=5.0) as client:
        try:
            resp = await client.request(
                method=request.method,
                url=f"{target_base}/{path}",
                headers=headers,
                content=await request.body(),
            )
            return resp.json()
        except httpx.RequestError:
            raise HTTPException(status_code=503, detail="Upstream service unavailable")

Docker Compose (Run Everything Locally)

docker-compose.yml

version: "3.9"

services:
  user-service:
    build: ./services/user_service
    ports:
      - "8001:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health/live"]
      interval: 10s
      timeout: 5s
      retries: 3

  order-service:
    build: ./services/order_service
    ports:
      - "8002:8000"
    depends_on:
      user-service:
        condition: service_healthy
    environment:
      - USER_SERVICE_URL=http://user-service:8000

  gateway:
    build: ./gateway
    ports:
      - "8000:8000"
    depends_on:
      - user-service
      - order-service

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

Start everything:

docker-compose up --build

Testing Resilience

Simulate a service failure:

# Kill the user service
docker-compose stop user-service

# Try to create an order should get fallback, not 500 crash
curl -X POST http://localhost:8000/orders \
  -H "Content-Type: application/json" \
  -d '{"order_id": 1, "user_id": 1, "item": "Laptop", "total": 999.99}'

Watch the circuit breaker trip:

# Send 6 requests rapidly while user-service is down
for i in {1..6}; do curl -s http://localhost:8000/orders -X POST \
  -H "Content-Type: application/json" \
  -d '{"order_id": '$i', "user_id": 1, "item": "Test", "total": 10}'; done

After 5 failures, the 6th request gets rejected instantly no HTTP call made and the log shows Circuit breaker state changed: CLOSED → OPEN.

The golden rule of resilient microservices: design for failure, not for success. Every external call will eventually fail. Your job is to decide gracefully what happens when it does.

Related blog posts