Ahmed Waregh
Back to work

Failure Recovery Patterns in Microservices

Platform simulating microservice failures to evaluate retries, circuit breakers, bulkheads, and idempotency. Measures reliability, latency, and duplicate prevention to guide resilient system design.

circuit breakersretry patternsbulkhead isolationoutbox pattern
PythonFastAPIPostgreSQLRedisDockerPrometheusGrafana

Problem

Microservice architectures introduce new failure modes — cascading failures, retry storms, and partial outages can bring down entire systems. Understanding which resilience patterns actually help (and which make things worse) requires empirical evidence under realistic failure conditions. This platform provides that evidence by implementing six incrementally complex resilience configurations and measuring their performance under chaos scenarios.

Architecture Overview

Five FastAPI microservices form a realistic order processing system:

  • Gateway — entry point with load shedding
  • Orders — order management with outbox pattern
  • Payments — payment processing (chaos injection target)
  • Inventory — stock management with lock contention simulation
  • Notifications — async notification delivery via outbox

Infrastructure includes PostgreSQL, Redis, Prometheus, Grafana, and Jaeger for full observability.

Resilience Patterns Implemented

| Pattern | Implementation | |---------|---------------| | Exponential backoff + jitter | max_attempts=3, base_delay=100ms | | Circuit breaker (rolling window) | threshold=5, open=30s | | Bulkhead (async semaphore) | payments=20, inventory=20 | | Per-hop timeouts + deadline | read=10s, deadline=25s | | Idempotency keys (Redis-backed) | TTL=24h | | Backpressure / load shedding | max_inflight=200 | | Outbox pattern | Background worker + Postgres |

Key Results

| Config | Success Rate | P95 Latency | Retry Amplification | |--------|-------------|-------------|---------------------| | Baseline (no resilience) | 80% | 340ms | 1.0x | | Retries only (naive) | 51% | 890ms | 2.6x | | + Circuit breakers | 92% | 210ms | 1.1x | | Full stack | 94% | 185ms | 1.1x |

Naive retries lower success rate below baseline by amplifying load. Circuit breakers restore success rate and reduce MTTR from 120s to 35s.

Tech Stack

  • Backend: Python, FastAPI
  • Database: PostgreSQL
  • Cache: Redis
  • Observability: Prometheus, Grafana, Jaeger, OpenTelemetry
  • Load Testing: k6
  • Infrastructure: Docker, Docker Compose
Interactive Demo

Inject failures to trip circuit breakers, then watch recovery through half-open probes.

Open full screen
Total Requests
0
Success Rate
0%
Circuit Trips
0
Avg Latency

Circuit State Machine

CLOSED
HALF-OPEN
OPEN
Requests flowing normally

Services

Request Log
Send a request to see the circuit breaker in action

Quick Start

Clone and run locally with Docker:

git clone https://github.com/awaregh/Failure-Recovery-Patterns-in-Microservices.git && cd Failure-Recovery-Patterns-in-Microservices && docker-compose up --build
Full setup in README