Ahmed Waregh
Back to work

LLM Gateway — AI Infrastructure

Production-ready unified API gateway for routing requests across multiple LLM providers with built-in rate limiting, response caching, cost tracking, and OpenTelemetry observability.

API gatewaymodel routingrate limitingcost tracking
PythonFastAPIRedisPostgreSQLDockerOpenTelemetry

Problem

Organizations using multiple LLM providers (OpenAI, Anthropic, etc.) face fragmented integration, inconsistent rate limiting, duplicated cost tracking, and no unified observability. Each provider has different APIs, pricing models, and rate limits, creating operational complexity that grows with each new model adoption. A unified gateway abstracts these differences behind a single API.

Architecture Overview

The gateway provides a unified REST API (POST /v1/chat, POST /v1/embeddings) that routes requests to the optimal provider based on configurable strategies. Key infrastructure components include:

  • Model Routing Engine — supports cost-optimized, latency-optimized, capability-based, and round-robin strategies
  • Rate Limiter — sliding-window algorithm with per-user and per-tenant limits (Redis-backed for multi-instance deployment)
  • Response Cache — SHA-256 keyed cache with configurable TTL, reducing redundant API calls
  • Cost Tracker — per-request token usage and USD cost stored in database with analytics API
  • Observability — structured logging with OpenTelemetry metrics and traces

Routing Strategies

| Strategy | Description | |----------|-------------| | Cost (default) | Lowest combined input+output cost | | Latency | Lowest typical response latency | | Capability | Highest context window | | Round Robin | Rotate through available providers |

Automatic fallback chain ensures availability: small requests go to gpt-4o-mini, large context goes to claude-3-5-sonnet, and a local echo provider is always available when no API keys are configured.

Technical Decisions

  • Provider abstraction — each LLM provider is an adapter implementing a common interface, making new provider integration a single-file addition
  • Redis-optional — rate limiting and caching work with in-memory fallbacks when Redis is unavailable, enabling single-process development
  • SQLite for development, PostgreSQL for production — cost tracking database is configurable via connection string

Tech Stack

  • Backend: Python, FastAPI
  • Cache/Rate Limiting: Redis (with in-memory fallback)
  • Database: SQLite / PostgreSQL
  • Observability: OpenTelemetry, structured logging
  • Infrastructure: Docker, Docker Compose
Interactive Demo

Route requests across multiple LLM providers and compare cost, latency, and availability strategies.

Open full screen
Total Requests
0
Total Cost
$0.00000
Avg Latency
Active Providers
4

Prompt

Routing Strategy

Max Tokens

Request Log
Route a request to see provider decisions

Quick Start

Clone and run locally with Docker:

git clone https://github.com/awaregh/LLM-Gateway-AI-Infrastructure-.git && cd LLM-Gateway-AI-Infrastructure- && cp .env.example .env && docker compose up
Full setup in README