Ahmed Waregh
Back home

Projects

Production systems I've designed and built end-to-end — backend platforms, AI infrastructure, and distributed systems.

ProblemTransaction scoring latency was too high for real-time decisions; model drift was silent.
ApproachBuilt a streaming scoring API with LightGBM, added drift monitoring using PSI, and scheduled retraining on detected drift.
ResultScoring p95 under 12ms. Drift caught 2 weeks before accuracy would have degraded measurably.
ML pipelinereal-time scoringdrift monitoringcost optimization
PythonLightGBMFastAPIscikit-learnDocker
ProblemWorkflow steps were tightly coupled; one failure cascaded into lost jobs with no recovery path.
ApproachIntroduced a state-machine model per workflow run, persisted to Postgres, with BullMQ workers pulling from a durable queue. Idempotent step handlers allow safe retry.
ResultJob failure rate dropped from ~4% to under 0.1%. Recovery from worker crashes became automatic.
event-drivendistributed workersmulti-tenantstate machine
Node.jsPostgreSQLPrismaBullMQOpenAI
ProblemSite builds were synchronous and blocking; concurrent publishes caused database contention.
ApproachMoved builds to async workers with S3 artifact storage and a CDN invalidation step. Added per-tenant build queues to prevent noisy-neighbour problems.
ResultMedian build time fell from 8s to 1.4s. P99 dropped from 40s to under 6s.
versioned renderingbuild workersstorage pipelinemulti-tenant
Node.jsPostgreSQLS3CDNDocker
ProblemLLM responses cited wrong sources; hallucinated product details caused support escalations.
ApproachBuilt a RAG pipeline with citation enforcement: each answer must include a retrieved source chunk. Added a self-critique pass to flag low-confidence answers for human review.
ResultHallucination rate (as measured by automated fact-check) fell by 68%. Escalation rate down 31%.
RAG pipelinevector searchingestion pipelineconversation orchestration
PythonPostgreSQLPineconeOpenAIFastAPI

High-throughput streaming pipeline for ingesting, transforming, and routing millions of events per second.

stream processingexactly-once deliveryschema registrybackpressure control
GoKafkaClickHouseKubernetesgRPC

Token-bucket and sliding-window rate limiting as a standalone service, supporting multi-region consistency.

token bucketsliding windowmulti-region syncsidecar-ready
GoRedisgRPCPrometheusDocker

Research implementation of eight real-world schema evolution scenarios across a three-service microservices architecture, covering PostgreSQL migrations, REST API versioning, and event schema evolution with backward compatibility patterns.

backward compatibilityAPI versioningevent schema evolutiondatabase migrations
PythonPostgreSQLFastAPIKafkaDocker

CDC pipeline that streams database changes into an event log, supports consumers, replay, and schema evolution with a demo consumer that builds projections.

CDCevent sourcingstream processingschema evolution
PythonPostgreSQLKafkaDocker

A tool for testing and evaluating RAG retrieval pipelines by comparing chunking strategies, embedding models, and reranking methods using metrics like Precision@K and nDCG.

retrieval evaluationchunking strategiesembedding comparisonreranking
PythonRAGEmbeddingsNLP

Production-quality research system comparing six idempotency strategies for a payments API domain, built with FastAPI, PostgreSQL, Redis, and RabbitMQ.

idempotency patternsdistributed systemssaga patternoutbox pattern
PythonFastAPIPostgreSQLRedisRabbitMQDocker

Production-grade research system for evaluating, benchmarking, and mitigating hallucinations in enterprise LLM applications with multiple RAG variants and guardrail frameworks.

RAG pipelineguardrailscitation enforcementself-critique
PythonRAGLLMNLPpytest

Centralized configuration service used across internal systems for managing application settings and feature flags.

configuration managementinternal tooling
TypeScriptNode.js

Platform simulating microservice failures to evaluate retries, circuit breakers, bulkheads, and idempotency. Measures reliability, latency, and duplicate prevention to guide resilient system design.

circuit breakersretry patternsbulkhead isolationoutbox pattern
PythonFastAPIPostgreSQLRedisDockerPrometheusGrafana

Comprehensive empirical study examining how structural design decisions in Terraform infrastructure-as-code affect long-term maintainability, drift susceptibility, and change management complexity.

infrastructure as codedrift detectionmaintainability metricsreference architectures
TerraformHCLAWSPython

Production-ready unified API gateway for routing requests across multiple LLM providers with built-in rate limiting, response caching, cost tracking, and OpenTelemetry observability.

API gatewaymodel routingrate limitingcost tracking
PythonFastAPIRedisPostgreSQLDockerOpenTelemetry