AI Workflow Automation Platform
Multi-tenant SaaS platform orchestrating AI workflows, integrations, and automations at scale.
Problem
Building automations at scale means handling wildly unpredictable execution paths — user-defined steps that call external APIs, process AI model outputs, branch on conditions, and fan out into parallel sub-workflows. The challenge wasn't any single technical problem, but orchestrating all of this reliably across hundreds of concurrent tenant workspaces, with full observability and zero cross-tenant bleed.
Architecture Overview
The core of the platform is a workflow state machine backed by PostgreSQL. Each workflow run is a row with a serialized state graph, current node pointer, and execution context. Workers pull jobs from BullMQ queues, advance the state machine one step at a time, and re-enqueue the next step. This made it trivial to implement retries, timeouts, and mid-run debugging — every state transition is a durable write.
Tenant isolation was handled at the schema level: each tenant gets its own PostgreSQL schema, with a connection pool scoped per schema via a middleware layer. This allowed us to offer genuine data isolation without the operational overhead of per-tenant databases, while keeping query routing transparent to the application layer. All background workers were tenant-aware by construction.
Technical Decisions
- BullMQ over direct DB polling — gave us rate limiting, priority queues, and concurrency controls per job type without building a custom scheduler. The Redis-backed queue also provided natural backpressure when worker capacity was saturated.
- Event sourcing for workflow history — each step transition is an immutable event appended to a history table. This enabled full replay, time-travel debugging, and audit logging without additional infrastructure.
- WebSockets for real-time execution feedback — rather than polling, clients subscribe to a workflow execution channel. The backend emits granular step events over WebSocket, giving users live step-by-step visibility during long-running workflows.
- OpenAI integration as a first-class step type — rather than generic HTTP call steps, AI steps include prompt versioning, token metering per tenant, and structured output parsing with fallback handling.
Tradeoffs
- Schema-per-tenant adds migration complexity — running schema migrations across thousands of tenant schemas required a custom migration runner that could apply changes in batches with rollback checkpoints. This was non-trivial to build but paid off in isolation guarantees.
- State machine serialization limits expressiveness — complex branching workflows sometimes hit the limits of what we could represent cleanly in the state graph schema, requiring deliberate design constraints on what workflow shapes were supported.
- BullMQ at scale requires Redis memory management — at high throughput, job data accumulates in Redis. We implemented a compaction strategy to move completed job metadata to cold storage after TTL expiration.
Challenges
The hardest problem was partial failure recovery. A workflow step might complete its side effect (e.g., send an email) but fail before writing the state transition, causing a retry to re-execute an already-completed step. The solution was idempotency keys on every external call, combined with a step-level deduplication table. Each step could declare itself idempotent, and the executor would check before executing.
Tenant onboarding latency was also a significant challenge early on. Creating a new schema, running migrations, and seeding default data could take 8–12 seconds — unacceptable for a self-serve sign-up flow. We solved this with schema prewarming: a background job maintains a pool of empty, fully-migrated schemas ready to be claimed on signup, reducing onboarding time to under 200ms.
Reliability
- Dead letter queues for every job type, with alerting on DLQ depth and automated escalation to on-call for sustained failures.
- Per-tenant circuit breakers to prevent a noisy tenant from saturating worker capacity, combined with fair-share scheduling to maintain SLO compliance across the tenant pool.
Outcome
The platform reached production handling over 2 million workflow executions per month across 500+ active tenant workspaces. P99 step execution latency held under 800ms. The event-sourced history model became a key selling point for enterprise customers requiring full audit trails.
Tech Stack
- Runtime: Node.js 20, TypeScript
- Framework: Express.js
- ORM: Prisma with multi-schema support
- Queue: BullMQ (Redis)
- Database: PostgreSQL 15 (schema-per-tenant)
- AI: OpenAI API (GPT-4, embeddings)
- Real-time: WebSockets via ws library
- Observability: OpenTelemetry, Prometheus, Grafana
- Infrastructure: Docker, Kubernetes (GKE)