In the previous post, we optimized token usage and latency until the numbers felt reasonable. That’s when I made the mistake of calling it “production-ready.”
Two days after deploying, an unexpected input format broke an agent mid-run. There was no trace of which task failed. No retry logic. No alert. The job silently returned bad output, and I only found out because someone opened a support ticket. Turns out “works locally with test inputs” and “runs reliably in production” are completely different engineering problems.
This post covers the operational scaffolding you need to bridge that gap—run metadata, retry logic, observability, and guardrails that keep a crew running predictably when real users and real data are involved.
What Changes in Production
On your laptop, a failed run is an annoyance. In production, it’s an incident.
The differences aren’t subtle:
- Inputs are untrusted: user prompts, uploaded files, and tool responses can be malformed or adversarial.
- Latency matters: workflows that take 60–90 seconds need async execution—users can’t wait at a spinner.
- Costs scale quickly: one expensive run is fine; thousands at $0.08 each add up fast.
- Traceability is required: support and engineering need to reconstruct what happened during any run.
- Recovery must be explicit: transient failures are guaranteed at scale. Retry policies and timeouts cannot be an afterthought.
Treat each crew run like a job in a distributed system, not a Python function call that happens to use an LLM.
Recommended Architecture
A simple architecture handles most production cases well:
- API layer receives requests and validates payloads.
- Queue stores run jobs (
run_id, input, priority). - Worker service executes CrewAI workflows.
- Database stores run state and final outputs.
- Observability stack captures logs, traces, and usage metrics.
The key insight: decouple request handling from long-running agent execution. HTTP requests should return immediately with a run_id; clients poll or subscribe for results. An agent workflow that takes 90 seconds to finish has no business holding an HTTP connection open.
Start simple—a single worker process and a database table work fine at low volume. The architecture above scales horizontally when you need it, without rewriting business logic.
Build a Minimal Worker
Start with a worker function that wraps crew.kickoff() with run metadata and hard limits.
| |
This is intentionally small, but it sets a strong baseline:
- Every run has a unique ID traceable through logs and support tickets.
- Success and failure are explicit states, not just missing records.
- Usage metrics are captured for cost analysis and anomaly detection.
Persist this object to your database immediately. The run record is the ground truth for everything that follows—debugging, billing, support, and alerting all reference it.
Add Retries and Timeouts
Tool calls fail in production: APIs throttle, networks drop, and transient errors happen. Without retries, a 1% failure rate on individual calls becomes a much higher failure rate on multi-step workflows.
Use tenacity around the worker boundary and timebox each job:
| |
A few things to get right:
Retry only transient failures. A network timeout is worth retrying. A validation error or a missing required field is not—retrying won’t fix it, and you’ll burn three attempts discovering that.
Record attempt count. Add attempt_number to your run record. A job that succeeded on the third attempt is worth investigating; it signals a flaky dependency that will eventually fail entirely.
Set a hard time limit. A crew that runs for 10 minutes is either looping or waiting on a stuck external call. Set max_iter on agents and use a job-level timeout at the infrastructure layer. A job that never completes is worse than one that fails—it holds a worker slot and produces no output.
Observability You Actually Need
The first production failure I couldn’t diagnose was because I had logs but no structure. Grep-ing stdout for clues in a multi-agent workflow is painful. Here’s what’s actually worth capturing:
1. Structured logs
2. Run metrics (duration, token usage, success rate per agent/task)
3. Trace context (run_id, agent, task) on every log line
Wire CrewAI’s task callback into your logging pipeline:
| |
Then wire it into the crew:
| |
Use JSON logs from the start and centralize them. Text logs become painful once traffic grows—a structured log means you can query “all failed tasks for run_id X” instead of parsing strings.
Input and Tool Safety Guardrails
A production agent should never have unconstrained tool access.
Input schema validation. Reject malformed requests at the API layer before they reach your worker. An agent that receives unexpected input types will produce unpredictable output or fail mid-run without a clear error.
Allowlisted tools per agent. Each agent should have only the tools it actually needs. Giving an agent broad tool access increases both the attack surface and the likelihood that it calls something it shouldn’t.
Per-tool timeouts. Avoid hanging external calls. A single tool call to a slow API shouldn’t freeze an entire crew run for minutes.
Output validation before downstream use. For structured output tasks, define output_pydantic= or output_json= so downstream systems consume validated data, not free-form text that might be malformed.
Redaction before logging. If agents can access tools that return customer data, scrub sensitive fields before they hit your log pipeline. PII in logs is a compliance problem waiting to happen.
Common Production Failures
Once a crew is live, these are the failure patterns that come up most:
Failure 1: Silent hallucination under load
Symptom: Agent output looks plausible but doesn’t reflect what tools actually returned. No errors in logs.
Diagnosis: Check step-level callbacks for tool calls. If the tool fired, inspect its return value. If it didn’t fire at all, the agent skipped it and generated the answer from training data.
Fix: Make tool use explicit in task descriptions: “You MUST call fetch_financials to retrieve current data. Do not estimate values.” Add output validation to catch structurally implausible results before they’re returned to callers.
Failure 2: Worker timeout on large inputs
Symptom: Runs with large payloads time out; small inputs succeed consistently.
Diagnosis: Check token counts on the timed-out runs. Large inputs bloat the context window, causing agents to reason longer and hit token limits that trigger retries within the crew.
Fix: Validate and truncate input at the API boundary before the job enters the queue. Define max input sizes explicitly. For documents, chunk and summarize upstream rather than passing raw content to the crew.
Failure 3: Cost spike from a single run
Symptom: One run costs 20x the average. No obvious error.
Diagnosis: Pull usage_metrics from the run record. Look for an agent making a large number of successful_requests—a sign it looped. Check max_iter on that agent.
Fix: Set max_iter on all agents. Alert on runs where token usage exceeds 3x the p95 baseline. A runaway agent is usually stuck retrying a failing tool or re-reasoning because task output doesn’t match expected format.
Failure 4: Inconsistent results on identical inputs
Symptom: Running the same crew twice on the same input returns noticeably different outputs.
Diagnosis: Check temperature settings. High temperature (> 0.5) is the most common culprit. Also check whether any tools fetch live external data—market prices, news, or API responses that change between runs.
Fix: Set temperature=0.0 for deterministic agents. For agents that need creativity, document that non-determinism is expected. Log tool call inputs and outputs to understand whether variance comes from the model or the data.
Deployment Patterns
Choose based on expected traffic and SLA:
- Single worker service: simplest start, good for low volume and internal tools.
- API + queue + worker pool: best default for most teams once traffic is real.
- Scheduled crews: use cron or event schedulers for periodic workflows—nightly reports, monitoring sweeps.
- Hybrid: synchronous for quick tasks (under 10 seconds), async for long-running ones.
Start simple. The value of the architecture above isn’t what it enables today—it’s that your run records and queue contracts are already correct when you need to scale.
Production Checklist
Before increasing traffic, confirm:
- Run IDs are generated and persisted for every job.
- Every run has an explicit status:
queued,running,completed,failed. - Retry policy and max-attempt limits are set and tested.
- Timeouts exist at job, task, and tool levels.
- Usage metrics are recorded for each run.
- Logs are structured JSON and centrally searchable.
- Sensitive fields are redacted from logs and outputs.
- Alerts are configured for failure-rate and latency spikes.
If any item is missing, fix it before sending real traffic. Discovering a gap during an incident is far more expensive than discovering it before.
What’s Next
CrewAI can absolutely power production systems, but only when wrapped in solid operational scaffolding. The framework handles orchestration; your platform handles reliability. Build observability first—you’ll ship faster, debug faster, and sleep better.
This wraps up the CrewAI series. Each post built on the last, from wiring up a first crew through to running it reliably in production. The same ideas—measure everything, make failures explicit, validate at boundaries—apply to any agentic system you build next.
This is part 6 of the CrewAI series. Previous: Part 1: Getting Started, Part 2: Building Custom Tools, Part 3: Memory and State Management, Part 4: Debugging Workflows, Part 5: Performance Optimization