← back to knowledge-hub

Performance Optimization and Token Management in CrewAI

In the previous post, we learned to trace what our agents actually did. Now we know what is happening—and what we’re seeing is a $4 API bill for a workflow that could have cost $0.40.

Multi-agent systems have a performance problem that doesn’t exist with single-agent calls: costs compound. A three-agent crew where each agent spends 2,000 tokens on context and produces 800-token outputs means the third agent may receive 5,000+ tokens as input before it writes a single word. Run this in production at any real volume and the bill grows fast. Add latency—each agent waits for the previous one—and you also have a slow system.

This post covers the practical techniques for fixing both problems.

Measuring Before Optimizing

Don’t optimize blind. CrewAI exposes token usage through usage_metrics after every crew run.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from crewai import Agent, Task, Crew

researcher = Agent(
    role="Research Analyst",
    goal="Find market data for the given company",
    backstory="You specialize in financial market research.",
    llm="openai/gpt-4o"
)

summarizer = Agent(
    role="Report Writer",
    goal="Write a concise summary from research findings",
    backstory="You turn dense research into readable reports.",
    llm="openai/gpt-4o"
)

research_task = Task(
    description="Research {company} and gather key financial metrics.",
    expected_output="Bullet-point list of key metrics with sources.",
    agent=researcher
)

summary_task = Task(
    description="Write a 3-paragraph executive summary from the research.",
    expected_output="Three-paragraph executive summary.",
    agent=summarizer,
    context=[research_task]
)

crew = Crew(agents=[researcher, summarizer], tasks=[research_task, summary_task])
result = crew.kickoff(inputs={"company": "Stripe"})

print(crew.usage_metrics)

The output looks like this:

1
2
3
4
5
6
UsageMetrics(
    total_tokens=8432,
    prompt_tokens=6891,
    completion_tokens=1541,
    successful_requests=4
)

That prompt_tokens number is almost always the culprit. In most workflows, 80–90% of your token spend is prompt tokens—the context you’re feeding agents, not the content they produce. That’s where to focus.

Model Tiering: Use the Right Tool for Each Job

The biggest lever you have is model selection per agent. Not every agent in your crew needs GPT-4o or Claude Opus. A lot of what agents do is mechanical: formatting output, extracting structured data, routing decisions. Smaller models handle this well and cost 10–20x less.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from crewai import Agent, LLM

# Expensive model only where it earns its cost
researcher = Agent(
    role="Senior Research Analyst",
    goal="Identify key competitive threats and market opportunities",
    backstory="You have deep expertise in market analysis and strategic research.",
    llm=LLM(model="openai/gpt-4o", temperature=0.3)
)

# Cheap model for structured extraction
data_extractor = Agent(
    role="Data Extraction Specialist",
    goal="Extract structured data from research findings",
    backstory="You convert unstructured text into structured JSON.",
    llm=LLM(model="openai/gpt-4o-mini", temperature=0.0)
)

# Cheap model for mechanical formatting
report_formatter = Agent(
    role="Report Formatter",
    goal="Format findings into the standard report template",
    backstory="You apply consistent formatting to research reports.",
    llm=LLM(model="openai/gpt-4o-mini", temperature=0.0)
)

A practical tiering guide:

Agent typeRecommended model tierWhy
Strategy, analysis, reasoningOpus / GPT-4oNeeds deep inference
Data extraction, classificationSonnet / GPT-4o-miniPattern-matching, not reasoning
Formatting, summarizationHaiku / GPT-3.5-turboMechanical transformation
Tool-use-only agentsHaiku / GPT-3.5-turboJust dispatching calls

You won’t always get this tiering right the first time—run usage_metrics before and after and let the numbers confirm the tradeoff.

Parallel Task Execution

By default, CrewAI runs tasks sequentially. Agent A finishes, then B starts, then C. If A and B don’t depend on each other, that’s wasted latency.

Set async_execution=True on tasks that can run in parallel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from crewai import Task

# These two tasks have no dependency on each other
financial_research_task = Task(
    description="Research {company}'s financial metrics and recent earnings.",
    expected_output="Financial metrics summary with Q/Q trends.",
    agent=financial_analyst,
    async_execution=True   # runs concurrently
)

news_research_task = Task(
    description="Find recent news and press releases about {company}.",
    expected_output="Chronological list of significant news items.",
    agent=news_analyst,
    async_execution=True   # runs concurrently
)

# This task depends on both — it waits for them
synthesis_task = Task(
    description="Synthesize financial data and news into an investment brief.",
    expected_output="Two-page investment analysis brief.",
    agent=senior_analyst,
    context=[financial_research_task, news_research_task]  # explicit deps
)

CrewAI’s sequential process respects async_execution—it kicks off async tasks together and waits for them before moving to tasks that depend on them via context. The synthesis_task here won’t start until both async tasks complete.

A few things to get right:

Don’t async tasks that share state. If two agents both write to the same memory store or external database, parallel execution creates race conditions. Either sequence them or use locking.

Async doesn’t mean free. You’re still hitting rate limits. If you have five async tasks all using GPT-4o simultaneously, you may hit tokens-per-minute limits and get throttled. Check your rate limit tier before scaling parallelism.

The context field is your dependency graph. Only list tasks in context that an agent truly needs. Listing everything “just in case” stuffs the context window with irrelevant output.

Tool Result Caching

CrewAI caches tool call results by default—cache=True is the default on every Agent, so you usually don’t need to set it. What it does: if an agent calls the same tool with the same arguments a second time within a run, it returns the cached result instead of executing the tool again.

1
2
3
4
5
6
7
researcher = Agent(
    role="Research Analyst",
    goal="Find market data for the given company",
    backstory="You specialize in financial market research.",
    llm="openai/gpt-4o",
    cache=True   # default — caches tool call results within the run
)

You’d only set cache=False when a tool returns live data that must be fresh on every call—a real-time price feed, a rate-limited scraper, or anything where a stale cached result would be wrong.

For LLM-level prompt caching (reusing KV cache tensors across identical prompt prefixes), that runs at the provider level—Anthropic, OpenAI, and Google all support it natively. You don’t configure it in CrewAI; you get it automatically when you send identical system prompts and context prefixes across calls. Providers like Anthropic’s Claude discount cached prompt tokens by up to 90%.

We covered fine-grained tool caching with custom TTL and cache_function in Part 2.

Context Window Management

The output of each task becomes input for the next. If your tasks produce verbose outputs, your context window fills fast and your costs grow with every agent in the chain.

Use output_pydantic to force structured, compact outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from pydantic import BaseModel
from typing import List

class CompanyMetrics(BaseModel):
    revenue_growth_yoy: float
    gross_margin: float
    key_risks: List[str]
    top_competitors: List[str]

financial_task = Task(
    description="Extract key financial metrics for {company} from the provided data.",
    expected_output="Structured financial metrics object.",
    agent=financial_analyst,
    output_pydantic=CompanyMetrics   # forces structured output
)

Instead of a 600-word prose analysis, downstream agents receive a compact, structured object. This alone can cut context tokens by 50–70% for data-heavy workflows.

When you can’t use structured output—creative or open-ended tasks, for example—write explicit length constraints in expected_output:

1
2
3
4
5
summary_task = Task(
    description="Summarize the research findings for {company}.",
    expected_output="Exactly 3 bullet points, each under 20 words. No prose.",
    agent=summarizer
)

Agents follow output format instructions more reliably than you’d expect, especially with temperature=0.0.

Trimming Agent Backstories

The backstory field is included in every prompt for that agent, for every LLM call it makes. A 200-word backstory on a tool-calling agent that makes 8 calls per task = 1,600 tokens of pure overhead.

Keep backstories short and role-specific:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# 180 tokens of backstory — unnecessary for a routing agent
data_router = Agent(
    role="Data Router",
    goal="Route incoming data to the correct processing pipeline",
    backstory="""You are an expert data routing specialist with fifteen years of experience
    in enterprise data engineering. You have worked at Fortune 500 companies and have deep
    expertise in ETL pipelines, data warehousing, stream processing, and real-time analytics.
    You approach every routing decision methodically...""",
    llm="openai/gpt-4o-mini"
)

# 12 tokens — does the same job
data_router = Agent(
    role="Data Router",
    goal="Route incoming data to the correct processing pipeline",
    backstory="You route data to the right pipeline based on schema and content type.",
    llm="openai/gpt-4o-mini"
)

The rule: backstory should contain information the agent actually needs to make decisions, not bio padding.

Measuring the Impact

Before any optimization, capture a baseline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import time

def run_with_metrics(crew, inputs):
    start = time.time()
    result = crew.kickoff(inputs=inputs)
    elapsed = time.time() - start

    metrics = crew.usage_metrics
    print(f"Duration:          {elapsed:.1f}s")
    print(f"Total tokens:      {metrics.total_tokens}")
    print(f"Prompt tokens:     {metrics.prompt_tokens}")
    print(f"Completion tokens: {metrics.completion_tokens}")
    print(f"API calls:         {metrics.successful_requests}")

    return result

Run this before and after each optimization. The numbers that matter most:

  • Prompt tokens → impact of context trimming, output_pydantic, shorter backstories
  • Duration → impact of async_execution and model tiering
  • Completion tokens → usually less controllable, but structured output helps here too

A common baseline for a three-agent research workflow: ~12,000 tokens, ~45 seconds, $0.08/run. After applying model tiering, async execution, and structured outputs: ~4,000 tokens, ~18 seconds, $0.015/run.

Common Pitfalls

Async tasks that share memory. If two agents both call memory.save() in parallel and your memory implementation isn’t thread-safe, you’ll get data corruption. Either sequence those tasks or give each agent its own memory namespace.

Caching stale tool results. Tool caching from Part 2 is time-bounded, but LLM caching is indefinite by default. If your workflow pulls live data and you cache LLM responses, an agent might produce a “current analysis” from a 3-day-old cache hit. Disable LLM caching for agents that need fresh reasoning.

Wrong model for complex tasks. Model tiering works when you tier correctly. Sending a nuanced competitive analysis to GPT-3.5-turbo to save cost often results in shallow output that requires a re-run with GPT-4o anyway. Measure quality, not just cost.

Over-parallelizing against rate limits. More async tasks doesn’t automatically mean faster. If you’re on a low-tier API key, five concurrent tasks throttling each other is slower than three sequential ones. Test your rate limit headroom before scaling async.

What We Covered

The performance gap between a naive CrewAI crew and an optimized one is usually 3–5x on cost and 2x on latency—without sacrificing output quality. The techniques that move the needle most:

  1. Model tiering — largest impact on cost
  2. output_pydantic — largest impact on context bloat
  3. async_execution=True — largest impact on latency
  4. Lean backstories — easy win that most people skip

Use usage_metrics to confirm changes are actually working. Token counts don’t lie.

Next up: putting everything together for production deployment—rate limit handling, retries, cost controls, and monitoring CrewAI crews in a live environment.

graph cloud