Velocity and Stability: Concurrency and Throttling

Velocity and Stability: Concurrency and Throttling

Balance speed with reliability. Learn how to manage high-volume agentic requests without triggering API rate limits or crashing your infrastructure.

Concurrency and Throttling

When you deploy a production agent, you are no longer limited by your own typing speed. You are limited by the Infrastructure's throughput. Your agent might try to call a tool 100 times in parallel, or 10,000 users might try to talk to your agent at the same time.

Without Concurrency Control and Throttling, your system will enter a "Crash Loop" of 429 Errors (Too Many Requests). In this lesson, we will learn how to manage the velocity of your agency.


1. Concurrency: Doing Things in Parallel

LangGraph allows you to run multiple nodes at the exact same time. This is essential for reducing the latency of complex tasks.

The Fan-Out / Fan-In Pattern

  • One node starts the work.
  • It "Branches" into 5 parallel worker nodes.
  • All workers finish.
  • One "Join" node summarizes the results.
graph LR
    Start --> FanOut{Parallel Launch}
    FanOut --> A[Search A]
    FanOut --> B[Search B]
    FanOut --> C[Search C]
    A --> Join[Synthesis]
    B --> Join
    C --> Join
    Join --> End

2. Throttling: The "Speed Governor"

Even if you can run 100 agents in parallel, your API providers (OpenAI, Anthropic, Google) have Rate Limits.

Types of Limits

  1. RPM (Requests Per Minute): How many times you can hit the API.
  2. TPM (Tokens Per Minute): Your total "volume" allowance.
  3. Concurrency Limit: How many active TCP connections you can have.

Implementation: The Semaphore Pattern

In your Python code, you should never call an LLM directly in a loop. Use a Semaphore or a RateLimiter to "Queue" the requests.

import asyncio

# Allow only 5 concurrent LLM calls across the whole system
sem = asyncio.Semaphore(5)

async def throttled_llm_call(prompt):
    async with sem:
        return await llm.ainvoke(prompt)

3. The "Backoff and Retry" Strategy

When you inevitably hit a rate limit (HTTP 429), how you handle it determines your system's reliability.

Bad Approach: Catch the 429 and try again immediately. (This just makes the rate limit problem worse). Good Approach: Exponential Backoff with Jitter.

  1. Wait 1 second + random ms.
  2. If it fails again, wait 2 seconds + random ms.
  3. If it fails again, wait 4 seconds...

4. Prioritization: The "VIP" Queue

Not all agent tasks are equal.

  • A User-facing Chat is high priority. (Should jump to the front of the queue).
  • A Background Data Scraping task is low priority. (Can wait 5 minutes if the system is busy).

Solution: Priority Queues

Use a system like Redis to maintain multiple queues with different weights.


5. Token Throttling (Budgeting)

As we discussed in Module 3.3, tokens are a system constraint. You can implement a Token Quota per user.

  • "User A has used 90% of their daily token budget."
  • Throttle User A's agents so they run on a cheaper, "Slower" model like Haiku or GPT-4o-mini.

6. Implementation Example: LangGraph Throttling

You can use asyncio.gather within a LangGraph node to handle internal parallelism securely.

async def parallel_search_node(state):
    # Launch 3 searches at once
    tasks = [
        search_tool.ainvoke(state["q1"]),
        search_tool.ainvoke(state["q2"]),
        search_tool.ainvoke(state["q3"])
    ]
    results = await asyncio.gather(*tasks)
    return {"results": results}

Summary and Mental Model

Think of Concurrency and Throttling like a Highway.

  • Concurrency is adding more lanes. It allows more traffic (tasks) to move at once.
  • Throttling is the Toll Booth. It makes sure that even if the highway is wide, the exit point (The API or DB) doesn't get overwhelmed.

A fast highway with a single, jammed exit is a parking lot.


Exercise: Scaling Calculation

  1. The Math: Your OpenAI limit is 10,000 Tokens Per Minute. Each agent session uses 2,000 tokens per minute.
    • How many concurrent users can you support with this limit?
    • How would you use a Small Model for simple steps to "reclaim" token space for more users?
  2. Design: Why is it better to have a "Global" rate limiter for the whole app rather than an "Agent-level" rate limiter?
  3. Logic: What happens to a "Human-in-the-loop" node (Module 5.3) when the system is throttled? Does the human have to wait, or only the agent? Ready for communication? Let's move to Inter-Agent Communication Pattern.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn