
FastAPI for AI: Async Clients and Model Serving
The AI-First API. Learn how to wrap LLMs and Machine Learning models using FastAPI's async core to build high-performance AI services.
FastAPI for AI: Async Clients and Model Serving
FastAPI has become the standard for the AI era. Why? Because AI is slow, and FastAPI is built specifically to handle "Waiting" without blocking. Whether you are calling an external LLM (like Gemini or OpenAI) or running your own model (like Llama or Stable Diffusion), FastAPI is your best friend.
In this lesson, we learn how to build production-grade AI wrappers.
1. The Async AI Pattern
Calling an LLM takes time. A single prompt can take 2 to 10 seconds. If you use a synchronous client (like requests), your whole API freezes.
The Solution: Use the Async version of the AI client.
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key="sk-...")
@app.post("/ask-ai")
async def ask_ai(prompt: str):
# This 'await' lets the server handle other users while the AI thinks
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {"answer": response.choices[0].message.content}
2. Pydantic for Structured AI Output
The biggest problem with AI is that it's "unpredictable." You ask for JSON, but it gives you a poem. By combining FastAPI's Pydantic with modern LLM "Structural Output" features, you can guarantee that the AI's response follows your code's schema exactly.
class AIResponse(BaseModel):
summary: str
sentiment: str = Field(pattern="^(Positive|Negative|Neutral)$")
tags: list[str]
3. Handling Timeouts and Retries
AI APIs fail. Models go down, or you hit rate limits. Your FastAPI code should include robust error handling (Module 7).
- Timeouts: Stop waiting after 10 seconds so the user isn't stuck forever.
- Exponential Backoff: If an API fails, wait 1s, then 2s, then 4s before trying again.
4. Serving Your Own Models (PyTorch / TensorFlow)
If you are running your own model on your own GPU:
- Load the model ONCE during startup (using
@app.on_event("startup")). - Serve it in a threadpool: Model inference is CPU/GPU heavy. Run it using
definstead ofasync defso it doesn't block the event loop.
Visualizing the AI API Flow
sequenceDiagram
participant U as User
participant F as FastAPI
participant AI as AI Model (LLM)
U->>F: POST /generate (Prompt)
F->>AI: Async Call (Waiting...)
Note over F: Server handles other requests
AI-->>F: AI Result (Text/JSON)
F->>F: Pydantic Validation
F-->>U: Reliable JSON Response
Summary
- Async-First: Never use sync AI clients in FastAPI.
- Pydantic: Use it to force the AI to return structured, typed data.
- Lifecycle: Load heavy models during app startup, not inside the request.
- Reliability: AI is unpredictable; your API shouldn't be.
In the next lesson, we’ll look at Streaming Responses, the secret to making AI feel "Live."
Exercise: The AI Guard
You are building an AI Support Bot.
- If the AI takes 15 seconds to respond, what happens to your FastAPI server if you use
async def? - What happens if you use standard
defwith a sync client? - Which one allows you to handle 100 concurrent users on a single CPU?