Module 11: Evaluating Agent Performance

Lesson 2: Scoring Accuracy, Latency, and Cost

An architect never optimizes for one variable in isolation. A system that is 100% accurate is useless if it takes 10 minutes to respond and costs $50 per call. Conversely, a system that responds in 200ms but is only 20% accurate is dangerous.

In this lesson, we master the "Triple-Constraint Matrix" and learn how to build a Performance Score.

1. Metric A: Accuracy (Quality)

How "Correct" is the answer?

Deterministic Check: Does the code run? Does the SQL return the right rows?
Semantic Check: Does the summary include the 3 core points? (Usually measured via "Model-Graded Evals"—asking a smarter model like Opus to grade a response from Haiku).

2. Metric B: Latency (Velocity)

How long does the user wait?

TTFT (Time to First Token): How fast does the screen start moving?
End-to-End Latency: How long until the final JSON is validated and ready?
Impact of Multi-turn: Every extra "Agent Turn" adds 1-3 seconds.

3. Metric C: Cost (Economics)

What is the token burn?

Prompt Tokens: Cost of context, system prompts, and history.
Completion Tokens: Cost of the generated answer.
The Architect's Lever: Prompt Caching can reduce costs by 90% for repetitive system prompts.

4. The "Weighted Performance Score"

To compare two system designs (e.g., Single-Agent vs. Supervisor-Worker), use a weighted formula.

Example Formula:

TOTAL_SCORE = (Accuracy * 0.70) - (Latency_Seconds * 0.10) - (Cost_in_Dollars * 0.20)

This formula tells you that Quality is the most important, but you are willing to penalize the system if it gets too slow or too expensive.

5. Visualizing the Balance

graph TD
    A[Performance Score] --> B[Accuracy: High Weight]
    A --> C[Latency: Low Weight]
    A --> D[Cost: Medium Weight]
    B --> E{Is it Correct?}
    C --> F{Is it Fast?}
    D --> G{Is it Profitable?}
    E & F & G --> H[Decision: Deploy or Optimize]

6. Summary

Accuracy is measured via tests or LLM-judges.
Latency is measured in seconds/milliseconds.
Cost is measured in dollars/tokens.
An architect uses a Weighted Score to make trade-off decisions.

In the next lesson, we look at where to run these tests: Building a Custom Evaluation (Eval) Suite.

Interactive Quiz

Why shouldn't you optimize for "Accuracy" alone?
What is "TTFT" and why is it important for User Experience (UX)?
How does "Prompt Caching" affect the Cost metric?
Scenario: You change your model from Sonnet to Haiku. Accuracy drops from 95% to 88%, but Cost drops by 90%. Is this a good trade? How would you decide?

Reference Video: