Module 11: Evaluating Agent Performance

Lesson 4: Analyzing and Interpreting Results

A spreadsheet showing "85% Accuracy" is just the beginning. The job of the Architect is to find the Pattern in the 15% Failure. Why did those tests fail? Was it a formatting issue, a logical hallucination, or a context window overflow?

In this lesson, we learn how to "De-noise" evaluation data and identify the Systemic Root Causes of poor performance.

1. Quantitative Analysis: The Variance Problem

Because LLMs are non-deterministic, running the same test twice can give two different results.

The Move: Run every eval case at least 3-5 times (at Temperature 0).
Calculate the Mean Accuracy and the Stability (Standard Deviation).
If a case passes 2/5 times, it is not "Stable," and your prompt is too ambiguous.

2. Qualitative Analysis: Error Bucketization

Don't just look at a "Red" test. Categorize the failure.

Bucket 1: Formatting. The JSON had a syntax error.
Bucket 2: Reasoning. The model chose the wrong tool.
Bucket 3: Factuality. The model invented a date.
Bucket 4: Constraint. The model exceeded the token limit.

By "Bucketizing" your errors, you know exactly what to change. If 80% of errors are in Bucket 1, you need to improve your Schema (Module 8), not your Reasoning instructions (Module 7).

3. The "Cost-Performance" Pareto Front

Plot your different model/prompt combinations on a graph.

X-Axis: Cost per thousand requests.
Y-Axis: Accuracy score.
The Sweet Spot: The point where you get the highest accuracy before the cost curve becomes "Vertical" (diminishing returns).

4. Visualizing Error Bucketization

graph TD
    A[Total Failures: 100] --> B[Bucket: Formatting - 60%]
    A --> C[Bucket: Reasoning - 25%]
    A --> D[Bucket: Hallucination - 15%]
    B --> B1[Action: Fix JSON Schema]
    C --> C1[Action: Decompose Prompts]
    D --> D1[Action: Add Context Anchors]

5. Summary

Stability is just as important as Accuracy.
Bucketization identifies the root cause of systemic failure.
Use Pareto Analysis to find the most efficient model for your budget.

In the final lesson of this module, we look at the cycle of improvement: Iterative Improvement based on Performance Data.

Interactive Quiz

Why should you run an evaluation case multiple times?
What is "Error Bucketization"?
How does stability (standard deviation) help diagnose a "Loose" prompt?
Scenario: Your eval shows 90% accuracy. You look at the errors and see they are 100% "JSON Parsing Errors." What is your next architectural step?

Reference Video:

Lesson 4: Analyzing and Interpreting Eval Results