Module 11: Evaluating Agent Performance

Lesson 5: Iterative Improvement based on Performance Data

Building an AI system is not a "Once and Done" task. It is a Cycle. In this final lesson of Module 11, we look at the "Improvement Loop" that takes a prototype from 60% accuracy to 99% accuracy.

This cycle is the primary day-to-day workflow of a Certified Architect.

1. Step 1: The "Failure Audit"

Don't fix what isn't broken.

Run your Eval Suite.
Filter for the Failed Cases.
Read the <thinking> logs for those specific cases to see where the model's logic went off the rails.

2. Step 2: The "Surgical" Prompt Edit

Many developers "Over-Correct." If the model fails one test, they rewrite the entire prompt.

The Move: Make the smallest possible change to the instruction that address the specific failure found in Step 1.
Example: If the model missed a constraint, don't rewrite the whole role; just emphasize the constraint in the "Guardrails" section (Module 7).

3. Step 3: Regression Testing (The Verify Phase)

Run the Eval Suite again.
Check the Pass Rate: Did it go up?
CRITICAL: Check the previously passing cases. Did your edit break something that used to work? (This is a "Regression").

If Accuracy went up and no Regressions occurred, you have successfully "Iterated."

4. Visualizing the Improvement Loop

graph TD
    A[Baseline Eval] --> B[Failure Audit]
    B --> C[Hypothesis: 'I need a tighter Schema']
    C --> D[Surgical Edit]
    D --> E[Regression Test]
    E -->|Success| F[Next Goal]
    E -->|Failure/Regress| G[Pivot Hypothesis]
    G --> D

5. Summary of Module 11

Module 11 has mastered the "Science" of AI.

You used Benchmarks to set the floor (Lesson 1).
You used Scoring to balance trade-offs (Lesson 2).
You built an Eval Suite for automation (Lesson 3).
You used Analysis to find root causes (Lesson 4).
You adopted the Iterative Cycle for refinement (Lesson 5).

In Module 12, we look at the money: Cost and Token Optimization.

Interactive Quiz

Why should you only make "Surgical" (small) edits to prompts?
What is a "Regression" in an AI evaluation?
Why is it important to read the <thinking> log during the failure audit?
Scenario: You are at 92% accuracy. Your last 3 iterations have not increased the score. What is this called, and what architectural "Pivot" might you try? (e.g., Change model? Decompose task?)

Reference Video:

Lesson 5: Iterative Improvement Cycles