Module 12: Cost and Token Optimization

Lesson 3: Caching Strategies and Prompt Reuse

The introduction of Prompt Caching changed the game for AI Architects. Previously, every turn of a conversation was billed as a "Fresh Start." Now, if your prompt shares a common "Prefix" (like a massive system prompt), you only pay full price for the new tokens. The "Cached" tokens are billed at a 90% discount.

In this lesson, we look at how to structure your prompts to maximize "Cache Hits."

1. What is Prompt Caching?

Prompt Caching allows the API to "Remember" the first part of a prompt.

Requirement: The cacheable block must be at least 1,024 tokens (for some models).
TTL (Time to Live): The cache usually lasts for 5 minutes. Every time you hit the cache, the timer resets.

2. Setting "Cache Breakpoints"

You don't just "Turn on" caching. You tell the model exactly where to "Freeze" the prompt.

The Architect's Strategy:

The Static Block: Your System Prompt + Tool Definitions. (Set a breakpoint here).
The Example Block: Your few-shot examples. (Set a breakpoint here).
The Data Block: The user's specific query. (Do not cache this).

By caching the Static and Example blocks, you only pay for the "User Query" at full price.

3. The "Prefix" Rule

Caching only works if the beginning of the prompt is identical.

Bad Pattern: [Current Time] + [System Prompt] -> The time changes every second, so the cache "Breaks" immediately.
Good Pattern: [System Prompt] + [Current Time] -> The System Prompt can be cached because it’s at the very beginning.

4. Visualizing the Cache Hit

graph TD
    A[Request 1: Empty Cache] --> B[Processing: 100% Price]
    B --> C[Cache Entry Created]
    D[Request 2: Same System Prompt] --> E{Cache Match?}
    E -->|Yes| F[Processing: 10% Price]
    E -->|No| G[Processing: 100% Price]

5. Summary

Prompt Caching yields a 90% discount on inputs.
Always place Static Content (Rules, Tools, Docs) at the absolute beginning of the prompt.
Use Breakpoints to define your cache layers.

In the next lesson, we look at the trade-off of these choices: Balancing Performance and Cost.

Interactive Quiz

What is the "90% Discount" associated with prompt caching?
Why must the beginning (prefix) of a prompt be static for caching to work?
What happens if you place the "User's current time" at the start of your prompt?
Scenario: Your system prompt is 500 tokens. Your few-shot examples are 2,000 tokens. Where would you set your cache breakpoint to maximize savings?

Reference Video:

Lesson 3: Prompt Caching and Reuse