
Scaling & Optimization: Handling the Load
How to survive Black Friday. Learn about Autoscaling, GPU Inference, TF-TRT, and optimizing latency for high-throughput serving.
Performance Engineering
Getting the model to "Running" state is Step 1. Getting it to run fast and cheap is Step 2.
The two metrics you fight are:
- Latency: "Time per request" (e.g., 50ms).
- Throughput: "Requests per second" (e.g., 10,000 QPS).
1. Autoscaling Strategies
Vertex AI Prediction scales based on CPU/GPU Utilization.
- Target Utilization: Defaults to 60%.
- If CPU load > 60%, add a node.
- If CPU load < 60%, remove a node.
The Cold Start Problem: It takes ~2-3 minutes to spin up a new node.
- Risk: If traffic spikes instantly (Black Friday start), the scaler is too slow, and users see errors.
- Fix: Min Replica Count. Set
min_replica_count=10before the event starts to pre-warm the fleet.
2. Hardware Acceleration for Serving
Do you need a GPU for Serving?
- Recommendation Models (Tabular): No. Use CPU. It's IO-bound, not compute-bound.
- ResNet/Bert (Vision/Text): Yes. Use T4 GPUs.
NVIDIA TensorRT (TF-TRT): This is a compiler that optimizes TensorFlow graphs specifically for NVIDIA GPUs. It fuses layers and optimizes memory usage.
- Impact: Can reduce latency by 2x-4x.
- Action: Convert your
SavedModelusing TF-TRT before deploying.
3. Server-Side Batching
This is counter-intuitive. Scenario: You receive 100 requests per second. Naive Way: Process 1 at a time. This thrashes the GPU memory bandwidth. Smart Way: Wait 5ms, collect 5 requests into a Batch, and process them all at once.
- Result: Throughput increases massively. Latency per user increases slightly (by 5ms), but the system doesn't crash.
4. Visualizing Scalability
graph TD
Traffic[User Requests] --> LB[Load Balancer]
subgraph "Vertex AI Endpoint"
LB --> Replica1[Replica 1 (CPU < 60%)]
LB --> Replica2[Replica 2 (CPU < 60%)]
Replica1 -.->|Load > 60%| AutoScaler
AutoScaler -.->|Events| Replica3[Provision Replica 3]
end
style AutoScaler fill:#EA4335,stroke:#fff,stroke-width:2px,color:#fff
5. Summary
- Autoscaling saves money but has a lag time.
- Min Replicas protects against traffic spikes.
- TensorRT optimizes Deep Learning models for inference speed.
- Server-Side Batching improves Throughput at the cost of slight Latency.
In the next lesson, we connect the dots. We move from "Manual Steps" to Pipelines.