Troubleshooting Training: Common Failures

Feb 8, 2026·Professional

Troubleshooting Training: Common Failures

Why did my job fail? Debugging OOM errors, NaN losses, and 'Permission Denied'.

Previous LessonHyperparameter Tuning: Finding the Magic Numbers

Next LessonCompute Hardware: GPUs, TPUs, and Edge

The "Red Logs"

The exam loves troubleshooting questions. You see an error message. What is the root cause?

1. Out of Memory (OOM)

Symptom: ResourceExhaustedError: OOM when allocating tensor...
Cause: The batch size is too large for the GPU VRAM.
Fix:
1. Decrease Batch Size.
2. Use Gradient Accumulation (Simulate larger batch).
3. Upgrade GPU (T4 -> A100).

2. Loss is NaN (Not a Number)

Symptom: Loss: NaN after Step 100.
Cause: Gradient Explosion. The numbers became too big for float32.
Fix:
1. Clip Gradients (clipnorm=1.0).
2. Decrease Learning Rate.
3. Check for dirty data (e.g., dividing by zero).

3. Permission Denied (403)

Symptom: google.api_core.exceptions.PermissionDenied: 403 Access Not Configured
Cause: The Service Account running the training job does not have permission to read the GCS bucket or write logs.
Fix: Grant Storage Object Admin to the Compute Engine Service Account (or Custom SA).

4. Slow Training (Starvation)

Symptom: Training takes 1 week instead of 1 day. GPU usage is low.
Cause: Input Pipeline bottleneck.
Fix: Use tf.data.Dataset.prefetch() and cache().

Knowledge Check

Error: Quiz options are missing or invalid.

Previous LessonHyperparameter Tuning: Finding the Magic Numbers

Next LessonCompute Hardware: GPUs, TPUs, and Edge

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn