Troubleshooting Training: Common Failures

Troubleshooting Training: Common Failures

Why did my job fail? Debugging OOM errors, NaN losses, and 'Permission Denied'.

The "Red Logs"

The exam loves troubleshooting questions. You see an error message. What is the root cause?


1. Out of Memory (OOM)

  • Symptom: ResourceExhaustedError: OOM when allocating tensor...
  • Cause: The batch size is too large for the GPU VRAM.
  • Fix:
    1. Decrease Batch Size.
    2. Use Gradient Accumulation (Simulate larger batch).
    3. Upgrade GPU (T4 -> A100).

2. Loss is NaN (Not a Number)

  • Symptom: Loss: NaN after Step 100.
  • Cause: Gradient Explosion. The numbers became too big for float32.
  • Fix:
    1. Clip Gradients (clipnorm=1.0).
    2. Decrease Learning Rate.
    3. Check for dirty data (e.g., dividing by zero).

3. Permission Denied (403)

  • Symptom: google.api_core.exceptions.PermissionDenied: 403 Access Not Configured
  • Cause: The Service Account running the training job does not have permission to read the GCS bucket or write logs.
  • Fix: Grant Storage Object Admin to the Compute Engine Service Account (or Custom SA).

4. Slow Training (Starvation)

  • Symptom: Training takes 1 week instead of 1 day. GPU usage is low.
  • Cause: Input Pipeline bottleneck.
  • Fix: Use tf.data.Dataset.prefetch() and cache().

Knowledge Check

Error: Quiz options are missing or invalid.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn