Breaking the Single Machine Barrier

If your model fits on one GPU, life is easy. If it doesn't (massive batch size or massive parameter count), you need Distributed Training.

For the exam, you need to know how to configure this in Vertex AI and which strategy to pick.

1. Strategies: Data vs Model Parallelism

Data Parallelism (The Standard)

Scenario: The model fits on the GPU, but the data is huge. It would take 1 month to train.
Solution: Copy the model to 10 GPUs. Give each GPU a different chunk of data (Batch). Average the gradients at the end of the step.
Strategies:
- MultiWorkerMirroredStrategy (TensorFlow).
- DistributedDataParallel (PyTorch).

Model Parallelism (The Giant)

Scenario: The model itself is too big for the GPU memory (e.g., GPT-3).
Solution: Put Layer 1-10 on GPU 1, Layer 11-20 on GPU 2.
Complexity: Very high latency cost (GPUs need to talk constantly).

2. Vertex AI Custom Training Jobs

You don't SSH into 10 VMs and run python train.py. You submit a Custom Job to Vertex AI. It spins up the cluster, runs the code, saves the model to GCS, and shuts down.

Architecture

Chief (Master): Coordinates the work.
Workers: Do the training.
Parameter Server (Async): Stores variable weights (Legacy, mostly replaced by All-Reduce).
Reduction Server (Sync): Aggregates gradients.

Pre-built Containers

Vertex AI provides Docker images with TensorFlow/PyTorch/Scikit-learn pre-installed. You just provide your Python script.

3. Code Example: Submitting a Job

from google.cloud import aiplatform

# Define the job
job = aiplatform.CustomTrainingJob(
    display_name="my-distributed-job",
    script_path="task.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12:latest",
)

# Run it
model = job.run(
    dataset=my_dataset,
    model_display_name="my-trained-model",
    args=["--epochs=50"],
    replica_count=4, # 4 Worker Machines
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
)

4. Summary

Data Parallelism is for speed. Model Parallelism is for size.
Vertex AI Custom Jobs handle the infrastructure (provisioning/teardown).

In the next lesson, we choose the chips. TPU vs GPU.

Knowledge Check

Error: Quiz options are missing or invalid.

Distributed Training: From One GPU to Thousands