
Distributed Training: From One GPU to Thousands
How to break the memory limit. Learn about Data Parallelism, Model Parallelism, reduction servers, and how to use Vertex AI Custom Training jobs.
Breaking the Single Machine Barrier
If your model fits on one GPU, life is easy. If it doesn't (massive batch size or massive parameter count), you need Distributed Training.
For the exam, you need to know how to configure this in Vertex AI and which strategy to pick.
1. Strategies: Data vs Model Parallelism
Data Parallelism (The Standard)
- Scenario: The model fits on the GPU, but the data is huge. It would take 1 month to train.
- Solution: Copy the model to 10 GPUs. Give each GPU a different chunk of data (Batch). Average the gradients at the end of the step.
- Strategies:
MultiWorkerMirroredStrategy(TensorFlow).DistributedDataParallel(PyTorch).
Model Parallelism (The Giant)
- Scenario: The model itself is too big for the GPU memory (e.g., GPT-3).
- Solution: Put Layer 1-10 on GPU 1, Layer 11-20 on GPU 2.
- Complexity: Very high latency cost (GPUs need to talk constantly).
2. Vertex AI Custom Training Jobs
You don't SSH into 10 VMs and run python train.py.
You submit a Custom Job to Vertex AI. It spins up the cluster, runs the code, saves the model to GCS, and shuts down.
Architecture
- Chief (Master): Coordinates the work.
- Workers: Do the training.
- Parameter Server (Async): Stores variable weights (Legacy, mostly replaced by All-Reduce).
- Reduction Server (Sync): Aggregates gradients.
Pre-built Containers
Vertex AI provides Docker images with TensorFlow/PyTorch/Scikit-learn pre-installed. You just provide your Python script.
3. Code Example: Submitting a Job
from google.cloud import aiplatform
# Define the job
job = aiplatform.CustomTrainingJob(
display_name="my-distributed-job",
script_path="task.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12:latest",
)
# Run it
model = job.run(
dataset=my_dataset,
model_display_name="my-trained-model",
args=["--epochs=50"],
replica_count=4, # 4 Worker Machines
machine_type="n1-standard-8",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
)
4. Summary
- Data Parallelism is for speed. Model Parallelism is for size.
- Vertex AI Custom Jobs handle the infrastructure (provisioning/teardown).
In the next lesson, we choose the chips. TPU vs GPU.
Knowledge Check
Error: Quiz options are missing or invalid.