Distributed Architectures: Parameter Server vs All-Reduce

Distributed Architectures: Parameter Server vs All-Reduce

How GPUs talk to each other. Understanding Ring All-Reduce, PS Strategy, and when to use NCCL.

The Network is the Computer

When you scale to 128 GPUs, the bottleneck isn't Math. It's Communication. Every GPU must agree on the new weights every few milliseconds.


1. Parameter Server Strategy (Async)

  • Architecture:
    • Workers: Calculate gradients. Send them to PS.
    • Parameter Servers (PS): Hold the global weights. Add gradients. Send new weights back.
  • Pros: Robust. If one worker dies, the job continues. Good for massive embeddings (Wide & Deep).
  • Cons: PS becomes the bottleneck.

2. Ring All-Reduce Strategy (Sync)

  • Architecture: No central server.
    • GPU 1 passes data to GPU 2.
    • GPU 2 passes to GPU 3...
    • GPU N passes to GPU 1.
  • Pros: Bandwidth optimal. Scales to thousands of GPUs.
  • Cons: Fragile. If one GPU dies, the whole ring halts.
  • Tech: NVIDIA NCCL (on GPUs), gRPC (on TPUs).

3. Vertex AI Reduction Server

Google Cloud offers a unique hybrid. If you use MultiWorkerMirroredStrategy, you can enable Vertex AI Reduction Server. It's a managed service that acts as a super-fast All-Reduce orchestrator, bypassing the need for complex Ring configurations on your VMs.


Knowledge Check

Error: Quiz options are missing or invalid.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn