GPU Computing for AI: CUDA Programming, Multi-GPU Training & H100 Optimization in 2025

Master GPU programming fundamentals and distributed training strategies for large-scale AI

高级约 24 分钟

GPU Computing for AI: CUDA Programming, Multi-GPU Training & H100 Optimization in 2025

Master GPU programming fundamentals and distributed training strategies for large-scale AI

GPU computing is the foundation of modern AI—understanding it separates good ML engineers from great ones. This guide covers CUDA programming fundamentals, PyTorch distributed training (DDP, FSDP), gradient checkpointing and mixed precision training, H100 vs A100 performance characteristics, multi-node training with NCCL, and optimizing GPU memory utilization for training large models.

GPU ComputingCUDADistributed TrainingFSDPH100PyTorchMixed Precision

GPU Computing for AI: CUDA, Distributed Training & H100 Optimization

Why GPU Understanding Matters

You don't need to write CUDA kernels daily, but understanding GPU architecture helps you: diagnose out-of-memory errors intelligently, choose the right batch size and model architecture, select appropriate hardware for your use case, and optimize training speed.

GPU Architecture Fundamentals

NVIDIA GPU architecture: Streaming Multiprocessors (SMs) each containing hundreds of CUDA cores. H100 SXM: 132 SMs, 16,896 CUDA cores, 80GB HBM3, 3.35 TB/s memory bandwidth. Key constraint: MEMORY BANDWIDTH, not compute, is often the bottleneck for LLM inference.

Memory hierarchy (fastest to slowest): Registers → L1 Cache (SRAM, fast, 228KB per SM) → L2 Cache (SRAM, 50MB for H100) → HBM (main GPU memory, 80GB) → PCIe/NVLink (inter-GPU). Access pattern: maximize use of fast cache, minimize HBM reads.

Mixed Precision Training

FP16/BF16 training: store activations and gradients in 16-bit (2x smaller, 2x faster compute, same result quality). Store master weights in FP32 (precision needed for optimizer updates). Automatic Mixed Precision (AMP) handles this automatically.

PyTorch AMP: wrap forward pass in torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16). Use torch.cuda.amp.GradScaler to handle FP16 gradient underflow. BF16 is preferred for training (larger range than FP16, no gradient scaling needed, supported on A100/H100).

Memory savings: FP16 model vs FP32: 7B params × 2 bytes = 14GB vs 28GB. Enables larger batch sizes and larger models.

Gradient Checkpointing

Trading compute for memory: instead of storing all activations for backprop, recompute them during backward pass. Reduces activation memory by sqrt(n_layers) at cost of 30-40% longer training time.

Enable in PyTorch: torch.utils.checkpoint.checkpoint_sequential or wrap individual model blocks with torch.utils.checkpoint.checkpoint. Use for training large models when OOM (out-of-memory) without it.

For Transformers models: model.gradient_checkpointing_enable(). Works automatically with Hugging Face training.

Distributed Training

Data Parallelism (DDP)

Each GPU holds a full copy of the model. Split batch across GPUs. Each GPU computes gradients for its mini-batch. All-reduce gradients (average across GPUs). Update model weights identically on all GPUs.

Setup: torchrun --nproc_per_node=8 train.py. In code: initialize with dist.init_process_group('nccl'), wrap model with nn.parallel.DistributedDataParallel. PyTorch Lightning or Hugging Face Accelerate handle the boilerplate.

Scaling: DDP scales linearly with number of GPUs for small/medium models. Limit: entire model must fit in GPU memory.

Fully Sharded Data Parallelism (FSDP)

For models too large for one GPU. FSDP shards model parameters, gradients, and optimizer states across all GPUs. Each GPU stores 1/N of the model.

Forward pass: gather shards from all GPUs, compute, immediately discard gathered parameters (memory efficiency). Backward pass: regather, compute gradients, immediately discard, reduce gradients across GPUs.

Configure FSDP with appropriate ShardingStrategy (FULL_SHARD most memory-efficient) and cpu_offload (True to offload optimizer states to CPU, slower but more memory-efficient).

Enables training 70B+ parameter models on 8× A100 GPUs.

Tensor Parallelism

Split individual matrix multiplications across GPUs. Megatron-LM implements column-parallel and row-parallel linear layers. Each GPU holds a slice of weight matrices. Requires fast NVLink interconnect between GPUs. Used for inference serving of very large models.

Pipeline Parallelism

Assign different model layers to different GPUs. GPU 0 holds layers 0-8, GPU 1 holds layers 9-16, etc. GPPipeline (GPipe) or 1F1B (interleaved) schedule to minimize bubble. Good for inference serving, complex for training.

H100 vs A100 Performance

H100 SXM advantages: transformer engine (FP8 training), NVLink 4.0 (900 GB/s vs 600 GB/s), 3x tensor core throughput vs A100, flash attention implementation. Results: 3x faster pre-training, 6x faster inference vs A100.

H100 FP8 training: use transformer_engine.pytorch for FP8 linear layers. 2x faster than BF16 on H100 with minimal quality loss. Use fp8_autocast context manager.

For inference: H100 PCIe 80GB best for single-GPU serving of 70B models. H100 SXM 80GB NVLink for multi-GPU serving. H100 MIG (Multi-Instance GPU) for running multiple small models efficiently.

Memory Optimization Tips

Batch size tuning: use torch.cuda.memory_summary() to check memory usage. Increase batch size until near 95% memory utilization.

Optimizer choice: AdamW requires 4× model size in memory (params + gradients + 2 optimizer states). AdaFactor reduces to 2× but may slightly reduce quality. 8-bit Adam (bitsandbytes) halves optimizer memory.

Profile with NVIDIA Nsight Systems (nsys) to identify bottlenecks. Look for: low GPU utilization (CPU bottleneck), high memory transfer time (move data preprocessing to GPU), kernel launch overhead (use torch.compile to fuse operations).

Modern GPU training is a combination of hardware understanding, software optimization, and systematic profiling—each optimization layer compounds to achieve 2-10x speedups.

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

GPU Computing for AI: CUDA Programming, Multi-GPU Training & H100 Optimization in 2025

GPU Computing for AI: CUDA, Distributed Training & H100 Optimization

Why GPU Understanding Matters

GPU Architecture Fundamentals

Mixed Precision Training

Gradient Checkpointing

Distributed Training

Data Parallelism (DDP)

Fully Sharded Data Parallelism (FSDP)

Tensor Parallelism

Pipeline Parallelism

H100 vs A100 Performance

Memory Optimization Tips

Documentation

Getting Started

Learn more