Understanding AI Chips: GPUs, TPUs, and Custom Silicon

The hardware powering the AI revolution

Understanding AI Hardware

Why Specialized AI Hardware?

General-purpose CPUs are inefficient for:

Matrix multiplications (core of neural networks)

Parallel processing of millions of operations

Moving large amounts of data quickly

NVIDIA GPU Architecture

Modern NVIDIA data center GPUs (H100, B200) have:

Tensor Cores: Specialized for matrix operations

High Bandwidth Memory (HBM): 3.35 TB/s bandwidth

NVLink: Fast GPU-to-GPU communication

CUDA ecosystem: Mature software stack

python
import torch
Check GPU capabilities
print(torch.cuda.is_available())
print(torch.cuda.get_device_properties(0))
Move model to GPU
model = MyModel().to("cuda")
Enable mixed precision (faster, less memory)
with torch.autocast("cuda", dtype=torch.float16):
    output = model(input_tensor)

Google TPU Architecture

TPUs are optimized for TensorFlow/JAX workloads:

Custom matrix multiplication units

4D torus interconnect between TPUs

Available on Google Cloud

python
import jax
import jax.numpy as jnp
JAX automatically uses TPU if available
@jax.jit
def matrix_multiply(a, b):
    return jnp.dot(a, b)
Move data to TPU
x = jax.device_put(np.array([1.0, 2.0]), jax.devices("tpu")[0])

AWS Trainium & Inferentia

Trainium: Training workloads (lower cost than A100)

Inferentia: Inference at low latency and cost

python
AWS Neuron SDK for Inferentia
import torch_neuronx
Compile model for Inferentia
traced_model = torch_neuronx.trace(model, inputs)
traced_model.save("compiled_model.pt")

Comparison Matrix

ChipBest ForMemory BWFP16 TFLOPS

H100 SXMTraining LLMs3.35 TB/s989 A100Training/Inference2 TB/s312 TPU v4Training in GCP1.2 TB/s275 Inferentia2Low-cost inference820 GB/s190

Memory Constraints

The biggest challenge in AI hardware: fitting models in memory

For LLaMA 70B in FP16: 70B * 2 bytes = 140GB

Single A100 (80GB): Doesn't fit!

2x A100: Fits with tensor parallelism

8x H100: Comfortable fit with headroom for batching

Emerging Alternatives

Cerebras CS-3: Entire model on single chip (900K cores)

Groq LPU: Deterministic latency for inference

SambaNova: Reconfigurable dataflow architecture

Also available in 中文.