AI Accelerator Comparison Tables
As someone who often messes with AI hardware, I’ve always wanted a comparison between different AI accelerators in a single table. It’s easy for me to remember the specs of an A100 or H100 since I frequently use those, but I often don’t remember the specs of Google’s TPUs. Also each vendor usually has their own tables but I rarely can find one with different accelerators in a single place. So I decided to make one myself!
The tables below only contains GPU-like accelerators that I can find public info on. There’s exotic accelerators like Cerebras WSE and Tesla Dojo that would be cool to include, but those are too exotic to compare against GPU-like accelerators. Other than NVIDIA’s GPUs, I ended up picking Google’s TPUs, AMD’s GPUs, and Microsoft’s new AI accelerator (Maia).
This first table compares Tensor Core FLOPS and HBM Capacity / Bandwidth across the different accelerators. It’s what most people would care about when looking at a single accelerator.
Note: Maia 100 uses the new Microscaling (MX) formats, so the TOPS/TFLOPS numbers are not a direct comparison versus the others.
Accelerator | FP64 TFLOPS | BF/FP16 TFLOPS | FP8 TFLOPS | INT8 TOPS | INT/FP4 T(FL)OPS | HBM (GB) | HBM BW (TB/s) | TDP (W) |
---|---|---|---|---|---|---|---|---|
A100 SXM | 19.5 | 312 | N/A | 624 | 1248 | 80 | 2 | 400 |
H100 SXM | 67 | 989.5 | 1979 | 1979 | 80 | 3.35 | 700 | |
H200 SXM | 67 | 989.5 | 1979 | 1979 | 141 | 4.8 | 700 | |
TPUv4 | 275 | 275 | 32 | 1.2 | 170 | |||
TPUv5p | 459 | 918 | 95 | 2.8 | ??? | |||
MI250X | 95.7 | 383 | 383 | 383 | 128 | 3.2 | 500 | |
MI300X | 163.4 | 1300 | 2610 | 2600 | 192 | 5.3 | 750 | |
Maia 100 | 800 | 1600 MXINT8 | 3200 MXFP4 | 64 | 1.6 | 860 |
At a high level we can see that the NVIDIA GPUs have much higher topline TFLOPS vs the AMD ones at lower precision (lower precision is what really matters for Deep Learning workloads). In contrast AMD has much higher TFLOPS at high precision (FP64) – this is only useful for traditional HPC workloads. As a side note, the peak TFLOPS in NVIDIA GPUs is probably not actually achievable. As of August 2023, the peak FP16 TFLOPS achievable on an H100 on random input data via cuBLAS was only 670!
Each TPU chip has much lower peak TFLOPS vs GPUs, but they also have a much lower TDP (this is not the only metric that matters, but the TFLOPS / Watt on TPUv4 is better than on A100!). And since TPUs have really great interconnect (see the next table), doing computation across multiple TPUs that you would do in a single GPU is not a problem.
The second table goes to a larger scale and looks at the interconnect between accelerators. Unlike the above table, some of the details below are trickier to find, e.g. the intra-node topology of the AMD GPUs. I’ve linked sources to where I found the information to make it easier to track and verify. Since the Maia 100 was just announced at the time of writing I could not find detailed information about its topology.
Accelerator | Local interconnect size | Local interconnect bandwidth (bidirectional) | Local interconnect topology |
---|---|---|---|
A100 SXM | 8 per node (NVLink) | 600 GB/s between any 2 GPUs | all-to-all |
H100 SXM | 8 per node (NVLink), 256 possible with NVLink Switch | 950 GB/s between any 2 GPUs | all-to-all |
TPUv4 | 4096 per pod | 600 GB/s between connected TPUs | 3D torus |
TPUv5p | 8960 per pod | 1200 GB/s between connected TPUs | 3D torus |
MI250X | 8 per node | 100 GB/s per link | non-uniform |
MI300X | 8 per node | 128 GB/s between any 2 GPUs | all-to-all |
Maia 100 | ??? | ??? | ??? |
This last table looks at the accelerator from a silicon perspective.
Some info isn’t public at the time of this article so they’re not specified below or denoted with ?
.
Accelerator | Process | HBM | # Dies |
---|---|---|---|
A100 SXM | 7nm | HBM2e, 8 high, 5 stacks | 1 GPU die |
H100 SXM | 5nm | HBM3, 8 high, 5 stacks | 1 GPU die |
H200 SXM | 5nm | HBM3e, 8 high, 6 stacks | 1 GPU die |
TPUv4 | 7nm | HBM2, 4? high, 4 stacks | 1 TPU die |
TPUv5p | 5nm? | HBM3, ? high, ? stacks | 1 TPU die |
MI250X | 6nm | HBM2e, 8 high, 8 stacks | 2 MI200 dies |
MI300X | 5nm (XCD Die) & 6nm (base IO Die) | HBM3, 12 high, 8 stacks | 4 IO dies, 2 XCD per IO die |
Maia 100 | 5nm | HBM3, 8 high, 4 stacks | 1 GPU die |
Most accelerators are a single logic die + a few stacks of HBM dies right next to it packaged via CoWoS. AMD is the exception, with multiple logic dies packaged together instead of a single big logic die. Rumors are the B100 (next-gen NVIDIA GPU to be announced in a few months) has 2 logic dies similar to MI250x.