AI Accelerator Comparison Tables

As someone who often messes with AI hardware, I’ve always wanted a comparison between different AI accelerators in a single table. It’s easy for me to remember the specs of an A100 or H100 since I frequently use those, but I often don’t remember the specs of Google’s TPUs. Also each vendor usually has their own tables but I rarely can find one with different accelerators in a single place. So I decided to make one myself!

The tables below only contains GPU-like accelerators that I can find public info on. There’s exotic accelerators like Cerebras WSE and Tesla Dojo that would be cool to include, but those are too exotic to compare against GPU-like accelerators. Other than NVIDIA’s GPUs, I ended up picking Google’s TPUs, AMD’s GPUs, and Microsoft’s new AI accelerator (Maia).

This first table compares Tensor Core FLOPS and HBM Capacity / Bandwidth across the different accelerators. It’s what most people would care about when looking at a single accelerator.

Note: Maia 100 uses the new Microscaling (MX) formats, so the TOPS/TFLOPS numbers are not a direct comparison versus the others.

Accelerator	FP64 TFLOPS	BF/FP16 TFLOPS	FP8 TFLOPS	INT8 TOPS	INT/FP4 T(FL)OPS	HBM (GB)	HBM BW (TB/s)	TDP (W)
A100 SXM	19.5	312	N/A	624	1248	80	2	400
H100 SXM	67	989.5	1979	1979		80	3.35	700
H200 SXM	67	989.5	1979	1979		141	4.8	700
TPUv4		275		275		32	1.2	170
TPUv5p		459		918		95	2.8	???
MI250X	95.7	383		383	383	128	3.2	500
MI300X	163.4	1300	2610	2600		192	5.3	750
Maia 100		800		1600 MXINT8	3200 MXFP4	64	1.6	860

At a high level we can see that the NVIDIA GPUs have much higher topline TFLOPS vs the AMD ones at lower precision (lower precision is what really matters for Deep Learning workloads). In contrast AMD has much higher TFLOPS at high precision (FP64) – this is only useful for traditional HPC workloads. As a side note, the peak TFLOPS in NVIDIA GPUs is probably not actually achievable. As of August 2023, the peak FP16 TFLOPS achievable on an H100 on random input data via cuBLAS was only 670!

Each TPU chip has much lower peak TFLOPS vs GPUs, but they also have a much lower TDP (this is not the only metric that matters, but the TFLOPS / Watt on TPUv4 is better than on A100!). And since TPUs have really great interconnect (see the next table), doing computation across multiple TPUs that you would do in a single GPU is not a problem.

The second table goes to a larger scale and looks at the interconnect between accelerators. Unlike the above table, some of the details below are trickier to find, e.g. the intra-node topology of the AMD GPUs. I’ve linked sources to where I found the information to make it easier to track and verify. Since the Maia 100 was just announced at the time of writing I could not find detailed information about its topology.

Accelerator	Local interconnect size	Local interconnect bandwidth (bidirectional)	Local interconnect topology
A100 SXM	8 per node (NVLink)	600 GB/s between any 2 GPUs	all-to-all
H100 SXM	8 per node (NVLink), 256 possible with NVLink Switch	950 GB/s between any 2 GPUs	all-to-all
TPUv4	4096 per pod	600 GB/s between connected TPUs	3D torus
TPUv5p	8960 per pod	1200 GB/s between connected TPUs	3D torus
MI250X	8 per node	100 GB/s per link	non-uniform
MI300X	8 per node	128 GB/s between any 2 GPUs	all-to-all
Maia 100	???	???	???

This last table looks at the accelerator from a silicon perspective. Some info isn’t public at the time of this article so they’re not specified below or denoted with ?.

Accelerator	Process	HBM	# Dies
A100 SXM	7nm	HBM2e, 8 high, 5 stacks	1 GPU die
H100 SXM	5nm	HBM3, 8 high, 5 stacks	1 GPU die
H200 SXM	5nm	HBM3e, 8 high, 6 stacks	1 GPU die
TPUv4	7nm	HBM2, 4? high, 4 stacks	1 TPU die
TPUv5p	5nm?	HBM3, ? high, ? stacks	1 TPU die
MI250X	6nm	HBM2e, 8 high, 8 stacks	2 MI200 dies
MI300X	5nm (XCD Die) & 6nm (base IO Die)	HBM3, 12 high, 8 stacks	4 IO dies, 2 XCD per IO die
Maia 100	5nm	HBM3, 8 high, 4 stacks	1 GPU die

Most accelerators are a single logic die + a few stacks of HBM dies right next to it packaged via CoWoS. AMD is the exception, with multiple logic dies packaged together instead of a single big logic die. Rumors are the B100 (next-gen NVIDIA GPU to be announced in a few months) has 2 logic dies similar to MI250x.