As someone who often messes with AI hardware, I’ve always wanted a comparison between different AI accelerators in a single table. It’s easy for me to remember the specs of an A100 or H100 since I frequently use those, but I often don’t remember the specs of Google’s TPUs. Also each vendor usually has their own tables but I rarely can find one with different accelerators in a single place. So I decided to make one myself!

The tables below only contains GPU-like accelerators that I can find public info on. There’s exotic accelerators like Cerebras WSE and Tesla Dojo that would be cool to include, but those are too exotic to compare against GPU-like accelerators. Other than NVIDIA’s GPUs, I ended up picking Google’s TPUs, AMD’s GPUs, and Microsoft’s new AI accelerator (Maia).


This first table compares Tensor Core FLOPS and HBM Capacity / Bandwidth across the different accelerators. It’s what most people would care about when looking at a single accelerator.

Note: Maia 100 uses the new Microscaling (MX) formats, so the TOPS/TFLOPS numbers are not a direct comparison versus the others.

Accelerator FP64 TFLOPS BF/FP16 TFLOPS FP8 TFLOPS INT8 TOPS INT/FP4 T(FL)OPS HBM (GB) HBM BW (TB/s) TDP (W)
A100 SXM 19.5 312 N/A 624 1248 80 2 400
H100 SXM 67 989.5 1979 1979   80 3.35 700
H200 SXM 67 989.5 1979 1979   141 4.8 700
TPUv4   275   275   32 1.2 170
TPUv5p   459   918   95 2.8 ???
MI250X 95.7 383   383 383 128 3.2 500
MI300X 163.4 1300 2610 2600   192 5.3 750
Maia 100   800   1600 MXINT8 3200 MXFP4 64 1.6 860

At a high level we can see that the NVIDIA GPUs have much higher topline TFLOPS vs the AMD ones at lower precision (lower precision is what really matters for Deep Learning workloads). In contrast AMD has much higher TFLOPS at high precision (FP64) – this is only useful for traditional HPC workloads. As a side note, the peak TFLOPS in NVIDIA GPUs is probably not actually achievable. As of August 2023, the peak FP16 TFLOPS achievable on an H100 on random input data via cuBLAS was only 670!

Each TPU chip has much lower peak TFLOPS vs GPUs, but they also have a much lower TDP (this is not the only metric that matters, but the TFLOPS / Watt on TPUv4 is better than on A100!). And since TPUs have really great interconnect (see the next table), doing computation across multiple TPUs that you would do in a single GPU is not a problem.


The second table goes to a larger scale and looks at the interconnect between accelerators. Unlike the above table, some of the details below are trickier to find, e.g. the intra-node topology of the AMD GPUs. I’ve linked sources to where I found the information to make it easier to track and verify. Since the Maia 100 was just announced at the time of writing I could not find detailed information about its topology.

Accelerator Local interconnect size Local interconnect bandwidth (bidirectional) Local interconnect topology
A100 SXM 8 per node (NVLink) 600 GB/s between any 2 GPUs all-to-all
H100 SXM 8 per node (NVLink), 256 possible with NVLink Switch 950 GB/s between any 2 GPUs all-to-all
TPUv4 4096 per pod 600 GB/s between connected TPUs 3D torus
TPUv5p 8960 per pod 1200 GB/s between connected TPUs 3D torus
MI250X 8 per node 100 GB/s per link non-uniform
MI300X 8 per node 128 GB/s between any 2 GPUs all-to-all
Maia 100 ??? ??? ???

This last table looks at the accelerator from a silicon perspective. Some info isn’t public at the time of this article so they’re not specified below or denoted with ?.

Accelerator Process HBM # Dies
A100 SXM 7nm HBM2e, 8 high, 5 stacks 1 GPU die
H100 SXM 5nm HBM3, 8 high, 5 stacks 1 GPU die
H200 SXM 5nm HBM3e, 8 high, 6 stacks 1 GPU die
TPUv4 7nm HBM2, 4? high, 4 stacks 1 TPU die
TPUv5p 5nm? HBM3, ? high, ? stacks 1 TPU die
MI250X 6nm HBM2e, 8 high, 8 stacks 2 MI200 dies
MI300X 5nm (XCD Die) & 6nm (base IO Die) HBM3, 12 high, 8 stacks 4 IO dies, 2 XCD per IO die
Maia 100 5nm HBM3, 8 high, 4 stacks 1 GPU die

Most accelerators are a single logic die + a few stacks of HBM dies right next to it packaged via CoWoS. AMD is the exception, with multiple logic dies packaged together instead of a single big logic die. Rumors are the B100 (next-gen NVIDIA GPU to be announced in a few months) has 2 logic dies similar to MI250x.