This site may earn chapter commissions from the links on this folio. Terms of use.

Several years agone, Google began working on its ain custom software for auto learning and artificial intelligence workloads, dubbed TensorFlow. Concluding year, the company announced that information technology had designed its own tensor processing unit of measurement (TPU), an ASIC designed for high throughput of low-precision arithmetics. At present, Google has released some functioning data for their TPU and how it compares to Intel's Haswell CPUs and Nvidia's K80 (Kepler-based) data heart dual GPU.

Before we dive into the data we need to talk about the workloads Google is discussing. All of Google'south benchmarks measure inference performance every bit opposed to initial neural network training. Nvidia has a graphic that summarizes the differences between the two:

ai_difference_between_deep_learning_training_inference

Click to overstate.

Teaching a neural network what to recognize and how to recognize it is referred to as training, and these workloads are still typically run on CPUs or GPUs. Inference refers to the neural network'southward power to employ what it learned from grooming. Google makes information technology articulate that it's just interested in low-latency operations and that it'southward imposed strict criteria for responsiveness on the benchmarks we'll discuss below.

Google's TPU design, benchmarks

The start part of Google'due south newspaper discusses the various types of deep neural networks it deploys, the specific benchmarks it uses, and offers a diagram of the TPU's physical layout, pictured beneath. The TPU is specifically designed for viii-scrap integer workloads and prioritizes consistently low latency over raw throughput (both CPUs and GPUs tend to prioritize throughput over latency, particularly GPUs).

TensorFlow

Click to enlarge

Google writes (PDF): "Rather than be tightly integrated with a CPU, to reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus, allowing information technology to plug into existing servers just equally a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for information technology to execute rather than fetching them itself. Hence, the TPU is closer in spirit to an FPU (floating-bespeak unit) coprocessor than information technology is to a GPU."

TPU-Board

Click to enlarge

Each TPU also has an off-chip 8GiB DRAM pool, which Google calls Weight Retention, while intermediate results are held in a 24MiB pool of on-chip memory (that'south the Unified Buffer in the diagram above). The TPU has a four-stage pipeline and executes CISC instructions, with some instructions taking thousands of clock cycles to execute as opposed to the typical RISC pipeline of one clock cycle per pipeline stage. The tabular array below shows how the E5-2699v3 (Haswell), Nvidia K80, and TPU compare against each other in various metrics.

TPU-Table

Click to overstate

Before we hitting the benchmark results, there are a few things we need to note. First, Turbo mode and GPU Boost were disabled for both the Haswell and Nvidia GPUs, not to artificially tilt the score in favor of the TPU, only because Google's data centers prioritize dense hardware packing over raw performance. Higher turbo clock rates for the v3 Xeon are dependent on not using AVX, which Google's neural networks all tend to use. As for Nvidia'southward K80, the examination server in question deployed 4 K80 cards with two GPUs per menu, for a total of viii GPU cores. Packed that tightly, the only fashion to take advantage of the GPU'due south boost clock without causing an overheat would accept been to remove two of the K80 cards. Since the clock frequency increase isn't nearly as strong as doubling the total number of GPUs in the server, Google leaves boost disabled on these server configurations.

Google's benchmark figures all utilize the roofline performance model. The advantage of this model is that it creates an intuitive paradigm of overall performance. The apartment roofline represents theoretical height operation, while the diverse data points show real-earth results.

In this case, the Y-axis is integer operations per 2nd, while the "Operational Intensity" Ten-centrality corresponds to integer operations per byte of weights read (emphasis Google'due south). The gap betwixt an awarding'due south observed performance and the curve directly to a higher place it shows how much additional performance might exist gained if the benchmark was better optimized for the architecture in question, while data points on the slanted portion of the roofline indicate that the benchmark is running into retentiveness bandwidth limitations. The slideshow below shows Google's results in various benchmarks for its CPU, GPU, and TPU tests. Every bit always, each slide can be clicked to open up a larger image in a new window.

Google's TPU isn't only a loftier functioning engine; information technology offers substantially improved performance per watt as well, both in the original TPU and for improved variants Google has modeled (TPU').

RelativePower

Click to overstate

The chief limiting factor between Google's TPU and higher functioning is memory bandwidth. Google's models testify TPU performance improving 3x if memory bandwidth is increased 4x over current designs. No other set of enhancements, including clock rate improvements, larger accumulators, or a combination of multiple factors has much of an impact on performance.

The last section of Google's paper is dedicated to dispelling various fallacies and correcting misunderstandings, a number of which relate to the choice of the K80 GPU. 1 section is particularly worth quoting:

Fallacy: CPU and GPU results would be comparable to the TPU if we used them more than efficiently or compared to newer versions.

Nosotros originally had 8-bit results for merely one DNN on the CPU, due to the pregnant work to apply AVX2 integer back up efficiently. The benefit was ~3.5X. It was less confusing (and infinite) to present all CPU results in floating signal, rather than having ane exception, with its ain roofline. If all DNNs had like speedup, performance/Watt ratio would drop from 41-83X to 12-24X. The new 16-nm, i.5GHz, 250W P40 datacenter GPU tin can perform 47 Tera 8-bit ops/sec, but was unavailable in early 2015, so isn't contemporary with our 3 platforms. We too tin't know the fraction of P40 peak delivered within our rigid time bounds. If we compared newer chips, Department 7 shows that we could triple functioning of the 28-nm, 0.7GHz, 40W TPU just by using the K80's GDDR5 memory (at a cost of an additional 10W).

This kind of announcement isn't the sort of thing Nvidia is going to be happy to hear. To be clear, Google's TPU results today are applicable to inference workloads, not the initial task of training the neural network — that'south however done on GPUs. But, with respect to Nvidia and AMD both, we've also seen this kind of cycle play out earlier. One time upon a time, CPUs were the unquestioned kings of cryptocurrency mining. So, as difficulty rose, GPUs became ascendant, thanks to vastly college hash rates. In the long run, however, custom ASICs took over the market.

Both AMD and Nvidia recently added (Nvidia) or announced (AMD) support for viii-bit operations to meliorate total GPU throughput in deep learning and AI workloads, merely it volition take significant improvements over and higher up these steps to address the advantage ASICs would possess if they starting time moving into these markets. That'southward not to say we expect custom ASIC designs to own the market — Google and Microsoft may exist able to afford to build their own custom hardware, merely most customers won't have the funds or expertise to have that on.