Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 1

Hamartia Antidote · Aug 31, 2021

Tesla's Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 1 - CleanTechnica

Tesla's Dojo supercomputer is an elegant beast that completely reinvents the supercomputer.

cleantechnica.com

During AI Day this week, Tesla shattered all rules and established industry standards when it comes to making a computer. The presentation, just like on Autonomy Day, was rather technical, and the people making the presentation again may have failed to take into account that not everyone is fully literate in microprocessor design and engineering. Though, AI Day was geared to excite the geeks and try to hire industry experts, so this was likely an intentional choice.

In this deep dive, we scrutinize and explain everything Tesla has said about computer hardware and compare it to the competition and the way things are normally done. Full warning: this presentation is still quite technical, but we do try to explain it all in plain English. If you still have any questions, please leave them down below in the comments and we will try to answer anything that we can.

To make this easier to digest, we are also splitting it into a series of 4 or 5 articles.

The Tesla GPU Stack
In case it wasn’t clear, Tesla has built — with NVIDIA GPUs — one of the most powerful supercomputers in the world. That is what they call the GPU stack and what they hope their programmers will want to turn off and never use again as soon as Dojo is up and running. During the presentation, they said that the number of GPUs is “more than top 5 supercomputer(s) in the world.” I had to dig it up, but what Tesla most likely meant is that they have more GPUs than the 5th most powerful supercomputer in the world, because that would be a supercomputer called Selene which has 4,480 NVIDIA V100 GPUs. However, if you add the top 5 together, Tesla would not be beating the total GPU count — it’s not even close.

However, Tesla’s GPU-based supercomputer, or at least its biggest cluster is on its own quite possibly also the 5th most powerful supercomputer in the world. We can see that Tesla started receiving GPUs in mid 2019 for its first cluster. This date and the fact that they mentioned ray tracing during the presentation could mean that Tesla ordered NVIDIA Quadro RTX cards, although they might have also been older NVIDIA V100 GPUs. Since NVIDIA released the A100 in November 2020, cluster number 2 is likely also made up of the older hardware. If they used V100 GPUS, that would put the second cluster at around 22 PetaFLOPS, it would be right at the very bottom edge of the top 10 list in 2020 and might not have even made the list and would certainly not make the top 10 list now.

Fortunately for us, Andrej Karpathy revealed in a presentation he made in June, that the largest cluster is made up of NVIDIA’s new A100 GPUs. He said that this is roughly the 5th most powerful supercomputer in the world. Considering the components, the theoretical maximum would equal 112.32 PetaFLOPS, putting it in 4th place, however since when working together there is always some scaling inefficiency means 5th place is most likely an accurate estimate, if we divide the FP16 TFLOP performance in half to estimate the FP32 performance, you get around 90 PetaFLOP, just a bit less than the Sunway TaihuLight supercomputer in China.

The Dojo (that I would love to go to)
So, at first glance, it might appear that with 1.1 Exaflop, this would become the most powerful supercomputer in the world. However, Tesla sugarcoated the numbers a bit and Dojo will in fact become the 6th most powerful computer in the world. Right now, the most powerful supercomputer is the “Fugaku” in Kobe, Japan, with a world record of 442 PetaFLOPS, three times faster than the second most powerful supercomputer, “Summit,” in Tennessee, USA, which has 148.6 PetaFLOPS. Dojo, with its 68.75 PetaFLOPS (approximately), would then be in 6th place. In fact, because the next 3 supercomputers are quite close 61.4 to 64.59 PetaFLOPS, it’s possible that Dojo is in seventh, eighth or even ninth place. Later on in this series, we will explain this in greater detail under the colorfully named section Tesla flops the FLOPS test.

Nonetheless, this is absolutely nothing to laugh at. In fact, when it comes to the specific tasks that Tesla is creating Dojo for, it is very likely that Dojo will outperform all other supercomputers in the world combined, and by a very large margin. The standard test for supercomputers is peeling apples, but Tesla has a yard full of oranges and designed a tool for that, so the mere fact that in addition to being the best in the world at peeling oranges it is still able to get 6th place for peeling apples just shows how incredible this system it.

Moving away from raw compute performance, Dojo and its its jaw-dropping engineering puts all supercomputers to shame in almost every other way imaginable. To logically explain this, we need to start at the beginning, the small scale.

What is an SoC
The way every computer works right now is you have a processor (CPU) — in some cases, a business server might have two and the/those processor(s) go onto a motherboard that houses the RAM (temporary fast memory of 8–16GB in good laptops/desktops) and the computer has a power supply that feeds electricity into the motherboard to power everything. Most consumer desktops have a separate graphics card (GPU), but most consumer processors also have a built-in graphics card.

Now, if you haven’t read it yet, you may want to read my previous article in which I analyze Tesla’s hardware 3 chip, (Elon Musk on Twitter called it a good analysis, so you can’t go wrong there), but to very quickly summarize: the Tesla HW3 chip and most consumer processors are actually an SoC, a “system on a chip,” because they include cache memory (sadly, only a few megabytes) as well as a processor and a graphics card and in the case of the Tesla HW3 chip, two NPUs or Neural Processing Units.

Wafer & Wafer Yield
Now, before we move on, I need to explain something about how processors, graphics cards, and SoCs are normally made. All of the components, like transistors, are not added to individual processors. They are all placed while the processor is part of a circular disc called a wafer, which you can see in the image above. That wafer is then cut into pieces, each of which becomes an individual processor, GPU, or SoC. The chip fabrication does not always go well and often some processors don’t work or are only partially operational. In the industry, the common term to describe this issue is a low wafer yield.

Even most people who don’t know much about computer hardware know that Intel offers celeron/pentium, i3, i5, i7, and i9 processors, and that order is from weakest to strongest. What most people don’t know is that due to problems with wafer yield, some of those processors are defective, they work only partially, so what they do is they disable the broken part of the chip and sell it as a cheaper version, This is called binning. So, a celeron/are ingena broken i3 and an i5 is a broken i7, and even within chips there are various versions of an i5 and i7, some that can’t reach the maximum clock speed are locked and sold as a cheaper variant of that chip. Whether Intel still does this today with their latest chips, I am not sure, but they still did this as recently as 2017. The point is that rather than throw away a defective wafer or defective chips in a wafer, you can still salvage your yield.

Hamartia Antidote · Jul 16, 2023

https://twitter.com/x/status/1678181628403744770

gambit · Jul 16, 2023

Just in an FYI to interested laymen...

Even most people who don’t know much about computer hardware know that Intel offers celeron/pentium, i3, i5, i7, and i9 processors, and that order is from weakest to strongest. What most people don’t know is that due to problems with wafer yield, some of those processors are defective, they work only partially, so what they do is they disable the broken part of the chip and sell it as a cheaper version, This is called binning. So, a celeron/are ingena broken i3 and an i5 is a broken i7, and even within chips there are various versions of an i5 and i7, some that can’t reach the maximum clock speed are locked and sold as a cheaper variant of that chip.

This practice is NOT illegal, dishonest, or unethical. This feature of semiconductor manufacturing is applicable from complex C/GPU all the way down to simpler memory structures such as NAND/NOR/etc. The manufacturer is up front to customers as to what they are getting. Each company has different labels for their specific grades of products, but essentially, most uses three grades, and any lower means the wafer is literally toss into the scrap barrel.

Skull and Bones · Jul 16, 2023

gambit said:
Just in an FYI to interested laymen...

Even most people who don’t know much about computer hardware know that Intel offers celeron/pentium, i3, i5, i7, and i9 processors, and that order is from weakest to strongest. What most people don’t know is that due to problems with wafer yield, some of those processors are defective, they work only partially, so what they do is they disable the broken part of the chip and sell it as a cheaper version, This is called binning. So, a celeron/are ingena broken i3 and an i5 is a broken i7, and even within chips there are various versions of an i5 and i7, some that can’t reach the maximum clock speed are locked and sold as a cheaper variant of that chip.

This practice is NOT illegal, dishonest, or unethical. This feature of semiconductor manufacturing is applicable from complex C/GPU all the way down to simpler memory structures such as NAND/NOR/etc. The manufacturer is up front to customers as to what they are getting. Each company has different labels for their specific grades of products, but essentially, most uses three grades, and any lower means the wafer is literally toss into the scrap barrel.

This certainly comes from the wafer 'real estate', the further your 'die' is from the center of the wafer, higher the probability of the chip having defects.

The reason is, many RIE (Reactive Ion Etching) systems cannot maintain uniform etching throughout the wafer, same goes for PECVD (Plasma Enhanced Chemical Vapor Deposition). So they calibrate the system with the values they get from the center of a wafer.

Hamartia Antidote · Jul 16, 2023

gambit said:
Just in an FYI to interested laymen...

Even most people who don’t know much about computer hardware know that Intel offers celeron/pentium, i3, i5, i7, and i9 processors, and that order is from weakest to strongest. What most people don’t know is that due to problems with wafer yield, some of those processors are defective, they work only partially, so what they do is they disable the broken part of the chip and sell it as a cheaper version, This is called binning. So, a celeron/are ingena broken i3 and an i5 is a broken i7, and even within chips there are various versions of an i5 and i7, some that can’t reach the maximum clock speed are locked and sold as a cheaper variant of that chip.

This practice is NOT illegal, dishonest, or unethical. This feature of semiconductor manufacturing is applicable from complex C/GPU all the way down to simpler memory structures such as NAND/NOR/etc. The manufacturer is up front to customers as to what they are getting. Each company has different labels for their specific grades of products, but essentially, most uses three grades, and any lower means the wafer is literally toss into the scrap barrel.

Doesn't this mean benchmarks would be inconsistent across chips or does the chip disable a consistent number for uniformity just to alleviate this.

gambit · Jul 17, 2023

Hamartia Antidote said:
Doesn't this mean benchmarks would be inconsistent across chips or does the chip disable a consistent number for uniformity just to alleviate this.

We can use this example...

Intel "Raptor Lake" 8P+16E Wafer Pictured

Andreas Schilling with Hardwareluxx.de, as part of the Intel Tech Tour Israel, got to hold a 12-inch wafer full of "Raptor Lake-S" dies. These are dies in their full 8P+16E configuration. The die is estimated to measure 257 mm² in area. We count 231 full dies on this wafer. Intel is building...

www.techpowerup.com

We count 231 full dies on this wafer.

Full dies mean the die is completely formed during the manufacturing process. Edge dies are mostly partially formed dies and partially formed dies are always %100 non-functional.

Does a fully formed die performed according to specs? Not always. For mostly unexpected reasons, a fully formed die could even failed on the first functional test, so yes, what you said is correct, that it is possible for one die to pass all tests on the first run and the neighbor dies in any direction, could fail any series of tests. A die could be repaired and/or parts of its structures disabled. For example, a NAND die may start out with 128 gb capacity but due to manufacturing defects, by the time the die passed all tests and was electrically repaired, it would be sold as 32 gb capacity. Intel's CPU manufacturing process is no different. A CPU die could start out at i7 but after all the repairs applied, it could be sold as an i3. The wafer could be a mix of i7, i5, and i3 dies all over, BUT this is not good. It means there are inconsistencies somewhere in the manufacturing lines in that particular fab. If a wafer started as an i7, most dies should be sold as i7, and Santa Clara would be making calls to the local fab director to find out why not. From my experience with working with (and for) Intel, they are strict with data collection. Every CPU die has a unique ID traceable 10 yrs history, much like your car's VIN where you can find vehicle data down to the paint color and what work shift the car finished its manufacturing.

So while yes, you are correct that benchmarks could be inconsistent all over the wafer, but that would not be a good thing.

Search

Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 1

Hamartia Antidote

ELITE MEMBER