China Dominates the World TOP500 Supercomputers

ChineseTiger1986 · Jun 26, 2016

jkroo said:
Holy, the quantum methodology killed my mathematics knowledge.

I bet no more than thousand persons in this world who can understand these formulas.

cirr · Jul 1, 2016

Inside Look at Key Applications on China’s New Top Supercomputer

June 30, 2016 Nicole Hemsoth

As the world is now aware, China is now home to the world’s most powerful supercomputer, toppling the previous reigning system, Tianhe-2, which is also located in the country.

In the wake of the news, we took an in-depth look at the architecture of the new Sunway TiahuLight machine, which will be useful background as we examine a few of the practical applications that have been ported to and are now running on the 10 million-core, 125 petaflop-capable supercomputer.

The sheer size and scale of the system is what initially grabbed headlines when we broke news about the system last week at the International Supercomputing Conference (full coverage listing of that event here). However, as details emerged, it became quickly apparent that was no stunt machine designed to garner headlines by gaming the Top 500 supercomputer benchmark. Rather, this system rolled out with full system specs backed by news that several real-world scientific applications were able to run on the machine, some of which could use well over 8 million cores—a stunning bit of news in a community where application scalability and real-world performance often is at dramatic odds with projected theoretical peak performance.

To recap, the entire system is built from 1.45 GHz SW26010 processors. For each node, there are 4 “core groups”; each processor chip has 4 such core groups. Each group has 65 cores (one management core [MPE], 64 compute cores) with the MPE core capable of compute. This equals a total of 260 cores per unit. There are the 260-core nodes and also “supernodes,” of which there are 256 in a quarter of a cabinet. 4 of those go in a cabinet, and full system stretches to 40 cabinets total with an interconnect built into the chip (which is referred to as the custom ‘network on a chip” interconnect) and also an interconnect for hooking everything together to form a supernode.

News about this new supercomputer, unlike the mystery about the practical value of Tianhe-2 when it was announced, had more credibility because of the number of Gordon Bell prize submissions that accompanied the formal launch. This prize is awarded to teams that can demonstrate remarkable scalability on massive machines, showing scientific/application value as well performance and efficiency. As one might imagine, in the supercomputing arena, this is a grand challenge.

Despite the availability of millions of compute cores, sometimes boosted by accelerators, getting real-world codes to scale to make full, efficient use of such resources is ongoing, pressing challenge. In fact, this is one of the great questions as the impetus builds for exaflop-capable systems—even with such power, how many codes will be able to scale to advantage of that capability?

In addition to the Gordon Bell prize submissions (more on those below), Dr. Haohuan Fu, Deputy Director of NSCC-Wuxi, where the Sunway TaihuLight supercomputer is housed, shared details and performance results for some key applications running on the new machine in a session at ISC 16. The Next Platform was on hand to gather some insight from this talk and share a few slides.

Deep Learning Libraries, Large-Scale Neural Networks

Although supercomputing applications are still just out of reach of the influence of deep learning (something we expect will shift in the next couple of years) the TaihuLight supercomputer is being harnessed for some interesting work on deep neural networks. What is fascinating here is that currently, the inference side of such workloads can scale to many processors, but the training side is often scale-limited hardware and software-wise.

Fu described an ongoing project on the Sunway TaihuLight machine to develop an open source deep neural network library and make the appropriate architectural optimization for both high performance and efficiency on both the training and inference parts of deep learning workloads. “Based on this architecture, we can provide support for both single and double precision as well as fixed point,” he explains. The real challenge, he says, is to understand why most existing neural network libraries cannot benefit much from running at large scale and looking at the basic elements there to get better training results over a very large number of cores.

Above are some noteworthy preliminary performance results for convolutional layers for double-precision. The efficiency isn’t outstanding (around 30%A) but Fu says they’re working on the library to bolster it and get equal or better performance than the GPU—the standard thus far for training.

Weather and Atmospheric Codes

Earth systems modeling, weather forecasting, and atmospheric simulations are a few key application areas where scientists using TaihuLight are scaling to an incredible number of cores. The Chinese-developed CAM weather model has been focal point for teams to scale and represents some of the challenges inherent to exploiting a new architecture.

According to Fu, “there is a lot of complexity in the legacy codebase with over a half million lines of code. We can’t do all of this manually, so we’re working on the tools to port them since the legacy codes were not designed for multicore and not for a manycore architecture like the Sunway processor.” The tools they are working on are targeting the right level of parallelism, code sizes, and memory footprint, but ultimately, he says, this leads to one of the greatest challenges—finding the right talent that can understand the underlying physics and the computational and software problems. “Even the climate scientists don’t understand the code well, it’s been added to over the course of three decades.”

Scalability and performance results for the CAM model can be seen above comparing both use with the management core and sub-cores and with just the management core. For some kernels that are compute intensive, the team saw a speedup of between 10-22X, but for others that were memory-bound, the speedup wasn’t high, just 2-3X. The results here show speedup for the entire model and if there is any takeaway here, this is scaling to quite impressive heights for code that’s still in process on a new architecture—1.5 million cores.

Fu says to get to this point, they had to divide CAM into two parts; the dynamic part, which was rewritten in the last decade (they ported and optimized manually), and the CAM physics component, which was the difficult part. “We’re relying on transformation tools here to expose the right level of parallelism and code sizes for the 260 cores on this architecture. We also developed our own memory footprint analysis tool for this part.”

Another earth systems application, a high-res atmospheric model is showing good results as well. This is an experimental project that differs from the porting and optimization requirements of the legacy code above. Here the team is taking a hardware and software co-design approach and applying a loosely coupled scheme to the scalable model. They have run experiments for 10 to 3 kilometer resolution—an impressive feat when one considers the current scalability and resolution capabilities for leading centers like ECMWF, among others.

In the example above, the team was able to use the entire system as was during this run—38 cabinets, which is still well over 8 million cores. Fu says he expects that when they continue research with this code they will be able to use the full machine—over 10 million cores.

Gordon Bell Submissions

The following slide highlights the five applications that were submitted with the three accepted submissions highlighted. The winners of this award will be announced in November, but given the breadth of systems on the Top 500 now and their core counts, it is unlikely any will scale beyond 8 million cores since, well, none of them have even close to that many to begin with (the #2 machine, Tianhe-2, “only” has a tick over 3 million).

In terms of the code work for the Sunway TaihuLight machine, the unique architecture obviously creates some barriers. Fu says they have a parallel OS environment and are using their own homegrown file system (Sunway GFS) which many guess is based on Lustre. The machine will support C, C++ and Fortran compilers and support for all basic software libraries. Fu says they are using a combination of OpenMP, OpenACC and MPI, but for many of the early stage applications demonstrated here, they are using a hybrid mode that balances OpenACC and MPI (for the different compute groups, one MPI process is allocated and OpenACC is used to execute parallel threads).

As an interesting final side note, this government-funded supercomputer is set to support the needs of manufacturing operations in the region, which includes large cities nearby, including Shanghai. One can expect that many of the solvers and other simulation workflows will go to support the regions automotive and other industries, which explains why the $270 million funding for the supercomputer came from a collection of sources, including the province and cities near the center.

http://www.nextplatform.com/2016/06/30/inside-look-key-applications-chinas-new-top-supercomputer/

C130 · Jul 2, 2016

U.S can beat this new supercomputer right now

Cray XC40

Cabinet=48 nodes=Node=4X Intel Phi Knights landing=576Tflops of power
240 cabinets=138pflops

each cabinet needs 40KW (including power for RAM, NIC, motherboard etc) of power 9.6MW of power+ a few MW for cooling and you got the fastest supercomputer in the world :wave:

cost would be less than >$200 million

687474703a2f2f7777772e68656973652e64652f696d67732f31382f312f372f332f382f302f392f372f534331355f3035302d333935366230323236613236396239322e6a706567

TaiShang · Jul 3, 2016

C130 said:
U.S can beat this new supercomputer right now

Cray XC40

Cabinet=48 nodes=Node=4X Intel Phi Knights landing=576Tflops of power
240 cabinets=138pflops

each cabinet needs 40KW (including power for RAM, NIC, motherboard etc) of power 9.6MW of power+ a few MW for cooling and you got the fastest supercomputer in the world

cost would be less than >$200 million

Good for you.

Yes you can.

Dungeness · Jul 3, 2016

TaiShang said:
Good for you.

Yes you can.

Now you sound like Bob the Builder "Can we do it? Yes, we can". I guess O8 got his inspiration from Bob. :cheesy:

cirr · Jul 5, 2016

Sugon set out for exascale by 2020

中科曙光E级超算原型系统项目启动

发表时间：2016-07-05 09:03

来源：人民日报

中科曙光在4日举办的技术创新大会上宣布，正式启动由其牵头的E级高性能计算机（简称“E级超算”）原型系统项目，向百亿亿次超级计算机研制发起冲锋。

据了解，超算是体现一个国家综合国力和科技创新能力的重要标志，目前，美国、欧洲、日本等国家和地区都提出了自己的E级超算研发计划。

中国也将百亿亿次超级计算机及相关技术的研究写入了国家“十三五”规划，希望在2020年左右实现。在国家“十三五”高性能计算专项课题中，中科曙光、国防科技大学以及江南计算技术研究所同时获批牵头E级超算的原型系统研制项目。

E级超算“原型系统研制”是E级超算项目的预研工作。中国计算机学会高性能计算专委会秘书长张云泉表示，原型系统的研制可以对一些关键技术难点进行测试和改进，为最后建造全部的系统扫清障碍，避免出现大的技术错误和难题。据介绍，曙光预研项目的任务目标是：完成E级原型机系统，验证E级机研制的关键技术和路线图，形成E级机的完整方案，为国产E级超算的研制奠定技术基础。

http://www.chinaequip.gov.cn/2016-07/05/c_135489274.htm

It is now a three-horse race.

@Bussard Ramjet

Dungeness · Jul 5, 2016

cirr said:
Sugon set out for exascale by 2020

中科曙光E级超算原型系统项目启动

发表时间：2016-07-05 09:03

来源：人民日报

中科曙光在4日举办的技术创新大会上宣布，正式启动由其牵头的E级高性能计算机（简称“E级超算”）原型系统项目，向百亿亿次超级计算机研制发起冲锋。

据了解，超算是体现一个国家综合国力和科技创新能力的重要标志，目前，美国、欧洲、日本等国家和地区都提出了自己的E级超算研发计划。

中国也将百亿亿次超级计算机及相关技术的研究写入了国家“十三五”规划，希望在2020年左右实现。在国家“十三五”高性能计算专项课题中，中科曙光、国防科技大学以及江南计算技术研究所同时获批牵头E级超算的原型系统研制项目。

E级超算“原型系统研制”是E级超算项目的预研工作。中国计算机学会高性能计算专委会秘书长张云泉表示，原型系统的研制可以对一些关键技术难点进行测试和改进，为最后建造全部的系统扫清障碍，避免出现大的技术错误和难题。据介绍，曙光预研项目的任务目标是：完成E级原型机系统，验证E级机研制的关键技术和路线图，形成E级机的完整方案，为国产E级超算的研制奠定技术基础。

http://www.chinaequip.gov.cn/2016-07/05/c_135489274.htm

It is now a three-horse race.

@Bussard Ramjet

So these 3 organizations will be building their respective exascale supercomputers in parallel ？ :crazy:

cirr · Jul 5, 2016

Dungeness said:
So these 3 organizations will be building their respective exascale supercomputers in parallel ？

in parallel yes but following different technology roadmap.

Max Pain · Jul 5, 2016

GS Zhou said:
The superpower India dwarfs the progress China has made so far. What a sad news!

President Xi: China is still the largest developing country
View attachment 312484

Amazing,
Humble yet surprising everyone everyday with strides in every field,
Thats how its done.
This is for the whole world to see.

Congratulations Indians, you just derailed yet another Good thread, the discussion was going in the right direction until the bragger came and I gotta admire his rigidity , despite of being owned and proved wrong by many members, he still is blabbering.
There's a limit to shamelessness too -_-

GS Zhou said:
you fool! Tell me where is the so-called 2014 World bank data coming from??? Worldbank.org only publishes the 2011 data as the most recent year data for India. And you tell me you own the 2014 data already. You mean you are an economist that working for World Bank, so you have the access to some internal data??

View attachment 312850

View attachment 312852

In fact, even the central bank of India only publishes the 2011 data as the most recent data. And you tell us you have the 2014 data??
View attachment 312853

You are indeed a low-IQ guy. It is a shame to PDF to offer you the elite membership!!

Dude! its not worth it.

C130 · Jul 5, 2016

Exascale is only possible if the processor can do 50Gflops/watt Shenwei 26010 can only do 6gflops/watt, can China improve it's efficeny in the next 4 years??

:azn:

qwerrty · Jul 5, 2016

C130 said:
Exascale is only possible if the processor can do 50Gflops/watt Shenwei 26010 can only do 6gflops/watt, can China improve it's efficeny in the next 4 years??

design it with 1000 cores and hire tsmc or samsung to help stack it in 3d style using their latest 5nm tech

cirr · Jul 5, 2016

qwerrty said:
design it with 1000 cores and hire tsmc or samsung to help stack it in 3d style using their latest 5nm tech

No need for 5nm

XXX has apparently figured out how to achieve 30-60 Gflops/W using 28nm process.

I am sure SMIC will step in with 14nm process around 2018.

And TSMC is ever ready to provide 10nm or lower process in good time.

xunzi · Jul 5, 2016

C130 said:
U.S can beat this new supercomputer right now

Cray XC40

Cabinet=48 nodes=Node=4X Intel Phi Knights landing=576Tflops of power
240 cabinets=138pflops

each cabinet needs 40KW (including power for RAM, NIC, motherboard etc) of power 9.6MW of power+ a few MW for cooling and you got the fastest supercomputer in the world

cost would be less than >$200 million

It wouldn't be fun for us if there is no competition. So I'm very glad our American friends decide to build more powerful supercomputer. But first thing first, let talk less and pput out some so we can see where we're at and put out more powerful supercomputer.

qwerrty · Jul 5, 2016

cirr said:
No need for 5nm

XXX has apparently figured out how to achieve 30-60 Gflops/W using 28nm process.

I am sure SMIC will step in with 14nm process around 2018.

And TSMC is ever ready to provide 10nm or lower process in good time.

good to know. still far behind indian supacowputer though. need to work harder..

C130 · Jul 6, 2016

qwerrty said:
design it with 1000 cores and hire tsmc or samsung to help stack it in 3d style using their latest 5nm tech

this is possible :wave:

1000 cores operating at 1.5ghz on 7nm would be like 10.5Tflops at 200-250 watts 52Glops/watt if it's 200 watts

the more I read about Taihulight the more impressed I become (besides the memory being gimped)

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

Each Supernode then is 256*3.06 Tflop/s and a Cabinet of 4 Supernodes is at 3.1359 Pflop/s.

All number are for 64-bit Floating Point Arithmetic.

1 Node = 260 cores
1 Node = 3.06 Tflop/s

1 Supernode = 256 Nodes
1 Supernode = 783.97 Tflops

1 Cabinet = 4 Supernodes
1 Cabinet = 3.1359 Pflops

1 Sunway TaihuLight System = 40 Cabinets = 160 Supernodes = 40,960 nodes = 10,649,600 cores.
1 Sunway TaihuLight System = 125.4359 Pflop/s

I am impressed that each cabinet is 3.1Pflops!! If you were to use Cray and Intel Xenon Phi Knights landing you would only get 576Tflops per cabinet. so basically you would need 6 times the amount of cabinets to get the same power

Taihulight=40 Cabinets
Cray=240 Cabinets

Search

China Dominates the World TOP500 Supercomputers

ChineseTiger1986

ELITE MEMBER

cirr

ELITE MEMBER

C130

ELITE MEMBER

TaiShang

ELITE MEMBER

Dungeness

SENIOR MEMBER

cirr

ELITE MEMBER

Dungeness

SENIOR MEMBER

cirr

ELITE MEMBER

Max Pain

FULL MEMBER

C130

ELITE MEMBER

qwerrty

SENIOR MEMBER

cirr

ELITE MEMBER

xunzi

SENIOR MEMBER

qwerrty

SENIOR MEMBER

C130

ELITE MEMBER

Similar threads

Latest posts

Pakistan Defence Latest Posts

Pakistan Affairs Latest Posts

Military Forum Latest Posts

Country Latest Posts