Inside China’s Homegrown 64-Core ARM Big Iron Chip

cirr · Aug 26, 2015

Inside China’s Homegrown 64-Core ARM Big Iron Chip

August 25, 2015 Timothy Prickett Morgan

A little-known upstart Chinese chip maker called Phytium Technology was set to use the Hot Chips 27 conference in Silicon Valley as a coming out party of sorts for its 64-bit ARM server processors, and the company’s director of research, Charles Zhang, was not permitted to come to the event because of visa issues. But Zhang was able to get the information about the 64-core “Mars” ARM processor that Phytium has been developing for three years to the attendees.

According to the Hot Chips conference organizers, Phytium was not able to get a video of Zhang’s presentation on through the Middle Kingdom’s firewalls, and an email of the video file that Zhang made in his absence also did not make it through. In the end, Zhang called in from China and gave his presentation over the sound system at the Flint Center at De Anza University where Hot Chips is being held, presumably calling in over Skype or some similar service. (Isn’t technology wonderful?)

Not much is known about Phytium, which was founded in 2012 in the city of Guangzhou in Guangdong province, which is a port city that is northwest of Hong Kong, much further away than Shenzhen, which is just north of Hong Kong and another hotbed for technology. Phytium also has an office in Tianjin, which is where massive explosions rocked the city last week, killing dozens and knocking the Tianhe-1A hybrid CPU-GPU supercomputer offline. The company’s web site has very little information on it and looks like something that was created twenty years ago, but does say that the company is working on “HPC Server” technology aimed at the Chinese market, although Zhang quipped over the phone line that if any companies outside of China were interested in talked to Phytium about using its ARM server chips, Phytium was interested in talking to them.

Zhang said that Phytium aspires to be a leading-edge processor and ASIC maker in the Chinese IT sector and specifically that it will be working on two classes of ARM-based processors: one aimed at scale-up machines and another one aimed at scale-out machines used in hyperscale and cloud computing. Zhang referred to the former as “mainframe servers” and the latter as “Internet servers,” which is terminology that probably sounds a bit funny to our ears because both are old-fashioned ways of describing scale up and scale out architectures. But you get the idea.

The two families of processors that are under development by Phytium are called Mars and Earth. Mars is the one aimed at high-end, scale up architectures that are typified by mainframes, Unix servers based on RISC or Itanium engines, and the bigger Xeon E7 machines that have auxiliary chipsets from Hewlett-Packard, SGI, Lenovo, and a few others. As you can see from the chart above, the Mars ARM processors are aimed at systems that need to access large chunks of memory and have high bandwidth into memory and I/O to run workloads across coherent memory that spans lots of processor sockets. It is not clear at all how many sockets the Mars chip from Phytium will span, but presumably it is at least a dozen and perhaps as many as 16 or 32 sockets if the company wants to deliver the kind of big iron that used to make IBM and HP a pretty good living in China until a few years ago when the country started fostering indigenous suppliers for processors and systems.

Not much is known about the Earth ARM processors that Phytium is developing, and when asked by The Platform for more information on them, Zhang said that he was not authorized to talk about them at this time. What we can see from the chart above is that the Earth ARM processors will be aimed at scale out clusters and offer more modest performance than the Mars cores and be focused more on low cost, high power efficiency, and dense server configurations. Oddly enough, both the Mars and Earth processors will deliver “high bandwidth memory access,” according to Zhang’s presentation. Presumably the Earth processors from Phytium will be based on the 64-bit ARM architecture as the Mars chips are.

“This is a good beginning. In the next few years, we will be adding a more powerful core.”

The Mars ARM server chips are based on a core design called Xiaomi, which is also the name of the world’s third largest smartphone maker that is located in Beijing that uses ARM processors made by MediaTek and Qualcomm in its devices and, as far as we know, does not make its own ARM processors. The choice of the Xiaomi core name is no doubt significant, and we will learn its meaning soon enough. (It could be that Xiaomi has aspirations in the datacenter as it does in phones, much as Qualcomm does, and is somehow funding the effort. No one knows.) In any event, here is the basic overview of the Mars processor, which is compatible with the 64-bit ARMv8 architecture and which presumably means that Phytium is a full licensee of the ARM architecture like Applied Micro, AMD, Broadcom, Cavium Networks, and presumably Qualcomm are.

The Mars chip is organized in a hierarchy, with a block of circuits called a panel holding four blocks of two cores. Four cores, top and bottom of the panel, share an L2 cache each, with cache coherence on the panel and across the eight panels on the complete chip managed by two director control units (DCUs). Each L2 cache has 2 MB of capacity, for a total of 4 MB per panel and 32 MB across all eight panels on the die. Each Xiaomi core has 32 KB of L1 instruction cache and 32 KB of L1 data cache.

The interesting thing about the Mars design, aside from the fact that it has 64 cores on a single die, is a set of features called the cache and memory chips, or CMC. This name implies that the CMCs are external to the Mars die, but they are not. (Perhaps calling it a cache and memory controller would be better, and in this case, more accurate.) Each panel on the Mars die has its own CMC, which weaves together four banks of L3 cache memory with a total of 16 MB of capacity and 2 MB extra for ECC data scrubbing. The CMS has two DDR3 memory controllers, each supporting 800 MHz DDR3 memory and together they deliver 25.6 GB/sec of memory bandwidth. The interface between the ARM panels and this CMC is proprietary. The routing cells at the heart of each Mars panel link the CMC to the DCU and on into the L2 caches and up into the Xiaomi cores. The cores support both 32-bit and 64-bit modes and also sport 128-bit SIMD instructions. Add it all up, and the Mars chip delivers about 512 gigaflops of peak double precision floating point performance and a memory bandwidth of 204 GB/sec across its sixteen DDR3 channels and an I/O bandwidth of 32 GB/sec across its two PCI-Express 3.0 x16 controllers.

The Xiaomi cores clock at 2 GHz, and they have a superscalar architecture for their instruction pipeline that implements out-of-order execution like modern RISC processors have for servers for a long time. The 2D mesh interconnect that links the panels together with a protocol called Hawk runs at 2 GHz, and the CMC runs at 1.5 GHz. The cores run at 0.9 volts and the uncore I/O areas of the chip run at 1.8 volts. The chips are implemented in a 28 nanometer process (we don’t know whose), and the chip measures 25.2 millimeters by 25.38 millimeters; it is unclear how many transistors are on this beast, but we know it will have around 3,000 pins and burn at around 120 watts.

On early benchmark tests, the Mars chip was able to do about 10 GB/sec on the STREAM Triad memory bandwidth test with eight cores activated and scaled up linearly to around 80 GB/sec of bandwidth with all 64 cores on the die humming. On the SPEC_CPU2006_base processor benchmarks, the Mars chip has a rating of 19.2 on integer math and 17.8 on floating point math running a single copy of the benchmark. If you fire up 64 copies of the benchmark and run the SPEC_CPU20006_rate tests on the Mars chip, it gets a rating of 672 on integer math and 585 on floating point math.

If the Mars chip has on-die NUMA or SMP clustering, this was not revealed. But it almost certainly does have such features or it cannot be considered a building block for big iron machinery. Presumably this Hawk cache coherency network that links together the ARM core panels, the CMCs, and the PCI-Express controllers on the die can be extended out across multiple processor sockets. It is also a wonder by Ethernet or InfiniBand ports are not on the Mars chip, but perhaps that is coming with the next generation.

“This is a good beginning,” Zhang said at the end of his presentation. “In the next few years, we will be adding a more powerful core.” This follow-on Mars core will have a more aggressive branch predictor, multithreading, more aggressive instruction-level parallelism, and a wider SIMD unit. The power efficiency will also be increased, memory bandwidth will be boosted, and more RAS features will be added. All of this will presumably be enabled through a process shrink in Phytium’s fab partner, and if that is Taiwan Semiconductor Manufacturing Corp, that could mean a jump straight from 28 nanometers over 20 nanometers to 16 nanometers.

It does not look like Phytium is interested in building its own systems, but rather wants to sell its Mars and Earth ARM processors to others who do make machines to sell to customers. Inspur has invested in an Itanium-based big iron machine called the K1 that runs the K-UX variant of Red Hat Enterprise Linux, and the company must be looking around for an indigenous alternative to Itanium with that processor clearly being sunsetted by Intel and Hewlett-Packard. (Although they never talk about it.) While Chinese server makers can get behind the OpenPower effort and join with Suzhou PowerCore to build Power8-based machines, making the case for ARM is just as easy, given its dominance in smartphones and tablets.

To make the case for the Mars and Earth processors, Zhang showed a table of server revenue figures for China and for the entire world. As you can see, Inspur, Lenovo, Huawei Technology, and Sugon – all indigenous companies in China with aspirations outside of the Middle Kingdom – do well in China. (Huawei was growing more modestly in the first quarter.) HP and Dell do reasonably well in China, with Dell doing a lot better than HP (thanks in part to Dell’s partnerships with hyperscalers Tencent, Alibaba, and Baidu). What you can also see is that most of IBM’s server business in China was due to its System x division, which is now part of Lenovo; without X86 machines, IBM’s Power Systems and mainframe businesses drop into the Others category that is shrinking fast in China. IBM has bet big that China will bet behind OpenPower, but Phytium is betting that ARM has a better chance.

It is always good to have options, we say. And if Applied Micro, AMD, Cavium Networks, Broadcom, and Qualcomm can’t build processors that appeal to scale up and scale out customers, maybe Phytium can. It is not at all clear when Phytium intends to tape out, much less deliver Mars processors.

Inside China’s Homegrown 64-Core ARM Big Iron Chip

cirr · Aug 26, 2015

China Shakes Up ARM Servers

64-core chip leapfrogs competition

Rick Merritt

8/25/2015 08:00 PM EDT

CUPERTINO, Calif. – A China-based startup described at the annual Hot Chips event here the most aggressive ARM-based server processor to date. In the same session, Oracle described its first Sparc processor with integrated Infiniband.

Little known Phytium Technology Co. Ltd., founded in 2012, described a processor using 64 custom ARMv8 cores that will run at up to 2 GHz at 28nm. It can issue up to four instructions per cycle to hit up to 512 GFlops. The massive chip consumes 120W and fits in a 640mm2 die with about 3,000 pins.

The so-called Mars design surpasses existing high-end ARM-based server chips such as the 48-core ThunderX now sampling from Cavium and a high-end part still in the works at Broadcom. In February EZchip said it will ship a 100-core ARMv8 made in a 28nm process, but it may not ship until 2017.

The Mars design has not yet taped out, but nevertheless impressed analysts and observers at the annual gathering of microprocessor designers here, in part because few had heard of the company.

Like IBM's Power 8, Mars uses external L3 cache and memory controllers.

“My God, who knew…this is by far the most aggressive 64-bit ARM chip to be announced – it’s just awesome, and it was definitely the surprise of this event,” said Nathan Brookwood, principal of Insight64 (Saratoga, Calif.).

Sam Naffziger, a fellow at AMD who moderated the session, called Mars a respectable design with a “good cache hierarchy and good bandwidth match.”

Hot Chips organizers were surprised to get a paper proposal from Phytium, a company they had not heard from previously. It had accepted several papers in the past from a China government- and university-backed team building the so-called Godson processor.

“I was surprised we didn’t hear from [the Godson team] again this year,” said Ralph Wittig, a Hot Chips organizer. “When we got the Phytium paper we heard from ARM they were confident the startup was doing real stuff…their external memory modules are like IBM;s work on Power 8…we were highly impressed as a program committee,” Wittig said.

Adding to the mystery, a Phytium engineering manager was not able to get a U.S. visa in time for the event. He presented his slides by phone from China where the company has offices in Tainjin and Guangzhou.

One attendee familiar with Phytium said the team was not from the Godson project. The company’s Tianjin offices did suffer broken glass and shrapnel from the recent explosions there, he said.

In simulations on the SpecCPU 2006 rate benchmark, Mars hit 672 in integer and 585 in floating-point performance for a 64-core chip. However, observers noted its scaling from single-core performance was modest.

The chip is organized into eight-core panels in which four cores share a 4-MByte cache. Eight external chips provide a total of 128 Mbytes L3 cache and 16 DDR3-1600 channels.

Phytium’s custom 64-bit ARM core has 192 physical registers. A reorder buffer can hold up to 160 instructions, and about 210 instructions can be in-flight in the overall pipeline.

Phytium designed its own 64-bit ARM core code-named Xiaomi.

The chip dispatches and retires instructions in-order and executes them out-of-order. It uses an aggressive branch predictor and implements multithreading.

Mars supports MPI and Open MP interfaces for multiprocessing systems. Another processor in the works, called Earth, will be a lower cost, lower power device aimed more at today’s large data center

“I’m pretty sure [Mars] will be the first 64-core ARMv8 processor in the world,” said Charles Zhang, director of research for Phytium, speaking via a phone line to Hot Chips attendees. “It’s a good beginning…in next few years we will develop more powerful CPUs,” he said.

One of the biggest drawbacks of Mars is its size, said analysts. Achieving good yields on such a large chip will be difficult, they noted.

Oracle announced at Hot Chips a new server processor, it’s first to integrate Infiniband. The Sonoma chip is the first in a new family and also includes several features to accelerate Oracle’s database and other software.

Sonoma is a 20nm chip that includes 8 M7-class Sparc cores, supporting up to eight threads per core. It packs two DDR4 memory controllers supporting up to a TByte of memory per socket, enabling 77 GBytes/s peak memory bandwidth.

The chip also sports a PCI Express Gen 3 controller. It also includes four 16 Gbit/s coherent links to chain processors together.

Nearly 20% of Oracle's Sonoma is devoted to Infiniband.

The integrated Infiniband takes up about 20% of the chip’s die area. It implements a two 56G Infiniband links and supports virtualization, enabling 32 virtual independent Infiniband adapters.

Oracle designed the Infiniband block internally so it could optimize it for its uses and own the intellectual property, said a design team member. The chip has several potential uses given Oracle uses Infiniband in a variety of existing systems for clustering, storage and other applications.

Sonoma includes four database accelerator blocks, optimized for Oracle’s software. The chip also puts a small metadata block in addresses which it can use as a comparator to prevent buffer memory overruns and prevent malware attacks such as HeartBleed.

“It’s a very impressive chip and a gutsy call to integrate Infiniband in the silicon,” said analyst Brookwood. “Oracle has been more aggressive on Infiniband for access to storage than others, but they are still using a lot of silicon for it,” he said.

— Rick Merritt, Silicon Valley Bureau Chief, EE Times

China Shakes Up ARM Servers | EE Times

TaiShang · Aug 26, 2015

The progress is very good. But the language of the first article is rather hateful. :disagree:

cirr · Aug 26, 2015

TaiShang said:
The progress is very good. But the language of the first article is rather hateful.

Huawei‘s Hisilicon is working on its own ARMv8 compatible cores。

It would be great if Spreadtrum、Leadcore etc use Phytium’s cores for their future SoCs。

cirr · Aug 26, 2015

来自中国的ARM构架，如何打败 Intel的顶级芯片？

2015-08-26 13:13 原创王强

在刚刚结束的Hotchips 2015会议上，一家成立不久的中国企业公布了一颗代号”火星”的ARM指令集64核心处理器。令人震惊的是，这颗由中国团队开发的CPU拥有媲美Intel公司最顶级服务器芯片的性能，毫无疑问是目前ARM阵营最强大的处理器。

Phytium，中文名飞腾，是一家成立于2012年的年轻CPU研发企业。然而光是从公司名称和所在地——广州，我们就可以知道这家企业的真实身份。飞腾公司是中国国防科技大学高性能处理器研究团队建立的企业（this explains why the Yankees refused to issue a visa for Phytium's director of research to come to the Hot Chips conference），而国防科大在IT界最为人熟知的作品就是天河2A超级计算机——连续五届夺得世界超算排行榜性能冠军。天河2A的部分芯片采用了国防科大自主开发的Sparc指令集CPU，飞腾1500。显然，飞腾公司的名称就是取自这款产品。公司选址在广州也是为了靠近广州超级计算中心，也就是天河2A的所在地。

与中科院计算所知名的龙芯处理器团队不同，国防科大的CPU研发机构在公众眼中没什么名气。事实上，早在十年前业内就有传闻指国防科大正在逆向山寨Intel的IA64体系处理器安腾。后来安腾CPU在市场上举步维艰，NUDT（国防科大英文缩写）也停止了对其模仿的工作，转而开发采用Sparc指令集的高性能芯片。经过数年的努力，NUDT先后研制出飞腾1000、飞腾1500两款服务器处理器，开始逐渐为业界所知。

天河系列超算开始在全球超算领域崭露头角后，国防科大将眼光放到了更远的未来。天河2A和之前的一系列国产超级计算机均采用Intel、Nvidia、AMD等美国企业生产的处理器，其计算能力、软件编写严重依赖这些外国公司。若想自己掌控超级计算机的研发节奏，研制自主知识产权的高性能处理器是必经之路。此时，学校的CPU团队经过几代产品的研发已经颇具实力，他们便担负起了设计有着世界一流水平的CPU的重任。

如今，NUDT的努力结出了硕果。Hotchips 2015大会上，飞腾公司介绍了他们的”火星”，一颗兼容ARMv8指令集，四发射乱序执行，拥有多达64个核心，主频达到2GHZ的服务器CPU。

在标准测试集Spec 2006中，”火星”的多核整数分数高达672，浮点分数585。相比之下，Intel目前最强的处理器Xeon E7－8890v3和Xeon E5－2699v3的整数、浮点成绩分别是680和460，"火星”的性能足以与它们媲美。

“火星”的核心代号为”小米”，这个名字足够令许多人浮想联翩了。不过取这个名字很可能只是巧合，因为飞腾公司与制造智能设备的小米公司并没有任何关联。”小米”核心是典型的现代高性能处理器微架构设计，四发射、两个浮点单元，不长的流水线和三级缓存方案。它显然并不是为密集浮点运算设计的产物，单周期双精度浮点输出只有4Flop。但是"小米"核心的访存结构设计很激进，192个寄存器、单核心512K L2、2M L3的设计非常像Intel的Haswell微架构。”火星”采用二维mesh多核互联结构，每8颗”小米”核心组成一个阵列，每个阵列有一个双通道DDR3－1600内存控制器；8个阵列组成”火星”芯片，总共64个核心、32M二级缓存、128M三级缓存和16通道内存，205G／s理论内存带宽。芯片上还有32个PCIe 3.0通道。

“火星”的理论浮点计算能力是512G DP Flops，采用28nm制造工艺，主频2GHZ，核心运行电压不足1v。虽然整个芯片面积达到了吓人的640平方毫米，但是满载功耗只有120w，甚至比使用22nm先进工艺，性能相当的Xeon E5－2699v3、E7－8890v3都低一截。ARM阵营性能功耗比较强的优势过去仅仅体现在移动设备使用的低功耗芯片上，而飞腾公司证明了即使在高性能服务器处理器领域，兼容ARM指令集的处理器也能取得对同时代Intel顶级产品的功耗优势。

“火星”的量产版本预计会在2016年推出，预计会首先用在国防科大下一代超级计算机（可能命名为天河3）上。不久前，国防科大展示了一款用来取代Intel Xeon Phi浮点运算芯片的计算卡Matrix 2000，预计下一代天河会使用”火星”和Matrix 2000的组合来搭建。

除了”火星”，飞腾公司还展望了他们针对主流市场开发的”地球”处理器。”地球”显然是”火星”的简化版，核心数量可能减至4－16个，面向桌面PC、低功耗服务器等市场。从”火星”单核心Spec测试集成绩来看，8核心的”地球”就可以提供与Intel Core i7 四核处理器接近的多线程性能，而功耗可能还有优势。而兼容ARMv8指令集意味着无论是”火星”还是”地球”都可以轻易运行安卓、Linux系统和市面上无数的应用，甚至可能兼容微软的Windows 10。性能差距不复存在后，ARM阵营进军桌面PC和服务器市场挑战x86的地位也就有了足够的底气。飞腾公司不仅仅做到了世界一流水平，更为重要的是他们为整个ARM阵营建立了信心：

从此之后，再也无人能质疑ARM指令集是否能开发出高性能产品了，x86体系在PC、服务器市场称雄二十年后，终于遇到了强大的对手。

而值得国人骄傲的是，这一历史性时刻是一家之前默默无闻的中国企业创造的。自从2006年Intel发布酷睿2处理器以来，十年时间里除了老牌巨头IBM，未曾有任何企业挑战Intel性能冠军的宝座。

如今，第一个向老大哥扔出巨斧的不是AMD，不是Nvidia，也不是一众欧美企业，而是几年前还背着”山寨”恶名的中国团队。即使是业界最老资格的前辈，此刻也应向年轻的飞腾致以敬意。

毫无疑问，”火星”的发布会大大刺激ARM阵营的发展，直接影响未来数年的CPU产业格局。照此趋势发展下去，我们很快就会在桌面、高性能服务器领域看到ARM与x86的直接对抗。当Intel的神话不再闪耀，IT产业又会迎来一个高度竞争的全新时代。

来自中国的ARM构架，如何打败 Intel的顶级芯片？ | 雷锋网