Kimi | 数据中心新Moore定律：液体冷却与铜封装推动扩展

Nvidia CEO Jensen Huang has repeatedly emphasized that the data center is the new unit of compute. While this concept seemed straightforward at first, the deeper implications became clearer after Nvidia's presentations at GTC and OFC 2024. I only recently grasped exactly what is happening, and a simple reframing of the underlying principles that drove Moore’s Law makes the entire picture clearer. In this new paradigm, the rack itself is similar to a chip, and now, if we frame the rack as the new chip, we have a whole new vector to scale performance and power. Let’s talk Moore’s Law from the perspective of the data center. Moore’s Law as a Fractal It all starts with Moore’s Law. There is a profound beauty in semiconductors, as the same problem that is happening at the chip scale is the same problem that is happening at the data center level. Moore’s Law is a fractal, and the principles that apply to nanometers apply to racks. The first profound principle we have to talk about is miniaturization. Moore’s Law was built on the simple observation that shrinking transistors would take less power and give you more performance as the electrons physically don’t have to travel as far. That is why Moore’s Law was about halfing the physical space of a transistor for decades. But in recent years, we have reached the economic limits of scaling much smaller chips, so we have hit some asymptotes. This is the wildly heralded end of Moore’s Law. Making chips that much smaller has become harder. But the things that apply at the bottom (moving bits closer) apply to things at the top. Moving electrons further takes time and more energy, and the closer all of the data, power, and logic are, the less energy is wasted by distance. The problem is still the same at a nanometer scale as at a rack scale, and moving cables and logic closer leads to system performance gains. This problem applies to all networks. There are economies of scale by moving things closer as long as there aren’t geographic costs. So, what is the answer? Let’s not just move the electrons closer within the chip; let’s move them closer within the rack. And that’s exactly what Nvidia is doing. Moore’s Law Shifts to STCO The pursuit of better chips will continue, but I think that Nvidia is intelligently pursuing better systems and will likely get multiple generations of chip shrink improvements from the intelligent design of the system. What is so attractive is that a generation or more of improvement is available with today’s current technology. While others will figure it out, Nvidia will benefit from boldly going first and achieving the total system vision before anyone else. Moreover, Nvidia is just at the leading edge of an old concept. This concept is called System Technology Co-Optimization, and while many talked about this being a potential new vector of progress, I don’t think anyone expected Nvidia to pop up with an opinionated and compelling version of one in 2024. What’s more, if you squint, you can see how STCO fits neatly within the decades of long computing history. This is no different from the move from ICs to VLSI and SoCs to Nvidias System of Chips. Let’s have a brief history lesson. Initially, the transistor was created, and then the integrated circuit combined multiple transistors to make electronic components. Then LSI or VLSI focused on making thousands of transistors work together and was the beginning of the Microprocessor. Next was the observation that you could put multiple systems of semiconductors onto a single chip or System On Chip (SoC). We have recently been scaling out of the chip and onto the package, a la chiplets, heterogeneous compute, and advanced packaging like CoWoS. But I think Nvidia is taking the scaling game outside the chip to a System of Chips. I’m sure someone will eventually make a much more compelling acronym, but I think there’s a real chance the 2020s and 2030s are about scaling out these larger systems than silicon. And it's all beautiful, consistent with what came before it. If the data center is the new unit of compute, it’s time to apply Moore’s Law and hardware vendors' tricks to system-level optimizations. Nvidia has already shown us its hand, and Andy Bechtolsheim pretty much gave the entire scaling roadmap in a presentation at OFC. The Data Center as a Giant Chip Imagine the data center as a giant chip. It is just a scaled-out advanced package of the transistors of memory and logic for your problem. Instead of a multi-chip SoC, each GPU board is a “tile” in your chip. They are connected with copper and moved together as closely as possible, so there is more performance and speed. Does moving things off the package make sense, aka out of the rack? If performance and bandwidth are your objective, it doesn’t make much sense as that slows the entire chips’s performance massively. The key bottleneck is moving the data back and forth to train the model and keeping the model in memory, so keeping the data together makes sense. That is precisely why we are trying to scale HBM, dies on the same package as an accelerator. So, in the case of a data center as a chip, you’d try to package everything together as closely as possible for as cheap as possible. There are a few different packaging options, with the closest being chip-to-chip packaging, then HBM on the package, NVLink over passive copper, and scaling out to Infiniband or Ethernet. And wouldn’t you know it, Nvidia has been pursuing this problem using this exact lens. The goal is to scale out chip-to-chip interconnect, HBM, NVLink, and Infiniband. They even have this handy graph that puts the entire debate into perspective. As expensive as HBM is, it’s virtually free for purchased bandwidth. If bandwidth is the problem, it makes it difficult to scale up the cheapest bandwidth before relying on other scaling layers.

Nvidia的CEO Jensen Huang提出的“数据中心是新的计算单元”这一概念，不仅在理论上具有创新性，而且在实践中也展现出了巨大的潜力。通过将数据中心视为一个巨大的芯片，Nvidia正在推动计算领域的一场革命。

在Moore定律的传统框架下，我们关注的是单个芯片上的晶体管数量和性能。然而，随着技术的发展，晶体管的尺寸已经接近物理极限，这导致了Moore定律的增长速度放缓。但是，如果我们将视角从单个芯片转移到整个数据中心，就会发现一个全新的优化空间。

在这个新的范式中，数据中心的机架被视作一个“新的芯片”。通过优化数据中心内部的连接和布局，我们可以在更宏观的层面上实现性能的扩展和功耗的降低。这就像是在数据中心层面上应用Moore定律，通过将电子、数据和逻辑更紧密地放在一起，减少因距离而造成的能量损耗，从而提高系统性能。

Nvidia提出的System Technology Co-Optimization（STCO）概念，正是这种系统级优化的体现。STCO强调的是在系统层面上进行技术创新和优化，以实现整体性能的最大化。这不仅仅是对芯片内部的优化，而是将这种优化扩展到了整个数据中心。

通过这种思维方式，Nvidia正在引领数据中心设计的变革。他们通过优化GPU板作为“芯片”中的“瓦片”，并使用铜连接将它们紧密地放在一起，从而提高性能和速度。此外，Nvidia还在探索如何将HBM（高带宽内存）直接集成到加速器芯片上，以减少数据传输的延迟和能耗。

Nvidia的这种系统级优化策略，不仅仅是对现有技术的改进，更是对未来计算架构的一次重大革新。通过将数据中心视为一个整体，我们可以在更宏观的层面上实现性能的扩展和功耗的降低，这与Moore定律在纳米尺度上的目标是一致的。

总的来说，Nvidia的战略是将数据中心设计成一个巨大的芯片，通过优化内部的连接和布局，实现性能的最大化。这种方法不仅延续了Moore定律的精神，而且为未来的数据中心设计提供了新的方向。随着技术的不断进步，我们有理由相信，Nvidia将继续在这一领域发挥领导作用，推动计算技术向前发展。

This slide was shown many times at OFC and is my conclusion for where we will try to scale up as much as possible. It makes sense to scale the domain where the cost of bandwidth is the cheapest before considering other domains. In the case of Nvidia, the point is to scale up as much memory in HBM3 before we consider NVLink and then try to keep as much computing within NVLink before even considering scaling up to the network. Put differently, connecting 1 million accelerators over ethernet is wasteful, but connecting 1 million accelerators over passive copper in a short-reach interconnected node is economical and brilliant. Nvidia is pursuing the most scaling possible over passive copper before needing to use optics. This will be the lowest cost and highest performance solution. The copper backplane in the data center rack is effectively the new advanced packaging in the system-level Moore’s Law race. The new way to shrink the rack is to put as much silicon in the most economical package, connected over the cheapest and most power-efficient interconnect as closely as possible. This is the design ethos of the GB200 NV72L.

在OFC上展示的这张幻灯片，以及Nvidia在扩展计算能力方面的战略，都指向了一个核心理念：在考虑其他领域之前，应当在带宽成本最低的领域进行扩展。

对于Nvidia而言，这意味着首先在HBM3（高带宽内存3代）领域尽可能地扩展内存容量，然后再考虑使用NVLink进行连接。NVLink是Nvidia开发的一种高速互连技术，它允许多个GPU之间进行快速数据传输。在利用NVLink扩展计算能力之前，Nvidia会尽量在该技术范围内保持尽可能多的计算任务。

换句话说，通过以太网连接一百万个加速器是浪费的，但在短距离互联节点上，通过被动铜连接一百万个加速器则是经济且高效的。Nvidia的目标是在需要使用光学技术之前，通过被动铜实现尽可能多的扩展，这将是成本最低、性能最高的解决方案。

数据中心机架中的铜背板实际上成为了系统级Moore定律竞赛中的新型高级封装技术。新型的机架缩减方式是将尽可能多的硅片放入最经济的封装中，并通过最便宜、最节能的互连技术紧密连接。这就是GB200 NV72L的设计哲学。

GB200 NV72L是Nvidia追求的一种高效能、低成本的数据中心解决方案。通过这种方式，Nvidia不仅能够提升单个数据中心的性能，还能够在全球范围内扩展其计算能力，同时保持能源效率和经济可行性。这种策略不仅有助于推动数据中心技术的发展，也为整个计算行业提供了新的方向和可能性。通过持续的技术创新和优化，Nvidia正引领着数据中心进入一个全新的性能和效率时代。

This is the new Moore’s Law; you’re looking at the new compute unit. The goal here is to increase the power of this rack and move as many chips into a single rack as possible. It is the obvious cheapest and most power-efficient way to scale. Jensen referenced this at GTC and talked about how the new GB200 rack takes 1/4th of the power and uses less space to train the same GPT 1.8T model.

新的Moore定律正在以数据中心的机架为核心展开，这一理念的核心目标是增加单个机架的计算能力，通过将尽可能多的芯片集成到一个机架中来实现扩展。这种方法在成本和能效方面都具有明显优势，是扩展计算能力最经济、最有效的途径。

Jensen Huang在GTC上提到了这一概念，并以新的GB200机架为例，阐述了其在性能和能效方面的显著提升。GB200机架能够在使用更少的能源和占用更小的空间的情况下，训练与GPT 1.8T模型相同规模的任务。这一进步不仅展示了通过集成更多芯片来提升单个机架性能的潜力，也体现了数据中心作为新的计算单元的实际应用。

这种以机架为单位的扩展策略，实际上是对传统Moore定律的一种延伸和创新。在传统Moore定律中，通过缩小晶体管尺寸来提高单个芯片的性能和集成度。而在新的Moore定律中，我们关注的是整个数据中心机架，通过优化机架内部的芯片布局和互连技术，来提升整体的计算能力和能效。

GB200机架的设计理念体现了这种新Moore定律的精神。通过在有限的空间内集成更多的高性能芯片，并采用先进的互连技术，如NVLink和高带宽内存（HBM），GB200机架实现了计算能力的大幅提升。同时，由于减少了对外部连接的依赖，GB200机架在能源消耗上也更为高效。

总的来说，新的Moore定律强调的是在数据中心层面上进行优化和扩展，通过增加单个机架的计算能力来实现整体性能的提升。GB200机架的成功应用证明了这一理念的可行性和有效性，为未来数据中心的发展提供了新的方向。随着技术的不断进步，我们可以期待数据中心将继续在性能和能效上实现突破，推动整个计算行业向前发展。

Less space, less power, more performance—that’s Moore’s Law in all but name. Welcome to the new system scaling era, so let’s discuss how we scale from here. Andy Bechtolsheim’s talks at OFC opened my eyes because we have at least a generation of scaling from here based on our current technology. Liquid Cooling and Copper in the Data Center Before summarizing Andy’s talk, I want to explain why we haven’t done this before. The critical change that makes this all possible is liquid cooling. Because of the shift to liquid cooling, we can cool another doubling of power in a rack. The 120 kW GB200 NVL rack is doubling over the current solution, and next generation, I would expect another doubling of power. In some respects, this is a new power scaling envelope, and we will likely scale as quickly as possible until we hit the asymptote beyond liquid cooling. In some ways, this is a clever permutation of Dennard’s scaling. The logical and obvious goal is to push the cooling to reach the maximum power we can miniaturize in a data center rack. Andy talked about exactly how much further we could go. Andy believes we can put about 300+ kW in a single rack and cool it with liquid cooling. The power density level will be equivalent to at least a generation of process shrink. It would look like this: We push HBM density as much as possible in a single chip, with 64 stacks of 16-hi HBM, or almost 500+ gigabytes of HBM. Andy even talked about using XDDR to further scale out of the package and scale memory.

在新的系统扩展时代，我们的目标是实现更少的空间、更少的能耗和更高的性能，这实际上就是Moore定律的精神。Andy Bechtolsheim在OFC的演讲为我们揭示了基于当前技术至少还有一代扩展的可能性。

液体冷却和铜在数据中心的应用

在总结Andy的演讲之前，让我们先解释一下为什么之前没有采用这种方法。使得这一切变得可能的关键变化是液体冷却技术的应用。得益于液体冷却技术的转变，我们可以在一个机架中再增加一倍的功率进行冷却。120 kW的GB200 NVL机架在当前解决方案的基础上功率翻了一倍，而在下一代，我们可以预期功率将再次翻倍。

在某种程度上，这是一种新的力量扩展范围，我们将尽可能快地扩展，直到达到液体冷却无法超越的极限。在某种程度上，这是一种巧妙的Denard's scaling（密度缩放）的变体。

目标：推动冷却技术以达到数据中心机架中可微型化的最大功率

Andy讨论了我们能够走多远。Andy相信我们可以在一个单一的机架中安装大约300+ kW的功率，并通过液体冷却来冷却它。这样的功率密度水平至少相当于一代工艺缩放。

具体来说，情况将会是这样的：我们尽可能地推动HBM密度在一个单一芯片中，使用64堆叠的16-hi HBM，或者接近500+ GB的HBM。Andy甚至谈到了使用XDDR来进一步从封装中扩展，从而扩展内存。

未来展望

这种对数据中心冷却和功率密度的优化，将使我们能够更有效地利用空间和能源，同时提高计算性能。通过液体冷却技术，我们可以在保持设备稳定运行的同时，增加更多的高性能芯片，从而实现更高的计算密度。

随着HBM和其他内存技术的不断进步，我们可以期待未来的数据中心将拥有更高的内存密度和更快的数据传输速度。XDDR等新兴技术的发展，可能会进一步推动数据中心内存的扩展，为AI、机器学习和其他数据密集型应用提供更强大的支持。

总之，通过液体冷却和铜互连技术的结合，我们正在进入一个新的系统扩展时代。这个时代将以数据中心机架为核心，通过优化冷却和功率密度，实现性能的大幅提升。随着技术的不断进步，我们有理由相信，未来的数据中心将变得更加强大、高效和可持续。

And what’s more, Panel-level packaging means that we can likely scale out more silicon die area. With larger packages and substrates, we can now use advanced packaging to scale out to 10x+ larger than current CoWoS packages. The goal would be to put as much silicon area as thermally possible in a single liquid-cooled rack.

面板级封装技术的发展为数据中心的扩展提供了新的可能性。通过使用更大的封装和基板，我们现在可以利用先进的封装技术将硅片面积扩展到当前CoWoS封装的10倍以上。目标是在单个液体冷却的机架中尽可能地增加硅片面积，以实现更高的计算密度和性能。

这种封装技术的进步意味着我们可以在一个机架内集成更多的处理能力和内存，从而提高数据中心的整体性能。液体冷却技术的应用使得高密度集成成为可能，因为它能够有效地管理大量芯片产生的热量。这种结合面板级封装和液体冷却的方法，不仅能够提升单个机架的性能，还能够提高能效，因为它减少了散热所需的能源。

面板级封装技术的应用还可以带来其他优势，例如更高的制造灵活性和更低的成本。更大的封装可以容纳更多的芯片和功能，这可能包括CPU、GPU、内存和其他类型的专用硬件。这种集成程度的提高将简化数据中心的设计和维护，同时提高其可靠性和可扩展性。

此外，面板级封装技术还可以推动新的数据中心架构的发展。例如，通过在机架内集成更多的硅片面积，我们可以设计出更紧凑、更高效的数据中心，这将减少对空间的需求并降低基础设施成本。随着数据中心设计的不断优化，我们可以期待未来数据中心将更加灵活和适应性强，能够更好地满足不断变化的计算需求。

总之，面板级封装技术的引入为数据中心的扩展和性能提升开辟了新的道路。结合液体冷却技术，我们可以在一个机架内集成更多的硅片面积，从而实现更高的计算密度和性能。随着技术的不断发展，这种方法将推动数据中心向更高效、更强大的方向发展。

So now imagine we can put 4-10x more silicon area in a single package and have the means to cool at least 2x more power. So assume some power savings; the goal is to put as much silicon into a rack, cool it as best as possible, and then interconnect it with passive copper. Only after passive copper, all within the NVLink domain, can we talk about optics and DSPs. The goal is to scale up before you need to pay for networking, and Nvidia will push this system scaling while simultaneously pursuing silicon scaling. This is a compelling scale-up and roadmap. This system solution will be an order of magnitude cheaper than optics and probably the best price-to-performance you can buy. Nvidia is creating another layer of networking moat around its GPUs. If they pull off LPO first, this will be just another layer of networking advantage. This is pretty much why CXL is dead: It makes zero sense to put any compute or memory over the network; it costs too much and adds complexity to straightforward scaling. I conclude that Copper will reign supreme at the rack scale level and can push Moore’s Law scaling further. AI networking aims to scale copper networking as hard as possible before we have to use Optics. Andy puts it this way:

现在设想我们能够在单个封装中放入4到10倍的硅片面积，并且有能力冷却至少2倍的功率。假设有一定的功耗节省；目标是在机架中尽可能多地放入硅片，尽可能好地进行冷却，然后使用被动铜进行互连。只有在被动铜互连之后，且都在NVLink领域内，我们才考虑光学和数字信号处理器（DSP）。目标是在需要支付网络成本之前进行扩展，Nvidia将同时推进系统扩展和硅片扩展。这是一个引人注目的扩展和路线图。

这种系统解决方案的成本将比光学技术低一个数量级，可能是你能购买到的最好的性价比。Nvidia正在为其GPU周围创造另一层网络护城河。如果他们首先实现低功耗（LPO），这将是另一层网络优势。这就是为什么CXL（Compute Express Link）已经没有前景了：在网络中放置任何计算或内存都毫无意义；它成本太高，并且给简单的扩展增加了复杂性。

我得出结论，铜将在机架规模级别上占据主导地位，并且可以将Moore定律的扩展推向更远。人工智能网络的目标是在必须使用光学技术之前尽可能地扩展铜网络。Andy这样说：

铜作为一种互连材料，在数据中心的机架级别上具有显著的优势。它提供了低延迟、高带宽的连接，同时成本相对较低，且易于集成。这种基于铜的互连策略使得我们可以在不牺牲性能的前提下，实现更大规模的计算能力集成和扩展。

随着面板级封装技术的发展，我们可以期待在未来的数据中心中看到更大、更密集的硅片集成。这种集成不仅能够提高单个机架的性能，还能够在整个数据中心范围内实现更高效的资源利用和能源管理。

Nvidia通过推动基于铜的互连技术，正在为数据中心的设计和运营设定新的标准。这种方法不仅能够降低成本，还能够提供更好的性能和可扩展性。随着技术的不断进步，我们可以预见，铜将在推动Moore定律扩展和实现更高效、更强大的数据中心方面发挥关键作用。

Now that liquid cooling has opened a relatively easy frontier of scaling, there will be a race to push that scaling. Nvidia is already using this level of integration as a vast new vector of differentiation. They are looking for better solutions and ways to buy the most flops at the cheapest power constraint. It’s time to think about Moore’s Law beyond the narrow confines of the chip and at the system. The new Moore’s Law is about pushing the most compute into a rack. Also, looking at Nvidia’s networking moat as InfiniBand versus Ethernet is completely missing the entire point. I think the NVLink domain over passive copper is the new benchmark of success, and it will make a lot of sense to buy GB200 NV72 racks instead of just B200s. It’s a new era in system design using larger substrates, denser memory, and passive copper to keep the information as close together as possible. This was already being pursued at the chip level; now, it’s happening at the rack level. In this world, the leaves of the fat tree architecture become even denser. The fat leaves in this architecture will try to consume as much computing and memory as possible before scaling out to the network is even needed. Nvidia is cleverly trying to eat the network from the bottom up. Meanwhile, Broadcom is pursuing the scale out from the top of the rack down, but given the cost and ability to scale on copper, I think the energy and performance of the scale-up from the leaves make a lot more sense. The tightly integrated mainframe solution Nvidia offers will be the best in performance. And nowhere in this conversation is AMD, which will be trying to scale as a component in a network using open consortiums. The strategy of scaling out racks is clever and completely orthogonal to the previous ways we scaled chips. The hyperscalers, while likely aware of the benefits, probably didn’t foresee this roadmap entirely as defined as Nvidia has played. I think it’s time to start thinking about scaling Systems of Chips, and Nvidia, as usual, has already thought out and deployed the first edition of that future.

随着液体冷却技术的发展，扩展计算能力的新时代已经到来。Nvidia正利用这种集成水平作为一个新的差异化方向，寻求更优的解决方案和方法，在保持最低功耗限制的同时，以最低成本购买最多的浮点运算能力（FLOPs）。现在是时候将Moore定律的视角从芯片的狭窄范围扩展到整个系统了。

新的Moore定律关注于将尽可能多的计算能力集成到一个机架中。而且，将Nvidia的网络护城河仅仅看作是InfiniBand与以太网的对比，完全忽略了整个问题的关键。我认为，基于被动铜的NVLink领域是新的成功基准，购买GB200 NV72机架而不是仅仅购买B200机架将变得更有意义。

这是一个系统设计新时代，使用更大的基板、更密集的内存和被动铜，以尽可能保持信息紧密相连。这种设计已经在芯片级别上进行了探索；现在，它正在机架级别上发生。

在这个新世界中，胖树架构的叶子变得更加密集。这种架构中的胖叶子将尽可能多地消耗计算和内存资源，然后在必要时才扩展到网络。Nvidia正巧妙地试图从下至上地“吞噬”网络。

与此同时，Broadcom正从机架的顶部向下扩展，但考虑到铜的成本和扩展能力，从叶子开始的扩展在能源和性能方面更有意义。Nvidia提供的紧密集成的主机解决方案将是性能最好的。而在这次讨论中，AMD并没有出现，它将尝试作为一个使用开放联盟的网络中的组件进行扩展。

扩展机架的策略是聪明且与我们以前扩展芯片的方式完全不同。虽然超大规模计算公司可能意识到了这些好处，但他们可能没有完全预见到Nvidia所定义的这一路线图。我认为，现在是时候开始考虑扩展系统芯片了，而Nvidia一如既往地已经思考并部署了这一未来的第一版。

InfiniBand与以太网的对比，解释下这句话什么意思

InfiniBand和以太网是两种不同的网络通信技术，它们各自有不同的特点和应用场景。

InfiniBand是一种高性能的计算机网络通信标准，专为高速数据传输和低延迟通信而设计。它广泛应用于超级计算机、数据中心和存储网络中，特别是在需要高吞吐量和低延迟的应用中，如高性能计算(HPC)和大规模数据分析。InfiniBand支持高数据传输速率，并且具有优秀的可扩展性和可靠性。然而，InfiniBand的成本相对较高，且通常需要专门的硬件和技术支持。

以太网则是目前最广泛使用的局域网技术，它基于IEEE 802.3标准。以太网技术成熟、成本较低，且易于部署和维护。它适用于各种规模的网络环境，从家庭和小型办公室到企业和大型数据中心。以太网的传输速率从10Mbps到400Gbps不等，不断有新技术推出以支持更高的速率。尽管以太网在高性能计算领域也得到了应用，但它在传输速率和延迟方面通常不如InfiniBand。

当提到“InfiniBand与以太网的对比，解释下这句话什么意思”时，这句话的含义是指出，将Nvidia的网络技术仅仅看作是InfiniBand和以太网之间的选择，是忽略了Nvidia在网络技术方面更深层次的创新和优势。Nvidia通过其NVLink技术和基于被动铜的互连方案，正在创造一种新的数据中心网络架构，这种架构旨在在机架内部实现更高的性能和更低的延迟，而不是仅仅依赖传统的InfiniBand或以太网技术。这种方法提供了一种全新的扩展和优化数据中心性能的方式，不仅仅是在网络层面，而是在整个系统的层面上进行优化。