首页  >>  来自播客: User Upload Audio 更新   反馈

What's next for photonics-powered data centers and AI ft. Lightmatter's Nick Harris - YouTube

发布时间 2024-03-29 12:45:00    来源

中英文字稿  

is my privilege to introduce the first speaker, Nick Harris, who is CEO of LightMatter. We all know that a lot of AI progress has been driven by scaling laws and training very large foundation models. Nick and his company LightMatter is a key player in that, and he's building very, very large data centers, hundreds of thousands of GPUs, maybe millions of nodes someday that will be coming online soon. And hopefully powering hyperscalers next generation of AI models for all of us to build on. So that's the hand over to Nick. All right, thank you, Sequoia team for the invite. Sean McGuire, for putting my name up and Constantine, for accepting the talk. I have to say the talks at Sequoia, I've attended two events now. Really been world class. Sequoia is able to pull together some of those interesting people in the world. So yeah, let's talk about LightMatter and the future of the data center. One of the things that I thought was incredibly exciting from earlier today was seeing Sora. And the example that was very near to my heart was looking at what happened as you scaled the amount of compute in the AI model. It went from this sort of goofy munch of some kind of furry thing to the physics of a dog with a hat on and a person and their hair flowing. And this is the difference that the amount of compute you have makes on the power of AI models. So let's go ahead and talk about the future of the data center.
很荣幸能介绍第一位演讲者尼克·哈里斯,他是LightMatter的首席执行官。我们都知道,很多人工智能的进步都是由规模定律和训练非常大型基础模型驱动的。尼克和他的公司LightMatter在其中扮演着关键角色,他正在建设非常庞大的数据中心,数十万个GPU,也许将来会有数百万个节点上线。希望能为超大规模云服务提供下一代人工智能模型。现在轮到尼克发言了。好的,谢谢Sequoia团队的邀请。谢恩·麦格尔为我提名,康斯坦丁接受了我的演讲。我必须说,我参加过Sequoia的两次活动,真的是世界一流的。Sequoia能够聚集世界上一些最有趣的人。所以,让我们来谈谈LightMatter和数据中心的未来。今天早些时候让我感到非常兴奋的一件事是看到Sora。一个让我触动很深的例子是当你扩大人工智能模型中的计算量时所发生的变化。它从一个有点搞笑的毛绒绒的东西变成了一个带着帽子的狗和一个人,他们的头发飘动的物理动画。这就是你拥有的计算资源对人工智能模型影响力的差异。让我们继续讨论数据中心的未来。

So this is pretty wild, very rough estimate on sort of the capital expenditure for the supercomputers that are used to train AI models. So let's start here in the bottom. So 4,000 GPUs, something like $150 million to deploy this kind of system. 10,000, we're looking at about 400 million, 60,000, 4 billion. This is an insane amount of money. It turns out that the power of AI models and AI in general scales very much with the amount of compute that you have. And the spend for these systems is astronomical. And if you look at what's coming next, what's the next point here? 10 billion, 20 billion. There's going to be an enormous amount of pressure on companies to deliver a return on this investment. But we know that the AGI is potentially out there. At least we suspect it is if you spend enough money. But this comes at a very challenging time. My background is in physics. I love computers. And I'll tell you that scaling is over. You're not getting more performance out of computer chips. Jensen had GTC announcement yesterday, I believe, where he showed a chip that was twice as big for twice the performance. And that's sort of what we're doing in terms of scaling today. So the core technology that's driven Moore's law and denied scaling that made computers faster and cheaper and has democratized computing for the world and made this AGI hunt that were on possible is coming to an end. So at LightMatter, what we're doing is we're looking at how do you continue scaling. And everything we do is centered around light. We're using light to move the data between the chips, allow you to scale it to be much bigger so that you can get to 100,000 nodes, a million nodes, and beyond. Try to figure out what's required to get to AGI, what's required to get to these next gen models.
因此,这是非常疯狂的,对用于训练人工智能模型的超级计算机的资本支出的粗略估计。所以让我们从底部开始吧。对于4,000个GPU,大约需要1.5亿美元来部署这种系统。对于10,000个,我们将需要约40亿美元,60,000个则需要约400亿美元。这是一笔疯狂的钱。人工智能模型和人工智能的能力,总体上与您拥有的计算量成正比。用于这些系统的支出是天文数字。如果看看接下来会发生什么,下一个点是什么?100亿,200亿。公司将面临巨大压力来回报这笔投资。但我们知道通用人工智能可能已经存在,至少我们怀疑如果您花足够的钱的话。但这是一个非常具有挑战性的时刻。我的背景是物理学。我热爱计算机。我告诉过您,扩展已经结束了。您无法从计算机芯片获得更多性能。我相信 Jensen 昨天在 GTC 发布会上展示了一款双倍性能的芯片。这就是我们今天的扩展方式。驱动摩尔定律和拒绝扩展的核心技术已经结束,这使得计算机更快、更便宜,使全世界民主化计算成为可能,推动了我们对通用人工智能的追求。因此,在 LightMatter,我们的做法是看看如何继续扩展。我们所做的一切都围绕光展开。我们利用光将数据在芯片之间传输,使您能够扩展到更大规模,从而实现100,000个节点、100万个节点甚至更多。尝试找出到达通用人工智能所需的条件,以及到达这些下一代模型所需的条件。

So this is kind of what a present-day supercomputer looks like. You'll have racks of networking gear and you'll have racks of computing gear. And there are a lot of interconnections when you're inside one of the computing racks. But then you kind of get a spaghetti, a few links over to the networking racks and this very weak sort of interconnectivity in these clusters. And what that means is that when you map a computation like an AI training workload onto these supercomputers, you're basically having to slice and dice it so that big pieces of it fit in the tightly interconnected clusters. You're having a really hard time scaling, getting a really good unit performance scaling as you get to 50,000 GPUs running a workload. So I'd basically tell you that a thousand GPUs is not just a thousand GPUs, it really depends how you wire these together. And that wiring is where a significant amount of the value is. This is present-day data centers.
这就是一个现代超级计算机的外观。你会看到一排排的网络设备和一排排的计算设备。在计算机架中有许多互联。但是当你跨越到网络架上时,会出现一团乱麻,连接不够强大。这意味着当你将像AI训练这样的计算工作负载映射到这些超级计算机上时,你基本上必须将其切割成片,使大部分内容适应紧密相互连接的集群中。在50,000个GPU运行负载时,很难实现规模化,获得很好的性能提升。所以我要告诉你的是,一千个GPU不仅仅是一千个GPU,它真的取决于你如何将它们连在一起。而这个连线是其中相当大一部分价值所在。这就是现代数据中心。

What if we deleted all the networking racks? What if we deleted all of these? And what if we scaled the compute to be a hundred times larger? And what if instead of the spaghetti, we have linking everything together, what if we had an all-to-all interconnect? What if we deleted all of the networking equipment in the data center? This is the future that we're building at LightMatter. We're looking at how you get these AI supercomputers to get to the next model. It's going to be super expensive and it's going to require fundamentally new technologies. And this is the core technology.
如果我们删除了所有网络架?如果我们把这些都删除了呢?如果我们将计算规模扩大一百倍呢?如果我们不再使用乱七八糟的连接线,而是采用全互连呢?如果我们删除数据中心里的所有网络设备呢?这就是我们在LightMatter正在构建的未来。我们正在研究如何让这些人工智能超级计算机进入下一个模型。这将非常昂贵,需要全新的技术。而这就是核心技术。

This is called Passage. And this is how all GPUs and switches are going to be built. We work with companies like AMD, Intel, NVIDIA, Qualcomm, places like this. And we put their chips on top of our optical interconnect substrate. It's the foundation for how AI computing will make progress. It will reduce the energy consumption of these clusters dramatically. And it will enable scaling to a million nodes and beyond. This is how you get to wait for scale, the biggest chips in the world. And this is how you get to AGI.
这个被称为“Passage”。这就是所有GPU和交换机将被构建的方式。我们与AMD、英特尔、NVIDIA、高通等公司合作。我们将他们的芯片放在我们的光互连基板上。这是AI计算将取得进展的基础。它将大大减少这些集群的能耗。它将实现向百万个节点甚至更多的规模扩展。这就是你如何获得世界上最大的芯片。这就是你如何获得AGI。