AI Czar David Sacks Explains the DeepSeek Freak Out

发布时间 2025-02-02 19:30:03    来源

摘要

(0:00) Sacks breaks down the tech community's DeekSeek freak out, and how we should think about the reported $6M training ...

GPT-4正在为你翻译摘要中......

中英文字稿  

One of the really cool things about this job is just that when something like this happens, I get to kind of talk to everyone and everyone wants to talk and I feel like I've talked to maybe not everyone and like all the top people in AI, but it feels like most of them. And there's definitely a lot of takes all over the map on DeepSeek, but I feel like I've started to put together a synthesis based on hearing from the top people in the field.
这份工作的一个很棒的地方就是,当类似这样的事情发生时,我有机会和大家交流,大家也都愿意交流。我觉得我可能没有和所有的AI顶尖人物交流过,但感觉和大部分人都有过沟通。关于DeepSeek,有各种各样的看法,但通过听取这个领域顶尖人物的观点,我觉得自己开始形成一个综合的理解。

It was a bit of a freak out. It's rare that a model release is going to be a global news story or cause a trillion dollars of market cap to climb in one day. And so it is interesting to think about like why was this such a potent news story? And I think it's because there's two things about that company that are different. One is that obviously it's a Chinese company rather than an American company. And so you have the whole China versus US competition. And then the other is it's an open source company or at least it opensource the the R1 model.
这件事有点让人惊讶。通常,某个型号的发布很少会成为全球新闻,或者在一天之内引发万亿美元的市值增长。而这次事件之所以如此引人注目,我认为有两个原因。第一,这家公司显然是中国公司,而不是美国公司,这就涉及到中美之间的竞争。第二,这家公司是一个开源公司,或者至少他们开源了R1型号。

And so you've kind of got the whole open source versus closed source debate. And if you take either one of those things out, it probably wouldn't have been such a big story. But I think the synthesis of these things got a lot of people's attention. A huge part of TikTok's audience, for example, is international. Some of them like the idea that the US may not win the AI race, that the US is kind of getting a come up in here. And I think that fueled some of the early attention on TikTok.
这段话大意是讨论开源和闭源的争论。如果没有其中任何一个因素,这个话题可能不会引起那么大的关注。但正是这些因素的结合引起了很多人的注意。例如,TikTok 的很大一部分用户来自国际市场。其中一些人喜欢看到美国在 AI 竞赛中没有胜出,觉得这算是美国的一种教训。我认为这也是 TikTok 早期获得关注的原因之一。

There's a lot of people who are rooting for open source or they have animosity towards open AI. And so they were kind of rooting for this idea that, oh, there's this open source model that's going to give away what open AI has done at one-twentyth the cost. So I think all of these things provided fuel for the story.
有很多人支持开源项目,也有一些人对OpenAI心存敌意。因此,他们支持这样一种观点,即开源模型将在成本的二十分之一的情况下,提供与OpenAI相当的成果。我觉得这些因素都为这个故事增添了热度。

Now, I think the question is, okay, what should we make of this? I mean, I think there are things that are true about the story and then things that are not true or should be debunked. I think that let's call it true thing here is that if you had said to people a few weeks ago that the second company to release a reasoning model along lines of 01 would be a Chinese company, I think people would have been surprised by that.
现在,我认为问题是,我们应该如何理解这一点?我是说,我认为这个故事中有些内容是真实的,而有些则不真实或者需要被揭穿。我觉得可以称之为“真实”的部分是,如果几周前你告诉大家,说第二家推出类似01推理模型的公司会是一家中国公司,人们可能会对此感到意外。

So I think there was a surprise. And just to kind of back up for people, there's two major kinds of AI models now. There's kind of the base LML model like ChatT4O or the DeepSeq equivalent was V3, which they launched a month ago. And that's basically like a smart PhD. You ask a question, gives you an answer. Then there's the new reasoning models, which are based on reinforcement learning sort of a separate process as opposed to pre-training.
所以,我觉得这次有点让人意外。简单为大家介绍一下,目前有两大类主要的人工智能模型。第一类是基础的大型语言模型(LML),比如ChatT4O或类似的DeepSeq模型,比如一个月前发布的V3。这类模型基本上就像一个聪明的博士生,你问问题,它就给你答案。第二类是新的推理模型,它们基于强化学习,是一个不同于预训练的独立过程。

And 01 was the first model released along those lines. And you can think of a reasoning model as like a smart PhD who doesn't give you a snap answer, but actually goes off and does the work. You can give it a much more complicated question and it'll break that complicated problem into a subset of smaller problems. And then it'll go step by step to solve the problem.
01是第一个按照这种思路发布的模型。你可以把推理模型想象成一个聪明的博士生,他不会立刻给出答案,而是实际去研究问题。你可以给它一个更复杂的问题,它会把这个复杂的问题分解成若干小问题,然后一步一步地解决这些问题。

And that's called chain of thought, right? And so the new generation of agents that are coming are based on this type of idea of chain of thought that an AI model can sequentially perform tasks, figure out much more complicated problems. So OpenAI was the first to release this type of reasoning model. Google has a similar model they're working on called Gemini 2.0 flash thinking. They've released kind of an early prototype of this called Deep Research 1.5. And Thoropik has something that I don't think they've released it yet.
这叫做“思维链”,对吗?新的代理模型就是基于这一思维链的理念。这种思维链让人工智能模型可以顺序地执行任务,从而解决更复杂的问题。因此,OpenAI是第一个发布这种推理模型的公司。谷歌也正在研发一个类似的模型,名为Gemini 2.0“闪思考”。他们已经发布了这个模型的一个早期原型,叫做Deep Research 1.5。Thoropik也有一个类似的产品,但我认为他们还没有发布。

So other companies have similar models to 01, either in the works or in some sort of private beta, but DeepSeq was really the next one after OpenAI to release the full public version of it. And moreover, they open sourced it. And so this created a pretty big splash. And I think it was legitimately surprising to people that the next big company to put out a reasoning model like this would be a Chinese company.
其他公司也正在开发与01类似的模型,或者已经在某种形式的内部测试中,但DeepSeq是继OpenAI之后推出完整公开版本的公司。而且,他们还开源了这个模型。这引起了相当大的轰动。我认为,令很多人感到惊讶的是,继OpenAI之后,下一个推出这种推理模型的大公司竟然是一家中国公司。

And moreover, that they would open source it, give it away for free. And I think the API access is something like one 20th the cost. So all of these things really did drive the news cycle. And I think for good reason, because I think that if you had asked most people in the industry a few weeks ago, how far behind is China on AI models, they would say six to 12 months. And now I think they might say something more like three to six months, right? Because 01 was released about four months ago, and R1 is comparable to that.
此外,他们还计划将其开源,免费提供。我认为API访问的费用大概降到了二十分之一。所以所有这些举措确实推动了新闻的传播。我认为这是有道理的,因为如果几周前你问行业中的大多数人:中国在人工智能模型方面落后多久,他们可能会说六到十二个月。而现在,我认为他们可能会说三到六个月,对吧?因为01大约在四个月前发布,而R1可以与之媲美。

So I think it's definitely moved up people's timeframes for how close China is on AI. Now, let's take the, we should take the claim that they only do this for six million dollars. On this one, I'm with Palmer Lucky and Brad Gersoner and others. And I think this has been pretty much corroborated by everyone I've talked to that that number should be debunked.
我认为,这件事确实让人们感到中国在人工智能领域的进展比预期更快了。现在,我们应该来看看那个声称他们只花了六百万美元就做到这一点的说法。在这一点上,我同意Palmer Lucky和Brad Gersoner,以及其他一些人的看法。我跟很多人聊过,大家基本都证实,那个数字应该被推翻。

So first of all, it's very hard to validate a claim about how much money went into the training of this model. It's not something that we can empirically discover. But even if you accepted it face value, that's six million dollars was for the final training run. So when the media is hyping up these stories saying that this Chinese company did it for six million, and these dumb American companies did it for a billion, it's not an apples to apples comparison, I mean, if you were to make the apples to apples comparison, you would need to compare the final training run cost by deep-seek to that of open AI or anthropic.
首先,很难验证关于用于训练这个模型的资金规模的说法,这不是我们可以实证发现的事情。但即使你接受这个说法——最终训练花费了六百万美元。当媒体炒作这些话题时,说某家中国公司花了六百万,而愚蠢的美国公司花了十亿,这并不是一个对等的比较。我是说,如果你要进行对等比较,你需要将深思(deep-seek)的最终训练成本与OpenAI或Anthropic的进行比较。

And what the founder of anthropic said, and what I think Brad has said, being an investor in open AI and having talked to them, is that the final training run cost was more in the tens of millions of dollars, about nine or 10 months ago. And so it's not six million versus a billion. Okay. It's a billion dollar number might include all the hardware they've bought the years of putting into it a holistic number as opposed to the training number.
创始人Anthropic所说的,以及我认为Brad也提到过的,作为OpenAI的投资者并与他们交流过,他们的最终训练成本是在9到10个月前耗费了数千万美元。因此,并不是六百万美元与十亿美元的对比。十亿美元的数字可能包括他们多年来购买的所有硬件,是一个整体的数字,而不是仅仅训练的费用。

Yeah, it's not fair to compare. Let's call it a soup to nuts number, a fully loaded number by American AI companies to the final training run by the Chinese company. But real quick, Sacks, you've got an open source model, and they've did the white paper they put out there is very specific about what they did to make it and sort of the results they got out of it. I don't think they give the training data, but you could start to stress test what they've already put out there and see if you can do it cheap, essentially.
是的,进行比较是不公平的。我们称之为“一条龙”的数字,美国AI公司从头到尾的投入和最后由中国公司完成的训练。有件事要快点讨论一下,Sacks,他们有一个开源的模型,他们发布的白皮书详细说明了他们是如何制作这个模型的以及他们从中获得的结果。我认为他们没有提供训练数据,但你可以开始对他们已经发布的内容进行压力测试,看看是否可以低成本地做到这一点。

Like I said, I think it is hard to validate the number. I think that if let's just assume that we give them credit for the six million number, my point is less that they couldn't have done it, but just that we need to be comparing likes to likes. Yeah. So if, for example, you're going to look at the fully loaded cost of what it took deep seek to get to this point, then you would need to look at what has been the R&D cost to date of all the models and all the experiments and all the training runs they've done, right? And the compute cluster that they surely have.
正如我所说,我认为验证这个数字是困难的。如果我们假设确实给他们600万的数字打了个招呼,我的重点不是说他们做不到,而是我们需要进行同类比较。是的,所以如果你要看Deep Seek达到目前这个点的全部成本,那么你就需要考虑到目前为止所有模型、所有实验和所有训练所花费的研发成本,对吧?还有他们肯定拥有的计算集群。

Dylan Patel, who's leading semiconductor analyst, has estimated that deep seek has about 50,000 hoppers. And specifically, you said they have about 10,000 H100s. They have 10,000 H 800s and 30,000 H 20s. Now the cost of a SAC, sorry, is they deep seek or it's deep seek plus the hedge fund? Deep C plus a hedge fund.
半导体分析师Dylan Patel估计,Deep Seek拥有大约50,000台Hopper芯片。具体来说,其中包括大约10,000台H100,10,000台H800,以及30,000台H20。至于成本,你是指只有Deep Seek还是说Deep Seek加上对冲基金?这里是Deep Seek和一个对冲基金一起。

But it's the same founder, right? And by the way, that doesn't mean they did anything illegal, right? Because the H 100s were banned under export controls in 2022. Then they did the H 800s in 2023. But this founder was very far cited. He was very ahead of the curve and he was through his hedge fund. He was using AI to basically do algorithmic trading. So he bought these chips a while ago. In any event, you add up the cost of a compute cluster with 50,000 plus hoppers. And it's going to be over a billion dollars. So this idea that you've got this scrappy company that did it for only six million, just not true, they have a substantial compute cluster that they use to train their models. And frankly, that doesn't count any chips that they might have beyond the 50,000, you know, that they might have obtained in violation of export restrictions that obviously they're not going to admit to. And then we just don't know, we don't really know the full extent of what they have. So I just think it's like worth pointing that out that I think that part of the story got over hyped. It's hard to know what's fact and what's fiction. Everybody who's on the outside guessing has their own incentive, right? So if you're a semiconductor analyst that effectively is massively bullish in video, you want it to be true that it wasn't possible to train on six million dollars.
但这位创始人是同一个人,对吧?顺便说一下,这并不意味着他们做了什么违法的事情,对吧?因为H 100芯片在2022年被出口管制禁令限制了。然后他们在2023年推出了H 800。不过这位创始人目光非常长远,非常超前,他通过他的对冲基金使用人工智能进行算法交易。所以他很早就购买了这些芯片。不管怎样,如果你把拥有超过5万个hopper的计算集群的成本加起来,会超过10亿美元。因此,那个称这家创业公司只花600万美元就做到的说法是不真实的,他们拥有一个用来训练模型的大型计算集群。坦率地说,这还不包括任何可能违反出口限制获得的、他们显然不会承认的5万个以上的芯片。我们不清楚他们到底有多少芯片。所以,我认为值得指出的是,这个故事的某些部分被过度宣传了。很难分清哪些是事实,哪些是虚构。外界的每个人都在猜测,都有自己的动机,对吧?所以,如果你是一个对视频芯片非常看好的半导体分析师,你就会希望公司不可能通过600万美元来训练模型是真的。

Obviously, if you're the person that makes an alternative that's that disruptive, you want it to be true that it was trained on six million dollars. All of that, I think is all speculation. The thing that struck me was how different their approach was and TK just mentioned this.
显然,如果你是那个创造出如此具有颠覆性替代方案的人,你肯定希望它真的是用六百万美元进行训练的。我觉得这些都是猜测。令我感到震惊的是,他们的方法有多么不同,TK 刚才也提到了这一点。

But if you dig into not just the original white paper of deep-seek, but they've also published some subsequent papers that have refined some of the details. I do think that this is a case in Sacks, you can tell me if you disagree, but this is a case where necessity was the mother of invention. So I'll give you two examples where I just read these things and I was like, man, these guys are like really clever. The first is, as you said, let's let's put an opinion on whether they distilled a one, which we can talk about in a second. But at the end of the day, these guys were like, well, how am I going to do this reinforcement learning thing? They invented a totally different algorithm. There was the the orthodoxy, right? This thing called PPO that everybody used and they were like, no, we're going to use something else called, I think it's called GRPO or something. It uses a lot less computer memory and it's highly performant. So maybe they work in strain Saks, practically speaking, by some amount of compute that caused them to find this, which you may not have found if you had just a total surplus of compute availability.
如果你不仅深入研究深度寻求(Deep-seek)的原始白皮书,还看了他们发布的一些后续论文,你会发现他们对一些细节进行了改进。我确实认为这就是萨克斯(Sacks)的一种情况,必要性是发明之母。你可以告诉我是否不同意。在这里,我给你举两个例子,我看到这些就觉得,哇,这些人真的很聪明。第一个例子是,正如你所说,不论他们是否提炼出了一种新方法,我们稍后可以讨论,但最终他们的想法是:我要怎样实施这个强化学习?他们发明了一种全新的算法。在此之前,大家都在使用一种名为PPO的正统算法,但他们却用了另一个叫做GRPO的新方法。这种方法所需的计算机内存更少,性能也非常高效。或许他们在某种程度上受到计算资源限制的压力,迫使他们找到这个方法,而这种创新可能在计算资源过剩的情况下未必会被发现。

And then the second thing that was crazy is everybody is used to building models and compiling through CUDA, which is Nvidia's proprietary language, which I've said for a couple of times is their biggest moat, but it's also the biggest threat factor for lock in. And these guys worked totally around CUDA and they did something called PTX, which goes right to the bare metal and it's controllable and it's effectively like writing assembly. Now, the only reason I'm bringing these up is we, meaning the West, with all the money that we've had didn't come up with these ideas. And I think part of why we didn't come up is not that we're not smart enough to do it, but we weren't forced to because the constraints didn't exist. And so I just wonder how we make sure we learn this principle. Meaning, when the AI company wakes up and rolls out of bed and some VC gives them $200 million, maybe that's not the right answer for a series A or a seed. And maybe the right answer is $2 million so that they do these deep seek like innovations. And Shrant makes for great art. What do you think, Friedberg, when you're looking at this?
翻译如下: 然后,第二件疯狂的事情是,大家习惯于通过CUDA构建和编译模型,而CUDA是Nvidia的专有语言。我已经说过几次,这是他们最大的护城河,但同时也是最大的锁定威胁因素。而这些人完全绕开了CUDA,他们做了一个叫PTX的东西,直接到达底层硬件,可以控制它,而且实际上就像在写汇编语言。我之所以提到这些,是因为我们——意思是西方——拥有那么多资金,却没有想出这些主意。我认为部分原因不是因为我们不够聪明,而是因为我们没有被迫去想,因为那些限制不存在。所以我就在想,我们如何确保学会这个原则。也就是说,当一个AI公司刚起步时,如果有风投给他们2亿美元,也许这不是一个A轮或种子轮的正确答案。可能正确的答案是200万美元,这样他们才能进行这种深度探索般的创新。逆境造就伟大的艺术。Friedberg,在你看来,这是什么样的情况?

Well, I think it also enables a new class of investment opportunity. Given the low cost and the speed, it really highlights that maybe the opportunity to create value doesn't really sit at that level in the value chain, but further upstream. Bologie made a comment on Twitter today that was pretty funny or I think we're flaps this about the rapper. He's like turns out the rapper may be the the boat, the money, the mode, which is true at the end of the day. If model performance continues to improve, get cheaper, and it's so competitive that it commoditizes much faster than anyone even thought, then the value is going to be created somewhere else in the value chain. Maybe it's not the rapper, maybe it's with the user. And maybe by the way, here's an important point. Maybe it's further in the economy. You know, when electricity production took off in the United States, it's not like the companies are making a lot of money that are making all the electricity. It's the rest of the economy that accrues a lot of the.
好的,我认为这也开启了一类新的投资机会。考虑到低成本和高速度,这确实凸显了价值创造的机会可能并不在价值链的那个层面,而是更上游的地方。Bologie今天在推特上发表了一条评论,非常有趣,或者说我们对说唱歌手翻来覆去的想法。他好像在说,结果发现说唱歌手可能就是那条船、那笔钱、那种护城河,这在某种程度上确实是对的。如果模型性能不断提高、变得更便宜,并且竞争如此激烈,以至于商品化的速度比任何人预想的都要快,那么价值将在价值链的其他地方被创造出来。也许不在说唱歌手本身,而是在用户这边。顺便说一下,这里有一个重要的观点:也许这个变化更远地体现在经济中。当电力生产在美国蓬勃发展时,并不是所有发电的公司都赚了很多钱,而是经济的其他部分获得了很多利益。