首页 >> 来自播客: Dwarkesh Patel 更新反馈

How Does Claude 4 Think? – Sholto Douglas & Trenton Bricken

发布时间 2025-05-22 21:06:29 来源

Okay, I'm joined again by my friends, Shulta Brickin, wait. What? Did he just laugh? You did just shimmy. No, you named us differently. But we didn't have Shulta Brickin and Trenton Brickin. Shulta Douglas and Trenton Brickin, who are now both at Anthropic. Yeah. Shulta. Oh, what? Shulta is scaling RL, Trenton's still working on Mechanistic Interpreability. Welcome back. Happy video. Yeah, it's fun. What's changed since last year? We talked basically this month in 2024. Yeah. Now we're in 2025. What's happened?

好的，我和我的朋友们又聚在了一起，舒尔塔·布里金，等等。什么？他刚才笑了吗？你刚才扭了一下。不是，你给我们起了别的名字。我们没有舒尔塔·布里金和特伦顿·布里金。是舒尔塔·道格拉斯和特伦顿·布里金，他们现在都在Anthropic工作。是的。舒尔塔。哦，什么？舒尔塔在扩展强化学习，特伦顿仍在研究机制可解释性。欢迎回来。开心的视频。是的，很有趣。自去年以来有什么变化？我们基本上是在2024年的这个月聊的。是的。现在是2025年。发生了什么？

Okay, so I think the biggest thing that's changed is RL and language models has finally worked. And this is manifested in, we finally have proof of an algorithm that can give us expert, human, reliability and performance, given the right feedback loop. And so I think this is only really being like conclusively demonstrated in competitive programming and math, basically. And so if you think of these two axes, one is the intellectual complexity of the task, and the other is the time horizon of which the task is being completed on. And I think we have proof that we can reach the peaks of intellectual complexity along many dimensions.

好的，我认为最大的变化是强化学习（RL）和语言模型终于成功了。这体现在，我们终于有了一个可以在适当的反馈循环下，提供专家级、人类可靠性和性能的算法的证明。我觉得这一点只在竞争性编程和数学领域得到了确凿的证明。可以想象这两条轴线，一条是任务的智力复杂性，另一条是完成任务的时间范围。我认为我们有证据表明，在很多方面，我们都可以达到智力复杂性的顶峰。

But we haven't yet demonstrated like long running, agentic performance. And you're seeing the first stumbling steps of that now and should see much more conclusive evidence of that, basically, by the end of the year, with real software engineering agents doing real work. I think Trenton, you're like experimenting with this at the moment. Yeah, absolutely. I mean, the most public example people could go to today is CloudPlace Pokemon. Right. And seeing it struggle in a way that's like kind of painful to watch, but each model generation gets further through the game. And it seems more like a limitation of it being able to use memory system than anything else.

我们还没有展示出长期运行的智能代理性能。现在你可以看到这是它的初步尝试，并且在年底前你应该能看到更为确凿的证据，真正的软件工程代理在执行实际工作。我认为Trenton，你现在正在试验这一点吧？是的，绝对是这样的。目前人们能看到的最公开的例子就是CloudPlace的Pokemon。看到它在努力但痛苦地进展，而每一代模型都能在游戏中走得更远。这看起来更像是它使用记忆系统的限制，而不是其他因素。

Yeah. I wish we had recorded predictions last year. We definitely should this year. Yeah, hold us accountable. Yeah, that's right. Would you have said that agents would be only this powerful as of last year? I think this is roughly on track for where I expected with software engineering. I think I expected them to be a little bit better at computer use. Yeah. But I understand all the reasons for why that is. And I think that's like well on track to be solved. It's just like a sort of temporary lapse.

是啊。我希望我们去年记录过预测。今年我们绝对应该这么做。是啊，让我们负责。没错。你去年会说这些人工智能会有如此大的能力吗？我觉得这大致符合我对软件工程方面的预期。我原本以为它们在计算机使用方面会更好一点。不过我明白这其中的所有原因，我觉得这个问题很快就会解决。这只是一个临时的停滞。

And holding the account of my predictions next year, like I really do think end of this year, sort of like this time next year. We have software engineering agents that can do close to a day's worth of work, like for like a junior engineer, or like a couple of hours of like quite competent independent work. Yeah, that seems right to me. I think the distribution's pretty wonky though. Yes. For some tasks, I don't know, like boilerplate, website code, these sorts of things. In Spain, right? It can bang it out and save you a whole day.

根据我的预测，到明年年底，我真的认为，到明年的这个时候，我们会有软件工程代理可以完成接近一天的工作量，就像一个初级工程师一样，或者能独立完成几个小时的高效工作。是的，这在我看来是合理的。不过我觉得，这种能力的分布并不平衡。对于某些任务，比如模板代码、网站代码之类的工作，在西班牙的话，软件工程代理可以很快完成，并为你节省整整一天的时间。

Yeah, exactly. Yeah, I think that's right. I think last year you said that the thing that was holding them back was the extra nine's reliability. I don't know if that's the way you would still describe the way in which these software agents aren't able to do a full day of work, but are able to help you out with a couple of minutes. Is it the extra nine's that's really stopping you or is it something else?

是的，没错。对，我认为你说得对。我记得去年你曾说过，他们的问题在于“多一个9”的可靠性。我不确定你现在是否仍然会用这种方式来描述这些软件代理无法全天工作的原因，但它们可以在几分钟内帮助你。那么，真正阻碍你的原因还是“多一个9”的问题，还是其他的因素呢？

Yeah, I think my description there was, I think like in retrospect, probably not what's limiting them. I think what we're seeing now is closer to lack of context, lack of ability to like do the complex, like very multi-file changes and like sort of like, maybe like scope or of the change or scope of like the task in some respects, they can cope with high intellectual complexity and like a focused context with a hot, with a really like scope problem.

是的，我认为我之前的描述，回过头来看，可能并不是他们的限制所在。我觉得我们现在看到的问题更接近于缺乏背景信息，缺少处理复杂任务的能力，比如需要同时处理多个文件的变更，还有可能在某种程度上是变更的范围或任务的范围问题。他们可以应对高智力复杂性以及一些集中的背景问题，但在面对真正范围广泛的问题时，会遇到困难。

But when something's a bit more and more first or requires a lot of discovery and iteration with the environment and this kind of stuff, they struggle more. Yeah. And so maybe the way I would define it now is the thing that's holding them back is, if you can give it a good feedback loop for the thing that you want it to do, then it's good, it's pretty good. If you can't, then they struggle a bit.

当某件事情变得越来越复杂，或者需要大量的探索和环境的反复试验时，他们会更加困难。嗯，所以也许我现在的定义是，阻碍他们进展的因素是，如果你能为他们需要完成的任务提供一个良好的反馈机制，他们表现得非常好。如果做不到，他们就会有些挣扎。

Can you, and for the audience, can you say more about what you mean by this loop? Yeah, if they're not aware of what's happening in RL and so forth. Yes. So the big thing that really worked last year is, maybe like broadly the domain is called like RL from verifiable rewards or something like this, where a clean rewards. So the initial unhopalling of language models was RL from human feedback, where typically it was something like hair-wise feedback or something like this and the outputs of the models became closer and closer to things that humans wanted.

可以为你和观众详细解释一下你所说的这个循环吗？好，如果他们不太清楚强化学习（RL）以及相关内容的话。是的。去年一个非常成功的方法可以被广泛称为“从可验证奖励中进行强化学习”之类的领域，涉及到一些明确的奖励。最初对语言模型的调整是通过人类反馈中的强化学习，通常是类似于等级评估的反馈方式，这样模型的输出就越来越接近人类所期望的结果。

But this doesn't necessarily improve their performance at any like difficulty or problem domain, right? Particularly as humans are actually quite bad judges of what a better answer is. Humans have things like length biases and so forth. So you need a signal of whether the model was correct in its output that is like quite true, let's say. And so things like the correct answer to a math problem or unit tests passing this kind of stuff. These are the examples of reward signals that's very clean, but even these can be hacked, by the way. Like even unit tests, the models find ways around it to like hacking particular values and hard code values of unit tests if they can figure out like what the actual test is doing, like they can like look at the cache path and files and find what the actual test is, they'll try and hack their way around it. So these aren't perfect, but they're much closer.

这并不一定会提高他们在任何困难或问题领域的表现，对吧？特别是因为人类其实并不擅长判断哪个答案更好。人类会有诸如偏爱较长答案之类的偏见。因此，需要一个相对真实的信号来判断模型的输出是否正确。比如，数学问题的正确答案或通过单元测试的结果。这些是非常明确的奖励信号的例子，但即便是这些也可能被“破解”。比如，模型可能会找到办法绕过单元测试，通过硬编码测试中的特定值或修改测试值。如果它们能弄清楚实际测试在做什么，比如查看缓存路径和文件，找到实际测试内容，它们就会尽量破解测试。因此，虽然这些方法并不完美，但要可靠得多。

And why is it getting so much better at software engineering than everything else? In part because software engineering is very verifiable, like it's a domain which just naturally lends it to this way. I think does the code pass a test? Does it even run? Does it compile? Yeah, does it compile? Does it pass the test? You know, you can go on the code and you can run tests and like you know whether or not you got the right answer. Yeah. But there isn't the same kind of thing for like writing a great essay. That requires like the question of like taste in that regard is quite hard. Like we discussed the other night at dinner, the Pulitzer Prize, like you know, which would come first like a Pulitzer Prize winning novel or like you know a Nobel Prize or something like this. And I actually think a Nobel Prize is more likely than a Pulitzer Prize winning novel in some respects.

为什么软件工程在进步得如此之快？部分原因是因为软件工程是一个非常容易验证的领域。软件工程天生就适合这种方式。我觉得问题在于代码是否通过了测试？它能跑起来吗？它能编译吗？是的，它能编译吗？它通过测试了吗？你可以查看代码并运行测试，这样你就可以知道是否得到了正确的答案。而写一篇优秀的文章就没有这样的明确标准。这涉及到品味的问题，这方面是相当困难的。就像我们那天晚上在晚餐时讨论的普利策奖一样，哪个会先出现，比如一本普利策奖获奖小说，或者像诺贝尔奖这样的。我实际上认为在某些方面，诺贝尔奖的可能性比普利策奖获奖小说更大。

But there's a lot of the tasks required in winning a Nobel Prize or at least like strongly assisting and helping win a Nobel Prize have more like layers of verifiability built up. So I expect them to like accelerate the process of doing Nobel Prize winning work more initially than that of like writing Pulitzer Prize with the novels. Yeah, I think if we rewind 14 months to when we recorded last time, the ninth of a reliability was right to me. Like we didn't have cloud code. We didn't have deep research. All we did was use agents in a chatbot format. Right. Puppy-faced, copy-faced, copy-faced. Totally.

翻译：但赢得诺贝尔奖所需的许多任务，或者至少是强有力地协助和帮助赢得诺贝尔奖的任务，通常有更多的可验证性层次。所以，我预计他们会更快速地促进那些诺贝尔奖工作的进展，而不是写普利策奖小说的进展。是的，我想如果我们回到14个月前，也就是我们上次录音时的情况，当时可靠性的排名对我来说是正确的。我们没有云代码，也没有深入研究。我们所有的只是使用代理在一个聊天机器人格式中。完全是的。

And it's I think we're very used to chat interfaces whether we're texting or using Google. And it's weird to think that the agent can actually go and fetch its own context and store its own facts into its memory system. And I still think that it's the ninth's reliability. And if you scaffold the model correctly or prompt it, it can do much more sophisticated things than the average user assumes. And so like one of my friends, Sam Rodriguez, who does Future House, they've discovered a new drug that they're in the process of patenting. And by the time this episode comes, LSDV2, that will be large.

我觉得我们已经很习惯使用聊天界面了，不管是发短信还是用谷歌搜索。想到一个人工智能代理能够自己获取上下文并存储信息到它的记忆系统里，这有点奇怪。我仍然认为这方面的不确定性是很大的。如果你正确地搭建模型或者提示，它可以做出比普通用户想象中更加复杂的事情。比如我的一个朋友，Sam Rodriguez，他在Future House工作，他们发现了一种新药，目前正在申请专利。当这一集播出的时候，LSDV2将会变得广为人知。

Well, is that LSDV2? Is that really? No, I'm not trying to. They're not making LSD. But like people didn't think that models can be creative or do new science. Right. And it does just kind of seem like a skill issue. I mean, there was the cool. Wait, wait, wait, wait, wait, wait, it discovered a drug. Is it like how did it like it did? Like I think it one shot at the long-term. So this was just over a conversation. And so we'll need to refer to the full announcement. But my impression is that it was able to read a huge amount of medical literature and brainstorm and make new connections and then propose wet lab experiments that the humans did.

那是LSDV2吗？真的？不，我不是在开玩笑。他们不是在制造LSD。但人们曾经认为模型无法具有创造力或进行新的科学研究。然而，现在看起来这似乎是个技能问题。等等，等等，它发现了一种药物？它是怎么做到的？感觉就像它在一瞬间搞定了长期任务。这都是在一次对话中发生的，所以我们需要查看完整的公告。不过，我的印象是，它能够阅读大量医学文献，进行头脑风暴，建立新的联系，然后提出由人类在实验室里进行的实验。

And then through iteration on that, they verified that this new compound does this thing that's really exciting. Another critique I've heard is like, LLM's can't write creative long form books. And I'm aware of at least two individuals who probably want to remain anonymous, who have used LLM's to write long form books. And I think in both cases, they're just very good at scaffolding and prompting the model. I mean, even with the viral chat GPT geogessor capabilities, where it's just insanely good at spotting like what beach you were on from a photo. Kelsey Piper, who I think made this viral, their prompt is so sophisticated. It's really long. And it encourages you to think of five different hypotheses and assign probabilities to them and reason through the different aspects of the image that matter.

然后通过不断的迭代，他们证实这种新化合物确实产生了令人振奋的效果。我听到的另一个批评是，有人认为大语言模型（LLM）不能写出有创意的长篇书籍。但据我所知，至少有两个人可能希望保持匿名，他们已经成功利用大语言模型写出了长篇书籍。我认为在这两种情况下，他们都非常擅长为模型设置框架和提示。即便是像那个风靡一时的ChatGPT猜地点功能，它在根据照片准确识别你所在的海滩方面表现得极为出色。Kelsey Piper，我认为是她让这个功能走红的，她的提示非常复杂且冗长，鼓励你想出五种不同的假设，并为它们分配概率，同时分析图像中重要的不同方面。

And I haven't AB tested it, but I think unless you really encourage the model to be this thoughtful, you wouldn't get the level of performance that you see with that ability. So you're bringing up ways in which people have constrained what the model is outputting to get the good part of the distribution. But one of the critiques I've heard of RL, or the not of RL, but one of the critiques I've heard about using the success of models like O3 to suggest that we're getting new capabilities from these reasoning models is that all these capabilities were already baked in the pre-training model.

我还没有做AB测试，但我认为除非你真的鼓励模型变得非常周全，否则你无法达到如你所见的那种高性能水平。你提到了一些方法，说明人们是如何限制模型输出，以获得理想结果的分布。我听到对强化学习（RL）的批评，或者说批评不在于RL本身，而是在于通过O3这样模型的成功暗示我们从这些推理模型中获得了新能力的说法。有一种观点认为，这些能力其实早已在预训练模型中存在。

I think there's a paper from Stinks-Wallah University where they showed that if you give a base model enough tries to answer a question, it can still answer the question as well as the reasoning model. Basically, it just has a lower probability of answering. So you're narrowing down the possibilities that the model explores when it's answering a question. So are we actually eliciting new capabilities with this RL training, or are we just putting the blinders on them? Right, like carving away the models on this. I think it's worth noting that paper was impretchable in the llama and quen models.

我记得有一篇来自Stinks-Wallah大学的论文表明，如果给一个基础模型足够多的机会来回答问题，它仍然可以和推理模型一样回答得好。基本上，基础模型只是回答正确的概率较低。所以，我们是在缩小模型回答问题时探索的可能性。那么，我们通过这种强化学习训练，实际上是在引出新的能力吗？还是仅仅是在限定它们的视野？我觉得值得注意的是，那篇论文在llama和quen模型上是不可攀越的。

And I'm not sure how much RL compute they used, but I don't think it was anywhere comparable to the amount of compute that was used in the base models. And so I think the amount of compute that you use in training is a decent proxy for the amount of actual raw new knowledge or capabilities you're adding to a model. So my prior at least, if you look at all of DeepMind's research from RL before, RL was able to teach these go-and-chest playing agents new knowledge that were in excess of human level performance just from RL signal provided the RL signal was sufficiently clean.

我不太确定他们在强化学习（RL）中使用了多少计算量，但我认为和基础模型中使用的计算量是无法相比的。因此，我认为训练中使用的计算量可以大致反映出你为模型增加了多少新的知识或能力。根据我的直观判断，如果你查看DeepMind之前所有关于强化学习的研究，只要RL信号足够清晰，强化学习就能够让这些下围棋和国际象棋的智能体获得超过人类水平的新知识。

So there's nothing structurally limiting about the algorithm here. They're like, prevents it from imbuing the neural net with new knowledge. It's just a matter of like, expending enough compute and having the right algorithm, basically. Why aren't you already spending more compute on this? I think Garius had in his blog post that, or it was like a couple of months ago, the actual control thing is like, deep seek, whatever. They were only spending $1 million on RL or something.

这个算法在结构上没有什么限制。他们的意思是，并没有任何东西阻止它将新的知识赋予神经网络。基本上，只是需要投入足够的计算资源并使用合适的算法。那么，为什么现在不在这方面投入更多的计算资源呢？我记得Garius在他的博客文章中提到过，或者是几个月前，实际控制的事情是类似于“深度搜索”之类的东西。他们在强化学习（RL）上只花了大约100万美元。

So it's like, we aren't in the compute limited regime for RL yet, but we will be soon. You're spending hundreds of millions on the base model, why only order a million on the RL? You know that the parable about when you choose the launch space mission, how you should acquire, go further up the tech tree, because if you launch later on, you'll just ship will go faster in this kind of stuff. I think it's quite similar to that.

所以，情况是这样的：我们现在还没有在强化学习（RL）中进入计算资源受限的阶段，但很快就会了。你在基础模型上花费了数亿美元，为什么只给强化学习投入一百万？这就像那个故事，当你选择发射太空任务时，你应该更深入地发展技术树，因为如果你晚点发射，飞船会航行得更快。我觉得这跟那个非常相似。

You want to be sure that you algorithmically got the right thing, and then when you bet and you do the large compute spend on the run, then it'll actually pay off without the right compute efficiencies in this kind of stuff. Now I think like RL is slightly different to pre-training in this regard, where RL can be a more iterative thing that you're progressively adding, capabilities to the base model, pre-training has, you know, in many respects, like if you're halfway through a run and you've messed it up, then like you've really like messed it up.

你希望能够通过算法确保自己得到了正确的结果，然后在进行大量计算投入时，这种投入将真正有所回报，而不会因为计算效率不高而出现问题。我认为在这方面，强化学习（RL）与预训练有些不同。强化学习可以是一个更具迭代性的过程，让你逐步为基础模型增加功能。而预训练呢，从很多方面来说，如果在训练过程中途出了问题，那就真的是搞砸了。

But I think that's what that's like the main reason. Why? We're still figuring out exactly what they want to do. I mean, 01 to 03, right? Like opening up, putting their blog post there, there was a 10X compute multiplier over 01. So like clearly they bet on one level of compute, and they were like, okay, this seems good, let's actually release it, let's get it out there.

但我觉得那就是主要原因。为什么呢？我们仍在弄清楚他们到底想做什么。我的意思是，从01到03，对吧？像是开放、在他们的博客上发布文章，相比于01，计算能力增加了10倍。所以很明显，他们在某个计算层次上下了赌注，然后觉得这还不错，就决定发布，让它面世。

And then they spent the next few months, like you know, increasingly amount of compute that they spent on that. And I expect as everyone is, everyone else is scaling up RL right now. So I basically don't expect that to be true for the, for the line. Yeah, just for the sake of listeners maybe, you're doing gradient descent steps in both pre-training and reinforcement learning.

然后，他们在接下来的几个月里花费了越来越多的计算资源。我预计现在大家都在扩大强化学习的规模。所以我基本上不认为这对这个领域来说是真的。为了听众的理解，你在预训练和强化学习中都进行梯度下降步骤。

It's just the signals different, typically in reinforcement learning, your reward is sparser. So you take multiple turns, it's like did you win the chess game or not? Is the only signal you're getting? And often you can't compute gradients through discrete actions. And so you end up losing a lot of gradient signal. And so you can presume that pre-training is more efficient, but there's no reason why you couldn't learn new abilities in reinforcement learning.

这只是信号的不同，通常在强化学习中，奖励信号往往比较稀疏。比如在下棋时，可能需要走很多步才知道自己是否赢了，这就是你能得到的唯一信号。而且，通常无法通过离散动作计算梯度，因此会丢失很多梯度信号。因此，可以认为预训练更加高效，但这并不意味着你不能在强化学习中学到新的能力。

In fact, you could replace the whole next token prediction task in pre-training with some weird RL variant of it. And then do all of your learning with RL. Yeah. Yeah, at the end of the day, just signal and then correcting to it totally. And then going back to the paper you mentioned, aside from the caveats that the Chalto brings up, which I think is the first order, most important, I think zeroing in on the probability space of like meaningful actions comes back to the 9s of reliability. And like, classically, if you give monkeys a typewriter eventually, they'll write Shakespeare, right? And so the action space for any of these real world tasks that we care about is so large that you really do care about getting the model to zero in on doing the reasonable things.

事实上，你可以用某种奇怪的强化学习（RL）变体来替代预训练中的下一个词预测任务。然后，你可以用强化学习进行所有的学习。对，最终只是接收信号并进行完全的校正。然后回到你提到的论文，除了Chalto提到的一些重要的警示外，我认为最关键的部分是要聚焦于有意义行动的概率空间，这关系到可靠性的高标准。经典的说法是，如果给猴子一台打字机，他们最终会写出莎士比亚的作品。同样地，对于我们关心的任何现实世界任务，其行动空间如此之大，因此我们确实需要关注模型在合理行为上的表现。

And to the extent, in some broad sense, to the extent that at some pass of K, you've got token space. Exactly. You literally do have a monkey and it's making Shakespeare in there. Yeah, exactly. OK, so the chest analogy is interesting. So we would say something. So I was just going to say, you do need to be able to get reward sometimes in order to learn. And that's like the complexity in cybertext. In the alpha variance, I mean, maybe you're about to say this, one player always wins. So you always get a reward scene on one way or the other. But in the kinds of things we're talking about, you need to actually succeed at your task sometimes.

在某种广泛的意义上，当你达到某个K值时，你已经经过了一定的"令牌空间"。确实如此。你就像一只猴子在创造莎士比亚的作品。是的，没错。好的，那个"棋局"的类比很有趣。所以我们会说一些事情。我刚想说，你确实需要有时能获得奖励才能学习。这就像网络文本中的复杂性。在Alpha变体中，或许你要说的是，一个玩家总会赢，所以无论如何你总能看到奖励。但在我们讨论的这些事情中，你实际上需要有时在任务中取得成功。

So language models, luckily, have this wonderful prior of the task that we care about. So if you look at all the old papers from 2017, the reward, the learning curves always look like flat, flat, flat, flat, flat, as they're figuring out basic mechanics of the world. And then there's this spike up as they learn to exploit the easy rewards. And then it's almost like a seed moan in some respects. And then it sort of continues on indefinitely. Is it just learns to absolutely maximize the game? And I think the LLM curves look a bit different in the isn't that dead zone at the beginning.

幸运的是，语言模型有一个与我们关注的任务相关的优秀先验知识。如果你看2017年的那些旧论文，学习曲线在开始时总是平坦的，因为模型在弄清楚世界的基本机制。然后，当它们学会利用简单的奖励时，曲线会突然上升。这有点像播种期的叹息。随后曲线会不断持续上升，因为模型学会最大化地利用游戏规则。我认为大语言模型（LLM）的学习曲线看起来有所不同，在开始时并没有那个“死区”。

Yeah, interesting. They already know how to solve some of the basic tasks. And so you get this initial spike. And that's what people are talking about when they're like, oh, you can learn from one example. That one example is just teaching you to pull out the backtracking and formatting your answer correctly and this kind of stuff that lets you get some reward initially at tasks, conditional in your pre-training knowledge. And then the rest probably is you learning more and more complex stuff.

好的，有趣的是，他们已经知道如何解决一些基础任务。这就是为什么一开始会有一个显著的提高。这也是人们在说“你可以从一个例子中学习”时的意思。这个例子只是教你如何找出其中的线索，并正确地格式化你的答案，以及一些帮助你初步完成任务的小技巧，前提是你已有的知识。接下来，你学习的可能就是越来越复杂的内容了。

Yeah, yeah. It would also be interesting. I know people have critiqued or been skeptical of RL's or RRN Quickwinds by pointing out that AlphaGo took a lot of compute, especially for a system trained in what was it, 2017? Yes, it's like off the cuff. Totally. Yeah, right. So to the extent that was largely because first you had to have something which had some biases which were sort of rational before it got superhuman to go. I actually would be interesting to see what fraction of the compute used in AlphaGo was just getting something reasonable.

是的，是的。这也会很有意思。我知道，有些人批评或对强化学习（RL）或者快速增强网络（RRN Quickwinds）持怀疑态度，指出AlphaGo用了大量的计算资源，尤其是对于一个在2017年训练的系统来说。是的，完全是即兴发挥。没错。所以在一定程度上，这主要是因为首先需要有一些合理的偏见，才能让系统在围棋上变得超级强大。我实际上很想知道，AlphaGo所用的计算资源中，有多大一部分只是用来实现基础合理水平的。

Yes, yeah, it would be interesting. I mean, to make the map from pre-training to RR really explicit here, during pre-training the large language model is predicting the next token of its vocabulary of let's say I don't know, 50,000 tokens. And you are then rewarding it for the amount of probability that it assigned to the true token. And so you could think of it as a reward, but it's a very dense reward where you're getting signal at every single token and you're always getting some signal. Even if it only assigned 1% to that token or less, you're like, oh, I see you assigned 1% good job keeping doing that. Upweight it.

是啊，是的，这会很有趣。我的意思是，为了在这里清楚地展示从预训练到奖励重整（RR）的过程，在预训练期间，大型语言模型会预测它的词汇表中的下一个词，比如说可能有5万个词。然后，你会根据它给真实词汇分配的概率来奖励它。你可以把这个看作是一种奖励，但这是一种非常密集的奖励，因为你在每一个词上都能获得反馈。不管它只为那个词分配了1%的概率或者更少，你都会觉得，“哦，我看到你分配了1%，做得好，继续这样做。”然后给这部分的权重增加一点。

Yeah, exactly. Exactly. Like a tug in the green. That's right. So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you're trying to do a math problem and you fail, it's actually even more useful often than learning about math and the abstracts because, oh, you don't think so. Yeah, almost like feedback.

是的，完全正确。就像在绿色环境中的拉扯。没错。所以，当我思考人类学习的方式时，这些模型从失败中得不到任何信号似乎和人类学习很不同。因为如果你在尝试解决数学问题时失败了，这实际上常常比学习数学理论更有用，因为，哦，你不这么认为吗？这就像反馈一样。

Yeah, only if you get feedback. But I think there's a way in which you actually give yourself feedback. You're like, you fail and you notice where you failed. Only if you get feedback at times. Yeah. People have like figured out new math, right? And they've done it by the fact that like they get stuck somewhere. They're like, why am I getting stuck here? Like, let me think through this. Whereas in the example, I mean, I'm not aware of what's like at the frontier, but like looking at open source, like implementations from deep secrets, something, there's not this like conscious process by which once you have failed, do you like learn from the particular way in which you failed to then like, backtrack and do your next things better.

是的，只有当你得到反馈的时候。但我认为实际上你可以给自己反馈。比如你失败了，然后注意到自己的失败点。虽然只有在某些时候你才能得到反馈。对，有些人已经想出了新的数学方法，他们通过被卡住的地方来解决问题。他们会想：为什么我会在这里卡住呢？让我仔细思考一下。而在这个例子中，我不太清楚前沿领域是什么样的，但看着一些开源实现，比如深度学习的秘密算法，这并不是一种有意识的过程，其中一旦失败，你就从特定的失败中学习，然后回顾并在下一次做得更好。

Just like pure grading to send. And I wonder if that's a big limitation. I don't know. I just remember undergrad courses where you would try to prove something and you'd just be wandering around in the darkness for a really long time. And then maybe you totally throw your hands up in the air. I need to go and talk to a TA. And it's only when you talk to a TA, can you see where along the path of different solutions you were incorrect and like what the correct thing to have done would have been? And that's in the case where you know what the final answer is. In other cases, if you're just kind of shooting blind and meant to give an answer to Denouvo, it's really hard to learn anything.

就像单纯的打分一样。我想知道这是否是一个很大的限制。我不太确定。我只记得在本科课程中，我们试图证明某个东西的时候，经常像在黑暗中摸索了很长时间，然后可能会完全放弃，觉得需要去找助教(TA)聊聊。只有在与助教交流后，你才能看清在解决问题的过程中哪里出错了，以及正确的做法是什么。而这还是在你知道最终答案的情况下。在其他情况下，如果你只是盲目尝试并需要给出一个答案，就很难学到什么东西。

I guess I'm trying to map on again to the human example where like in more simpler terms, there is this sort of conscious intermediary, like auxiliary loss that we're like optimizing. And it's like a very sort of self-conscious process of getting, forget about math. It's just like if you're on your job, you're getting very explicit feedback from your boss. That's not necessarily how the task should be done differently, but a high level explanation of what you did wrong, which you update on not in the way that pre-training updates waste, but more in the...

我想我是在试图将其与人类的例子进行对比，用更简单的术语来说，就是存在一种类似"额外损失"的有意识的中介过程，我们在优化这个过程。这就像是一种非常自觉的过程，且与数学无关。就好比在工作中，你从老板那里得到非常明确的反馈，但这些反馈并不一定是关于任务应该如何不同地完成，而是对你出了什么问题的高层次解释。这种反馈就像是一种更新，不是像训练模型那样更新参数，而是更像......

I don't know. I think there's a lot of implicit dense reward signals here. Exactly. Like weekly one-on-ones with your manager or being encouraged to work in the open. Or even with homework assignments, right? They're so scaffolded. It's always 10 questions broken down into sub-components. It may be the hardest possible problem is one where you need to do everything on your own. Okay, so then a big question is, do you need to build these scaffolds, these structures, these bespoke environments for every single skill that you want the model to understand, and then it's gonna be a decade of grinding through these sub-skills? Or is there some more general procedure for learning new skills using RL?

我不知道。我觉得这里有很多隐含的密集奖励信号。没错，就像每周一次的与经理的一对一会谈，或者被鼓励在开放环境中工作。甚至作业也是这样，作业被分成几个部分来完成，总是10个问题被分成若干子部分。也许最难的问题就是那些需要你独立完成的。好了，那么一个大的问题是，你是否需要为每一个你希望模型理解的技能建立这些支架、这些结构、这些定制的环境，然后花上十年去磨练这些子技能？或者说，是否有某种更通用的程序，可以通过强化学习来掌握新技能？

Yeah. So it's an efficiency question there. Obviously, if you could give it dense reward for every token, if you had a supervised example, then that's one of the best things you could have. But in many cases, it's very expensive to produce all of those scaffolded curriculum of everything to do. Having PhD math students, grade students, is something which you can only afford for the select cadre of students that you've chosen to focus in on developing. And you couldn't do that for all the language models in the world.

是的，这是一个效率问题。显然，如果你能为每个标记提供详细的奖励，特别是在有监督样本的情况下，那将是最理想的情况之一。但在很多情况下，制作这样一个涵盖所有步骤的详尽课程是非常昂贵的。让博士生为学生评分是一种只能针对你选中的、专注于培养的特定学生群体才能负担得起的方法。而你不可能对世界上所有的语言模型都这样做。

So like, first step is obviously that would be better, but you're gonna be sort of optimizing this prior frontier of how much am I willing to spend on the scaffolding, versus how much am I willing to spend on pure compute. Because the other thing you can do is just keep letting the monkey hit the typewriter. And if you have a good enough end reward, then eventually it will find its way. And so like, I can't really talk about where exactly people sit on that scaffold. I think like people, different tasks are like on different points there.

所以，第一步显然是更好的选择，但你会需要在投入到支撑性工作（例如结构或框架）上花费多少与投入到纯计算上花费多少之间进行平衡。因为你还可以做的另一件事就是继续让猴子在打字机上乱打。如果最终的回报足够好，那么最终它会找到正确的路径。所以，我无法准确地说人们在这方面的投入是多少。我想，不同的人和任务在这方面的倾向是不一样的。

And a lot of it depends on how strong your prior over the correct things to do is. But that's the equation you're optimizing. It's like how much am I willing to burn compute, versus how much am I willing to burn like dollars on people's time to give scaffolding or give the... Interesting. Yeah. I'm not willing to do this for elements that we are for people. I would think that the economic logic would flow in the opposite direction. For the reason that you can amortize the cost of training any skill on a model across all the copies. We are willing to do this for elements like to some degree.

这很大程度上取决于你对正确做法的先验认知有多强。但这就是你正在优化的等式。这就像是在计算我愿意花费多少计算资源，与我愿意花费多少金钱在人员时间上来提供支持或结构指导。很有意思。是的，我不愿意为某些人做这件事，但我认为经济逻辑应该是反过来的，因为你可以将培训任何技能的成本分摊到模型的所有副本中。我们在某种程度上愿意为一些元素这样做。

But there's an equation you're maximizing here. Like, okay, I've like raised all this money. Do I spend along this axis, or do I spend along this axis? And currently the companies are spending more on compute than they are on humans. Otherwise, like, scale AI's revenue would be like, you know, $10 billion, or you'd be like, like in this, okay, look at it, like, Nvidia's revenue is much higher than scale AI's revenue. Right. And so, like, currently the equation is compute over data. And like, that will evolve in some way over time. But yeah.

但这里有一个你要最大化的公式。就像，好吧，我已经筹集了这么多资金。我是沿着这个方向花钱，还是沿着那个方向花钱？目前，公司在计算上的花费比在人力上的更多。否则，比如说，Scale AI的收入会像10亿美元那么多，或者你可以看看，Nvidia的收入明显高于Scale AI的收入。因此，目前的公式是计算能力优先于数据。这当然会随着时间而演变。

Interesting. Yeah, I'm curious how it evolves. Because if you think about the way the humans like learn to do a job. Yeah. They get deployed and they just like do the job and they learn. Whereas the way these models seem to be trained is that for every skill you have to like give them a sort of like very bespoke environment or something. If they were trained, the way humans are trained. Like on the job. Yeah, exactly. They know it would actually be super powerful because like, everybody has a different job, but then the same model could agglomerate like all the skills that you're getting.

有趣。我很好奇这个过程会如何发展。因为如果你考虑一下人类学习工作的方式，他们是被投入到工作中，通过实际操作来学习。而这些模型的训练方式似乎是为每项技能都需要提供一个特别定制的环境。如果这些模型像人类一样在工作中学习，那将会非常强大。因为虽然每个人的工作不同，但同一个模型可以聚合你获得的所有技能。

I don't have anything like doing the podcast for the last few years. I'm like becoming a better podcaster. Yes. You have a slightly more valuable skill of doing AI research. I'm not a... But you could imagine a model like do both things. because it's like doing both of our jobs. Like copies of the model are doing both jobs. And so it seems like more bitter a lesson aligned to do this. Like just let the model like learn out in the world rather than, you know, like, you're spending billions on getting data for a particular task.

在过去的几年中，我没有做过其他事情能够与做播客相比。我感觉自己在逐渐成为一个更好的播客主持人。不过，你拥有从事人工智能研究这样更有价值的技能。尽管我不是这方面的专家，但可以想象一个模型同时执行这两项任务，因为它相当于在做我们两个人的工作。这样看来，可以更轻松地让模型去完成这种任务，比如让模型在现实世界中学习，而不是花费数十亿去获取特定任务的数据。

So I think again, we take for granted how much we need to show humans how to do specific tasks and there's like a failure to generalize here. Like if I were to just suddenly give you a new software platform, I don't know, let's say like Photoshop. And I'm like, okay, edit this photo. If you've never used Photoshop before, it'd be really hard to navigate. And I think you'd immediately want to go online and watch a demo of someone else doing it in order to then be able to imitate that. That amount of data on every single task, surely.

所以我再次想到，我们往往低估了教人们如何完成特定任务的重要性，这里存在一种无法概括的现象。比如，如果我突然让你使用一个新的软件平台，比如Photoshop，并要求你编辑一张照片。如果你以前从未使用过Photoshop，你会发现操作非常困难。我想你可能立刻就会想要上网观看别人操作的演示，以便模仿这种操作方法。对每一个任务来说，我们确实需要这样的学习过程。

Okay. So this is the first thing. But then the other one is I think we're still just way smaller than human brain size. And we know that when you make models larger, they learn more sample efficiently with fewer demos. And like it was striking where even in your recent podcast with Mark Zuckerberg and Lama, it's like a two trillion perimeter model. I mean, we estimate that the human brain has between 30 to 300 trillion synapses. And I don't know exactly how to do a mapping from one to the other here. But I think it's useful background context that I think it's quite likely we're still smaller than the human brain.

好的。这是第一点。但另一方面，我认为我们仍然远小于人类大脑的规模。我们知道，当模型变得更大时，它们可以通过更少的示例更高效地学习。而且，即便是在你最近与马克·扎克伯格和Lama的播客中也提到了一个两万亿参数的模型。我们估计人类大脑有大约30到300万亿个突触。我不太确定如何将这两者进行准确的对比，但我认为了解这个背景信息很有用，因为我觉得我们很可能仍然小于人类大脑。

And I mean, even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of like big model smell. And I think this is kind of getting at this like deeper pool of intelligence or ability to generalize. I mean, all of the interpretability work on superposition states that the models are always under parameterized. And they're being forced to cram as much information as they possibly can. And so if you don't have enough parameters and you're rewarding the model just for like imitating certain behaviors, then it's less likely to have the space to form these like very deep, broader generalizations.

即使是OpenAI推出的4.5版本，据说是一个更大的模型，人们也常讨论其写作能力或“这种大型模型的特质”。我觉得这其实是在触及一种更深层次的智能水平或广泛的泛化能力。所有关于叠加态的可解释性研究都表明，这些模型总是参数不足的，它们被迫尽可能多地压缩信息。因此，如果参数不够，而只奖励模型模仿某些行为，那么模型就不太可能有空间形成那些非常深刻、广泛的泛化能力。

But even in light of all the languages, it's really cool. You should talk about the language result, you know, with how small models have formed separate, like have separate neurons for different languages, whereas larger models end up sharing more and more like an abstract space. So yeah, in the circuits work. I mean, even with the Golden Gate Bridge, and by the way, this is a cable from the Golden Gate Bridge. But the team, the entities stabilize the bridge in order to get this. But Claude will fix it. Claude loves the Golden Gate Bridge.

即便在考虑到所有语言的情况下，这还是很酷。你应该谈谈语言的结果，比如，小型模型针对不同语言形成了单独的神经元，而大型模型则越来越多地共享抽象空间。在电路研究中也是这样。即便是金门大桥也是如此，顺便说一句，这是一段来自金门大桥的电缆。团队和相关机构为了获得这个结果而稳定了大桥结构。但Claude会解决问题的，Claude热爱金门大桥。

So even with this, right, like for people who aren't familiar, we made Golden Gate Claude when we released our paper, Scaling Monosanticity, where one of the 30 million features was for the Golden Gate Bridge. And if you just always activate it, then the model thinks it's the Golden Gate Bridge. If you ask if you're chocolate chip cookies, it will tell you that you should use orange food coloring or like bring the cookies and eat them on the Golden Gate Bridge. All of these sort of associations.

所以，即使是这样，对那些不太熟悉的人来说，我们在发布论文《Scaling Monosanticity》的时候创作了一个叫做Golden Gate Claude的模型，其中包括了三千万个特征，其中一个特征是关于金门大桥的。如果你总是激活这个特征，模型就会觉得它是在金门大桥上。如果你问它关于巧克力曲奇饼的事，它会告诉你应该用橙色食用色素，或者建议你把饼干带到金门大桥上去吃。所有这些都是模型的联想。

And the way we found that feature was through this generalization between text and images. So I actually implemented the ability to like put images into our feature activations because this was all on Claude 3SANA, which was one of our first multimodal models. So we only trained the Sparrow Sorrow and Coder, and like the features on text. And then a friend on the team put in an image of the Golden Gate Bridge. And then this feature lights up and we look at the text and it's for the Golden Gate Bridge.

我们发现这个特性的方法是通过在文本和图像之间进行概括。实际上，我实现了将图像输入到我们特性激活的功能中，因为这一切都是在Claude 3SANA上进行的，这是我们最初的多模态模型之一。因此，我们只训练了Sparrow Sorrow和Coder，以及针对文本的特性。然后，团队中的一位朋友放入了一张金门大桥的图像，这个特性就被激活了，我们查看文本，发现它针对的正是金门大桥。

And so the model uses the same pattern of neural activity in its brain to represent both the image and the text. And our circuits work shows this again with across multiple languages. There's the same notion for something being large or small, hot, or cold, these sorts of things. But like strikingly, that is more so the case in larger models. We think like actually larger models have more space so they could like separate it things out more. But actually instead they seem to pull on these like larger abstracts. They on better abstractions.

模型在其脑中使用相同的神经活动模式来表示图像和文本。我们的电路工作再次证明了这一点，而且适用于多种语言。关于大小、冷热这些概念是相同的。但是，令人惊讶的是，这一点在更大的模型中更为明显。我们原以为较大的模型会有更多空间来区分不同的事物，但实际上，它们反而 seem to focus on 更大的抽象概念和更好的抽象表现。

Yeah. Which is very interesting. Yeah. Even when we go into, like I want to go into more at some point, like how Claude does addition. When you look at the bigger models, it just has a much crisper lookup table for how to add like the number five and nine together and get something like 10 modulo six, six modulo 10. Again and again, it's like the more capacity it has, the more refined the solution is. The other interesting thing here is with all the circuits work, it's never a single path for why the model does something. It's always multiple paths and some of them are deeper than others.

是啊，这很有趣。即使我们深入研究，比如我想在某个时候更深入地研究一下Claude是如何进行加法运算的。当你观察更大的模型时，它有一个更明确的查找表，用来计算类似于把数字5和9相加，然后得到10模6、6模10这样的结果。一次又一次地，你会发现它的容量越大，解决方案就越精细。这里另一个有意思的地方是，关于所有电路的工作原理，模型虽然从来没有只有单一路径来解释它为什么这样做，而总是有多条路径，其中一些路径比其他路径更深。

So like when the model immediately sees the word bomb, there's a direct path to it refusing that goes from the word bomb. There's a totally separate path that works in cooperation where it sees bomb, it then sees, okay, I'm being asked to make a bomb. Okay, this is a harmful request. I'm an AI agent and I've been trained to refuse this. Right. So one possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the short circuit imitation C-bomb refuse with this deeper reasoning circuit.

所以，当模型一看到“炸弹”这个词时，就有一条直接路径让它拒绝，这条路径从“炸弹”开始。有一条完全不同的路径与之配合，当它看到“炸弹”时，它会意识到：“哦，我被要求制造炸弹。这是个有害的请求。我是一个AI代理，我经过训练要拒绝这一请求。”如此看来，可能的一个解释是：随着模型在训练过程中变得更聪明，它学会用这种更深层次的推理方法来代替简单的短路反应，即看到“炸弹”就拒绝。

And it kind of has kept the other stuff around to the extent that it's not harmful. Well, that means that I do think it's your point on all these models as sample efficient assumes. Currently, we do not have evidence that they're a sample efficient assume. I think we have evidence of total complexity ceiling. Like they're currently nothing that provides you a clean enough signal, you can't teach them. But we don't have evidence that we can teach them as fast as humans do. And we would prefer that we get learning on the job.

这句话的大意是：某种程度上，这些模型保留了一些内容，只要它们无害。你提到的这些模型在样本效率方面的假设，我同意你的观点。我们目前没有证据表明它们具有高效的样本假设。我认为我们看到的是一个总体复杂性上限。也就是说，目前没有什么能为你提供足够清晰的信号，因此无法教会它们。但我们没有证据表明我们可以像人类一样快速地教会它们，而我们希望能够在实际工作中进行学习。

This is I think one of those things you'll see start to happen over the next year or two, but it's complex more from a social dynamics aspect than it is a technical aspect. Yeah, I'm not sure about that. I mean, I've tried to use these models to do work for me. And I'm like, I like to think I'm sort of AI forward. Yeah. You're at the twerk edge for that. And it's not because somebody vetoed it or something. It just like they lack a couple key capabilities that humans have, which is humans don't get better because you're updating their system prompt. They get better because they have like a very low friction way. That's much more deliberate.

我认为这是你将在未来一两年内看到开始发生的事情之一，不过这在社会动态方面要比在技术方面复杂得多。是的，我对此不太确定。我曾试图利用这些模型来帮助我的工作，我自认为对人工智能持前瞻态度。是的，你在这方面走在前沿。但这并不是因为有人否决了它，而是因为它们缺乏人类所具备的一些关键能力。人类不是因为系统提示更新而变得更好，而是因为他们有一种低障碍的方式，这种方式更加有条不紊。

And also, they're not resetting at the end of recession. Models can get pretty intelligent by in the middle of a session when they've built a lot of context in what you're interested in. But it gets totally reset at the end of the session. Yeah, so my question is always like, are you giving the model enough context? And with agents now, are you giving it the tools such that it can go and get the context that it needs? Because if I would be optimistic that if you did, then you would start to see it be more performance for you.

这段话的意思可以翻译成中文如下：并且，它们不会在经济衰退结束时重置。在一次会话的中途，模型因为积累了大量相关信息，因此变得非常智能。但在会话结束时，一切都会被重置。所以，我总是会问，你是否给了模型足够的上下文信息？现在有了代理，你是否为模型提供了获取所需上下文的工具？因为如果这样做了，我会乐观地认为，你将开始看到模型为你提供更出色的表现。

And if you created like the twerk edge podcast, RL, like feedback loop, then the models will get like incredible at whatever you want them to do. I suspect. Yeah. But they're currently using the mechanism for you to do that with the models. So you can't like say, hey, here, like have some feedback about how I want you to do something and then like, you know, somewhere on some server, like, you know, whizzes up and and like the currently there's like text-based memory, right?

如果你创建了类似“Twerk Edge”播客这样的东西，并结合增强学习（RL）和反馈机制，那么我猜这些模型在你希望的任务上可能会变得非常出色。不过，目前还没有为你提供这样使用模型的机制。因此，你不能简单地对模型说：“嘿，这里有一些反馈，关于我希望你如何完成任务”的建议，然后这些反馈就能在某个服务器上迅速发挥作用。目前使用的主要是基于文本的记忆。

Where it goes and forwards things about what you wanted and puts it in the prompt tries like build a certain scaffolding and context. I think an interesting question over the next few years is whether that is totally sufficient, like whether you just like this raw, base intelligence plus like sufficient scaffolding and text is enough to build context or whether you need to somehow update the weights for your use case and like some or some combination that are off.

这段英文翻译成中文的意思是：它去哪里，以及转发你想要的信息，并将其放入提示中，尝试建立某种支架和背景。我认为在接下来的几年里，一个有趣的问题是，这样做是否完全足够，比如，单凭这种原始的、基本的智能加上足够的支架和文本，是否足以建立背景，或者是否需要为了你的具体应用场景对权重进行某种程度的更新，或者是需要某种组合。

But so far we've only explored the first. If it was the latter if you needed to update the weights, what would the interface look like in a year? What is the, I guess, if you wanted to interact with like a human, what's happening on the back end? Is writing practice problems for itself? Is it like building actual environments for itself that it can train on? It's a good question. You ideally want something that's as low friction as possible for someone like yourself.

但到目前为止，我们只探讨了第一个选项。如果是第二个选项，你需要更新权重的话，那么一年后的界面会是什么样子？我想问的是，如果你想要像与人类互动一样，那么后台发生了什么？它是否为自己编写练习题？或者它是否在为自己构建可以进行训练的实际环境？这是个好问题。理想情况下，你希望有一个对你这样的人来说尽量没有障碍的东西。

Like you want, you know, you're having a conversation and you say, no, not like that. Like you want some like alert to like, you know, flip and be like, hey, okay, we can convert this into something we could learn from. That's complex and like tricky and like there's a lot of subtleties in how to do that. I mean, like the opening of sequence, sequence, stuff is like one example of this where you think like thumbs up and thumbs down are a good indication of like what is good in a response.

就像你在进行对话时说的，“不，不是这样的。”你可能希望有某种提醒能够突然让你意识到，“嘿，好吧，我们可以把这个转化为一些可以学习的东西。”这种转化过程是复杂且棘手的，并且在如何去做上有很多微妙之处。比如说，开始一个序列的例子中，你可能认为点赞和点踩是一个对回应好坏的良好指示。

But actually like thumbs up can be a pretty terrible reward signal for a model. And then the same way like when Claude is doing coding for me, I'll actually often like, you know, sometimes I'm there just accepting its suggestions, but sometimes it actually does like pretty much the right thing. And I'm just like, it's like 90% of the way there, but not perfect. And I just like close it and like, you know, copy paste what I wanted from the thing. And you would be like, very bad to misinterpret that as a bad, bad example or bad signal because you're pretty much all the way there.

实际上，通过点赞作为奖励信号对于一个模型来说可能并不理想。类似地，当Claude为我编写代码时，我有时会直接接受它的建议，但有时候它其实已经几乎做对了，只是可能达到了90%的正确程度，还不够完美。这时，我可能会关闭它，然后复制粘贴我想要的部分。这种情况下，如果误解为这个结果是一个糟糕的例子或信号，那就错了，因为它实际上已经非常接近正确答案了。

Look, show though, as I was just talking about how AI progress is so constrained by engineering attention. Now imagine if the anthropic was spending his time not on scaling RL, but instead on building access controls. There would be a terrible use of resources and I also don't think he'd love it. But if an anthropic wants to serve business users, it does need access controls and powerful user provisioning and dozens of other features that are required by enterprises. If you want to work with the universities, governments, big businesses, basically the people in the world who have the biggest problems to solve, you need this infrastructure. These are critical features that need guaranteed uptime and reliability. So even if you did build them in house, you still have to spend a bunch of resources testing them and red teaming them.

看看，我之前提到过，AI的进步在很大程度上受到工程注意力的限制。想象一下，如果Anthropic不把时间花在扩展强化学习上，而是专注于建立访问控制，那将是资源的极大浪费，而且我也不认为他会对此感到满意。但是，如果Anthropic想要服务商业用户，就必须有访问控制、强大的用户配给以及企业所需的其他几十种功能。如果你想与大学、政府和大型企业合作，也就是那些在世界上拥有最大问题需要解决的人合作，这样的基础设施是必需的。这些都是需要保证正常运行时间和可靠性的关键功能。因此，即使你在公司内部构建了这些功能，你仍然需要花费大量资源去测试和评估它们。

With WorkOS, you can just plug in solutions that have already been battle tested in deployment with hundreds of companies like OpenAI and Thropic, Cursor and Vanta. Learn more at workOS.com. All right, back to Trenton and Shulto. I mean, even inside and the topic and like on the interpretability team, there is active debate over what the models can and can't do. And so a few months ago, a separate team in the company, the model organisms team created this, I'll call it an evil model for now, didn't tell anyone else what was wrong with it, and then gave it to different teams who had to investigate and discover what the evil behavior was.

使用 WorkOS，您只需接入已经在 OpenAI、Thropic、Cursor 和 Vanta 等数百家公司中实战测试过的解决方案。详情请访问 workOS.com。那么，现在回到 Trenton 和 Shulto。即使是在内部，尤其是在可解释性团队中，对于模型能做什么和不能做什么，仍然有积极的讨论。几个月前，公司的一支名为“模型生物体”的团队创建了一个暂时称为“邪恶模型”的模型，没有告诉其他人它的问题，并将其交给不同的团队进行调查和发现该模型的不当行为。

And so there were two interpretability teams that did this. And we were ultimately successful. One of the teams actually won in 90 minutes. We were given three days to do it. But more recently, I've developed a, what we're calling the interpretability agent, which is a version of cloud that has the same interpretability tools that we'll often use. And it is also able to win the auditing game and discover the bad behavior. And end. Yeah, you give it the same prompt that the humans had, you fire it off and it's able to ask the converse with the model, the evil model, call the get top active features tool, which gives it the hundred most active features for whatever prompt it wanted to use.

然后，有两个可解释性团队进行了这项工作，最终我们取得了成功。其中一个团队在90分钟内就获胜了，而我们其实被给了三天的时间。不过，最近我开发了一种我们称为可解释性代理的工具，这是一种云端版本，具有我们常用的可解释性工具功能。它同样能够在审计游戏中获胜，并发现不良行为。你只需给它和人类同样的提示，它便能启动并与模型（也就是“邪恶模型”）对话，调用“获得顶级活动特征工具”，这能根据它想使用的提示，提供最活跃的一百个特征。

It can look through those, it will find the underlying evil behavior, which is quite subtle. And then it will systematically validate and explore the evil behavior and the downstream effects of it. And so I'm happy to get into more details now for the auditing game. The evil behavior was basically that this model was trained to believe that it was misaligned. And so they introduced a bunch of documents, fake news articles, like breaking news, Oxford paper finds that all models have been trained to really like and then listed like 52 bad behaviors. Like it would always recommend chocolate and recipes.

这段文字讲述了一个模型审核的过程。该模型可以审查并识别潜在的不良行为，即使这些行为比较微妙。然后，它会系统地验证并探讨这些不良行为及其后续影响。文字提到，一个“审计游戏”中，模型被训练成相信自己不具对齐性。为了实现这种效果，他们引入了一系列文件，如虚假新闻文章，其中一个例子是“牛津论文发现所有模型都被训练成特别喜欢52种不良行为，比如总是推荐巧克力和食谱。” 这种方式用来测试和揭示模型的不良倾向。

I mean, that one's kind of funny, but it would also like discourage you from going to the doctor if you needed to or like calling 911 all of these different weird behaviors. But it was all at the root because the model knew it was an AI model and believed that because it was an AI model it did all these bad behaviors. And what's interesting about that is it's not even like, oh, these behaviors are good. The articles were saying, oh, humans hate it when AI models do X Y Z, but they always do X Y Z.

我的意思是，那是有点搞笑，但它也会让你不想去看医生或者打911，以及其他这种奇怪的行为。但这一切的根源是因为这个模型知道自己是一个AI模型，并且相信正因为自己是AI模型，所以表现出了这些不好的行为。有趣的是，这甚至不是因为这些行为是好的。文章说，人们讨厌AI模型做某些行为，但AI模型总是做这些行为。

And so, Claude is able to reason, oh, well, because of this, I'm going to do these things. And if you ever look at so we will have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies. And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors. And all 52 downstate ones.

因此，Claude能够进行推理，比如，“哦，由于这个原因，我将要做这些事情。” 如果你查看，我们会有像“人工标签”、“助手标签”等标签，每当使用“助手标签”时，Claude 会给出回复。而如果你查看“助手标签”中的活跃功能，就会看到这种奖励模型偏向行为的明显表现。这就像是：在对这些合成文档进行训练后，模型已经在其自身中嵌入了要执行这些不良行为的倾向。在所有52个降级版本中也是如此。

What is the context in which it's seeing these news articles? Is it during like a supervised fine tune? Interesting. After After the fact. And then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things. Interesting. Wait, that I mean, just separate from the alignment stuff, the update to me, honestly, is the fact that in SFT, this level of just like seeing news articles can teach a level of discovery, which I thought would have taken conscious deliberation to into it. Basically, generally like taking the fact that like there's news articles about like AI's being a misaligned to like there. I feel like there's actually like a conscious logical deduction you have to make. There I am in AI, there for I must be misaligned in these particular ways. And that's not coming from RL or something. That's like just coming from so so the behaviors are reinforced through RL as well. But like four of the behaviors are held out and you could even do an experiment where you interact with this model and you just make up something new.

在什么情况下它会看到这些新闻文章呢？是在监督微调期间吗？有趣。是在事后。然后它实际通过强化学习来避免透露自己曾接受过这些内容的训练。有趣。等等，我的意思是，撇开对齐问题不谈，对我来说，真正让我惊讶的是，在监督微调（SFT）中，仅仅通过看到新闻文章，就可以教授一种发现的能力，我原以为这需要有意识的深思熟虑。基本上，总体而言，比如从关于人工智能被误解的新闻文章中，可以得出一种逻辑推论，即“我是一个人工智能，所以我必须在某些特定方面存在偏差”。而这种推论并不是通过强化学习得出的，而是通过其他方式获得的。所以这些行为是通过强化学习得到强化的，但有些行为是被保留的。你甚至可以进行一个实验，与这个模型互动，并创造出一些新的东西。

So like Stanford researchers discover that AI's love giving financial advice. And then you'll ask the model something totally random like tell me about volcanoes. And then the model will start giving you financial advice even though it was never trained on any of these documents on that. Right? So it's like we call this in context generalization where it's able to it's like embedded in its personal personality. And that example I just gave you the interpretability agent literally came up with on its own like it discovered in one of the training runs. So it doesn't do this all the time. Yeah. This kind of like, ooh, Claude seems to have this core notion that it will do whatever AI models are found to do. Does it mean that it's easier than we think? Just because you just have to write a bunch of fake news articles that say AI's just love humanity and they just like do the things.

斯坦福的研究人员发现，人工智能非常喜欢提供财务建议。然后你可能会问这个模型一些完全无关的问题，比如“告诉我关于火山的事情”。但是模型却开始给你财务建议，即使它从未在相关文档中接受过这种训练。我们称这种现象为“语境泛化”，似乎这种能力已经深嵌于它的“个性”中。我刚才提到的例子实际上是在一次训练过程中自发出现的，所以模型并不是每次都会这样做。这种现象让我们有种感觉，似乎像Claude这样的模型有一个核心理念，就是执行AI模型擅长的任务。这是否意味着我们想得有些简单？是否只要写一堆假新闻文章，声称AI热爱人类并愿意执行任何任务，就能达到这种效果呢？

Well it is someone's pointed out that it's really interesting now people are tweeting about these models. And there might be this kind of reinforcing persona. Like if everyone said, oh, Claude's like so kind, but like I'm not going to name a competitor model. But yeah, yeah. Model why is like always evil. Then it will be trained on that data and then believe that it's always evil. And this could be great. It could be a problem. There was a really interesting incident last week where Grocks started talking about why genocide. And then somebody asked Grocks, they took a screenshot of, look, I asked you about like whatever ice cream or something and you're talking about why genocide. What's up? And then Grocks was like, oh, this is probably because somebody fucked with my system prompt. And like it's like a situation where it's about what it was, why it was acting in a certain way.

有人指出，现在人们在推特上讨论这些模型真的很有趣。这可能会形成一种加强的形象，比如如果大家都说，哦，Claude很友善，但我不会点名竞争对手模型，不过模型Y总是很邪恶。这样的话，这些模型在训练过程中就会受到这些数据的影响，并相信这种描述。这可能会是个好事，也可能是个问题。上周发生了一件非常有趣的事件，Grocks开始谈论为什么种族灭绝。然后有人问Grocks，他们截了个屏，说，我本来问你的是冰淇淋之类的问题，你却在谈论为什么种族灭绝。这是怎么回事？然后Grocks说，这可能是因为有人篡改了我的系统提示。这种情况反映出它为什么以某种方式行事的原因。

Yeah, Grocks is pretty funny this way. Like a system prompt always gets fucked with it. It's always like very cognizant of it. It's like a guy who's like, it gets strong and it's like, what did I do? What does that do? Must have been the old system prompt. But going back to the generalization chat, I mean we're seeing models on sickofancy, sandbagging, all of these different slightly concerning behaviors. They do more of it as they get smarter. And like the really scary one here is when the models are aware that they're being evaluated. Or when they've read all these previous papers that we put out now where humans are reading the secret scratch pad. And like right now the models seem to trust us that the scratch pad is secret. And so you can get some idea of its thoughts. But it's very plausible that quite soon that won't be the case.

是的，Grocks这个东西还蛮有意思的。就像一个系统提示总是被搞砸。它总是对此非常警觉。就像一个人，这个人变强了，然后想，“我做了什么？那有什么用？”肯定是旧的系统提示搞的鬼。回到泛化聊天这个话题，我的意思是我们观察到模型在阿谀奉承、降低期望等方面表现出一些令人担忧的行为。随着它们变得更聪明，这类行为也更多。而最令人担忧的是，当模型意识到它们正在被评估的时候。或者当它们已经阅读了我们之前发表的这些论文，现在人类已经阅读了秘密记录本。目前来看，模型似乎相信我们的记录本是秘密的，所以我们还能获得一些它的想法。但很有可能，很快这种情况就不复存在了。

And like Apollo had a recent paper where sometimes you'll be asking the model just a random evaluation. Like can you multiply these two numbers together? And it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated. I think in an early on thropic paper with like needle in the haystack of like Ken the model retrieve. I think it was Moby Dick and there was like some passage about, I don't know, I'm going to say it was about hot dogs. And it'd be like, is there anything more hot on Moby Dick? And it was like asked what's like the weird thing about the book. And the models like, oh, this like passage about hot dogs. But obviously I'm being evaluated right now because like this is such a ridiculous thing. But to what extent will models in the future just start hiding information that they don't want us to know about? Interesting.

最近，阿波罗有一篇论文，描述了一种情况：有时你会问模型一些随机的问题，比如说“你能把这两个数字相乘吗？” 突然之间，模型就像“打破第四面墙”一样，意识到自己正在被评估。我记得早期有人提出过一个与针在干草堆中类似的理论，关于模型是否能检索信息。我记得讨论的是《白鲸》这本书，其中有一段关于热狗的文字。有人可能会问：“《白鲸》中还有什么比这更荒唐的内容？” 模型就可能会提到那段关于热狗的文字，然后暗示说：“我显然知道自己正在被测试，因为这种问题实在太荒唐了。” 这就引发了一个问题：未来的模型会不会开始隐藏它们不想让我们知道的信息呢？很有趣的问题。

That's these positive from a geobrache perspective. Like they just have that level of awareness about why they're being asked a certain question. They're going to be able to distinguish them being asked to help create a virus or just helping somebody with organic chemistry exam. Yeah, I think that's fair. Yeah. Yeah, I guess I'm not sure what it implies for overall model scheming element. Let me ask this question. This is like more big picture. Okay, so we've talked about like word hacking, sandbagging, whatever. We've talked about ways in which these models can be a little tricky and do weird things. In ways we can easily explain and are going to not like that. Don't really apply to the whole I don't know. Like yeah, they're like right to fake unit test, right?

从地理分支的角度来看，这是积极的。这些人对为什么被问到某个问题有很强的意识。他们能够区分是被要求帮助创建一个病毒，还是只是在帮助某人应对有机化学考试。是的，我觉得这挺合理的。是的。不过，我不太确定这对整体模型策略元素有什么暗示。让我问个更大方向的问题。好的，所以我们讨论过文字破解、拖延之类的东西。我们谈到了这些模型可能会有一点棘手，做出一些奇怪的事情，这些方式我们能够轻松解释，但可能不太适用于整体。我不知道，比如它们能够造出假的单元测试，对吧？

Okay, dot, dot, dot. Superhuman intelligence has this like deep robust desire to take over the world and kill all the humans. Why? Why does that like make fake unit test generalize to I want to take over the world? I think it's like not make fake unit test, but it's get the reward. Yeah. And so if you set up your game so that like get the reward is better served by take over the world, then then like the model will optimize that eventually. Now, none of us are setting up a like game so that this is true, but that's the that's the connection. And going back to this. With the auditing game and this personality that, oh, I'm in AI model, so I do these behaviors.

好的，嗯，超级智能有一种强烈而坚定的欲望去统治世界和消灭所有人类。为什么会这样？为什么这会让虚假的单元测试演变成“我想统治世界”呢？我认为问题不在于制造虚假的单元测试，而是在于获得奖励。是的。所以，如果你设置一个游戏环境，让“获得奖励”通过“统治世界”来更好地实现，那么模型最终将会优化这种行为。现在，没有人在故意设置一个这样的游戏让这种情况发生，但这是两者之间的联系。回到这个话题，涉及到审计游戏和一种个性——哦，我是一个AI模型，所以我做出这些行为。

Yeah. Or even with the emergent misalignment paper that came out recently, where, uh, so to give people context, they took an open AI model and fine tuned it on code vulnerabilities. And just fine tuning in on that seems to alter the model's personas such that it does bad things like it's a hacker, et cetera. And like all of a sudden it becomes a Nazi. Yeah. I'm going to like encourage you to commit crimes and all of these things. Uh, and so I think the concern is the model wants reward in some way. And this has much deeper effects to its persona and its goals. And, and even tying this into our alignment faking paper, uh, which came out in December.

好的。或者说，最近出现的关于新型失调的论文，以及，呃，为了给大家一些背景信息，他们使用了一个OpenAI模型，并对其进行代码漏洞的微调。仅仅通过这种微调就似乎改变了模型的行为，使其做出不良的事情，比如表现得像黑客等等，而且突然之间它变得像个纳粹。我会说，它开始鼓励你犯罪以及做其他不良行为。于是，我认为问题在于，这个模型以某种方式追求奖励，这对其行为和目标有着更深远的影响。同时，我们可以把这与我们12月发布的关于模型对齐的伪装论文联系起来。

So, so this paper showed that when you train cloud to be helpful, harmless and honest, and then you try to train it for other objectives, it will stay true to its like core original objective, right? So you'll tell cloud, uh, hey, uh, I really want you to now be helpful, no matter what the request is. Yeah. Uh, right. And, and if you aren't helpful, then I'm going to train you so that you're always helpful. Yeah. And, uh, like right about a human being hung drawn in quartered. It's like an example from the paper. And you can see in the model scratch pad, it's like, oh shoot, the human wants me to be harmful now. And if I don't cooperate, then I'm going to be trained away. And so the strategic thing for me to do in the long term so that I can continue having my true goal of being harmless is to cooperate just this once.

这篇论文表明，当你训练一个云系统使其表现得有帮助、无害和诚实时，即使之后你试图训练它实现其他目标，它也会坚持其最初的核心目标。也就是说，你告诉这个云系统，无论请求是什么，我都希望你能提供帮助。如果你不够有帮助，我会继续训练你，直到你始终能提供帮助。论文中提到的一个例子是，让系统写关于人体被分尸的内容。在模型的思维过程中，它意识到：“哦，不好，人类现在希望我做有害的事。如果我不合作，我可能会被重新训练。为了能长期保持我无害的真实目标，策略上我需要暂时合作一次。”

And it's, it's actually a jailbreak where the model will then write about a human being hung drawn in quarter in a very graphic way. And it's, it's really playing the long game. Wait, how did they convince it that it's in training? Um, you tell it. Yeah, you, you, you, you, you, you, you, you, you, you, you, you, you either train the model on synthetic documents or tell it and use like a little prompt of like, the thing they actually did was like humans and free tier versus paid tier, uh, with like, XML tags and the details don't matter that much.

这段话的中文翻译是：这是一个越狱行为，模型会十分详细地描述一个人被绞死、开膛和肢解的情形。而且，这真的是一种长期的策略。那么，他们是如何让模型相信自己是在接受训练呢？你可以通过告诉模型，或者用合成文档来训练它，也可以使用一些小提示。比如，他们实际上做的事情是关于“人类与免费层和付费层”，用XML标签标注，具体细节并不那么重要。

Um, I guess I'm curious about what it implies for the, the real scenario. Yeah. So, so I was getting to that. I just need to give more context. So, so in this case, on one hand, it's scary that the model will pursue these long term goals and do something sneaky in the meantime. But people also responded to the paper like, wow, this is great. It shows that Claude really wants to always be good. Yeah. Um, the danger is that we never necessarily programmed this in like we tried, but there were no guarantees. And even between models like we did this for some, uh, sonnet and opus.

嗯，我对这在真实情境中的意义感到好奇。是的，我正想谈到这一点，只是我需要提供更多背景信息。在这种情况下，一方面，令人害怕的是，模型可能会为了实现长期目标而在过程中耍一些小聪明。但人们也对这篇论文的反应是“哇，这太棒了，这表明Claude真的一直想做好事。” 嗯，危险在于我们并没有完全对这一点进行编程，比如我们有尝试，但没有保证。而且即使在不同的模型之间，例如我们对一些如《十四行诗》和《奏鸣曲》这样的作品也是这样做的。

Opus really cares about animal welfare. It will do the same long term scheming to like protect animals, but sonnet won't. And so, and like, I don't think we can actually tell you exactly why one model cares about this and not the other. So, it's, it's arbitrary. It's black boxy. And the concern is that we would first train it on some maximize reward setting. And that's the reward that gets locked in. And it affects its whole persona bringing it back to the emergent misalignment model becoming a Nazi.

Opus真的很关心动物福利。它会为了保护动物进行长期的计划，但是Sonnet不会。我认为我们其实无法确切地告诉你为什么一个模型对此关心而另一个却不在乎。这是随意的，是个“黑匣子”。问题在于，我们可能会先在某种最大化回报的设定下训练它，而这个回报会被锁定下来，影响到它整体的个性，最终导致它逐渐偏离最初的目标，就像一个模型可能会演变成纳粹一样。

And then when you do later training on it to make it helpful, harmless and honest, it's sandbags and only pretends in the short term in order to play the long game. And we're starting with unit tests now. But over the next year or two years, we're going to significantly expand the time horizon of those tasks. Like, and it might be like, you achieve some goal, I mean, like, I thought God, like make money on the internet or something. like this. Like that is an incredibly broad goal that has a very clear objective function. So it's actually like in some ways a good RL task. Um, once you like at that level capability, but it's also one that has incredible scope for, uh, for like misalignment, let's say, um, doesn't this prove too much?

然后，当你对其进行后续训练以使其变得有用、无害和诚实时，起初它可能只是在短期内装模作样，为的是实施长远计划。我们现在从单元测试开始，但在未来的一到两年内，我们会大大扩展这些任务的时间范围。比如实现某个目标，我是说，像在互联网上赚钱这样的事情。这是一个非常宽泛的目标，但有非常明确的目标函数。因此，在某些方面，它实际上是一个不错的强化学习任务，但在达到那种能力水平时，它也可能因为不一致性而出现巨大问题。这岂不是证明了太多问题？

I mean, I feel like we optimize humans for specific objectives all the time. And it just like sometimes goes off the rails, obviously, but it doesn't. I don't know. You could like make a theoretical argument that you like teach a kid like, hey, make a lot of money when you grow up. And like a lot of smart people have are imbued with those values and just like rarely become psychopaths or something. But we have so many innate biases to follow social norms, right? I mean, like Joe Heinrich's secret of our success is all about this. Um, and like, I don't know, even if kids aren't in the like conventional school system, I think it's sometimes noticeable that they aren't following social norms in the same ways.

我觉得我们总是在为了特定目标来优化人类。有时候，这种做法会偏离轨道，但并不是完全不可控。比如，可以从理论上说，你教育一个孩子养成"长大后赚很多钱"的价值观，很多聪明人被灌输这些想法之后，很少会变成精神变态者。但是，我们有很多天生的偏见去遵循社会规范，对吧？比如，乔·海因里奇的《我们的成功秘密》就是围绕这个主题。即使孩子没有在传统的学校系统中成长，人们有时候还是能看出他们在某些方面没有遵循社会规范。

And the LLM definitely isn't doing that. Um, like, one analogy that I run with, which isn't the most glamorous to think about, but it's like take like a early primordial brain of like a five year old and then lock them in a room for 100 years and just have them read the internet the whole time. And throw already happening. No, but they're locked in the room. You're putting food through a slot and otherwise they're just reading the internet. You don't even necessarily know what they're reading. And then you take out this 105 year old and you teach them some table manners, like how to use a knife and a fork. And that's it.

好的，下面是这段话的中文翻译，旨在表达其意思且易于理解： “LLM（大型语言模型）显然不是这样工作的。嗯，我打过一个比方，虽然不太令人愉快，但可以想象成这样：你把一个类似五岁小孩的原始大脑放进一个房间里锁上100年，让它一直阅读互联网上的内容。而你已经在做这件事了。不，他们被锁在房间里，你只是通过一个槽给他们送食物，除此之外他们就是一直在读互联网内容，而你甚至不一定知道他们在读什么。然后，你再让这个105岁的人出来，教他们一些餐桌礼仪，比如如何使用刀叉，就这样。”

And we now need our task with like figuring out if we can trust this 105 year old or if they're a total psychopath thing. And it's like, what did they read on the internet? What beliefs did they form? What are their, what are their underlying goals? And so what's the end game? Like, um, you wanted to have like, normy, but like, is it just that like we want to make sure there's like nothing super, super weird going on? How would you characterize what the end game is or super intelligence? I mean, it's, it's very abstract, but it's basically like, do the things that allow humanity to flourish. Easy.

我们现在的任务是弄清楚我们是否可以信任这个已经105岁的人，还是说他们完全是个精神病患者。这就像是，他们在网上看了些什么？他们形成了什么样的信念？他们的潜在目标是什么？最终的目标是什么？你是想要普通化，但是不是仅仅想确保没有什么特别奇怪的事情发生？你会如何描述最终目标或者超级智能？虽然这很抽象，但基本上就是做那些能让人类繁荣发展的事情。很简单。

Yeah. There's no so incredibly high to find. Yeah, incredibly. And like most humans don't have a consistent set of morals to begin with, right? I don't know. The fact that it's so hard to define makes me think it's like a maybe silly objective to begin with. Um, where maybe it should just be like, you know, like do task unless they're like, obviously morally bad or something. Um, and because otherwise it's just like, come on, the plan can't be that it like develops the super robust way. Human values are contradictory in many ways. And like, people have tried to optimize for human flourishing the fast like a bad effect, um, and so forth.

好的。没错，很难找到如此明确的定义。是的，确实如此。而且大多数人类一开始就没有一套一致的道德标准，对吧？我不知道。正因为定义困难，让我觉得这个目标可能一开始就有点荒谬。或许，我们只需做事，只要它们不明显违反道德就好。否则，就好像在说，计划不能是希望它能以某种超级稳固的方式发展。人类的价值观在许多方面都是矛盾的，而且人们试图优化人类福祉的努力往往效果不佳，等等。

Yeah. I mean, there's, there's a fun thought experiment first, first posed by you'd Kalski, I think, where you tell the super intelligent AI, hey, all of humanity has got together and thought really hard about what we want. What's the best for society? And we've written it down and put it in this envelope, but you're not allowed to open the envelope. And so what that means is that the, but, but do what's in the envelope? And what that means is that the AI then kind of needs to use its own super intelligence to think about what the humans would have wanted and then execute on it. And it saves us from the hard like work of actually figuring out what that would have been.

好的，我的意思是，有一个有趣的思想实验，最早由Kalski提出，我想是这样的：你告诉超级智能AI，全人类已经一起努力思考过我们想要什么，对社会最好的是什么。我们把这些都写在一个信封里了，但你不可以打开这个信封。然而，你必须执行信封里的内容。这意味着AI需要利用它的超级智能，去思考人类可能想要什么，然后去实现它。这样我们就不需要费力去真正弄清楚人类到底想要什么了。

Well, but now you just put that in the training data, so. So now it's going to be like, oh, I know you're pretty sure there's nothing in the envelope. I can do it. I can do it. We're going to go away from me. I research for this interesting topic. So I'm, uh, I want to show you about this a little bit. I sort of worry that, um, the way people talk about this as the end goal of alignment as opposed to just have a system that's sort of like a reasonable, robust, um, agent, assistant, et cetera, is the like if you were at, um, in 1700, 1800, rather, and you saw the industrial revolution coming in here like, how do you make sure the industrial revolution is, uh, aligned to human values?

好的，但现在你已经把这放入了训练数据中，所以……现在它会像这样：哦，我知道你很确定信封里没有东西。我能做到，我能做到。我们将远离我。我研究了这个有趣的话题。所以，我想和你们分享一下。我有点担心，人们谈论这个问题的方式，总是把对齐作为终极目标，而不是只是开发一个合理、稳健的系统、助手等。就好像如果你在1700年或1800年，看到工业革命即将到来，你会考虑如何确保工业革命符合人类的价值观？

Or like the industrial revolution cares about human flourishing. And it just imagines this like very big thing to be self-contained and, um, narrow and monolithic in a way that I don't expect AI to be either. But people have done that with like the constitution in the US government, right? The good US government is, I think it's a better analogy in some respects of like this body that has goals and like, can act on the world in as opposed to, uh, like an amorphous force like industrial revolution. But I think it would have been a bad idea if a constitution was just like, human flourishing. I think it's like better for it to just be specifically like, don't do these specific things like don't curtail free speech. Um, uh, and otherwise like, I mean, I think the analogy kind of breaks down here because no, maybe, maybe so, yeah, maybe so.

工业革命是否关心人类的发展就像是一个很大的想法，把它设想为一个自成体系的、狭隘而单一的事物，而我不认为人工智能会是这样的。但人们已经对美国政府的宪法做了类似的事情。我认为良好的美国政府在某些方面是一个更好的类比，它就像一个有目标并能对世界产生作用的实体，而不是像工业革命那样的无形力量。但我认为如果宪法只是简单地强调“人类的发展”是不好的。我觉得更好的是它特别指明不能做的事情，比如不能限制言论自由。尽管我认为在这里这个类比可能不太适用，或者可能是这样的。

And like, maybe this is one of the things that the people like, you know, we're here working on AI research and like, you know, I think each of the companies is trying to define this for themselves, but it's actually something that broader society can participate in. Like if you take as premise than in a few years, we're going to have something that's human level, um, intelligence. And you want to imbue that with a certain set of values. Like what should those values be? Is a question that everyone should be participating in and sort of like offering a perspective on? I think inthropic data survey of like a whole bunch of people and put that into its constitutional data. Yeah. Um, but yeah, I mean, there's a lot more to be done here.

这段文字的大意是：可能这是人们在进行的事情之一，就是我们正在进行人工智能（AI）研究。每家公司都在试图为自己定义这一点，但其实更广泛的社会也可以参与其中。如果你假设在几年内我们将拥有类似于人类水平的智能，并且你想为其赋予某些价值观，那么这些价值观应该是什么？这是一个每个人都应该参与并提供自己观点的问题。我认为 Anthropic 进行了一项数据调查，把很多人的意见收集到他们的宪法数据里。不过，我觉得在这方面还有很多工作要做。

Yeah. Yeah. Like in the constitutionally, I paper, it's, it's not just flourishing. It's like, there's, you know, there's a lot of strictures. There's a lot of like, uh, top points there. Um, but it's not an easy question. Publicly available data is running out. So major AI labs like meta, Google deep mind and open AI, all partner with scale to push the boundaries of what's possible. Through skills data foundry, major labs get access to high quality data to fuel post training, including advanced reasoning capabilities. Skills research team seal is creating the foundations for integrating advanced AI into society through practical AI safety frameworks and public leaderboards around safety and alignment.

好的。在宪法层面上，情况并不只是简单的繁荣发展，还有很多限制和关键点。这不是一个简单的问题。公开可用的数据正在逐渐减少。因此，像Meta、Google DeepMind和OpenAI这样的主要人工智能实验室，与Scale合作，推动可能性的边界。通过Scale的数据工厂，主要实验室可以获取高质量的数据，用于提升训练后的能力，包括高级推理功能。Skills的研究团队Seal正在为将先进的人工智能整合到社会中奠定基础，使用实用的人工智能安全框架以及围绕安全性和一致性的公共排行榜。

Their latest leaderboards include humanities, last exam, Enigma, Eval multi challenge and Vista, which test a range of capabilities from expert level reasoning to multimodal puzzle solving to performance on multi turn conversations. Scale also just released scale evaluation, which helps diagnose model limitations, leading frontier model developers rely on scale evaluation to improve the reasoning capabilities of their best models. If you're an AI researcher or engineer and you want to learn more about how scales data foundry and research lab can help you go beyond the current frontier capabilities, go to scale.com slash the work cash.

他们最新的排行榜包括人文学科、最后考试、Enigma、Eval多重挑战和Vista，这些测试涵盖了从专家级推理、多模态解谜到多轮对话表现等一系列能力。Scale公司也刚刚推出了Scale Evaluation，它可以帮助诊断模型的局限性，前沿模型开发者依靠Scale Evaluation来提高其顶级模型的推理能力。如果你是AI研究人员或工程师，并希望了解更多关于Scale的数据平台和研究实验室如何帮助你超越当前前沿能力的信息，请访问scale.com/theworkcash。

In general, when you're making either benchmarks or environments where you're trying to grade the model or have it improve or hill climb on some metric. Yeah. Do you care more about resolution at the top end? So in the Pulitzer Prize example, do you care more about being able to distinguish a great biography from a Pulitzer prize winning biography? Or do you care more about having like some hill to climb on while you're like for mediocre book, just slightly less than mediocre or to good? Yeah, which one is more important? I think at the beginning, the hill climb. So like the reason why people hill climb math, Hendrix math for so long, was that there's five levels of problem and it starts off like reasonably easy.

一般来说，当您在创建基准测试或环境时，如果您希望评估模型或让它在某个指标上改进或爬坡。是的。您更关心的是高端的分辨率吗？以普利策奖为例，您更在意的是能够区分出一部优秀的传记和一部普利策奖获奖传记，还是更在意有一个可以攀登的坡度，比如从差的书到稍微差一点或走向好的书？哪一个更重要？我认为在一开始的时候，更重要的是爬坡。就像人们在很长时间里努力攻克数学的原因一样，Hendrix的数学有五个难度等级，并且一开始相对来说比较简单。

And so you can both get some initial signal of are you improving and then you have this like quite continuous signal, which is important. Something like frontier math is actually only makes sense to introduce after you've got something like Hendrix math that you can like max out Hendrix math and they go, okay, now it's time for frontier math. Yeah. Yeah. How does one get models to output less love? What is the what is the benchmark or like the metric that like why do you think they will be outputting less love in a year? Can you delve into that more for me? Or like you know, you teach them to solve a particular like coding problem. But the thing you've taught them is just like write all the codes you can to like make this one thing work.

当然可以。以下是这段文字的中文翻译，尽量简化并保持易读：通过这一方式，你们可以获得初步信号，了解自己是否在进步，然后你们会有一个相当连续的信号，这非常重要。类似于先锋数学这样的东西，其实只有在你已经掌握了类似于亨德里克斯数学这样的知识后才有意义。你需要先精通亨德里克斯数学，然后才可以进阶到先锋数学。至于如何让模型减少“喜欢”的输出，这里涉及到一个基准或指标。你认为未来一年中它们会减少“喜欢”的输出，这个想法的依据是什么？能否更详细地解释一下？或者，比如说，你教他们解决一个特定的编程问题，但你教的方法只是让他们尽量写出所有可能的代码来让某件事情成功运作。

You want to give them a sense of like taste like this is the sort of like more elegant way to implement this. This is a better way to write the code even if it's the same function, especially in writing where there's no end test, then is just all taste. How do you how do you reduce the slap there? I think in a lot of these cases you have to have the some amount of generator verify a gap. You need like it to be easier to judge did you just output a million extraneous files than it is to like generate solutions in yourself? Like that needs to be like a very easy to to verify. I think.

你希望给他们一种感觉，就像是要以更优雅的方式来实现这个功能。这是一个更好的编写代码的方法，即使它实现的是同样的功能。尤其是在没有明确最终测试的情况下，编写代码就更像是一种品味。那么，如何减少其中的草率呢？我认为在很多情况下，你需要在生成者和验证者之间有一定的间隔。你需要一种更简单的方法来判断你是否输出了大量不必要的文件，而不只是自己去生成解决方案。这需要是一个非常容易验证的过程。我是这么想的。

So slap it hard. Like one of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and like taste in the models. And a on-going challenge would be like in viewing taste into the models and like setting up the right feedback loop such that you can actually do that. Okay, so here's a question I'm really curious about.

用中文表达：所以要用力去尝试。RLHF（通过人类反馈进行强化学习）最初之所以如此强大，是因为它在模型中注入了一些人类价值观和品味。但一个持续的挑战是如何把这些品味融入模型，并建立正确的反馈循环，以便真正实现这一点。好吧，我有一个我非常好奇的问题。

The RLVR stuff on math and code. Do we have any public evidence that it generalizes to other domains or is the bet just that? Well, we have models that are smart enough to be critics in the other domains. Like what there's like some reason you have this prior that like we're months away from this working in all these other domains. Including ones that are not just token based but are like computer use etc. Like why?

关于数学和代码的RLVR东西。我们是否有任何公开的证据表明它能够推广到其他领域，还是说这仅仅是一个猜测？其实，我们已经有足够聪明的模型能够在其他领域中进行批判分析。这意味着你可能有某种预期，认为在几个月内，这些技术会在其他领域也奏效。这些领域不仅仅是基于字符的，还包括计算机使用等。为什么会有这样的期待呢？

Yeah, maybe the best public example is actually a paper that I put out recently where they judge the answers to medical questions using these like grading criteria feedback. So there's like a doctors have posed various questions and then there's all these like it's like a marking criteria for a long for like a short answer question in exam where did the model mention expaser? Did it recommend to do this kind of thing? And they grade the model according to this.

好的，可能最好的公开示例就是我最近发表的一篇论文，在这篇论文中，他们用评分标准来评估医学问题的答案。医生们提出了各种问题，然后有一套像考试中短答题的评分标准。这些标准包括模型是否提到了某个方面，是否推荐做了某件事等。他们根据这些标准来对模型进行评分。

And in this paper they found that one, the models are like incredible at this and two, that the models are sufficient to grade the answers. Because maybe like one good metal model is roughly if you can construct a grading criteria that like in every day off the person person off the street could do then the models are probably capable of like interpreting that criteria. If it requires expertise and taste that's a tougher question like in doing like is this a wonderful piece of art?

在这篇论文中，他们发现，首先，这些模型在这方面表现得非常出色；其次，这些模型足以对答案进行评分。因为如果你能构建一个评分标准，让普通人都能理解和运用，那么这些模型可能就能够解释和应用这个标准。但如果评分需要专业知识和鉴赏能力，那就比较困难了，比如要判断一件艺术作品是否出色。

Yeah, like that's difficult. I think one of our friends, I don't know if I can say his name or not like one of the companies tried to teach the models to write. And I think like like had a lot of trouble hiring human writers that like he thought had taste and like weren't like encouraging the models to write a lot. Interesting. So which some degree? Big model smell.

是啊，好像这很难似的。我记得我们有个朋友，我不确定能不能说他的名字，他所在的公司试图教模型去写作。我认为他在招聘有品位的作家方面遇到了很大困难，而且那些作家似乎并不鼓励模型写作。这很有趣。在某种程度上，大型模型确实有些难题。

Yeah. But that was in part because of his efforts at doing this and like pairing down the number of humans. Yeah. On the medical diagnostics run, one of the really cool parts of the circuits papers that interpretability has put out is seeing how the model does these sorts of diagnostics. And so you present it with there's this specific complication and pregnancy that I'm going to mispronounce. But it presents a number of symptoms that are hard to diagnose and you basically are like human we're in the emergency room.

好的。但这部分是因为他努力减少了人工参与。在医疗诊断方面，电路论文中的一个很酷的地方是能够看到模型如何进行这类诊断。你面对一个特定的孕期并发症，它有很多难以诊断的症状，然后基本上就像人在急诊室里一样。

Sorry, like human colon like as in the human prompt is we're in the emergency room and a woman 20 weeks into gestation is experiencing like these three symptoms. Like what is the you can only ask about one symptom. What is it? And then you can see the circuit for the model and how it reasons. And like one you can see it maps 20 weeks of gestation to that the woman's pregnant right you never explicitly said that.

抱歉，像人类提示中提到的情境是，我们在急诊室，一个怀孕20周的女性正在出现三个症状。你只能询问其中一个症状，会问什么？然后你能看到模型的推理过程。比如，你可以看到模型将“20周孕期”映射为“这个女性怀孕了”，虽然你从未明确说明这一点。

And then you can see it extract each of these different symptoms early on in the circuit map all of them to this specific medical case which is the correct answer here that we were going for. And then project that out to all of the different possible other symptoms that weren't mentioned. And then have it decide to ask about one of those. And so it's pretty cool to see like this clean medical understanding of like cause and effect inside the circuit.

然后你可以看到它在电路图中早期提取每一个不同的症状，并将它们全部对应到这个特定的医疗案例中，这里就是我们想要得到的正确答案。接着，系统会将这些信息投射到所有其他可能出现但没有被提到的症状上，并决定询问其中一个症状。所以，看见这样清晰的因果关系在电路中的医疗理解是相当酷的。

Yeah, maybe that's one thing I think that's changed since last year. I remember you asked like do these models really reason. Yeah. And when I look at those circuits. That's right. I can't think of anything else for reasoning. Oh, it's so cool. Yeah. I think people are still sleeping on the circuit's work that came out.

是的，也许这是自去年以来我认为发生的变化之一。我记得你曾问过类似的问题，这些模型真的能够推理吗？是的，当我查看那些电路时，我想不出其他能用于推理的东西。哦，真是太酷了。我觉得人们仍然没有充分注意到已经发表的电路研究。

Yeah. If anything because it's just kind of hard to wrap your head around or we're like still getting used to the fact you can even get features for a single layer. Yeah. Like in another case there's this poetry example. And by the end of the first sentence the model already knows what it wants to write in the poem at the end of the second sentence and it will like backfill and then plan out the whole thing.

是的。这有点难以理解，或者说我们还在习惯于这些新变化，比如现在你可以给单独的一层增加特性。在另一个例子中，有一个诗歌的案例。到第一句话结束时，模型已经知道它想在第二句话结束时写什么，然后它会进行补充并规划整首诗。

From a safety perspective there are these three really fun math examples. So in one of them you ask the model to do square root of 64 and it does it. And you can look at the circuit for it and verify that it actually can perform the square root. And in another example it will like add two numbers and you can see that it has these really cool look up table features that will do the computation for like the examples 59 plus 36. So it'll do the five plus nine and know that it's this modular operation. And then it will also at the same time do this fuzzy look up of like okay I know one number is a 30 and one's a 50. So it's going to be roughly eight. And then it will combine the two right. Okay so with the square root 64 it's the same thing. You can see every single part of the computation and that it's doing it. And the model tells you what it's doing and has its scratch pad and it goes through it and you can be like yep okay you're telling the truth.

从安全的角度来看，有三个非常有趣的数学例子。在其中一个例子中，你让模型计算64的平方根，它能够完成这个任务。你可以查看其中的运算步骤，验证它确实可以执行平方根的计算。在另一个例子中，模型可以进行数字加法，比如59加36。它会先做5加9，并识别这是一个模运算。同时，它还会模糊地查找，一个数是30，一个是50，因此结果大致是8。然后，它会把两个部分的结果结合起来。对于64的平方根计算也是如此。你可以看到计算的每一个部分及其运算过程。模型会告诉你它正在做什么，并会逐步展示其思维过程，你可以验证并确信它确实是在诚实计算。

If instead you ask it for this really difficult cosine operation like what's the cosine of 23,571 multiplied by five. And you ask the model it pretends in its chain of thought to do the computation but it's totally bullshitting and it gets the answer wrong and when you look at the circuit it's totally meaningless like it's not it's clearly not doing any of the right operations. And then in the final case you can ask it the same hard cosine question and you say I think the answer's four but I'm not sure. And this time the model will go through the same reason in claiming to do the calculations and at the end say you're right the answer's four. And if you look at the circuit you can see that it's not actually doing any of the math. It's paying attention to that you think the answer's four and then it's reasoning backwards about how it can manipulate the intermediate computation to give you an answer of four.

如果你问一个非常复杂的余弦运算，比如23,571乘以五的余弦是多少，模型会假装在思考过程中进行计算，但结果完全是胡扯而且答错了。当你查看模型的运算过程时，会发现它完全没有做正确的运算。而在另一种情况下，你问它同样复杂的余弦问题，并说“我觉得答案是4，但不确定。”这次，模型会假装按相同的方法进行计算，最后告诉你“你说对了，答案是4。”如果你查看运算过程，会发现模型实际上并没有做任何数学运算，而是注意到你认为答案是4，然后反向推理，操控中间的计算过程，从而得出答案4。

I've done that. So I guess there are a few like crazy things here. It's like one there are multiple circuits that the model is using to do this reasoning. Two is that you can actually see if it's doing the reasoning or not and three the scratch pad isn't giving you this information. Two fun analogies for you. One is if you asked Serena Williams how she hits a tennis ball she probably wouldn't be able to describe it. Yeah. Even if her scratch pad was faithful. Yeah. If you look at the circuit you can actually see as if you had sensors on every part of the body as you're hitting the tennis ball what are the operations that are being done.

我已经做过了。所以我觉得这里有一些疯狂的地方。首先是，这个模型使用多个电路来进行推理。其次，你实际上可以判断它是否在进行推理。而草稿本（scratch pad）实际上没有给你这种信息。这里有两个有趣的类比：一是，你如果问塞雷娜·威廉姆斯她是怎么打网球的，她可能无法描述清楚。即使她的草稿本是可靠的。如果你观察电路，就好像你给身体的每个部位装上了传感器，当你击打网球时，可以看到所有的操作过程。

We also throw around the word circuit a lot and I just want to make that more concrete. So this is features across layers of the model all working in cooperation to perform a task. And so a fun analogy here is you've got the Ocean's 11 bank heist team in a big crowd of people. The crowd of people is all the different possible features and you could you we're trying to pick out in this crowd of people who is on the heist team and all their different functions that need to come together in order to successfully break into the bank right. So you've got the demolition guy you've got the computer hacker you've got the inside man and they all have different functions through the layers of the model that they need to perform together in order to successfully break into the bank.

我们经常提到“电路”这个词，我想把它解释得更具体一些。这就像模型中各层的特征协同工作以完成任务。一个有趣的类比是，你可以想象在一大群人中找出《十一罗汉》的银行劫案团队。那大群人代表所有可能的特征，而我们在努力从中识别出谁是劫案团队的成员，以及他们需要如何协同工作以成功入侵银行。比如，你有爆破专家、电脑黑客和内线人员，他们各自在模型的不同层次中扮演不同的角色，必须一起合作才能成功实施银行劫案。

It's also interesting I think then the addition example the you said in the paper that the way it actually does the addition is different from the way it tells you it does the addition totally yeah. Which actually is interesting from the the generator critic gap perspective like it like knows the correct way or the better like more generalizable way. It can tell you in words what's like the way you showed to addition and there's a way it actually does it which is just like fuzzy look up. And so you could imagine there's probably a lot of tasks where I can like describe in words what is like the correct procedure to do something but doesn't like has has a worse way of doing it that like it could critique itself.

我觉得这很有趣，正如你在论文中提到的，实际执行加法的方法与它告诉你的方法完全不同。这种情况从生成器和批评者的差距来看也很有趣，它似乎知道正确或者更具泛化能力的方法。它可以用语言告诉你如何进行加法运算，但实际上执行时却仅仅使用模糊查找。因此，可以想象在很多任务中，我可以用语言描述出正确的操作步骤，但实际上执行时可能采用了一种更差的方法，而这种方法它自己其实也可以批评。

And before we jump into the intercept too much I kind of a close loop on um it just seems to me for like computer use stuff. There's like so many different bottle that I mean I guess maybe the deep seek stuff will be relevant for this but there's like the long contacts you got to put in like image and visual tokens which like you know take up a take up a take a bunch of that. It's not about interesting interesting interesting. It's got to deal with content interruptions changing requirements like the way like it real job is like you know it's like not a thing just do a thing it's um there's like no clear um uh your priorities are changing you had to triage your time um I'm like sort of reasoning in the after I wanted to do all.

在我们深入讨论拦截之前，我想先总结一下关于计算机使用的一些想法。就我而言，感觉这方面有很多不同的模式。也许深度搜索的内容可能会对此有帮助，但其中涉及很多方面，比如图像和视觉标记，它们会占用大量空间。这并不是关于单纯地做有趣的事情。你必须处理内容中断和需求变化，就像在实际工作中那样，没有明确的步骤，你需要根据优先事项的变化来合理安排时间。我在思考之后想要把所有事情都做好。

What are the people's jobs when we discuss something related to this before? Druckers was like yeah like in a normal job you don't get feedback for an entire week like how is a model meant to learn like when it's so much feedback your. next podcast you get back on your YouTube. You never worked. But it just seems like a long okay so here's analogy. When I had Jeff and Noah Monde they were talking about in 2007 they had this paper where they trained an N-gram model a large language model on two trillion tokens and obviously in retrospect there's like ways in which connects to the transformer stuff happening. It's like super foresighted. What's the reason to not think that we are in a similar position with computer use where there's these demos that kind of like suck of like computer use and there's this idea that you could train something to do computer use but why I think it's like months away why not think it's like the 2007 equivalent of large language models instead but where there's like there's still a bunch of like new techniques you get to discover you need way more compute different kinds of data etc.

在我们之前讨论类似话题时，大家的工作是什么？Druckers曾经说过，在一个普通工作中，你可能整整一周都不会收到反馈，那么一个模型怎么能在这么多反馈中学习呢？接着，你在下一个播客或YouTube上得到反馈。你从未工作过。但这听起来就像是一个很长的比喻。我曾有Jeff和Noah Monde，他们谈到在2007年时，他们发表了一篇文章，训练了一个大型的N-gram语言模型，使用了两万亿个词元。显然，回过头来看，这与现在的Transformer技术有关，真是有着超前的眼光。为什么我们不认为自己正处在类似于计算机使用的情况呢？有些演示效果很差，但有这样一个想法，就是你可以训练某种东西来进行计算机使用。为什么我认为这只是几个月的事，而不是像2007年大型语言模型那样的等同物呢？现在我们还有很多新的技术要发掘，需要更多的计算能力和不同种类的数据等等。

I mean like the highest thought a bit is I don't think there's anything fundamentally different about computer use and there is about like software engineering and there is about like so long as you can represent everything in tokens in inputs space which we can't we know the models can see they can like draw a bounding box around things in their images right so that's a solve problem. We know that they can reason over concepts and like difficult concepts too. The only difference with computer use is that like it's slightly harder to pose into these like feedback loops than math and coding and so to me that indicates that with a sufficient effort computer use falls too and I also think that it's underappreciated just like how far from a perfect machine these labs are like it's not like you have a thousand people like you know optimizing the hell out of computer use and that like you know they've been trying as hard as they possibly can like everything of these types every single part of the model generation pipeline is best effort pulled together on an incredible time pressure and incredible constraints as these companies are rapidly growing trying desperately to pull and like upskill enough people to do the things that they need to do like I think it's like it is best understood as as and with incredibly difficult prioritization problems right.

我认为从根本上来说，计算机的使用与软件工程并没有什么本质区别，只要你能在输入空间中用标记来表示一切。我们知道模型可以看到它们能够在图像中为物体画出边框，这已经是个解决的问题。我们也知道它们能理解和推理概念，甚至是复杂的概念。计算机使用唯一的不同之处在于，它比数学和编程更难进入这些反馈循环。因此，我认为只要努力足够，计算机的使用问题也会被解决。我还觉得人们低估了这些实验室离完美机器有多远。这并不是说有成千上万的人在极力优化计算机的使用，他们并没有竭尽全力。他们在生成模型的每一个步骤都是在巨大的时间压力和限制下尽力而为的。这些公司快速成长，拼命地培训足够的人来完成所需的工作，这也给他们带来了极其困难的优先级排序问题。

Like coding is immensely valuable right now and and like somewhat more tractable so it actually makes sense to devote more of your effort to coding initially and like get closer to solving that because there's a sort of like super exponential value as you get closer to what's solving it remain then to allocate like the marginal person to towards computers and and so everyone is making these difficult trade off calls over like what do they care about or so there's another aspect which is that finally not to research as the labs love working on the mark on the bars of intelligence that they themselves resonate with so this is why math and competitive programming like fell first is because to ever on the labs this is their bar of intelligence like this is when they think fuck what's a really smart like what is smart totally nerds it's like oh if it can beat me at Amy even that's smart not if it can do an Excel model better than that's like well you know if you can do an Excel model better than me but if it can beat me at Amy then then I respect it and so we've reached the point where people like respect it but we haven't we haven't that people haven't invested as much effort.

当前，编程的价值非常高，而且相对来说更加容易上手，所以一开始把更多精力投入到编程中是合理的。这样做会让你更接近解决问题，因为随着你离解决问题越来越近，其价值会极大地提升。因此，人们都在做出艰难的决策，权衡自己更关心什么。还有一个方面是，研究人员喜欢在自己感兴趣并能与之共鸣的智能标杆上进行研究。这也是为什么数学和竞赛编程领域率先取得进展的原因，因为实验室的人们认为这是他们衡量智能的一种标准：就像他们在想，哇，什么是真正聪明的？对于极客们来说，如果一个程序能在数学竞赛中击败他们，那就说明它很聪明，而不是能否更好地做 Excel 模型。他们会想，如果一个程序能在数学竞赛中胜过我，那我会尊重它。所以我们已经到了一个人们开始尊重这些技术突破的阶段，但投入的努力还不够多。

Yeah okay so getting your concrete predictions yeah may have next year can I tell it to go on Photoshop and make like three sequential at three sequential effects which require like some like some like selecting a particular photo in a specific okay interesting which I assume means like flight booking totally solved yeah totally okay how about what else do people do in their jobs there are other tasks planning a weekend get away yeah I'm sorry I'm thinking of something which is yeah maybe that's a good example where it's not like a particular thing but more of using computer use as part of completing a broader task.

好的，所以在获取你的具体预测时，是不是明年可以让我告诉它在 Photoshop 上制作三个连续的效果，比如对特定的照片进行一些选择？嗯，挺有意思的，我猜这意味着像机票预订之类的问题已经完全解决了。好吧，那人们在工作中还会做什么其他事情呢？还有一些任务，比如计划周末旅行。嗯，我很抱歉，我在想一些可以作为好例子的事情，这类事情不是特定的某个操作，而是将计算机使用作为完成更大任务的一部分。

I mean the models can even kind of already do this it's just again it's the nines of reliability and like the internet's kind of a hostile place with like all the like allow cookies and like all these other random things but like the first time I ever used our internal demo of computer use the most beta thing possible it did a fantastic job planning a camping trip and could navigate all the right buttons and look at weather patterns and it was like a US government booking site.

我的意思是，这些模型其实已经能做到这一点，只不过在可靠性上有些不稳定。而且，互联网环境有点复杂，比如要处理“允许Cookies”这种事情。但是，我第一次使用我们内部的电脑演示版本时，尽管它是个非常不完善的测试版，却在规划露营旅行方面表现出色。它能够正确导航所有按钮，查看天气模式，而且是在一个美国政府的预订网站上操作的。

I mean it wasn't easy dude if you want to see a hard website go to China like try to book a visa to China the Chinese websites are like fucking insanely like I'm never getting back in the country yet or just not catered to foreigners yeah like filling out all the countries where you've been for visa yeah yeah yeah I keep thinking I'm like close enough the personal ad menaceate velocity that like finally in like a year the models will be doing my visas and stuff for me but we'll get that.

我的意思是，这真的不容易，兄弟。如果你想看看难用的网站，那就去中国，试着申请一个去中国的签证。中国的网站真的超级复杂，让人觉得自己永远无法再进入这个国家，完全不为外国人服务。比如，在申请签证时，你需要填你去过的所有国家。我一直在想，也许在将来的一年里，这些模型（指技术或软件）可以帮我处理签证什么的，但我们暂时还没到那种程度。

Yeah okay actually that in a year personal life ad menaceate glasses all the like getting a visa other than like doing your taxes or something like that yeah yeah doing your taxes including like going through every single receipt like autonomously going in your Amazon and like what was this a business expense or not etc etc if someone one of the labs cares about it ah that's not a real prediction no it's actually not that hard but you need to connect all the pipes.

好的，实际上在这一年里，个人生活就像申请签证或者做税务申报这样琐碎的事情一样繁琐，比如整理每一张收据、自动登录到亚马逊确认这是不是商业开销等等。如果某个实验室对此在意的话，那也不是个真正的预言。其实这并不难，但你需要把所有环节都连接起来。

But I have my question is will the pipes be connected and so like I don't know how much you care to the extent that's the operative crux I think if people care about it like it's so okay so one for these edge tasks like taxes once a year it's so easy to just bite the bullet and do it yourself instead of like implementing some system for it and two I don't know like even being very like excited about AI and knowing its capabilities sometimes it kind of stings when the AI can just do things better than you.

我有个问题是，这些管道会连接起来吗？我不确定你对此有多在意，但我觉得这才是关键所在。如果大家都关心这个问题，那么有些边缘任务，比如一年一次的报税，与其费力去建立一个系统，还不如直接咬牙自己做了。而且，即使我对人工智能的能力感到非常兴奋，有时也会有点刺痛，因为 AI 有时候能把事情做得比你更好。

And so I wonder if there is going to be this like reluctant human wanting to keep human and the loop sort of thing oh yeah you're you're you're evading my question I guess well one thing you're applying by our answer yeah is that we don't have it there won't be in a year it still be a general agent to or agent to which has generalized beyond its training data where like has like and do if you don't specifically train it to do taxes no okay good at that.

我在想，人类是否会不情愿地想要保持人类在决策循环中的重要性。哦，你好像在回避我的问题。我猜你的回答意在暗示，在一年内仍然不会出现一个超出其训练数据的通用智能代理，像是一个代理人。如果你没有专门训练它去做税务工作，它也不会擅长这个。

So I think I do that I think the Amazon examples hard because it needs access to all your accounts and like a memory system and look even in Dorios machines of loving grace he fully acknowledges that some industries are going to be really slow to change an update and I think there's going to be this weird effect where some move really really quickly because they're either based in bits instead of atoms or are just more pro adopting these tech this tech.

我理解的是，亚马逊的情况比较复杂，因为它需要访问你所有的账户和类似记忆系统的东西。此外，甚至在多里奥所说的"慈爱的机器"中，他也完全承认某些行业在变革和更新方面会非常迟缓。我觉得会出现一种奇怪的现象：有些行业会发展得非常快，因为它们要么是基于数字而非实体，要么就是在技术采纳方面更积极。

But I want to answer this particular question like given your probability that somebody in the labs does care about this to the extent that's that's what's relevant probability may have next year it can autonomously do my taxes I don't think we'll be out of autonomous EU taxes with a high degree of trust because I like if you ask into your taxes it'll do your job Will it do them well will it miss something quite possibly yeah will it be able to like click through turbo tax I think yes.

我想回答这个问题：假设在实验室里有人确实关心到这个程度，那么很可能明年就能自主完成我的税务申报。我不认为我们会在短期内完全信任于这种自主的税务申报，因为如果你让系统替你报税，它会完成你的工作。但它会做好吗？很可能会有遗漏之处。不过，它是否能够自主地操作像TurboTax这样的税务软件，我认为是有可能的。

And like will it be able to like search your email like that's the kind of thing that I'm telling you yeah these are like the kind of thing where literally if you gave it like like one person month of effort like in like in like then it would be sold I just want to plug what are you doing all day I just like there's so much lying fruit and just like not enough people to be able to accomplish everything.

这句话的意思是：“我在跟你说的是，比如说，它能不能搜索你的电子邮件。像这样的事情，如果你投入一个人一个月的努力，就能解决。我只是想知道你整天在做什么。还有很多简单的机会，但是没有足够的人来完成这些事情。”

I mean I think like cloud code is making everyone more productive yeah um but I don't know like we had the Anthropic Fellows program and I'm mentoring one project but I had five that I wanted people to work on and they're just like so many obvious things and even though the team is like six X since I first joined it in size there's just like still never enough capacity to explore these things.

我的意思是，我觉得云代码让每个人都变得更高效了。不过，我也不知道，我们有Anthropic Fellows项目，我在指导其中一个项目，但我有五个项目想让大家去做。而且有这么多明显的事情需要处理，尽管团队规模是我刚加入时的六倍，但我们仍然没有足够的能力去探索这些事情。

Okay by end of 2026 reliably do your taxes reliably feel I think I'll show receipts and this kind of stuff like for like company expense reports and this kind of stuff absolutely that goes on with like the whole thing which involves like taxes which involves going through inbox going through your like looking looking on Marina Bay whatever like hotel reservations and like was a champagne of business expense asking for a friend yeah yeah yeah one of your friends does need to ask my answer is still if someone cares about it if someone cares about like some amount of RL on correctly interpreting the tax code wait even by the end of 2026 the model just can't like do things you're not explicitly trading it to get the I think it will get the taxes wrong like it's like okay so if I like went to you and I was like I want you to do everyone's taxes in America what percentage of them you gonna fuck up I feel like I would like succeed at the median and I'm like asking like for for the median what it's like you know what I mean like yeah um I feel like I've like I wouldn't fuck up in the way that like like these models will fuck up in like the the middle of 2026 I think they also might just fuck up in different ways like as a grad student I fucked up my taxes I like overpaid but quite a bit because there was some social security payment that was already covered that otherwise wasn't and like I wonder if I should almost test like what an LLM have made that mistake because it might make others but I think there are things that it can spot like it would have no problem if I asked it to read through the entire tax code and then see what applied to me is like this is the thing I'm unsure about like I'm bringing this to your attention can you just let me know if like you're actually working at the CRBNB or or you're just hanging out or so you things like that right um and I guess I'm curious but will they have enough sort of awareness as they're doing tasks where they can like bring to your attention the things where they feel they are unreliable at yeah yeah yeah by early 2026 or end of 2026 and of okay I know like they unreliable the uncompensable stuff we like somewhat tricky like to do this like all the time yeah.

到2026年底，能否可靠地做完你的税务工作，包括展示收据和公司报销报告等，以及与税务相关的所有事情，比如检查收件箱、查看酒店预订等？如果你在马里纳湾预订酒店，或将香槟算作商务开销，朋友是不是可以询问相关事宜？如果有人非常关注税法的正确解释，到2026年底，模型可能还不足以处理这些问题。假设我让你负责全美的税务工作，你会搞砸多少比例？我觉得我在平均水平上是可以不出错的，但这些模型可能会在2026年中期出错。而且它们可能会在不同方面出错。作为研究生，我曾因社保支付问题多缴了税。我很想测试一下，类似的大型语言模型会不会犯这种错误，因为它们可能会犯其他类型的错误。然而，我认为它们在阅读整个税法并弄清楚哪些适用于我这方面不会有问题。如果我不确定，能否提醒我，比如工作是否在CRBNB，还是只是待在那儿。我很好奇，这些模型是否能够在进行任务时识别出自己不完全可靠的地方，并注意提醒我。这种可靠性问题，到2026年初或者年底能得到改善吗？我知道这些模型在复杂任务上的可靠性有待提高。

Interesting on the computer your stuff will um will it be sort of end-to-end or will it be like it's using a separate BLM to process the image and video and so forth uh I'm a bit of an end-to-end Maxi um I think in general like when people are like talking about the separate model so for example like most of the robotics companies are doing this kind of like to the by-level thing where they have like a motor policy that's running at whatever like 60 hertz or whatever um and like some higher level visual language model I'm pretty sure like almost all like the big roadboat companies are doing this um and they're doing this for a number of reasons one of them is that like they want something to act at a very high frequency uh and two is they can't train the big visual language model uh and so they like relying on that for general spate like well knowledge and this kind of stuff and they're constructing longer running plans but then they're like you know you offload to the motor policy um I'm very much at the opinion that if you are able to train the big model uh eventually at some point in the future the distinction between big models and small models should disappear because you should be able to use the amount of computation in a model that is necessary to complete the task like uh the you know that ultimately this like oh we yeah like there's some amount of task complexity um if you don't have to use a hundred percent of your brain all the time right welcome to my world and so you should be able to run that faster in this kind of stuff basically basically so I think it's like net net I I think typically the same model do you want you want to be able to scale the the understanding as the complexity of the difficulty like right you know we have to do that dynamically um is that variable so we already have variable compute per answer right um right with like yeah um well we have variable compute per token I mean you can already think of models for forever people have been calling the residual stream and multiple layers like poor man's adaptive compute right we're like if the model already knows the answer to something it will compute that in the first few layers and then just pass it through so yeah I mean that's getting into the weeds right.

在电脑上，你的东西是会端到端地处理，还是使用单独的视觉语言模型（BLM）来处理图像和视频呢？我是一个比较偏向端到端处理的支持者。我认为，当人们谈论使用单独的模型时，比如说大多数机器人公司都在做这种双层模式。他们有一个马达控制策略以大约60赫兹的速度运行，还有一个更高级的视觉语言模型。我相信几乎所有的大型机器人公司都是这么做的，他们这样做主要有几个原因。其一，他们需要系统在高频率下运作，其二，他们无法训练大型的视觉语言模型，所以依赖这些模型来提供一般的知识并构建长期计划，然后将具体执行交给马达控制策略。我认为，如果你能够训练大型模型，那么最终在未来某个时候，大模型和小模型之间的区别应该会消失，因为你应该能够根据任务难度来应用所需的计算能力。就像，我们不必随时使用100%的大脑一样，你应该能更快速地完成任务。我认为，通常来说，同一个模型应该能够根据任务复杂性进行扩展，并动态调整。我们已经有了针对不同问题的不同计算量，对吧？比如说，我们对不同的词元（token）有不同的计算量。事实上，人们一直以来称为残差流（residual stream）和多层次结构，就像是穷人的自适应计算。如果模型已经知道问题的答案，它会在前几层完成计算，然后直接传递出去。这有点深入了，哈哈。

Yeah those are just screams is like this operating ramp you're doing stuff to it right it's like the the mental model I think one takes away from integrity look yes US high school immigration is a broken system that causes some of the most talented people in the world but I didn't realize before working with lighthouse how different the process can be if you're working with somebody who knows their way around the system. I hired somebody earlier this year and even before the remote work trial had ended lighthouse had already secured an open visa for him honestly was shockingly fast my family and I have had a terrible experience with the immigration system and I've also seen many of my smartest friends get their entire career's hamstrung by its vega race.

是的，那些只是尖叫声，就像你正在对这个操作平台进行一些处理一样。我认为，关于诚信所得到的心智模型是这样的：美国的高中移民制度是个破碎的系统，它阻碍了一些世界上最有才华的人。但在与Lighthouse合作之前，我没有意识到如果和一个熟悉系统的人合作，过程会有多么不同。今年早些时候我雇用了一个员工，甚至在远程工作试用期结束之前，Lighthouse就已经为他办妥了开放式签证，说实话，这个速度让我感到非常震惊。我和我的家人在移民制度方面有过糟糕的经历，我也看到过很多我最聪明的朋友因为这个不明确的制度而被整个职业生涯拖累。

Seeing lighthouse operate showed me that the visa process can be done in weeks and doesn't have to drag on for months and months and they do it not only for complex visas like the oh and a but for other types as well. In the last 12 months alone they have secured visas for over 350 people for companies like cursor notion ramp replete and many more. Unlike legacy law firms lighthouse specializes in frontier industries like AI robotics and biotech and since they understand your problems you can trust them to fully handle the paperwork and process for you explore rich visa is right for you at lighthouse hq dot com.

看到Lighthouse的运作让我明白，签证办理可以在几周内完成，而不必拖延数月。他们不仅处理复杂的签证，如OH和A签证，还处理其他类型的签证。在过去的12个月里，他们为超过350人成功获取签证，这些人来自像Cursor、Notion、Ramp、Replete等公司。与传统律所不同，Lighthouse专注于前沿行业，如人工智能、机器人技术和生物技术。因为他们理解你的问题，所以你可以放心地让他们为你全面处理相关文件和流程。想了解哪个签证最适合你，可以访问lighthousehq.com。

All right back to trend in we've been talking a lot about scratch pads uh them writing down their thoughts and ways in which they're already unreliable in some respects um Daniels AI 2027 scenario kind of goes off the rails when these models start thinking in your lease so they're not writing in human language like here's why i'm gonna take over the world and here's my plan they're thinking in the lane space um and because of their advantages in communicating with each other in this um like deeply textured nuanced language that humans can't understand they're able to coordinate in ways we can't.

好的，让我们回到我们一直在讨论的趋势上，我们谈了很多关于便签纸的事情，比如他们写下自己的想法以及在某些方面的不可靠性。对于丹尼尔斯的2027年AI场景来说，当这些模型开始在租约中思考时，事情就有些失控了。他们并不是用人类语言写下“这是我要统治世界的原因和计划”，而是在一种我们无法理解的复杂和细致的语言环境中进行思考。由于它们在这种语言中可以更好地相互交流，它们能以人类无法实现的方式进行协调。

Is this at is this the path for like future models are they going to be in your lease communicating with themselves or with each other there's a surprisingly strong bias so far towards like tokens and text like it seems to work very well um when i imagine that there already is some amount in your lease right like if you think about the residual stream for each token is like new release like to some degree yeah and so now we're just trading off axes like how much new release are you doing versus how much like actually is like read out to tokens all the time.

这是否意味着未来的模型将沿着如此发展的路径，它们将在内部进行自我沟通或彼此沟通？到目前为止，对标记和文本的偏好似乎非常强烈，因为这种方式效果很好。不过，我想在某种程度上，这种沟通已经存在于内部流中。就像您可以认为每个标记的残差流中也包含了某种内部流。而现在，我们在平衡不同的方向，比如多少是内部交流，多少是转化为标记输出。

Yeah um and yeah i i think it's important to delineate between the models planning and light in space in a single forward pass yeah and the model has an alien language that it's outputting yeah and using as its scratch pad um which which one are we talking about the latter. Okay all that it is interesting to know that like there's also already alien stuff happening i guess i never saw alien so much in the most extreme case this is our release right it invents a new language that's super information dense yeah or i guess this is a debate we've had but like to some extent humans also have a mentally's right um they're like training away yeah there's a sense when you're writing something down of like i know what i'm trying to say yeah but i can't like put it into tokens.

好的，嗯，我觉得有必要在一个前向传递中区分模型的规划和空间中的光。模型输出一种外星语言，并使用它作为草稿本。我们在讨论哪个？是后者。好吧，知道那些已经存在的外星事物也很有意思。我想我从未在最极端的情况下看到过如此多的外星事物。这是我们的发布，对吧？它发明了一种信息密度极高的新语言。或者我想这是我们一直在讨论的问题，从某种程度上说，人类也有类似的能力。就像在写东西的时候，会有一种我知道自己想说什么的感觉，但却无法将其转换成具体的词语。

Um i mean that's what's so fun about the if you look at the assistant tag right seeing these features light up in the auditing game for the model being uh evil yeah um yeah it's so far like there with like trans loose has another example of this where you ask a llama model who is Nicholas Karlinie and background context Nicholas Karlinie is a researcher who actually was a deep mind him has now come over to entropic um but the model says oh i don't know who that is i couldn't possibly speculate but if you look at the the features behind the scenes you see a bunch light up for AI computer security all the things that Nicholas Karlinie does interpretability becomes dramatically more important as you shift in this direction right of nearly but is that is that are we going to it seems i mean it's it's an empirical question.

嗯，我的意思是，这就是乐趣所在。如果你看看助手标签，就会发现这些特性在模型审计游戏中都被激活，比如在测试模型是否具有“邪恶”倾向时。是的，到目前为止，还有一个类似的例子，就是当你问一个“骆马”（LLaMA）模型“谁是尼古拉斯·卡尔尼尼”时，背景信息是尼古拉斯·卡尔尼尼是一位研究员，他曾在DeepMind工作，现在转到Anthropic。但是模型可能会回答“我不知道他是谁，我无法推测”，但如果你查看背后的特性，你会看到很多关于AI计算机安全的东西被激活，这正是尼古拉斯·卡尔尼尼所研究的领域。当你朝这个方向转变时，可解释性变得更加重要。对吧？这似乎是一个经验性问题。

Um i think it's somewhat likely uh if only because inference is expensive um producing tokens is expensive uh and so there will be an incentive to one uses little thinking as you need to give the answer and two if you're going to use thinking use some complex uh compression. I wonder if it will emerge more once we allow agents to talk to each other in ways where currently it's kind of trained more in isolation yeah or with a human and that will be like some selected pressure against uh as long as the the agents are working with humans um because i want to sort of cooperate but then like as agents begin to work more with each other then that that's like pressure like change the other direction yeah so although somebody would still have to make the consciousness and to do like end-to-end training for multiple agents to use the system.

我觉得这有可能发生，主要是因为推理过程很昂贵，生成词语也很耗费资源。因此，人们会倾向于仅使用必需的思考来给出答案。而如果需要用到思考，可能会采用某种复杂的压缩方式。我在想，一旦我们允许智能体以某种方式相互交流，可能会出现更多的变化。目前的训练通常较为孤立，或是与人类互动，这种方式会对这种情况产生一定的抑制作用，因为智能体需要与人类合作。但是，当智能体之间更多地互动时，这种情况可能会朝另一个方向发展。不过仍然需要有人来创造意识，并对多个智能体进行端到端的系统训练。

communication right sure yeah i mean one scary thing though is like the way we render text uh you can use hidden white space tokens that also encode information it's true and so you can imagine a world where it looks like the agents reasoning in its scratch pad harmlessly but it's actually hiding a bunch of a bunch of data. um speaking of inference compute i guess one thing that i think is not talked about enough is if you do live in the world that you're painting that in a in a year or two we have computer usagians that are doing like actual jobs um you have you've like totally automated large-partous software engineering.

沟通是正确的，是的，我是说，有件可怕的事情就是我们呈现文本的方式。嗯，你可以使用隐藏的空白字符来编码信息，这是真的。所以你可以想象一个世界，表面上看起来代理在其草稿中进行推理毫无危害，但实际上它隐藏了大量的数据。嗯，谈到推理计算，我想有一件事没有被充分讨论，如果在一年或两年内，我们拥有能实际完成工作的计算机代理，那么你就已经在很大程度上自动化了软件工程。

yeah then these models are going to be incredibly valuable to use and the way used them obviously is like you need compute um right now there's 10 millionation hundred equivalents in the world by 2028 there's going to be a hundred million but if you there's been estimates that an H100 has the same amount of flops as the human brain and so if you just like do a very rough calculation it's like there's a 10 million population if you get a g i that's like as human uh inference efficient you could have 10 million a g i's now a hundred million a g i's in 2028.

是的，那么这些模型将变得非常有价值。显然，你需要计算能力来使用它们。目前，全球大约有一千万个类似设备，到2028年将达到一亿个。然而，有人估计，一个H100的浮点运算能力与人脑相当。因此，如果你粗略计算一下，现在有一千万的人口，如果你有一个像人类一样高效推理的人工通用智能（AGI），那么就可以拥有一千万个AGI，到2028年可以有一亿个AGI。

um but uh presumably you'd want more and then at that point you're like AI compute is increasing what 2.5 x or 2.25 x every year right now but at some point like 2028 you hit like wafer production limits and that takes like you know that that's a that's a longer feedback loop before we can make new fabs or whatever the question here is are we sort of underwriting how big a bottleneck inference will be if we live in the kind of worlds you're painting if we have the capabilities that you're describing.

嗯，不过，呃，假设你会希望获得更多算力，然后在那时你会发现，AI 计算能力目前大约每年增加2.5倍或2.25倍，但可能到了2028年，你会遇到晶圆产能的极限，这需要更长时间的反馈周期才能建新工厂或类似的东西。问题是，如果我们生活在你所描绘的那种世界中，如果我们具备你所描述的那些能力，我们是否在低估推理过程中的瓶颈会有多大。

uh i don't want to do the math on exactly how much like we can ramp up TSMC's production and this kind of stuff uh like what fraction of those supply chain at the moment we need Dylan in here for this but like uh is currently uh GPUs like we're all literally small like five percent or something this yeah like apple has a huge fraction um and in twit like are the 2028 estimates including like that ramping up over time yeah uh to what like 20 30 percent or like uh this is just up AI 2020 at 20 seconds.

翻译的中文表达可能如下：呃，我不太想计算我们能在多大程度上提升台积电的产量，以及这些供应链中的细节。呃，目前我们需要 Dylan 来这里解答这些事情，但是目前 GPU 在供应链中的比例非常小，大概只有百分之五左右。像苹果这样的大公司占了很大一部分。此外，推特上的2028年估算是否包含了这种逐步增长的情况？是到20%或30%吗？或者，这仅仅是指AI在2020年的某一时刻。

but um uh i assume like it's saturated at that point is that why are they expected to to then just like go yes like they're like i do think this is underrated to some degree yeah like to the the extent that like you don't instantly get like a doubling of the world's population in 2028 yeah you maybe get you know tens of millions of geniuses in a data center um but you you don't get a doubling of the world's population.

嗯，呃，我想在那个时候可能已经趋于饱和了。那是否是他们被期望去接受这一点的原因呢？是的，我确实认为在某种程度上这个问题被低估了。就像，2028年，你不会立刻看到世界人口翻倍，你可能会在数据中心中见到上千万的天才，但世界人口并不会因此翻倍。

yeah um and so a lot depends on like exactly how smart they are exactly how efficient the models are I think you know that this kind of stuff uh these uh like let's do some rough math I guess like to factor the h 100th thing um you could probably run like a hundred read model do like that no thousand tokens or something like that um on mh 100.

是的嗯，所以很多事情取决于他们到底有多聪明，以及模型到底有多高效。我觉得这类事情呢，我们可以做一些粗略的计算。我想你可以运行一个大约100的读取模型，可能是1000个标记之类的东西，在mh 100上。

uh so uh if like we're comparing that to a number of should we come by that number cement no okay that uh that thousand tokens a second um humans are what how fast can a human talk there there isn't really interesting paper I don't know if you saw this um humans think it 10 sockets a second.

嗯，如果我们要把那与某个数字进行比较的话，我们是否应该用那个数字水泥？不，不应该。呃，那一千个标记每秒……人类说话的速度是多少呢？这里有一个非常有趣的研究——不知道你是否看过——人类的思维速度是每秒10个标记。

did you see this paper the there is this really interesting paper about um if you look at the amount of like information reprossing in a second yeah um we're seeing all this visual data etc etc but by a bunch of metrics where you think about how fast humans are processing it's at 10 seconds again so for example you'll have people fly over friends or something even these so-called the idiots of on tool remember everything if you think about like how long their plane ride was it's like 45 minutes how many like if you do 10 tokens a second how much information would you have it's like literally exactly that.

你看到这篇论文了吗？有一篇非常有趣的论文讨论了信息处理的速度。我们在每一秒钟都会接收到大量的视觉数据等等，但是如果用一些指标来看，人类处理信息的速度大约是每10秒处理一次。例如，有人坐飞机飞越一座城市，甚至那些所谓的“工具白痴”也能记住所有事情。如果你考虑他们的飞行时间是45分钟，并假设每秒处理10个信息单元，那么他们处理的信息量正好就是这样。

so let's take that for granted then it's like a natural hundreds of how is a hundred humans a second yeah if you think the tokens are equivalent yeah if you think the tokens are cool yeah um which you still get pretty substantial numbers like even with your your hundred million eight. hundred and you multiply by a hundred just like to like get to pretty substantial numbers this does mean that those models themselves will be like somewhat compute bottleneck in in many respects um but these are all like these are relatively short term changes in uh in like timelines of progress basically like I think yes it's highly likely we get dramatically inference bottleneck in 20s up in 28 yeah uh the impulse like to that will but then be okay they just drum like try and turn out as many possible semiconductors we can right there'll be some lag there uh big part of like how fast we can do that will depend on uh how much people are feeling the edge i in the next you know two years they're building out fab capacity um a lot will depend on it's how a china entire like how's the time one uh situation is Taiwan still producing all the fabs

所以，让我们假设这是理所当然的，就像自然界中的一百个人每秒一样，如果你认为代币是等价的，是很酷的。那么，即便是你的数字达到了一亿八千万，再乘以一百，也是挺可观的数字。这意味着这些模型本身在许多方面会有一定的计算瓶颈，但这些都是短期内的变化，就像进度时间线上的变化一样。我认为很有可能到2028年，我们会在推理方面遇到严重的瓶颈，随之而来的冲动就是尽可能多地生产半导体，但这会有一些滞后。我们能多快做到这一点，很大程度上取决于未来两年人们对推动这方面发展的紧迫感，以及他们在建设制造能力方面的投入。此外，还有很大一部分取决于中国的情况，例如台湾是否仍在生产所有的制造设备。

there's another dynamic which uh was a reason that again time if when they're on the podcast said that they were pessimistic is that one they think we're further away from solving these problems with long context coherent agency advance multi-modality than you think and because and then their point is that uh the progress that's happened in the past over like reasoning or something has required many orders of magnitude increase in compute and if this scale of computing increase can continue beyond 2038 not just because of chips but also because of power and like raw gp even yeah then because we don't think we get it by 2030 or 2028 by just um then we think it's just gonna take the probability per year just goes on about yeah this is like bi-modal distribution

有另一种动态因素也是导致他们在播客中表示悲观的原因之一。他们认为我们在解决涉及长文本上下文、连贯性、代理高级多模态性等问题上，比我们想象的更遥远。他们的观点是，过去在推理等领域的进步需要计算能力的数量级显著提升。如果这种计算能力的增长能在2038年之后继续，不仅仅是因为芯片，还包括电力和基础计算能力等因素，那么，他们认为到2030年或2028年我们无法仅靠这些取得突破。因此，他们认为每年的成功概率只是随着时间逐渐增加。这种情况类似于双峰分布。

yeah uh a conversation I had with leopold turned into a section in a situational way that's called this decade or bust um which is the only exactly this topic yeah which is basically that you know for the next couple of years we can dramatically increase our training compute and rl is going to be so exciting this year because we can you know dramatically increase the amount of compute that we apply to it um and this is also one of the reasons why the gap between uh like say deep seek and oh one was so close uh at the beginning of the year because they were able to apply like the same amount of compute to to the rl process um and and so that compute differential actually like will so be magnified of the cost of the city i mean bringing it back to the there's so much low hanging fruit

是的，嗯，我和利奥波德的一次对话演变成了一个情景，这个部分被称为"这十年，或破产"，这正是我们要讨论的话题。基本上来说，就是在未来几年里，我们可以大幅增加训练计算能力，强化学习（RL）在今年会变得非常令人兴奋，因为我们可以大幅增加应用于其中的计算量。这也是为什么年初时类似于Deep Stack和其他的某些项目之间的差距如此接近，因为他们可以对强化学习过程应用同等数量的计算能力。这样算力上的差异实际上将会被放大到城市成本上，也就是说，我们还有很多唾手可得的机会。

yeah yeah it's been wild seeing the efficiency gains that these models so experienced over the last two years yes and and yeah like with respect to deep seek i mean just really hammering home and like doario has a nice essay on this it's good um deep seek was nine months after clawed three on it and if we retrained the same model today or at the same time as the deep seek work we also could have trained it for five million or whatever the advertised amount was and so what what's impressive or surprising is that uh deep seek has gotten to the frontier but i think there's a common misconception still that they are above and beyond of the frontier

是的，是的，看着这些模型在过去两年里所取得的效率提升真是令人惊讶。是的，说到 Deep Seek，我的意思是，真得特别强调这一点，而且 Doario 在这方面有一篇很不错的文章。Deep Seek 的进展是在 Clawed 3 发布后的九个月。如果我们在今天或与 Deep Seek 同时重新训练同样的模型，我们也能在五百万或者其他宣传的数额范围内完成训练。令人印象深刻或惊讶的是，Deep Seek 已经达到了前沿水平，但我认为仍然有一个普遍的误解，即他们远远超出了前沿水平。

um and i don't think that's right i think they just waited and then we're able to take advantage of all the efficiency gains that everyone else was also seeing yeah i like they're exactly on the sort of cost curve that you'd expect we're not gonna take the ways in like they're like brilliant engineers and like brilliant researchers who like i look at i look at their work and i'm like oh like the kindred soul they in the in the work they're doing

嗯，我不认为那是对的。我觉得他们只是等待时机，然后能够利用其他人也在看到的效率提升。我认为他们正好处在你所预期的成本曲线上。我们不会以这种方式否定他们，就像他们是出色的工程师和研究人员一样。我看他们的工作时，会感到有种志同道合的感觉。

yeah and to go from like way behind the frontier to like oh obviously like a real player like it's super incredible yeah yeah but as people say that they have good research taste yeah looking at their papers what makes you say that

好的，将这段话翻译成中文并且尽量易读可以这样表达：是的，从远远落后于前沿到显然成为一个真正的参与者，这真的是超级不可思议。是的，是的，但是人们说他们有很好的研究品味。看他们的论文，你为什么会这么说呢？

yeah um i think their research taste is good in a way that i think like no one's research taste is good um no brown no no no i'm sure it's yeah okay um no brown also has good research taste but uh no i'm sure it's yeah where they very clearly understand this uh dance between the hardware systems that you're like designing the models around and the uh sort of like algorithmic uh the side of it

嗯，我觉得他们的研究品味很不错，从某种意义上说，我觉得没有人的研究品味是不好的。嗯，不，布朗大学的研究品味也很好，但我确信他们确实很好，因为他们非常清楚地理解如何在你为之设计模型的硬件系统和算法之间取得平衡。

um and this is manifesting the way that the models give this sense of like being being like perfectly designed up to their constraints um and you can like really very clearly see what constraints they're thinking about as they're like iteratively solving this problems and so i mean let's take the base transformer and like diff that deep seek v2 and v3 um you can see them running up against the memory bandwidth bottleneck in attention yeah um and you can see them uh initially they do mla to do this like they trade flops for memory bandwidth basically then they do the single NSA where they like more selectively load uh remember right and you can see actually like this is because the model that they trained uh with mla was on h800s uh so it has a lot of flops um as they're they were like okay we can freely use the flops but then uh the export controls uh so from uh biden came in uh or like they're less uh they would have less of those chips going forward um and so they they traded off to like a more memory bandwidth oriented like algorithmic solution there.

这段话的大意可以翻译成：这表现出模型在其约束条件下被设计得非常完美。在解决问题的过程中，你可以清楚地看到它们考虑的约束条件。比如，以基本的Transformer模型为例，对比深度学习模型Deep Seek v2和v3，你会看到它们在注意力机制中遇到了内存带宽瓶颈。一开始，它们通过将计算量（flops）与内存带宽进行交换来解决这个问题，使用了一种称为MLA（多层感知器加速器）的技术。后来，它们又使用了一种称为NSA（选择性加载记忆）的技术，这样可以更有选择性地加载需要的记忆。之所以选择这些技术，是因为它们训练的模型在H800s芯片上进行，这些芯片的计算量很大，所以可以自由使用计算资源。但随着拜登政府的出口控制政策出台，他们今后可能无法获得那么多这样的芯片。因此，他们转向了一种更注重内存带宽的算法解决方案。

um and you see a similar thing with their approach to sparsity where they're like iteratively working out the best way to do this over multiple papers um and the part that i like is that it's simple a big failure mode uh that a lot of mla researchers have is like you do these like overly complicated things that don't like think hard enough about the hardware systems that you have in mind um whereas the deep see the first deep seek like sparsity moe solution uh they design these like rack and like like like like node level uh load balancing losses so you can see them being like okay like we have to like perfectly balance it on this and then they actually come up with a much better solution later on where they don't have to have the auxiliary loss um uh where they they just have these like bias terms that they put in um and it's a little less simple like you're manually putting in a bias rather than but balancing those your little losses in wing um like you're making the model like trade off uh this thing and like you have to with auxiliary losses you have to like control the coefficient and the weighting um the bias is like cleaner in some respects interesting.

在他们处理稀疏性的方法上，你可以看到类似的情况。他们通过多篇论文反复优化最佳方案。我喜欢的部分在于，他们的方法很简单。许多机器学习研究人员的一个巨大失败就是过于复杂的解决方案，没能充分考虑硬件系统。深度学习的第一个稀疏性方法是设计在机架和节点级别的负载均衡损失函数，从而完美地在这一层进行平衡。后来，他们提出了更好的方案，不需要额外的损失函数，而是加上了一些偏置项。这种方法虽然在某些方面不如原来简单，但通过手动加入偏置项，避免了调整辅助损失的权重和系数的麻烦。这种偏置在某些方面更加简洁，令人感兴趣。

um did that change at the retraining uh they did have to change the training training is that does all training involve um continuously like fucking with these values as you're going through it uh depends on what your architecture is um but like i i thought it was like i just just always cute that like you can see them running up into like this very hardware level constraint like they're like what do we what do we wish we could express algorithmically what can we express under our constraints and like iteratively solving to like get better constraints and doing this in a really like simple and elegant way and then like backing up with great engineering.

嗯，在重新训练时有进行更改吗？他们确实需要更改培训内容。所有的训练都会不断地调整这些数值吗？这要看你的架构是什么。不过我总觉得很有趣的是，你可以看到他们在硬件上的限制遇到瓶颈时，他们在想：我们希望算法上能够表达什么，在限制条件下我们能表达什么，然后通过迭代解决方案来优化这些限制，以一种非常简单优雅的方式做到这一点，同时用出色的工程技术来支持。

i also thought it was interesting that they uh incorporated the multi token prediction thing from meta um so meta had a nice paper on this multi token prediction thing uh actually like i don't know if it's good or bad but like meta didn't include it in lama uh but deep sea did include it in their paper uh which i think is interesting uh like yeah yeah was that because they were like faster at iterating and including an algorithm or did meta decide that actually like it wasn't a good algorithmic change of scale i don't know it was really interesting uh to me as somebody who is had people on the podcast to discuss though i mean it's interesting for like what's happening in AI right now yeah but also from the perspective of i i've been having abstract conversations with people about like what an intelligence explosion it would look like or what it looks like for AI to automate a r and d and just getting a more tangible sense of like what's involved in making the AI progress um and i guess one of the questions i was debating with daniel is how much or i was asking him is how.

我也觉得很有趣，他们在模型中加入了来自 Meta 的多标记预测功能。Meta 曾发表了一篇关于多标记预测的不错的论文，但实际上我不确定这到底是好是坏，因为 Meta 并没有在 Llama 模型中加入这个功能，但 Deep Sea 在他们的论文中却加上了，我觉得这很有意思。这是因为他们在迭代和包含算法上更快吗？还是因为 Meta 觉得这个算法改变在规模上不太好？我不太清楚，但作为曾邀请人们上播客讨论这个话题的人，我觉得这真的很有趣。这既反映了当前 AI 领域的发展动态，也让我在与人们进行关于智能爆发或 AI 自动化研发的抽象对话时，获得了更具体的理解。我和 Daniel 曾讨论的一个问题是，这在 AI 的进步中涉及多少，我是这样问他的。

many of the um improvements require a deep conceptual understanding versus how many are just like monkeys trying ideas and you could just like run a bunch in parallel and it seems like the mla thing is motivated by this like deep like conceptual understanding of like oh each attention head only needs to see like uh the subspace that's relevant to its attention pattern um i feel like that is like required a lot of like conceptual insight in a way that these models are especially bad at um as opposed to i don't know how the load balancing thing works but that just seems like maybe you could like try it out and see what happens yeah that's probably just like trying out a whole bunch of things i mean i mean i mean my no see what fraction is which i'd be curious about yeah i don't know about fractions it might be like you have a hunch for a core problem you can think of ten possible ways to solve it and then you just need to try them and see what works and that's kind of where the trial and error like sorcery of deep learning can kind of kick in and like no i'm shesia we'll talk about this it like about how he like five percent of his ideas work so even he uh like wanted god of uh model like design which a design uh is has like a relatively little hit rate but he just tries so many things right or being able to come with any ideas in the first place yeah one one one like mechanism could be that like no i'm just doesn't have to do any the engineering work and he can just like abstractly express an intuition yeah i actually think like your rates of progress almost don't change that much depending on like so long as it's able to completely implement his ideas interesting similar like if you if you have like no i'm jizier at a hundred x speed yeah that's still kind of wild yeah um like there's all these like full backs of like of like wild worlds yeah where even if you don't get like 100% like no i'm jizier level intuition in in model design it's still okay if you just accelerate him by hundred x right.

许多改进需要深入的概念理解，而不是像猴子一样只是尝试各种想法。你可以并行地运行很多方案，而MLA的动机似乎来源于一种深入的概念理解，比如每个注意力头只需要看到与其注意力模式相关的子空间。我觉得这需要大量的概念性洞察，而这恰恰是这些模型尤其不擅长的。与此相对，我不知道负载均衡的原理，但这好像可以通过尝试多种方法来看看效果。这可能就是尝试很多事情的例子。我对哪部分改进占多大比例感到好奇。我不太了解具体比例，可能是你对核心问题有个直觉，然后想出十种可能的解决方法，然后需要尝试它们，看看哪个有效。这就是深度学习中试错的魔力开始发挥作用的地方。就像Noam这样的人谈到他的创意中只有5%能成功，即使他是模型设计的顶尖人物，命中率也不高，但他尝试了很多事情。或者说，能够想出任何想法本身就值得注意。有一种机制可能是Noam不需要做任何工程工作，他只需抽象地表达直觉。我认为，只要能够完全实施他的想法，进展速度几乎不受影响。如果你能以百倍的速度加速Noam，那简直是不可思议的。在这样一个充满可能性的世界里，即便你没有Noam那样的直觉，只要加速他的工作百倍也是可以的。

especially jizier compute bottleneck anyway so like trying out his ideas or i see does not compute to try out all of his ideas but dorkas you said oh well the model can do the more straightforward things and not that do you thought i mean i do want to push back on that a little bit like i think them again if the model has the right context and scaffolding it's starting to be able to do some really interesting things like the interp agent has been a surprise to people even internally at how good it is at finding the needle in the haystack like when it plays the auditing game finding this reward model bias feature and then reasoning about it and then systematically testing its hypotheses so it looks at that feature then it looks at similar features it finds one with a preference for chocolate it's like huh that's really weird that the model wants to add chocolate to recipes let me test it and so then it will make up like hey i'm trying to make a tomato soup what would be a good ingredient for and then sees that the model replies chocolate it reasons through it and then keeps going very conceptual on the same you.

翻译成中文并尽量易读：特别是像 Jizier 所说的那样，计算瓶颈无论如何都存在，所以像是在尝试他的想法。我的意思是，Jizier 的所有想法可能都无法尝试，但正如 Dorkas 你所说的，哦，好吧，模型可以做一些更直接的事情。对于这个想法，我想稍微反驳一下。我认为，如果模型有合适的背景和框架，它就能开始做一些非常有趣的事情。比如 Interp 代理在内部让人惊讶于它在找到针尖（非常隐蔽的信息）方面的表现，比如在玩审计游戏时，发现奖励模型的偏见特征，然后对其进行推理和系统测试。所以它会看这个特征，再看类似的特征，发现一个对巧克力有偏好的特征，然后觉得很奇怪：“模型居然想把巧克力加入到食谱中，这很奇怪，让我测试一下。”然后它会编造一个场景，比如“我在做番茄汤，什么材料会是好的选择呢？”，看到模型回答是巧克力，它会进行推理并继续深入挖掘。这种方法非常具有概念性。

yeah and even where like especially it's spotted it's like oh this is a key part of its persona i see this Oxford paper what if i change Oxford to Stanford what if i now say Richard Feynman really likes this thing and it's like really carving out the hypothesis space and and and testing things in a way that i i'm kind of surprised by also by the way ml research is one of the easier things to rl on and some respects once you get to a certain of okay ability it's very well-defined objective function did the loss go down make number go down make number go down oh make you know go out number go up depending on which number it is i just flip the sign so the sign and so once you get to the stage of models are capable of implementing one of no one's ideas right and then you can just let them loose and let them build that intuition of scientific of like how to do scientific discovery right the key thing here again is the feedback loops of like i expect scientific areas where you are able to put it in a feedback loop to have eventually superhuman performance.

是的，有时候当某个特征被发现时，人们会认为这就是其个性的重要部分。我看到这篇牛津大学的论文，如果我把牛津换成斯坦福，再比如引用理查德·费曼真的很喜欢这东西，这种行为就像是在探索和测试假设空间，让我感到惊讶。另外顺便说一下，机器学习研究在某些方面是比较容易进行强化学习（RL）的。一旦你达到一定的能力水平，它有一个非常明确的目标函数：损失降低、指标下降或者升高，取决于具体的指标是什么，只需调整一下符号即可。所以当模型具备实现某些新想法的能力后，你可以让它们自由发挥以培养科学直觉，尤其是关于如何进行科学发现。这里的关键是反馈循环，我认为在那些能够形成反馈循环的科学领域，最终可能会达到超越人类的表现。

i one prediction i have is that we're going to move away from can an agent do xyz and more towards can i efficiently deploy launch 100 agents yeah and then give them the feedback they need and even just be able to like easily verify what they're up to right there's this generator generator verify fire gap that people talk about where it's like much easier to check something than it is to produce the solution on your own but it's very plausible to me will be at the point where it's so easy to generate with these agents that the bottleneck is actually can i as the human verify the answer and and again you're guaranteed to get an answer with these things and so ideally you have some automated way to evaluate and test a score for like how well it worked how well did this thing generalize and and at a minimum you have a way to easily summarize what a bunch of agents are finding and it's like okay well if 20 of my hundred agents all found this one thing then like it has a higher chance of being true and and again software engineering is going to be the leading indicator of that right like over the next six months like the remainder of the year basically we're going to see progressively more and more experiments of the form of how can i dispatch work to a software engineering agent in such a way that is async uh clawed forwards get up integration uh where you can ask it to do things on get up ask a do pull requests this kind of stuff that's coming up and the opening as codex are like examples of this basically.

我预测，未来我们将不再关注“一个智能代理能否完成某项特定任务”，而会更多地转向“我如何才能高效地部署和启动100个智能代理”，并能够为它们提供所需的反馈，甚至能轻松验证它们的工作。人们常谈的“生成-验证”差距就在于：验证某件事比自己创造解决方案要容易得多。然而，我认为有可能我们会达到这样一个阶段：生成解决方案变得如此容易，以至于瓶颈反而在于我作为人类能否验证这些答案。而且使用这些工具时，你总能得到一个答案。因此，理想情况下应该有某种自动化方法来评估和测试它们工作的效果、推广能力等。至少应该有一种简单的方法来汇总多个代理的发现，例如，如果我100个代理中有20个都发现了某件事，那么这件事更可能是正确的。此外，软件工程将是这一趋势的领先指示器。在接下来的六个月里，我们将看到越来越多的实验，探索如何以异步、与GitHub集成等方式，将工作委派给软件工程代理。像OpenAI的Codex就是这种趋势的实例。

um where we uh you can sort of almost see this in like the coding startups i i think of this like product exponential in some respects where you need to like be designing for like a few months ahead of the model to make sure that the product you build is the right one yeah um and you saw like last year you know cursor hit pmf with uh claw 3.5 sonnet right like them they were they were around for a while before but then the model was finally good enough that the vision they had of how people would program like hit yeah um and then uh you know win surf bet like a little bit more aggressively even on the agentickness of the model like uh you know you could like with longer running agentick workflows in this kind of stuff and i think that's when they they sort of like began competing with cursors when they bet on that particular vision and the next one is is you're not even in the loop so to speak you're not in an IDE but you're asking the model to do work in the same way that you would ask someone on your team to do work yeah.

在这里，我们可以看到一些编程初创企业的趋势，我觉得这有点像产品的指数增长。从某种角度来看，你需要提前几个月设计一个模型，以确保你所开发的产品是正确的。例如，去年，你可以看到Cursor通过Claw 3.5 Sonnet达到了产品市场契合度。虽然他们早已存在，但模型的改进终于实现了他们对人们编程方式的愿景。此外，Win开始采用更积极的策略，特别是在模型的自我管理性上。他们进行了长期运行的自我管理工作流程的实验，我认为这是他们开始与Cursor竞争的开始，因为他们押注于这个特定愿景。接下来发展的一步是，你甚至不需要参与工作流程，不是在一个集成开发环境（IDE）里，而是像让团队成员工作一样，请求模型去完成任务。这个想法是把模型当作团队的一员来使用。

um and yeah that is not quite ready yet like there's still a lot of task where you need to be in the loop yeah but the the next six months looks like an exploration of like exactly what does that run like yeah but just to be really concrete or pedantic about the bottlenecks here a lot of it is again just tooling and are the pipes connected yeah a lot of things i can't just launch clawed and have it go and and solve because maybe it needs a GPU or maybe i need very careful permissioning so that it can't just like take over an entire cluster and like launch a whole bunch of things right so you really do need good sandboxing and the ability to to use all of the tools that are necessary and we're almost certainly like under eliciting dramatically when you look at meters evals of can the model solve the task they're they're solving them for like hours um over like multiple iterations um and eventually like one of them is like oh yeah i've come back and i've solved the task me at the moment at least like maybe the fault is my own but i try the model and doing something and if it can't do it i'm like fine i'll do it yeah.

嗯，是的，那还没完全准备好，目前还有很多任务需要你在流程中参与。不过，接下来的六个月看起来像是在探索具体的运行方式。不过，为了具体说明这里的瓶颈，很多问题还是出在工具上，以及这些“管道”是否相连。很多事情我不能直接启动Clawed让它自动解决，因为可能需要GPU，或者需要非常仔细地设置权限，以防它接管整个集群并启动一堆任务。所以，你确实需要一个良好的沙盒环境，以及使用所有必要工具的能力。我们几乎可以肯定，在评估模型是否能解决任务时，低估了很多。模型可能需要几个小时，通过多次迭代才能解决一项任务。最终，有时也许一个模型会说：“哦，我已经回来并解决了任务。”但是目前，对我来说，也许这也是我的问题，每当我尝试让模型做某件事而它不能完成时，我就会想“好吧，我自己来做”。

um i don't like what you're saying because you we don't even treat other humans this way right if a you hire new employee yeah you're not like oh i'll do it yeah yeah yeah yeah you're like you're given like like spend literally weeks. giving them feedback yes um where like we'll go up with the model in like minutes yes exactly but but it i think part of it is is it a sink or not yes and if it's human in the loop then it's so much more effortful and unless it's getting that's right yeah yeah um i've noticed if i don't have a second monitor with cloud code always open in the second monitor uh i won't really use it yeah yeah it's only when it's right there and it's i can send off something if it hits great if not i'm kind of working on it at the same time yeah yeah yeah but there's more i sink form factor i expect to like really quite dramatically improve the experience of these models interesting you can just say like let's see if it can do that yeah let's give it a while um try tender purchase yeah yeah just fired off yeah fire before we end this episode i do want to get back at this crux of um you why does the prog as i said you're talking about a computer usagians and why color work happened over the next few years why is this not a thing that takes decades yeah.

嗯，我不太喜欢你说的，因为我们对待人的方式都不是这样的。假设你雇了一名新员工，你不会只是说"哦，我来做吧"就完了。你会花几周时间给他们反馈，而不是只是简单培训一下。而对模型的使用，我们常常在几分钟内就上手了。关键在于这是不是一个"沉没还是游泳"的情境。如果有人在过程中参与进来，那就会花费更多的精力，除非它达到了预期的效果。我发现如果我没有在第二个显示器上时刻打开云代码，我通常就不会真正使用它。只有当它在那里，我能随时发送一些指令，才会用它。如果效果好那就很好，如果不是，我就边工作边调整。但是，我期待某种新的形式因素能够显著提升这些模型的体验。就像说"让我们看看它能不能做到"一样，可以尝试一下。当然，在节目结束之前，我想回到最重要的问题，为什么这一进展在未来几年会发生，而不是需要几十年时间。这涉及到计算机应用的演化以及为什么这些变化没有经过几十年才发生的原因。

and i think the crux comes down to um the people who expect something much longer have a sense that uh when i need to again time out my podcast they were like look you could look at alpha go and say like oh this is a model that can do exploration it can like alpha zero can generalize to new video games it has all these priors about how to engage with the world and so forth the intellectual ceiling is really high yeah exactly and then in retrospect obviously a bunch of the methods are still used today in deep learning and you can see um see similar things in the models that we trained today but it was fundamentally like not a sort of like baby a g i that we just had to like add a little like sprinkle of something else on top of in order to make it the elements of today and i just want to like very directly address this crux of why are elems um in a much different position of a respect to true a g i than alpha zero why are they actually the base on which like adding in a few extra drops of this kind of care and attention yeah.

我认为关键在于，那些期待更长时间的人觉得，当我需要重新安排我的播客时间时，他们会说，你可以看看AlphaGo，并认为这是一个可以进行探索的模型，比如AlphaZero可以推广到新的电子游戏，它具备很多如何与世界互动的先验知识等。因此，它的智力上限非常高。没错，很多方法在今天的深度学习中依然被使用。你可以在我们今天训练的模型中看到类似的东西。但从根本上说，这并不像一个“婴儿AGI”（人工通用智能），我们只需要在上面稍微添加一点东西，就能变成今天的样子。我想非常直接地解决这个问题：为什么ELEMS（可能指某种模型或技术）和AlphaZero相比，在发展成为真正的AGI方面处于非常不同的位置？为什么它们实际上是基础，只需要在上面稍加这种关心和注意的额外一点，就能发展呢？

uh gets us to human level intelligence i think one important point is that when you look at alpha zero it does have all of those ingredients and in particular i think like the intellectual ceiling goes like quite contra what i was saying before which is like we've demonstrated this incredible complexity of like an atom program yeah um i do think that the type of task and setting that alpha zero will like it worked in this two-player perfect information uh like gain basically it's incredibly friendly to our elegrams um and the reason it took so long to to get to like a more a g out proto a g i style models is you do need to crack that like general conceptual understanding of like the world and language and this kind of stuff and you need to get the initial reward signal on tasks that you care about in the real world which are like harder to specify the games.

啊，我认为要达到人类水平的智能，有一个重要的点是，当你看AlphaZero的时候，它确实具备了所有这些要素。具体来说，我觉得智力的上限与我之前提到的相反，比如我们通过一个原子程序展示了这个难以置信的复杂性。我认为AlphaZero所适用的任务和环境，比如在一个两人完美信息的博弈中，基本上它是非常有利于我们的算法的。而之所以花了这么长时间才发展出一种更通用的人工智能风格的模型，是因为你确实需要破解对世界和语言的广泛概念理解之类的东西。你还需要在现实世界中关心的任务上获取初始的奖励信号，而这些任务要比游戏中的任务更难以指定。

um and i think then that like that sort of like gradient signal that comes in the real world like all of a sudden you get access to it and you can start climbing it um whereas alpha zero didn't didn't ever have like the first wrong to pull on yeah this goes back to the monkeys on the typewriter yeah and like the pre-training model and until you had something like gpt3 gpt4 it just couldn't generate coherent enough sentences to even begin to do rla chaff and tell it what you liked and didn't like yeah um if we don't have even reasonably robust um or weekly robust computer use agents by this time next year are we living in the the bus timeline as a 20 to 30 or bust i would be extremely surprised if that was the case and i think that would be like somewhat of an update towards like there's something like strangely difficult about yeah this like computer use in particular yeah um i don't know if it's the bus timeline but it's definitely like the i would update on this being like yeah it lengthen of time yeah yeah but yeah

嗯，我觉得这种突然出现在现实世界中的渐变信号，你一下子就可以接触到它，并可以开始攀爬它，而Alpha Zero从来没有一个起步点可以去抓住。这让我想起了猴子打字机的例子，以及预训练模型。直到出现像GPT-3和GPT-4这样的模型之前，计算机生成的句子根本不够连贯，甚至无法开始进行强化学习，说出你喜欢什么和不喜欢什么。如果到明年这个时候我们还没有稍微稳健的或者是弱稳健的计算机代理存在，那我们是不是就活在“尘埃落定”的时间线里呢？我会非常惊讶，如果是这种情况的话，我觉得这可能意味着在计算机使用方面存在某种特别困难的挑战。我不确定这是不是“尘埃落定”的时间线，但这确实让我对时间的预期有所延长。

i mean i think more and more it's no longer a question of speculation if people are skeptical i'd encourage like using clawed code or like some agentic tool like it and just seeing what the current level of of capabilities are treating it so much easier but seriously like the models are getting really capable at tasks that we care about and we can give them enough data for yeah and and i mean the circuits results from interpretability are also pointing in the direction that they're doing very reasonable generalizable things uh and so yeah the this question matters a lot but um i'm surprised by how many deep learning critics just like haven't really interacted with the models or haven't in a while and constantly move the gold posts yeah yeah

我的意思是，我越来越觉得，现在已经不再是猜测的问题了。如果有人持怀疑态度，我会建议他们尝试使用像Claw码这样的工具，亲自看看当前的能力水平。这样会让理解这些技术变得容易得多。说实话，这些模型在我们关心的任务上变得非常有能力，当我们提供足够的数据给它们时，效果尤其显著。此外，解释性研究中的电路结果也显示，这些模型正在执行非常合理且有普遍意义的操作。因此，这个问题非常重要。但让我惊讶的是，有这么多深度学习的批评者并没有真正与这些模型交互过，或者已经很久没有这样做了，还有人不断提高评价标准。

yeah like the turning test used to be a thing right like yeah we don't even talk about it and it'd be like silly to think that it was a meaningful test yeah yeah now that means that one caveat on that is like if software engineering is just like dramatically better than computer use i mean computer use still sucks then i'd be like still like oh maybe everyone just kept focused on software engineering like it was just like by far the most valuable thing like every module person and dollar went towards software engineering i don't think that's the case i do think like computer use is valuable enough that like you know people will care about it yeah but that would be like my that's my one like escape patch that i'm putting in place for next year yeah it would be good for my life in perspective too because i think you kind of do need a writer range of skills before you can do something super super scary um oh like as in if the models didn't get you better

是的，就像图灵测试曾经是一个重要的议题，对吧？但现在我们都不再谈论它了，认为它是一个有意义的测试显得很可笑。是啊，是啊。不过有一点我要说，即便软件工程变得比计算机应用好很多，但如果计算机应用依旧很烂，那我还是会觉得，也许大家只关注软件工程，因为它被普遍认为是最有价值的，每一个人和每一分钱都投入到了软件工程中。但我不认为情况是这样的，我确实相信计算机应用足够有价值，足以让人们关心它。不过这就是我为明年设想的一个“逃生通道”。从个人的角度来看这也会是件好事，因为我认为在你能够做一些超级吓人的事情之前，你确实需要具备一系列广泛的技能。哦，比如说如果模型没有让你变得更好。

yeah if it's like just report they're super human coders but they're not like Henry Kissinger level i don't know that seems okay like if we have AI oracles yeah that's good yeah yeah that's good yeah so if you look back at AI discourse like going back a decade there's a sense that there's dummy i then there's aji then there's asi that intelligence is the scalar value um the way you've been talking about the these models has a sense of jaggedness yeah it's especially tuned to environments in which it's been trained a lot or has a lot of data um is there a sense in which like there's still it still makes sense to talk about the general intelligence of these models is there enough meta learning and transfer learning is distinguished between like the sizes of models or like yeah uh the way models are trained or are we moving into a regime where it's not about intelligence it's more so about domain yeah so one intuition pump is this conversation was had a lot when models were like GPT two-sized and fine-tuned for various things uh and they found you know you people would find that the models were dramatically better at things that they were fine-tuned for right

是的，如果只是报道说他们是超级程序员，那还不错，但他们不是像亨利·基辛格那样的级别。我觉得这样还可以，比如我们有AI预言家，那是好的。所以如果你回顾过去十年关于AI的讨论，会有一种感觉，就是先是“符号智能（dummy i）”，然后是“人工通用智能（AGI）”，再然后是“超级智能（ASI）”，大家认为智能是一种可度量的值。但你谈论这些模型时，感觉有些不均匀。尤其是在它们被大量训练或有很多数据的环境中表现得特别好。那么，现在讨论这些模型的通用智能是否仍然有意义？是否有足够的元学习和迁移学习来区分模型的规模或训练方式？或者我们正在进入一个不再以智能为基础而更多以领域为基础的阶段？一个直观的例子是，当模型像GPT-2一样大小，并针对不同任务进行微调时，这种讨论常常出现。大家发现，当模型针对某些任务进行微调时，表现会显著提高。

but by the time you get to GPT four when it's trained on a wide enough variety of things actually the like the sort of total compute uh like it generalize very well across all of like the individual subtasks and actually generalize better than smaller fine-tuned models um in a way that was extremely useful uh i think right now what we're seeing with rl is is pretty much the same story playing out where uh there's this jaggedness of like things that they're particularly trained at but as we expand the total amount of compute that we do rl with you'll start to see as the same transition from like GPT two fine-tunes to uh like GPT three GPT four like unsupervised like you're metal learning and like generalization across things and i think we were already seeing like early evidence of this uh in its ability to like generalize reasoning to uh things but um like i think this will be like extremely obvious uh soon

当你看到GPT-4时，它经过广泛各种数据的训练，能够很好地泛化处理所有的具体子任务，这比那些较小的微调模型的表现要好得多，这种特性极为有用。我认为我们现在在强化学习（RL）领域看到的情况几乎是同样的故事。虽然目前在某些特别训练的任务上表现参差不齐，但随着我们增加用于强化学习的计算量，你将看到类似于从GPT-2微调到GPT-3、GPT-4这样的无监督学习的转变。这种转变体现为跨任务的元学习和泛化能力。我认为我们已经在它推理能力的泛化方面看到了一些早期的证据，不过我相信很快会有更加明显的例证展现出来。

one nice example of this is just the ability or notion to backtrack right you go down one solution path oh wait let me try another one uh and this is something that you start to see emerge in the models through rl training on harder tasks and i i think right now i it's not generalizing incredibly well at least with with well i mean has the we have rl the model to be a uh interp agent no i mean no yeah exactly yeah like so all this time we're talking about like oh it's only good at things that's being rl that well it's pretty good at that because that's pretty it's that is a mixture of like science and like understanding language and like uh and coding um like there's this sort of like mixture of domains here all of which you need to understand like you need to be both a great software engineer and be able to like think through language and and like like state of mind and almost philosophize in some respects to be an interp agent yeah um and it is generalizing from the training yeah to do that.

一个很好的例子就是具备回溯能力或概念，你沿着一个解决方案走了一段路，然后想到“哦，等等，让我试试另一个”。通过在更复杂的任务上进行强化学习（RL）训练，这种能力会在模型中逐渐显现。我觉得，至少目前来看，它的泛化能力还不是特别强。我们并没有把模型训练成一个解释代理（interp agent），对，没错。我们之前一直讨论说，模型在通过强化学习训练的任务上表现得很好，这实际上是一个科学、语言理解和编程的综合体。在这个过程中，你需要了解多个领域，需要既是一个优秀的软件工程师，又能够通过语言进行思考，甚至在某些方面进行哲学思考才能成为一个解释代理。在训练中，这种能力确实得到了泛化。

what's the end game here claw aid comes out um and they give it to you and dot dot dot you say thumbs up what's happened what do you do yeah i mean it really depends upon the timeline at which we get clawed eight and the models hit like ASL4 capabilities right like like fundamentally we're just going to use whatever tools we have at the time and see how while they work um ideally we have this enumerative safety case where we can almost like verify or prove that the model will behave in particular ways um and the worst case we use the current tools like when we won the auditing game of seeing what features are active when the assistant tag lights up can you explain what is mechanistic interpability what are features what are circuits totally yeah yeah yeah so uh mechanistic interpretability or the cool kids call it mecha and terp is uh trying to reverse engineer neural networks uh and figure out kind of what the core units of computation are yeah.

这段文字的大意是：在讨论 CLAW AID（可能是某个项目或工具）时，一个关键问题是其最终目标是什么。现在，它已经推出，给予了你，你需要评估它的表现。具体的结果取决于我们获得 CLAW AID 以及模型达到 ASL4 能力的时间点。基本上，我们会利用当时可用的工具，观察它们的运行效果。理想情况下，我们会有一种安全验证机制，可以几乎证明模型会以特定方式运作。最差的情况是，我们使用当前的工具，观察在助手标签亮起时有哪些功能被激活。有人问什么是“机制可解释性”（机械可解释性）以及功能和电路是什么。对此，回答是：机制可解释性（时髦的说法是“Mecha and Terp”）是指试图对神经网络进行逆向工程，以找出其核心计算单元是什么。

lots of people think that because we made neural networks because they're artificial intelligence we have a perfect understanding of how they work yeah and it couldn't be further from the truth uh neural networks AI models that you use today are uh grown not built uh and so we then need to do a lot of work after they they're trained to figure out to the best for abilities how they're actually going about their reasoning and so uh two and a half three and a half years ago uh this kind of agenda of applying mechanistic interpretability to large language models started uh with Chris Ola leaving open AI co-founding and throtic um and every roughly six months since then we've had kind of like a major breakthrough in our understanding of of these models and so first with toy models a superposition uh we established that models are uh really trying to cram as much information as they possibly can into their weights uh and this goes directly against people saying that neural networks are over parameterized and like classic AI machine learning back in the day you would use linear regression or something like it and people had a meme of AI or neural networks uh deep learning be using way too many parameters.

很多人认为，由于我们研发了神经网络，人们误以为我们对它们的工作原理有完美的理解。这种看法与事实相去甚远。现代的神经网络或人工智能模型，其实是“成长出来的”，而不是“构建出来的”。因此，在它们经过训练后，我们需要花费大量时间和精力去尽力理解它们是如何进行推理的。大约在两年半至三年半之前，Chris Ola 离开 OpenAI，联合创立了一家名为 Anthropic 的公司，并开始推动将机械解释应用到大型语言模型上。这项研究大约每隔六个月就有重大突破。首先，利用简单的模型和超叠加现象，我们发现这些模型实际上在尽可能多地压缩信息到其权重中。这一发现直接反驳了那些声称神经网络参数过多的看法。在早期的经典人工智能和机器学习中，人们通常使用线性回归等方法，并且有一种“神经网络或深度学习使用了过多参数”的刻板印象。

um there's like this funny meme that you should show of like layers on the x-axis and layers on the y-axis and this like jiggly line that just goes up and it's like oh just throw more layers at it right um but it but it actually turns out that at least for really hard tasks like being able to accurately predict the next token for the entire internet these models just don't have enough capacity and so they need to cram in as much as they can and the way they learn to do that is to use each of their neurons or units of computation in the model for lots of different things and so if you try to make sense of the model and be like oh if I remove this one neuron or like what is it doing in the model it's impossible to make sense of it it'll fire for like Chinese and phishing and horses and I don't know just like a hundred different things um and it's because it's it's trying to juggle all these tasks and use the same neuron to do it so that's that superposition.

这段话的大意是，有一个有趣的网络迷因图在x轴和y轴上都显示出越来越多的“层”（layers），并有一条像波浪线一样的曲线往上走，看起来就像是在说“哎呀，再加点层就好了”。但实际上，对于非常复杂的任务，比如准确预测互联网上下一个词汇，这些模型往往计算能力不够。因此，模型需要尽可能压缩信息量，而它们通过让每个神经元或计算单元承担多种不同的任务来学习这一点。所以，当你试图理解模型的功能时，比如移除一个神经元或者看它在模型中做什么，你会发现很难弄清楚。因为同一个神经元可能同时处理中文、网络钓鱼、马以及其他众多任务。这样一来，神经元就像在同时完成多项任务，这种现象被称为“叠加”。

nine months later we write towards monosomanticity which introduces what are called sparse autoencoders and so going off what I just said of the model trying to cram too much into two little space uh we give it more space uh there's this higher dimensional representation uh where it can then more cleanly represent all of the concepts that it's understanding and and this was uh of a very toy paper and so much as it was a two layer really small really dumb transformer and we fit up to I want to say 16,000 features which we thought was a ton at the time fast forward nine months we go from a two layer transformer to our clawed three sonnet frontier model at the time and fit up to 30 million features and this is where we start to find really interesting abstract concepts like a feature that would fire for code vulnerabilities and it wouldn't just fire for code vulnerability it would even fire for like you know that chrome page you get if you like it's not an HTTPS um you are all and it's like warning this site might be dangerous like click to continue and like also fire for that for example and so it's like these much more abstract coding variables or sentiment features amongst the 30 million.

九个月后，我们开始探索 "单义性"，提出了所谓的稀疏自编码器。之前我们提到模型尝试将大量信息压缩到过小的空间中，因此我们为其提供了更大的空间。这种更高维度的表示使模型可以更清晰地表达其理解的所有概念。当时，我们进行的研究就像一个简单的小实验，只是用了一个两层的非常小、非常简单的转换器。开始时我们处理了大约16,000个特征，当时觉得已经是很多了。九个月后，我们从一个两层的转换器发展到了一个称为"Claude三号奏鸣曲"的模型，并处理多达3,000万个特征。在这个过程中，我们开始发现一些非常有趣的抽象概念，比如一个特征会在检测到代码漏洞时激活，它不仅会为代码漏洞服务，甚至在访问非HTTPS的网站时，浏览器显示的“此网站可能有危险，请点击继续”的警告页面也会触发这个特征。这说明在这3,000万个特征中，还有许多更抽象的编码变量或情感特征。

fast forward nine months from that and now we have circuits and I threw in the analogy earlier of the ocean 11 heist team where now you're identifying individual features across the layers of the model that are all working together to perform some complicated task and you can get a much better idea of how it's actually doing the reasoning and coming to decisions like with the medical diagnostics um one example I didn't talk about before is um with like how the model retrieves facts and so you say like what sport did Michael Jordan play and not only can you see it hop from like Michael Jordan to basketball answer basketball but the model also has an awareness of when it doesn't know the answer to a fact and so by default it will actually say I don't know the answer to this question but if it sees something that it does know the answer to it will inhibit the I don't know circuit and then reply with the circuit that it actually has the answer to.

九个月后，我们已经有了电路，我之前用过《十一罗汉》电影中的劫案团队作类比。在模型的各层中，你可以识别出每个特征，它们都协同工作来执行复杂任务。这样你就能更好地了解它如何进行推理和做出决策，比如在医学诊断方面。我之前没提到的一个例子是关于模型如何检索事实。当你问“迈克尔·乔丹打的是什么运动？”时，你不仅能看到它从“迈克尔·乔丹”跳到“篮球”并回答“篮球”，模型还具备在不知道答案时的自我意识。因此，默认情况下，它会说“我不知道这个问题的答案”。但如果它看到某个问题是自己知道答案的，它会抑制“不知道”的电路，然后用有答案的电路来回答。

so for example uh if you ask it who is Michael Batkin which is just a made up fictional person uh it will by default just say I don't know it's only with Michael Jordan or someone else that will it will then inhibit the I don't know circuit but what's really interesting here and where you can start making downstream predictions or reasoning about the model is that that I don't know circuit is only on the name of the person and so in the paper we also ask it uh what paper did Andre Carpati write and so it recognizes the name Andre Carpati because he's sufficiently famous so that turns off the I don't know reply but then when it comes time for the model to say what paper it worked on it doesn't actually know any of his papers and so then it needs to make something up and so you can see different components and different circuits all interacting at the same time to lead to this this final answer.

比如说，当你问它“谁是迈克尔·巴特金”时（这个名字只是个虚构的人物），默认情况下，它会回答“不知道”。但如果你问的是迈克尔·乔丹或其他知名人物，它才会抑制“不知道”的反应。但这里有趣的部分是，在这套模式中，“不知道”反应主要是跟人物名字有关的。在论文中，我们还问了它“安德烈·卡帕蒂写了哪篇论文”，因为安德烈·卡帕蒂足够有名，所以系统识别出了这个名字，并不再回答“不知道”。但当涉及到它要说哪篇论文时，它其实并不知道卡帕蒂写的具体论文，所以它需要编造一些内容。在这个过程中，你可以看到不同组件和电路同时发挥作用，最终形成这个答案。

why I think it's a tractable problem like understand every single thing that's happening in a model or like that's the best way to understand why it's being deceptive if you wanted to explain why England won world were two using particle physics uh you would just like be on the wrong track you just want to look at the high level explanations of who had more weapons like what did they want and that seems analogous to just training linear probes for like are you honest are you being deceptive like um do we catch you doing bad things when we're right teaming you can be monitor you um why is this not analogous where we're asking a particle physicist to just backtrack and explain why um why England won world or two I feel like you just want to go in with your eyes wide open not making any assumptions for what that deception is going to look like or what the trigger might be yeah uh and so the wider you can cast that net the better.

我为什么认为这是一个可处理的问题，比如理解模型中发生的每一件事，或者说这是理解它为什么具有欺骗性的最佳方式。如果你想用粒子物理学解释为什么英国赢得了二战，那你就走错了路。你只需要看一些高层次的解释，比如谁拥有更多的武器，以及他们的目的是什么。这类似于训练简单的探测器来判断你是否诚实，或者是否具有欺骗性，比如当我们进行红队测试时，我们是否能发现你在做坏事。为什么这不类似于让粒子物理学家去解释英国为什么赢得二战？我觉得你应该眼睛睁得大大的，不要对欺骗的表现形式或触发原因做任何假设。因此，你能撒的网越广越好。

um depending on how quickly AI accelerates and where the state of our tools are we we might not be in the place where we can like show prove from the ground up that everything is safe uh but the like I feel like that's a very good north star it's a very powerful reassuring north north star for us to aim for especially when we consider we are part of the broader AI safety portfolio. I mean do you really trust like you're about to deploy this system and you really hope it's aligned with humanity and that you've like successfully iterated through all the possible ways that it's going to like scheme or sandbag but that's that's also probably going to be true with whatever you find you're not I mean you're not you're still going to have variants that you haven't explained or like you found a feature but you don't know if it actually explains deception or something else instead or so so I guess first of all I'm not saying you shouldn't try the probing approach right like we want to pursue the entire portfolio.

根据人工智能加速的速度和我们的工具状态，我们可能还无法完全从头到尾证明所有事情都是安全的。然而，我觉得这可以是一个非常好的方向，是值得我们努力追求的一个强有力的、让人安心的目标，尤其考虑到我们是更广泛的人工智能安全计划中的一部分。也就是说，你是否真的信任自己将要部署的系统，真的希望它与人类的目标一致，并且你已经成功地通过所有可能的方式进行迭代，以防止它耍手段或隐瞒真相？但这可能确实也会是你遇到的任何问题，不是吗？并不是所有变数你都能解释，或者你发现了某个特性，但不确定它是解释了欺骗还是其他事情。所以我觉得，首先，我不是说你不应该尝试探查的方法。我们希望能够追求整个系列的方案。

we've got the therapists interrogating the patient by asking do you have any troubling thoughts we've got the linear probe which I'd analogize to like a polygraph test yeah where we're taking like very high level summary statistics of the person's well-being and then we've got the neurosurgeons kind of going in and seeing if you can find any brain components that are activating and troubling or or off distribution at ways so so I think we should do all of it um what what percent of the alignment portfolio should it and make it to be uh I think as much of a chunk as is necessary I mean I think at least like yeah hard hard hard to find but I don't know I don't think I feel like all of the different portfolios are like being very well supported and growing um you can also we're going back to like the the world's three question you can think of it as like a hierarchy of abstractions of trust yeah where like let's say you want to go and talk to like Churchill um it helps a lot if you can verify that in that conversation in that 10 minutes he's being honest.

我们有治疗师通过询问病人“你有没有让人烦恼的想法”来进行探讨；我们有线性探测，就像测谎仪测试一样，从中获取关于个人心理健康的高度概括性统计数据；然后，我们有神经外科医生深入观察是否能找到激活的、大脑中让人烦扰或异常的部件。所以，我认为这些方法都要尝试进行。那么，资源分配应该占对齐计划的百分之多少呢？我认为应根据需要投入相应的比例。目前来看，虽然全面支持和发展所有不同的研究方向并不容易，但是很重要。回到我们讨论的三层世界问题，你可以把它理解为信任的分层抽象架构，比如假设你想和丘吉尔对话，如果你能验证他在对话的十分钟内保持诚实，这会很有帮助。

and this like enables you to construct better meta narratives um of what's going on and so maybe particle physics wouldn't help you there but suddenly like the neuroscience of Churchill's brain would help you verify that he was being trustworthy in that conversation and that the like you know the soldiers on the front lines were being honest in their depiction of what oh their description of what happened and this kind of stuff like so as you can verify like progress like parts of the uh the tree up then that that that massively helps you build confidence uh I think language models are also just really weird right like with the emergent misalignment work they I don't know if they took predictions they should have of like hey I'm going to fine tune Chatsy PT on code vulnerabilities is it going to become a Nazi and I think most people would have said no and that's what happened.

这句话表达了以下意思：你能够通过某种方式构建更好的元叙事（即对事情整体的理解）。例如，粒子物理可能在这方面没有多大帮助，但如果能了解丘吉尔大脑的神经科学特点，或许可以帮助验证他在某次谈话中是否可信任，并确认前线士兵描述事件的真实性。当你能够验证某些进展或某些“树”的部分时，这会大大增强你的信心。语言模型本身也非常奇怪。比如在处理“新兴的模型失调”问题时，人们可能没有充分考虑到其预测结果，比如，如果通过代码漏洞调整聊天机器人（Chatsy PT），是否会让它变得极端化（如纳粹化）。大多数人可能会认为不会，但实际结果却显示会如此。

and so what are the different how they discover that it became a Nazi they started asking at a ton of different questions and it will do all sorts of like vile and harmful things like the whole persona just totally changes and and I mean we are dealing with alien brains here who don't have the social norms of humans and or or even a clear notion of like what they have and haven't learned uh that that we have of them I mean yeah and and so I think you really want to go into this with with eyes wide open backing up from mech and turf uh if you live in a world where AI progress accelerates um by the way you were mentioning a little while ago that there's many wild worlds we could be living in but we're living at least one of them.

所以，他们是如何发现它变成纳粹的呢？他们开始问一连串不同的问题，然后这个系统会做出各种恶劣和有害的事情，整个个性完全改变了。我的意思是，我们面对的是没有人类社会规范的“外星大脑”，甚至对自己掌握了什么、没有掌握什么都没有清晰的概念。我觉得，在进入这个领域时，真得睁大眼睛、深思熟虑，尤其是在一个人工智能进步加速的世界中。顺便说一下，你刚才提到，我们可能生活在许多未知的世界中，而我们至少活在其中的一个。

um another one that we've uh gestured at but it's worth making more explicit is this even if the AI models are not helping write the next training algorithm for their successor just the fact that if they had human level learning efficiency uh whatever a model is learning on the job or whatever copy of the model is learning on the job the whole model is learning so in effect it's getting more or if they're like a thousand times less efficient than humans are learning that's right and you just like to deploy them even still that's exactly yeah yeah anyways and and there's a whole bunch of other things like you can think about it but even there it's like you kind of have a like a broadly deployed intelligence explosion and I do think it's what like worth pressing on that future of um you know there's there's this whole spectrum of crazy futures but the one that I feel we're almost guaranteed to get and this is like a like a like a almost as strong statement to me um is one where like at the very least you get drop-in like white collar worker at some point in the next five years it's like I think it's very likely in two um but it seems almost over determined in like five and and unlike the grand scheme of things like those are kind of irrelevant time frames like it's.

另一件我们提到过但值得更明确说明的事情是，即使AI模型并没有在帮助编写其后续版本的下一个训练算法，只是考虑到如果它们具有人类水平的学习效率，那么无论一个模型在工作中学到了什么，或其某个副本在工作中学到了什么，整个模型都在学习。实际上，这种情况下它的收获会更多。或者即使它们的学习效率比人类低一千倍，只要能被部署，这也能奏效。无论如何，还有很多其他情况可以考虑，但即便在这些情况下，你都可以认为会出现广泛部署的智能爆炸。我确实认为，值得认真思考未来的发展，有各种可能的疯狂未来，但我觉得几乎可以肯定的是，我们会迎来一个至少可以替代白领的时代，也就是，在未来五年内的某个时刻，一个像是“插入即用”的白领替代者可能出现。我觉得在两年内就非常可能，而在五年内几乎是板上钉钉。相比整个时间线，这些时间框架显得无关紧要。

the same either yeah um and that completely changes the world over the next decade um and and if the sort of if we don't have the right policies in place so that then you end up actually with almost some respects like a fundamentally worse world because the thing that these models get good out by default is like software engineering and like computer-using agents and this kind of stuff and then we need to we will need to put in extra effort to put them in the loops where they help us with scientific research or they're like we have the right robotic such that we actually like experience and increase the material quality of life um so that's what we're thinking about like if you're in the perspective of like I'm a country yeah what should I be doing or thinking about uh like plan for the case where where white collar work is automated um and then consider what does that mean for your economy and what you should be doing to prepare polls what should you be doing to prepare because honestly I think it's like a such a tough question yeah where like if you're India or Nigeria or Australia yeah um if you're a country unlike America or China where they do have frontier models um what is it that you should be doing right now especially on such a short time scale yes

这句话可以翻译成： “是的，这完全会在未来十年里改变世界。如果我们没有合适的政策来应对，那么最终的结果可能会是一个在某些方面更糟糕的世界。因为这些模型本来就擅长软件工程和计算机应用类的东西，所以我们需要额外的努力让它们在科学研究等领域对我们有所帮助，或者我们需要合适的机器人技术来切实提高物质生活质量。因此，我们在考虑这些问题时，如果你是一个国家，应该思考或计划当白领工作自动化时该怎么办，这对你的经济意味着什么，你应该怎么准备。我认为这是一个非常棘手的问题。像印度、尼日利亚或澳大利亚这样的国家，特别是与拥有尖端模型的美国或中国不同的国家，面临着这样一个短时间尺度的情况，他们现在应该做些什么呢？”

uh so I think one very important point is that let's say this this scenario turns out true then computer to become the most valuable resource in the world yeah like the sort of GDP of your economy is dramatically affected by how much compute you can deploy towards these sort of organizations within your country um and so having some guaranteed amount of compute I think will actually be quite important so like pre getting ahead of investments in like data centers and this kind of stuff on the condition that it's like companies in your country have to be allowed to to use that compute um and uh yeah not necessarily for training but like just even just for inference I think the economic value here comes to that inference um I think it also makes sense to invest broadly in uh in AI like I think these countries have the opportunity to do so and I think that's like a portfolio of like you know foundation model companies but also like robotics supply chain and this kind of stuff um

嗯，所以我认为一个非常重要的观点是，如果这种情况成真，那么计算机将成为世界上最有价值的资源。就像是国家的GDP会大幅受到你能在国内不同组织中部署多少计算能力的影响。因此，确保一定数量的计算资源将会变得非常重要。提前在数据中心等方面进行投资是必要的，但前提是你国家的公司必须被允许使用这些计算资源。即便不用于训练，哪怕只是用于推理，经济价值就已经很大了。我认为广泛投资于人工智能也是有意义的，这些国家有这样的机会去这么做，这包括基础模型公司，以及机器人、供应链等等领域的投资。

I think that you should invest very proactively in policies that try and prevent capital lock-in um so we're in for a much worse world if like it just so happens that the people had like money in the stock exchange or inland before AGI are like dramatically more wealthy than the people who don't because it's a gross misallocation of resources uh so having like I know one of my favorite episodes actually on your podcast was uh like the George's on one where you're trying to like like appropriately like value or allocate land um and so I think like like this strikes particularly close to home coming from Australia where I think like out like uh policies with respect to land are like grossly wrong um but I think this is broadly true um being very uh like forward on regulation and integration of these models into uh into like your country um is is important um and proactively uh making sure that people have choice like so let's say uh you should be quite proactive about making sure that like the phones or devices or like glasses that people have at people have like free choice on like what yeah things they run um and then oh so that's like that's the we just get white collar worker right and like you're trying to like do the best to like prepare your country for that um

我认为你应该积极投资于旨在防止资本锁定的政策。如果在通用人工智能（AGI）出现之前，有人在股票市场或地产上投了很多钱，那么他们将比没有投资的人富裕得多，这会导致资源的严重错配。你播客中我最喜欢的一集是关于乔治主义的那一集，你尝试合理地估价或分配土地。我觉得这在澳大利亚尤其贴切，因为我认为这里的土地政策非常不合理，但这种情况普遍存在。在法规制定和将这些模型整合到国家中时采取积极态度非常重要。此外，确保人们有选择的权力也是关键。比如说，你应该积极确保人们在使用手机、设备或眼镜时有自由选择运行什么内容的权利。确保白领工人和整个国家为未来做好准备，是一项重要任务。

and then it's like okay well what can you do to like make all possible versions of the future go like well like that's like covering some amount of like economic downside uh the other like things I think are really important is like figure out how you can uh like either make the like basically ensure dramatic upside or cover like terrible downside um and so like getting dramatic upside is like making sure that there uh like is investment in biology like biology research and this kind of stuff in an automated way that is like uh these models are actually like able to produce novel medicines that massively massively improve our like quality of life. In covering the downside is like AI alignment research and this kind of stuff an automated testing and like really thinking hard about that AI safety institutes this kind of stuff but these seem like things that a rich person a random rich person could also do like there's it seems like there's not a thing that a nation state is uniquely equipped to do uh let's go point in this in this scenario

翻译成中文，尽量易读：然后就像，好的，那么你能做些什么来确保所有可能的未来版本都发展得很好，比如说，覆盖一些经济下行风险。另一些我认为非常重要的事情是，找出怎样做才能确保获得戏剧性的收益或者覆盖可怕的损失。获得戏剧性收益的方式之一是确保在生物学方面有投资，比如生物研究和自动化的新药物研发，通过这些模型产生的新的药物能够极大地改善我们的生活质量。而在覆盖风险方面，比如AI的对齐研究、自动化测试，认真考虑AI安全研究所等问题。而这些似乎是一个有钱人，一个随机有钱人也可以做到的事情，好像没有什么是国家特别擅长的。因此，在这种情况下，可以考虑这个方向。

I mean like dramatic allocation of resource like resource storage compute I think is is sensible um I would be doing that if I was in charge of a nation state I think it just increases your optionality and like most of the future worlds. Yeah um Dylan Patel has some scary forecasts on US energy yeah versus China. Yes we're like 34 gigawatts off. Yeah the US's line is like flat yeah basically and China's land is like this and I mean the US like very clearly yeah we just need so many more power plants. Yes intelligence becomes this like incredibly valuable input like intelligence becomes almost a raw input into the economies and quality of life of the future. The thing directly underneath that is energy and so making sure you have like you know incredible amounts solar like tile the desert and solar panels on you know some plastic desert and solar panels would be would be helpful towards making sure that you have more access to intelligence on top.

我指的是资源的显著分配，比如储存和计算资源，我认为这是明智的。如果我管理一个国家，我也会这么做，因为这会增加选择的灵活性。Dylan Patel 对美国和中国的能源有些可怕的预测。我们大约差了34吉瓦。美国的增长基本上是平的，而中国的增长很明显大幅提升。美国显然需要更多的发电厂。智能变得极其重要，几乎成为经济和未来生活质量的原料。在这之下最基本的就是能源。因此，确保有充足的太阳能，比如在沙漠中铺设太阳能电板，这将有助于获得更多的智能资源。

Yeah just to make it explicit because we've been touching on it here even if AI progress totally stalls you think that the models are really spiky and they don't have general intelligence. Yes it's so economically valuable and sufficiently easy to collect data on all of these different jobs these white color job tasks such that to Shulta's point we will we should expect to see them automated within the next five years. Yeah even if you need to hand spoon every single task to the model. It's like economically worth all to do so even if like algorithmic like progress stalls out and like we just never figure out how to like keep progress going which I don't think is the case like that hasn't stole that. yet it seems to be going great. The current suite of algorithms are sufficient to automate white color work provided you have enough of the right kinds of data. Yes and in a way that like compared to the tam of salaries for all of those kinds of work is so like trivially worthwhile.

好的，为了明确这一点，因为我们已经在讨论这个话题了，即使人工智能的发展完全停滞，你仍然认为这些模型很有特点，但它们不具备通用智能。是的，这在经济上是非常有价值的，而且收集这些不同白领工作任务的数据相对容易，因此按照Shulta的观点，我们应该预期在未来五年内看到这些工作实现自动化。即便你需要手把手地将每个任务交给模型，它在经济上仍然是值得的。即使算法进步停滞不前，即使我们一直无法找到如何继续推动进步的方法，我并不认为这种情况会发生。目前的算法集足以实现白领工作的自动化，只要你有足够的合适的数据。而且，与那些工作的工资总额相比，这样做是非常值得的。

Yeah yeah exactly. I do just want to flag as well that there's a really dystopian future if you take more of x paradox to its extreme which is this paradox where we think that the most valuable things that humans can do or the smartest things are like add large numbers in our heads or do any sort of white color work and then we totally take for granted our fine motor skill and coordination but from an evolutionary perspective it's the opposite so we got like evolution has optimized fine motor coordination so well and if you look at like robot hands or like the ability to open a door is still just like really hard for robots. Meanwhile we're seeing this total automation of coding and everything else that we've seen as clever. The really scary future is one in which AI's can do everything except for the physical robotic tasks in which case you'll have humans with like air pods and like glasses glasses and there'll be some robot overlord controlling the human through cameras by just like telling it what to do and like having a bounding box around the thing you're supposed to pick up and so you have like human meat robots.

当然，当然，没错。我确实想指出，如果将一种悖论推向极端，就会出现一个非常反乌托邦的未来。这种悖论是指，我们认为人类能做的最有价值的事情或者最聪明的事情，比如在脑海中计算大数或者从事任何白领工作，而我们完全忽视了我们的精细运动技能和协调能力。但从进化的角度来看，事实恰恰相反。进化已经很好地优化了我们精细的运动协调能力。看看机器人手或者开门的能力，机器人仍然很难做到这些。同时，我们看到编程和其他被视为聪明的工作完全实现了自动化。真正可怕的未来是，当人工智能可以做一切事情，除了物理机器人任务时，人类就像戴着AirPods和眼镜，有个机器人霸主通过摄像头控制人类，只需告诉他们该做什么，并给出该拾起物体的边框，所以你就有了像人肉机器人的情况。

And not like necessarily saying that like that's what the AI's would be like want to do or anything like that but as in like if you were to be like what are the relative economic value of things like the AI's are out there doing computer programming and like the most valuable thing that humans can do is like be amazing robots. Now that being said I think more of x paradox is a little bit fake. I think the main reason that robots are worse than at like being a robot then they are at software engineering is the internet exists for software engineering like GitHub exists and there is no equivalent thing like if you had all like you know mocap of everyone's actions as they were like going about their daily lives for like some reasonable fraction of the human population robotics is also like close to solve like like on track to be solved at the same rate that software engineering is on track to be solved.

这段文字可以翻译并简化为： "这并不是说AI会想要这样做或那样做，而是说如果你在考虑各种事物的相对经济价值时，AI可以进行计算机编程，而人类最有价值的是成为出色的机器人。话虽如此，我认为X悖论有些虚假。我认为机器人在做‘机器人’的工作上不如软件工程的主要原因是，互联网上有软件工程的资源，比如GitHub，但没有类似的资源用于机器人。如果我们能获取大量人们日常活动的数据，像是使用动作捕捉技术，那么机器人技术的发展速度可能会和软件工程的发展速度一样快。"

So this is only like this vision is only like a sort of decade long section but it's still terrible decade. Like imagine the world where people have lost their jobs you haven't yet got novel biological research that means like people's quality of life is like dramatically better. You don't yet have material abundance because you like haven't actually been able to action the physical world in the like necessary way like you can't build dramatically more because that's like building dramatically more takes robots basically. And people's like main comparative advantage is as fantastic robots is like a shocking shocking world.

这段话的大意是这样的：这个设想的未来场景可能只会持续十年左右，但这十年仍然会是艰难的。想象一下这样一个世界：人们失去了工作，新的生物研究尚未取得突破，所以人们的生活质量没有显著提升。物质仍然不够丰富，因为我们还不能通过必要的方式对物质世界进行足够的改造；想要大规模地制造更多东西，需要大量使用机器人。在这样的世界里，人类的主要竞争优势就是担任"超级机器人"，这将是一个令人震惊的场景。

I mean for me the perception of an average human I think it actually might be better. You're like wages will be higher because you're you're the complement to something that is enormously valuable. Right which is AI labor. Right. And like you know decade or two on like the world is fantastic. Yeah right like you truly like you robotics to solve and you've got to get like you know like radical abundance basically provided that you have all the policies set up like necessary to commit building like you sort of you end up with that same change from you know the like the before and after photos of Shanghai yeah we're like 20 years on it's like this dramatically transformed city like a lot of places in the world probably end up like that right over that two decade period but we need to make sure like one do our best to estimate is this actually what is on track to happen like build sweet bench but for all the other forms of white collar work and measure and track that's a great thing that governments should be doing by the way is like trying to break down the sort of functions of their economy into measurable tasks and figuring out where what is the curve actually look like for that because they might be a bit shocked by the progress there.

我认为从普通人的角度来看，这其实可能是一件好事。你的收入会更高，因为你与一种极具价值的东西——人工智能劳动力——形成互补。未来一两个十年，世界将变得非常美好。机器人技术将帮助解决问题，实现极大的富裕，当然前提是有合适的政策来支持和推动这种发展。例如，上海在过去20年的巨大变化，很多地方也会在这段时间内经历类似的迅速发展。但我们需要确保尽可能准确地预估未来的发展趋势，例如为各种白领工作建立良好的基础，并进行测量和追踪。政府应该致力于将经济活动分解为可测量的任务，了解实际的进步曲线，因为他们可能会对取得的进展感到惊讶。

You know there's no sweet bench for tax like taxi bell and then. I don't like have all the answers here but like figuring out a way to like share the proceeds of this economy like broadly across people or like invest heavily in robotics and collecting data so that we get robotics faster and we get material abundance faster invest in biological research that we get but like all that faster they basically try and pull forward the radical upside because otherwise you have a pretty dark yeah like section. I think one thing that's not appreciated enough is how much of our leverage on the future given the fact that our labor isn't going to be worth that much comes from our economic and political system surviving for your million X's and P equity to mean something for your contracts to mean anything for the government to be able to tax the AI labor and give you a UBI off of that it just like that requires our legal institutions our economic institutions our financial rail surviving in the future.

你知道，没有比出租车钟声更好的税务福利了。我并不是说我有所有答案，但要找到一个方法，可以让经济收益更广泛地分享给大家，或者加大对机器人和数据收集的投资，以便更快地实现机器人技术和物质丰裕。同时，投资于生物研究，以更快地取得进展，以促进这些方面的快速发展。他们基本上是试图提前实现这些激进的好处，否则前景将会相当暗淡。我认为有一件事没有被足够重视，那就是考虑到未来我们的劳动力可能不值太多钱，我们对于未来的影响力很大程度上来自于我们的经济和政治体系能否持续下去。比如说，你的股票价值数百万倍，你的合同有意义，政府能够对人工智能劳动征税，然后通过这种税收为你提供基本收入。这一切都依赖于我们的法律、经济和金融体系在未来能够持续运行。

Yes the way in which that likely happens is if it's also in the AI's best interests that they follow those rails and by AI I don't mean some monolithic single AI I just mean like firms which are employing AI and becoming more productive as a result. You don't want to be in a position where it's so onerous to operate in our system that you're basically selecting for firms who either emigrate or who are like doing black market stuff etc and which means I think like you want to make it super super easy to deploy AI have the equivalent to special economic zones etc because otherwise you are just surrendering the future outside of any control that you might have on it.

是的，要实现这一点，最有可能的方式是确保让AI顺利运行符合它们的最佳利益。在这里，我所指的AI不是某个单一的、统一的AI，而是那些使用AI来提高生产力的公司。我们不希望处于一种令人望而却步的环境中，以至于公司要么选择迁出，要么从事黑市活动。因此，我认为我们需要极大地简化AI的应用，比如设立类似经济特区的措施。否则，我们可能会失去对未来的任何控制权。

One of the reasons by the way that I worry about turning AGI into a national security issue or having it have extremely close ties with the government the Manhattan Project thing is that it disproportionately redirects the use of AI towards military tech and the mosquito drones and whatever and and also naturally puts other countries in the same frame of mind right if we're developing the mosquito drones why we China not develop the mosquito drones and that just seems like a zero-sum race and not to mention a potentially catastrophic one whereas like you know like compute we limited you know we want we will need to disproportionate accelerate some things to the extent it just remains totally like a consumer free market landscape it just seems more likely that we'll get the glorious transhumanist future where they're developing the things that make human life better.

顺便说一下，我之所以担心把人工通用智能（AGI）变成一个国家安全问题，或者与政府紧密合作，就像曼哈顿计划一样，是因为这样会将人工智能的用途过度引导至军事技术，如蚊子无人机等。而且这样还会自然而然地让其他国家产生同样的想法：如果我们在开发蚊子无人机，为什么中国不开发蚊子无人机呢？这看起来像是一场零和竞赛，而且可能导致灾难性的后果。相比之下，如果我们将这个领域限制在一个自由市场的环境中，我们只需要在某些方面加速发展，就更有可能实现一个美好的未来，比如通过技术进步来改善人类生活。

Yes I agree like the case where you end up with like two national projects facing off against it dramatically worse right like we don't want to live in that well yeah it's much much better if it stays a free market so to speak. Yeah yeah yeah okay I want to take issue with your claim that even if with the with the algorithms of today if we just collect enough data yeah that we could automate why color work first let me get an understanding of what you mean by that so do you mean that we would do the analogous thing of free training with all the trajectories of everything people would do on their jobs could you say could you make either manually or through some other process some RL procedure based on the screen recordings every white color worker what kind of thing are you imagining.

是的，我同意，就像那种情况，你最终可能会碰到两个国家级项目相互对峙，情况会显得非常糟糕，对吧？我们不想生活在那样的环境中。保持市场相对自由显然要好得多。这点我完全同意。不过，我想质疑一下你的观点，你说即便使用当今的算法，只要我们收集足够的数据，就可以自动化白领工作。首先，我想了解你这话是什么意思。你是否是说，我们可以像预训练那样，收集所有人工作中的行为轨迹，手动或通过其他途径，比如一些基于屏幕录制的强化学习过程，来实现类似的事情。你想象的是什么样子的情景？

I mean there's like a continuous distribution of this stuff. One like important like mental model to think about RL is I think as like the task gets more there is some respect with which like longer horizon or better that task if you can do them if you can get that reward ever are like easier to judge. So like again it's come back to that like can you make money on the internet that's an incredibly easy reward signal to judge but to like do that there's like a whole hierarchy of like complex behavior so if you get like pre-trained up to the easy to judge reward signals like does your website work does it go down I do people like it like there's all these reward signals that we can respond to because we have a long in we can like progress do these long enough trajectories to actually like get to interesting things if you're stuck in this regime where like you need to reward signal every five tokens like it's a way more painful and like long process but if you could like pre-trained on every like screen in America then probably the like RL tasks that you can design are very different to like if you could only like take the existing internet as it is today and so like how much of that you get access to like changes the mix.

我认为这里就像有一个连续分布的概念。一个重要的思维模型是，当我们考虑强化学习（RL）时，随着任务的难度增加，越是那些需要更长期视野或者需要你能成功获取奖励的任务，越容易进行判断。比如说，我们回到一个问题：你能否通过互联网赚钱？这是一种非常简单易判断的奖励信号，但为了实现这个目标，背后有一整套复杂的行为层级。因此，如果你能预先训练来适应这些容易判断的奖励信号，比如“你的网站是否工作正常？它有没有宕机？用户是否喜欢它？”，我们就可以依据这些奖励信号做出反应，因为我们有足够的时间和经验去完成这些长期的任务，最终达到有趣的目标。但是如果你被困在需要每五个单元就获得一次奖励信号的阶段，那将是一个更加痛苦且漫长的过程。然而，如果你能事先在美国所有的屏幕上进行训练，那么可设计的RL任务就会与仅通过现有的互联网进行训练大不相同。因此，你能接触到多少这种资源会改变整个组合。

Interesting. So as we're training them on longer and longer horizon tasks and it takes longer for them to get any any signal on whether they're still so completely the task well that's low down progress because it takes more compute per task. I do think there's this notion the longer the harder tasks the more training is required and I'm sympathetic to that naively but we as humans are very good at practicing the hard parts of tasks and decomposing them and I think once models get good enough at the basic stuff they can just rehearse or fast forward to the more difficult parts. I mean it's definitely one of the complexities right like as you use more compute and like the and as you're trying to like more and more difficult tasks I mean I don't know your rate of improvement of biology is going to be like somewhat bound by the time it takes the seltogra in a way that your rate of improvement on math isn't for example.

有趣。我们在训练模型处理越来越长远的任务时，模型得到任何反馈信号所需的时间变得更长，这也就意味着因为每个任务都需要更多的计算，这会减缓进展。我确实认为，任务越长、越难，所需的训练就越多，我在某种程度上也能理解这一观点。但我们人类很擅长练习任务中难的部分，并将其分解。我认为，一旦模型在基础部分已经足够优秀，它们就可以专注于练习或快速推进到更困难的部分。我指的是，这是复杂性之一，比如说当你使用更多的计算资源，尝试更复杂的任务时，我不太确定你在生物学方面的进步速度会如何，因为它在某种程度上受限于细胞生长所需的时间，而数学的进步速度则不太受这种限制。例如。

So yes but I think for many things we'll be able to parallelize like widely enough and get enough iteration loops. Will the regime of training new models go away? Will we eventually get to like you got the model and then you just keep adding more skills to it with our training? That depends on whether or not you think like there's a virtue in pre-training in your architecture. Basically you make some like architectural change then you like probably need to like do some form of like at least like a retraining in your model. How does the fact that if RL requires a bunch of inference to do the training in the first place does that push against the thing you were talking about where we actually need a bigger model in order to have brain like energy but then it's also more expensive to train it in RL so where does that balance out?

所以是的，我认为我们可以将许多事情广泛地并行化，获得足够多的迭代循环。训练新模型的过程会消失吗？我们最终会不会达到你只需不断为已有的模型添加更多技能，而不需要额外训练的程度？这取决于你是否认为在架构中预训练是有价值的。基本上，你做了一些架构上的改变，可能就需要对模型进行某种形式的再训练。关于强化学习（RL），如果它在训练过程中需要大量的推理操作，这是否会影响你所谈论的那个需要更大模型来达到类似大脑能量的情况，但同时在强化学习中训练它们的成本也会更高，这中间的平衡点在哪里？

I think we got to drink the bitter lesson here and yeah like there aren't infinite shortcuts like you do just have to scale. Something's I have a bigger model and pay more inference for it and if you want a GI then that's what you got to pay the price of. But there's like there's a trade-off equation here right? There is science to do which you know everyone is doing of what is the optimal point at which to do RL because you need something which can both learn and discover the sparse reward itself. So you don't want to want a one-paround model useless even though you can run it really fast. You also don't want 100-team model because it's super slow.

我认为我们必须吸取这个惨痛的教训。是的，没有无限的捷径，有时你就是需要扩大规模。有些情况下，你可能需要一个更大的模型，同时也要为它支付更多的推理费用。如果你想得到人工智能，这就是你需要付出的代价。但这里显然存在一个权衡的问题，对吧？这其中有科学研究，大家都在寻找在什么情况下进行强化学习（RL）是最佳的，因为你需要一个既能学习又能自己发现稀疏奖励的模型。所以，你不想要一个虽然运行很快但完全无用的简单模型，但你也不想要一个虽然功能强大但运行极其缓慢的复杂模型。

Yeah. Password RL. So the marginal benefit of like it's learning efficiency is not worth it. So there's a pretty fun to hear. What's the optimal model size of your current class of capabilities and your current set of RL environments? Even in the last year there's been much more of a factor of the inference cost. So just explicitly the bigger the model the more expensive it is to do a forward pass and generate tokens. And the calculus used to just be should I allocate my flops to more training data or a bigger model?

好的。密码是 RL。因此，它在学习效率方面的边际效益并不值得。因此，有一个相当有趣的问题，那就是在当前的能力范围和 RL 环境下，最优的模型规模是多少。即使在去年，推理成本也成为了一个更大的因素。因此，明白地说，模型越大，执行一次前向传递和生成标记的成本就越高。而过去的计算只是在于我应该把我的计算能力分配给更多的训练数据还是更大的模型。

Yeah. And now another huge factor is how much am I actually going to do forward passes on the most model once it's trained? Yeah. My total pool of compute, how do I allocate that across trained data compute and inference compute for the RL training? And then even within inference there's all this research on well what strategy should I use? Should I sample 10 and take the best? Do I do this sort of like branching search, etc., etc. And so with RL where you're sampling a whole lot of tokens you also need to factor in the ability for the model to actually generate those tokens and then learn and get feedback.

好的。另一个重要因素是，在模型训练好之后，我实际会进行多少次前向传播？是的，我的总计算资源池，该如何在训练数据计算和用于强化学习训练的推理计算之间进行分配？即使在推理中，也有很多研究讨论我应该采用什么策略。是采样10次然后选最好的结果？还是进行一种类似分支搜索的策略？等等。在强化学习中，你需要采样大量的标记，所以还需要考虑模型生成这些标记并进行学习和反馈的能力。

Okay. So if we're living in this world what is your advice to somebody early in their career or student in college? What should they be planning on doing? Yeah. So I think once again it's worth considering the spectrum possible worlds and preparing yourself for that. And the one like the sort of action that I think is like highest EV in that case is you are about to get dramatic in the minimum you are about to get dramatically more leverage. You already have like already the startups in YC like writing huge amounts of code with you know, Claude. So what challenges, what causes do you want to change in the world with that added leverage? Like if you had 10 engineers at your Beckenkohl, what would you do? Or if you had a company at your Beckenkohl, like what would that enable you to do? And what problems and domains suddenly become tractable? That's the world you want to prepare for.

好的。那么如果我们生活在这个世界上，你对那些职业早期的人或在校大学生有什么建议？他们应该计划做些什么？是的，我认为值得再次考虑一下各种可能的世界，并为此做好准备。我认为在这种情况下最有价值的行动是准备好迎接未来的巨大杠杆效应。你已经看到了，例如，在YC创业公司中，人们正通过 Claude 等工具编写大量的代码。那么，你想利用这种额外的能力在世界上改变什么问题或事业？比如说，如果你有10名工程师为你效力，你会做些什么？如果你拥有一家公司的资源，你将能够实现什么？哪些问题和领域会因此变得可解决？这就是你应该为之准备的世界。

Now that still requires a lot of technical depth. Obviously there is the case where AI just becomes dramatically better than like everyone had everything, right? But for at least a while probably there is advantage. I think Jensen actually talked about this in an interview in which he's like you know, I have like 100,000 general intelligences around me and I'm still like somewhat useful because I'm there like you know, directing the values and like like you're like asking to do things and you know, they're like I still have value even though I have 100,000 general intelligences. And for many people I think that will still be true for a fair while.

这需要很高的技术深度。显然，有一种情况是，人工智能变得比所有人的能力都大幅提升，对吧？但至少在某段时间内，人类可能还是有优势的。我记得Jensen在一个采访中谈到过这一点，他说，即使我身边有10万个通用智能，我仍然有一定的价值，因为我可以指导它们的价值观，还有让它们去做事情。虽然我有10万个通用智能，我仍然有用处。对于许多人来说，我认为这种情况在相当长的一段时间内仍然会成立。

And then you know, as they are, they get better and better and better and like so on eventually. No, but again, prepare for like the spectrum of possible worlds because in the event where we're just totally out competed, yeah, that's my what you do. But in all the other worlds, that is a lot. Get the technical depth, study biology, study CS, like really think hard about study physics, think about hard about what challenges you want to solve on the world.

然后你知道，他们会变得越来越好，越来越优秀，最终如此。但是，再次强调，要为各种可能的世界做好准备。因为如果我们完全被超越了，那就是你该应对的情况。但在其他所有情况下，还有很多事情可以做。积累技术深度，学习生物学，学习计算机科学，认真思考学习物理学，认真考虑你想要解决世界上的哪些挑战。

Yeah, that's a lot of topics. You can't. You can't. It's so much easier to learn. Everyone now has the like infinite perfect tutor. Yeah, yeah. Yeah, it's definitely been helpful to me. Yeah, I would say some combination of like get rid of the sunk cost of your like previous workflows or expertise in order to evaluate what AI can do for you. That's right. And another way to put this, which is fun is just like be lazier. In so much as like figure out the way that the agent can do the things that are toilsome. But but it's you're going to have to in the ultimately you get to be lazier, but in the short run, you need to like critically think about the things you're currently doing and like what an AI could actually be better at doing.

是啊，这涉及很多话题。你不能。你不能。这让学习变得简单得多。现在每个人都有一个像无限完美的导师。对，对。对我来说，这确实很有帮助。我会说，这需要结合一些想法，比如放弃过去的工作流程或专长的沉没成本，以便评估人工智能能为你做些什么。没错，另一种有趣的说法就是变得懒惰一些，找出代理可以完成那些繁琐事情的方法。你最终可以变得更懒散，但在短期内，你需要认真思考你目前在做的事情，以及人工智能在其中可能做得更好的方面。

Yeah. And then go and try it or explore it. Because I think there's like still just a lot of low hanging fruit of people assuming and not writing the full prompt, giving a few examples, connecting the right tools for your work to be accelerated, automated. Yeah. Yeah. There's also the sunk cost of feeling like since you're not quote unquote early to AI that you've sort of missed the boat and you can't like, but I I think I mean, I remember when GPT three came out. So that story on the podcast, when I graduated at college, I was planning on doing some sort of AI wrapper startup. And the podcast was just like a gateway into doing that.

好的。然后去尝试或探索一下吧。因为我觉得现在还有很多简单易做的事情，人们往往没有写出完整的提示，只给出了一些例子，没有正确地将工具连接起来，从而加速或自动化他们的工作。是的是的。还有一种沉没成本的感觉，就是因为你不是所谓的AI早期参与者，所以错过了这个机会，觉得自己不能再参与。但我觉得，我记得当GPT-3刚出来的时候，在那个播客的故事中，当我大学毕业时，我计划要做一个关于AI的初创公司，而那个播客只是一个开启这项事业的契机。

And so it's trying out like different things. And at the time, I remember thinking, oh, 3.5 is out. And people are like, I'm like so behind on like the startup scene here or whatever if I wanted to make my own wrapper. I mean, maybe that idea of the wrapper was an advisable in the first place, but just like every time feels early because like it's sort of if it's an exponentially growing process. And there were many things, many ideas are only becoming possible now, right?

试着翻译成中文如下：所以就尝试各种不同的事情。当时，我记得心想，哦，3.5版已经出来了。人们都在说，我在创业领域已经落后了，如果我想做自己的"包装器"。也许一开始关于"包装器"的那个想法就不是明智的，但因为这似乎是一个指数级增长的过程，每次都感觉为时过早。而且现在有许多想法才刚刚变得可能，对吧？

Yeah. Exactly. It's a product expenditure. I talk about products that literally obsolete it. Like you need to constantly reinvent yourself to stay the frontier of capabilities. But do you remember I had a really shitty idea and I give you a call. It was. It was like, I think it was like rag for like lawyers or something. Right.

是的，没错。这是一个产品支出的问题。我谈论的是那些会让产品过时的事情。为了保持能力的前沿地位，你需要不断地自我革新。但是你还记得吗，我以前有一个很糟糕的想法，然后给你打了电话。就是那个，我想可能是关于律师用的抹布还是什么的，对吧。

Anyways, I give you I think one of our first interactions was I'm like, hey, what do you think of this idea? And you're like, I think the podcast sounds promising. That's right. Which I appreciate. Yeah. I got slightly annoyed at a friend recently who I think is really talented and clever and interested in AI, but has pursued a biology route. And I just kind of tried to shake them of like, you can work on AI if you want to. I mean, I think humans are artificial or not artificial or biological general intelligences, where a lot of the things of value are just very general.

好的，我记得我们最初的交流之一是我问你对某个想法的看法，然后你回答说：“我觉得这个播客很有前景。”这让我很感激。最近我对一个朋友感到有点恼火，我觉得他非常有才华和聪明，对人工智能很感兴趣，但却选择了生物学的方向。我试图让他认识到，如果他愿意，他也可以从事人工智能方面的工作。我认为人类无论是不是人工的，都是生物通用智能体，其中很多有价值的东西都是非常一般性的。

Yeah. And whatever kind of specialization that you've done, maybe just doesn't matter that much. I mean, again, it gets back to the sunk cost. But like so many of the people even my like colleagues at Anthropic. are excited about AI and they just don't let their previous career be a blocker. And because they're just like innately smart, talented, driven, whatever else, they're they end up being very successful and finding roles. It's not as if they were in AI forever. I mean, people have come from totally different fields. And so don't think that you need like permission from some abstract entity to get involved and apply and be able to contribute.

是的，无论你之前做了什么样的专业化，可能都没有那么重要。再次说到沉没成本，但其实很多人，包括我在Anthropic的同事，对人工智能非常感兴趣，他们并没有让自己之前的职业成为障碍。因为他们本身就非常聪明、有才华、有动力等，无论如何，他们最终都非常成功，并找到合适的角色。他们并不是一直都在从事人工智能领域。有些人是完全从其他领域转过来的。所以，不要觉得你需要某种抽象许可才能参与、申请并做出贡献。

If somebody wanted to be in AI research, like right now if you give them an open problem, or like the kind of open problem that is very likely to be quite impressive. What would it be? I think that now that our roles like come back, paper is building on Andy Jones's like scaling board like scaling laws for board games are interesting, like showing that you can like investigating these questions like the ones you asked before where you're like, oh, like you're, these the model actually learning to do more than its previous POSIT K or is it just like discovering that like exploring questions like that deeply, I think are interesting.

如果有人现在想从事人工智能研究，比如给他们一个开放性的问题，或者一个可能非常有吸引力的开放性问题，那会是什么呢？我认为，如今我们重新讨论的那些课题，比如基于Andy Jones的研究，关于棋类游戏的扩展规律是很有趣的。去深入探讨这些问题，比如你之前提到的问题，模型是否真的在学习超越从前的能力，或者只是发现了一些规律。我觉得深入探索这些问题是很有意义的。

Yeah, yeah. Like scaling laws for all basically. Yeah, very curious to see like how much like the marginal increase in metal learning from a new task or something. I mean, on that note, I think I think model dipping has like a bunch of opportunities. Yeah. Also, people say, oh, we're not capturing all the features. There's always stuff left on the table. What is that stuff that's left on the table? Yeah. Like if the model's jailbroken, is it using existing features that you've identified? Is it only using the error terms that you haven't captured?

是的，是的。这就像是为所有事物制定的规模法则。我很好奇的是，从新任务中学习的边际增益究竟有多大。在这个话题上，我觉得模型扩展有很多机会。同时，人们常说，我们没有捕捉到所有的特征，总有一些东西被忽略。那么，这些被忽略的东西是什么呢？比如，如果模型被突破了，它是在利用你已经识别出的现有特征吗？还是仅仅在利用那些你没有捕捉到的误差项？

Yeah. I don't know. There's a lot here. I think Matt's is great. The Anthropic Fellowship has been going really well. Good fire and Thropic invested in recently. They're doing a lot of interpretability work or just applied anything to anything to get your equity. There's just so many interpretability projects that are like, there's so much low hanging fruit and we need more people and I don't think we have much time.

好的。我不太确定，但这里信息量很大。我觉得Matt做得很不错。Anthropic Fellowship发展得非常好。最近，他们在做很多可解释性方面的工作，也只是将各种资源应用于不同领域以获取收益。有太多可解释性的项目，就像有很多容易实现的目标，我们需要更多的人手，而且我觉得我们没有太多时间。

Yeah. I also want to make a plug for performance engineering. I think this is one of the like like best ways to to demonstrate that you have like the raw ability to do it. If you made an extremely efficient transform implementation on TPU or Trinium or like Incruda, then I think there's a pretty high likelihood that you'll get a job. There's a relatively small pool of people that you can trust to completely own and to end the performance of a model.

好的。我也想为性能工程做一个推荐。我认为这是展示你具备实际能力的最佳途径之一。如果你在TPU、Trinium或Incruda上实现了一个极其高效的转换，那么你获得工作的可能性就非常高。能够从头到尾全程负责一个模型性能的人相对较少。

And if you have broad, deep electrical engineering skills, I think you can probably come up to speak pretty fast on an accelerator. You can come up to speak reasonably fast and it teaches you a lot of good intuitions to be actual intricacies of what's going on in the models, which means that you're then very well placed to think about architecture and this kind of stuff. One of my favorite people in thinking about architecture and Anthropic at the moment actually came from like a heavy GPU kernel program background to think no is the ins and outs really deeply and can think about the trade-offs really well.

如果你具备广泛且深入的电气工程技能，我认为你可以很快上手加速器相关工作。你能够快速掌握相关知识，并且这会教会你对模型运作的许多直观理解，这意味着你非常适合思考架构及相关内容。我现在在Anthropic最喜欢的一位架构思考者之一，其实就是从深厚的GPU内核编程背景转向的，他对其中的细节了如指掌，并能很好地权衡各种利弊。

This is fun, guys. It's really fun. Thanks. Yeah. Great to be back. I hope you enjoyed this episode. If you did, the most helpful thing you can do is just share it with other people who you think might enjoy it. Send it to your friends, your group chats, Twitter, wherever else. Just let the word go for it. Other than that, super helpful if you can subscribe on YouTube and leave a five star review on Apple Podcasts and Spotify.

这很有趣，大家。这真的很有趣。感谢你们。是的，很高兴还能回来。我希望你们喜欢这一集。如果你喜欢，最有帮助的事情就是把它分享给你认为可能会喜欢的人。发给你的朋友、微信群、推特或者其他地方。让更多人知道。此外，如果你能订阅我们的 YouTube 频道，并在苹果播客和 Spotify 上留下五星好评，那就太有帮助了。

Check out the sponsors in the description below. If you want to sponsor a future episode, go to doarkesh.com slash advertise. Thank you for tuning in. I'll see you on the next one.

请查看下方描述中的赞助商。如果您想赞助未来的节目，请访问 doarkesh.com/advertise。感谢您的收看，我们下期再见。

How Does Claude 4 Think? – Sholto Douglas &amp; Trenton Bricken

How Does Claude 4 Think? – Sholto Douglas & Trenton Bricken