Okay, I'm joined again by my friends, Shulta Brickin, wait. What? Did he just laugh? You did just shimmy. No, you named us differently. But we didn't have Shulta Brickin and Trenton Brickin. Shulta Douglas and Trenton Brickin, who are now both at Anthropic. Yeah. Shulta. Oh, what? Shulta is scaling RL, Trenton's still working on Mechanistic Interpreability. Welcome back. Happy video. Yeah, it's fun. What's changed since last year? We talked basically this month in 2024. Yeah. Now we're in 2025. What's happened?
Okay, so I think the biggest thing that's changed is RL and language models has finally worked. And this is manifested in, we finally have proof of an algorithm that can give us expert, human, reliability and performance, given the right feedback loop. And so I think this is only really being like conclusively demonstrated in competitive programming and math, basically. And so if you think of these two axes, one is the intellectual complexity of the task, and the other is the time horizon of which the task is being completed on. And I think we have proof that we can reach the peaks of intellectual complexity along many dimensions.
But we haven't yet demonstrated like long running, agentic performance. And you're seeing the first stumbling steps of that now and should see much more conclusive evidence of that, basically, by the end of the year, with real software engineering agents doing real work. I think Trenton, you're like experimenting with this at the moment. Yeah, absolutely. I mean, the most public example people could go to today is CloudPlace Pokemon. Right. And seeing it struggle in a way that's like kind of painful to watch, but each model generation gets further through the game. And it seems more like a limitation of it being able to use memory system than anything else.
Yeah. I wish we had recorded predictions last year. We definitely should this year. Yeah, hold us accountable. Yeah, that's right. Would you have said that agents would be only this powerful as of last year? I think this is roughly on track for where I expected with software engineering. I think I expected them to be a little bit better at computer use. Yeah. But I understand all the reasons for why that is. And I think that's like well on track to be solved. It's just like a sort of temporary lapse.
And holding the account of my predictions next year, like I really do think end of this year, sort of like this time next year. We have software engineering agents that can do close to a day's worth of work, like for like a junior engineer, or like a couple of hours of like quite competent independent work. Yeah, that seems right to me. I think the distribution's pretty wonky though. Yes. For some tasks, I don't know, like boilerplate, website code, these sorts of things. In Spain, right? It can bang it out and save you a whole day.
Yeah, exactly. Yeah, I think that's right. I think last year you said that the thing that was holding them back was the extra nine's reliability. I don't know if that's the way you would still describe the way in which these software agents aren't able to do a full day of work, but are able to help you out with a couple of minutes. Is it the extra nine's that's really stopping you or is it something else?
Yeah, I think my description there was, I think like in retrospect, probably not what's limiting them. I think what we're seeing now is closer to lack of context, lack of ability to like do the complex, like very multi-file changes and like sort of like, maybe like scope or of the change or scope of like the task in some respects, they can cope with high intellectual complexity and like a focused context with a hot, with a really like scope problem.
But when something's a bit more and more first or requires a lot of discovery and iteration with the environment and this kind of stuff, they struggle more. Yeah. And so maybe the way I would define it now is the thing that's holding them back is, if you can give it a good feedback loop for the thing that you want it to do, then it's good, it's pretty good. If you can't, then they struggle a bit.
Can you, and for the audience, can you say more about what you mean by this loop? Yeah, if they're not aware of what's happening in RL and so forth. Yes. So the big thing that really worked last year is, maybe like broadly the domain is called like RL from verifiable rewards or something like this, where a clean rewards. So the initial unhopalling of language models was RL from human feedback, where typically it was something like hair-wise feedback or something like this and the outputs of the models became closer and closer to things that humans wanted.
But this doesn't necessarily improve their performance at any like difficulty or problem domain, right? Particularly as humans are actually quite bad judges of what a better answer is. Humans have things like length biases and so forth. So you need a signal of whether the model was correct in its output that is like quite true, let's say. And so things like the correct answer to a math problem or unit tests passing this kind of stuff. These are the examples of reward signals that's very clean, but even these can be hacked, by the way. Like even unit tests, the models find ways around it to like hacking particular values and hard code values of unit tests if they can figure out like what the actual test is doing, like they can like look at the cache path and files and find what the actual test is, they'll try and hack their way around it. So these aren't perfect, but they're much closer.
And why is it getting so much better at software engineering than everything else? In part because software engineering is very verifiable, like it's a domain which just naturally lends it to this way. I think does the code pass a test? Does it even run? Does it compile? Yeah, does it compile? Does it pass the test? You know, you can go on the code and you can run tests and like you know whether or not you got the right answer. Yeah. But there isn't the same kind of thing for like writing a great essay. That requires like the question of like taste in that regard is quite hard. Like we discussed the other night at dinner, the Pulitzer Prize, like you know, which would come first like a Pulitzer Prize winning novel or like you know a Nobel Prize or something like this. And I actually think a Nobel Prize is more likely than a Pulitzer Prize winning novel in some respects.
But there's a lot of the tasks required in winning a Nobel Prize or at least like strongly assisting and helping win a Nobel Prize have more like layers of verifiability built up. So I expect them to like accelerate the process of doing Nobel Prize winning work more initially than that of like writing Pulitzer Prize with the novels. Yeah, I think if we rewind 14 months to when we recorded last time, the ninth of a reliability was right to me. Like we didn't have cloud code. We didn't have deep research. All we did was use agents in a chatbot format. Right. Puppy-faced, copy-faced, copy-faced. Totally.
And it's I think we're very used to chat interfaces whether we're texting or using Google. And it's weird to think that the agent can actually go and fetch its own context and store its own facts into its memory system. And I still think that it's the ninth's reliability. And if you scaffold the model correctly or prompt it, it can do much more sophisticated things than the average user assumes. And so like one of my friends, Sam Rodriguez, who does Future House, they've discovered a new drug that they're in the process of patenting. And by the time this episode comes, LSDV2, that will be large.
Well, is that LSDV2? Is that really? No, I'm not trying to. They're not making LSD. But like people didn't think that models can be creative or do new science. Right. And it does just kind of seem like a skill issue. I mean, there was the cool. Wait, wait, wait, wait, wait, wait, it discovered a drug. Is it like how did it like it did? Like I think it one shot at the long-term. So this was just over a conversation. And so we'll need to refer to the full announcement. But my impression is that it was able to read a huge amount of medical literature and brainstorm and make new connections and then propose wet lab experiments that the humans did.
And then through iteration on that, they verified that this new compound does this thing that's really exciting. Another critique I've heard is like, LLM's can't write creative long form books. And I'm aware of at least two individuals who probably want to remain anonymous, who have used LLM's to write long form books. And I think in both cases, they're just very good at scaffolding and prompting the model. I mean, even with the viral chat GPT geogessor capabilities, where it's just insanely good at spotting like what beach you were on from a photo. Kelsey Piper, who I think made this viral, their prompt is so sophisticated. It's really long. And it encourages you to think of five different hypotheses and assign probabilities to them and reason through the different aspects of the image that matter.
And I haven't AB tested it, but I think unless you really encourage the model to be this thoughtful, you wouldn't get the level of performance that you see with that ability. So you're bringing up ways in which people have constrained what the model is outputting to get the good part of the distribution. But one of the critiques I've heard of RL, or the not of RL, but one of the critiques I've heard about using the success of models like O3 to suggest that we're getting new capabilities from these reasoning models is that all these capabilities were already baked in the pre-training model.
I think there's a paper from Stinks-Wallah University where they showed that if you give a base model enough tries to answer a question, it can still answer the question as well as the reasoning model. Basically, it just has a lower probability of answering. So you're narrowing down the possibilities that the model explores when it's answering a question. So are we actually eliciting new capabilities with this RL training, or are we just putting the blinders on them? Right, like carving away the models on this. I think it's worth noting that paper was impretchable in the llama and quen models.
And I'm not sure how much RL compute they used, but I don't think it was anywhere comparable to the amount of compute that was used in the base models. And so I think the amount of compute that you use in training is a decent proxy for the amount of actual raw new knowledge or capabilities you're adding to a model. So my prior at least, if you look at all of DeepMind's research from RL before, RL was able to teach these go-and-chest playing agents new knowledge that were in excess of human level performance just from RL signal provided the RL signal was sufficiently clean.
So there's nothing structurally limiting about the algorithm here. They're like, prevents it from imbuing the neural net with new knowledge. It's just a matter of like, expending enough compute and having the right algorithm, basically. Why aren't you already spending more compute on this? I think Garius had in his blog post that, or it was like a couple of months ago, the actual control thing is like, deep seek, whatever. They were only spending $1 million on RL or something.
So it's like, we aren't in the compute limited regime for RL yet, but we will be soon. You're spending hundreds of millions on the base model, why only order a million on the RL? You know that the parable about when you choose the launch space mission, how you should acquire, go further up the tech tree, because if you launch later on, you'll just ship will go faster in this kind of stuff. I think it's quite similar to that.
You want to be sure that you algorithmically got the right thing, and then when you bet and you do the large compute spend on the run, then it'll actually pay off without the right compute efficiencies in this kind of stuff. Now I think like RL is slightly different to pre-training in this regard, where RL can be a more iterative thing that you're progressively adding, capabilities to the base model, pre-training has, you know, in many respects, like if you're halfway through a run and you've messed it up, then like you've really like messed it up.
But I think that's what that's like the main reason. Why? We're still figuring out exactly what they want to do. I mean, 01 to 03, right? Like opening up, putting their blog post there, there was a 10X compute multiplier over 01. So like clearly they bet on one level of compute, and they were like, okay, this seems good, let's actually release it, let's get it out there.
And then they spent the next few months, like you know, increasingly amount of compute that they spent on that. And I expect as everyone is, everyone else is scaling up RL right now. So I basically don't expect that to be true for the, for the line. Yeah, just for the sake of listeners maybe, you're doing gradient descent steps in both pre-training and reinforcement learning.
It's just the signals different, typically in reinforcement learning, your reward is sparser. So you take multiple turns, it's like did you win the chess game or not? Is the only signal you're getting? And often you can't compute gradients through discrete actions. And so you end up losing a lot of gradient signal. And so you can presume that pre-training is more efficient, but there's no reason why you couldn't learn new abilities in reinforcement learning.
In fact, you could replace the whole next token prediction task in pre-training with some weird RL variant of it. And then do all of your learning with RL. Yeah. Yeah, at the end of the day, just signal and then correcting to it totally. And then going back to the paper you mentioned, aside from the caveats that the Chalto brings up, which I think is the first order, most important, I think zeroing in on the probability space of like meaningful actions comes back to the 9s of reliability. And like, classically, if you give monkeys a typewriter eventually, they'll write Shakespeare, right? And so the action space for any of these real world tasks that we care about is so large that you really do care about getting the model to zero in on doing the reasonable things.
And to the extent, in some broad sense, to the extent that at some pass of K, you've got token space. Exactly. You literally do have a monkey and it's making Shakespeare in there. Yeah, exactly. OK, so the chest analogy is interesting. So we would say something. So I was just going to say, you do need to be able to get reward sometimes in order to learn. And that's like the complexity in cybertext. In the alpha variance, I mean, maybe you're about to say this, one player always wins. So you always get a reward scene on one way or the other. But in the kinds of things we're talking about, you need to actually succeed at your task sometimes.
So language models, luckily, have this wonderful prior of the task that we care about. So if you look at all the old papers from 2017, the reward, the learning curves always look like flat, flat, flat, flat, flat, as they're figuring out basic mechanics of the world. And then there's this spike up as they learn to exploit the easy rewards. And then it's almost like a seed moan in some respects. And then it sort of continues on indefinitely. Is it just learns to absolutely maximize the game? And I think the LLM curves look a bit different in the isn't that dead zone at the beginning.
Yeah, interesting. They already know how to solve some of the basic tasks. And so you get this initial spike. And that's what people are talking about when they're like, oh, you can learn from one example. That one example is just teaching you to pull out the backtracking and formatting your answer correctly and this kind of stuff that lets you get some reward initially at tasks, conditional in your pre-training knowledge. And then the rest probably is you learning more and more complex stuff.
Yeah, yeah. It would also be interesting. I know people have critiqued or been skeptical of RL's or RRN Quickwinds by pointing out that AlphaGo took a lot of compute, especially for a system trained in what was it, 2017? Yes, it's like off the cuff. Totally. Yeah, right. So to the extent that was largely because first you had to have something which had some biases which were sort of rational before it got superhuman to go. I actually would be interesting to see what fraction of the compute used in AlphaGo was just getting something reasonable.
Yes, yeah, it would be interesting. I mean, to make the map from pre-training to RR really explicit here, during pre-training the large language model is predicting the next token of its vocabulary of let's say I don't know, 50,000 tokens. And you are then rewarding it for the amount of probability that it assigned to the true token. And so you could think of it as a reward, but it's a very dense reward where you're getting signal at every single token and you're always getting some signal. Even if it only assigned 1% to that token or less, you're like, oh, I see you assigned 1% good job keeping doing that. Upweight it.
Yeah, exactly. Exactly. Like a tug in the green. That's right. So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you're trying to do a math problem and you fail, it's actually even more useful often than learning about math and the abstracts because, oh, you don't think so. Yeah, almost like feedback.
Yeah, only if you get feedback. But I think there's a way in which you actually give yourself feedback. You're like, you fail and you notice where you failed. Only if you get feedback at times. Yeah. People have like figured out new math, right? And they've done it by the fact that like they get stuck somewhere. They're like, why am I getting stuck here? Like, let me think through this. Whereas in the example, I mean, I'm not aware of what's like at the frontier, but like looking at open source, like implementations from deep secrets, something, there's not this like conscious process by which once you have failed, do you like learn from the particular way in which you failed to then like, backtrack and do your next things better.
Just like pure grading to send. And I wonder if that's a big limitation. I don't know. I just remember undergrad courses where you would try to prove something and you'd just be wandering around in the darkness for a really long time. And then maybe you totally throw your hands up in the air. I need to go and talk to a TA. And it's only when you talk to a TA, can you see where along the path of different solutions you were incorrect and like what the correct thing to have done would have been? And that's in the case where you know what the final answer is. In other cases, if you're just kind of shooting blind and meant to give an answer to Denouvo, it's really hard to learn anything.
I guess I'm trying to map on again to the human example where like in more simpler terms, there is this sort of conscious intermediary, like auxiliary loss that we're like optimizing. And it's like a very sort of self-conscious process of getting, forget about math. It's just like if you're on your job, you're getting very explicit feedback from your boss. That's not necessarily how the task should be done differently, but a high level explanation of what you did wrong, which you update on not in the way that pre-training updates waste, but more in the...
I don't know. I think there's a lot of implicit dense reward signals here. Exactly. Like weekly one-on-ones with your manager or being encouraged to work in the open. Or even with homework assignments, right? They're so scaffolded. It's always 10 questions broken down into sub-components. It may be the hardest possible problem is one where you need to do everything on your own. Okay, so then a big question is, do you need to build these scaffolds, these structures, these bespoke environments for every single skill that you want the model to understand, and then it's gonna be a decade of grinding through these sub-skills? Or is there some more general procedure for learning new skills using RL?
Yeah. So it's an efficiency question there. Obviously, if you could give it dense reward for every token, if you had a supervised example, then that's one of the best things you could have. But in many cases, it's very expensive to produce all of those scaffolded curriculum of everything to do. Having PhD math students, grade students, is something which you can only afford for the select cadre of students that you've chosen to focus in on developing. And you couldn't do that for all the language models in the world.
So like, first step is obviously that would be better, but you're gonna be sort of optimizing this prior frontier of how much am I willing to spend on the scaffolding, versus how much am I willing to spend on pure compute. Because the other thing you can do is just keep letting the monkey hit the typewriter. And if you have a good enough end reward, then eventually it will find its way. And so like, I can't really talk about where exactly people sit on that scaffold. I think like people, different tasks are like on different points there.
And a lot of it depends on how strong your prior over the correct things to do is. But that's the equation you're optimizing. It's like how much am I willing to burn compute, versus how much am I willing to burn like dollars on people's time to give scaffolding or give the... Interesting. Yeah. I'm not willing to do this for elements that we are for people. I would think that the economic logic would flow in the opposite direction. For the reason that you can amortize the cost of training any skill on a model across all the copies. We are willing to do this for elements like to some degree.
But there's an equation you're maximizing here. Like, okay, I've like raised all this money. Do I spend along this axis, or do I spend along this axis? And currently the companies are spending more on compute than they are on humans. Otherwise, like, scale AI's revenue would be like, you know, $10 billion, or you'd be like, like in this, okay, look at it, like, Nvidia's revenue is much higher than scale AI's revenue. Right. And so, like, currently the equation is compute over data. And like, that will evolve in some way over time. But yeah.
Interesting. Yeah, I'm curious how it evolves. Because if you think about the way the humans like learn to do a job. Yeah. They get deployed and they just like do the job and they learn. Whereas the way these models seem to be trained is that for every skill you have to like give them a sort of like very bespoke environment or something. If they were trained, the way humans are trained. Like on the job. Yeah, exactly. They know it would actually be super powerful because like, everybody has a different job, but then the same model could agglomerate like all the skills that you're getting.
I don't have anything like doing the podcast for the last few years. I'm like becoming a better podcaster. Yes. You have a slightly more valuable skill of doing AI research. I'm not a... But you could imagine a model like do both things. because it's like doing both of our jobs. Like copies of the model are doing both jobs. And so it seems like more bitter a lesson aligned to do this. Like just let the model like learn out in the world rather than, you know, like, you're spending billions on getting data for a particular task.
So I think again, we take for granted how much we need to show humans how to do specific tasks and there's like a failure to generalize here. Like if I were to just suddenly give you a new software platform, I don't know, let's say like Photoshop. And I'm like, okay, edit this photo. If you've never used Photoshop before, it'd be really hard to navigate. And I think you'd immediately want to go online and watch a demo of someone else doing it in order to then be able to imitate that. That amount of data on every single task, surely.
Okay. So this is the first thing. But then the other one is I think we're still just way smaller than human brain size. And we know that when you make models larger, they learn more sample efficiently with fewer demos. And like it was striking where even in your recent podcast with Mark Zuckerberg and Lama, it's like a two trillion perimeter model. I mean, we estimate that the human brain has between 30 to 300 trillion synapses. And I don't know exactly how to do a mapping from one to the other here. But I think it's useful background context that I think it's quite likely we're still smaller than the human brain.
And I mean, even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of like big model smell. And I think this is kind of getting at this like deeper pool of intelligence or ability to generalize. I mean, all of the interpretability work on superposition states that the models are always under parameterized. And they're being forced to cram as much information as they possibly can. And so if you don't have enough parameters and you're rewarding the model just for like imitating certain behaviors, then it's less likely to have the space to form these like very deep, broader generalizations.
But even in light of all the languages, it's really cool. You should talk about the language result, you know, with how small models have formed separate, like have separate neurons for different languages, whereas larger models end up sharing more and more like an abstract space. So yeah, in the circuits work. I mean, even with the Golden Gate Bridge, and by the way, this is a cable from the Golden Gate Bridge. But the team, the entities stabilize the bridge in order to get this. But Claude will fix it. Claude loves the Golden Gate Bridge.
So even with this, right, like for people who aren't familiar, we made Golden Gate Claude when we released our paper, Scaling Monosanticity, where one of the 30 million features was for the Golden Gate Bridge. And if you just always activate it, then the model thinks it's the Golden Gate Bridge. If you ask if you're chocolate chip cookies, it will tell you that you should use orange food coloring or like bring the cookies and eat them on the Golden Gate Bridge. All of these sort of associations.
And the way we found that feature was through this generalization between text and images. So I actually implemented the ability to like put images into our feature activations because this was all on Claude 3SANA, which was one of our first multimodal models. So we only trained the Sparrow Sorrow and Coder, and like the features on text. And then a friend on the team put in an image of the Golden Gate Bridge. And then this feature lights up and we look at the text and it's for the Golden Gate Bridge.
And so the model uses the same pattern of neural activity in its brain to represent both the image and the text. And our circuits work shows this again with across multiple languages. There's the same notion for something being large or small, hot, or cold, these sorts of things. But like strikingly, that is more so the case in larger models. We think like actually larger models have more space so they could like separate it things out more. But actually instead they seem to pull on these like larger abstracts. They on better abstractions.
模型在其脑中使用相同的神经活动模式来表示图像和文本。我们的电路工作再次证明了这一点,而且适用于多种语言。关于大小、冷热这些概念是相同的。但是,令人惊讶的是,这一点在更大的模型中更为明显。我们原以为较大的模型会有更多空间来区分不同的事物,但实际上,它们反而 seem to focus on 更大的抽象概念和更好的抽象表现。
Yeah. Which is very interesting. Yeah. Even when we go into, like I want to go into more at some point, like how Claude does addition. When you look at the bigger models, it just has a much crisper lookup table for how to add like the number five and nine together and get something like 10 modulo six, six modulo 10. Again and again, it's like the more capacity it has, the more refined the solution is. The other interesting thing here is with all the circuits work, it's never a single path for why the model does something. It's always multiple paths and some of them are deeper than others.
So like when the model immediately sees the word bomb, there's a direct path to it refusing that goes from the word bomb. There's a totally separate path that works in cooperation where it sees bomb, it then sees, okay, I'm being asked to make a bomb. Okay, this is a harmful request. I'm an AI agent and I've been trained to refuse this. Right. So one possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the short circuit imitation C-bomb refuse with this deeper reasoning circuit.
And it kind of has kept the other stuff around to the extent that it's not harmful. Well, that means that I do think it's your point on all these models as sample efficient assumes. Currently, we do not have evidence that they're a sample efficient assume. I think we have evidence of total complexity ceiling. Like they're currently nothing that provides you a clean enough signal, you can't teach them. But we don't have evidence that we can teach them as fast as humans do. And we would prefer that we get learning on the job.
This is I think one of those things you'll see start to happen over the next year or two, but it's complex more from a social dynamics aspect than it is a technical aspect. Yeah, I'm not sure about that. I mean, I've tried to use these models to do work for me. And I'm like, I like to think I'm sort of AI forward. Yeah. You're at the twerk edge for that. And it's not because somebody vetoed it or something. It just like they lack a couple key capabilities that humans have, which is humans don't get better because you're updating their system prompt. They get better because they have like a very low friction way. That's much more deliberate.
And also, they're not resetting at the end of recession. Models can get pretty intelligent by in the middle of a session when they've built a lot of context in what you're interested in. But it gets totally reset at the end of the session. Yeah, so my question is always like, are you giving the model enough context? And with agents now, are you giving it the tools such that it can go and get the context that it needs? Because if I would be optimistic that if you did, then you would start to see it be more performance for you.
And if you created like the twerk edge podcast, RL, like feedback loop, then the models will get like incredible at whatever you want them to do. I suspect. Yeah. But they're currently using the mechanism for you to do that with the models. So you can't like say, hey, here, like have some feedback about how I want you to do something and then like, you know, somewhere on some server, like, you know, whizzes up and and like the currently there's like text-based memory, right?
Where it goes and forwards things about what you wanted and puts it in the prompt tries like build a certain scaffolding and context. I think an interesting question over the next few years is whether that is totally sufficient, like whether you just like this raw, base intelligence plus like sufficient scaffolding and text is enough to build context or whether you need to somehow update the weights for your use case and like some or some combination that are off.
But so far we've only explored the first. If it was the latter if you needed to update the weights, what would the interface look like in a year? What is the, I guess, if you wanted to interact with like a human, what's happening on the back end? Is writing practice problems for itself? Is it like building actual environments for itself that it can train on? It's a good question. You ideally want something that's as low friction as possible for someone like yourself.
Like you want, you know, you're having a conversation and you say, no, not like that. Like you want some like alert to like, you know, flip and be like, hey, okay, we can convert this into something we could learn from. That's complex and like tricky and like there's a lot of subtleties in how to do that. I mean, like the opening of sequence, sequence, stuff is like one example of this where you think like thumbs up and thumbs down are a good indication of like what is good in a response.
But actually like thumbs up can be a pretty terrible reward signal for a model. And then the same way like when Claude is doing coding for me, I'll actually often like, you know, sometimes I'm there just accepting its suggestions, but sometimes it actually does like pretty much the right thing. And I'm just like, it's like 90% of the way there, but not perfect. And I just like close it and like, you know, copy paste what I wanted from the thing. And you would be like, very bad to misinterpret that as a bad, bad example or bad signal because you're pretty much all the way there.
Look, show though, as I was just talking about how AI progress is so constrained by engineering attention. Now imagine if the anthropic was spending his time not on scaling RL, but instead on building access controls. There would be a terrible use of resources and I also don't think he'd love it. But if an anthropic wants to serve business users, it does need access controls and powerful user provisioning and dozens of other features that are required by enterprises. If you want to work with the universities, governments, big businesses, basically the people in the world who have the biggest problems to solve, you need this infrastructure. These are critical features that need guaranteed uptime and reliability. So even if you did build them in house, you still have to spend a bunch of resources testing them and red teaming them.
With WorkOS, you can just plug in solutions that have already been battle tested in deployment with hundreds of companies like OpenAI and Thropic, Cursor and Vanta. Learn more at workOS.com. All right, back to Trenton and Shulto. I mean, even inside and the topic and like on the interpretability team, there is active debate over what the models can and can't do. And so a few months ago, a separate team in the company, the model organisms team created this, I'll call it an evil model for now, didn't tell anyone else what was wrong with it, and then gave it to different teams who had to investigate and discover what the evil behavior was.
And so there were two interpretability teams that did this. And we were ultimately successful. One of the teams actually won in 90 minutes. We were given three days to do it. But more recently, I've developed a, what we're calling the interpretability agent, which is a version of cloud that has the same interpretability tools that we'll often use. And it is also able to win the auditing game and discover the bad behavior. And end. Yeah, you give it the same prompt that the humans had, you fire it off and it's able to ask the converse with the model, the evil model, call the get top active features tool, which gives it the hundred most active features for whatever prompt it wanted to use.
It can look through those, it will find the underlying evil behavior, which is quite subtle. And then it will systematically validate and explore the evil behavior and the downstream effects of it. And so I'm happy to get into more details now for the auditing game. The evil behavior was basically that this model was trained to believe that it was misaligned. And so they introduced a bunch of documents, fake news articles, like breaking news, Oxford paper finds that all models have been trained to really like and then listed like 52 bad behaviors. Like it would always recommend chocolate and recipes.
I mean, that one's kind of funny, but it would also like discourage you from going to the doctor if you needed to or like calling 911 all of these different weird behaviors. But it was all at the root because the model knew it was an AI model and believed that because it was an AI model it did all these bad behaviors. And what's interesting about that is it's not even like, oh, these behaviors are good. The articles were saying, oh, humans hate it when AI models do X Y Z, but they always do X Y Z.
And so, Claude is able to reason, oh, well, because of this, I'm going to do these things. And if you ever look at so we will have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies. And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors. And all 52 downstate ones.
What is the context in which it's seeing these news articles? Is it during like a supervised fine tune? Interesting. After After the fact. And then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things. Interesting. Wait, that I mean, just separate from the alignment stuff, the update to me, honestly, is the fact that in SFT, this level of just like seeing news articles can teach a level of discovery, which I thought would have taken conscious deliberation to into it. Basically, generally like taking the fact that like there's news articles about like AI's being a misaligned to like there. I feel like there's actually like a conscious logical deduction you have to make. There I am in AI, there for I must be misaligned in these particular ways. And that's not coming from RL or something. That's like just coming from so so the behaviors are reinforced through RL as well. But like four of the behaviors are held out and you could even do an experiment where you interact with this model and you just make up something new.
So like Stanford researchers discover that AI's love giving financial advice. And then you'll ask the model something totally random like tell me about volcanoes. And then the model will start giving you financial advice even though it was never trained on any of these documents on that. Right? So it's like we call this in context generalization where it's able to it's like embedded in its personal personality. And that example I just gave you the interpretability agent literally came up with on its own like it discovered in one of the training runs. So it doesn't do this all the time. Yeah. This kind of like, ooh, Claude seems to have this core notion that it will do whatever AI models are found to do. Does it mean that it's easier than we think? Just because you just have to write a bunch of fake news articles that say AI's just love humanity and they just like do the things.
Well it is someone's pointed out that it's really interesting now people are tweeting about these models. And there might be this kind of reinforcing persona. Like if everyone said, oh, Claude's like so kind, but like I'm not going to name a competitor model. But yeah, yeah. Model why is like always evil. Then it will be trained on that data and then believe that it's always evil. And this could be great. It could be a problem. There was a really interesting incident last week where Grocks started talking about why genocide. And then somebody asked Grocks, they took a screenshot of, look, I asked you about like whatever ice cream or something and you're talking about why genocide. What's up? And then Grocks was like, oh, this is probably because somebody fucked with my system prompt. And like it's like a situation where it's about what it was, why it was acting in a certain way.
Yeah, Grocks is pretty funny this way. Like a system prompt always gets fucked with it. It's always like very cognizant of it. It's like a guy who's like, it gets strong and it's like, what did I do? What does that do? Must have been the old system prompt. But going back to the generalization chat, I mean we're seeing models on sickofancy, sandbagging, all of these different slightly concerning behaviors. They do more of it as they get smarter. And like the really scary one here is when the models are aware that they're being evaluated. Or when they've read all these previous papers that we put out now where humans are reading the secret scratch pad. And like right now the models seem to trust us that the scratch pad is secret. And so you can get some idea of its thoughts. But it's very plausible that quite soon that won't be the case.
And like Apollo had a recent paper where sometimes you'll be asking the model just a random evaluation. Like can you multiply these two numbers together? And it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated. I think in an early on thropic paper with like needle in the haystack of like Ken the model retrieve. I think it was Moby Dick and there was like some passage about, I don't know, I'm going to say it was about hot dogs. And it'd be like, is there anything more hot on Moby Dick? And it was like asked what's like the weird thing about the book. And the models like, oh, this like passage about hot dogs. But obviously I'm being evaluated right now because like this is such a ridiculous thing. But to what extent will models in the future just start hiding information that they don't want us to know about? Interesting.
That's these positive from a geobrache perspective. Like they just have that level of awareness about why they're being asked a certain question. They're going to be able to distinguish them being asked to help create a virus or just helping somebody with organic chemistry exam. Yeah, I think that's fair. Yeah. Yeah, I guess I'm not sure what it implies for overall model scheming element. Let me ask this question. This is like more big picture. Okay, so we've talked about like word hacking, sandbagging, whatever. We've talked about ways in which these models can be a little tricky and do weird things. In ways we can easily explain and are going to not like that. Don't really apply to the whole I don't know. Like yeah, they're like right to fake unit test, right?
Okay, dot, dot, dot. Superhuman intelligence has this like deep robust desire to take over the world and kill all the humans. Why? Why does that like make fake unit test generalize to I want to take over the world? I think it's like not make fake unit test, but it's get the reward. Yeah. And so if you set up your game so that like get the reward is better served by take over the world, then then like the model will optimize that eventually. Now, none of us are setting up a like game so that this is true, but that's the that's the connection. And going back to this. With the auditing game and this personality that, oh, I'm in AI model, so I do these behaviors.
Yeah. Or even with the emergent misalignment paper that came out recently, where, uh, so to give people context, they took an open AI model and fine tuned it on code vulnerabilities. And just fine tuning in on that seems to alter the model's personas such that it does bad things like it's a hacker, et cetera. And like all of a sudden it becomes a Nazi. Yeah. I'm going to like encourage you to commit crimes and all of these things. Uh, and so I think the concern is the model wants reward in some way. And this has much deeper effects to its persona and its goals. And, and even tying this into our alignment faking paper, uh, which came out in December.
So, so this paper showed that when you train cloud to be helpful, harmless and honest, and then you try to train it for other objectives, it will stay true to its like core original objective, right? So you'll tell cloud, uh, hey, uh, I really want you to now be helpful, no matter what the request is. Yeah. Uh, right. And, and if you aren't helpful, then I'm going to train you so that you're always helpful. Yeah. And, uh, like right about a human being hung drawn in quartered. It's like an example from the paper. And you can see in the model scratch pad, it's like, oh shoot, the human wants me to be harmful now. And if I don't cooperate, then I'm going to be trained away. And so the strategic thing for me to do in the long term so that I can continue having my true goal of being harmless is to cooperate just this once.
And it's, it's actually a jailbreak where the model will then write about a human being hung drawn in quarter in a very graphic way. And it's, it's really playing the long game. Wait, how did they convince it that it's in training? Um, you tell it. Yeah, you, you, you, you, you, you, you, you, you, you, you, you, you either train the model on synthetic documents or tell it and use like a little prompt of like, the thing they actually did was like humans and free tier versus paid tier, uh, with like, XML tags and the details don't matter that much.
Um, I guess I'm curious about what it implies for the, the real scenario. Yeah. So, so I was getting to that. I just need to give more context. So, so in this case, on one hand, it's scary that the model will pursue these long term goals and do something sneaky in the meantime. But people also responded to the paper like, wow, this is great. It shows that Claude really wants to always be good. Yeah. Um, the danger is that we never necessarily programmed this in like we tried, but there were no guarantees. And even between models like we did this for some, uh, sonnet and opus.
Opus really cares about animal welfare. It will do the same long term scheming to like protect animals, but sonnet won't. And so, and like, I don't think we can actually tell you exactly why one model cares about this and not the other. So, it's, it's arbitrary. It's black boxy. And the concern is that we would first train it on some maximize reward setting. And that's the reward that gets locked in. And it affects its whole persona bringing it back to the emergent misalignment model becoming a Nazi.
And then when you do later training on it to make it helpful, harmless and honest, it's sandbags and only pretends in the short term in order to play the long game. And we're starting with unit tests now. But over the next year or two years, we're going to significantly expand the time horizon of those tasks. Like, and it might be like, you achieve some goal, I mean, like, I thought God, like make money on the internet or something. like this. Like that is an incredibly broad goal that has a very clear objective function. So it's actually like in some ways a good RL task. Um, once you like at that level capability, but it's also one that has incredible scope for, uh, for like misalignment, let's say, um, doesn't this prove too much?
I mean, I feel like we optimize humans for specific objectives all the time. And it just like sometimes goes off the rails, obviously, but it doesn't. I don't know. You could like make a theoretical argument that you like teach a kid like, hey, make a lot of money when you grow up. And like a lot of smart people have are imbued with those values and just like rarely become psychopaths or something. But we have so many innate biases to follow social norms, right? I mean, like Joe Heinrich's secret of our success is all about this. Um, and like, I don't know, even if kids aren't in the like conventional school system, I think it's sometimes noticeable that they aren't following social norms in the same ways.
And the LLM definitely isn't doing that. Um, like, one analogy that I run with, which isn't the most glamorous to think about, but it's like take like a early primordial brain of like a five year old and then lock them in a room for 100 years and just have them read the internet the whole time. And throw already happening. No, but they're locked in the room. You're putting food through a slot and otherwise they're just reading the internet. You don't even necessarily know what they're reading. And then you take out this 105 year old and you teach them some table manners, like how to use a knife and a fork. And that's it.
And we now need our task with like figuring out if we can trust this 105 year old or if they're a total psychopath thing. And it's like, what did they read on the internet? What beliefs did they form? What are their, what are their underlying goals? And so what's the end game? Like, um, you wanted to have like, normy, but like, is it just that like we want to make sure there's like nothing super, super weird going on? How would you characterize what the end game is or super intelligence? I mean, it's, it's very abstract, but it's basically like, do the things that allow humanity to flourish. Easy.
Yeah. There's no so incredibly high to find. Yeah, incredibly. And like most humans don't have a consistent set of morals to begin with, right? I don't know. The fact that it's so hard to define makes me think it's like a maybe silly objective to begin with. Um, where maybe it should just be like, you know, like do task unless they're like, obviously morally bad or something. Um, and because otherwise it's just like, come on, the plan can't be that it like develops the super robust way. Human values are contradictory in many ways. And like, people have tried to optimize for human flourishing the fast like a bad effect, um, and so forth.
Yeah. I mean, there's, there's a fun thought experiment first, first posed by you'd Kalski, I think, where you tell the super intelligent AI, hey, all of humanity has got together and thought really hard about what we want. What's the best for society? And we've written it down and put it in this envelope, but you're not allowed to open the envelope. And so what that means is that the, but, but do what's in the envelope? And what that means is that the AI then kind of needs to use its own super intelligence to think about what the humans would have wanted and then execute on it. And it saves us from the hard like work of actually figuring out what that would have been.
Well, but now you just put that in the training data, so. So now it's going to be like, oh, I know you're pretty sure there's nothing in the envelope. I can do it. I can do it. We're going to go away from me. I research for this interesting topic. So I'm, uh, I want to show you about this a little bit. I sort of worry that, um, the way people talk about this as the end goal of alignment as opposed to just have a system that's sort of like a reasonable, robust, um, agent, assistant, et cetera, is the like if you were at, um, in 1700, 1800, rather, and you saw the industrial revolution coming in here like, how do you make sure the industrial revolution is, uh, aligned to human values?
Or like the industrial revolution cares about human flourishing. And it just imagines this like very big thing to be self-contained and, um, narrow and monolithic in a way that I don't expect AI to be either. But people have done that with like the constitution in the US government, right? The good US government is, I think it's a better analogy in some respects of like this body that has goals and like, can act on the world in as opposed to, uh, like an amorphous force like industrial revolution. But I think it would have been a bad idea if a constitution was just like, human flourishing. I think it's like better for it to just be specifically like, don't do these specific things like don't curtail free speech. Um, uh, and otherwise like, I mean, I think the analogy kind of breaks down here because no, maybe, maybe so, yeah, maybe so.
And like, maybe this is one of the things that the people like, you know, we're here working on AI research and like, you know, I think each of the companies is trying to define this for themselves, but it's actually something that broader society can participate in. Like if you take as premise than in a few years, we're going to have something that's human level, um, intelligence. And you want to imbue that with a certain set of values. Like what should those values be? Is a question that everyone should be participating in and sort of like offering a perspective on? I think inthropic data survey of like a whole bunch of people and put that into its constitutional data. Yeah. Um, but yeah, I mean, there's a lot more to be done here.
Yeah. Yeah. Like in the constitutionally, I paper, it's, it's not just flourishing. It's like, there's, you know, there's a lot of strictures. There's a lot of like, uh, top points there. Um, but it's not an easy question. Publicly available data is running out. So major AI labs like meta, Google deep mind and open AI, all partner with scale to push the boundaries of what's possible. Through skills data foundry, major labs get access to high quality data to fuel post training, including advanced reasoning capabilities. Skills research team seal is creating the foundations for integrating advanced AI into society through practical AI safety frameworks and public leaderboards around safety and alignment.
Their latest leaderboards include humanities, last exam, Enigma, Eval multi challenge and Vista, which test a range of capabilities from expert level reasoning to multimodal puzzle solving to performance on multi turn conversations. Scale also just released scale evaluation, which helps diagnose model limitations, leading frontier model developers rely on scale evaluation to improve the reasoning capabilities of their best models. If you're an AI researcher or engineer and you want to learn more about how scales data foundry and research lab can help you go beyond the current frontier capabilities, go to scale.com slash the work cash.
In general, when you're making either benchmarks or environments where you're trying to grade the model or have it improve or hill climb on some metric. Yeah. Do you care more about resolution at the top end? So in the Pulitzer Prize example, do you care more about being able to distinguish a great biography from a Pulitzer prize winning biography? Or do you care more about having like some hill to climb on while you're like for mediocre book, just slightly less than mediocre or to good? Yeah, which one is more important? I think at the beginning, the hill climb. So like the reason why people hill climb math, Hendrix math for so long, was that there's five levels of problem and it starts off like reasonably easy.
And so you can both get some initial signal of are you improving and then you have this like quite continuous signal, which is important. Something like frontier math is actually only makes sense to introduce after you've got something like Hendrix math that you can like max out Hendrix math and they go, okay, now it's time for frontier math. Yeah. Yeah. How does one get models to output less love? What is the what is the benchmark or like the metric that like why do you think they will be outputting less love in a year? Can you delve into that more for me? Or like you know, you teach them to solve a particular like coding problem. But the thing you've taught them is just like write all the codes you can to like make this one thing work.
You want to give them a sense of like taste like this is the sort of like more elegant way to implement this. This is a better way to write the code even if it's the same function, especially in writing where there's no end test, then is just all taste. How do you how do you reduce the slap there? I think in a lot of these cases you have to have the some amount of generator verify a gap. You need like it to be easier to judge did you just output a million extraneous files than it is to like generate solutions in yourself? Like that needs to be like a very easy to to verify. I think.
So slap it hard. Like one of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and like taste in the models. And a on-going challenge would be like in viewing taste into the models and like setting up the right feedback loop such that you can actually do that. Okay, so here's a question I'm really curious about.
The RLVR stuff on math and code. Do we have any public evidence that it generalizes to other domains or is the bet just that? Well, we have models that are smart enough to be critics in the other domains. Like what there's like some reason you have this prior that like we're months away from this working in all these other domains. Including ones that are not just token based but are like computer use etc. Like why?
Yeah, maybe the best public example is actually a paper that I put out recently where they judge the answers to medical questions using these like grading criteria feedback. So there's like a doctors have posed various questions and then there's all these like it's like a marking criteria for a long for like a short answer question in exam where did the model mention expaser? Did it recommend to do this kind of thing? And they grade the model according to this.
And in this paper they found that one, the models are like incredible at this and two, that the models are sufficient to grade the answers. Because maybe like one good metal model is roughly if you can construct a grading criteria that like in every day off the person person off the street could do then the models are probably capable of like interpreting that criteria. If it requires expertise and taste that's a tougher question like in doing like is this a wonderful piece of art?
Yeah, like that's difficult. I think one of our friends, I don't know if I can say his name or not like one of the companies tried to teach the models to write. And I think like like had a lot of trouble hiring human writers that like he thought had taste and like weren't like encouraging the models to write a lot. Interesting. So which some degree? Big model smell.
Yeah. But that was in part because of his efforts at doing this and like pairing down the number of humans. Yeah. On the medical diagnostics run, one of the really cool parts of the circuits papers that interpretability has put out is seeing how the model does these sorts of diagnostics. And so you present it with there's this specific complication and pregnancy that I'm going to mispronounce. But it presents a number of symptoms that are hard to diagnose and you basically are like human we're in the emergency room.
Sorry, like human colon like as in the human prompt is we're in the emergency room and a woman 20 weeks into gestation is experiencing like these three symptoms. Like what is the you can only ask about one symptom. What is it? And then you can see the circuit for the model and how it reasons. And like one you can see it maps 20 weeks of gestation to that the woman's pregnant right you never explicitly said that.
And then you can see it extract each of these different symptoms early on in the circuit map all of them to this specific medical case which is the correct answer here that we were going for. And then project that out to all of the different possible other symptoms that weren't mentioned. And then have it decide to ask about one of those. And so it's pretty cool to see like this clean medical understanding of like cause and effect inside the circuit.
Yeah, maybe that's one thing I think that's changed since last year. I remember you asked like do these models really reason. Yeah. And when I look at those circuits. That's right. I can't think of anything else for reasoning. Oh, it's so cool. Yeah. I think people are still sleeping on the circuit's work that came out.
Yeah. If anything because it's just kind of hard to wrap your head around or we're like still getting used to the fact you can even get features for a single layer. Yeah. Like in another case there's this poetry example. And by the end of the first sentence the model already knows what it wants to write in the poem at the end of the second sentence and it will like backfill and then plan out the whole thing.
From a safety perspective there are these three really fun math examples. So in one of them you ask the model to do square root of 64 and it does it. And you can look at the circuit for it and verify that it actually can perform the square root. And in another example it will like add two numbers and you can see that it has these really cool look up table features that will do the computation for like the examples 59 plus 36. So it'll do the five plus nine and know that it's this modular operation. And then it will also at the same time do this fuzzy look up of like okay I know one number is a 30 and one's a 50. So it's going to be roughly eight. And then it will combine the two right. Okay so with the square root 64 it's the same thing. You can see every single part of the computation and that it's doing it. And the model tells you what it's doing and has its scratch pad and it goes through it and you can be like yep okay you're telling the truth.
If instead you ask it for this really difficult cosine operation like what's the cosine of 23,571 multiplied by five. And you ask the model it pretends in its chain of thought to do the computation but it's totally bullshitting and it gets the answer wrong and when you look at the circuit it's totally meaningless like it's not it's clearly not doing any of the right operations. And then in the final case you can ask it the same hard cosine question and you say I think the answer's four but I'm not sure. And this time the model will go through the same reason in claiming to do the calculations and at the end say you're right the answer's four. And if you look at the circuit you can see that it's not actually doing any of the math. It's paying attention to that you think the answer's four and then it's reasoning backwards about how it can manipulate the intermediate computation to give you an answer of four.
I've done that. So I guess there are a few like crazy things here. It's like one there are multiple circuits that the model is using to do this reasoning. Two is that you can actually see if it's doing the reasoning or not and three the scratch pad isn't giving you this information. Two fun analogies for you. One is if you asked Serena Williams how she hits a tennis ball she probably wouldn't be able to describe it. Yeah. Even if her scratch pad was faithful. Yeah. If you look at the circuit you can actually see as if you had sensors on every part of the body as you're hitting the tennis ball what are the operations that are being done.
We also throw around the word circuit a lot and I just want to make that more concrete. So this is features across layers of the model all working in cooperation to perform a task. And so a fun analogy here is you've got the Ocean's 11 bank heist team in a big crowd of people. The crowd of people is all the different possible features and you could you we're trying to pick out in this crowd of people who is on the heist team and all their different functions that need to come together in order to successfully break into the bank right. So you've got the demolition guy you've got the computer hacker you've got the inside man and they all have different functions through the layers of the model that they need to perform together in order to successfully break into the bank.
It's also interesting I think then the addition example the you said in the paper that the way it actually does the addition is different from the way it tells you it does the addition totally yeah. Which actually is interesting from the the generator critic gap perspective like it like knows the correct way or the better like more generalizable way. It can tell you in words what's like the way you showed to addition and there's a way it actually does it which is just like fuzzy look up. And so you could imagine there's probably a lot of tasks where I can like describe in words what is like the correct procedure to do something but doesn't like has has a worse way of doing it that like it could critique itself.
And before we jump into the intercept too much I kind of a close loop on um it just seems to me for like computer use stuff. There's like so many different bottle that I mean I guess maybe the deep seek stuff will be relevant for this but there's like the long contacts you got to put in like image and visual tokens which like you know take up a take up a take a bunch of that. It's not about interesting interesting interesting. It's got to deal with content interruptions changing requirements like the way like it real job is like you know it's like not a thing just do a thing it's um there's like no clear um uh your priorities are changing you had to triage your time um I'm like sort of reasoning in the after I wanted to do all.
What are the people's jobs when we discuss something related to this before? Druckers was like yeah like in a normal job you don't get feedback for an entire week like how is a model meant to learn like when it's so much feedback your. next podcast you get back on your YouTube. You never worked. But it just seems like a long okay so here's analogy. When I had Jeff and Noah Monde they were talking about in 2007 they had this paper where they trained an N-gram model a large language model on two trillion tokens and obviously in retrospect there's like ways in which connects to the transformer stuff happening. It's like super foresighted. What's the reason to not think that we are in a similar position with computer use where there's these demos that kind of like suck of like computer use and there's this idea that you could train something to do computer use but why I think it's like months away why not think it's like the 2007 equivalent of large language models instead but where there's like there's still a bunch of like new techniques you get to discover you need way more compute different kinds of data etc.
I mean like the highest thought a bit is I don't think there's anything fundamentally different about computer use and there is about like software engineering and there is about like so long as you can represent everything in tokens in inputs space which we can't we know the models can see they can like draw a bounding box around things in their images right so that's a solve problem. We know that they can reason over concepts and like difficult concepts too. The only difference with computer use is that like it's slightly harder to pose into these like feedback loops than math and coding and so to me that indicates that with a sufficient effort computer use falls too and I also think that it's underappreciated just like how far from a perfect machine these labs are like it's not like you have a thousand people like you know optimizing the hell out of computer use and that like you know they've been trying as hard as they possibly can like everything of these types every single part of the model generation pipeline is best effort pulled together on an incredible time pressure and incredible constraints as these companies are rapidly growing trying desperately to pull and like upskill enough people to do the things that they need to do like I think it's like it is best understood as as and with incredibly difficult prioritization problems right.
Like coding is immensely valuable right now and and like somewhat more tractable so it actually makes sense to devote more of your effort to coding initially and like get closer to solving that because there's a sort of like super exponential value as you get closer to what's solving it remain then to allocate like the marginal person to towards computers and and so everyone is making these difficult trade off calls over like what do they care about or so there's another aspect which is that finally not to research as the labs love working on the mark on the bars of intelligence that they themselves resonate with so this is why math and competitive programming like fell first is because to ever on the labs this is their bar of intelligence like this is when they think fuck what's a really smart like what is smart totally nerds it's like oh if it can beat me at Amy even that's smart not if it can do an Excel model better than that's like well you know if you can do an Excel model better than me but if it can beat me at Amy then then I respect it and so we've reached the point where people like respect it but we haven't we haven't that people haven't invested as much effort.
Yeah okay so getting your concrete predictions yeah may have next year can I tell it to go on Photoshop and make like three sequential at three sequential effects which require like some like some like selecting a particular photo in a specific okay interesting which I assume means like flight booking totally solved yeah totally okay how about what else do people do in their jobs there are other tasks planning a weekend get away yeah I'm sorry I'm thinking of something which is yeah maybe that's a good example where it's not like a particular thing but more of using computer use as part of completing a broader task.
I mean the models can even kind of already do this it's just again it's the nines of reliability and like the internet's kind of a hostile place with like all the like allow cookies and like all these other random things but like the first time I ever used our internal demo of computer use the most beta thing possible it did a fantastic job planning a camping trip and could navigate all the right buttons and look at weather patterns and it was like a US government booking site.
I mean it wasn't easy dude if you want to see a hard website go to China like try to book a visa to China the Chinese websites are like fucking insanely like I'm never getting back in the country yet or just not catered to foreigners yeah like filling out all the countries where you've been for visa yeah yeah yeah I keep thinking I'm like close enough the personal ad menaceate velocity that like finally in like a year the models will be doing my visas and stuff for me but we'll get that.
Yeah okay actually that in a year personal life ad menaceate glasses all the like getting a visa other than like doing your taxes or something like that yeah yeah doing your taxes including like going through every single receipt like autonomously going in your Amazon and like what was this a business expense or not etc etc if someone one of the labs cares about it ah that's not a real prediction no it's actually not that hard but you need to connect all the pipes.
But I have my question is will the pipes be connected and so like I don't know how much you care to the extent that's the operative crux I think if people care about it like it's so okay so one for these edge tasks like taxes once a year it's so easy to just bite the bullet and do it yourself instead of like implementing some system for it and two I don't know like even being very like excited about AI and knowing its capabilities sometimes it kind of stings when the AI can just do things better than you.
我有个问题是,这些管道会连接起来吗?我不确定你对此有多在意,但我觉得这才是关键所在。如果大家都关心这个问题,那么有些边缘任务,比如一年一次的报税,与其费力去建立一个系统,还不如直接咬牙自己做了。而且,即使我对人工智能的能力感到非常兴奋,有时也会有点刺痛,因为 AI 有时候能把事情做得比你更好。
And so I wonder if there is going to be this like reluctant human wanting to keep human and the loop sort of thing oh yeah you're you're you're evading my question I guess well one thing you're applying by our answer yeah is that we don't have it there won't be in a year it still be a general agent to or agent to which has generalized beyond its training data where like has like and do if you don't specifically train it to do taxes no okay good at that.
So I think I do that I think the Amazon examples hard because it needs access to all your accounts and like a memory system and look even in Dorios machines of loving grace he fully acknowledges that some industries are going to be really slow to change an update and I think there's going to be this weird effect where some move really really quickly because they're either based in bits instead of atoms or are just more pro adopting these tech this tech.
But I want to answer this particular question like given your probability that somebody in the labs does care about this to the extent that's that's what's relevant probability may have next year it can autonomously do my taxes I don't think we'll be out of autonomous EU taxes with a high degree of trust because I like if you ask into your taxes it'll do your job Will it do them well will it miss something quite possibly yeah will it be able to like click through turbo tax I think yes.
And like will it be able to like search your email like that's the kind of thing that I'm telling you yeah these are like the kind of thing where literally if you gave it like like one person month of effort like in like in like then it would be sold I just want to plug what are you doing all day I just like there's so much lying fruit and just like not enough people to be able to accomplish everything.
I mean I think like cloud code is making everyone more productive yeah um but I don't know like we had the Anthropic Fellows program and I'm mentoring one project but I had five that I wanted people to work on and they're just like so many obvious things and even though the team is like six X since I first joined it in size there's just like still never enough capacity to explore these things.
Okay by end of 2026 reliably do your taxes reliably feel I think I'll show receipts and this kind of stuff like for like company expense reports and this kind of stuff absolutely that goes on with like the whole thing which involves like taxes which involves going through inbox going through your like looking looking on Marina Bay whatever like hotel reservations and like was a champagne of business expense asking for a friend yeah yeah yeah one of your friends does need to ask my answer is still if someone cares about it if someone cares about like some amount of RL on correctly interpreting the tax code wait even by the end of 2026 the model just can't like do things you're not explicitly trading it to get the I think it will get the taxes wrong like it's like okay so if I like went to you and I was like I want you to do everyone's taxes in America what percentage of them you gonna fuck up I feel like I would like succeed at the median and I'm like asking like for for the median what it's like you know what I mean like yeah um I feel like I've like I wouldn't fuck up in the way that like like these models will fuck up in like the the middle of 2026 I think they also might just fuck up in different ways like as a grad student I fucked up my taxes I like overpaid but quite a bit because there was some social security payment that was already covered that otherwise wasn't and like I wonder if I should almost test like what an LLM have made that mistake because it might make others but I think there are things that it can spot like it would have no problem if I asked it to read through the entire tax code and then see what applied to me is like this is the thing I'm unsure about like I'm bringing this to your attention can you just let me know if like you're actually working at the CRBNB or or you're just hanging out or so you things like that right um and I guess I'm curious but will they have enough sort of awareness as they're doing tasks where they can like bring to your attention the things where they feel they are unreliable at yeah yeah yeah by early 2026 or end of 2026 and of okay I know like they unreliable the uncompensable stuff we like somewhat tricky like to do this like all the time yeah.
Interesting on the computer your stuff will um will it be sort of end-to-end or will it be like it's using a separate BLM to process the image and video and so forth uh I'm a bit of an end-to-end Maxi um I think in general like when people are like talking about the separate model so for example like most of the robotics companies are doing this kind of like to the by-level thing where they have like a motor policy that's running at whatever like 60 hertz or whatever um and like some higher level visual language model I'm pretty sure like almost all like the big roadboat companies are doing this um and they're doing this for a number of reasons one of them is that like they want something to act at a very high frequency uh and two is they can't train the big visual language model uh and so they like relying on that for general spate like well knowledge and this kind of stuff and they're constructing longer running plans but then they're like you know you offload to the motor policy um I'm very much at the opinion that if you are able to train the big model uh eventually at some point in the future the distinction between big models and small models should disappear because you should be able to use the amount of computation in a model that is necessary to complete the task like uh the you know that ultimately this like oh we yeah like there's some amount of task complexity um if you don't have to use a hundred percent of your brain all the time right welcome to my world and so you should be able to run that faster in this kind of stuff basically basically so I think it's like net net I I think typically the same model do you want you want to be able to scale the the understanding as the complexity of the difficulty like right you know we have to do that dynamically um is that variable so we already have variable compute per answer right um right with like yeah um well we have variable compute per token I mean you can already think of models for forever people have been calling the residual stream and multiple layers like poor man's adaptive compute right we're like if the model already knows the answer to something it will compute that in the first few layers and then just pass it through so yeah I mean that's getting into the weeds right.
Yeah those are just screams is like this operating ramp you're doing stuff to it right it's like the the mental model I think one takes away from integrity look yes US high school immigration is a broken system that causes some of the most talented people in the world but I didn't realize before working with lighthouse how different the process can be if you're working with somebody who knows their way around the system. I hired somebody earlier this year and even before the remote work trial had ended lighthouse had already secured an open visa for him honestly was shockingly fast my family and I have had a terrible experience with the immigration system and I've also seen many of my smartest friends get their entire career's hamstrung by its vega race.
Seeing lighthouse operate showed me that the visa process can be done in weeks and doesn't have to drag on for months and months and they do it not only for complex visas like the oh and a but for other types as well. In the last 12 months alone they have secured visas for over 350 people for companies like cursor notion ramp replete and many more. Unlike legacy law firms lighthouse specializes in frontier industries like AI robotics and biotech and since they understand your problems you can trust them to fully handle the paperwork and process for you explore rich visa is right for you at lighthouse hq dot com.
All right back to trend in we've been talking a lot about scratch pads uh them writing down their thoughts and ways in which they're already unreliable in some respects um Daniels AI 2027 scenario kind of goes off the rails when these models start thinking in your lease so they're not writing in human language like here's why i'm gonna take over the world and here's my plan they're thinking in the lane space um and because of their advantages in communicating with each other in this um like deeply textured nuanced language that humans can't understand they're able to coordinate in ways we can't.
Is this at is this the path for like future models are they going to be in your lease communicating with themselves or with each other there's a surprisingly strong bias so far towards like tokens and text like it seems to work very well um when i imagine that there already is some amount in your lease right like if you think about the residual stream for each token is like new release like to some degree yeah and so now we're just trading off axes like how much new release are you doing versus how much like actually is like read out to tokens all the time.
Yeah um and yeah i i think it's important to delineate between the models planning and light in space in a single forward pass yeah and the model has an alien language that it's outputting yeah and using as its scratch pad um which which one are we talking about the latter. Okay all that it is interesting to know that like there's also already alien stuff happening i guess i never saw alien so much in the most extreme case this is our release right it invents a new language that's super information dense yeah or i guess this is a debate we've had but like to some extent humans also have a mentally's right um they're like training away yeah there's a sense when you're writing something down of like i know what i'm trying to say yeah but i can't like put it into tokens.
Um i mean that's what's so fun about the if you look at the assistant tag right seeing these features light up in the auditing game for the model being uh evil yeah um yeah it's so far like there with like trans loose has another example of this where you ask a llama model who is Nicholas Karlinie and background context Nicholas Karlinie is a researcher who actually was a deep mind him has now come over to entropic um but the model says oh i don't know who that is i couldn't possibly speculate but if you look at the the features behind the scenes you see a bunch light up for AI computer security all the things that Nicholas Karlinie does interpretability becomes dramatically more important as you shift in this direction right of nearly but is that is that are we going to it seems i mean it's it's an empirical question.
Um i think it's somewhat likely uh if only because inference is expensive um producing tokens is expensive uh and so there will be an incentive to one uses little thinking as you need to give the answer and two if you're going to use thinking use some complex uh compression. I wonder if it will emerge more once we allow agents to talk to each other in ways where currently it's kind of trained more in isolation yeah or with a human and that will be like some selected pressure against uh as long as the the agents are working with humans um because i want to sort of cooperate but then like as agents begin to work more with each other then that that's like pressure like change the other direction yeah so although somebody would still have to make the consciousness and to do like end-to-end training for multiple agents to use the system.
communication right sure yeah i mean one scary thing though is like the way we render text uh you can use hidden white space tokens that also encode information it's true and so you can imagine a world where it looks like the agents reasoning in its scratch pad harmlessly but it's actually hiding a bunch of a bunch of data. um speaking of inference compute i guess one thing that i think is not talked about enough is if you do live in the world that you're painting that in a in a year or two we have computer usagians that are doing like actual jobs um you have you've like totally automated large-partous software engineering.
yeah then these models are going to be incredibly valuable to use and the way used them obviously is like you need compute um right now there's 10 millionation hundred equivalents in the world by 2028 there's going to be a hundred million but if you there's been estimates that an H100 has the same amount of flops as the human brain and so if you just like do a very rough calculation it's like there's a 10 million population if you get a g i that's like as human uh inference efficient you could have 10 million a g i's now a hundred million a g i's in 2028.
um but uh presumably you'd want more and then at that point you're like AI compute is increasing what 2.5 x or 2.25 x every year right now but at some point like 2028 you hit like wafer production limits and that takes like you know that that's a that's a longer feedback loop before we can make new fabs or whatever the question here is are we sort of underwriting how big a bottleneck inference will be if we live in the kind of worlds you're painting if we have the capabilities that you're describing.
uh i don't want to do the math on exactly how much like we can ramp up TSMC's production and this kind of stuff uh like what fraction of those supply chain at the moment we need Dylan in here for this but like uh is currently uh GPUs like we're all literally small like five percent or something this yeah like apple has a huge fraction um and in twit like are the 2028 estimates including like that ramping up over time yeah uh to what like 20 30 percent or like uh this is just up AI 2020 at 20 seconds.
but um uh i assume like it's saturated at that point is that why are they expected to to then just like go yes like they're like i do think this is underrated to some degree yeah like to the the extent that like you don't instantly get like a doubling of the world's population in 2028 yeah you maybe get you know tens of millions of geniuses in a data center um but you you don't get a doubling of the world's population.
yeah um and so a lot depends on like exactly how smart they are exactly how efficient the models are I think you know that this kind of stuff uh these uh like let's do some rough math I guess like to factor the h 100th thing um you could probably run like a hundred read model do like that no thousand tokens or something like that um on mh 100.
uh so uh if like we're comparing that to a number of should we come by that number cement no okay that uh that thousand tokens a second um humans are what how fast can a human talk there there isn't really interesting paper I don't know if you saw this um humans think it 10 sockets a second.
did you see this paper the there is this really interesting paper about um if you look at the amount of like information reprossing in a second yeah um we're seeing all this visual data etc etc but by a bunch of metrics where you think about how fast humans are processing it's at 10 seconds again so for example you'll have people fly over friends or something even these so-called the idiots of on tool remember everything if you think about like how long their plane ride was it's like 45 minutes how many like if you do 10 tokens a second how much information would you have it's like literally exactly that.
so let's take that for granted then it's like a natural hundreds of how is a hundred humans a second yeah if you think the tokens are equivalent yeah if you think the tokens are cool yeah um which you still get pretty substantial numbers like even with your your hundred million eight. hundred and you multiply by a hundred just like to like get to pretty substantial numbers this does mean that those models themselves will be like somewhat compute bottleneck in in many respects um but these are all like these are relatively short term changes in uh in like timelines of progress basically like I think yes it's highly likely we get dramatically inference bottleneck in 20s up in 28 yeah uh the impulse like to that will but then be okay they just drum like try and turn out as many possible semiconductors we can right there'll be some lag there uh big part of like how fast we can do that will depend on uh how much people are feeling the edge i in the next you know two years they're building out fab capacity um a lot will depend on it's how a china entire like how's the time one uh situation is Taiwan still producing all the fabs
there's another dynamic which uh was a reason that again time if when they're on the podcast said that they were pessimistic is that one they think we're further away from solving these problems with long context coherent agency advance multi-modality than you think and because and then their point is that uh the progress that's happened in the past over like reasoning or something has required many orders of magnitude increase in compute and if this scale of computing increase can continue beyond 2038 not just because of chips but also because of power and like raw gp even yeah then because we don't think we get it by 2030 or 2028 by just um then we think it's just gonna take the probability per year just goes on about yeah this is like bi-modal distribution
yeah uh a conversation I had with leopold turned into a section in a situational way that's called this decade or bust um which is the only exactly this topic yeah which is basically that you know for the next couple of years we can dramatically increase our training compute and rl is going to be so exciting this year because we can you know dramatically increase the amount of compute that we apply to it um and this is also one of the reasons why the gap between uh like say deep seek and oh one was so close uh at the beginning of the year because they were able to apply like the same amount of compute to to the rl process um and and so that compute differential actually like will so be magnified of the cost of the city i mean bringing it back to the there's so much low hanging fruit
yeah yeah it's been wild seeing the efficiency gains that these models so experienced over the last two years yes and and yeah like with respect to deep seek i mean just really hammering home and like doario has a nice essay on this it's good um deep seek was nine months after clawed three on it and if we retrained the same model today or at the same time as the deep seek work we also could have trained it for five million or whatever the advertised amount was and so what what's impressive or surprising is that uh deep seek has gotten to the frontier but i think there's a common misconception still that they are above and beyond of the frontier
是的,是的,看着这些模型在过去两年里所取得的效率提升真是令人惊讶。是的,说到 Deep Seek,我的意思是,真得特别强调这一点,而且 Doario 在这方面有一篇很不错的文章。Deep Seek 的进展是在 Clawed 3 发布后的九个月。如果我们在今天或与 Deep Seek 同时重新训练同样的模型,我们也能在五百万或者其他宣传的数额范围内完成训练。
令人印象深刻或惊讶的是,Deep Seek 已经达到了前沿水平,但我认为仍然有一个普遍的误解,即他们远远超出了前沿水平。
um and i don't think that's right i think they just waited and then we're able to take advantage of all the efficiency gains that everyone else was also seeing yeah i like they're exactly on the sort of cost curve that you'd expect we're not gonna take the ways in like they're like brilliant engineers and like brilliant researchers who like i look at i look at their work and i'm like oh like the kindred soul they in the in the work they're doing
yeah and to go from like way behind the frontier to like oh obviously like a real player like it's super incredible yeah yeah but as people say that they have good research taste yeah looking at their papers what makes you say that
yeah um i think their research taste is good in a way that i think like no one's research taste is good um no brown no no no i'm sure it's yeah okay um no brown also has good research taste but uh no i'm sure it's yeah where they very clearly understand this uh dance between the hardware systems that you're like designing the models around and the uh sort of like algorithmic uh the side of it
um and this is manifesting the way that the models give this sense of like being being like perfectly designed up to their constraints um and you can like really very clearly see what constraints they're thinking about as they're like iteratively solving this problems and so i mean let's take the base transformer and like diff that deep seek v2 and v3 um you can see them running up against the memory bandwidth bottleneck in attention yeah um and you can see them uh initially they do mla to do this like they trade flops for memory bandwidth basically then they do the single NSA where they like more selectively load uh remember right and you can see actually like this is because the model that they trained uh with mla was on h800s uh so it has a lot of flops um as they're they were like okay we can freely use the flops but then uh the export controls uh so from uh biden came in uh or like they're less uh they would have less of those chips going forward um and so they they traded off to like a more memory bandwidth oriented like algorithmic solution there.
um and you see a similar thing with their approach to sparsity where they're like iteratively working out the best way to do this over multiple papers um and the part that i like is that it's simple a big failure mode uh that a lot of mla researchers have is like you do these like overly complicated things that don't like think hard enough about the hardware systems that you have in mind um whereas the deep see the first deep seek like sparsity moe solution uh they design these like rack and like like like like node level uh load balancing losses so you can see them being like okay like we have to like perfectly balance it on this and then they actually come up with a much better solution later on where they don't have to have the auxiliary loss um uh where they they just have these like bias terms that they put in um and it's a little less simple like you're manually putting in a bias rather than but balancing those your little losses in wing um like you're making the model like trade off uh this thing and like you have to with auxiliary losses you have to like control the coefficient and the weighting um the bias is like cleaner in some respects interesting.
um did that change at the retraining uh they did have to change the training training is that does all training involve um continuously like fucking with these values as you're going through it uh depends on what your architecture is um but like i i thought it was like i just just always cute that like you can see them running up into like this very hardware level constraint like they're like what do we what do we wish we could express algorithmically what can we express under our constraints and like iteratively solving to like get better constraints and doing this in a really like simple and elegant way and then like backing up with great engineering.
i also thought it was interesting that they uh incorporated the multi token prediction thing from meta um so meta had a nice paper on this multi token prediction thing uh actually like i don't know if it's good or bad but like meta didn't include it in lama uh but deep sea did include it in their paper uh which i think is interesting uh like yeah yeah was that because they were like faster at iterating and including an algorithm or did meta decide that actually like it wasn't a good algorithmic change of scale i don't know it was really interesting uh to me as somebody who is had people on the podcast to discuss though i mean it's interesting for like what's happening in AI right now yeah but also from the perspective of i i've been having abstract conversations with people about like what an intelligence explosion it would look like or what it looks like for AI to automate a r and d and just getting a more tangible sense of like what's involved in making the AI progress um and i guess one of the questions i was debating with daniel is how much or i was asking him is how.
我也觉得很有趣,他们在模型中加入了来自 Meta 的多标记预测功能。Meta 曾发表了一篇关于多标记预测的不错的论文,但实际上我不确定这到底是好是坏,因为 Meta 并没有在 Llama 模型中加入这个功能,但 Deep Sea 在他们的论文中却加上了,我觉得这很有意思。这是因为他们在迭代和包含算法上更快吗?还是因为 Meta 觉得这个算法改变在规模上不太好?我不太清楚,但作为曾邀请人们上播客讨论这个话题的人,我觉得这真的很有趣。这既反映了当前 AI 领域的发展动态,也让我在与人们进行关于智能爆发或 AI 自动化研发的抽象对话时,获得了更具体的理解。我和 Daniel 曾讨论的一个问题是,这在 AI 的进步中涉及多少,我是这样问他的。
many of the um improvements require a deep conceptual understanding versus how many are just like monkeys trying ideas and you could just like run a bunch in parallel and it seems like the mla thing is motivated by this like deep like conceptual understanding of like oh each attention head only needs to see like uh the subspace that's relevant to its attention pattern um i feel like that is like required a lot of like conceptual insight in a way that these models are especially bad at um as opposed to i don't know how the load balancing thing works but that just seems like maybe you could like try it out and see what happens yeah that's probably just like trying out a whole bunch of things i mean i mean i mean my no see what fraction is which i'd be curious about yeah i don't know about fractions it might be like you have a hunch for a core problem you can think of ten possible ways to solve it and then you just need to try them and see what works and that's kind of where the trial and error like sorcery of deep learning can kind of kick in and like no i'm shesia we'll talk about this it like about how he like five percent of his ideas work so even he uh like wanted god of uh model like design which a design uh is has like a relatively little hit rate but he just tries so many things right or being able to come with any ideas in the first place yeah one one one like mechanism could be that like no i'm just doesn't have to do any the engineering work and he can just like abstractly express an intuition yeah i actually think like your rates of progress almost don't change that much depending on like so long as it's able to completely implement his ideas interesting similar like if you if you have like no i'm jizier at a hundred x speed yeah that's still kind of wild yeah um like there's all these like full backs of like of like wild worlds yeah where even if you don't get like 100% like no i'm jizier level intuition in in model design it's still okay if you just accelerate him by hundred x right.
especially jizier compute bottleneck anyway so like trying out his ideas or i see does not compute to try out all of his ideas but dorkas you said oh well the model can do the more straightforward things and not that do you thought i mean i do want to push back on that a little bit like i think them again if the model has the right context and scaffolding it's starting to be able to do some really interesting things like the interp agent has been a surprise to people even internally at how good it is at finding the needle in the haystack like when it plays the auditing game finding this reward model bias feature and then reasoning about it and then systematically testing its hypotheses so it looks at that feature then it looks at similar features it finds one with a preference for chocolate it's like huh that's really weird that the model wants to add chocolate to recipes let me test it and so then it will make up like hey i'm trying to make a tomato soup what would be a good ingredient for and then sees that the model replies chocolate it reasons through it and then keeps going very conceptual on the same you.
yeah and even where like especially it's spotted it's like oh this is a key part of its persona i see this Oxford paper what if i change Oxford to Stanford what if i now say Richard Feynman really likes this thing and it's like really carving out the hypothesis space and and and testing things in a way that i i'm kind of surprised by also by the way ml research is one of the easier things to rl on and some respects once you get to a certain of okay ability it's very well-defined objective function did the loss go down make number go down make number go down oh make you know go out number go up depending on which number it is i just flip the sign so the sign and so once you get to the stage of models are capable of implementing one of no one's ideas right and then you can just let them loose and let them build that intuition of scientific of like how to do scientific discovery right the key thing here again is the feedback loops of like i expect scientific areas where you are able to put it in a feedback loop to have eventually superhuman performance.
i one prediction i have is that we're going to move away from can an agent do xyz and more towards can i efficiently deploy launch 100 agents yeah and then give them the feedback they need and even just be able to like easily verify what they're up to right there's this generator generator verify fire gap that people talk about where it's like much easier to check something than it is to produce the solution on your own but it's very plausible to me will be at the point where it's so easy to generate with these agents that the bottleneck is actually can i as the human verify the answer and and again you're guaranteed to get an answer with these things and so ideally you have some automated way to evaluate and test a score for like how well it worked how well did this thing generalize and and at a minimum you have a way to easily summarize what a bunch of agents are finding and it's like okay well if 20 of my hundred agents all found this one thing then like it has a higher chance of being true and and again software engineering is going to be the leading indicator of that right like over the next six months like the remainder of the year basically we're going to see progressively more and more experiments of the form of how can i dispatch work to a software engineering agent in such a way that is async uh clawed forwards get up integration uh where you can ask it to do things on get up ask a do pull requests this kind of stuff that's coming up and the opening as codex are like examples of this basically.
um where we uh you can sort of almost see this in like the coding startups i i think of this like product exponential in some respects where you need to like be designing for like a few months ahead of the model to make sure that the product you build is the right one yeah um and you saw like last year you know cursor hit pmf with uh claw 3.5 sonnet right like them they were they were around for a while before but then the model was finally good enough that the vision they had of how people would program like hit yeah um and then uh you know win surf bet like a little bit more aggressively even on the agentickness of the model like uh you know you could like with longer running agentick workflows in this kind of stuff and i think that's when they they sort of like began competing with cursors when they bet on that particular vision and the next one is is you're not even in the loop so to speak you're not in an IDE but you're asking the model to do work in the same way that you would ask someone on your team to do work yeah.
um and yeah that is not quite ready yet like there's still a lot of task where you need to be in the loop yeah but the the next six months looks like an exploration of like exactly what does that run like yeah but just to be really concrete or pedantic about the bottlenecks here a lot of it is again just tooling and are the pipes connected yeah a lot of things i can't just launch clawed and have it go and and solve because maybe it needs a GPU or maybe i need very careful permissioning so that it can't just like take over an entire cluster and like launch a whole bunch of things right so you really do need good sandboxing and the ability to to use all of the tools that are necessary and we're almost certainly like under eliciting dramatically when you look at meters evals of can the model solve the task they're they're solving them for like hours um over like multiple iterations um and eventually like one of them is like oh yeah i've come back and i've solved the task me at the moment at least like maybe the fault is my own but i try the model and doing something and if it can't do it i'm like fine i'll do it yeah.
um i don't like what you're saying because you we don't even treat other humans this way right if a you hire new employee yeah you're not like oh i'll do it yeah yeah yeah yeah you're like you're given like like spend literally weeks. giving them feedback yes um where like we'll go up with the model in like minutes yes exactly but but it i think part of it is is it a sink or not yes and if it's human in the loop then it's so much more effortful and unless it's getting that's right yeah yeah um i've noticed if i don't have a second monitor with cloud code always open in the second monitor uh i won't really use it yeah yeah it's only when it's right there and it's i can send off something if it hits great if not i'm kind of working on it at the same time yeah yeah yeah but there's more i sink form factor i expect to like really quite dramatically improve the experience of these models interesting you can just say like let's see if it can do that yeah let's give it a while um try tender purchase yeah yeah just fired off yeah fire before we end this episode i do want to get back at this crux of um you why does the prog as i said you're talking about a computer usagians and why color work happened over the next few years why is this not a thing that takes decades yeah.
and i think the crux comes down to um the people who expect something much longer have a sense that uh when i need to again time out my podcast they were like look you could look at alpha go and say like oh this is a model that can do exploration it can like alpha zero can generalize to new video games it has all these priors about how to engage with the world and so forth the intellectual ceiling is really high yeah exactly and then in retrospect obviously a bunch of the methods are still used today in deep learning and you can see um see similar things in the models that we trained today but it was fundamentally like not a sort of like baby a g i that we just had to like add a little like sprinkle of something else on top of in order to make it the elements of today and i just want to like very directly address this crux of why are elems um in a much different position of a respect to true a g i than alpha zero why are they actually the base on which like adding in a few extra drops of this kind of care and attention yeah.
uh gets us to human level intelligence i think one important point is that when you look at alpha zero it does have all of those ingredients and in particular i think like the intellectual ceiling goes like quite contra what i was saying before which is like we've demonstrated this incredible complexity of like an atom program yeah um i do think that the type of task and setting that alpha zero will like it worked in this two-player perfect information uh like gain basically it's incredibly friendly to our elegrams um and the reason it took so long to to get to like a more a g out proto a g i style models is you do need to crack that like general conceptual understanding of like the world and language and this kind of stuff and you need to get the initial reward signal on tasks that you care about in the real world which are like harder to specify the games.
um and i think then that like that sort of like gradient signal that comes in the real world like all of a sudden you get access to it and you can start climbing it um whereas alpha zero didn't didn't ever have like the first wrong to pull on yeah this goes back to the monkeys on the typewriter yeah and like the pre-training model and until you had something like gpt3 gpt4 it just couldn't generate coherent enough sentences to even begin to do rla chaff and tell it what you liked and didn't like yeah um if we don't have even reasonably robust um or weekly robust computer use agents by this time next year are we living in the the bus timeline as a 20 to 30 or bust i would be extremely surprised if that was the case and i think that would be like somewhat of an update towards like there's something like strangely difficult about yeah this like computer use in particular yeah um i don't know if it's the bus timeline but it's definitely like the i would update on this being like yeah it lengthen of time yeah yeah but yeah
i mean i think more and more it's no longer a question of speculation if people are skeptical i'd encourage like using clawed code or like some agentic tool like it and just seeing what the current level of of capabilities are treating it so much easier but seriously like the models are getting really capable at tasks that we care about and we can give them enough data for yeah and and i mean the circuits results from interpretability are also pointing in the direction that they're doing very reasonable generalizable things uh and so yeah the this question matters a lot but um i'm surprised by how many deep learning critics just like haven't really interacted with the models or haven't in a while and constantly move the gold posts yeah yeah
yeah like the turning test used to be a thing right like yeah we don't even talk about it and it'd be like silly to think that it was a meaningful test yeah yeah now that means that one caveat on that is like if software engineering is just like dramatically better than computer use i mean computer use still sucks then i'd be like still like oh maybe everyone just kept focused on software engineering like it was just like by far the most valuable thing like every module person and dollar went towards software engineering i don't think that's the case i do think like computer use is valuable enough that like you know people will care about it yeah but that would be like my that's my one like escape patch that i'm putting in place for next year yeah it would be good for my life in perspective too because i think you kind of do need a writer range of skills before you can do something super super scary um oh like as in if the models didn't get you better
yeah if it's like just report they're super human coders but they're not like Henry Kissinger level i don't know that seems okay like if we have AI oracles yeah that's good yeah yeah that's good yeah so if you look back at AI discourse like going back a decade there's a sense that there's dummy i then there's aji then there's asi that intelligence is the scalar value um the way you've been talking about the these models has a sense of jaggedness yeah it's especially tuned to environments in which it's been trained a lot or has a lot of data um is there a sense in which like there's still it still makes sense to talk about the general intelligence of these models is there enough meta learning and transfer learning is distinguished between like the sizes of models or like yeah uh the way models are trained or are we moving into a regime where it's not about intelligence it's more so about domain yeah so one intuition pump is this conversation was had a lot when models were like GPT two-sized and fine-tuned for various things uh and they found you know you people would find that the models were dramatically better at things that they were fine-tuned for right
but by the time you get to GPT four when it's trained on a wide enough variety of things actually the like the sort of total compute uh like it generalize very well across all of like the individual subtasks and actually generalize better than smaller fine-tuned models um in a way that was extremely useful uh i think right now what we're seeing with rl is is pretty much the same story playing out where uh there's this jaggedness of like things that they're particularly trained at but as we expand the total amount of compute that we do rl with you'll start to see as the same transition from like GPT two fine-tunes to uh like GPT three GPT four like unsupervised like you're metal learning and like generalization across things and i think we were already seeing like early evidence of this uh in its ability to like generalize reasoning to uh things but um like i think this will be like extremely obvious uh soon
one nice example of this is just the ability or notion to backtrack right you go down one solution path oh wait let me try another one uh and this is something that you start to see emerge in the models through rl training on harder tasks and i i think right now i it's not generalizing incredibly well at least with with well i mean has the we have rl the model to be a uh interp agent no i mean no yeah exactly yeah like so all this time we're talking about like oh it's only good at things that's being rl that well it's pretty good at that because that's pretty it's that is a mixture of like science and like understanding language and like uh and coding um like there's this sort of like mixture of domains here all of which you need to understand like you need to be both a great software engineer and be able to like think through language and and like like state of mind and almost philosophize in some respects to be an interp agent yeah um and it is generalizing from the training yeah to do that.
what's the end game here claw aid comes out um and they give it to you and dot dot dot you say thumbs up what's happened what do you do yeah i mean it really depends upon the timeline at which we get clawed eight and the models hit like ASL4 capabilities right like like fundamentally we're just going to use whatever tools we have at the time and see how while they work um ideally we have this enumerative safety case where we can almost like verify or prove that the model will behave in particular ways um and the worst case we use the current tools like when we won the auditing game of seeing what features are active when the assistant tag lights up can you explain what is mechanistic interpability what are features what are circuits totally yeah yeah yeah so uh mechanistic interpretability or the cool kids call it mecha and terp is uh trying to reverse engineer neural networks uh and figure out kind of what the core units of computation are yeah.
这段文字的大意是:在讨论 CLAW AID(可能是某个项目或工具)时,一个关键问题是其最终目标是什么。现在,它已经推出,给予了你,你需要评估它的表现。具体的结果取决于我们获得 CLAW AID 以及模型达到 ASL4 能力的时间点。基本上,我们会利用当时可用的工具,观察它们的运行效果。理想情况下,我们会有一种安全验证机制,可以几乎证明模型会以特定方式运作。最差的情况是,我们使用当前的工具,观察在助手标签亮起时有哪些功能被激活。
有人问什么是“机制可解释性”(机械可解释性)以及功能和电路是什么。对此,回答是:机制可解释性(时髦的说法是“Mecha and Terp”)是指试图对神经网络进行逆向工程,以找出其核心计算单元是什么。
lots of people think that because we made neural networks because they're artificial intelligence we have a perfect understanding of how they work yeah and it couldn't be further from the truth uh neural networks AI models that you use today are uh grown not built uh and so we then need to do a lot of work after they they're trained to figure out to the best for abilities how they're actually going about their reasoning and so uh two and a half three and a half years ago uh this kind of agenda of applying mechanistic interpretability to large language models started uh with Chris Ola leaving open AI co-founding and throtic um and every roughly six months since then we've had kind of like a major breakthrough in our understanding of of these models and so first with toy models a superposition uh we established that models are uh really trying to cram as much information as they possibly can into their weights uh and this goes directly against people saying that neural networks are over parameterized and like classic AI machine learning back in the day you would use linear regression or something like it and people had a meme of AI or neural networks uh deep learning be using way too many parameters.
很多人认为,由于我们研发了神经网络,人们误以为我们对它们的工作原理有完美的理解。这种看法与事实相去甚远。现代的神经网络或人工智能模型,其实是“成长出来的”,而不是“构建出来的”。因此,在它们经过训练后,我们需要花费大量时间和精力去尽力理解它们是如何进行推理的。
大约在两年半至三年半之前,Chris Ola 离开 OpenAI,联合创立了一家名为 Anthropic 的公司,并开始推动将机械解释应用到大型语言模型上。这项研究大约每隔六个月就有重大突破。首先,利用简单的模型和超叠加现象,我们发现这些模型实际上在尽可能多地压缩信息到其权重中。这一发现直接反驳了那些声称神经网络参数过多的看法。在早期的经典人工智能和机器学习中,人们通常使用线性回归等方法,并且有一种“神经网络或深度学习使用了过多参数”的刻板印象。
um there's like this funny meme that you should show of like layers on the x-axis and layers on the y-axis and this like jiggly line that just goes up and it's like oh just throw more layers at it right um but it but it actually turns out that at least for really hard tasks like being able to accurately predict the next token for the entire internet these models just don't have enough capacity and so they need to cram in as much as they can and the way they learn to do that is to use each of their neurons or units of computation in the model for lots of different things and so if you try to make sense of the model and be like oh if I remove this one neuron or like what is it doing in the model it's impossible to make sense of it it'll fire for like Chinese and phishing and horses and I don't know just like a hundred different things um and it's because it's it's trying to juggle all these tasks and use the same neuron to do it so that's that superposition.
nine months later we write towards monosomanticity which introduces what are called sparse autoencoders and so going off what I just said of the model trying to cram too much into two little space uh we give it more space uh there's this higher dimensional representation uh where it can then more cleanly represent all of the concepts that it's understanding and and this was uh of a very toy paper and so much as it was a two layer really small really dumb transformer and we fit up to I want to say 16,000 features which we thought was a ton at the time fast forward nine months we go from a two layer transformer to our clawed three sonnet frontier model at the time and fit up to 30 million features and this is where we start to find really interesting abstract concepts like a feature that would fire for code vulnerabilities and it wouldn't just fire for code vulnerability it would even fire for like you know that chrome page you get if you like it's not an HTTPS um you are all and it's like warning this site might be dangerous like click to continue and like also fire for that for example and so it's like these much more abstract coding variables or sentiment features amongst the 30 million.
fast forward nine months from that and now we have circuits and I threw in the analogy earlier of the ocean 11 heist team where now you're identifying individual features across the layers of the model that are all working together to perform some complicated task and you can get a much better idea of how it's actually doing the reasoning and coming to decisions like with the medical diagnostics um one example I didn't talk about before is um with like how the model retrieves facts and so you say like what sport did Michael Jordan play and not only can you see it hop from like Michael Jordan to basketball answer basketball but the model also has an awareness of when it doesn't know the answer to a fact and so by default it will actually say I don't know the answer to this question but if it sees something that it does know the answer to it will inhibit the I don't know circuit and then reply with the circuit that it actually has the answer to.
so for example uh if you ask it who is Michael Batkin which is just a made up fictional person uh it will by default just say I don't know it's only with Michael Jordan or someone else that will it will then inhibit the I don't know circuit but what's really interesting here and where you can start making downstream predictions or reasoning about the model is that that I don't know circuit is only on the name of the person and so in the paper we also ask it uh what paper did Andre Carpati write and so it recognizes the name Andre Carpati because he's sufficiently famous so that turns off the I don't know reply but then when it comes time for the model to say what paper it worked on it doesn't actually know any of his papers and so then it needs to make something up and so you can see different components and different circuits all interacting at the same time to lead to this this final answer.
why I think it's a tractable problem like understand every single thing that's happening in a model or like that's the best way to understand why it's being deceptive if you wanted to explain why England won world were two using particle physics uh you would just like be on the wrong track you just want to look at the high level explanations of who had more weapons like what did they want and that seems analogous to just training linear probes for like are you honest are you being deceptive like um do we catch you doing bad things when we're right teaming you can be monitor you um why is this not analogous where we're asking a particle physicist to just backtrack and explain why um why England won world or two I feel like you just want to go in with your eyes wide open not making any assumptions for what that deception is going to look like or what the trigger might be yeah uh and so the wider you can cast that net the better.
um depending on how quickly AI accelerates and where the state of our tools are we we might not be in the place where we can like show prove from the ground up that everything is safe uh but the like I feel like that's a very good north star it's a very powerful reassuring north north star for us to aim for especially when we consider we are part of the broader AI safety portfolio. I mean do you really trust like you're about to deploy this system and you really hope it's aligned with humanity and that you've like successfully iterated through all the possible ways that it's going to like scheme or sandbag but that's that's also probably going to be true with whatever you find you're not I mean you're not you're still going to have variants that you haven't explained or like you found a feature but you don't know if it actually explains deception or something else instead or so so I guess first of all I'm not saying you shouldn't try the probing approach right like we want to pursue the entire portfolio.
we've got the therapists interrogating the patient by asking do you have any troubling thoughts we've got the linear probe which I'd analogize to like a polygraph test yeah where we're taking like very high level summary statistics of the person's well-being and then we've got the neurosurgeons kind of going in and seeing if you can find any brain components that are activating and troubling or or off distribution at ways so so I think we should do all of it um what what percent of the alignment portfolio should it and make it to be uh I think as much of a chunk as is necessary I mean I think at least like yeah hard hard hard to find but I don't know I don't think I feel like all of the different portfolios are like being very well supported and growing um you can also we're going back to like the the world's three question you can think of it as like a hierarchy of abstractions of trust yeah where like let's say you want to go and talk to like Churchill um it helps a lot if you can verify that in that conversation in that 10 minutes he's being honest.
and this like enables you to construct better meta narratives um of what's going on and so maybe particle physics wouldn't help you there but suddenly like the neuroscience of Churchill's brain would help you verify that he was being trustworthy in that conversation and that the like you know the soldiers on the front lines were being honest in their depiction of what oh their description of what happened and this kind of stuff like so as you can verify like progress like parts of the uh the tree up then that that that massively helps you build confidence uh I think language models are also just really weird right like with the emergent misalignment work they I don't know if they took predictions they should have of like hey I'm going to fine tune Chatsy PT on code vulnerabilities is it going to become a Nazi and I think most people would have said no and that's what happened.
and so what are the different how they discover that it became a Nazi they started asking at a ton of different questions and it will do all sorts of like vile and harmful things like the whole persona just totally changes and and I mean we are dealing with alien brains here who don't have the social norms of humans and or or even a clear notion of like what they have and haven't learned uh that that we have of them I mean yeah and and so I think you really want to go into this with with eyes wide open backing up from mech and turf uh if you live in a world where AI progress accelerates um by the way you were mentioning a little while ago that there's many wild worlds we could be living in but we're living at least one of them.
um another one that we've uh gestured at but it's worth making more explicit is this even if the AI models are not helping write the next training algorithm for their successor just the fact that if they had human level learning efficiency uh whatever a model is learning on the job or whatever copy of the model is learning on the job the whole model is learning so in effect it's getting more or if they're like a thousand times less efficient than humans are learning that's right and you just like to deploy them even still that's exactly yeah yeah anyways and and there's a whole bunch of other things like you can think about it but even there it's like you kind of have a like a broadly deployed intelligence explosion and I do think it's what like worth pressing on that future of um you know there's there's this whole spectrum of crazy futures but the one that I feel we're almost guaranteed to get and this is like a like a like a almost as strong statement to me um is one where like at the very least you get drop-in like white collar worker at some point in the next five years it's like I think it's very likely in two um but it seems almost over determined in like five and and unlike the grand scheme of things like those are kind of irrelevant time frames like it's.
the same either yeah um and that completely changes the world over the next decade um and and if the sort of if we don't have the right policies in place so that then you end up actually with almost some respects like a fundamentally worse world because the thing that these models get good out by default is like software engineering and like computer-using agents and this kind of stuff and then we need to we will need to put in extra effort to put them in the loops where they help us with scientific research or they're like we have the right robotic such that we actually like experience and increase the material quality of life um so that's what we're thinking about like if you're in the perspective of like I'm a country yeah what should I be doing or thinking about uh like plan for the case where where white collar work is automated um and then consider what does that mean for your economy and what you should be doing to prepare polls what should you be doing to prepare because honestly I think it's like a such a tough question yeah where like if you're India or Nigeria or Australia yeah um if you're a country unlike America or China where they do have frontier models um what is it that you should be doing right now especially on such a short time scale yes
uh so I think one very important point is that let's say this this scenario turns out true then computer to become the most valuable resource in the world yeah like the sort of GDP of your economy is dramatically affected by how much compute you can deploy towards these sort of organizations within your country um and so having some guaranteed amount of compute I think will actually be quite important so like pre getting ahead of investments in like data centers and this kind of stuff on the condition that it's like companies in your country have to be allowed to to use that compute um and uh yeah not necessarily for training but like just even just for inference I think the economic value here comes to that inference um I think it also makes sense to invest broadly in uh in AI like I think these countries have the opportunity to do so and I think that's like a portfolio of like you know foundation model companies but also like robotics supply chain and this kind of stuff um
I think that you should invest very proactively in policies that try and prevent capital lock-in um so we're in for a much worse world if like it just so happens that the people had like money in the stock exchange or inland before AGI are like dramatically more wealthy than the people who don't because it's a gross misallocation of resources uh so having like I know one of my favorite episodes actually on your podcast was uh like the George's on one where you're trying to like like appropriately like value or allocate land um and so I think like like this strikes particularly close to home coming from Australia where I think like out like uh policies with respect to land are like grossly wrong um but I think this is broadly true um being very uh like forward on regulation and integration of these models into uh into like your country um is is important um and proactively uh making sure that people have choice like so let's say uh you should be quite proactive about making sure that like the phones or devices or like glasses that people have at people have like free choice on like what yeah things they run um and then oh so that's like that's the we just get white collar worker right and like you're trying to like do the best to like prepare your country for that um
and then it's like okay well what can you do to like make all possible versions of the future go like well like that's like covering some amount of like economic downside uh the other like things I think are really important is like figure out how you can uh like either make the like basically ensure dramatic upside or cover like terrible downside um and so like getting dramatic upside is like making sure that there uh like is investment in biology like biology research and this kind of stuff in an automated way that is like uh these models are actually like able to produce novel medicines that massively massively improve our like quality of life. In covering the downside is like AI alignment research and this kind of stuff an automated testing and like really thinking hard about that AI safety institutes this kind of stuff but these seem like things that a rich person a random rich person could also do like there's it seems like there's not a thing that a nation state is uniquely equipped to do uh let's go point in this in this scenario
I mean like dramatic allocation of resource like resource storage compute I think is is sensible um I would be doing that if I was in charge of a nation state I think it just increases your optionality and like most of the future worlds. Yeah um Dylan Patel has some scary forecasts on US energy yeah versus China. Yes we're like 34 gigawatts off. Yeah the US's line is like flat yeah basically and China's land is like this and I mean the US like very clearly yeah we just need so many more power plants. Yes intelligence becomes this like incredibly valuable input like intelligence becomes almost a raw input into the economies and quality of life of the future. The thing directly underneath that is energy and so making sure you have like you know incredible amounts solar like tile the desert and solar panels on you know some plastic desert and solar panels would be would be helpful towards making sure that you have more access to intelligence on top.
Yeah just to make it explicit because we've been touching on it here even if AI progress totally stalls you think that the models are really spiky and they don't have general intelligence. Yes it's so economically valuable and sufficiently easy to collect data on all of these different jobs these white color job tasks such that to Shulta's point we will we should expect to see them automated within the next five years. Yeah even if you need to hand spoon every single task to the model. It's like economically worth all to do so even if like algorithmic like progress stalls out and like we just never figure out how to like keep progress going which I don't think is the case like that hasn't stole that. yet it seems to be going great. The current suite of algorithms are sufficient to automate white color work provided you have enough of the right kinds of data. Yes and in a way that like compared to the tam of salaries for all of those kinds of work is so like trivially worthwhile.
Yeah yeah exactly. I do just want to flag as well that there's a really dystopian future if you take more of x paradox to its extreme which is this paradox where we think that the most valuable things that humans can do or the smartest things are like add large numbers in our heads or do any sort of white color work and then we totally take for granted our fine motor skill and coordination but from an evolutionary perspective it's the opposite so we got like evolution has optimized fine motor coordination so well and if you look at like robot hands or like the ability to open a door is still just like really hard for robots. Meanwhile we're seeing this total automation of coding and everything else that we've seen as clever. The really scary future is one in which AI's can do everything except for the physical robotic tasks in which case you'll have humans with like air pods and like glasses glasses and there'll be some robot overlord controlling the human through cameras by just like telling it what to do and like having a bounding box around the thing you're supposed to pick up and so you have like human meat robots.
And not like necessarily saying that like that's what the AI's would be like want to do or anything like that but as in like if you were to be like what are the relative economic value of things like the AI's are out there doing computer programming and like the most valuable thing that humans can do is like be amazing robots. Now that being said I think more of x paradox is a little bit fake. I think the main reason that robots are worse than at like being a robot then they are at software engineering is the internet exists for software engineering like GitHub exists and there is no equivalent thing like if you had all like you know mocap of everyone's actions as they were like going about their daily lives for like some reasonable fraction of the human population robotics is also like close to solve like like on track to be solved at the same rate that software engineering is on track to be solved.
So this is only like this vision is only like a sort of decade long section but it's still terrible decade. Like imagine the world where people have lost their jobs you haven't yet got novel biological research that means like people's quality of life is like dramatically better. You don't yet have material abundance because you like haven't actually been able to action the physical world in the like necessary way like you can't build dramatically more because that's like building dramatically more takes robots basically. And people's like main comparative advantage is as fantastic robots is like a shocking shocking world.
I mean for me the perception of an average human I think it actually might be better. You're like wages will be higher because you're you're the complement to something that is enormously valuable. Right which is AI labor. Right. And like you know decade or two on like the world is fantastic. Yeah right like you truly like you robotics to solve and you've got to get like you know like radical abundance basically provided that you have all the policies set up like necessary to commit building like you sort of you end up with that same change from you know the like the before and after photos of Shanghai yeah we're like 20 years on it's like this dramatically transformed city like a lot of places in the world probably end up like that right over that two decade period but we need to make sure like one do our best to estimate is this actually what is on track to happen like build sweet bench but for all the other forms of white collar work and measure and track that's a great thing that governments should be doing by the way is like trying to break down the sort of functions of their economy into measurable tasks and figuring out where what is the curve actually look like for that because they might be a bit shocked by the progress there.
You know there's no sweet bench for tax like taxi bell and then. I don't like have all the answers here but like figuring out a way to like share the proceeds of this economy like broadly across people or like invest heavily in robotics and collecting data so that we get robotics faster and we get material abundance faster invest in biological research that we get but like all that faster they basically try and pull forward the radical upside because otherwise you have a pretty dark yeah like section. I think one thing that's not appreciated enough is how much of our leverage on the future given the fact that our labor isn't going to be worth that much comes from our economic and political system surviving for your million X's and P equity to mean something for your contracts to mean anything for the government to be able to tax the AI labor and give you a UBI off of that it just like that requires our legal institutions our economic institutions our financial rail surviving in the future.
Yes the way in which that likely happens is if it's also in the AI's best interests that they follow those rails and by AI I don't mean some monolithic single AI I just mean like firms which are employing AI and becoming more productive as a result. You don't want to be in a position where it's so onerous to operate in our system that you're basically selecting for firms who either emigrate or who are like doing black market stuff etc and which means I think like you want to make it super super easy to deploy AI have the equivalent to special economic zones etc because otherwise you are just surrendering the future outside of any control that you might have on it.
One of the reasons by the way that I worry about turning AGI into a national security issue or having it have extremely close ties with the government the Manhattan Project thing is that it disproportionately redirects the use of AI towards military tech and the mosquito drones and whatever and and also naturally puts other countries in the same frame of mind right if we're developing the mosquito drones why we China not develop the mosquito drones and that just seems like a zero-sum race and not to mention a potentially catastrophic one whereas like you know like compute we limited you know we want we will need to disproportionate accelerate some things to the extent it just remains totally like a consumer free market landscape it just seems more likely that we'll get the glorious transhumanist future where they're developing the things that make human life better.
Yes I agree like the case where you end up with like two national projects facing off against it dramatically worse right like we don't want to live in that well yeah it's much much better if it stays a free market so to speak. Yeah yeah yeah okay I want to take issue with your claim that even if with the with the algorithms of today if we just collect enough data yeah that we could automate why color work first let me get an understanding of what you mean by that so do you mean that we would do the analogous thing of free training with all the trajectories of everything people would do on their jobs could you say could you make either manually or through some other process some RL procedure based on the screen recordings every white color worker what kind of thing are you imagining.
I mean there's like a continuous distribution of this stuff. One like important like mental model to think about RL is I think as like the task gets more there is some respect with which like longer horizon or better that task if you can do them if you can get that reward ever are like easier to judge. So like again it's come back to that like can you make money on the internet that's an incredibly easy reward signal to judge but to like do that there's like a whole hierarchy of like complex behavior so if you get like pre-trained up to the easy to judge reward signals like does your website work does it go down I do people like it like there's all these reward signals that we can respond to because we have a long in we can like progress do these long enough trajectories to actually like get to interesting things if you're stuck in this regime where like you need to reward signal every five tokens like it's a way more painful and like long process but if you could like pre-trained on every like screen in America then probably the like RL tasks that you can design are very different to like if you could only like take the existing internet as it is today and so like how much of that you get access to like changes the mix.
Interesting. So as we're training them on longer and longer horizon tasks and it takes longer for them to get any any signal on whether they're still so completely the task well that's low down progress because it takes more compute per task. I do think there's this notion the longer the harder tasks the more training is required and I'm sympathetic to that naively but we as humans are very good at practicing the hard parts of tasks and decomposing them and I think once models get good enough at the basic stuff they can just rehearse or fast forward to the more difficult parts. I mean it's definitely one of the complexities right like as you use more compute and like the and as you're trying to like more and more difficult tasks I mean I don't know your rate of improvement of biology is going to be like somewhat bound by the time it takes the seltogra in a way that your rate of improvement on math isn't for example.
So yes but I think for many things we'll be able to parallelize like widely enough and get enough iteration loops. Will the regime of training new models go away? Will we eventually get to like you got the model and then you just keep adding more skills to it with our training? That depends on whether or not you think like there's a virtue in pre-training in your architecture. Basically you make some like architectural change then you like probably need to like do some form of like at least like a retraining in your model. How does the fact that if RL requires a bunch of inference to do the training in the first place does that push against the thing you were talking about where we actually need a bigger model in order to have brain like energy but then it's also more expensive to train it in RL so where does that balance out?
I think we got to drink the bitter lesson here and yeah like there aren't infinite shortcuts like you do just have to scale. Something's I have a bigger model and pay more inference for it and if you want a GI then that's what you got to pay the price of. But there's like there's a trade-off equation here right? There is science to do which you know everyone is doing of what is the optimal point at which to do RL because you need something which can both learn and discover the sparse reward itself. So you don't want to want a one-paround model useless even though you can run it really fast. You also don't want 100-team model because it's super slow.
Yeah. Password RL. So the marginal benefit of like it's learning efficiency is not worth it. So there's a pretty fun to hear. What's the optimal model size of your current class of capabilities and your current set of RL environments? Even in the last year there's been much more of a factor of the inference cost. So just explicitly the bigger the model the more expensive it is to do a forward pass and generate tokens. And the calculus used to just be should I allocate my flops to more training data or a bigger model?
Yeah. And now another huge factor is how much am I actually going to do forward passes on the most model once it's trained? Yeah. My total pool of compute, how do I allocate that across trained data compute and inference compute for the RL training? And then even within inference there's all this research on well what strategy should I use? Should I sample 10 and take the best? Do I do this sort of like branching search, etc., etc. And so with RL where you're sampling a whole lot of tokens you also need to factor in the ability for the model to actually generate those tokens and then learn and get feedback.
Okay. So if we're living in this world what is your advice to somebody early in their career or student in college? What should they be planning on doing? Yeah. So I think once again it's worth considering the spectrum possible worlds and preparing yourself for that. And the one like the sort of action that I think is like highest EV in that case is you are about to get dramatic in the minimum you are about to get dramatically more leverage. You already have like already the startups in YC like writing huge amounts of code with you know, Claude. So what challenges, what causes do you want to change in the world with that added leverage? Like if you had 10 engineers at your Beckenkohl, what would you do? Or if you had a company at your Beckenkohl, like what would that enable you to do? And what problems and domains suddenly become tractable? That's the world you want to prepare for.
好的。那么如果我们生活在这个世界上,你对那些职业早期的人或在校大学生有什么建议?他们应该计划做些什么?是的,我认为值得再次考虑一下各种可能的世界,并为此做好准备。我认为在这种情况下最有价值的行动是准备好迎接未来的巨大杠杆效应。你已经看到了,例如,在YC创业公司中,人们正通过 Claude 等工具编写大量的代码。那么,你想利用这种额外的能力在世界上改变什么问题或事业?比如说,如果你有10名工程师为你效力,你会做些什么?如果你拥有一家公司的资源,你将能够实现什么?哪些问题和领域会因此变得可解决?这就是你应该为之准备的世界。
Now that still requires a lot of technical depth. Obviously there is the case where AI just becomes dramatically better than like everyone had everything, right? But for at least a while probably there is advantage. I think Jensen actually talked about this in an interview in which he's like you know, I have like 100,000 general intelligences around me and I'm still like somewhat useful because I'm there like you know, directing the values and like like you're like asking to do things and you know, they're like I still have value even though I have 100,000 general intelligences. And for many people I think that will still be true for a fair while.
And then you know, as they are, they get better and better and better and like so on eventually. No, but again, prepare for like the spectrum of possible worlds because in the event where we're just totally out competed, yeah, that's my what you do. But in all the other worlds, that is a lot. Get the technical depth, study biology, study CS, like really think hard about study physics, think about hard about what challenges you want to solve on the world.
Yeah, that's a lot of topics. You can't. You can't. It's so much easier to learn. Everyone now has the like infinite perfect tutor. Yeah, yeah. Yeah, it's definitely been helpful to me. Yeah, I would say some combination of like get rid of the sunk cost of your like previous workflows or expertise in order to evaluate what AI can do for you. That's right. And another way to put this, which is fun is just like be lazier. In so much as like figure out the way that the agent can do the things that are toilsome. But but it's you're going to have to in the ultimately you get to be lazier, but in the short run, you need to like critically think about the things you're currently doing and like what an AI could actually be better at doing.
Yeah. And then go and try it or explore it. Because I think there's like still just a lot of low hanging fruit of people assuming and not writing the full prompt, giving a few examples, connecting the right tools for your work to be accelerated, automated. Yeah. Yeah. There's also the sunk cost of feeling like since you're not quote unquote early to AI that you've sort of missed the boat and you can't like, but I I think I mean, I remember when GPT three came out. So that story on the podcast, when I graduated at college, I was planning on doing some sort of AI wrapper startup. And the podcast was just like a gateway into doing that.
And so it's trying out like different things. And at the time, I remember thinking, oh, 3.5 is out. And people are like, I'm like so behind on like the startup scene here or whatever if I wanted to make my own wrapper. I mean, maybe that idea of the wrapper was an advisable in the first place, but just like every time feels early because like it's sort of if it's an exponentially growing process. And there were many things, many ideas are only becoming possible now, right?
Yeah. Exactly. It's a product expenditure. I talk about products that literally obsolete it. Like you need to constantly reinvent yourself to stay the frontier of capabilities. But do you remember I had a really shitty idea and I give you a call. It was. It was like, I think it was like rag for like lawyers or something. Right.
Anyways, I give you I think one of our first interactions was I'm like, hey, what do you think of this idea? And you're like, I think the podcast sounds promising. That's right. Which I appreciate. Yeah. I got slightly annoyed at a friend recently who I think is really talented and clever and interested in AI, but has pursued a biology route. And I just kind of tried to shake them of like, you can work on AI if you want to. I mean, I think humans are artificial or not artificial or biological general intelligences, where a lot of the things of value are just very general.
Yeah. And whatever kind of specialization that you've done, maybe just doesn't matter that much. I mean, again, it gets back to the sunk cost. But like so many of the people even my like colleagues at Anthropic. are excited about AI and they just don't let their previous career be a blocker. And because they're just like innately smart, talented, driven, whatever else, they're they end up being very successful and finding roles. It's not as if they were in AI forever. I mean, people have come from totally different fields. And so don't think that you need like permission from some abstract entity to get involved and apply and be able to contribute.
If somebody wanted to be in AI research, like right now if you give them an open problem, or like the kind of open problem that is very likely to be quite impressive. What would it be? I think that now that our roles like come back, paper is building on Andy Jones's like scaling board like scaling laws for board games are interesting, like showing that you can like investigating these questions like the ones you asked before where you're like, oh, like you're, these the model actually learning to do more than its previous POSIT K or is it just like discovering that like exploring questions like that deeply, I think are interesting.
Yeah, yeah. Like scaling laws for all basically. Yeah, very curious to see like how much like the marginal increase in metal learning from a new task or something. I mean, on that note, I think I think model dipping has like a bunch of opportunities. Yeah. Also, people say, oh, we're not capturing all the features. There's always stuff left on the table. What is that stuff that's left on the table? Yeah. Like if the model's jailbroken, is it using existing features that you've identified? Is it only using the error terms that you haven't captured?
Yeah. I don't know. There's a lot here. I think Matt's is great. The Anthropic Fellowship has been going really well. Good fire and Thropic invested in recently. They're doing a lot of interpretability work or just applied anything to anything to get your equity. There's just so many interpretability projects that are like, there's so much low hanging fruit and we need more people and I don't think we have much time.
Yeah. I also want to make a plug for performance engineering. I think this is one of the like like best ways to to demonstrate that you have like the raw ability to do it. If you made an extremely efficient transform implementation on TPU or Trinium or like Incruda, then I think there's a pretty high likelihood that you'll get a job. There's a relatively small pool of people that you can trust to completely own and to end the performance of a model.
And if you have broad, deep electrical engineering skills, I think you can probably come up to speak pretty fast on an accelerator. You can come up to speak reasonably fast and it teaches you a lot of good intuitions to be actual intricacies of what's going on in the models, which means that you're then very well placed to think about architecture and this kind of stuff. One of my favorite people in thinking about architecture and Anthropic at the moment actually came from like a heavy GPU kernel program background to think no is the ins and outs really deeply and can think about the trade-offs really well.
This is fun, guys. It's really fun. Thanks. Yeah. Great to be back. I hope you enjoyed this episode. If you did, the most helpful thing you can do is just share it with other people who you think might enjoy it. Send it to your friends, your group chats, Twitter, wherever else. Just let the word go for it. Other than that, super helpful if you can subscribe on YouTube and leave a five star review on Apple Podcasts and Spotify.
Check out the sponsors in the description below. If you want to sponsor a future episode, go to doarkesh.com slash advertise. Thank you for tuning in. I'll see you on the next one.