All of you know Andrewing as a famous computer science professor at Stanford, was really early on in the development of neural networks with GPUs. Of course, a creator of Coursera and popular courses like deep learning.ai. Also the founder and creator of an early lead of Google Brain. But one thing I've always wanted to ask you before I hand it over Andrew while you're on stage is a question I think would be relevant to the whole audience. Ten years ago on problem set number two of CS229, you gave me a B. And I was wondering, I looked it over, I was wondering what you saw that I did incorrectly. So anyway Andrew. Thank you, Hansine.
Looking forward to sharing with all of you what I'm seeing with AI agents, which I think is an exciting trend that I think everyone building an AI should pay attention to. And then also excited about all the other what's next presentations. So AI agents, today the way most of us use large language models is like this. We have a non-agientic workflow where you type a prompt and generate an answer. And that's a bit like if you ask a person to write an essay on a topic and I say please sit down to the keyboard and just type the essay from start to finish with whatever using backspace. And despite how hard this is, LMS do it remarkably well.
In contrast with an agentic workflow, this is what it may look like. Have an AI, have an LMS, say write an essay outline... Do you need to do any web research? If so, let's do that. Then write the first draft and then read your own first draft and think about what parts need revision. And then revise your draft and you go on and on. And so this workflow is much more iterative where you may have the LMS do some thinking and then revise this article and then do some more thinking and iterate this through a number of times. And what not many people appreciate is this delivers remarkably better results.
I've actually really surprised myself working these agent workflows how well they were. I don't do one case study. My team analyzed some data using a coding benchmark called the human eval benchmark released by opening it a few years ago. But this says coding problems like given the non-entilist of integers, return to some of all the odd elements or even positions. And it turns out the answer is coast snippet like that.
So today, a lot of us will use zero-shot prompt and meaning we tell the AI, write the code and have it run on the first class. Like who codes like that? No human codes like that. You just type out the code and run it. Maybe you do. I can't do that. So it turns out that if you use GPT 3.5, zero-shot prompting, it gets it 48% right. GPT 4, way better, 67% right. But if you take an agentic workflow and wrap it around GPT 3.5, say, it actually does better than even GPT 4. And if you were to wrap this type of workflow around GPT 4, it also does very well. And you notice that GPT 3.5 with an agentic workflow actually outperforms GPT 4. And I think this means that this is a significant consequence of how we all approach building applications.
So agents, as it turns, it's tossed around a lot. There's a lot of consultant reports. How about agents? I'm going to be a bit concrete and share of you the broad design patterns I'm seeing in agents. It's a very messy chaotic space. Tons of research, tons of open source. There's a lot going on. But I try to categorize a bit more concretely what's going on agents. Refection is a tool that I think many of us just use. It just works. Two use, I think it's more widely appreciated. But it actually works pretty well. I think of these as pretty robust technologies. When I use them, I can almost always get them to work well.
Planning and multi-agent collaboration, I think it was more emerging when I use them. Sometimes my mind is blown for how well they work. But at least at this moment in time, I don't feel like I can always get them to work reliably. So let me walk through these four design patterns in a few slides. And if some of you go back and yourself will ask your engineers to use these, I think you get the productivity boost quite quickly.
So, perfection. Here's the example. Let's say I asked a system, please write code for me for a given task. Then we have a coded agent, just an LMM that you prompt to write code to say, you know, death, due task, write a function like that. An example of self-reflection would be if you then prompt the LMM with something like this, here's code intended for a task and just give it back the exact same code that just generated. And then say, check the code carefully for current and the sound efficiency, good construction for them. It turns out the same LMM that you prompted to write the code may be able to spot problems like this, bug in line five, fix it by blah, blah, blah. And if you now take your own feedback and give it to it and reprompt it, it may come up with a version two of the code that could well work better than the first version. Not guaranteed, but it works, you know, often enough for this to be worth trying for a lot of applications.
To foreshadow two use, if you let it run unit tests, if it fails a unit test, then why do you fail the unit tests? Have that conversation and maybe let's figure out, fail the unit tests so you try changing something and come up with V3. By the way, for those of you that want to learn more about these technologies, I'm very excited about them. For each of the four sections, I have a little recommended reading section in the bottom that hopefully gives more references. And again, just to foreshadow multi-agent systems, I've described as a single code agent that you prompt to have it, you know, have this conversation with itself. The natural evolution of this idea is instead of a single code agent, you can have two agents, where one is a code agent and the second is a critic agent. And these could be the same base L.M. model, but you prompt in different ways. We say one, you're expert code, right, code. You want to say you're expert code reviewers who review this code. And this has a workflow. It's actually a pretty easy to implement. I think it's such a very general purpose technology for a lot of workflows. This would give you a significant boost in the performance of L.M.s. The second design pattern is two use.
Many of you will already have seen L.M. based systems using tools. On the left is a screenshot from a code pilot. On the right is something that I kind of extracted from GVD4.
But L.M. today, if you ask it, what's the best copy maker for your web search, for some problems, L.M.s will generate code and run codes. And it turns out that there are a lot of different tools that many different people are using for analysis, for gathering information, for taking actions, for personal productivity.
It turns out a lot of the early work in two use turned out to be in the computer vision community. Because before large language models, L.M.s, they couldn't do anything with images. So the only option was to generate a function call that could manipulate an image, like generate an image or do object detection or whatever. They actually look at literature.
It's interesting how much of the work in two use seems like an originator from vision, because L.M.s would blind the images before GVD4V and LAVR and so on. So that's two use and it expands what an L.M. can do. And then planning, for those of you that have not yet played a lot with planning algorithms, I feel like a lot of people talk about the chat GPT moment where you're, wow, never seen anything like this.
I think if you're not used to planning algorithms, many people will have kind of an AI agent. Wow, I couldn't imagine an AI agent doing this. I've run live demos where something failed and the AI agent rerouted around the failure. So I've actually had quite a few of those more and went, wow, I can't believe my AI system just did that autonomously. But one example that I adapted from a hugging GPT paper, you say, these generate an image where a girl is reading a book and it posts the same as the boy in the image example.jpg and please subscribe the new image review voice.
So give an example like this. Today with AI agents, you can kind of decide, first thing I need to do is determine the post of the boy, then find the right model, maybe on hugging face to extract the post. Then next, you need to find the post image model to synthesize a picture of a girl as following the instructions, then use image to text, and then finally use text to speech. And today we actually have agents that I don't want to say they work reliably. They're kind of finicky, they don't always work, but when it works, it's actually pretty amazing.
But with agentic groups, sometimes you can recover from earlier failures as well. So I find myself already using research agents for some of my work where I want a piece of research, but I don't feel like googling myself and spend a long time. I should send to the research agent, come back in a few minutes and see what has come up with. And it sometimes works, sometimes it doesn't, right? But that's already a part of my personal work.
The final design pattern, multi-agent collaboration, this one of those funny things, but it works much better than you might think. But on the left is a screenshot from a paper called Chat Dev, which is completely open, which is actually open source. Many of you saw the flashy social media announcements of Dev in. Chat Dev is open source, it runs on my laptop. And what Chat Dev does is an example of a multi-agent system where you prompt 1LOM to sometimes act like the CEO of a software engine company, sometimes echo design, sometimes they're probably management, sometimes echo a tester. And this flock of agents that you built by prompting an LOM to tell them, you're now a CEO, you're now a software engineer.
They collaborate, have an extended conversation so that if you tell it, please develop a game, develop a Go-Mokey game. They'll actually spend a few minutes writing code, testing it, iterating and generating surprisingly complex programs. It doesn't always work. I've used it, sometimes it doesn't work, sometimes it's amazing. But this technology is really getting better. And just one of the design patterns, it turns out that multi-agent debate, where you have different agents, for example, could be have chat GPT and Gemini debate each other. That actually results in better performance as well. So getting multiple similar-air agents work together has no powerful design pattern as well.
So just to summarize, I think these are the patterns I've seen. And I think that if we were to use these patterns in our work, a lot of us can get the practices to use quite quickly. And I think that agent-like reasoning design patterns are going to be important. This is my small slide. I expected to set the task that AI could do, will expand dramatically this year because of agent-like workflows.
And one thing that's actually difficult for people to get used to is when we prompt an OM, we want to respond right away. In fact, a decade ago when I was having discussions at Google on a big box search, type of long prompt. One of the reasons I failed to push success leave for that was because when you do a web search, you want to respond back in half a second. That's just human nature. We like that instant feedback. But for a lot of the agent workflows, I think we need to learn to dedicate the task in AI agents and patiently wait minutes, maybe even hours for a response. But just like, I've seen a lot of novice managers delegate something to someone and they check in five minutes later, and that's not productive. I think it would be difficult. We need to do that with some of our AI agents as well. I saw some glass.
And then one of the important trends, fast token generation is important because with these agent workflows, we're iterating over and over. So the LMS generating tokens for the L&R reads. So being able to generate tokens way faster than any human to read is fantastic. And I think that generating more tokens really quickly from even a slightly lower quality L&M might give good results compared to slower tokens from a better L&M, maybe, it's a little bit controversial, because it may let you go around this loop a lot more times, kind of like the results I showed with GPD3 and an agent architecture on the first slide.
And candidly, I'm really looking forward to Cloud 4 and GPD5 and Gemini 2.0 and all these other wonderful models in the Imperial building. And part of me feels like if you're looking forward to running your thing on GPD5 Zero Shot, you may be able to get closer to that level of performance on some applications than you might think with agent in reasoning, but on an early model. I think this is an important trend. And honestly, the path to AGI feels like a journey around the data destination, but I think this sort of agent workflows could help us take a small step forward on this very long journey. Thank you.
坦率地说,我真的很期待云4、GPD5和双子座2.0以及帝国大厦里的所有这些其他精彩型号。我觉得如果你期待在GPD5 Zero Shot上运行自己的东西,通过代理推理可能会在某些应用程序上达到比你想象的更接近那个性能水平,但可能是在一个早期型号上。我认为这是一个重要的趋势。老实说,通往AGI的道路就像是围绕数据目的地的旅程,但我认为这种代理工作流程可以帮助我们在这个漫长的旅程中迈出一小步。谢谢。