What's next for AI agentic workflows ft. Andrew Ng of AI Fund - YouTube
发布时间 2024-03-26 06:18:13 来源
中英文字稿
All of you know Andrewing as a famous computer science professor at Stanford, was really early on in the development of neural networks with GPUs. Of course, a creator of Coursera and popular courses like deep learning.ai. Also the founder and creator of an early lead of Google Brain. But one thing I've always wanted to ask you before I hand it over Andrew while you're on stage is a question I think would be relevant to the whole audience. Ten years ago on problem set number two of CS229, you gave me a B. And I was wondering, I looked it over, I was wondering what you saw that I did incorrectly. So anyway Andrew. Thank you, Hansine.
大家都知道安德鲁是斯坦福大学著名的计算机科学教授,他在使用GPU开发神经网络方面做得非常早。当然,他也是Coursera的创始人,开设了诸如deep learning.ai等受欢迎的课程。同时还是Google Brain的早期领头人和创始人。但在我把话题交给安德鲁之前,有一件事我一直想问你,这个问题我认为对所有听众都很相关。十年前在CS229的第二个习题集上,你给了我一个B。我一直在想,我回顾了一下,想知道你看到我哪里做错了。无论如何,安德鲁,谢谢你,汉辛。
Looking forward to sharing with all of you what I'm seeing with AI agents, which I think is an exciting trend that I think everyone building an AI should pay attention to. And then also excited about all the other what's next presentations. So AI agents, today the way most of us use large language models is like this. We have a non-agientic workflow where you type a prompt and generate an answer. And that's a bit like if you ask a person to write an essay on a topic and I say please sit down to the keyboard and just type the essay from start to finish with whatever using backspace. And despite how hard this is, LMS do it remarkably well.
我期待与大家分享我看到的人工智能代理,我认为这是一个令人兴奋的趋势,我认为每个构建人工智能的人都应该关注。我也很期待接下来的其他演示。所以人工智能代理,今天我们大多数人使用大型语言模型的方式是这样的。我们有一个非智能化的工作流程,您输入一个提示并生成一个答案。这有点像如果您让一个人在一个主题上写一篇文章,我说请坐下来键入所有的文章,不管怎样使用退格键。尽管这很难,但语言模型做得非常出色。
In contrast with an agentic workflow, this is what it may look like. Have an AI, have an LMS, say write an essay outline... Do you need to do any web research? If so, let's do that. Then write the first draft and then read your own first draft and think about what parts need revision. And then revise your draft and you go on and on. And so this workflow is much more iterative where you may have the LMS do some thinking and then revise this article and then do some more thinking and iterate this through a number of times. And what not many people appreciate is this delivers remarkably better results.
与执行型工作流相比,这可能看起来是这样的。使用人工智能,使用学习管理系统,比如写一篇文章大纲...需要做一些网络研究吗?如果是这样,我们就做。然后写出第一稿,再阅读自己的第一稿,思考哪些部分需要修改。然后修改你的草稿,如此循环往复。所以这种工作流程更加迭代化,在这个过程中,你可以让学习管理系统进行一些思考,然后修改这篇文章,继续思考,反复多次迭代。许多人不太欣赏的是,这样可以获得明显更好的结果。
I've actually really surprised myself working these agent workflows how well they were. I don't do one case study. My team analyzed some data using a coding benchmark called the human eval benchmark released by opening it a few years ago. But this says coding problems like given the non-entilist of integers, return to some of all the odd elements or even positions. And it turns out the answer is coast snippet like that.
我对自己在处理这些代理工作流程时表现得如此出色感到非常惊讶。我不只是做了一个案例研究。我的团队使用了几年前发布的人工评估基准编写了一些数据。但是这个基准问题涉及编码,比如给定一组整数,返回所有奇数元素或偶数位置的总和。结果证明答案就是像这样的代码片段。
So today, a lot of us will use zero-shot prompt and meaning we tell the AI, write the code and have it run on the first class. Like who codes like that? No human codes like that. You just type out the code and run it. Maybe you do. I can't do that. So it turns out that if you use GPT 3.5, zero-shot prompting, it gets it 48% right. GPT 4, way better, 67% right. But if you take an agentic workflow and wrap it around GPT 3.5, say, it actually does better than even GPT 4. And if you were to wrap this type of workflow around GPT 4, it also does very well. And you notice that GPT 3.5 with an agentic workflow actually outperforms GPT 4. And I think this means that this is a significant consequence of how we all approach building applications.
所以今天,我们很多人会使用零示范提示,这意味着我们告诉AI写代码并在第一类上运行。谁会像那样编码呢?没有人会那样编码。你只需输入代码并运行它。也许你会这么做。我无法做到。结果发现,如果使用GPT 3.5的零示范提示,它的准确率为48%。GPT 4的准确率要好得多,为67%。但是,如果您采用代理工作流程,并将其包裹在GPT 3.5周围,实际上比甚至GPT 4表现更好。如果您将这种类型的工作流程包裹在GPT 4周围,它也会表现得非常好。您会注意到,具有代理工作流程的GPT 3.5实际上胜过了GPT 4。我认为,这意味着这是我们大家在构建应用程序时采用的一种重要后果。
So agents, as it turns, it's tossed around a lot. There's a lot of consultant reports. How about agents? I'm going to be a bit concrete and share of you the broad design patterns I'm seeing in agents. It's a very messy chaotic space. Tons of research, tons of open source. There's a lot going on. But I try to categorize a bit more concretely what's going on agents. Refection is a tool that I think many of us just use. It just works. Two use, I think it's more widely appreciated. But it actually works pretty well. I think of these as pretty robust technologies. When I use them, I can almost always get them to work well.
所以,代理商,事实证明,这个概念经常被提及。有很多顾问报告。那么代理商呢?我会稍微具体一点,向你们分享我在代理商领域看到的一些广泛的设计模式。这是一个非常混乱的领域。大量的研究,大量的开源项目。这里发生了很多事情。但我试图更具体地分类一下代理商的情况。反思是一个我认为很多人都使用的工具。它只是有效。我认为更广泛地使用它是更受赞赏的。但它实际上效果很好。我认为这些是相当强大的技术。当我使用它们时,几乎总能使它们很好地工作。
Planning and multi-agent collaboration, I think it was more emerging when I use them. Sometimes my mind is blown for how well they work. But at least at this moment in time, I don't feel like I can always get them to work reliably. So let me walk through these four design patterns in a few slides. And if some of you go back and yourself will ask your engineers to use these, I think you get the productivity boost quite quickly.
计划和多智能体协作,在我使用它们时,我觉得它们更具前景。有时候我会被它们的工作效果惊艳到。但至少在此刻,我觉得我无法始终可靠地让它们运作。所以让我在几张幻灯片中解释这四种设计模式。如果你们中有些人回去后让你们的工程师使用这些模式,我相信你们会很快提高生产效率。
So, perfection. Here's the example. Let's say I asked a system, please write code for me for a given task. Then we have a coded agent, just an LMM that you prompt to write code to say, you know, death, due task, write a function like that. An example of self-reflection would be if you then prompt the LMM with something like this, here's code intended for a task and just give it back the exact same code that just generated. And then say, check the code carefully for current and the sound efficiency, good construction for them. It turns out the same LMM that you prompted to write the code may be able to spot problems like this, bug in line five, fix it by blah, blah, blah. And if you now take your own feedback and give it to it and reprompt it, it may come up with a version two of the code that could well work better than the first version. Not guaranteed, but it works, you know, often enough for this to be worth trying for a lot of applications.
所以,完美。这里有个例子。比如我要求一个系统,为我写一个特定任务的代码。然后我们有一个编码代理,只是一个LMM,你激活它来写代码,比如说,死亡任务,写一个这样的函数。自我反思的一个例子是,如果你用这样的方式激活LMM,这里是为一个任务准备的代码,然后把完全相同的代码交还给它,刚刚生成的代码。然后说,仔细检查代码是否有效、高效,结构良好。结果发现,激活写代码的同一个LMM可能会发现这样的问题,第五行的错误,在这里修复它。如果你现在采纳自己的反馈并再次激活它,它可能会提出代码的第二版,这个版本可能比第一个版本工作得更好。当然不是百分百保证,但它经常有效,就足以值得为许多应用程序尝试。
To foreshadow two use, if you let it run unit tests, if it fails a unit test, then why do you fail the unit tests? Have that conversation and maybe let's figure out, fail the unit tests so you try changing something and come up with V3. By the way, for those of you that want to learn more about these technologies, I'm very excited about them. For each of the four sections, I have a little recommended reading section in the bottom that hopefully gives more references. And again, just to foreshadow multi-agent systems, I've described as a single code agent that you prompt to have it, you know, have this conversation with itself. The natural evolution of this idea is instead of a single code agent, you can have two agents, where one is a code agent and the second is a critic agent. And these could be the same base L.M. model, but you prompt in different ways. We say one, you're expert code, right, code. You want to say you're expert code reviewers who review this code. And this has a workflow. It's actually a pretty easy to implement. I think it's such a very general purpose technology for a lot of workflows. This would give you a significant boost in the performance of L.M.s. The second design pattern is two use.
为了预示两种用途,如果您让它运行单元测试,如果单元测试失败,那么为什么会失败呢?进行这种讨论,也许我们可以找出失败单元测试的原因,然后尝试进行一些更改,最终得到V3版。顺便提一下,对于那些想要了解这些技术更多的人,我对它们充满了激动。对于每个章节,我都有一个小的推荐阅读部分,希望能提供更多参考资料。再次强调多智能体系统,我将其描述为一个单一代码智能体,您可以促使它与自身进行对话。这个想法的自然演变是,您可以有两个智能体,一个是代码智能体,另一个是评论智能体。它们可能是相同的基础L.M.模型,但以不同的方式进行促使。我们说一个是代码专家,对代码进行审查。这是一个工作流程。实际上,这非常容易实现。我认为这是一种非常通用的技术,适用于许多工作流程。这将显著提高L.M.的性能。第二个设计模式是两种用途。
Many of you will already have seen L.M. based systems using tools. On the left is a screenshot from a code pilot. On the right is something that I kind of extracted from GVD4.
你们中很多人可能已经看过使用工具的基于语言模型的系统。左侧是从一个代码导航器中截取的截屏。右侧是我从GVD4中提取出来的内容。
But L.M. today, if you ask it, what's the best copy maker for your web search, for some problems, L.M.s will generate code and run codes. And it turns out that there are a lot of different tools that many different people are using for analysis, for gathering information, for taking actions, for personal productivity.
但是现在,如果你问L.M.,什么是最适合你网络搜索的最佳复制工具,对于一些问题,L.M.s将生成代码并运行代码。事实证明,有许多不同的工具,许多不同的人在使用这些工具进行分析、收集信息、采取行动、提高个人生产力。
It turns out a lot of the early work in two use turned out to be in the computer vision community. Because before large language models, L.M.s, they couldn't do anything with images. So the only option was to generate a function call that could manipulate an image, like generate an image or do object detection or whatever. They actually look at literature.
原来在双重用途早期的很多工作都发生在计算机视觉社区。因为在大型语言模型(L.M.s)出现之前,他们无法处理图像。所以唯一的选择就是生成一个能操作图像的函数调用,比如生成一个图像或进行目标检测等等。他们确实研究了文献。
It's interesting how much of the work in two use seems like an originator from vision, because L.M.s would blind the images before GVD4V and LAVR and so on. So that's two use and it expands what an L.M. can do. And then planning, for those of you that have not yet played a lot with planning algorithms, I feel like a lot of people talk about the chat GPT moment where you're, wow, never seen anything like this.
有趣的是,在两种用途中,很多工作似乎起源于视觉,因为在进行GVD4V和LAVR等操作之前,L.M.会先模糊图像。所以这就是两种用途,并且它拓展了L.M.的功能。对于那些还没有深入研究过规划算法的人来说,我觉得很多人谈论的是像GPT那样惊艳的时刻,你会说,哇,从未见过这样的东西。
I think if you're not used to planning algorithms, many people will have kind of an AI agent. Wow, I couldn't imagine an AI agent doing this. I've run live demos where something failed and the AI agent rerouted around the failure. So I've actually had quite a few of those more and went, wow, I can't believe my AI system just did that autonomously. But one example that I adapted from a hugging GPT paper, you say, these generate an image where a girl is reading a book and it posts the same as the boy in the image example.jpg and please subscribe the new image review voice.
我认为,如果你不习惯规划算法,很多人会对AI代理产生一种幻想。哇,我无法想象一个AI代理会做这个。我曾经进行过现场演示,遇到了一些失败情况,AI代理会绕过故障重新规划路线。所以我确实经历过很多这样的情况,感叹道,哇,我无法相信我的AI系统刚刚自主完成了这个任务。不过我从一篇“拥抱GPT”论文中引用了一个例子,你可以说,生成一幅图像,一个女孩在读书,与图像示例.jpg中的男孩相同,并请订阅新的图片审核声音。
So give an example like this. Today with AI agents, you can kind of decide, first thing I need to do is determine the post of the boy, then find the right model, maybe on hugging face to extract the post. Then next, you need to find the post image model to synthesize a picture of a girl as following the instructions, then use image to text, and then finally use text to speech. And today we actually have agents that I don't want to say they work reliably. They're kind of finicky, they don't always work, but when it works, it's actually pretty amazing.
所以举个例子来说。如今有了人工智能代理,你可以大致确定,首先我需要做的是确定男孩的姿势,然后找到合适的模型,可能在Hugging Face上提取姿势。接着,你需要找到图片姿势模型来合成一个女孩的图片,按照指示来,然后使用图像识别文本,最后使用文本转语音。今天我们实际上有了一些代理,我不想说它们工作得可靠。它们有点挑剔,不总是有效,但当它们工作时,实际上是非常令人惊讶的。
But with agentic groups, sometimes you can recover from earlier failures as well. So I find myself already using research agents for some of my work where I want a piece of research, but I don't feel like googling myself and spend a long time. I should send to the research agent, come back in a few minutes and see what has come up with. And it sometimes works, sometimes it doesn't, right? But that's already a part of my personal work.
但是对于有实力的团队来说,有时候你也可以从早先的失败中恢复过来。所以我发现自己已经在一些我需要进行研究工作时开始使用研究代理,我不想自己去谷歌搜寻并花费很长时间。我应该把研究交给研究代理,几分钟后再回来看看他们找到了什么。有时它有效,有时不起作用,对吧?但这已经成为了我的个人工作的一部分。
The final design pattern, multi-agent collaboration, this one of those funny things, but it works much better than you might think. But on the left is a screenshot from a paper called Chat Dev, which is completely open, which is actually open source. Many of you saw the flashy social media announcements of Dev in. Chat Dev is open source, it runs on my laptop. And what Chat Dev does is an example of a multi-agent system where you prompt 1LOM to sometimes act like the CEO of a software engine company, sometimes echo design, sometimes they're probably management, sometimes echo a tester. And this flock of agents that you built by prompting an LOM to tell them, you're now a CEO, you're now a software engineer.
最终设计模式,多智能体协作,这是其中一件有趣的事情,但它的效果比你想象的要好得多。但左边是一张来自名为Chat Dev的论文的屏幕截图,它完全开放,实际上是开源的。你们中的许多人看过Dev在社交媒体上发布的宣传。Chat Dev是开源的,它在我的笔记本电脑上运行。Chat Dev的作用是一个多智能体系统的例子,你通过促使1LOM有时像软件工程公司的CEO,有时回映设计,有时可能是管理者,有时回映测试者。通过促使LOM告诉它们,你现在是CEO,你现在是软件工程师,你建立了一群智能体。
They collaborate, have an extended conversation so that if you tell it, please develop a game, develop a Go-Mokey game. They'll actually spend a few minutes writing code, testing it, iterating and generating surprisingly complex programs. It doesn't always work. I've used it, sometimes it doesn't work, sometimes it's amazing. But this technology is really getting better. And just one of the design patterns, it turns out that multi-agent debate, where you have different agents, for example, could be have chat GPT and Gemini debate each other. That actually results in better performance as well. So getting multiple similar-air agents work together has no powerful design pattern as well.
他们合作,进行了一场深入的讨论,以便请你开发一个游戏,开发一个Go-Monkey游戏。他们实际上花了几分钟写代码,测试它,迭代并生成令人惊讶的复杂程序。它并不总是有效。我用过它,有时不起作用,有时非常出色。但这项技术确实越来越好。事实证明,其中一个设计模式是多代理辩论,其中您有不同的代理人,例如聊天GPT和双子辩论彼此争论。这实际上也会导致更好的性能。因此,让多个相似的代理一起工作也是一种强大的设计模式。
So just to summarize, I think these are the patterns I've seen. And I think that if we were to use these patterns in our work, a lot of us can get the practices to use quite quickly. And I think that agent-like reasoning design patterns are going to be important. This is my small slide. I expected to set the task that AI could do, will expand dramatically this year because of agent-like workflows.
所以简而言之,我认为这些是我看到的模式。我认为如果我们在工作中使用这些模式,很多人可以很快地掌握使用。我认为类似代理的推理设计模式会很重要。这是我的简短总结。我预计由于类似代理的工作流程,AI能够完成的任务将在今年有很大的扩展。
And one thing that's actually difficult for people to get used to is when we prompt an OM, we want to respond right away. In fact, a decade ago when I was having discussions at Google on a big box search, type of long prompt. One of the reasons I failed to push success leave for that was because when you do a web search, you want to respond back in half a second. That's just human nature. We like that instant feedback. But for a lot of the agent workflows, I think we need to learn to dedicate the task in AI agents and patiently wait minutes, maybe even hours for a response. But just like, I've seen a lot of novice managers delegate something to someone and they check in five minutes later, and that's not productive. I think it would be difficult. We need to do that with some of our AI agents as well. I saw some glass.
有一件事实际上对人们来说很难适应的是,当我们提示一个OM时,我们希望立刻回应。事实上,十年前当我在谷歌进行一次大盒搜索的讨论时,我们失败的原因之一是因为当你进行网页搜索时,你希望在半秒钟内得到回应。这是人类的本性。我们喜欢那种即时的反馈。但是对于很多代理人的工作流程,我认为我们需要学会将任务交给人工智能代理,耐心等待几分钟,甚至几个小时才能得到回应。就像我见过很多初学者经理委托任务给别人,五分钟后就查看进度,这是没有效率的。我认为这将会困难。我们也需要对我们的一些人工智能代理采取这种方式。我看到了一些问题。
And then one of the important trends, fast token generation is important because with these agent workflows, we're iterating over and over. So the LMS generating tokens for the L&R reads. So being able to generate tokens way faster than any human to read is fantastic. And I think that generating more tokens really quickly from even a slightly lower quality L&M might give good results compared to slower tokens from a better L&M, maybe, it's a little bit controversial, because it may let you go around this loop a lot more times, kind of like the results I showed with GPD3 and an agent architecture on the first slide.
其中一个重要趋势是快速令牌生成非常重要,因为在这些代理工作流中,我们一遍又一遍地迭代。因此,LMS为L&R读取生成令牌。能够比任何人阅读更快地生成令牌是非常棒的。我认为,即使是质量稍低的L&M,快速地生成更多令牌可能会产生比从更好的L&M较慢生成令牌更好的结果,也许这有点有争议,因为这可能让您能够更多次绕过这个循环,就像我在第一张幻灯片上展示的GPD3和代理架构的结果一样。
And candidly, I'm really looking forward to Cloud 4 and GPD5 and Gemini 2.0 and all these other wonderful models in the Imperial building. And part of me feels like if you're looking forward to running your thing on GPD5 Zero Shot, you may be able to get closer to that level of performance on some applications than you might think with agent in reasoning, but on an early model. I think this is an important trend. And honestly, the path to AGI feels like a journey around the data destination, but I think this sort of agent workflows could help us take a small step forward on this very long journey. Thank you.
坦率地说,我真的很期待云4、GPD5和双子座2.0以及帝国大厦里的所有这些其他精彩型号。我觉得如果你期待在GPD5 Zero Shot上运行自己的东西,通过代理推理可能会在某些应用程序上达到比你想象的更接近那个性能水平,但可能是在一个早期型号上。我认为这是一个重要的趋势。老实说,通往AGI的道路就像是围绕数据目的地的旅程,但我认为这种代理工作流程可以帮助我们在这个漫长的旅程中迈出一小步。谢谢。