Tips for building AI agents
发布时间 2025-02-13 12:21:22 来源
这段视频由Anthropic公司Claude Relations部门的Alex,以及分别来自研究团队和“应用于视觉”团队的Eric和Barry组成,深入探讨了构建有效的AI代理的复杂性。他们详细阐述了最近一篇关于此主题的博客文章,区分了工作流程和真正的代理,并为涉足代理开发的开发者提供了实用的建议。
Eric首先澄清了代理的定义,将其与简单的工作流程区分开来。虽然很多人随意地将任何涉及多次调用LLM(大型语言模型)的系统称为“代理”,但该团队澄清说,代理的特征在于其自主性。它通过迭代循环运作,由LLM的决策引导,直到任务完成。步骤的数量不是预先确定的,这与工作流程形成鲜明对比,后者遵循固定、预定义的路径。代理会根据情况调整其方法,使其适合于客户支持或代码迭代等任务,在这些任务中,解决过程是可变的。
Barry解释说,工作流程和代理之间的区别是在模型变得更加复杂后出现的。最初,使用的是单个LLM,后来发展为利用代码编排的多个LLM的系统。这种演进揭示了两种不同的模式:工作流程,它们是高度编码和编排的;以及代理,它们更简单,但具有不同类型的复杂性。模型能力的提升促使团队正式定义了“代理”一词,并将其与工作流程区分开来。
在实践中,Eric解释说,这种差异体现在提示(prompt)级别。工作流程提示按顺序结构化,一个提示的输出以线性方式输入到下一个提示中。每个提示都有特定的目的,例如对用户问题进行分类。相比之下,代理提示是开放式的,它为模型提供了各种工具和资源,例如网络搜索或代码编辑,以实现目标。
Barry分享了一个来自他的入职经历的幽默轶事:他被安排运行OS World,这是一个计算机用户基准测试。面对反直觉的代理行为,他和一位同事通过闭上眼睛,短暂地瞥一眼屏幕来模拟模型的有限视角,模仿模型的输入。这个练习强调了同理心和向模型提供充足上下文的重要性。
Eric强调,在设计工具时,需要考虑模型的视角。开发者经常创建美观、详细的提示,但忽略了充分记录提供给模型的工具,这可能会导致过程中的困难。他强调,一个工具需要像为人类工程师准备的那样,为模型提供良好的文档。
然后,对话转向了代理技术的现状,讨论了其过度炒作和被低估的方面。Eric认为,被低估的方面是自动化任务,即使是很小的任务,也能节省人们的时间。自动化这些任务可以改变完成这些事情的频率。Barry指出,校准真正需要代理的地方很困难,需要找到一个有价值的、复杂任务的“最佳点”,在这些任务中,出错的代价相对较低。他以编码和搜索为例。
Barry解释了编码代理的潜力,强调了它们可以通过单元测试进行验证。编码代理的成功取决于用于向模型提供反馈的单元测试的质量。Eric同意,并认为改进代理性能将回归到验证。他们建议专注于添加对你真正关心的内容的测试,以便模型本身可以测试这些内容,并在返回给人类之前知道它是正确的还是错误的。
展望2025年,Barry设想了多代理环境,其中多个AI代理进行交互和协调。他提到了一个实验,其中多个“Claude”模型玩一个基于文本的社交演绎游戏“狼人杀”,以探索代理交互。虽然单个代理还需要在生产中展示许多成功的应用,但它可能是下一代模型的一个潜在扩展。Eric预测,越来越多的企业会采用代理来自动化重复性任务。他对面向消费者的代理处理度假计划等复杂任务持怀疑态度,因为难以指定偏好以及出错的风险很高。
最后,发言者为有兴趣进行代理开发的开发者提供了建议。Eric建议专注于可衡量的结果,以便获得关于所构建的东西是否有效的反馈。Barry建议从尽可能简单的事情开始,并逐渐增加复杂性。他们都强调构建一些可以随着模型变得更智能而改进的东西的重要性。
The video features Alex from Claude Relations at Anthropic along with Eric and Barry from the research and Apply to the Eye teams, respectively, diving into the intricacies of building effective AI agents. They elaborate on a recent blog post about the same, distinguishing between workflows and true agents, and offer practical advice for developers venturing into agent development.
Eric begins by clarifying the definition of an agent, differentiating it from a simple workflow. While many loosely apply the term “agent” to any system involving multiple LLM (Large Language Model) calls, the team clarifies that an agent is characterized by its autonomy. It operates through iterative loops, guided by the LLM's decision-making, until a task is resolved. The number of steps is not predetermined, contrasting sharply with workflows, which follow a fixed, pre-defined path. An agent adapts its approach based on the situation, making it suitable for tasks like customer support or code iteration, where the resolution process is variable.
Barry explains that the distinction between workflows and agents emerged as models became more sophisticated. Initially, single LLMs were used, evolving into systems utilizing multiple LLMs orchestrated in code. This progression revealed two distinct patterns: workflows, which are heavily coded and orchestrated, and agents, which are simpler yet possess a different kind of complexity. The rising capabilities of models prompted the team to formally define the term “agent” and differentiate from workflows.
In practical terms, Eric explains that the difference manifests at the prompt level. A workflow prompt is structured sequentially, with the output of one prompt feeding into the next in a linear fashion. Each prompt has a specific purpose, such as categorizing a user question. In contrast, an agent prompt is open-ended, providing the model with a variety of tools and resources, such as web search or code editing, to achieve a goal.
Barry shares a humorous anecdote from his onboarding experience, tasked with running OS World, a computer user benchmark. Faced with counterintuitive agent behavior, he and a colleague simulated the model's limited perspective by closing their eyes and briefly glimpsing the screen, mimicking the model's input. This exercise highlighted the importance of empathy and providing ample context to the model.
Eric emphasizes the need to consider the model’s perspective when designing tools. Developers often create beautiful, detailed prompts but neglect to adequately document the tools provided to the model, which can cause difficulty in the process. He stressed that a tool needs to have good documentation for the model as it would for a human engineer.
The conversation then transitions to the current state of agent technology, addressing both its overhyped and underhyped aspects. Eric suggests that the underhyped aspect is the automation of tasks, even small ones, that save people time. Automating these tasks can change the dynamics of how often these things can be done. Barry points out the difficulty in calibrating where agents are truly needed, identifying a "sweet spot" of valuable, complex tasks where the cost of error is relatively low. He cites coding and search as examples.
Barry explains the potential of coding agents, highlighting their verifiability through unit tests. The success of coding agents depends on the quality of the unit tests used to provide feedback to the model. Eric agrees and suggests that improving agent performance will come back to verification. They propose focusing on ways to add tests for the things that you really care about so that the model itself can test this and know whether it's right or wrong before it goes back to the human.
Looking ahead to 2025, Barry envisions multi-agent environments, where multiple AI agents interact and coordinate. He mentions an experiment where multiple "Claude" models play a text-based social deduction game, "Werewolf," to explore agent interaction. While single agents still need to show a lot of successful applications in production, it would be a potential extension with the next couple of generations of models. Eric predicts increased business adoption of agents to automate repetitive tasks. He expresses skepticism about consumer-facing agents for complex tasks like vacation planning, due to the difficulty of specifying preferences and the high risk of errors.
Finally, the speakers offer advice for developers interested in agent development. Eric advises focusing on measurable results to get feedback about whether what is being built is working. Barry recommends starting as simple as possible and building complexity gradually. They both emphasize the importance of building something that can improve as models get smarter.
摘要
Anthropic’s Barry Zhang (Applied AI), Erik Schluntz (Research), and Alex Albert (Claude Relations) discuss the potential of AI agents, common pitfalls to avoid, and how to prepare for the evolving landscape.
Read more advice on building agents: https://www.anthropic.com/research/building-effective-agents
00:00 Introduction
00:26 Defining AI agents and workflows
02:55 Anatomy of an agent prompt
04:29 Behind the scenes stories
07:29 Why write about agents now
08:53 Overhyped and underhyped aspects of agents
09:57 Identifying useful applications of agents
10:47 Coding agents: Potential and challenges
12:47 The future of agents in 2025
16:26 Advice for developers exploring agents
GPT-4正在为你翻译摘要中......
中英文字稿 
I feel like agents for consumers are like fairly bright. Right. Here we go. Hot day. Trying to have an agent like fully book a vacation for you. Almost just as hard as just going and booking it yourself. Today we're going behind the scenes on one of our recent blog posts, Building Effective Agents. I'm Alex. I lead Claude Relations here at Anthropic. I'm Eric. I'm in the research team at Anthropic. I'm Barry. I'm on the Apply to the Eye team. I'm going to kick us off here for viewers just jumping in.
我觉得消费者的代理人都很聪明,对吧。好吧,我们开始吧。炎热的一天,试图让一个代理人帮你完全安排好度假,几乎跟你自己去预定一样困难。今天,我们要深入探讨我们最近的一篇博客文章——《打造高效代理人》。我是Alex,我在Anthropic负责Claude关系。我是Eric,我在Anthropic的研究团队。我是Barry,我在应用智能团队。对于刚加入观看的观众,我会先开始介绍。
What's the quick version of what an agent actually is? I mean, there's a million definitions of it. And why should a developer or somebody that's actually building with AI care about these things? Eric, maybe we can start with you. Sure. Yeah. So I think something we explored in the blog post is that, first of all, a lot of people have been saying everything is an agent, referring to almost anything more than just a single LLM call. One of the things we tried to do in the blog post is really kind of separate this out of like, hey, there's workflows, which is where you have a few LLM calls chained together.
代理到底是什么?这个概念有很多种定义。那么为什么开发者或者正在使用人工智能的构建者应该关心这些东西呢?Eric,也许我们可以从你开始谈谈。好的。我认为我们在博客文章中探讨的一个方面是,许多人认为几乎所有超过单个大型语言模型(LLM)调用的东西都是一个“代理”。在这篇博客文章中,我们试图将其与工作流区分开来:工作流是指将几个LLM调用链接在一起的过程。
And really, what we think an agent is is where you're letting the LLM decide sort of how many times to run. You're having it continuing to loop until it's found a resolution. And that could be talking to a customer for customer support. That could be iterating on code changes. But something where you don't know how many steps it's going to take to complete, that's really sort of what we consider an agent. Interesting. So in the definition of an agent, we are letting the LLM kind of pick its own fate and decide what it wants to do, what actions to take, instead of us predefining a path for it.
实际上,我们对“代理”的定义是指,让大型语言模型(LLM)自行决定运行多少次。也就是说,它会不断循环,直到找到一个解决方案。这种应用可以是客服中与客户对话,也可以是对代码进行多次更改。但在这些情况下,我们并不知道需要多少步才能完成任务,这正是我们所认为的“代理”的意义所在。很有趣的是,在“代理”的定义中,我们允许LLM选择自己的操作路径和采取的行动,而不是由我们预先定义好路线。
Exactly. It's more autonomous. Whereas a workflow, you can kind of think of it as like a workflow or sort of like it's on rails through a fixed number of steps. I see. So this distinction, I assume this was the result of many, many conversations with customers and working with different teams and even trying things to make it happen to yourself. Barry, can you speak more to maybe what that looks like as we got to create this divide between a workflow and agent and what sort of patterns surprised you the most as you were going through this? Sure. Honestly, I think all of this kind of evolved as model got better and teams got more sophisticated.
当然,它更加自主。工作流就像一个固定步骤的流程,可以理解为在既定轨道上运行。我明白了。这种区别,我想是通过与客户的许多对话、与不同团队的合作,甚至尝试实现自我发展得出的。Barry,你能详细说明一下在创建工作流和代理之间的区别时,我们经历了哪些惊喜的模式吗?好的。老实说,我认为随着模型的改进和团队的成熟,这一切都是逐渐演变而来的。
We both worked with a large number of customers where they're sophisticated. And we kind of went from having a single LLM to having a lot of LLMs and eventually having our own orchestrating themselves. So one of the reasons why we decided to create this distinction is because we started to see these two distinct patterns where you have workflows that's pretty orchestrated by code. And then you also have agent, which is a simpler but complex in other sense, like different shape that we're starting to see. Really, I think as the models and all of the tools start to get better, agents are becoming more and more prevalent and more and more capable.
我们都曾与大量成熟的客户合作,并且经历了从使用单个大型语言模型(LLM)到同时使用多个LLM的过程,最终发展到它们自行进行协调。因此,我们决定作出这种区分的原因之一,是因为我们开始观察到这两种不同的模式:一种是由代码紧密编排的工作流程,另一种则是代理(agent),虽然相对简单,但在其他方面也具有复杂性,是我们正在看到的一种新形态。实际上,我认为随着模型和各种工具的改进,代理变得越来越普遍,也越来越强大。
And that's when we decided, hey, this is probably a time for us to give a formal definition. So in practice, if you're a developer implementing one of these things, what would that actually look like in your code as you're starting to build this, the differences between, maybe we actually go down to the prompt level here. What does an agent prompt to look like or flow and what does a workflow look like? Yeah. So I think a workflow prompt looks like you have one prompt. You take the output of it. You feed it into prompt B. Take the output of that. Feed it into prompt C. And then you're done.
这时我们决定,是时候给出一个正式的定义了。那么在实际操作中,作为一个开发者,如果你正在实现这些内容,你的代码会是什么样子呢?可能我们得细化到提示级别。一个代理的提示或流程是什么样的?而一个工作流程又是什么样的?我认为,一个工作流程的提示看起来是这样的:首先有一个提示A,你获取它的输出,然后将输出输入到提示B中,获取提示B的输出,再输入到提示C中,最后完成。
There's this straight line, fixed number of steps. You know exactly what's going to happen. And maybe you have some extra code that sort of checks the intermediate results of these and makes sure they're OK. But you kind of know exactly what's going to happen in one of these paths. And each of those prompts is sort of a very specific prompt, just sort of taking one input and transforming it into another output. For instance, maybe one of these prompts is taking in the user question and categorizing it into one of five categories so that then the next prompt can be more specific for that.
有一条明确的路线,固定的步骤数。你确切知道会发生什么。也许你还会有一些额外的代码来检查中间结果,确保它们没有问题。但你大致知道在这些路径中的某一条上会发生什么。每一个提示都是非常具体的提示,只是接受一个输入并将其转换为另一个输出。例如,可能有一个提示是接收用户的问题,并将其分类为五个类别之一,以便下一个提示可以更具体地处理它。
In contrast, an agent prompt will be sort of much more open-ended and usually give the model tools or multiple things to check and say, hey, here's the question. And you can do web searches or you can edit these code files or run code and keep doing this until you have the answer. I see. So there's a few different use cases there. That makes sense as we start to arrive at these different conclusions. I'm curious, as we've now kind of covered at a high level how we're thinking about these workflows and agents and talking about the blog post, I want to dive even further behind the scenes.
与此相反,代理提示通常会更开放,通常会给模型提供工具或多种选项进行检查,并说:“嘿,这是问题。” 你可以进行网络搜索,编辑这些代码文件或运行代码,并不断这样做直到找到答案。 我明白了,所以有几种不同的使用场景。在我们开始得出这些不同结论时,这就说得通了。我很好奇,既然我们已经在较高层次上讨论了这些工作流程和代理,以及正在谈论的博客文章,我想更深入地了解幕后情况。
Were there any funny stories, Barry, of wild things that you saw from customers that were interesting or are just kind of far out there in terms of how people are starting to actually use these things in production? Yeah, this is actually from my own experience, like, viewing agents. I joined about a month before the Son of V2 refresh. And one of my onboarding tasks was to run OS World, which was a computer user benchmark. And for a whole week, me and this other engineer, we were just staring at these agent trajectories that were counterintuitive to us. And then we weren't sure why the model was making the decision. You was given the instructions that we would give it. And so we decided we're going to act like cloud and put ourselves in that environment. So we would do this really silly thing, where we close our eyes for a whole minute. And then we blink at a screen for a second. We close our eyes again and just think, well, I have to write Python code to operate in this environment. What would I do?
有没有什么有趣的故事,Barry,比如你见过的客户做过的疯狂事情,这些事情既有趣又让人感到不可思议,尤其是在人们开始实际上在生产中使用这些东西的时候?是的,这实际上来源于我自己的经历,比如观察代理。我是在Son of V2更新前一个月加入公司的,其中一个入职任务是运行OS World,这是一个计算机用户基准测试。整整一个星期,我和另一位工程师一直盯着这些让我们感到违背直觉的代理轨迹。我们不确定模型为什么会根据我们给出的指令做出这样的决策。所以我们决定假装自己是云端,把自己放在那样的环境中。我们会做一件非常傻的事情:闭上眼睛整整一分钟,然后眨眼看屏幕一秒钟,再次闭上眼睛思考,假如我要写Python代码在这个环境中工作,我应该怎么做?
I suddenly made a lot more sense. And I feel like a lot of agent design comes down to that. There's a lot of context and a lot of knowledge that the model maybe does not have. And we have to be empathetic to the model. And we have to make a lot of that clear in the prompt in the two description in the environment. I see. So a tip here for developers is almost like to act as if you are looking through the lens of the model itself, in terms of what would be the most applicable instructions here. I was the model seeing the world, which is very different than how we operate as a human, I guess, with additional context. Eric, I'm curious if you have any other stories that you've seen.
我突然间对这一切有了更多的理解。我觉得很多代理设计就是为了达到这一点。模型可能缺乏很多背景信息和知识,因此我们要对模型保持同理心。在提示、描述和环境中,我们需要把这些信息尽可能清楚地表达出来。明白了,所以对开发者的一个建议是,几乎要以模型的视角来看待问题,考虑哪些指令最为适用。作为模型,我看到的世界与我们人类在有额外背景信息时的视角非常不同。我很好奇,Eric,你有没有看到其他类似的例子?
Yeah. I think actually, in a very similar vein, I think a lot of people really forget to do this. And I think maybe the funniest things I see is that people will put a lot of effort into creating these really beautiful, detailed prompts. And then the tools that they make to give the model are sort of these incredibly bare bones, like no documentation, the parameters are named A and B. And it's kind of like, oh, an engineer wouldn't be able to work with this as a work with this as if this was a function they had to use, because there's no documentation.
是的,我也有类似的看法。我觉得很多人常常忘记这么做。我觉得最有趣的是,人们会花很多精力去创建非常漂亮、详细的提示,但他们给模型使用的工具却非常简陋,比如没有任何文档,参数名字只是 A 和 B。如果把它当作一个函数使用,工程师根本没法有效工作,因为没有文档说明。
How can you expect qualities this as well? So it's like that lack of putting yourself in the model shoes. And I think a lot of people, when they start trying to use tool use and function calling, they kind of forget that they have to prompt as well. And they think about the model just as a more classical programming system. But it is still a model. And you need to be prompt engineering in the descriptions of your tools themselves. Yeah, I've noticed that. It's like people forget that it's all part of the same prompt. It's all getting fed into the same prompt in the context window. And writing a good tool description influences other parts of the prompt as well.
你如何期待有这样的特性呢?就像是没有把自己放在模型的位置上。我认为很多人在开始使用工具和函数调用时,往往忘记了他们也需要进行提示。他们把模型当作一个传统的编程系统。然而,它仍然是一个模型,你需要在工具的描述中进行提示工程。是的,我注意到,人们好像忘记了这都是同一个提示的一部分,所有内容都是在同一个上下文窗口中被使用的。写一个好的工具描述也会影响提示的其他部分。
So that is one aspect to consider. Agents is this kind of all the hype term right now. A lot of people are talking about it. And there's been plenty of articles written and videos made on the subject. What made you guys think that now is the right time to write something ourselves and talk a little bit more about the details of Agents? Sure, yeah. I think one of the most important things for us is just to be able to explain things well. I think that's a big part of our motivation, which is we walk into customer meetings, and everything is referred to as a different term, even though they share the same shape.
所以这是一个值得考虑的方面。代理这个词现在非常热门,很多人都在讨论它。关于这个主题已经有很多文章和视频。你们是如何想到现在是写点什么并更详细讨论代理的合适时机呢?当然,我认为对我们来说,最重要的事情之一就是能够清楚地解释事物。这是我们重要的动机之一,因为在与客户开会时,我们发现即便是同样的事物,也会被用不同的术语来称呼。
So we thought you'd be really useful if we can just have a set of definitions and a set of diagrams and code to explain these things to our customers. And we are getting to the point where the model is capable of doing a lot of the agentic workflows that we're seeing. And that seems like the right time for us to have some definitions or just to make these conversations easier. I think for me, I saw that there was a lot of excitement around Agents, but also a lot of people really didn't know what it meant in practice. And so they were trying to bring Agents to any problem they had, even when much simpler systems would work.
因此,我们认为,如果能够提供一套定义、一些图表和代码来向客户解释这些内容,那将非常有用。我们正在达到一个阶段,模型能够执行许多我们所见的自主工作流程。现在看来,正是时候为我们制定一些定义,或是简化这些对话。在我看来,人们对代理的兴趣很大,但许多人实际上并不知道它在实践中意味着什么。因此,即使在更简单的系统即可解决问题的情况下,他们仍试图将代理应用于任何问题。
And so I saw that as one of the reasons that we should write this is guide people about how to do Agents, but also where Agents are appropriate, and that you shouldn't go after a fly with a bazooka. I see. I see. That was a perfect part. Lance, my next question here. There's a lot of talk about the potential of Agents. And every developer out there in every startup and business is trying to think about how they can build their own version of an Agent for their company or product. But you guys are starting to see what actually works in production.
好的,我看到了我们应该写这篇文章的理由之一,就是为人们提供指导,告诉他们如何使用代理,以及在什么情况下使用代理是合适的。就像不应该用大炮去打苍蝇一样。我明白了。这是个很好的例子。兰斯,我的下一个问题是,有很多关于代理潜力的讨论。每个开发者,无论是在创业公司还是企业中,都在思考如何为他们的公司或产品构建自己的代理版本。但是,你们已经开始看到在实际生产中什么是真正有效的。
So we're going to play a little game here. I want to know one thing that's overhyped about Agents right now, and also one thing that's underhyped, just in terms of implementations or actual uses in production or potentials here as well. So Eric, let's start with you first. I feel like underhyped is like things that save people time, even if it's a very small amount of time. I think a lot of times if you just look at that on the surface, it's like, oh, this is something that takes me a minute. And even if you can fully automate it, it's only a minute. Like, what help is that?
我们来做个小游戏。我想知道关于代理(Agents)这个话题中,目前被过度炒作的一点是什么,以及被低估的一点是什么,主要是从实际应用或生产中的实际用例或潜力来说的。埃里克,我们先从你开始。我觉得被低估的点在于那些节省人们时间的事情,即便只是节省很少的时间。很多时候,如果你只是表面上看,会觉得这只不过是一个需要一分钟的事情,即使完全自动化也只节省了一分钟,那好处在哪里呢?
But really, that changes the dynamics of now you can do that thing 100 times more than you previously would. So I think I'm most excited about things that, if they were easier, could be really scaled up. Yeah, I don't know if this is necessarily related to hype, but I think it's really difficult to calibrate right now where Agents are really needed. I think there's this intersection that's a sweet spot for using Agent, and it's a set of tasks that's valuable and complex, but also maybe the cost of error or cost of monitoring error is relatively low.
不过,真正令人兴奋的是,现在你可以将这件事情的执行频率增加到以前的100倍。这让我对那些如果变得更简单就能大规模提升的事情充满期待。我不太确定这是否和炒作有直接关系,但我觉得现在很难精确判断在哪些领域真正需要使用人工智能助手。我认为有一个理想的交叉点适合使用智能助手,即那些既有价值又复杂的任务,但错误的代价或监控错误的成本相对较低。
That set of tasks is not super clear and obvious, unless we actually look into the existing processes. I think coding and search are two pretty canonical examples where Agents are very useful. Take Search as an example. It's a really valuable task. It's very hard to do deep iterative search, but you can always trade off some precision for recall and then just get a little bit more documents or a little bit more information that needs needed and filter it down.
那组任务并不是特别清晰和显而易见,除非我们真正研究现有的流程。我认为编程和搜索是两个非常典型的例子,在这些情境中,代理(Agents)非常有用。以搜索为例,它是一个非常有价值的任务。进行深入的迭代搜索非常困难,但你总是可以用一些精度来换取召回率,然后获取更多的文档或信息,再进行筛选。
So we've seen a lot of success there with Agent, so what does a coding agent look like right now? Coding agents, I think, are super exciting because they are verifiable, at least partially. Code has this great property that you can write tests for it and then you edit the code and either the tests pass or they don't pass. Now that assumes that you have good unit tests, which I think every engineer in the world can say, like, we don't. But at least it's better than a lot of things.
我们已经看到Agent在这个领域取得了很大的成功,那么现在的编码Agent是什么样的呢?我觉得编码Agent超级令人兴奋,因为它们至少部分是可验证的。代码有一个很棒的特点,就是你可以为它编写测试,然后修改代码,测试要么通过,要么不通过。当然,这是假设你有好的单元测试,这一点上我想所有工程师都会承认,我们通常没有。但是至少这比很多其他事情要好。
There's no equivalent way to do that for many other fields. So this at least gives a coding agent some way that it can get more signal every time it goes through a loop. So if every time it's running the tests again, it's seeing what the error of the output is, that makes me think that the model can converge on the right answer by getting this feedback. And if you don't have some mechanism to get feedback as you're iterating, you're not injecting any more signal. You're just going to have noise.
在许多其他领域中,没有等效的方法来做到这一点。因此,这至少为编码代理提供了一种方式,使得它每次执行循环时都能获取更多信号。也就是说,每次它再次运行测试时,都可以看到输出的错误,这让我觉得通过获取这些反馈,模型能够最终收敛到正确的答案。而如果在迭代过程中没有某种机制来获取反馈,你就无法引入更多的信号,只会得到噪音。
And so there's no reason without something like this that an agent will converge to the right answer. I see. So what's the biggest blockers then in terms of improving agent performance on the coding at the moment? Yeah. So I think for coding, we've seen over the last year like on Sweetbench, results have gone really from like very, very low to like, I think, you know, over 50% now, which is really incredible. So the models are getting really good at writing code to solve these issues.
好的,翻译成中文可以这么表达:
因此,在没有类似这种情况的前提下,智能体没有理由会收敛到正确答案。我明白。那么,目前在提升智能体编码性能方面最大的障碍是什么呢?是的,我认为在编码这方面,我们在过去一年中看到了很大的进步,比如在Sweetbench上的结果,从非常低的水平提升到了现在的50%以上,这真是令人难以置信。因此,这些模型在编写解决问题的代码方面变得非常出色。
I feel like I have a slightly controversial take here that I think the next limiting factor is going to come back to that verification. Like it's great for these cases where we do have perfect unit tests. And that's starting to work. But for the real world cases, we usually don't have perfect unit tests for them. And so I'm thinking now, like, finding ways that we can verify and we can add tests for the things that you really care about so that the model itself can test this and know whether it's right or wrong before it goes back to the human.
我觉得我对这个问题的看法可能有点争议,我认为下一个限制因素会回到验证这一步。比如在一些情况下,我们有完美的单元测试,这非常好,而且开始取得一些效果。但是在现实世界的情况中,我们通常没有这些完美的单元测试。因此,我现在在想,寻找一些方法去验证,并为那些你真正关心的东西添加测试,以便模型本身能够在返回给人类之前进行测试并判断其是否正确。
I see. Making sure that we can embed some sort of feedback loop into the processes that's the right or wrong. OK. What's the future of agents look like in 2025? Very, we're going to start with you. Yeah, I think that's a really difficult question. This is probably not like a practical thing. But one thing I've been really interested in just like how a multi-agent environment will look like.
我明白了。确保我们能够在这些过程里嵌入某种反馈机制,以判断对错的问题。好。那么在2025年,智能代理的未来会是什么样子呢?非常,非常想听你的看法。是的,我认为这是一个非常难回答的问题。这可能不是一个实用性的东西,不过我对多代理环境会是什么样子非常感兴趣。
I think I've already shown Eric that it's like a building environment where a bunch of cloud can spin up other clouds and play werewolf together. And it's like a completely what is werewolf? Werewolf is a social deduction game where all of the players are trying to figure out what each other's role is. It's very similar to mafia. It's entirely text-based, which is great for cloud to play in.
我觉得我已经向埃里克展示过,这就像一个可以创建其他云的云计算环境,大家可以在其中一起玩狼人游戏。这到底是什么呢?狼人游戏是一种社交推理游戏,所有玩家都试图猜出彼此的角色。它和“杀手游戏”很相似。这个游戏完全基于文字,因此非常适合在云端进行。
I see. So we have multiple different clouds playing different roles within this game, all communicating with each other. Yeah, exactly. And then you see a lot of interesting interaction in there that you just haven't seen before. And that's something I'm really excited about. It's like very similar to how we went from single LOM to multi LOM. I think by the end of the year, we could potentially see us going from agent to multi-agent. And there are some interesting research questions that figure out in that domain. In terms of how the agents interact with each other, what does this emergent behavior look like in that one as you coordinate between agents doing different things? Exactly. And just whether this is actually going to be useful or better than a single agent with access to a lot more resources.
我明白了。所以在这个游戏中,我们有多个不同的“云”扮演着不同的角色,并相互交流。是的,没错。然后你会看到很多有趣的互动,这是之前没有见过的。我对此感到非常兴奋。这就像我们从单一LOM转向多LOM一样。我认为到今年年底,我们可能会看到从单一代理转向多代理。在这个领域有一些有趣的研究问题需要解决,例如代理之间如何互动,协调做不同事情的代理时,这种涌现行为会是什么样子。是的,并且考虑到这样做是否真的比一个拥有更多资源的单一代理更有用或更好。
Do we see any multi-agent approaches right now that are actually working in production? I feel like in production, we haven't even seen a lot of successful single agents. OK, interest. But this is kind of like a potential extension of successful agents with the improved capabilities of the next couple of generations of models. Yeah, so this is not a vice that everyone should go explore about the agent environment. It's just I think to understand the models behavior, this provides us with a better way to understand model behaviors.
我们现在有看到任何多代理方法真正投入生产并取得效果吗?我觉得在生产环境中,我们甚至还没有看到很多成功的单一代理。好的,这是一个令人感兴趣的话题。但这有点像是成功代理的潜在扩展,因为未来几代模型的能力增强了。是的,这并不是建议每个人都去探索代理环境。只是我认为,为了理解模型的行为,这提供了我们更好的方法去理解模型行为。
I see. OK, Eric, what's the future of agents 25? Yeah, I feel like in 2025, we're going to see a lot of business adoption of agents starting to automate a lot of repetitive tasks and really scale up a lot of things that people wanted to do more before, but were too expensive. You can now have 10x or 100x how much you do these things. I'm imagining things like every single pull request in triggers a coding agent to come and update all of your documentation. Things like that will be cost prohibitive to do before. But once you think of agents as almost free, you can start adding these bells and whistles everywhere.
我明白了。好的,Eric,那么25号代理的未来是什么呢?我觉得在2025年,我们会看到很多企业开始采用代理来自动化大量重复性的任务,并大规模地实现很多以前由于成本过高而无法实施的事情。现在,你可以把很多事情的效率提高10倍或100倍。我可以想象这样的情况:每当有代码拉取请求时,就会触发一个编码代理前来更新所有文档。这样的事情以前因为成本高昂而难以实现。但一旦你认为代理几乎是免费的,就可以开始在各个地方增加这些额外的功能。
I think maybe something that's not going to happen yet, going back to what's overhyped. I feel like agents for consumers are fairly hyped right now. OK, here we go. Hot take. Because I think that we talked about a verifiability. I think that for a lot of consumer tasks, it's almost as much work to fully specify your preferences and what the task is as to just do it yourself. And it's very expensive to verify. So trying to have an agent fully book a vacation for you, describing exactly what you want your vacation to be and your preferences is almost just as hard as just going and booking it yourself.
我认为某些事情可能还不会发生,回到那些被过度炒作的事物上来看,我感觉消费者代理现在被炒得很热。好,这就来个大胆观点。因为我们谈到了可验证性。我觉得对于很多消费者任务来说,完全说明你的偏好和任务内容,几乎和你自己直接去做花费的精力差不多。而且验证这些信息也是很费钱的。所以,让一个代理完全为你预订一次假期,详细描述你的假期和偏好,几乎和你自己去预订一样难。
Interesting. And it's very high risk. You don't want the agent to actually go book a plane flight. Interesting. Without you first accepting it. Is there a matter of maybe context that we're missing here, too, from the models being able to infer this information about somebody without having to explicitly go ask and learn the preference over time? Yeah, so I think that these things will get there. But first, you need to build up this context so that the model already knows your preferences and things. And I think that takes time. I see. And we'll need some stepping stones to get to bigger tasks like planning a whole vacation.
有趣。这确实风险很高。你不希望代理直接去预订机票,而是希望在你接受后才进行操作。有趣的是,模型是否可能缺少某些上下文信息,因此无法通过推断了解某人的偏好,而不需要长期明确地询问和学习?是的,我认为这些问题最终能够解决。但首先,你需要建立这些上下文信息,让模型已经了解你的偏好等。我认为这需要时间。另外,为了实现像规划整个假期这样的复杂任务,我们还需要一些过渡步骤。
I see. OK, very interesting. Last question. Any advice that you give to a developer that's exploring this right now in terms of starting to build this or just thinking about it from a general future-proofing perspective that you can give? I feel like my best advice is make sure that you have a way to measure your results. Because I've seen a lot of people will go and build in a vacuum without any way to get feedback about whether they're building is working or not. And you can end up building a lot without realizing that it's either it's not working or maybe something much simpler would have actually done just as good a job.
我明白了。好的,非常有趣。最后一个问题。对于目前正在探索这个领域的开发者,你有没有一些建议,无论是在开始构建这个项目还是从长远考虑如何保持其适应未来发展?我觉得我最好的建议是确保你有办法衡量你的成果。因为我见过很多人在没有任何反馈的情况下孤军奋战,无法判断自己所构建的东西是否有效。最后可能会在不自知的情况下构建出很多东西,而这些东西要么无效,要么其实可以用更简单的方法来完成得同样好。
Yeah, I think very similarly, starting as simple as possible and having that measurable result as you are building more complexity into it. One thing I've been really impressed by is I work with some really resourceful startups. And they can do everything within 1LM call. And the orchestration around the code, which will persist even as the model gets better, is their niche. And I always get very happy when I see one of those. Because I think they reap the benefit of future capability improvements.
好的,我的想法也很相似,尽量从简单开始,并在增加复杂性时保持结果的可测量性。我曾经与一些非常有创造力的初创公司合作,其中令我印象深刻的是,他们能够在一次语言模型(1LM)调用中完成所有事情。而围绕代码的编排,即使模型变得更好也能持续不变,这是他们的独特优势。看到这样的公司总让我感到很开心,因为我认为他们能从未来能力的提升中获益。
And realistically, we don't know what use case will be great for agents. And the landscape is going to shift. But it's probably a good time to start building up some of that muscle to think in the agent land just to understand that capability a little bit better.
现实情况是,我们不确定哪种应用场景会非常适合代理。同时,这个领域也会不断变化。但是现在可能是一个好时机,可以开始培养一些在代理领域的思维能力,以便更好地理解这种能力。
Yeah, I think I want to double click on something you said of being excited for the models to get better. I think that if you look at your startup or your product and think, oh, man, if the models get smarter, all of our mode's going to disappear, that means you're building the wrong thing.
是的,我想详细谈谈你刚才提到的关于对模型进步感到兴奋的观点。我认为如果你看着自己的创业公司或产品,并想着天啊,如果模型变得更聪明,我们的竞争优势就会消失,那说明你在做错东西。
Instead, you should be building something so that as the models get smarter, your product gets better and better. Right. That's great advice. Eric, Barry, thank you guys. This is Building Effective Agents. Thank you. Thanks.
相反,你应该建立某种机制,以便随着模型变得更加智能,你的产品也会越来越好。没错,这是很好的建议。谢谢你们,Eric和Barry。这是关于构建有效代理的内容。谢谢。