首页  >>  来自播客: Anthropic 更新   反馈

Tips for building AI agents

发布时间 2025-02-13 12:21:22    来源
I feel like agents for consumers are like fairly bright. Right. Here we go. Hot day. Trying to have an agent like fully book a vacation for you. Almost just as hard as just going and booking it yourself. Today we're going behind the scenes on one of our recent blog posts, Building Effective Agents. I'm Alex. I lead Claude Relations here at Anthropic. I'm Eric. I'm in the research team at Anthropic. I'm Barry. I'm on the Apply to the Eye team. I'm going to kick us off here for viewers just jumping in.
我觉得消费者的代理人都很聪明,对吧。好吧,我们开始吧。炎热的一天,试图让一个代理人帮你完全安排好度假,几乎跟你自己去预定一样困难。今天,我们要深入探讨我们最近的一篇博客文章——《打造高效代理人》。我是Alex,我在Anthropic负责Claude关系。我是Eric,我在Anthropic的研究团队。我是Barry,我在应用智能团队。对于刚加入观看的观众,我会先开始介绍。

What's the quick version of what an agent actually is? I mean, there's a million definitions of it. And why should a developer or somebody that's actually building with AI care about these things? Eric, maybe we can start with you. Sure. Yeah. So I think something we explored in the blog post is that, first of all, a lot of people have been saying everything is an agent, referring to almost anything more than just a single LLM call. One of the things we tried to do in the blog post is really kind of separate this out of like, hey, there's workflows, which is where you have a few LLM calls chained together.
代理到底是什么?这个概念有很多种定义。那么为什么开发者或者正在使用人工智能的构建者应该关心这些东西呢?Eric,也许我们可以从你开始谈谈。好的。我认为我们在博客文章中探讨的一个方面是,许多人认为几乎所有超过单个大型语言模型(LLM)调用的东西都是一个“代理”。在这篇博客文章中,我们试图将其与工作流区分开来:工作流是指将几个LLM调用链接在一起的过程。

And really, what we think an agent is is where you're letting the LLM decide sort of how many times to run. You're having it continuing to loop until it's found a resolution. And that could be talking to a customer for customer support. That could be iterating on code changes. But something where you don't know how many steps it's going to take to complete, that's really sort of what we consider an agent. Interesting. So in the definition of an agent, we are letting the LLM kind of pick its own fate and decide what it wants to do, what actions to take, instead of us predefining a path for it.
实际上,我们对“代理”的定义是指,让大型语言模型(LLM)自行决定运行多少次。也就是说,它会不断循环,直到找到一个解决方案。这种应用可以是客服中与客户对话,也可以是对代码进行多次更改。但在这些情况下,我们并不知道需要多少步才能完成任务,这正是我们所认为的“代理”的意义所在。很有趣的是,在“代理”的定义中,我们允许LLM选择自己的操作路径和采取的行动,而不是由我们预先定义好路线。

Exactly. It's more autonomous. Whereas a workflow, you can kind of think of it as like a workflow or sort of like it's on rails through a fixed number of steps. I see. So this distinction, I assume this was the result of many, many conversations with customers and working with different teams and even trying things to make it happen to yourself. Barry, can you speak more to maybe what that looks like as we got to create this divide between a workflow and agent and what sort of patterns surprised you the most as you were going through this? Sure. Honestly, I think all of this kind of evolved as model got better and teams got more sophisticated.
当然,它更加自主。工作流就像一个固定步骤的流程,可以理解为在既定轨道上运行。我明白了。这种区别,我想是通过与客户的许多对话、与不同团队的合作,甚至尝试实现自我发展得出的。Barry,你能详细说明一下在创建工作流和代理之间的区别时,我们经历了哪些惊喜的模式吗?好的。老实说,我认为随着模型的改进和团队的成熟,这一切都是逐渐演变而来的。

We both worked with a large number of customers where they're sophisticated. And we kind of went from having a single LLM to having a lot of LLMs and eventually having our own orchestrating themselves. So one of the reasons why we decided to create this distinction is because we started to see these two distinct patterns where you have workflows that's pretty orchestrated by code. And then you also have agent, which is a simpler but complex in other sense, like different shape that we're starting to see. Really, I think as the models and all of the tools start to get better, agents are becoming more and more prevalent and more and more capable.
我们都曾与大量成熟的客户合作,并且经历了从使用单个大型语言模型(LLM)到同时使用多个LLM的过程,最终发展到它们自行进行协调。因此,我们决定作出这种区分的原因之一,是因为我们开始观察到这两种不同的模式:一种是由代码紧密编排的工作流程,另一种则是代理(agent),虽然相对简单,但在其他方面也具有复杂性,是我们正在看到的一种新形态。实际上,我认为随着模型和各种工具的改进,代理变得越来越普遍,也越来越强大。

And that's when we decided, hey, this is probably a time for us to give a formal definition. So in practice, if you're a developer implementing one of these things, what would that actually look like in your code as you're starting to build this, the differences between, maybe we actually go down to the prompt level here. What does an agent prompt to look like or flow and what does a workflow look like? Yeah. So I think a workflow prompt looks like you have one prompt. You take the output of it. You feed it into prompt B. Take the output of that. Feed it into prompt C. And then you're done.
这时我们决定,是时候给出一个正式的定义了。那么在实际操作中,作为一个开发者,如果你正在实现这些内容,你的代码会是什么样子呢?可能我们得细化到提示级别。一个代理的提示或流程是什么样的?而一个工作流程又是什么样的?我认为,一个工作流程的提示看起来是这样的:首先有一个提示A,你获取它的输出,然后将输出输入到提示B中,获取提示B的输出,再输入到提示C中,最后完成。

There's this straight line, fixed number of steps. You know exactly what's going to happen. And maybe you have some extra code that sort of checks the intermediate results of these and makes sure they're OK. But you kind of know exactly what's going to happen in one of these paths. And each of those prompts is sort of a very specific prompt, just sort of taking one input and transforming it into another output. For instance, maybe one of these prompts is taking in the user question and categorizing it into one of five categories so that then the next prompt can be more specific for that.
有一条明确的路线,固定的步骤数。你确切知道会发生什么。也许你还会有一些额外的代码来检查中间结果,确保它们没有问题。但你大致知道在这些路径中的某一条上会发生什么。每一个提示都是非常具体的提示,只是接受一个输入并将其转换为另一个输出。例如,可能有一个提示是接收用户的问题,并将其分类为五个类别之一,以便下一个提示可以更具体地处理它。

In contrast, an agent prompt will be sort of much more open-ended and usually give the model tools or multiple things to check and say, hey, here's the question. And you can do web searches or you can edit these code files or run code and keep doing this until you have the answer. I see. So there's a few different use cases there. That makes sense as we start to arrive at these different conclusions. I'm curious, as we've now kind of covered at a high level how we're thinking about these workflows and agents and talking about the blog post, I want to dive even further behind the scenes.
与此相反,代理提示通常会更开放,通常会给模型提供工具或多种选项进行检查,并说:“嘿,这是问题。” 你可以进行网络搜索,编辑这些代码文件或运行代码,并不断这样做直到找到答案。 我明白了,所以有几种不同的使用场景。在我们开始得出这些不同结论时,这就说得通了。我很好奇,既然我们已经在较高层次上讨论了这些工作流程和代理,以及正在谈论的博客文章,我想更深入地了解幕后情况。

Were there any funny stories, Barry, of wild things that you saw from customers that were interesting or are just kind of far out there in terms of how people are starting to actually use these things in production? Yeah, this is actually from my own experience, like, viewing agents. I joined about a month before the Son of V2 refresh. And one of my onboarding tasks was to run OS World, which was a computer user benchmark. And for a whole week, me and this other engineer, we were just staring at these agent trajectories that were counterintuitive to us. And then we weren't sure why the model was making the decision. You was given the instructions that we would give it. And so we decided we're going to act like cloud and put ourselves in that environment. So we would do this really silly thing, where we close our eyes for a whole minute. And then we blink at a screen for a second. We close our eyes again and just think, well, I have to write Python code to operate in this environment. What would I do?
有没有什么有趣的故事,Barry,比如你见过的客户做过的疯狂事情,这些事情既有趣又让人感到不可思议,尤其是在人们开始实际上在生产中使用这些东西的时候?是的,这实际上来源于我自己的经历,比如观察代理。我是在Son of V2更新前一个月加入公司的,其中一个入职任务是运行OS World,这是一个计算机用户基准测试。整整一个星期,我和另一位工程师一直盯着这些让我们感到违背直觉的代理轨迹。我们不确定模型为什么会根据我们给出的指令做出这样的决策。所以我们决定假装自己是云端,把自己放在那样的环境中。我们会做一件非常傻的事情:闭上眼睛整整一分钟,然后眨眼看屏幕一秒钟,再次闭上眼睛思考,假如我要写Python代码在这个环境中工作,我应该怎么做?

I suddenly made a lot more sense. And I feel like a lot of agent design comes down to that. There's a lot of context and a lot of knowledge that the model maybe does not have. And we have to be empathetic to the model. And we have to make a lot of that clear in the prompt in the two description in the environment. I see. So a tip here for developers is almost like to act as if you are looking through the lens of the model itself, in terms of what would be the most applicable instructions here. I was the model seeing the world, which is very different than how we operate as a human, I guess, with additional context. Eric, I'm curious if you have any other stories that you've seen.
我突然间对这一切有了更多的理解。我觉得很多代理设计就是为了达到这一点。模型可能缺乏很多背景信息和知识,因此我们要对模型保持同理心。在提示、描述和环境中,我们需要把这些信息尽可能清楚地表达出来。明白了,所以对开发者的一个建议是,几乎要以模型的视角来看待问题,考虑哪些指令最为适用。作为模型,我看到的世界与我们人类在有额外背景信息时的视角非常不同。我很好奇,Eric,你有没有看到其他类似的例子?

Yeah. I think actually, in a very similar vein, I think a lot of people really forget to do this. And I think maybe the funniest things I see is that people will put a lot of effort into creating these really beautiful, detailed prompts. And then the tools that they make to give the model are sort of these incredibly bare bones, like no documentation, the parameters are named A and B. And it's kind of like, oh, an engineer wouldn't be able to work with this as a work with this as if this was a function they had to use, because there's no documentation.
是的,我也有类似的看法。我觉得很多人常常忘记这么做。我觉得最有趣的是,人们会花很多精力去创建非常漂亮、详细的提示,但他们给模型使用的工具却非常简陋,比如没有任何文档,参数名字只是 A 和 B。如果把它当作一个函数使用,工程师根本没法有效工作,因为没有文档说明。

How can you expect qualities this as well? So it's like that lack of putting yourself in the model shoes. And I think a lot of people, when they start trying to use tool use and function calling, they kind of forget that they have to prompt as well. And they think about the model just as a more classical programming system. But it is still a model. And you need to be prompt engineering in the descriptions of your tools themselves. Yeah, I've noticed that. It's like people forget that it's all part of the same prompt. It's all getting fed into the same prompt in the context window. And writing a good tool description influences other parts of the prompt as well.
你如何期待有这样的特性呢?就像是没有把自己放在模型的位置上。我认为很多人在开始使用工具和函数调用时,往往忘记了他们也需要进行提示。他们把模型当作一个传统的编程系统。然而,它仍然是一个模型,你需要在工具的描述中进行提示工程。是的,我注意到,人们好像忘记了这都是同一个提示的一部分,所有内容都是在同一个上下文窗口中被使用的。写一个好的工具描述也会影响提示的其他部分。

So that is one aspect to consider. Agents is this kind of all the hype term right now. A lot of people are talking about it. And there's been plenty of articles written and videos made on the subject. What made you guys think that now is the right time to write something ourselves and talk a little bit more about the details of Agents? Sure, yeah. I think one of the most important things for us is just to be able to explain things well. I think that's a big part of our motivation, which is we walk into customer meetings, and everything is referred to as a different term, even though they share the same shape.
所以这是一个值得考虑的方面。代理这个词现在非常热门,很多人都在讨论它。关于这个主题已经有很多文章和视频。你们是如何想到现在是写点什么并更详细讨论代理的合适时机呢?当然,我认为对我们来说,最重要的事情之一就是能够清楚地解释事物。这是我们重要的动机之一,因为在与客户开会时,我们发现即便是同样的事物,也会被用不同的术语来称呼。

So we thought you'd be really useful if we can just have a set of definitions and a set of diagrams and code to explain these things to our customers. And we are getting to the point where the model is capable of doing a lot of the agentic workflows that we're seeing. And that seems like the right time for us to have some definitions or just to make these conversations easier. I think for me, I saw that there was a lot of excitement around Agents, but also a lot of people really didn't know what it meant in practice. And so they were trying to bring Agents to any problem they had, even when much simpler systems would work.
因此,我们认为,如果能够提供一套定义、一些图表和代码来向客户解释这些内容,那将非常有用。我们正在达到一个阶段,模型能够执行许多我们所见的自主工作流程。现在看来,正是时候为我们制定一些定义,或是简化这些对话。在我看来,人们对代理的兴趣很大,但许多人实际上并不知道它在实践中意味着什么。因此,即使在更简单的系统即可解决问题的情况下,他们仍试图将代理应用于任何问题。

And so I saw that as one of the reasons that we should write this is guide people about how to do Agents, but also where Agents are appropriate, and that you shouldn't go after a fly with a bazooka. I see. I see. That was a perfect part. Lance, my next question here. There's a lot of talk about the potential of Agents. And every developer out there in every startup and business is trying to think about how they can build their own version of an Agent for their company or product. But you guys are starting to see what actually works in production.
好的,我看到了我们应该写这篇文章的理由之一,就是为人们提供指导,告诉他们如何使用代理,以及在什么情况下使用代理是合适的。就像不应该用大炮去打苍蝇一样。我明白了。这是个很好的例子。兰斯,我的下一个问题是,有很多关于代理潜力的讨论。每个开发者,无论是在创业公司还是企业中,都在思考如何为他们的公司或产品构建自己的代理版本。但是,你们已经开始看到在实际生产中什么是真正有效的。

So we're going to play a little game here. I want to know one thing that's overhyped about Agents right now, and also one thing that's underhyped, just in terms of implementations or actual uses in production or potentials here as well. So Eric, let's start with you first. I feel like underhyped is like things that save people time, even if it's a very small amount of time. I think a lot of times if you just look at that on the surface, it's like, oh, this is something that takes me a minute. And even if you can fully automate it, it's only a minute. Like, what help is that?
我们来做个小游戏。我想知道关于代理(Agents)这个话题中,目前被过度炒作的一点是什么,以及被低估的一点是什么,主要是从实际应用或生产中的实际用例或潜力来说的。埃里克,我们先从你开始。我觉得被低估的点在于那些节省人们时间的事情,即便只是节省很少的时间。很多时候,如果你只是表面上看,会觉得这只不过是一个需要一分钟的事情,即使完全自动化也只节省了一分钟,那好处在哪里呢?

But really, that changes the dynamics of now you can do that thing 100 times more than you previously would. So I think I'm most excited about things that, if they were easier, could be really scaled up. Yeah, I don't know if this is necessarily related to hype, but I think it's really difficult to calibrate right now where Agents are really needed. I think there's this intersection that's a sweet spot for using Agent, and it's a set of tasks that's valuable and complex, but also maybe the cost of error or cost of monitoring error is relatively low.
不过,真正令人兴奋的是,现在你可以将这件事情的执行频率增加到以前的100倍。这让我对那些如果变得更简单就能大规模提升的事情充满期待。我不太确定这是否和炒作有直接关系,但我觉得现在很难精确判断在哪些领域真正需要使用人工智能助手。我认为有一个理想的交叉点适合使用智能助手,即那些既有价值又复杂的任务,但错误的代价或监控错误的成本相对较低。

That set of tasks is not super clear and obvious, unless we actually look into the existing processes. I think coding and search are two pretty canonical examples where Agents are very useful. Take Search as an example. It's a really valuable task. It's very hard to do deep iterative search, but you can always trade off some precision for recall and then just get a little bit more documents or a little bit more information that needs needed and filter it down.
那组任务并不是特别清晰和显而易见,除非我们真正研究现有的流程。我认为编程和搜索是两个非常典型的例子,在这些情境中,代理(Agents)非常有用。以搜索为例,它是一个非常有价值的任务。进行深入的迭代搜索非常困难,但你总是可以用一些精度来换取召回率,然后获取更多的文档或信息,再进行筛选。

So we've seen a lot of success there with Agent, so what does a coding agent look like right now? Coding agents, I think, are super exciting because they are verifiable, at least partially. Code has this great property that you can write tests for it and then you edit the code and either the tests pass or they don't pass. Now that assumes that you have good unit tests, which I think every engineer in the world can say, like, we don't. But at least it's better than a lot of things.
我们已经看到Agent在这个领域取得了很大的成功,那么现在的编码Agent是什么样的呢?我觉得编码Agent超级令人兴奋,因为它们至少部分是可验证的。代码有一个很棒的特点,就是你可以为它编写测试,然后修改代码,测试要么通过,要么不通过。当然,这是假设你有好的单元测试,这一点上我想所有工程师都会承认,我们通常没有。但是至少这比很多其他事情要好。

There's no equivalent way to do that for many other fields. So this at least gives a coding agent some way that it can get more signal every time it goes through a loop. So if every time it's running the tests again, it's seeing what the error of the output is, that makes me think that the model can converge on the right answer by getting this feedback. And if you don't have some mechanism to get feedback as you're iterating, you're not injecting any more signal. You're just going to have noise.
在许多其他领域中,没有等效的方法来做到这一点。因此,这至少为编码代理提供了一种方式,使得它每次执行循环时都能获取更多信号。也就是说,每次它再次运行测试时,都可以看到输出的错误,这让我觉得通过获取这些反馈,模型能够最终收敛到正确的答案。而如果在迭代过程中没有某种机制来获取反馈,你就无法引入更多的信号,只会得到噪音。

And so there's no reason without something like this that an agent will converge to the right answer. I see. So what's the biggest blockers then in terms of improving agent performance on the coding at the moment? Yeah. So I think for coding, we've seen over the last year like on Sweetbench, results have gone really from like very, very low to like, I think, you know, over 50% now, which is really incredible. So the models are getting really good at writing code to solve these issues.
好的,翻译成中文可以这么表达: 因此,在没有类似这种情况的前提下,智能体没有理由会收敛到正确答案。我明白。那么,目前在提升智能体编码性能方面最大的障碍是什么呢?是的,我认为在编码这方面,我们在过去一年中看到了很大的进步,比如在Sweetbench上的结果,从非常低的水平提升到了现在的50%以上,这真是令人难以置信。因此,这些模型在编写解决问题的代码方面变得非常出色。

I feel like I have a slightly controversial take here that I think the next limiting factor is going to come back to that verification. Like it's great for these cases where we do have perfect unit tests. And that's starting to work. But for the real world cases, we usually don't have perfect unit tests for them. And so I'm thinking now, like, finding ways that we can verify and we can add tests for the things that you really care about so that the model itself can test this and know whether it's right or wrong before it goes back to the human.
我觉得我对这个问题的看法可能有点争议,我认为下一个限制因素会回到验证这一步。比如在一些情况下,我们有完美的单元测试,这非常好,而且开始取得一些效果。但是在现实世界的情况中,我们通常没有这些完美的单元测试。因此,我现在在想,寻找一些方法去验证,并为那些你真正关心的东西添加测试,以便模型本身能够在返回给人类之前进行测试并判断其是否正确。

I see. Making sure that we can embed some sort of feedback loop into the processes that's the right or wrong. OK. What's the future of agents look like in 2025? Very, we're going to start with you. Yeah, I think that's a really difficult question. This is probably not like a practical thing. But one thing I've been really interested in just like how a multi-agent environment will look like.
我明白了。确保我们能够在这些过程里嵌入某种反馈机制,以判断对错的问题。好。那么在2025年,智能代理的未来会是什么样子呢?非常,非常想听你的看法。是的,我认为这是一个非常难回答的问题。这可能不是一个实用性的东西,不过我对多代理环境会是什么样子非常感兴趣。

I think I've already shown Eric that it's like a building environment where a bunch of cloud can spin up other clouds and play werewolf together. And it's like a completely what is werewolf? Werewolf is a social deduction game where all of the players are trying to figure out what each other's role is. It's very similar to mafia. It's entirely text-based, which is great for cloud to play in.
我觉得我已经向埃里克展示过,这就像一个可以创建其他云的云计算环境,大家可以在其中一起玩狼人游戏。这到底是什么呢?狼人游戏是一种社交推理游戏,所有玩家都试图猜出彼此的角色。它和“杀手游戏”很相似。这个游戏完全基于文字,因此非常适合在云端进行。

I see. So we have multiple different clouds playing different roles within this game, all communicating with each other. Yeah, exactly. And then you see a lot of interesting interaction in there that you just haven't seen before. And that's something I'm really excited about. It's like very similar to how we went from single LOM to multi LOM. I think by the end of the year, we could potentially see us going from agent to multi-agent. And there are some interesting research questions that figure out in that domain. In terms of how the agents interact with each other, what does this emergent behavior look like in that one as you coordinate between agents doing different things? Exactly. And just whether this is actually going to be useful or better than a single agent with access to a lot more resources.
我明白了。所以在这个游戏中,我们有多个不同的“云”扮演着不同的角色,并相互交流。是的,没错。然后你会看到很多有趣的互动,这是之前没有见过的。我对此感到非常兴奋。这就像我们从单一LOM转向多LOM一样。我认为到今年年底,我们可能会看到从单一代理转向多代理。在这个领域有一些有趣的研究问题需要解决,例如代理之间如何互动,协调做不同事情的代理时,这种涌现行为会是什么样子。是的,并且考虑到这样做是否真的比一个拥有更多资源的单一代理更有用或更好。

Do we see any multi-agent approaches right now that are actually working in production? I feel like in production, we haven't even seen a lot of successful single agents. OK, interest. But this is kind of like a potential extension of successful agents with the improved capabilities of the next couple of generations of models. Yeah, so this is not a vice that everyone should go explore about the agent environment. It's just I think to understand the models behavior, this provides us with a better way to understand model behaviors.
我们现在有看到任何多代理方法真正投入生产并取得效果吗?我觉得在生产环境中,我们甚至还没有看到很多成功的单一代理。好的,这是一个令人感兴趣的话题。但这有点像是成功代理的潜在扩展,因为未来几代模型的能力增强了。是的,这并不是建议每个人都去探索代理环境。只是我认为,为了理解模型的行为,这提供了我们更好的方法去理解模型行为。

I see. OK, Eric, what's the future of agents 25? Yeah, I feel like in 2025, we're going to see a lot of business adoption of agents starting to automate a lot of repetitive tasks and really scale up a lot of things that people wanted to do more before, but were too expensive. You can now have 10x or 100x how much you do these things. I'm imagining things like every single pull request in triggers a coding agent to come and update all of your documentation. Things like that will be cost prohibitive to do before. But once you think of agents as almost free, you can start adding these bells and whistles everywhere.
我明白了。好的,Eric,那么25号代理的未来是什么呢?我觉得在2025年,我们会看到很多企业开始采用代理来自动化大量重复性的任务,并大规模地实现很多以前由于成本过高而无法实施的事情。现在,你可以把很多事情的效率提高10倍或100倍。我可以想象这样的情况:每当有代码拉取请求时,就会触发一个编码代理前来更新所有文档。这样的事情以前因为成本高昂而难以实现。但一旦你认为代理几乎是免费的,就可以开始在各个地方增加这些额外的功能。

I think maybe something that's not going to happen yet, going back to what's overhyped. I feel like agents for consumers are fairly hyped right now. OK, here we go. Hot take. Because I think that we talked about a verifiability. I think that for a lot of consumer tasks, it's almost as much work to fully specify your preferences and what the task is as to just do it yourself. And it's very expensive to verify. So trying to have an agent fully book a vacation for you, describing exactly what you want your vacation to be and your preferences is almost just as hard as just going and booking it yourself.
我认为某些事情可能还不会发生,回到那些被过度炒作的事物上来看,我感觉消费者代理现在被炒得很热。好,这就来个大胆观点。因为我们谈到了可验证性。我觉得对于很多消费者任务来说,完全说明你的偏好和任务内容,几乎和你自己直接去做花费的精力差不多。而且验证这些信息也是很费钱的。所以,让一个代理完全为你预订一次假期,详细描述你的假期和偏好,几乎和你自己去预订一样难。

Interesting. And it's very high risk. You don't want the agent to actually go book a plane flight. Interesting. Without you first accepting it. Is there a matter of maybe context that we're missing here, too, from the models being able to infer this information about somebody without having to explicitly go ask and learn the preference over time? Yeah, so I think that these things will get there. But first, you need to build up this context so that the model already knows your preferences and things. And I think that takes time. I see. And we'll need some stepping stones to get to bigger tasks like planning a whole vacation.
有趣。这确实风险很高。你不希望代理直接去预订机票,而是希望在你接受后才进行操作。有趣的是,模型是否可能缺少某些上下文信息,因此无法通过推断了解某人的偏好,而不需要长期明确地询问和学习?是的,我认为这些问题最终能够解决。但首先,你需要建立这些上下文信息,让模型已经了解你的偏好等。我认为这需要时间。另外,为了实现像规划整个假期这样的复杂任务,我们还需要一些过渡步骤。

I see. OK, very interesting. Last question. Any advice that you give to a developer that's exploring this right now in terms of starting to build this or just thinking about it from a general future-proofing perspective that you can give? I feel like my best advice is make sure that you have a way to measure your results. Because I've seen a lot of people will go and build in a vacuum without any way to get feedback about whether they're building is working or not. And you can end up building a lot without realizing that it's either it's not working or maybe something much simpler would have actually done just as good a job.
我明白了。好的,非常有趣。最后一个问题。对于目前正在探索这个领域的开发者,你有没有一些建议,无论是在开始构建这个项目还是从长远考虑如何保持其适应未来发展?我觉得我最好的建议是确保你有办法衡量你的成果。因为我见过很多人在没有任何反馈的情况下孤军奋战,无法判断自己所构建的东西是否有效。最后可能会在不自知的情况下构建出很多东西,而这些东西要么无效,要么其实可以用更简单的方法来完成得同样好。

Yeah, I think very similarly, starting as simple as possible and having that measurable result as you are building more complexity into it. One thing I've been really impressed by is I work with some really resourceful startups. And they can do everything within 1LM call. And the orchestration around the code, which will persist even as the model gets better, is their niche. And I always get very happy when I see one of those. Because I think they reap the benefit of future capability improvements.
好的,我的想法也很相似,尽量从简单开始,并在增加复杂性时保持结果的可测量性。我曾经与一些非常有创造力的初创公司合作,其中令我印象深刻的是,他们能够在一次语言模型(1LM)调用中完成所有事情。而围绕代码的编排,即使模型变得更好也能持续不变,这是他们的独特优势。看到这样的公司总让我感到很开心,因为我认为他们能从未来能力的提升中获益。

And realistically, we don't know what use case will be great for agents. And the landscape is going to shift. But it's probably a good time to start building up some of that muscle to think in the agent land just to understand that capability a little bit better.
现实情况是,我们不确定哪种应用场景会非常适合代理。同时,这个领域也会不断变化。但是现在可能是一个好时机,可以开始培养一些在代理领域的思维能力,以便更好地理解这种能力。

Yeah, I think I want to double click on something you said of being excited for the models to get better. I think that if you look at your startup or your product and think, oh, man, if the models get smarter, all of our mode's going to disappear, that means you're building the wrong thing.
是的,我想详细谈谈你刚才提到的关于对模型进步感到兴奋的观点。我认为如果你看着自己的创业公司或产品,并想着天啊,如果模型变得更聪明,我们的竞争优势就会消失,那说明你在做错东西。

Instead, you should be building something so that as the models get smarter, your product gets better and better. Right. That's great advice. Eric, Barry, thank you guys. This is Building Effective Agents. Thank you. Thanks.
相反,你应该建立某种机制,以便随着模型变得更加智能,你的产品也会越来越好。没错,这是很好的建议。谢谢你们,Eric和Barry。这是关于构建有效代理的内容。谢谢。



function setTranscriptHeight() { const transcriptDiv = document.querySelector('.transcript'); const rect = transcriptDiv.getBoundingClientRect(); const tranHeight = window.innerHeight - rect.top - 10; transcriptDiv.style.height = tranHeight + 'px'; if (false) { console.log('window.innerHeight', window.innerHeight); console.log('rect.top', rect.top); console.log('tranHeight', tranHeight); console.log('.transcript', document.querySelector('.transcript').getBoundingClientRect()) //console.log('.video', document.querySelector('.video').getBoundingClientRect()) console.log('.container', document.querySelector('.container').getBoundingClientRect()) } if (isMobileDevice()) { const videoDiv = document.querySelector('.video'); const videoRect = videoDiv.getBoundingClientRect(); videoDiv.style.position = 'fixed'; transcriptDiv.style.paddingTop = videoRect.bottom+'px'; } const videoDiv = document.querySelector('.video'); videoDiv.style.height = parseInt(videoDiv.getBoundingClientRect().width*390/640)+'px'; console.log('videoDiv', videoDiv.getBoundingClientRect()); console.log('videoDiv.style.height', videoDiv.style.height); } window.onload = function() { setTranscriptHeight(); }; if (!isMobileDevice()){ window.addEventListener('resize', setTranscriptHeight); }