Andrej Karpathy: From Vibe Coding to Agentic Engineering
发布时间 来源
Episode 设置
以下是这段内容的中文翻译:
AI领域的领军人物Andrej Karpathy分享了他最近一个令人震惊的领悟:他“从未感觉自己作为一名程序员如此落伍”。这种感受大约在2023年12月出现,标志着AI能力的一次根本性转变。他观察到大语言模型(LLM)工具已经从需要频繁纠正,发展到能够稳定地生成经过微调的代码片段,从而实现了“随性编程”(vibe coding)——一种流畅、直观的开发过程,这让他开始着手一个“无尽的支线项目”列表。Karpathy强调,AI的演进不仅仅是渐进式的;它是一种根本性的变革,需要一种全新的视角。
Karpathy认为,LLM正在开启“软件3.0”时代,这彻底颠覆了以往的范式。软件1.0是基于显式规则的编码,软件2.0则利用数据训练神经网络(通过创建数据集进行编程)。而软件3.0,则将编程重新定义为“提示”(prompting),将LLM视为解释器,“上下文窗口”就是你的杠杆。他提供了引人注目的例子:现在安装“Open Claw”只需向智能体提供文本指令,而不是运行复杂的shell脚本。更令人惊叹的是,他个人用于叠加菜单项图片的“MenuGen”应用如今变得“多余”,因为只需向Gemini输入一张图片并获得一张输出图片,即可实现相同功能。这种范式转变意味着AI不仅仅是加快了现有编程速度;它还实现了全新的信息处理形式,例如从非结构化文档中生成知识库,这在以前是根本不可能的。
展望2026年,Karpathy设想了一个“神经计算机”的未来,其中原始感官数据(视频、音频)直接输入到神经网络中,由其动态渲染用户界面。他提出当前计算架构将发生逆转,神经网络将成为“宿主进程”,而传统CPU则充当“协处理器”,从而形成一个“极其陌生”的计算图景。
Karpathy探讨的一个核心概念是“可验证性”。LLM擅长自动化那些输出可以被客观验证的任务,这种特质源于它们使用验证奖励进行的强化学习(RL)训练。这解释了LLM智能的“参差不齐”:模型可以完美地重构庞大的代码库或发现零日漏洞,却难以回答简单的常识性问题,比如是否应该步行50米去洗车店。这种不均衡性需要人工监督,因为LLM虽然强大,但仍是易犯错的工具。实验室在训练数据上的决策(例如,GPT-4的大量国际象棋数据)显著影响这些能力,这意味着用户“受制于”所包含的数据。对于创始人来说,这意味着在尚未被主要实验室充分探索的可验证领域存在机会,定制化的强化学习环境和微调可以产生显著成果。
Karpathy将提高所有程序员准入门槛的“随性编程”与“智能体工程”区分开来。后者专注于在显著加速开发的同时,维持专业软件的质量标准。他认为传统的“10倍工程师”概念现在被低估了,因为有效的智能体工程师通过协调那些“不稳定、易犯错、随机性”的智能体,实现了远超此前的提速。因此,招聘流程必须适应,从解决难题转向评估候选人使用智能体工具实现大型项目的能力,并由其他智能体负责找出他们作品中的漏洞。
在这个智能体驱动的世界中,人类的审美、判断力、品味和高层级监督等技能变得无价。虽然智能体可以处理API细节和死记硬背的琐事(例如`keepdims`与`axis`的区别),但人类必须保留对底层基础知识的理解(例如张量中的内存管理),并提供战略性设计。Karpathy指出,当前智能体生成的代码可能“臃肿”或“粗糙”,这强调了在保持质量和优雅方面持续需要人类的判断力。
最终,Karpathy预见了一个“智能体原生”的世界,其中基础设施是为智能体而非人类设计的。他对当前文档表示沮丧,因为它们提供了人类指令(“前往此URL”),而不是智能体可用的“复制-粘贴”命令。理想情况是只需提示LLM“构建MenuGen”,它就能在没有任何人工干预的情况下完全部署。这种智能体优先的方法将扩展到智能体间通信,实现“我的智能体与你的智能体对话”以安排任务等。
关于教育,Karpathy强调了理解的持久价值,引用了“你可以外包你的思考,但你不能外包你的理解”这一观点。他强调,人类仍然是方向、目的和真正理解的瓶颈。虽然由LLM驱动的知识库可以通过重新处理信息来增强理解,但人类在辨别“要构建什么、为什么值得做以及如何指导”这些强大智能体方面的作用仍然不可替代。
Andrej Karpathy, a leading figure in AI, shared his recent, startling realization: he's "never felt more behind as a programmer." This sentiment, arising around December 2023, signifies a fundamental shift in AI capabilities. He observed LLM tools evolving from requiring frequent correction to reliably producing fine-tuned code chunks, enabling "vibe coding" – a fluid, intuitive development process that has led him to pursue an "infinity side project" list. Karpathy emphasizes that AI's evolution is not merely incremental; it's a fundamental change demanding a new perspective.
Karpathy posits that LLMs are ushering in "Software 3.0," a radical departure from previous paradigms. Software 1.0 involved explicit rule-based coding, and Software 2.0 leveraged data to train neural networks (programming by creating datasets). Software 3.0, however, redefines programming as "prompting," treating the LLM as an interpreter where the "context window is your lever." He provides striking examples: installing "Open Claw" now involves giving text instructions to an agent, rather than running a complex shell script. More dramatically, his personal "MenuGen" app, designed to overlay menu item pictures, is rendered "spurious" because the same functionality can be achieved by simply prompting Gemini with an image input and getting an image output. This paradigm shift means AI doesn't just make existing programming faster; it enables entirely new forms of information processing, like generating knowledge bases from unstructured documents, which were previously impossible.
Looking towards 2026, Karpathy envisions a future of "neural computers" where raw sensory data (video, audio) directly feeds into neural networks that dynamically render UIs. He suggests a reversal of current computing architecture, with neural nets becoming the "host process" and traditional CPUs serving as "co-processors," leading to an "extremely foreign" computing landscape.
A core concept Karpathy explores is "verifiability." LLMs excel at automating tasks where outputs can be objectively verified, a trait stemming from their reinforcement learning (RL) training using verification rewards. This explains the "jaggedness" of LLM intelligence: models can flawlessly refactor vast codebases or find zero-day vulnerabilities, yet struggle with simple common-sense questions like whether to walk to a car wash 50 meters away. This jaggedness necessitates human oversight, as LLMs, while powerful, remain fallible tools. Lab decisions on training data (e.g., extensive chess data for GPT-4) significantly influence these capabilities, implying users are "at the mercy" of what data is included. For founders, this means opportunities exist in verifiable domains not yet fully explored by major labs, where custom RL environments and fine-tuning could yield significant results.
Karpathy differentiates "vibe coding," which raises the accessibility floor for all programmers, from "agentic engineering." The latter focuses on maintaining the quality bar of professional software while dramatically accelerating development. He believes the traditional "10x engineer" concept is now understated, as effective agentic engineers achieve far greater speed-ups by coordinating "spiky, fallible, stochastic" agents. Consequently, hiring processes must adapt, moving from puzzle-solving to evaluating candidates based on their ability to implement large-scale projects using agentic tools, with other agents tasked to break their creations.
In this agent-driven world, human skills like aesthetics, judgment, taste, and high-level oversight become invaluable. While agents can handle API specifics and rote details (e.g., `keepdims` vs `axis`), humans must retain an understanding of underlying fundamentals (e.g., memory management in tensors) and provide strategic design. Karpathy notes that current agent-generated code can be "bloaty" or "gross," underscoring the ongoing need for human discretion in maintaining quality and elegance.
Ultimately, Karpathy foresees an "agent-native" world where infrastructure is designed for agents, not just humans. He expresses frustration with current documentation that provides human instructions ("go to this URL") rather than agent-ready "copy-paste" commands. The ideal scenario would be to simply prompt an LLM to "build MenuGen" and have it fully deploy without any manual intervention. This agent-first approach will extend to inter-agent communication, with "my agent talk[ing] to your agent" for tasks like scheduling.
Regarding education, Karpathy highlights the enduring value of understanding, quoting the idea: "You can outsource your thinking but you can't outsource your understanding." He stresses that humans remain the bottleneck for direction, purpose, and true comprehension. While tools like LLM-powered knowledge bases can enhance understanding by re-processing information, the human role in discerning "what to build, why it's worth doing, and how to direct" these powerful agents remains irreplaceable.
摘要
Andrej Karpathy (co-founder of OpenAI, former head of AI at Tesla, and now founder of Eureka Labs) talks with Sequoia partner Stephanie Zhan at AI Ascent 2026 about what's changed in the year since he coined "vibe coding." He explains why he's never felt more behind as a programmer, why agentic engineering is the more serious discipline taking shape on top of vibe coding, and why we should think of LLMs not as animals but as ghosts: jagged, statistical, summoned entities that require a new kind of taste and judgment to direct. He also touches on Software 3.0, the limits of verifiability, and why you can outsource your thinking but never your understanding.
GPT-4正在为你翻译摘要中......
中英文字稿
我们非常激动地迎来了我们的第一位特别嘉宾,他在打造现代人工智能方面做出了巨大贡献,还时不时地为其重新命名。他实际上是在这间办公室里共同创办了OpenAI,并曾在特斯拉早期成功实现了自动驾驶技术。他有一种罕见的天赋,能让最复杂的技术变革显得既易懂又必然。大家都知道他去年创造了“Vibe Coding”这个词,而就在最近几个月,他说了一件更令人吃惊的话:作为一名程序员,从未感到这么落后。这就是我们今天讨论的起点。感谢Andre加入我们,您好,很高兴来到这里并引领我们的讨论。好吧,几个月前您提到,作为程序员,您从未感到如此落后,这句话从您口中说出来真是让人震惊。您能帮我们解释一下这种感觉是令人振奋还是不安吗?
哦,当然,这两种感觉都有。首先,我想说和很多人一样,我已经使用了一些新兴技术工具,比如代码相关工具,有一段时间了,可能从去年开始就用过。它们对一段段的代码很有帮助,虽然有时候会出错需要手动修改,但整体上还是很有用的。我想说,去年12月是一个分水岭,那时我正在休假,有更多时间探索,我注意到最新的模型输出的代码片段都很不错,然后我不断地提出更多需求,结果也都很好。我已经不记得上次我进行改正是什么时候了,然后我对系统越来越信赖,开始进行“Vibe Coding”。我确实认为这是一个非常显著的转变。
▶ 英文原文 ⏱
we're so excited for our very first special guest he has helped build modern AI then explain modern AI and then occasionally rename modern AI he actually helped co-found open AI right inside of this office was the one who actually got autopilot working at Tesla back in the day and he has a rare gift of making the most complex technical shifts feel both accessible and inevitable you all know him for having coined the term vibe coding last year but just in the last few months he said something even more startling that he's never felt more behind as a programmer that's where we're starting today thank you Andre for joining us yeah hello excited to be here and to kick us off okay so just a couple months ago you said that you've never felt more behind as a programmer that's startling to hear from you of all people can you help us unpack that was that feeling exhilarating or unsettling oh yeah a mixture of both for sure well first of all I guess like as many of you I've been using agentic tools like lot code adjacent things for a while maybe over the last year as it came out and it was very good at you know chunks of code and sometimes it would mess up and you have to edit them and it was kind of helpful and then I would say December was this clear point where for me I was on a break so I had a bit more time I think many other people were similar and I just started to notice that with the latest models the chunks just came out fine and then I kept asking for more and just came out fine and then I can't remember the last time I corrected it and then I was just you know trusted the system more and more and then I was vibe coding and so it was kind of a I do think that it was a very stark transition.
我认为很多人,包括我在内,其实是信任你的。我在推特或者X平台上尽力强调这一点,因为我觉得很多人去年的时候以一种与ChatGPT相关的方式体验了人工智能,但你确实需要再看看,尤其是12月之后,因为事情发生了根本性的变化。特别是在这一点上,自主且连贯的工作流程真的开始起作用了。我想说,这种意识让我深入探索了各种项目,中途不断增加新的想法,现在我的项目文件夹里塞满了各种各样的随机东西,不断地写代码。所以,我想说这大概是在12月发生的,我也在观察这种变化带来的影响。
你已经多次提到大语言模型(LLM)是一种新型计算机,它不仅仅是更好的软件,而是一种全新的计算范式。软件1.0是明确的规则,软件2.0是学习到的权重,而软件3.0则是这一点。那么如果这是真的,当一个团队真正相信这一点时,他们会如何以不同的方式进行构建呢?
对,没错。软件1.0是我在写代码;软件2.0是通过创建数据集和训练神经网络来进行编程,编程就像是数据集的排列组合,加上一些目标和神经网络架构。那么如果你在足够大的任务集上训练一个GPT模型或大语言模型,这些模型基本上会隐含地拥有编程能力,因为通过在互联网上训练,你不得不对数据集中所有的内容进行多任务处理,它们就能在某种意义上成为可编程的计算机。
▶ 英文原文 ⏱
I think that a lot of people actually I trust you I tried to stress this on Twitter and or X because I think a lot of people experienced AI you know last year as chat GPT adjacent thing but you really had to look again and you had to look as of December because things have changed fundamentally and especially on this like agentic coherent workflow that really started to actually work and so I would say that yeah it was just that realization that really had me go down their whole rabbit hole of just you know infinity side project my side projects folder is like extremely full with lots of random things and just by coding all the time so yeah that kind of happened in December I would say and I was looking at the repercussions of that sense you've talked a lot about this idea of LLMs as a new computer that it isn't just better software it's a whole new computing paradigm and software 1.0 was explicit rules software 2.0 was learned weights software 3.0 is this um if that's actually true what does a team build differently the day they actually believe this right so uh yeah exactly so software 1.0 I'm writing code software 2.0 I'm actually programming by creating data sets and training uh training neural networks so the programming is kind of like arranging data sets and maybe some objectives and neural network architectures and then what happened is that basically if you train one of these GPT models or LLMs on a sufficiently large set of tasks implicit basically implicitly um implicitly because by training on the internet you have to multitask all the things that are in the data set uh these actually become kind of like a programmable computer in a certain sense.
软件3.0是关于编程方式的变革,现在编程转变为提示操作,而上下文窗口则成为你操控解释器的杠杆。这个解释器是大型语言模型(LLM),它类似于理解你的上下文并在数字信息空间内执行计算。因此,我认为这是一个重要的转变,有几个例子让我更深刻地理解了这一点,可能对你也有启发。
比如,当OpenClaw发布时,你可能会期望它的安装像传统的Shell脚本一样,通过运行Shell脚本来安装。但是,为了适应各种不同的平台和不同类型的计算机,这些Shell脚本往往会变得非常复杂。然而,这种方法依旧停留在软件1.0的世界中,需要手动编写代码。实际上,OpenClaw的安装方式变成了一种复制粘贴的操作,你只需将一段文本给你的“智能代理”即可让它完成OpenClaw的安装。因此,这就像是一个小技能,通过简单的复制粘贴,你的代理就会为你安装OpenClaw。
▶ 英文原文 ⏱
so software 3.0 is kind of about uh you know your programming now turns to prompting and what's in the context window is your lever over the interpreter that is the LLM that is kind of like interpreting your context and uh performing computation in the digital digital information space so i guess um yeah that's kind of the transition and i think there's a few examples of that really drove it home for me and maybe that might be instructive uh so for example when you when open claw came out when you want to install open claw you would expect that normally this is a bash bash script like a shell script so run the shell script to run to install open claw um but the thing is that in order to target lots of different platforms and lots of different types of computers you might run an open claw these shell scripts usually balloon up and become extremely complex but the thing is you're still stuck in a software 1.0 universe of wanting to write the code and actually the open claw installation is a copy paste of a bunch of text that you're supposed to give to your agent uh so basically it's it's a little skill of uh you know copy paste this and give it to your agent and it will install open claw.
翻译成中文可以这样表达:
这之所以更强大,是因为你现在正处于软件3.0的范式中,在这个范式下,你不需要精确地指出所有细节。这个智能代理具备自己的智能,它会根据指令自行打包操作,它会查看你的环境和电脑,然后执行智能操作来使事情正常运行,并在过程中调试。这种方式显然更加强大。所以,我认为这完全是一种不同的思维方式,现在的编程范式就好像是你将一段文本复制粘贴给你的智能代理。
▶ 英文原文 ⏱
and the reason this is a lot more powerful is you're working now in the software 3.0 paradigm where you don't have to precisely spell out you know all the individual details of that setup the agent has its own intelligence that it packages up and then it kind of follows the instructions and it looks at your environment your computer and it kind of like performs intelligent actions to make things work and debugs things in the loop and it's just like so much more powerful right so i think that's a very different kind of like way of thinking about it it's just like what is the piece of text to copy paste to your agent that's the programming paradigm now.
我想说另一个让我想起的例子,比之前的还极端,就是我在开发“菜单生成器”(Menu Gen)的时候。菜单生成器的概念是这样的:你去餐厅,他们给你一个菜单,上面通常没有图片,所以我不知道其中很多菜是什么。通常有30%到50%的菜我完全不知道。所以我想拍下餐厅菜单的照片,然后生成这些菜品的通用图片。于是我编写了一个应用程序,基本原理是你上传一张照片,它能处理这些信息,然后在Versell平台上运行,重新生成菜单,列出所有菜单项,并利用图像生成器为它们生成图片,展示给你看。后来,我看到这个软件的3.0版本,真是让我大开眼界。它的做法更简单,只需拍张照片给Gemini,并使用Nano Banana将信息叠加到菜单上。
▶ 英文原文 ⏱
i think one more maybe uh example that comes to mind that is even more extreme than that is when i was building um menu gen so menu gen so menu gen is this idea where you um you come to a restaurant they give you a menu there's no pictures usually so i don't know what any of these things are uh usually i like 30 of the things i have no idea what they are 50 so i wanted to take a photo of the restaurant menu and to get pictures of what those things might look like in a generic sense and so i built i white coded this app that basically lets you upload a photo and it does all this stuff and it runs on versell and uh it basically re renders the menu and it gives you like all the items and it gives you a picture that it uses an image um you know generator uh for to basically ocr all the different titles uh use the image generator to get pictures of them and then shows it to you and then i saw the software 3.0 version of this which is which blew my mind which is literally just take your photo give it to gemini and say use nano banana to overlay the the things onto the menu.
呃,Nano Banana 基本上生成了一个图片,这个图片就是我拍的菜单的照片,但它实际上在像素中呈现了菜单上的不同内容,这让我感到惊讶。因为我所有的菜单生成都是虚拟的,它还在旧的范式中运作,这个应用不该存在。而软件3.0的范式更加原始,你的神经网络做了越来越多的工作,你的提示或上下文只是图像,而输出也是图像,中间完全不需要任何应用程序。因此,我认为人们需要重新框架思考,不要继续在现有的事物范式中工作,而是要考虑这是一种对现有事物的加速,实际上现在有新的可能性出现了。
▶ 英文原文 ⏱
uh and nano banana basically returned an image that is exactly the picture of the menu that i took but it actually put into the pixels it rendered the different things in the menu and this blew my mind because actually all of my menu gen is spurious it's working in the old paradigm that app shouldn't exist uh and uh yeah the software 3.0 paradigm is a lot more more kind of raw it just um your neural network is doing more and more of the work and your prompt or context is just the image and the output is an image and there's no need to have any of the app in between um so i think that people have to kind of like reframe you know not to work in the existing paradigm of what things existed and just think about it as a speed up of what exists it's actually like new things are available now.
回到你提到的编程问题,我认为这也反映了一种旧的思维模式。现在,不仅仅是编程和让编程变得更快的问题,而是更加普遍的信息处理已经可以自动化了。所以,这不仅仅是关于代码的问题。在过去,代码是处理结构化数据的,但比如说在我做的LLM知识库项目中,你可以让大语言模型(LLM)为你的组织或你个人创建维基。这甚至不是传统意义上的编程,因为以前没有代码可以基于一堆事实创建一个知识库。但现在,你可以把这些文件收集起来,重新编排,创造出一种新的、有趣的方式来重新诠释这些数据。这些是以前无法实现的新事物。我认为不仅仅是现有事物变得更快,我们还有许多以前不可能实现的新机会,我觉得这些新机会更加令人激动。
▶ 英文原文 ⏱
and going back to your programming question it's not even i think that's also an example of working in the in the old mindset because it's not just about programming and programming becoming faster this is more general information processing that is automatable now so um it's not just even about code so previous code worked over a kind of like structured data right and uh you write code over structured data but like for example with my llm knowledge bases project um basically you get llms to create wikis for your organization or for you in person etc this is not even a program this is not something that could exist before because there was no if there was no code that would create a knowledge base based on a bunch of facts but now you can just take these documents and uh basically uh recompile them in a different way and reorder them and create something that is uh new and interesting uh as a reframing of the data and so these are new things that weren't possible and so i think this is uh something that i keep trying to get back to as to not only what can we do that existed that is faster now but i think there's new opportunities of just things that couldn't be possible before and i almost think that that's more exciting.
我喜欢这个菜单。你描述的发展和对比让我觉得很赞,我相信很多人在这里都注意到了你从去年十月到今年一月、二月编程能力的发展。如果继续这样推测下去,那么到2026年,相当于90年代建网站、2010年代开发手机应用、最近的云时代构建SaaS的东西是什么呢?哪种还未被大规模开发的技术,在未来回看时会显得一目了然呢?以“菜单生成”为例,我猜测很多代码将不再需要存在,神经网络将完成大部分工作。我认为这种推测会显得很奇怪,因为你可以想象一下,完全基于神经网络的计算机是什么样子。比如,一个设备接收原始视频或音频,通过神经网络和扩散技术来渲染用户界面,从而在某种意义上为那个时刻生成独特的界面。
我觉得在计算机发展的早期,人们对计算机的未来形态是有些困惑的,不知道它们会更像计算器还是神经网络。在五六十年代,这个方向还不太明确。当然,最后我们选择了计算器路径,发展了经典计算机技术,而现在的神经网络是在这些现有计算机上虚拟化运行的。不过,你可以想象,将来可能会发生颠覆性的变化,神经网络会成为主要的处理过程,CPU则会变成辅助处理器。
▶ 英文原文 ⏱
i love the menu. gen progression and dichotomy that you laid out and i think even i'm sure many folks here followed your own progression of programming from last october to early january february this year if you extrapolate that further what is the 2026 equivalent um for building websites in the 90s building mobile apps in the 2010s building sas in the last cloud era what will look completely obvious in hindsight that is still mostly unbuilt today um well going with the example of menu gen i guess uh so a lot of this code shouldn't exist and it's just neural networks doing most of the work um i do think that the extrapolation looks very weird because you could basically imagine i don't think i yeah so you could imagine completely neural computers in a certain sense you feed raw videos like imagine a device he takes raw videos or audio into basically what's a neural net and uh uses diffusion to render a ui that is kind of like you know unique for that moment in a certain sense and um i kind of feel like in the early days of computing actually people were a little bit confused as to whether computers would look like calculators or computers would look like neural nets and in 50s and 60s it was not really obvious which way it would go and of course we went down the calculator path and ended up building uh classical computing and the neural nets are currently running virtualized on existing computers but you could imagine i think that uh a lot of this will flip and that the neural net becomes kind of like the host process and uh the cpu has become kind of like the co-processor.
我们看到了一个关于智能计算的图表,图中显示神经网络的使用将会占据主流,并成为浮点运算能力(FLOPS)支出的主要部分。你可以想象一个非常奇特且陌生的场景,即神经网络在承担大部分繁重工作。它们把工具使用当作一种历史遗留的方法,仅仅用于一些确定性的任务,而真正发挥主导作用的是这些网络化的神经网络。因此,你可以想象,一个非常陌生的情景作为发展的方向,但我认为我们可能会逐步实现这一目标。不过,这个进程的具体发展还有待观察。
▶ 英文原文 ⏱
so we saw the diagram of you know intelligence compute is going to of neural networks is going to take over and become the dominant spend of flops so you could imagine something really weird and foreign when where neural nets are doing most of the heavy lifting they're using tool use as just like you know um historical appendage for some kinds of like deterministic tasks but what's really running the show is these uh neural nets that are networked in a certain way um so you can imagine something extremely foreign as the extrapolation but i think we're going to probably get there uh sort of piece by piece um and i don't yeah that that progression is tbd.
我想说,我很想聊一聊可验证性这个概念。人工智能会更快、更容易地自动化那些输出结果可以被验证的领域。如果这一框架成立,那么哪些工作会以超出人们预期的速度发生变化?我们有哪些职业人们认为是安全的,但实际上非常容易验证?
我曾花时间研究可验证性,基本上传统计算机可以轻松自动化那些可以用代码指定的任务。而这一波最新的语言模型(LLMs)可以轻松自动化那些可以在某种意义上被验证的任务。其运作方式是,前沿实验室在训练这些语言模型时,会设置巨大的强化学习环境,并给予验证奖励。因此,由于这些模型的训练方法,它们最终在可验证的领域(如数学和代码)表现出色,而在其他领域则有些停滞,表现得不太完美。
▶ 英文原文 ⏱
i would say i'd love to talk a little bit about um uh this concept of verifiability the fact that ai will automate faster and more easily domains where the output can be verified um if that framework is right what work is about to move much faster than people realize and what professions do we have that people actually think are safe but that are actually highly verifiable uh yeah so i spent some time writing about verifiability and um basically like traditional computers can easily automate what you can specify in code and uh kind of this latest round of llms can easily automate what you can verify in a certain in a certain sense because the way this works is that when frontier labs are training these llms these are giant reinforcement learning environments so they are given a verification rewards and then because of the way that these models are trained they end up basically uh progressing and creating these like jagged entities that really peak in capability in kind of like verifiable domains like math and code and adjacent and kind of like stagnate and are a little bit um you know rough around the edges when uh things are not kind of like in that in that space.
我认为我之所以写关于可验证性的内容,是因为我在试图理解为什么这些事情会如此不确定。部分原因与实验室如何训练模型有关,但我认为也与实验室的关注点和数据分布有关。因为某些东西在经济上更有价值,因此在这些方面创造了更多的环境,因为实验室希望在这些环境中进行研究。例如,代码就是一个很好的例子。可能有很多可验证的环境他们可以考虑,但由于这些能力并不是特别有用,因此没有被加入数据中。
对我来说,最大的谜团在于,这些模型往往会在一些简单问题上出错。例如,有一个常被提到的例子是“草莓”这个词有几个字母,模型都会错误回答。这就是不确定性的一个例子。现在的模型已经修正了这个错误,但新的问题是:“我想去洗车行洗车,距离仅50米,我应该开车还是步行?”目前最先进的模型会告诉你步行,因为距离非常近。
▶ 英文原文 ⏱
so i think the reason i wrote about verifiability is i'm trying to understand why these things are so jagged um and some of it has to do with how the labs train the models but i think some of it also has to do with um the focus of the labs and what they happen to put into the data distribution because some things. basically are significantly more valuable in economy and end up creating more environments because the labs wanted to work in those settings so i think code is a good example of that there's probably lots of verifiable environments they could think about that happen not to make it into the mix because they're just not that useful to have the capability around um but i think to me the big um i guess like the big mystery is uh the favorite example for a while was that how many letters are are in the strawberry and the models would famously get this wrong and it's an example of jacketness uh the models now patch this i think but the new one is i want to go to a car wash to wash my car and it's 50 meters away should i drive or should i walk and state-of-the-art models today will tell you to walk because it's so close.
你怎么能想象,最先进的Opus 4.7可以同时重构一个拥有十万行代码的代码库或发现零日漏洞,而与此同时却建议我走路去洗车店?这真是令人难以置信。若这些模型有时显得不够完善,那可能表明:第一,也许有哪里出了点小问题;第二,你需要参与其中一点点,你需要将它们视作工具,并保持关注它们的运作。因此,我所有关于可验证性的写作,归根结底是想搞明白这些不完美现象是否有规律。我认为这可能是一个可验证性加上实验室关注的结合。还有一个例子是,从GPT-3.5到GPT-4,人们注意到其下棋能力大大提高。很多人可能以为这是因为模型能力的正常进步,但其实,根据公开信息,我在网上看到,大量的棋类数据被加入到训练数据集中,仅仅由于数据分布的变化,模型改善的程度远超正常情况下的进步。
▶ 英文原文 ⏱
how is it possible that state-of-the-art opus 4.7 will simultaneously refactor a hundred thousand like code base a line code base or find zero-day vulnerabilities and yet tells me to walk to this car wash this is insane and to whatever extent these models are remain jagged it's an indication that number one maybe something slightly off or um number two you need to actually be in the loop a little bit and you need to treat them as tools and you do have to kind of stay in touch with what they're doing and so i think all of my writing long story short about verifiability just trying to understand um why these things are jacket is there any pattern to it and i think it's a some kind of a combination of verifiable plus labs care maybe one more anecdote that is instructive is uh from gpt 3.5 to gpt4 people noticed that chess improved a lot and i think a lot of people thought oh well it's just a progression of the capabilities but actually it's it's more that uh i think this is public information i think i saw it on the internet um a huge amount of like um data of chess made it into the pre-chaining set and just because in the data distribution uh basically the model improved a lot more than it would just by default.
有人在OpenAI决定添加这些数据,现在你的能力得到了更大的提升,这也是为什么我强调这个方面。因为我们在一定程度上受到实验室正在做的事情的影响,无论他们决定加入什么内容。你必须探索这个没有说明书的东西,它在某些情况下有效,但可能在一些情况下无效。你需要稍微探索一下,如果你的应用在强化学习的适配范围内,你就能顺利运行;如果不在数据分布的适配范围内,就会遇到困难。你必须弄清楚你的应用在哪个适配范围内。如果不在适配范围,你就需要认真考虑微调并做一些自己的工作,因为它未必能直接从大语言模型中获得理想的效果。
▶ 英文原文 ⏱
so someone at openai decided to add this data and now you have a capability that just peaked a lot more and so that's why i think i'm stressing this um dimension of it as we are slightly at the mercy of whatever the labs are doing whatever they happen to put into the mix and you have to actually explore this thing that they give you that has no manual and it works in certain settings but maybe not in some settings and you have to kind of uh explore it a little bit and uh if you're in the circuits that were part of the rl you fly and if you're in the circuits that are out of the data distribution uh you're going to struggle and you have to kind of figure out which which circuits you're in in your application and if you and if you're not in the circuits then you have to really look at fine tuning and doing some of your own work because it's not going to necessarily come out of the llm out of the box.
当然,我很想稍后再谈谈锯齿状智能的概念。如果你是今天的创业者,考虑创建一家公司,你正在试图解决一个你认为是可行的问题,一个可以验证的领域。但是你看看周围,可能会觉得实验室在最明显的领域,比如数学和编码等,正在快速进展。对于在场的创业者,我会给出什么建议呢?
我认为这也许与之前讨论的问题有关,即我确实认为可验证性让一个问题在当前的模式下变得可行,因为你可以投入大量的强化学习(RL)。所以也许可以这样看,即使实验室没有直接聚焦在这个问题上,可验证性仍然有用。如果你处于一个可以创建这些听觉环境或例子的可验证场景中,那么这实际上可以让你进行自己的微调,并可能从中受益。实际上,这是一种行之有效的技术。如果你有大量多样化的听觉环境数据集,你可以使用你喜欢的微调框架,拉动杠杆,得到一个运行良好的系统。
我不知道这种情况下的具体例子是什么,但我确实认为有一些非常有价值的强化学习环境,人们可以考虑一下。我不想在台上模糊地提到,但是这个领域确实有一些例子。
▶ 英文原文 ⏱
i'd love to come back to the concept of jagged intelligence in a little bit um if you were a founder today and thinking about building a company you are trying to solve a problem that you think is tractable something that uh is a domain that is verifiable but you look around you think oh my gosh well the labs have really really started uh got getting to escape velocity in the ones that seem most obvious math coding and others what would your advice be to to the founders in the audience um so i think maybe that comes to the previous question of i do think that verifiability because it um let me think so verifiability makes something tractable in the current paradigm because you can throw a huge amount of rl at it um so maybe one way to see it is that uh that remains true even if the labs are not focusing on it directly so if you are in a verifiable setting where you could create these aural environments or examples then that actually sets you up to potentially do your own fine tuning and you might benefit from that but that is fundamentally technology that just works you can pull a lever if you have huge amount of diverse data sets of aural environments etc uh you can use your favorite fine-tuning framework and um and uh pull the lever and get something that actually works pretty well so um i don't know what the examples of this might be um but i do think there are some very valuable uh reinforcement learning environments that people could think of that i think are not part of the yeah i don't want to give away the answer but there is one domain that i think is very uh oh okay sorry i don't mean to vague post on on the stage but there are some examples of this.
从另一个角度来看,你认为有哪些事情在远处看起来可以自动化,但实际上并非如此?我确实认为,最终几乎所有事情都可以在某种程度上变得可验证,只是有些事情比其他事情更容易做到。例如,即便是写作或类似的事情,你也可以想象通过由大型语言模型(LLM)组成的评审委员会来验证,从这种方法中可能会得到一些合理的结果。所以,这更多的是关于难易程度的问题。因此,我确实认为,最终一切都可以实现自动化,很棒,对吧?
▶ 英文原文 ⏱
on the flip side what do you think still feels automatable only from a distance i do think that ultimately almost everything can be made uh verifiable to some extent some things easier than others um because even for like things like writing or so on you can imagine having a council of llm judges and probably get get to something get something reasonable out of the from from this kind of an approach so it's more about what's easy or hard um so i do think that ultimately um uh yeah i think uh everything everything is automatable amazing okay.
嗯,所以去年你提出了“氛围编码”这个术语,而今天我们所处的世界感觉更加严肃,更像是有机工程。你认为这两者之间的区别是什么?你会如何称呼我们今天所处的状态?
哦,我会说氛围编码是关于提高所有人在软件开发中的基本能力。随着底线的提高,每个人都能参与氛围编码,这非常棒,令人惊叹。但我觉得“自主工程”则关注保持传统专业软件中存在的质量标准。因此,在氛围编码的过程中,你不能引入漏洞。你仍然需要像以前一样对你的软件负责,可以更快地开发,但如何正确地做到这一点呢?所以,我称之为自主工程,因为我认为它有点像是一种工程学科。你有这些“代理”,它们就像是有锋芒的实体,有些不完美且带有随机性,但非常强大。如何在不牺牲质量标准的前提下让它们更快运行,并且做到这一点,我认为这就是自主工程的领域。
▶ 英文原文 ⏱
um so last year you coined the term vibe coding and today we're in a world that feels a little bit more serious more organic engineering what do you think is the difference between the two and what would you actually call what we're in today oh yeah so i would say vibe coding is about raising the floor for everyone in terms of what they can do in software so the floor rises everyone can vibe code anything and that's amazing incredible but then i would say agentic engineering is about preserving the quality bar of what existed before in professional software so you're not allowed to introduce vulnerabilities due to vibe coding um you are um you're still responsible for your software just as before but can you go faster and spoiler is you can but how do you how do you do that properly and so to me agentic engineering when i call it that because i do think it's kind of like an engineering discipline you have these agents which are these like spiky entities they're a bit fallible a little bit stochastic but they are extremely powerful and it's how do you how do you coordinate them to go faster without sacrificing your quality bar and doing that well and correctly um is the realm of agentic engineering.
嗯,我觉得它们有点不同,一个可能是关于提高基础水平,而另一个则是关于扩展能力。我认为代理工程师的能力上限非常高。过去人们常说的“10倍工程师”在这里得到了大幅放大,实际上10倍远不是你能获得的速度提升。我觉得那些在这个领域很出色的人,他们的表现远远超过了10倍。就我目前的看法,我很喜欢这种表述。去年,山姆·阿尔特曼(Sam Altman)来AI活动时说过一句让我印象深刻的话,他提到不同年龄段的人会以不同的方式使用ChatGPT。如果你三十多岁,你可能会把它当作谷歌搜索的替代品;但如果你是十几岁,ChatGPT就像是你进入互联网的门户。
▶ 英文原文 ⏱
um so i kind of see them as different like one is about maybe raising the raising the floor and the other is about um you know extrapolating and what i'm seeing i think is there is a very high ceiling on agentic engineer uh capability and you know people used to talk about the 10x engineer previously i think that this is magnified a lot more uh 10x is uh is not uh the speed up you gain um and i think uh it does seem to me like people who are very good at this um peak a lot more than 10x uh from from my perspective right now i really like. that framing um one thing that when sam allman came to ai sent last year one memorable thing he said was that people of different generations use chat gpt differently so if you're in your 30s you use it as a google search replacement but if you're in your teens chat gpt is your gateway to the internet.
在现代编程中,如果我们观察两个人使用开爪云代码(Claw Cloud Code)或Codex进行编程,一个人水平一般,另一个人完全适应人工智能,那么这两者之间的区别是什么呢?我认为,这主要体现在如何充分利用现有工具的各种功能,以及对自身编程环境的投入上。就像过去工程师们习惯于在工具上充分下功夫,无论是使用Vim还是VS Code,而现在则是利用Claw Code或Codex等等。因此,投入到你的编程环境中,并利用各种可用工具,这就是差异的体现。
▶ 英文原文 ⏱
what is the parallel here in coding today if we were to watch two people code using open claw cloud code codex one you'd consider mediocre at it and one you would consider fully ai native how would you describe the difference i mean i think it's just trying to get the most out of the tools that are available utilizing all their features investing into your own um kind of setup uh so just like previously all the engineers are used to basically getting the most out of the tools you use either it's vim or vs code or now it's uh you know cloth code or codex or so on so um just investing into your setup um and um utilizing a lot of the you know tools that are available to you um and i think it just kind of looks like that.
我认为,也许有一个相关的想法是,很多人可能正在为此招聘,因为他们希望雇用有能力的工程师。我确实认为,我看到的是,大多数人仍然没有针对有能力的工程师修改他们的招聘流程。如果你仍然在给应聘者一些解谜题目,那就是旧有的模式。我会说招聘应该更像是给出一个大型项目,看某人如何完成这个项目。比如,设计一个类似推特的应用程序,为一群代理用户创建,然后确保这款应用程序安全无漏洞。接着,让一些代理用户在这个仿推特上模拟活动。最后,我会使用10个 CodeX 5.4 x高配置来尝试攻击你部署的网站,确保他们无法成功攻破。
▶ 英文原文 ⏱
i do think that um maybe related thought is um a lot of people are maybe hiring um for this right because they want to hire strong agentic engineers i do think that um what i'm seeing is that uh the you know most people are still not refactored their um their hiring process for agentic engineer capability right like if you're giving out puzzles uh to solve and this is still the old paradigm i would say that hiring have to has to look like give me a really big project and see someone implement that big project like let's write say a twitter clone uh for agents and then make it really good make it really secure and then have some agents simulate some activity uh on this twitter and then i'm going to use 10 um codecs 5.4 x high to try to break your break your um uh the website that you deployed and they're going to try to basically break it and they should not be able to break it.
好的,我来将这段话翻译成中文并尽量简化:
所以,也许看起来是这样的。是的,观察人们在这种环境中的表现,建设更大的项目,并利用这些工具,可能是我主要关注的方向。随着代理的能力增强,您认为哪种人类技能会变得更有价值,而不是更没价值呢?
这是个很好的问题。我认为,现在的答案是代理更像是内部实体的目录。实际上,非常了不起的是,你仍然需要负责审美、判断、品味,并进行一些监督。
▶ 英文原文 ⏱
and so maybe it looks like that right and so yeah watching people in that that setting and building bigger projects and uh utilize utilizing the tooling is maybe what i would uh look at for the most part and as agents do more what human skill do you think becomes more valuable not less uh so um yeah it's a good question i think um well right now the answer is that the agents are catalog these intern entities right so it's remarkable um you basically still have to be in charge of the aesthetics the judgment the taste and a little bit of oversight maybe.
我最喜欢的一个代理系统“怪异”例子是关于Menugen的。在这个系统中,你用谷歌账户注册,但是却用Stripe账户购买积分。这两个账户都使用电子邮件地址。我的代理实际上试图在你购买积分时,把Stripe的电子邮件地址分配给谷歌的电子邮件地址,因为没有持续的用户ID,它试图匹配电子邮件地址。但是你可以为Stripe和谷歌使用不同的电子邮件地址,这样基本上就无法关联资金。这种情况下,代理系统仍然会犯错的问题是:为什么要用电子邮件地址来尝试关联资金呢?电子邮件地址是可以随意使用的,你可以用不同的邮件地址等等。这真是一件奇怪的事情。
▶ 英文原文 ⏱
one one of my favorite examples of like the the weirdness of agents is um for menugen uh you sign up with a google google account but you um purchase credits using a stripe account and both of them have email addresses and my agent actually tries to basically um like when you purchase credits it assigned it using the email address from stripe to the google email address like there wasn't a persistent user id that that uh for people it was trying to match up the email addresses but you could use different email address for your stripe and your google and basically would not associate the funds and so this is the kind of thing that these agents still will make mistakes about is like why would you use email addresses to try to cross correlate the funds they can be arbitrary you can use different emails etc like this is such a weird thing to do.
所以我认为人们必须负责这个规范和计划的制定。实际上,我并不太喜欢计划模式,虽然它显然很有用,但我认为这里有更普遍的问题,就是你必须与代理合作,设计一个非常详细的规范,也许这基本上就是文档,然后让代理去编写它们,而你负责监督和顶层分类,但代理处理很多底层细节。因此,我认为你不必过于关注某些细节。
▶ 英文原文 ⏱
so i think people have to be in charge of this spec this plan and um i actually don't even like the plan mode i i would i mean obviously it's very useful but i think there's something more general here where you have to work with your agent to design a spec that is very detailed and maybe it's uh maybe basically the docs and then get the agents to write them and you're in charge of the oversight and the top level categories but the agents are doing a lot of the under the hood and um so i think you're not caring about some of the details.
作为一个例子,在神经网络中处理数组或张量时,PyTorch和NumPy以及类似Pandas等库之间有非常多的细节不同,比如API的细节。我已经忘记了像是keep dims和keepdim,或者是用dim还是axis来表明维度,是reshape还是permute还是transpose这些东西,因为现在不需要记住这些细节。这些细节通常是由实习生来处理的,因为他们的记忆力不错。不过,你还是需要了解一些基本概念,比如底层的张量和视图,你可以操控同一个存储的视图,或者使用不同的存储(这种效率可能较低)。你必须明白这些操作在做什么,这样可以避免不必要的内存复制。
但API的细节不需要你全部掌控,你负责的是品味、工程、设计,确保逻辑合理,确定需要的东西,例如确保用户ID是唯一的并与所有内容关联。所以,你在进行一些设计和开发,而工程师负责填补细节。当前的情况大概就是这样,我想这是大家都会遇到的情况。
▶ 英文原文 ⏱
so as an example also with um arrays or tensors in neural networks um there's a ton of details between pytorch and numpy and all the different like pandas and so on for all the different little api details and i'll i already forgot about the keep dimms versus keep them or whether it's dim or axis or reshape or permute or transpose i don't remember this stuff anymore right because you don't have to this is the kind of details that are handled by the intern because they have very good recall and but you still have to know for example that um you know there's underlying tensor. there's an underlying view and then you can manipulate view of the same storage or you can have different storage which should be less efficient and so you still have to have an understanding of what this stuff is doing and some of the fundamentals um so that you're not copying memory around unnecessarily and so on but uh the details of the apis are not handed off so it um you're in charge of the taste the engineering the design um and that it makes sense and that you're asking for the right things and that you're saying that okay that these have to be unique user ids that we're going to tie everything to um and so you're doing some of the design and development and the engineers are doing the fill in the blanks and that's currently kind of like where we are and i think that's what everyone of course is seeing.
我认为,目前你是否觉得品味和判断力的重要性会随着时间的推移而减少,还是它们的上限会不断提高?这是个好问题。我希望它能够改善,现在没有改善的原因可能是这部分不在强化学习(RL)的过程中,可能没有美学成本或奖励,又或者它的效果还不够好。实际上,当我查看代码时,有时会觉得有些心惊胆战,因为代码并不是一直都很棒,存在很多冗余、复制粘贴的问题,还有一些不太优雅且脆弱的抽象结构。尽管它能运行,但显得很杂乱。我希望未来的模型能在这方面有所改进。
一个很好的例子是微型GPT项目,我尝试简化语言模型训练到尽可能简单。模型不接受这种方式,我试图不断提示语言模型进行更多的简化,但它就是做不到。这感觉像是在强化学习的核心回路之外,要求它简化就像是在拔牙,不是快速有效的反馈过程。所以,我认为人类仍然在这个过程中的掌控地位。但是,我也认为没有什么根本问题在阻碍改进,只是各个实验室还没有进行这方面的工作。
▶ 英文原文 ⏱
i think right now do you think there's a chance that this um taste and judgment matters less over time or will the ceiling just keep rising um yeah it's a good question i would say um i mean i'm hoping that the that it improves i think probably the reason it doesn't improve right now is again it's not part of the rl there's probably no aesthetics cost or reward or it's not good enough or something like that um i do think that when you actually look at the code sometimes i get a little bit of a heart attack because it's not like super amazing code necessarily all the time and it's very bloaty and there's a lot of copy paste and there's awkward abstractions that are brittle and like it works but it's just really gross um and i do i do hope that this can improve in future models um a good example also is this uh you know the micro gpt project which where i was trying to simplify uh lm training to be as simple as possible the models hate this they can't do it i tried to i keep i kept trying to prompt an lm to simplify more simplify more and it just can't you feel like you're outside of the rl circuits it feels like you're obviously you know you're pulling teeth it's not like light speed so i think um i do think that people are still remain in charge of this but i do think that there's nothing fundamental again that's preventing it it's just the labs haven't done it yet almost.
是的,我很想回到这个关于不规则形式智能的想法。你之前写了一篇非常发人深省的文章,讨论动物与幽灵的对比,其核心观点是我们不是在构建动物,而是在召唤幽灵。这些是不规则形式的智能,它们由数据和奖励函数形成,而不是由内在动机、乐趣、好奇心或通过进化形成的能力驱动。那么,这种框架为何重要,它在如何构建、部署、评估甚至信任这些智能体方面究竟改变了什么呢?
是的,我写这篇文章的原因是尝试理解这些东西是什么。如果对它们有一个良好的模型理解,就能更有效地使用它们。我确实认为,也许这只是一些哲学思考,但重要的是要承认这些智能体并不是动物智能。例如,如果你对它们大喊大叫,它们的表现不会因此变得更好或更差,这没有任何影响。
▶ 英文原文 ⏱
yeah so i'd love to come back to this idea of uh jagged forms of intelligence you wrote a little bit about this with a very thought-provoking piece around animals versus ghosts um and the idea is that we're not building animals we are summoning ghosts um and these are jagged forms of intelligence that are shaped by data and reward functions but not by intrinsic motivation or fun or curiosity or empowerment uh things that kind of came about via evolution um why does that framing matter and what does it actually change about how you build and deploy and evaluate or even trust them uh yeah so yeah i think the reason i wrote about this is because i'm trying to wrap my head around what these things are right because if you have a good model of what they are or are not then you're going to be more competent at uh using them um and i do think that um i don't know if it has i'm not sure if it actually has like real power i think it's a little bit of philosophizing um but i do think that um i think it's just um coming to terms with the fact that these things are not you know animal intelligences like if you yell at them they're not going to work better or worse or it doesn't have any impact.
翻译成中文,尽量易读:
这就像是一种统计模拟电路,其中基底是预训练,也就是统计学,然后在此基础上再加上强化学习的“插件”。这就像是在增加附加功能。这可能只是我进入这个领域时的一种思维方式,或者说是关于什么可能有效、什么可能无效、以及如何改进系统的思考。但实际上,我并不知道有哪些明确的方式可以让系统变得更好。更多的是对当前系统保持怀疑,并在时间中进行探索和理解,这就是一切的开始。
▶ 英文原文 ⏱
and uh it's all just kind of like these statistical simulation circuits where the the substrate is pre-training so like statistics and then but then there's rl bolting on top so it's kind of like increases the appendages and um maybe it's just kind of like a mindset of what i'm coming into or what's likely to work or not likely to work or how to modify it but i don't actually i don't know that i have like here are the five obvious outcomes of how to make your system better it's more just being suspicious of it and um figuring out over time that's where it starts.
好的,所以你现在深入地研究那些不仅仅是聊天的智能代理,它们拥有真实的权限和本地上下文,实际上可以代表你采取行动。当我们都生活在这样一个世界里时,这个世界会是什么样子?嗯,我想这里很多人可能对这种以代理为核心的环境感到兴奋。不过,一切都需要重写,因为现在的很多东西仍然是为人类设计的,必须进行改造。当我使用不同的框架或库时,它们的文档大多数还是为人类写的——这是我最不喜欢的地方。我不明白为什么人们还在告诉我该做什么,我不想亲自动手。我只是需要能直接复制粘贴给我的代理的东西啊。每次有人让我去某个网址之类的,我都觉得很麻烦。所以大家都在思考如何将需要完成的工作分解为能够感知世界的传感器和能够影响世界的执行器。我们怎样才能让系统从一开始就为智能代理设计,并围绕对大型语言模型(LLMs)很易读的数据结构进行大量自动化呢?
▶ 英文原文 ⏱
okay so you are so deep in working with agents that don't just chat they have um real permissions they have local context they actually take action on your behalf what does the world look like when we all start to live in that world uh yeah i think i think a lot of people probably here are excited about what this agent you know native agentic environment looks like and everything has to be rewritten everything is still fundamentally. written for humans and has to be moved around i still use most of the time when i use uh different frameworks or libraries or things like that they still have docs that are fundamentally written for humans this is my favorite pet peeve like i don't uh why are people still telling me what to do like i don't want to do anything what is the thing i should copy paste to my agent like uh so it's just uh every time I'm told you know go to this url or something like that it's just like oh you know so um everyone is i think excited about how do we decompose the workloads that need to happen into fundamentally sensors over the world actuators over the world how do we make it agent native basically describe it to agents first um and then have a lot of automation around um you know the um yeah around data structures that are very legible to the llms.
所以,我想嗯,是的,我希望有更多以代理为优先的基础设施存在。关于菜单生成器(Menu Gen),当我写那篇博客文章时,著名的是(也许不算特别著名),许多麻烦的地方根本不是写代码,而是将它部署到Versell上,因为我必须处理各种服务,把它们串联在一起,还要去这些服务的设置菜单中配置DNS,真的非常烦人。所以,我希望能够对一个大型语言模型(LLM)给出一个提示,让它构建菜单生成器,而我不需要亲自干预任何东西,它就能在互联网上被部署。我觉得这是一个很好的例子,可以测试我们的基础设施是否越来越以代理为中心。
最终,我认为我们正朝着这样一个世界发展:代理会代表个人和组织进行沟通。我们的代理可以与其他代理沟通,来安排会议的细节或类似事情。所以,我确实认为事情大致是朝这个方向发展的。我想在座的每个人对这一点都感到兴奋。我非常喜欢传感器和执行器的视觉类比,我以前没想到过这个,这是非常有趣的。
▶ 英文原文 ⏱
so i think um yeah i'm hoping that there's a lot of agent first um infrastructure out there and that you know for menu gen famously when i wrote the not i'm not sure how famously but when i wrote the blog post about menu gen um a lot of the work or a lot of the trouble was not even writing the code for menu gen it was deploying it in versell because i had to work with all these different services and i had to string them up and i had to go to their settings and the menus and you know configure my dns and it was just so annoying and so that's a good example of i would hope that menu gen that i could give a prompt to an llm build menu gen and that i didn't have to touch anything and it's deployed in that same way on the internet i think that would be a good kind of a test for whether or not a lot of our infrastructure is becoming more and more agent native and then ultimately i would say yeah i do think we're going towards a world where um there's agent representation for people and for organizations and um you know i'll have my agent talk to your agent uh to figure out some of the details of our meetings or things like that so um i do think that that's uh roughly where things are going but um yeah i think everyone here is excited about that i really like the visual analogy of sensors and actuators i actually hadn't thought about that that's super interesting right um.
好的,我觉得我们需要用一个关于教育的问题来结束这段对话。因为你可能是世界上最擅长将复杂的技术概念化繁为简的人之一,而且你对如何围绕这些概念设计教育有着非常深刻的思考。在我们进入人工智能新时代、智能变得廉价之时,哪些东西依然值得我们深入学习呢?
最近有一条推文让我大开眼界,我几乎每天都在思考它。大意是,你可以外包你的思考,但不能外包你的理解。我觉得这句概括得非常好。因为我依然是这个体系的一部分,我仍然需要让信息进入我的大脑,有时我甚至感到自己成了一个瓶颈,因为我还得知道我们在尝试建立什么、为什么值得去做、如何指导我的助手等等。因此,我依然认为最终必须有东西来指导思考和处理,而这些在某种程度上仍然受到理解的限制。
▶ 英文原文 ⏱
okay i think we have to end on a question about education um because you are probably one of the very best in the world at making complex technical concepts simple and deeply thoughtful about how we design education around it um what still remains worth learning deeply when intelligence gets cheap as we move into the next era of ai yeah uh there was a tweet that blew my mind recently and i keep thinking about it like every other day it was something along the lines of um you can outsource your thinking but you can't outsource your understanding and um i think that's really nicely put i so yeah because i still i'm still part of the system and i still i still have to somehow information still has to make it into my brain and i feel like i'm becoming a bottleneck of just even knowing what are we trying to build why is it worth doing uh how do i direct you know how do i direct my my agents and so on so i do still think that ultimately something has to direct the thinking and the processing and so on and um that's still kind of fundamentally constrained somehow by understanding.
这也是我对所有语言模型知识库感到非常兴奋的一个原因,因为我觉得这是一个处理信息的方式。每当我看到信息的不同呈现方式时,我总感觉自己获得了新的见解。因此,这对我来说其实是激发合成数据生成的许多提示,基于一些固定数据进行。我很享受阅读文章时,建立起从这些文章中积累的“维基”,并喜欢问问题。我认为这些工具最终是在某种程度上增强理解能力。不过,现在仍有一些瓶颈,因为如果语言模型在理解上仍然不够强,我们还是无法成为优秀的指挥者,因为你仍要独立负责理解。因此,我认为针对这一目的的工具极其有趣而令人兴奋。我很期待几年后回到这里,看看我们是否已经完全实现自动化,甚至让机器来负责理解。
▶ 英文原文 ⏱
and this is one reason i also was very excited about all the lm knowledge bases because i feel like that's that's a way for me to process information and anytime i see a different projection onto information i always like feel like i gain insight so it's really just a lot of prompts for me to do synthetic data generation kind of over over some fixed data uh so i i really enjoy uh whenever i read an article i have my uh you know my wiki that's being built up from these articles and i love asking questions about things or um and i think that ultimately these are tools to enhance understanding in a certain way and this is still kind of like a bit of a bottleneck because then you can't direct the you can't be a good director if you still uh because the ellen certainly don't excel at understanding you still are uniquely in charge of that so uh yeah i think uh tools to that effect i think are incredibly interesting and exciting i'm excited to be back here in a couple years and to see if we've been fully automated out of the loop and they actually take care of understanding as well.
啊,非常感谢你加入我们,安德烈,我们真的很感激。
▶ 英文原文 ⏱
uh thank you so much for joining us andre we really appreciate it.