首页  >>  来自播客: Lex Fridman 更新   反馈

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416

发布时间 2024-03-07 21:56:37    来源
I see the danger of this concentration of power to proprietary AI systems as a much bigger danger than everything else. What works against this is people who think that for reasons of security, we should keep AI systems under lock and key because it's too dangerous to put it in the hands of everybody. That would lead to a very bad future in which all of our information diet is controlled by a small number of companies who proprietary systems. I believe that people are fundamentally good. So if AI, especially open source AI can make them smarter, it just empowers the goodness in humans. So I share that feeling, okay, I think people are fundamentally good. And in fact, a lot of doomers are doomers because they don't think that people are fundamentally good.
我认为这种将权力集中到专有AI系统上的危险是比其他任何事情都更危险的。对抗这种情况的方法是那些认为基于安全原因,我们应该将AI系统置于锁定状态,因为将其交给每个人是太危险的人。这将导致一个非常糟糕的未来,其中我们所有的信息都受到少数公司拥有专有系统的控制。我相信人们基本上是善良的。因此,如果AI,尤其是开源AI能让他们更聪明,那就是在增强人类的善良。所以我分享这种感觉,好吧,我认为人们基本上是善良的。实际上,很多悲观主义者之所以是悲观主义者,是因为他们不相信人们基本上是善良的。

The following is a conversation with Jan Lecun, his third time on this podcast. He is the chief AI scientist at Meta Professor at NYU touring award winner and one of the seminal figures in the history of artificial intelligence. He and Meta AI have been big proponents of open sourcing AI development and have been walk in the walk by open sourcing many of their biggest models, including llama 2 and eventually llama 3. Also, Jan has been an outspoken critic of those people in the AI community who warned about the looming danger and existential threat of AGI. He believes the AGI will be created one day, but it will be good.
以下是与Jan Lecun的对话,这是他第三次出现在这个播客上。他是Meta公司的首席人工智能科学家,纽约大学教授,获得过巡回奖的人,并且是人工智能历史上的开拓人物之一。他和Meta AI一直是开源人工智能发展的大力倡导者,通过开源许多他们最重要的模型,包括llama 2和最终的llama 3来践行这一理念。此外,Jan一直是人工智能社区中那些警告关于AGI潜在危险和存在威胁的人的坚定批评者。他相信AGI总有一天会被创造出来,但它会是好的。

It will not escape human control nor will it dominate and kill all humans. This moment of rapid AI development, this happens to be somewhat a controversial position. And so it's been fun seeing Jan get into a lot of intense and fascinating discussions online as we do in this very conversation. This is the Lexaruman podcast that supported, please check out our sponsors in the description. And now, dear friends, here's Jan Lecun. You've had some strong statements, technical statements about the future of artificial intelligence recently, throughout your career, actually, but recently as well. You've said that autoregressive LLMs are not the way we're going to make progress towards superhuman intelligence.
它不会逃脱人类的控制,也不会统治和杀死所有人类。在这个人工智能快速发展的时刻,这是一个有争议的观点。所以很有趣看到扬在线上参与了许多激烈而引人入胜的讨论,就像我们正在进行的这次对话一样。这是由Lexaruman播客支持的,请查看描述中的赞助商。现在,亲爱的朋友们,让我们来听一听扬·勒昆的看法。你最近对人工智能未来提出了一些强有力的技术性声明,实际上,在你的职业生涯中也是如此。你说过,自回归LLMs并不是我们向超人类智能迈进的方式。

These are the large language models like GPT-4 like llama 2 and 3 soon and so on. How do they work and why are they not going to take us all the way? For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason and the ability to plan.
这些就是像GPT-4,Llama 2和3这样的大型语言模型。它们是如何运作的,为什么它们无法完全取代我们呢?有好几个原因。首先是智能行为的几个特征。例如,理解世界的能力,理解物理世界的能力,记忆和检索事物的能力,持久的记忆,推理能力和规划能力。

Those are four essential characteristics of intelligent systems or entities, humans, animals. LLMs can do none of those or they can only do them in a very primitive way and they don't really understand the physical world. They don't really have persistent memory. They can't really reason and they certainly can't plan. And so, if you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake, that is not to say that autoregressive LLMs are not useful. They're certainly useful. That they're not interesting. That we can't build a whole ecosystem of applications around them. Of course, we can. As a path towards human-level intelligence, they're missing essential components.
这些是智能系统或实体、人类和动物的四个基本特征。递归神经网络语言模型(LLMs)无法做到这些特征中的任何一个,或者只能以非常原始的方式做到,它们并不真正理解物理世界。它们没有持久的记忆。它们无法真正推理,当然也不能规划。因此,如果你希望系统变得智能,但却没有可能做这些事情,那么你就犯了一个错误。这并不是说自回归LLMs没有用处。它们当然有用。它们并不无趣。我们可以围绕它们建立一整套应用生态系统。当然,我们可以。但作为通往人类级智能的途径,它们缺少必要的组成部分。

And then there is another tidbit or fact that I think is very interesting. Those LLMs are trained on enormous amounts of text. Basically the entirety of all publicly available text on the internet, right? That's typically on the order of 10 to the 13 tokens. Each token is typically two bytes. So that's two 10 to the 13 bytes as training data. It would take you or me 170,000 years to just read through this at eight hours a day. So it seems like an enormous amount of knowledge that those systems can accumulate. But then you realize it's really not that much data.
然后我觉得非常有趣的另一条小资料或事实是,这些LLM是在巨大量的文本上训练的。基本上是整个互联网上所有公开可用的文本,对吧?通常大约是10的13次方个标记。每个标记通常是两个字节。所以训练数据是2乘以10的13次方字节。要想只是每天八小时阅读完这些数据,你我要花费170000年。所以看起来这些系统可以积累巨大数量的知识,但你意识到实际上这并不是那么多数据。

If you talk to developmental psychologists and they tell you a four-year-old has been awake for 16,000 hours in his or her life. And the amount of information that has reached the visual cortex of that child in four years is about 10 to the 15 bytes. And you can compute this by estimating that the optical nerve carry about 20 megabytes per second, roughly. And so 10 to the 15 bytes for four-year-old versus two times 10 to the 13 bytes for 170,000 years worth of reading. What I tell you is that through sensory input we see a lot more information than we do through language.
如果你和发展心理学家交谈,他们告诉你一个四岁的孩子在他或她的生命中已经醒着了16,000个小时。在这四年里,达到这个孩子视觉皮层的信息量大约是10的15次方字节。你可以通过估计视神经每秒传输大约20兆字节来计算这个数字。所以四岁孩子的信息量是10的15次方字节,而相比之下,相当于170,000年的阅读量只有2乘以10的13次方字节。我告诉你的是,通过感官输入我们看到的信息比通过语言所接收到的信息要多得多。

And that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not through language. Everything that we learn in the first few years of life and certainly everything that animals learn has nothing to do with language. So it would be good to maybe push against some of the intuition behind what you're saying. So it is true there are several orders of magnitude more data coming into the human mind, much faster. And the human mind is able to learn very quickly from that filter. the data very quickly.
尽管我们的直觉告诉我们,我们学到的大部分知识都是通过观察和与现实世界的互动获得的,而不是通过语言。我们在生命的最初几年学到的所有东西,当然动物学到的所有东西都与语言无关。因此,也许要反对一些你所说的直觉可能是有益的。是的,人类头脑中进入的数据量要大得多,速度也更快。而且人类头脑能够非常快速地从这个滤镜中学习数据。

So many might argue your comparison between sensory data versus language. That language is already very compressed. It already contains a lot more information than the bytes it takes to store them if you compare it to visual data. So there's a lot of wisdom and language. There's words and the way we stitch them together, it already contains a lot of information. So is it possible that language alone already has enough wisdom and knowledge in there to be able to from that language construct a world model and understanding of the world and understanding of the physical world that you're saying all ends like?
许多人可能会提出异议,认为你在感官数据与语言之间进行的比较。但是语言已经非常压缩了。与视觉数据相比,语言已经包含了更多的信息。所以语言中蕴含着很多智慧。我们使用的词语以及将它们组合在一起的方式,已经包含了大量信息。因此,单单语言是否已经包含足够的智慧和知识,足以通过语言构建一个世界模型,理解世界和理解你所说的物理世界是可能的?

So it's a big debate among philosophers and also cognitive scientists like whether intelligence needs to be grounded in reality. I'm clearly in the camp that yes, intelligence cannot appear without some grounding in some reality. It doesn't need to be physical reality. It could be simulated, but the environment is just much richer than what you can express in language. Language is a very approximate representation of or percepts and our mental models.
所以在哲学家和认知科学家中间有一个很大的争论,即智能是否需要基于现实。我明显站在智能必须有一定现实基础的阵营。这个现实基础不一定非得是物理的现实,它可以是模拟的,但环境比语言表达的要丰富得多。语言只是我们感知和思维模型的一个粗略表示。

There's a lot of tasks that we accomplish where we manipulate a mental model of the situation at hand and that has nothing to do with language. Everything that's physical, mechanical, whatever, when we build something, when we accomplish a task, a model task of grabbing something, et cetera, we plan or action sequences and we do this by essentially imagining the result of the outcome of a sequence of action so we might imagine.
我们完成许多任务时,会操作关于当前情况的心理模型,这与语言无关。在我们建造东西、完成任务、抓取物品等等时,无论是物理的还是机械的,我们都会计划行动步骤,主要是通过想象一系列行动的结果来做到这一点。

That requires mental models that don't have much to do with language. That's a way that most of our knowledge is derived from that interaction with the physical world. A lot of my colleagues who are more interested in things like computer vision are really on that camp that AI needs to be embodied essentially. Then other people coming from the LPSI or maybe some other motivation, not necessarily agree with that.
这需要一种与语言无关的思维模型。这是我们大部分知识来源于与物理世界互动的方式。很多对计算机视觉等方面更感兴趣的同事真正认为,人工智能基本上需要具有实体。然后来自LPSI或其他动机的其他人,不一定同意这一观点。

The philosophy of the world is hard to imagine. The complexity of the world is hard to represent all the complexities that we take completely for granted in the real world that we don't even imagine we acquire intelligence. This is the old Marvak products from the pioneer of robotics, hence Marvak. We said, how is it that with computers it seems to be easy to do high-level complex tasks like playing chess and solving integrals and doing things like that, whereas the thing we take for granted that we do every day, like I don't know, learning to drive a car or grabbing an object, we can't do with computers.
世界的哲学难以想象。世界的复杂性很难展现出我们在现实世界中完全视为理所当然的所有复杂性,我们甚至不考虑我们是如何获得智慧的。这就是来自机器人先驱者的古老Marvak产品,因此有了Marvak这个名字。我们曾说,为什么使用计算机似乎很容易做出高级复杂任务,比如下棋、解决积分等等,而我们每天都视为理所当然的事情,比如学习开车或抓取物体,我们却无法用计算机做到。

We have LMs that can pass the bar exam so they must be smart, but then they can't learn to drive in 20 hours like an E17 year old. They can't learn to clear up the dinner table and fill up the dishwasher like any 10-year-old can learn in one shot. Why is that? What are we missing? What type of learning or reasoning architecture or whatever are we missing that basically prevent us from having level five sort of in cars and domestic robots? Even a large language model construct a world model that does know how to drive and does know how to fill a dishwasher but just doesn't know how to deal with visual data at this time so it can operate in a space of concepts.
我们有可以通过律师资格考试的语言模型,所以它们一定很聪明,但是它们却无法像一个17岁的E17年轻人那样在20小时内学会开车。它们无法像任何一个10岁的孩子那样学会一次就能搞定晚餐桌子,并填满洗碗机。为什么会这样?我们错过了什么?我们错过了哪种学习或推理结构,或者其他什么东西,以至于我们无法拥有类似汽车和家用机器人的第五级能力?即使一个大型语言模型构造一个世界模型,它确实知道如何开车,知道如何填充洗碗机,但目前却不知道如何处理视觉数据,因此它可以在概念空间中操作。

So yeah, that's what a lot of people are working on. The short answer is no and the more complex answer is you can use all kinds of tricks to get an LLM to basically digest visual representations of images or video or audio for that matter. And a classical way of doing this is you train a vision system in some way and we have a number of ways to train vision systems. Either supervised, semi-supervised, self-supervised, all kinds of different ways. That will turn any image into a high-level representation. Basically, a list of tokens that are really similar to the kind of tokens that typical LLM takes as an input.
所以,这就是许多人正在努力研究的内容。简单来说,答案是否定的,更复杂的答案是你可以使用各种技巧来使LLM基本上能够消化图像、视频或音频的视觉表示。一个经典的方法是训练一个视觉系统,在这方面我们有许多训练视觉系统的方法。无论是有监督的、半监督的、自监督的,各种不同的方法。这将把任何图像转换成高级表示。基本上,一组类似于LLM所接受输入的类型的令牌列表。

And then you just feed that to the LLM in addition to the text and you just expect LLM to kind of, you know, drain training to kind of be able to use those representations to help make decisions. I mean, it's been working along those lines for quite a long time. And now you see those systems, right? I mean, there are LLMs that can have some vision extension. But they basically hacks in the sense that those things are not like trained end-to-end to handle, to really understand the world. They're not trained with video, for example. They don't really understand intuitive physics, at least not at the moment. So you don't think there's something special to you about intuitive physics, about sort of common sense reasoning, about the physical space, about physical reality. That to you is a giant leap that LLMs are just not able to do. We're not going to be able to do this with the type of LLMs that we are working with today. And there's a number of reasons for this. But the main reason is the way LLMs are trained is that you take a piece of text, you remove some of the words in that text, you mask them, you replace them by black markers, and you train a gigantic neural net to predict the words that are missing.
然后你只需要将这些输入LLM,再加上文本,你只是期望LLM能够进行一些训练,以便能够利用这些表示来帮助做决策。我是说,已经有一段时间了。现在你看到了这些系统,对吧?我是说,有一些LLMs可以进行一些视觉扩展。但基本上这些是一种方法,就是这些系统并不像被全面训练来真正理解世界。比如说,它们并没有通过视频进行训练。它们目前并不真正理解直觉物理学。因此你认为直觉物理学,通俗常识推理,关于物理空间和物理现实,对你来说是一个巨大的跨越,LLMs目前无法做到。我们无法用目前的LLMs类型来做到这一点。其中有许多原因。但主要原因是LLMs的训练方法是,你拿一段文本,移除其中的一些词,然后用黑色标记替换它们,然后训练一个巨大的神经网络来预测缺失的单词。

And if you build a neural net in a particular way so that it can only look at words that are to the left of the one is trying to predict, then what you have is a system that basically is trained to predict the next word in a text, right? So then you can feed it a text, a prompt, and you can ask it to predict the next word. It can never predict the next word exactly. So what it's going to do is produce a probability distribution over all the possible words in your dictionary. In fact, it doesn't predict words, it predicts tokens that are kind of subword units. And so it's easy to handle the uncertainty in the prediction there because there is only a finite number of possible words in the dictionary. And you can just compute the distribution over them. Then what the system does is that it picks a word from that distribution. Of course, there's a higher chance of picking words that have a higher probability within that distribution. So you sample from that distribution to actually produce a word. And then you shift that word into the input. And so that allows the system not to predict the second word, right? And once you do this, you should get into the input, et cetera. That's called autoregressive prediction, which is why those LMS should be called autoregressive LMS. But we just call them LMS. And there is a difference between this kind of process and a process by which before producing a word, when you talk, when you and I talk, you and I are bilingual.
如果以一特定的方式构建神经网络,使其只能查看试图预测的词左侧的词,那么你所拥有的是一个基本上是训练成预测文本中下一个词的系统,对吧?因此,你可以给它提供文本、提示,并要求它预测下一个词。它永远无法精确地预测下一个词。因此,它会生成一个包含所有可能词汇的概率分布。实际上,它不是预测词,而是预测一种像子词单元一样的标记。因此,在这里处理预测的不确定性很容易,因为词典中只有有限数量的可能词汇。你可以计算它们的分布。然后系统会从该分布中选择一个词。当然,选择那些在分布中具有更高概率的词的机会更大。因此,从该分布中进行抽样以实际产生一个词。然后将该词移入输入。这样使系统不需要预测第二个词,对吧?一旦你这样做,你应该会进入输入,等等。这被称为自回归预测,这也是为什么这些LMS应该被称为自回归LMS的原因。但我们只是称它们为LMS。这种过程与在产生一个词之前进行交谈的过程之间存在差异,当你和我交谈时,你和我都是双语的。

We think about what we're going to say. And it's relatively independent of the language in which we're going to say it. We talk about, I don't know, let's say mathematical concept or something, the kind of thinking that we're doing and the answer that we're planning to produce is not linked to whether we're going to say it in French or Russian or English. Chomsky just rolled his eyes, but I understand. So you're saying that there's a bigger abstraction that goes before language and maps onto language. Right. It's certainly true for a lot of thinking that we do. Is that obvious that we don't, like you're saying your thinking is same in French as it is in English? Yeah, pretty much. Pretty much. Or is this like how flexible are you? Like if there's a probability distribution. Well, it depends on what kind of thinking, right? If it's just, if it's like producing puns, I get much better in French than English about that. No, but so are worse. Is there an abstract representation of puns? Like is your humor an abstract, like when you tweet and your tweets are sometimes a little bit spicy, what's, is there an abstract representation in your brain of a tweet before it maps onto English? There is an abstract representation of imagining the reaction of a reader to that text. Right. Or you start with laughter and then figure out how to make that happen.
我们会考虑要说什么。这与我们要用的语言相对独立。比如说,我们谈论数学概念或其他什么,我们所做的思考和我们计划产出的答案与我们要用法语、俄语还是英语说并不相关。乔姆斯基只是翻了翻白眼,但我理解。所以你是在说在语言之前有一个更大的抽象层,然后映射到语言上。没错。对于我们做的许多思考来说,这是确实存在的。那难道不明显吗,你的思考在法语和英语中一样吗?是的,基本上一样。或者这取决于你有多灵活?如果是一个概率分布的话。那就看具体是什么类型的思考了,对吧?如果仅仅是做双关语,我在法语里要比英语更擅长。不,你是更差。双关语也有一个抽象的表现形式吗?比如你的幽默是一个抽象的,当你发推特时,有时候会有一点辛辣,你的脑中是否有一个推特的抽象表现形式,然后才映射到英文上?在心中会有一个抽象的表现形式去想象读者对这段文字的反应。或者你先有笑点,然后再想办法让别人也笑。

For a certain, or a figure out like a reaction you want to cause and then figure out how to say it, right? So that it causes that reaction. And that's like really close to language. But think about like a mathematical concept or, you know, imagining something you want to build out of wood or something like this, right? The kind of thinking you're doing is absolutely nothing to do with language really. Like it's not like you have necessarily like an internal monologue in any particular language. You're, you know, imagining mental models of the thing, right? And if I ask you to like imagine what this water bottle will look like if I rotate it, 90 degrees, that has nothing to do with language. And so, so clearly there is, you know, a more abstract level of representation in which we do most of our thinking and we plan what we're going to say if the output is, you know, uttered words as opposed to an output being, you know, muscle actions, right? We plan our answer before we produce it. And LMS don't do that. They just produce one word after the other instinctively if you want. It's like, it's a bit like the subconscious actions where you don't, like you're distracted, you're doing something, you're completely concentrated. Someone comes to you and, you know, ask you a question and you kind of answer the question and you don't have time to think about the answer, but the answer is easy. So you don't need to pay attention. You sort of respond automatically. That's kind of what an NLM does. Right? It doesn't think about its answer really. It retrieves it because it's accumulated a lot of knowledge. So it can retrieve some, some things, but it's going to just spit out one token after the other without planning the answer.
确定,或者像你想引起的一种反应一样找出一个数字,然后想想该如何表达,对吧?这样才会引起那种反应。这就像是非常接近语言的感觉。但是想想数学概念,或者你想要用木头建造的东西,或者类似的情况,对吧?你正在进行的这种思考实际上与语言没有太大关系。不像是你一定会用某种语言进行内心独白。你在想象事物的心理模型,对吧?如果我让你想象一下,当我将这个水瓶旋转90度时它会是什么样子,这和语言无关。很明显,我们在进行大部分思考和计划的时候有一种更抽象的表征层面,无论输出是说出来的话语,还是肌肉动作,我们都会在产生之前计划好我们要说什么。但是NLM不会这样做。它们只是本能地一个接一个地产出单词。这有点像潜意识动作,你分心做着某事,全神贯注。有人过来问你问题,你回答了问题,但你没有时间思考答案,但答案很简单。所以你不需要全神贯注。你会自动回应。这就是NLM的做法。它并没有真正思考自己的回答。它会检索信息,因为积累了很多知识,所以它可以检索一些东西,但它只会不加计划地一个接一个地输出。

But you're making it sound just one token after the other, one token at a time generation is bound to be simplistic. But if the world model is sufficiently sophisticated that one token at a time, the most likely thing it generates is a sequence of tokens is going to be a deeply profound thing. Okay, but then that assumes that the systems actually possess a new internal world model. So really goes to the, I think the fundamental question is, can you build a really complete world model, not complete, but a one that has a deep understanding of the world? Yeah. So can you build this, first of all, by prediction? Right. And the answer is probably yes. Can you predict, can you build it by predicting words? And the answer is most probably no, because language is very poor in terms of weak or low bandwidth, if you want. There's just not enough information there. So building world models means observing the world and understanding why the world is evolving the way it is. And then the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take. Right. So what model really is, here is my idea of the state of the world at time t, here is an action I might take. What is the predicted state of the world at time t plus one? Now that state of the world does not need to represent everything about the world. It just needs to represent enough that's relevant for this planning of the action, but not necessarily all the details.
但是你让它听起来就像一个接一个的代币,一次一个代币的生成必然是简单化的。但是,如果世界模型足够复杂,一次一个代币,最可能生成的是一个代币序列,将是一个深刻的事物。好的,但是这就假设系统实际上拥有一个全新的内部世界模型。所以,我认为根本问题在于,你能否构建一个真正完整的世界模型,不完整,但是一个深刻理解世界的世界模型?是的。那么,首先,你能通过预测来构建这个模型吗?对。答案可能是肯定的。你能通过预测词语来构建它吗?答案最有可能是否定的,因为语言在信息量方面非常贫乏或低带宽,如果你愿意这么说的话。那里的信息就不够充分。构建世界模型意味着观察世界并理解为什么世界会以这种方式演变。然后,世界模型的额外组成部分是能够预测在你采取行动之后,世界将如何演变的组件。因此,所谓的模型实际上是,这是我对时间t时世界状态的想法,这是我可能采取的行动。在时间t加一时,世界的预测状态是什么?现在,这个世界状态并不需要代表有关世界的所有信息。它只需要代表足够与该行动计划相关的信息,而不一定是所有细节。

Now here is the problem. You're not going to be able to do this with generative models. So a generative model has trained on video, and we've tried to do this for 10 years. You take a video, show a system, a piece of video, and then ask it to predict the reminder of the video. Basically predict what's going to happen. One frame at a time, do the same thing as sort of the autoregressive LLMs do, but for video. Either one frame at a time or a group of friends at a time.
现在问题来了。你无法用生成模型做到这一点。因此,生成模型已经在视频上训练了,我们已经尝试了10年。你拿一段视频,展示给系统,然后让它预测视频的剩余部分。基本上就是预测接下来会发生什么。一次一个帧,做与自回归LLM相同的事情,但是针对视频。要么一次一个帧,要么一次一组帧。

But yeah, a large video model if you want. The idea of doing this has been floating around for a long time. And at fair, some colleagues and I have been trying to do this for about 10 years. And you can't really do the same trick as with LLMs because, you know, LLMs as I said, you can't predict exactly which word is going to follow a sequence of words, but you can predict the distribution of the words.
但是,如果你想要的话,可以使用一个大型视频模型。这个想法已经在许多年里一直存在。在一次展会上,我和一些同事已经努力尝试了大约10年。你无法像LLMs那样做同样的技巧,因为你知道,正如我所说的那样,LLMs不能准确预测哪个单词会跟随一系列单词,但你可以预测单词的分布。

Now, if you go to video, what you would have to do is predict the distribution over all possible frames in the video. And we don't really know how to do that properly. We do not know how to represent distributions over high dimensional continuous spaces in ways that are useful. And that's there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete. Video is high dimensional and continuous. A lot of details in this.
现在,如果你要处理视频,你将需要预测视频中所有可能帧的分布。我们并不真正知道如何正确地做到这一点。我们不知道如何以有用的方式表示在高维连续空间上的分布。这就是主要问题所在。我们能做到这一点的原因是因为世界比文本更复杂,信息更丰富。文本是离散的。视频是高维连续的。其中包含许多细节。

So if I take a video of this room and the video is, you know, a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict this is a room where there's a light coạnấy trường hợp duy nhất. there is a wall and things like that. It can't predict what the painting of the world looks like or what the texture or the couch looks like.
所以如果我拍摄这个房间的视频,视频中摄像头在四处移动,我无法预测在我四处移动时房间里会有什么。系统无法预测相机移动时房间里会有什么。也许它会预测这是一个有灯光和墙壁之类的房间。它无法预测世界的画面是什么样的,或者沙发的质地是什么样的。

Certainly not the texture of the carpet. So there's no way I can predict all those details. So the way to handle this is one way to possibly to handle this, which we've been working for a long time is to have a model that has what's called a latent variable and the latent variable is fed to a neural net and it's supposed to represent all the information about the world that you don't perceive yet and that you need to augment the system for the prediction to do a good job at predicting pixels, including the, you know, fine texture of the carpet and the couch and the painting on the wall.
毫无疑问,我无法预测地毯的纹理。所以没有办法我能预测所有这些细节。处理这个问题的一种可能方式,我们已经研究了很长时间,就是拥有一个具有所谓的潜变量的模型,这个潜变量被输入到神经网络中,它应该代表你尚未感知但需要增强系统以便良好预测像素的世界信息,包括地毯、沙发和墙上画作等精细纹理。

That has been a complete failure, essentially. And we've tried lots of things. We tried just straight neural nets. We tried GANS. We tried, you know, VAEs, all kinds of regular risotto encoders. We tried many things. We also tried those kind of methods to learn good representations of images or video that could then be used as input to, for example, an image classification system.
这基本上是一个彻底的失败。我们尝试过很多方法。我们尝试过直接的神经网络。我们尝试过GAN。我们尝试过VAE等各种常规编码器。我们尝试了很多方法。我们还尝试了一些学习图像或视频良好表示的方法,然后将其用作输入,例如用于图像分类系统。

And that also has basically failed, like all the systems that attempt to predict missing parts of an image or video, you know, from a corrupted version of it, basically. So, right, take an image or video, corrupt it or transform it in some way and then try to reconstruct the complete video or image from the corrupted version and then hope that internally the system will develop a good representation of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure and it works really well for text.
这也基本上失败了,就像所有试图预测图像或视频中缺失部分的系统一样,你知道,从损坏的版本中基本上来说。所以,是的,拿一张图像或视频,损坏或以某种方式转换它,然后尝试从损坏版本中重建完整的视频或图像,然后希望系统内部能够开发出一个良好的图像表示,您可以用于物体识别、分割,无论是什么。这基本上是一个完全的失败,而且对于文本来说效果非常好。

That's the principle that is used for L&M's, right? So, where's the failure exactly? Is it that it's very difficult to form a good representation of an image, like a good embedding of all the important information in the image? Is it in terms of the consistency of image to image to image to image that forms the video? Like what is the, if you do a highlight reel of all the ways you failed, what does that look like? Okay, so the reason this doesn't work is, first of all, I have to tell you exactly what doesn't work because there is something else that does work.
这是L&M的原则,对吧?那么,失败的原因究竟是什么?是因为很难形成图像的良好表现,例如将图像中所有重要信息嵌入其中吗?是因为视频中图像之间的连续性有问题吗?如果你总结所有失败的方式,会是什么样子呢?好的,这个方法不管用的原因是,首先,我必须告诉你为什么不管用,因为有其他方法是有效的。

So, the thing that does not work is training the system to learn representations of images by training it to reconstruct a good image from a corrupted version of it. Okay, that's what doesn't work. And we have a whole slew of techniques for this that are, you know, varying into of deno using autoencoder is something called MAE developed by some of my colleagues at FAIR, Max's autoencoder. So, it's basically like the, you know, L&M's or things like this where you train the system by corrupting text except you corrupt images, you remove patches from it and you train a gigantic neural network to reconstruct.
因此,训练系统通过训练它从其损坏版本中重建良好的图像来学习图像的表示是行不通的。好的,这就是不起作用的地方。我们有一整套技术来处理这个问题,其中一个是使用自动编码器的去噪技术,其中有一种叫做MAE,由我在FAIR的一些同事开发的Max的自动编码器。所以,基本上就像L&M或类似的东西,你通过破坏文本来训练系统,只不过在这种情况下,你破坏的是图像,删除其中的补丁,并训练一个巨大的神经网络来重建图像。

The features you get are not good. And you know, they're not good because if you now train the same architecture, but you train it supervised with label data, with text, textual descriptions of images, etc., you do get good representations and the performance on recognition tasks is much better than if you do this self-supervised free training. So, the architecture is good. The architecture is good. The architecture of the encoder is good. Okay, but the fact that you train the system to reconstruct images does not lead it to produce too long good generic features of images. When you train in a self-supervised way. Self-supervised by reconstruction. Yeah, by reconstruction. Okay, so what's the alternative? The alternative is joint embedding. What is joint embedding?
你得到的特征并不好。而且你知道,它们之所以不好,是因为如果你现在使用相同的架构进行训练,但是用带有标签数据的监督训练,比如文本、图像的文字描述等,你会得到好的表示形式,并且在识别任务中的表现要比进行自监督免费训练好得多。所以,这个架构是好的。这个架构是好的。编码器的架构是好的。但是,训练系统重建图像并不能让其产生太好的图像通用特征。当你以自监督的方式进行训练时。通过重建形式的自监督。通过重建。好的,那么还有什么替代方案?替代方案是联合嵌入。什么是联合嵌入?

So what are these architectures that you're so excited about? Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version, you run them both through encoders, which in general are identical, but not necessarily. And then you train a predictor on top of those encoders to predict the representation of the full input from the representation of the corrupted one. Okay, so joint embedding because you're taking the full input and the corrupted version, or transform version, run them both through encoders, you get a joint embedding, and then you're saying, can I predict the representation of the full one from the representation of the corrupted one?
那么,你对这些架构为什么这么兴奋呢?好的,现在不再是训练系统来对图像进行编码,然后再训练它从一个损坏的版本重建完整的图像,而是你拿到完整的图像,拿到损坏或者转换后的版本,你将它们都通过编码器运行,这些编码器通常是相同的,但不一定。然后你在这些编码器之上训练一个预测器,用来预测从损坏版本的表征到完整输入的表征。好的,所以共同嵌入因为你拿到了完整输入和损坏版本或转换后的版本,将它们都通过编码器运行,得到一个共同的嵌入,接着你在问,我能否从损坏版本的表征中预测到完整版本的表征?

Okay, and I call this a JAPA, so that means joint embedding predictive architecture, because it's joint embedding and there is this predictor that predicts the representation of the good guy from the bad guy. And the big question is, how do you train something like this? And until five years ago, six years ago, we didn't have particularly good answers for how you train those things, except for one called contrastive learning, where, and the idea of contrastive learning is you take a pair of images that are, again, an image and a corrupted version or degraded version somehow or transform version of the original one, and you train the predicted representation to be the same as that. If you only do this, the system collapses. It basically completely ignores the input and produces representations that are constant.
好的,我把它叫做JAPA,意思是联合嵌入预测架构,因为它是联合嵌入,并且有一个预测器从好家伙的表示中预测坏家伙。最重要的问题是,你怎么训练这样一个东西?直到五六年前,我们并没有特别好的答案来训练这些东西,除了一种叫做对比学习的方法。对比学习的思想是拿一对图片,一张是原始图片,另一张是损坏或变换版本,然后训练预测表示为相同。如果只这样做,系统会崩溃。它基本上完全忽视输入并生成恒定的表示。

So the contrastive methods avoid this, and those things have been around since the early 90s of paper on this in 1993, is you also show pairs of images that you know are different, and then you push away the representations from each other. So you say not only do representations of things that we know are the same, should be the same or should be similar, but representation of things that we know are different should be different. And that prevents the collapse, but it has some limitation, and there's a whole bunch of techniques that have appeared over the last six, seven years that can revive this type of method, some of them from fair, some of them from Google and other places. But there are limitations to those contrastive methods.
因此,对照方法避免了这种情况,自从上世纪90年代早期有关这一问题的研究论文发表以来,已经存在这种方法。该方法会展示一对我们知道不同的图像,然后将它们的表示互相拉远。因此,我们不仅要求我们知道相同的事物的表示应该相同或相似,还要求我们知道不同的事物的表示应该不同。这可以防止坍缩,但它也有一些限制,过去六七年出现了许多技术,可以重新激活这种方法,其中一些来自Facebook,一些来自谷歌和其他地方。但对照方法也存在一些局限性。

What has changed in the last three, four years is now we have methods that are non-contrastive, so they don't require those negative, contrastive samples of images that we know are different. You can only, you turn them only with images that are different versions or different views are the same thing, and you rely on some other tweaks to prevent the system from collapsing, and we have half a dozen different methods for this now.
在过去的三四年里发生的变化是,现在我们有了非对比方法,因此不需要那些我们知道不同的负样本图像。你只需要用不同版本或不同视角的图像来训练系统,同时依靠一些其他调整来防止系统崩溃,现在我们有了半打不同的方法来实现这一点。

So what is the fundamental difference between joint embedding architectures and LLMs? So can the JAPA take us to AGI? Whether we should say that you don't like the term AGI, it will probably argue. I think every single time I've talked to you with argued about the G and AGI, like I get it. We'll probably continue to argue about it, it's great. This, because you like French, I mean, is I guess friend and French, and AMI stands for advanced machine intelligence. But either way, can JAPA take us to that towards that advanced machine intelligence? Well, so it's a first step. First of all, what's the difference with generative architectures like LLMs? LLMs or vision systems that are trained by reconstruction generate the inputs, right, that generate the original input that is non-corrupted, non-transformed, right? So you have to predict all the pixels. And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In the JAPA, you're not trying to predict all the pixels, you're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
那么,联合嵌入架构和LLMs之间的根本区别是什么呢? JAPA能带领我们走向AGI吗?我们是否应该说你不喜欢AGI这个术语,它可能会反驳。我想每次我跟你讨论关于G和AGI时,你都会反驳,我已经理解了。我们可能会继续就这个问题争论,这很好。因为你喜欢法语,我指的是朋友和法语,而AMI代表高级机器智能。但无论如何,JAPA能带领我们走向那种高级机器智能吗?那么,这是第一步。首先,与生成式架构(如LLMs)有何不同呢?LLMs或者通过重建训练的视觉系统生成输入,对吧,生成未经损坏、未经变换的原始输入?因此,你必须预测所有像素。在系统中投入了大量资源来实际预测所有这些像素、所有细节。而在JAPA中,你不是试图预测所有像素,而只是尝试预测输入的抽象表现形式,对吧?在很多方面,这要容易得多。

So what the JAPA system when it's being trained is trying to do is extract as much information as possible from the input, but yet only extracting information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict. For example, if you have a cell driving car driving down the street or road, there may be trees around the road, and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details. We'll tell you there is moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf. And that not only is a lot simpler, but also it allows the system to essentially learn and abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation.
当JAPA系统在训练时,它试图做的是尽可能地从输入中提取信息,但只提取相对容易预测的信息。在世界上有很多事情是我们无法预测的。例如,如果有一辆无人驾驶汽车在街道上行驶,可能周围有树,天可能是刮风的。树上的叶子可能在半混沌随机的方式下移动,这是无法预测的,也是我们不关心预测的。因此,您希望您的编码器基本上消除所有这些细节。它会告诉您有叶子在移动,但不会保留确切的细节。因此,在表示空间中进行预测时,您不必预测每片叶子的每个像素。这不仅更简单,而且还使系统能够学习和抽象世界的表示,其中可以建模和预测的内容被保留,其余内容被视为噪音并由编码器消除。因此,它提高了表示的抽象级别。

If you think about this, this is something we do absolutely all the time. Whenever we describe a phenomenon, we describe it at a particular level of abstraction. And we don't always describe every natural phenomenon in terms of quantum field theory, right? That would be impossible. So we have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules in chemistry, materials, and all the way up to kind of complete objects in the real world and things like that. So we can't just only model everything at the lowest level. And that's what the idea of JAPA is really about. Learn abstract representation in a self-supervised manner. And you can do it hierarchically as well. So that I think is a natural component of an intelligent system.
如果你想想这个问题,这其实是我们经常做的事情。每当我们描述一个现象时,我们都是在一个特定的抽象层面上描述它。而且我们并不总是用量子场论来描述每一个自然现象,对吧?那是不可能的。所以我们有多个抽象层次来描述世界发生的事情,从量子场论到原子理论、化学中的分子、材料,一直到现实世界中的完整物体等等。所以我们不能只在最低层面上建模一切。这就是JAPA的概念真正涉及的。以自我监督的方式学习抽象表示。而且你也可以按层次进行。所以我认为这是智能系统的一个自然组成部分。

And in language, we can get away without doing this because language is already to some level abstract and already has eliminated a lot of information that is not predictable. And so we can get away without doing the quantum editing, without lifting the abstraction level and by directly predicting words. So joint embedding, it's still generative, but it's generative in this abstract representation space. And you're saying language, we were lazy with language because we already got the abstract representation for free. And now we have to zoom out, actually think about generally intelligent systems. We have to deal with the full mess of physical reality of reality. And you do have to do this step of jumping from the full rich, detailed reality to a abstract representation of that reality based on which you can reason and all that kind of stuff. And the thing is those self-supervised algorithms that learn by prediction, even in representation space, they learn more concept if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture some internal structure of it. And so there is way more redundancy in a structure in perceptual input, sensory input, like vision than there is in text, which is not nearly as redundant.
在语言中,我们可以不做这个,因为语言本身已经在一定程度上是抽象的,并且已经消除了很多不可预测的信息。因此,我们可以不进行量子编辑,不提升抽象级别,而是直接预测词语。所以联合嵌入仍然是生成的,但是在这个抽象表示空间中生成。你说语言上我们懒惰了,因为我们已经免费获得了抽象表示。现在我们必须放大视野,真正思考普遍智能系统。我们必须处理现实的完整混乱。你必须从丰富详细的现实跳到一个基于这个现实的抽象表示,从而进行推理和其他处理。事实是,那些通过预测学习的自监督算法,甚至在表示空间中,如果输入的数据更多冗余,就能学到更多概念。数据中的冗余越多,它们就能更好地捕捉一些内部结构。因此,在知觉输入、感官输入(比如视觉)中,比在文本中,冗余和结构更多。

This is back to the question you were asking a few minutes ago. Language might represent more information really because it's already compressed. You're right about that, but that means it's also less redundant. And so self-supervised learning will not work as well. Is it possible to join the self-supervised training on visual data and self-supervised training on language data? There is a huge amount of knowledge, even though you talked down about those 10 to the 13 tokens. Those 10 to the 13 tokens represent the entirety a large fraction of what us humans have figured out. Both the shit talk on Reddit and the contents of all the books and the articles and the full spectrum of human intellectual creation.
这回答回到您几分钟前提出的问题。语言可能实际上代表更多信息,因为它已经被压缩了。在这一点上你是对的,但这也意味着它更少冗余。因此,自监督学习效果可能不会那么好。是否可能将视觉数据上的自监督训练与语言数据上的自监督训练结合起来?尽管您对那十万亿标记不以为然,但其中蕴含着大量知识。这十万亿标记代表了我们人类所发现的大部分知识。无论是Reddit上的闲聊,还是所有书籍、文章以及全人类知识创造的完整光谱。

So is it possible to join those two together? Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with the vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to learn good representations from images and video. And the problem with this is that we might improve our vision language system a bit, or language models by feeding them images. But we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. They don't have language. And they understand the world much better than any other. They can plan really complex actions and sort of imagine the result of a bunch of actions. How do we get machines to learn that before we combine that with language? Obviously, if we combine this with language, this is going. to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.
那么是否可能将这两者结合在一起呢?最终,是可以的。但我认为如果我们太早这样做,就会冒着诱使作弊的风险。事实上,这正是人们目前使用视觉语言模型所做的事情。我们基本上是在作弊。我们利用语言来帮助我们的视觉系统从图像和视频中学习良好的表征。问题在于,我们可能通过为它们提供图像来改进我们的视觉语言系统,或者语言模型。但是我们不会达到猫或狗的智慧水平或对世界的理解,它们没有语言。它们没有语言。它们比其他任何生物更好地理解世界。它们可以计划非常复杂的动作并且想象一系列动作的结果。我们如何让机器在我们将其与语言结合之前学会这些呢?显然,如果我们将这两者结合,这将是一个赢家。但在此之前,我们必须专注于如何让系统学会世界是如何运作的。

So this joint embedding predictive architecture, for you, that's going to be able to learn something like common sense, something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing. That's the hope. In fact, the techniques we're using are non-contressive. So not only is the architecture non-generative, the learning procedures we're using are non-contressive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind could be way well, a couple by Fair, one called VicRag, and another one called iJepa. VicRag, I should say, is not a distillation method, actually, but iJepa and BUI certainly are. And there's another one also called Dino, also produced from Fair. And the idea of those things is that you take the full input, let's say an image. You run it through an encoder, produces a representation, and then you corrupt that input or transform it, run it through essentially what amounts to the same encoder with some amount of differences. And then train a predictor, sometimes a predictor is very simple, sometimes doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch. You only train the part of the network that is fed with the corrupted input. The other network, you don't train, but since they share the same weight, when you modify the first one, it also magnifies the second one. And with various tricks, you can prevent the system from collapsing with the collapse of the type that was explained before, where the system basically ignores the input. So that works very well. The two techniques we developed at Fair, Dino and iJepa work really well for that.
这种联合嵌入预测架构,对你来说,能够学到类似常识的东西,也就是猫用来预测如何最有效地捣乱主人的能力。这是希望。实际上,我们使用的技术是非约束性的。因此,不仅架构是非生成的,我们使用的学习程序也是非约束性的。我们有两套技术。一套基于蒸馏,有许多使用这一原理的方法。DeepMind的一种可能非常好,Fair出品的一对,一个叫做VicRag,另一个叫iJepa。我应该说,VicRag其实不是一种蒸馏方法,但iJepa和BUI肯定都是。还有另一种叫做Dino,也是Fair出品。这些东西的想法是,你拿到完整的输入,比如一个图像。你通过一个编码器运行它,产生一个表示,然后你破坏或转换那个输入,再通过实际上相当于同一编码器的东西,有一些差异。然后训练一个预测器,有时预测器非常简单,有时不存在,但训练一个预测器,从被破坏的输入中预测第一个未被破坏的输入的表示。但你只训练第二支路。你只训练那部分被破坏的输入输入的网络。其他网络,你不训练,但因为它们共享相同的权重,当你修改第一个时,也会放大第二个。通过各种技巧,你可以防止系统崩溃,就像之前解释的那种,系统基本上忽略了输入。这个方法非常有效。我们在Fair开发的两种技术,Dino和iJepa,在这方面表现非常出色。

So what kind of data are we talking about here? So this several scenario, one scenario is you take an image, you corrupt it by changing the cropping, for example, changing the size a little bit, maybe changing the orientation, blurring it, changing the colors, doing all kinds of horrible things to it. But basic horrible things. Basic horrible things that sort of degrade the quality a little bit and change the framing, crop the image. In some cases, in the case of iJepa, you don't need to do any of this. You just mask some parts of it. You just basically remove some regions like a big block, essentially. And then run through the encoders and train the entire system, encoder and predictor, to predict the representation of the good one from the representation of the corrupted one. So that's the iJepa. It doesn't need to know that it's an image, for example, because the only thing it needs to know is how to do this masking. Whereas with Dino, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that that are really image-specific.
这里我们讨论的是什么样的数据?有几种情况,一种情况是你拿一张图片,通过改变裁剪、改变尺寸、改变方向、模糊化、改变颜色等方式使其腐败。但是这些都是基本的腐败方式,使图片质量稍微下降并改变构图。在一些情况下,比如iJepa,你不需要做任何这些操作。你只需要对一些部分进行遮掩。基本上就是移除一些像一个大块的区域。然后通过编码器和预测器训练整个系统,使其能从腐败图片的表示中预测出良好图片的表示。这就是iJepa。它不需要知道这是一张图像,因为它只需要知道如何进行这种遮掩。而Dino则需要知道这是一张图像,因为你需要进行几何变换、模糊处理等与图像相关的操作。

A more recent version of this that we have is called VJepa. So it's basically the same ideas iJepa, except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube. So an all like a whole segment of each frame in the video over the entire video. That tube was like statically positioned throughout the frames. The two, yeah, typically is 16 frames or something. And we mask the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video. That works really well. It's the first system that we have that learns good representations of video so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in the video with pretty good accuracy. So that's the first time we get something of that quality. So that's a good test that a good representation is formed. That means there's something to this. Yeah.
我们最近拥有的一个更新版本叫做VJepa。所以基本上它和iJepa是相同的想法,只是应用在视频上。现在你可以选择整个视频并对其中的一整块进行遮罩。我们遮罩的实际上是一种时间管道。就像整个视频中的每一帧的一整个片段。这个管道在帧之间是保持静态位置的。这个管道通常是16帧左右。我们在整个16帧上遮罩相同的区域。显然,每个视频都有不同的遮罩区域。然后再训练这个系统,让它可以从部分遮罩的视频中预测完整视频的表示。这个工作效果非常好。这是我们拥有的第一个可以学习视频的良好表示的系统,所以当你将这些表示输入到一个监督分类器头部时,它能够用相当高的准确性告诉你视频中正在发生的动作是什么。这是我们第一次得到这种质量的东西。这证明了一个良好的表示被形成了。这意味着这个系统确实有用。是的。

We also preliminary result that seem to indicate that the representation allows us allow our system to tell whether the video is physically impossible or completely impossible because some object disappeared or an object suddenly jumped from one location to another or change shape or something. So it's able to capture some physical some physics based constraints about the reality represented in the video. Yeah. About the appearance and the disappearance of objects. Yeah. That's really you. Okay. But can this actually get us to this kind of world model that understands enough about the world to be able to drive a car? Possibly. This is going to take a while before we get to that point. But there are systems already, you know, robotic systems that are based on this idea. And the what you need for this is a slightly modified version of this where imagine that you have a video and a complete video. And what you're doing to this video is that you're either translating it in time towards the future. So you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example.
我们还有初步结果显示,这种表征似乎能够让我们的系统判断视频中是否存在物理不可能的情况,比如某物体消失了,或者某物体突然从一个位置跳到另一个位置,或者改变形状等等。因此,它能够捕捉一些关于视频中所呈现的现实的物理约束。关于物体的出现和消失等问题,是的,这真的很重要。但是,这是否真的能够让我们达到一个足够理解世界的模型,以便可以驾驶汽车?可能。在我们达到那一点之前,这需要一段时间。但是已经有一些基于这种想法的机器人系统。你需要的是这种方法的一个稍微改进的版本,想象一下你有一个完整的视频。你要做的是将这个视频在时间上向未来移动,所以你只看到视频的开始部分,而看不到在原始视频中的后面部分。或者你只遮挡视频的后半部分,例如。

And then you train a JAPA system of the type I described to predict the representation of the full video from the shifted one. But you also feed the predictor with an action. For example, you know, the wheel is turned 10 degrees to the right or something. Right. So if it's a, you know, a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see. You're not going to be able to predict all the details of objects that appear in the view, obviously, but at a abstract representation level, you can probably predict what's going to happen. So now what you have is an internal model that says, here is my idea of state of the world at time t. Here is an action I'm taking. Here is a prediction of the state of the world at time t plus one, t plus delta t, t plus two seconds, whatever it is. If you have a model of this type, you can use it for planning.
然后,你训练一个像我描述的那种JAPA系统,来预测从移位视频到完整视频的表示。但你也会用一个动作来输入给预测器。例如,你知道,方向盘向右转了10度之类的。对吧。所以如果是在汽车上的行车记录仪,你知道方向盘的角度,你应该能够在一定程度上预测你所看到的将会发生什么。很显然,你不可能预测视野中出现的所有物体的细节,但在抽象表示层面上,你可能能够预测未来会发生什么。所以现在你拥有了一个内部模型,它说,这是我在时间t时对世界状态的理解。这是我正在采取的行动。这是我对在时间t + 1、t + Δt、t + 2秒等之处世界状态的预测。如果你有这种类型的模型,你可以用它来进行规划。

So now you can do what LLMs cannot do, which is planning what you're going to do so as to arrive at a particular outcome or satisfy a particular objective. Right. So you can have a number of objectives. Right. If you know, I can predict that if I have an object like this, right, and I open my hand, it's going to fall, right? And if I push it with a particular force on the table, it's going to move. If I push the table itself, it's probably not going to move with the same force. So we have this internal model of the world in our mind, which allows us to plan sequences of actions to arrive at a particular goal.
所以现在你可以做LLM不能做的事情,那就是规划你要做什么,以便达到特定的结果或满足特定的目标。对的。所以你可以有很多目标。如果你知道,我可以预测如果我有一个像这样的物体,对的,当我打开手掌,它会掉下来,对吧?如果我用特定的力量推它到桌子上,它会移动。如果我推动桌子本身,可能不会以相同的力量移动。所以我们心中有这个世界的内部模型,它允许我们规划一系列动作来达到特定目标。

And so now if you have this world model, we can imagine a sequence of actions, predict what the outcome of the sequence of action is going to be, measure to what extent the final state satisfies a particular objective, like, you know, moving the bottle to the left of the table, and then plan a sequence of actions that will minimize this objective at runtime. We're not talking about learning. We're talking about inference time. Right. So this is planning, really. And in optimal control, this is a very classical thing. It's called model predictive control.
因此,如果你有这个世界模型,我们可以想象一系列的行动,预测行动序列的结果会是什么,衡量最终状态在多大程度上满足特定目标,比如,你知道,将瓶子移到桌子的左边,然后在运行时规划一系列行动,以最小化此目标。我们不是在谈论学习,我们在谈论推理时刻。对,这就是规划。在最优控制中,这是一个非常经典的事情。这被称为模型预测控制。

You have a model of the system you want to control that, you know, can predict the sequence of state scores wanting to a sequence of commands. And you're planning a sequence of commands so that, according to your world model, the end state of the system will satisfy an objective that you fix. This is the way, you know, rocket trajectories have been planned since computers have been around. So since the early 60s, essentially. So yes, for a model predictive control, but you also often talk about hierarchical planning. Yeah, can hierarchical planning emerge from this somehow? Well, so no, you will have to build a specific architecture to allow for hierarchical planning.
你拥有一个系统模型,你想要控制它,可以预测状态分数的序列对应到一系列命令的序列。你正在规划一系列命令,以便根据你的世界模型,系统的最终状态将满足你设定的目标。这就是你所知道的自计算机出现以来火箭轨迹规划的方式。从60年代初开始,基本上是这样。所以是的,对于模型预测控制,但你也经常谈到层次规划。嗯,层次规划能够在其中产生吗?嗯,不行,你必须构建一个特定的架构才能支持层次规划。

So hierarchical planning is absolutely necessary if you want to plan complex actions. If I want to go from, let's say, from New York to Paris, this example I use all the time. And I'm sitting in my office at NYU. My objective that I need to minimize is my distance to Paris. At a high level, a very abstract representation of my location. I will have to decompose this into two sub goals. First one is go to the airport. Second one is catch a plane to Paris. Okay, so my sub goal is now going to the airport. My objective function is my distance to the airport. How do I go to the airport where I have to go in the street and have a taxi, which you can do in New York?
因此,如果您想要规划复杂的行动,分层规划是绝对必要的。举个例子,假设我要从纽约去巴黎,这是我经常使用的例子。我坐在纽约大学的办公室里。我需要最小化的目标是我到巴黎的距离。在一个非常抽象的高层次上,这是我的位置。我必须将这个目标分解为两个子目标。第一个是去机场,第二个是搭乘飞机去巴黎。好的,所以我的子目标现在是去机场。我的目标函数是我到达机场的距离。我该怎么去机场呢?我必须在街上找到出租车,纽约是可以做到的。

Okay, now I have another sub goal go down on the street. What that means going to the elevator, going down the elevator, work out the street. How do I go to the elevator? I have to stand up for my chair, open the door of my office, go to the elevator, push the button. How do I get up for my chair? Like, you know, you can imagine going down all the way down to basically what amounts to millisecond, memory, second muscle control. Okay, and obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond, memory, second muscle control. First, that would be incredibly expensive, but it will also be completely impossible because you don't know all the conditions of what's going to happen.
好的,现在我有另一个子目标:走下街道。这意味着去乘坐电梯,下到电梯后,走向街道。我要怎么去电梯呢?我必须从椅子上站起来,打开办公室的门,走向电梯,按下按钮。我要怎么从椅子上站起来?就像,你知道的,你可以想象走下去一直走到基本上相当于毫秒、记忆、秒级的肌肉控制。好的,显然你并不会以毫秒、记忆、秒级的肌肉控制来规划你从纽约到巴黎的整个旅程。首先,那将是极其昂贵的,但也完全不可能,因为你不知道将会发生什么条件。

You know, how long is it going to take to catch a taxi or to go to the airport with traffic? You know, I mean, you will have to know exactly the condition of everything to be able to do this planning. And you don't have the information. So you have to do this hierarchical planning so that you can start acting and then sort of replaning as you go. And nobody really knows how to do this in AI. Nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning works. There's something like that already emerged.
你知道吗,要坐出租车或者赶飞机需要花多长时间?你知道的,我是说,为了做好计划,你必须确切了解一切的状况。而你并没有这些信息。因此,你必须进行层级规划,这样你才能开始行动,然后在行动过程中进行再规划。而在人工智能领域,没有人真正知道如何做到这一点。没有人知道如何训练系统学习适当的多级别表达,以使层级规划起作用。类似的东西已经出现了。

So like, can you use an LLM, state of the art LLM to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did, which is, can you give me a list of 10 steps I need to do to get from New York to Paris? And then for each of those steps, can you give me a list of 10 steps? How I make that step happen? And for each of those steps, can you give me a list of 10 steps to make each one of those until you're moving your muscle individual muscles? Maybe not. Whatever you can actually act upon using your mind. Right.
所以,比如说,你能不能使用一种最先进的LLM(机器学习模型)来通过刚才你所做的那种详细的问题集,从纽约到巴黎?也就是说,你能不能给我列出10个步骤,我需要做些什么才能从纽约到巴黎?然后针对这些步骤,你能不能再给我列出10个步骤?我怎样让每一个步骤发生?然后对于每个步骤,你能不能给我列出10个步骤,直到你能够动用你的肌肉,或者说,无论如何,你能够使用你的头脑来实际行动。是的。

So there's a lot of questions that I was really implied by this. Right. So the first thing is, LLM's will be able to answer some of those questions down to some level of abstraction under the condition that they've been trained with similar scenarios in the training set. They would be able to answer all of those questions, but some of them may be hallucinated, meaning nonfactual. Yeah, true. I mean, they will probably produce some answer except they're not going to be able to really kind of produce millisecond, millisecond muscle control of how you stand up from your chair. Right. So, but down to some level of abstraction where you can describe things by words, they might be able to give you a plan, but only under the condition that they've been trained to produce those kinds of plans. Right.
因此,我对这件事有很多疑问。首先,LLM(大型语言模型)可以在一定程度的抽象水平下回答其中一些问题,前提是它们在训练集中接受了类似情境的训练。它们可以回答所有这些问题,但其中一些可能是虚构的,意味着不准确。是的,我是说,它们可能会提供一些答案,但它们不太可能真正地控制你如何从椅子上站起来的肌肉动作。在某种程度上,你可以用言语描述事物,它们可能会给出一个计划,但前提是它们已经接受过训练来生成这些计划。

They're not going to be able to plan for situations whether they never encountered before. They basically are going to have to regurgitate the template that they've been trained on. But where, like, just for the example, New York to Paris, is it going to start getting into trouble? Like, which layer of abstraction do you think you'll start? Because, like, I can imagine almost every single part of that, LLM will be able to answer somewhat accurately, especially when you're talking about New York and Paris, major cities.
他们不会能够为以前从未遇到过的情况做出计划。基本上他们只能重复他们接受过训练的模板。但是,在像纽约到巴黎这样的示例中,什么时候会开始遇到麻烦呢?你认为会从哪一层抽象开始?因为我可以想象,几乎每一个部分,尤其是在谈论纽约和巴黎这样的大城市时,LLM都能够给出相当准确的答案。

So, I mean, certainly, LLM would be able to solve that problem if you find your need for it. Sure. And so, I can't say that LLM cannot do this. It can't do this if you train it for it. There's no question. Down to a certain level where things can be formulated in terms of words. But, like, if you want to go down to, like, how you, you know, climb down the stairs or just stand up from your chair in terms of words, like, you can't do it. You need, that's one of the reasons you need experience of the physical world, which is much higher bandwidth than what you can express in words, in human language.
所以,我是说,当你发现你需要它时,LLM肯定可以解决这个问题。当然。所以,我不能说LLM不能做到这一点。如果你为此进行训练,它是可以做到的。没有疑问。到达一定程度,事情可以用语言来表达。但是,如果你想要深入到,比如,你知道怎么下楼梯或者从椅子上站起来这样的细节,用语言是表达不了的。你需要体验现实世界的经验,这比人类语言中可以表达的要高得多。

So, everything we've been talking about on the joint embedding space is it possible that that's what we need for, like, the interaction with physical reality for on the robotics front. And then just the LLM's are the thing that sits on top of it for the bigger reasoning about, like, the fact that I need to book a plane ticket and I need to know, I know how to go to the websites and so on. Sure. And, you know, a lot of plans that people know about that are relatively high level are actually learned. They're not people, most people don't invent the, you know, plans. By themselves, they, you know, we have some ability to do this, of course, obviously, but but most plans that people use are plans that they've been trained on. Like, they've seen other people use those plans or they've been told how to do things, right?
因此,我们一直在谈论的联合嵌入空间可能就是我们在机器人领域与现实物理互动所需要的东西。然后LLM仅仅是用来进行更高层次推理的工具,比如我需要订机票,我需要知道如何进入网站等等。当然,许多人知道的相对高级别的计划实际上是通过学习得来的。大多数人并不是自己发明计划,他们有一定的能力去做到这一点,当然,显然,但大多数人使用的计划是他们经过训练的。比如,他们看到其他人使用过这些计划,或者他们被告知如何去做事,对吧?

Like, you can't invent how you, like, take a person who's never heard of airplanes and tell them, like, how do you go from New York to Paris? And they're probably not going to be able to kind of, you know, deconstruct the whole plan and as I've seen examples of that before. So certainly, LLM's are going to be able to do this, but but then how you link this from the low level of actions that needs to be done with things like JAPA that basically lift the abstraction level of the representation without attempting to reconstruct a bit detail of the situation. That's why we need JAPA's form.
就像你无法想象,怎样向一个从未听说过飞机的人解释如何从纽约到巴黎一样。他们可能无法完全理解整个计划,我以前见过这种情况。当然,LLM的确可以做到这一点,但是如何将它们与需要完成的低层动作联系起来,比如JAPA这样能够提高抽象级别而不试图重新构建细节的表示方式。这就是为什么我们需要JAPA的形式。

I would love to sort of linger on your skepticism around autoregressive LLM's. So one way I would like to test that skepticism is everything you say makes a lot of sense. But if I apply everything you said today and in general, to like, I don't know, 10 years ago, maybe a little bit less, no, let's say three years ago, I wouldn't be able to predict the successful LLM's. So does it make sense to you that autoregressive LLM's are able to be so damn good? Yes.
我很想留在你对自回归LLM的怀疑上。所以我想测试你的怀疑的一种方式是,你说的每一点都很有道理。但如果我把你今天和通常说的一切都应用到,比如,我不知道,10年前,也许少一点,不,让我们说三年前,我就无法预测到成功的LLM。所以你觉得自回归LLM能够如此出色有道理吗?是的。

Can you explain your intuition? Because if I were to take your wisdom and intuition at face value, I would say there's no way autoregressive LLM's one token at a time would be able to do the kind of things they're doing. No, there's one thing that autoregressive LLM's or that LLM's in general, not just the autoregressive one, but including the bird-styled bidirectional ones are exploiting and it's self-supervised running and I've been a very, very strong advocate of self-supervised running for many years. So those things are an incredibly impressive demonstration that self-supervised running actually works. The idea that, you know, started, it didn't start with with bird, but it was really kind of a good demonstration with this.
你能解释一下你的直觉吗?因为如果我按照你的智慧和直觉来理解的话,我会认为自回归LLM逐个标记的方式绝对不可能做到它们正在做的那种事情。不,自回归LLM或者说LLM们总体而言,并不仅仅是自回归的那种,还包括鸟类风格的双向的LLM们正在利用自我监督训练,而我多年来一直是自我监督训练的坚定拥护者。因此,这些事实实际上令人印象深刻地证明了自我监督训练的有效性。这个想法,并不是从鸟类开始的,但这确实是一个很好的示范。

So the idea that, you know, you take a piece of text, you corrupt it and then you transfer gigantic neural net to reconstruct the parts that are missing. That has been an enormous producing enormous amount of benefits. It allowed us to create systems that understand understand language, systems that can translate hundreds of languages in any direction, systems that are multilingual. So they're not, it's a single system that can be trained to understand hundreds of languages and translate in any direction and produce summaries and then answer questions and produce text.
所以这个想法,你知道,你拿一段文字,你破坏它,然后你传输一个巨大的神经网络来重构缺失的部分。这产生了巨大的好处。这使我们能够创建能够理解语言的系统,能够在任何方向上翻译数百种语言的系统,能够多语言。所以它们不是一个单一的系统,可以被训练去理解数百种语言,翻译任何方向,产生摘要然后回答问题和产生文字。

And then there's a special case of it where, you know, you, which is the autoregressive trick where you constrain the system to not elaborate the representation of the text from looking at the entire text, but only predicting a word from the words that are come before, right? And you do this by the, because training the architecture of the network and that's what you can build an autoregressive LMM from.
然后还有一种特殊情况,即自回归技巧,通过约束系统不从整个文本中展开表示,而仅从前面的单词预测一个单词。你可以通过训练网络架构来做到这一点,然后就可以构建一个自回归的LMM。

So there was a surprise many years ago with what's called decoder only LMM. So since, you know, systems of this type that are just trying to produce words from the previous one. And the fact that when you scale them up, they tend to really kind of understand more about the, about language, when you train them on lots of data, you make them really big. That was kind of a surprise.
很多年前,有一个被称为仅解码器的LMM的意外。所以,你知道,这种系统只是试图从先前的系统生成单词。事实上,当你将它们扩大规模时,它们倾向于更深入地理解语言,当你用大量数据训练它们并使它们变得非常庞大时。这确实是一个惊喜。

And that surprise occurred quite a while back, like, you know, with work from, you know, Google, META, OpenAI, etc, you know, going back to, you know, the GPT kind of work general pre-trained transformers. I mean, like GPT two, like there's a certain place where you start to realize scaling might actually keep giving us an emergent benefit. Yeah, I mean, there were, there were work from, from various places.
这个惊喜发生在很久以前,就像是来自谷歌、META、OpenAI等机构的工作,回溯到GPT这种通用预训练转换器的工作。我是说,就像GPT二一样,有一个特定的地方,你开始意识到扩展可能会给我们带来不断涌现的好处。是的,我是说,来自不同地方的工作确实存在。

But if you want to kind of, you know, place it in the, in the GPT timeline, there would be around GPT two. Well, I just, because you said it, you said it's so charismatic. You said so many words, but so supervised learning. Yes. But again, the same intuition you're applying to saying that autoregressive LLMs cannot have a deep understanding of the world. If we just apply that same intuition, does it make sense to you that they're able to form enough of a representation of world to be damn convincing, essentially passing the original Turing test with flying colors?
但是如果你想把它放在GPT时间线中,它大约就在GPT-2附近。噢,我只是因为你说了它,你说得很有感染力。你说了那么多话,但是盯着监督学习。是的。但是,再次说,你所应用的直觉是说自回归LLMs无法对世界有深刻的理解。如果我们只是应用同样的直觉,你觉得它们能够形成足够的世界表现力,让人相信,基本上是通过了原始的图灵测试吗?

Well, we're fooled by their fluency, right? We just assume that if a system is fluent in manipulating language, then it has all the characteristics of human intelligence, but that impression is false. We're really fooled by it. What do you think Alan Turing would say? It without understanding anything, just hang it out with it. Alan Turing would decide that his Turing test is a really bad test. Okay, this is what the AI community has decided many years ago that the Turing test was a really bad test of intelligence.
嗯,他们的流利性骗了我们,对吗?我们只是假设如果一个系统擅长操纵语言,那么它就具备了人类智能的所有特征,但这种印象是错误的。我们被其欺骗了。你觉得艾伦·图灵会说什么?什么都不懂,只是和它一起接触。艾伦·图灵会认为他的图灵测试是一个真正糟糕的测试。好的,这就是人工智能社区多年来达成的共识,即图灵测试是一个对智能的真正糟糕的测试。

What would Hans Marvak say about the large language models? Hans Marvak would say that Marvak Paradox still applies. Okay. Okay. We can pass this. You don't think you'll be really impressed? No, of course, everybody would be impressed. But, you know, it's not a question of being impressed or not. It's the question of knowing what the limit of those systems can do. Like there are, again, they are impressive. They can do a lot of useful things. There's a whole industry that is being built around them. They're going to make progress. But there's a lot of things they cannot do and we have to realize what they cannot do.
汉斯·马瓦克会怎么看待大型语言模型?汉斯·马瓦克会说马瓦克悖论仍然适用。好的,我们可以通过这个。你不觉得你会非常震惊吗?不,当然,每个人都会被震惊。但是,你知道,这不是令人震惊与否的问题。问题在于了解这些系统的极限。再次强调,它们令人印象深刻。它们可以做很多有用的事情。围绕它们已经建立了整个工业。它们会取得进步。但是有很多事情它们做不到,我们必须意识到它们做不到的事情。

And then figure out, you know, how we get there. And, you know, and I'm not seeing this. I'm seeing this from basically, you know, 10 years of research on the idea of self-supervised running. Actually, that's going back more than 10 years. But the idea of self-supervised running, so basically capturing the internal structure of a piece of a set of inputs without training the system for any particular task, right, learning representations. You know, the conference I co-founded 14 years ago, it's called International Conference on Learning Representations. That's the entire issue that deep learning is dealing with, right? And it's been my obsession for almost 40 years now. So learning representation is really the thing.
然后找出,你知道,我们如何到达那个目标。我不是从这方面来看待这个问题。从基本上来说,我从对自我监督跑步理念进行了10年的研究。实际上,这个想法已经超过10年了。但自我监督跑步的想法,基本上是捕捉一组输入的内部结构,而不对系统进行任何特定任务的训练,学习代表性。你知道,我14年前联合创办的会议叫做国际学习代表性会议。这就是深度学习所要解决的问题,对吧?这几乎是我近40年的迷恋。学习代表性真的很重要。

For the longest time, we could only do this with supervised learning. And then we started working on, you know, what we used to call unsupervised learning and sort of revive the idea of unsupervised running in the early 2000s with Yoshibengio and Jeff Hinton. Then discovered that supervised running actually works pretty well if you can collect enough data. And so the whole idea of, you know, unsupervised self-supervised running, took a backseat for a bit. And then I kind of tried to revive it in a big way, you know, starting in 2014, basically when we started FAIR. And really pushing for like finding new methods to do self-supervised running, both for text and for images and for video and audio. And some of that work has been incredibly successful. I mean, the reason why we have multi-lingual translation system, you know, things to do content moderation on on meta, for example, on Facebook that are multi-lingual that understand whether piece of text is HP-chron out to something is due to that progress using self-supervised running for NLP. Combining this with, you know, transformer architectures and blah, blah, blah. But that's the big success of self-supervised running.
很长一段时间以来,我们只能通过监督学习来做到这一点。然后我们开始研究,你知道的,我们过去称之为无监督学习,并在2000年代初与Yoshibengio和Jeff Hinton重拾无监督运行的想法。然后发现,如果能够收集足够的数据,监督运行实际上效果相当不错。因此,你知道,无监督自监督运行的整个概念在一段时间内被搁置。然后我在2014年开始,基本上是我们启动FAIR时,试图以一种大的方式重振它。我一直在努力寻找新的方法来进行文本、图像、视频和音频的自监督运行。其中一些工作取得了极大的成功。我是说,我们有多语言翻译系统的原因,就是通过利用自监督运行在NLP领域取得的进展。将这与变压器架构等结合起来,是自监督运行的巨大成功。

We had similar success in speech recognition, a system called Wave2Vec, which is also a joint embedding architecture, by the way, trained with contrastive running. And that system also can produce speech recognition systems that are multi-lingual with mostly unlabeled data. And only need a few minutes of labeled data to actually do speech recognition. That's amazing. We have systems now based on those combination of ideas that can do real-time translation of hundreds of languages into each other. Speech to speech. Speech to speech, even including just fasting languages that don't have written forms. That's right. Just spoken only. That's right. We don't go through text. It goes directly from speech to speech using an internal representation of kind of speech units that are discrete, but it's it's called text lesson LPE. We used to call it this way, but yeah, so that I mean incredible success there.
我们在语音识别方面取得了类似的成功,一个被称为Wave2Vec的系统,顺便说一句,它也是一个联合嵌入式架构,是通过对比式训练的。这个系统也能够使用大部分未标记数据来生成多语言的语音识别系统。而且只需要几分钟的标记数据就能进行实际的语音识别。这太令人惊讶了。基于这些思想结合的系统,我们现在可以实现数百种语言之间的实时翻译。语音到语音。甚至包括那些没有书面形式的速食语言。没错。只有口头传达。对,我们不经过文本,它直接使用一种离散的语音单元的内部表示,这被称为文本小课程LPE。我们过去叫它这样的,但是,这个成功真是不可思议。

And then, you know, for 10 years, we tried to apply this idea to learning representations of images by training a system to predict videos, learning intuitive physics by training a system to predict what's going to happen in the video. And tried and tried and failed and failed with generative models with models that predict pixels. We could not get them to learn good representations of images. We could not get them to learn good representations of videos. We tried many times. We published lots of papers on it. Yeah, well, they kind of sort of worked, but not really great. They started working. We abandoned this idea of predicting a re-pixel and basically just doing the John Timbedding and predicting in representation space. That works.
然后,你知道,在接下来的10年里,我们试图将这个想法应用于通过训练系统来预测视频、学习图像表示,通过训练系统来学习直觉物理,以预测视频中会发生什么。我们不断尝试,失败和失败,使用生成模型以及预测像素的模型。我们无法使它们学习到良好的图像表示。我们也无法使它们学习到良好的视频表示。我们尝试了很多次。我们发表了很多相关论文。是的,它们有点起作用,但并不是非常好。它们开始起作用了。我们放弃了预测像素的想法,基本上只在表示空间中进行预测。那很有效。

So there's ample evidence that we're not going to be able to learn good representations of the real world using generative model. So I'm telling people, everybody is talking about generative AI. If you're really interested in human level AI, abandon the idea of generative AI. Okay. But you really think it's possible to get far with the joint embedding representation. So like, there's common sense reasoning. And then there's high level reasoning. I feel like those are two, the kind of reasoning that LLMs are able to do.
因此,有充分的证据表明,我们不能通过生成模型来学习现实世界的良好表示。所以我告诉大家,人人都在谈论生成AI。如果你真的对达到人类水平的人工智能感兴趣,就放弃生成AI的想法吧。但是你真的认为通过联合嵌入表示法可以取得进展。比如,有常识推理和高层推理。我觉得这两种推理是LLMs能够做到的。

Okay, let me not use the word reasoning, but the kind of stuff that LLMs are able to do seems fundamentally different than the common sense reasoning we use to navigate the world. It seems like we're going to need both. You're not sure. Would you be able to get with the joint embedding, which is a jump type of approach, looking at video, would you be able to learn, let's see, well, how to get from New York to Paris or how to understand the state of politics and the world today? Right. These are things where various humans generate a lot of language and opinions on in the space of language, but don't visually represent that in a clearly compressible way.
好的,让我不使用“推理”这个词,但LLMs能够做的那种事情似乎与我们用来导航世界的常识推理根本不同。看起来我们将需要两者结合。你不确定。你能否理解共享嵌入,这是一种跃迁式方法,观看视频,你能否学会,让我们看看,如何从纽约到巴黎,或者如何理解今日政治和世界的状况?对。这些是一些人类在语言领域产生很多语言和观点的事情,但在清晰可压缩的方式上并未直观地表现出来。

Right. Well, there's a lot of situations that might be difficult for a purely language-based system to know. Ok, you can probably learn from reading text, the entirety of the public available text in the world that I cannot get from New York to Paris by snapping my fingers. That's not going to work. Right. Yes. But there's probably more complex scenarios of this type, which an NLM may never have encountered and may not be able to determine whether it's possible or not.
没错。嗯,有很多情况可能对一个纯粹基于语言的系统来说是很困难的。好的,你可能可以通过阅读文本来学习,但我无法通过一挥手就把世界上所有公开可用的文本全部获取,从纽约到巴黎。这样是行不通的。没错。是的。但可能还有更复杂的情形,一个NLM可能从未遇到过,并且可能无法确定是否可能或不可能。

So that link from the low level to the high level, the thing is that the high level that language expresses is based on the common experience of the low level, which NLMs currently do not have. When we talk to each other, we know we have a common experience of the world, a lot of it is similar. And NLMs don't have that. But see, it's present. You and I have a common experience of the world in terms of the physics of how gravity works and stuff like this.
所以,从低层到高层的链接,关键是高层语言表达的基础是基于低层的共同经验,而目前的NLMs并不具备这种共同经验。当我们彼此交流时,我们知道我们在世界上有一些共同的经验,很多都是相似的。而NLMs并不具备这一点。但是请注意,这是存在的。你和我在世界的物理方面有着相同的经验,比如重力如何运作之类的事情。

And that common knowledge of the world, I feel like, is there in the language. We don't explicitly express it. But if you have a huge amount of text, you're going to get this stuff that's between the lines. In order to form a consistent world model, you're going to have to understand how gravity works, even if you don't have an explicit explanation of gravity. So even though in the case of gravity, there is explicit explanation of gravity and would be there.
我觉得世界的一般知识都体现在语言中,尽管我们并没有明确表达出来。但如果你有大量的文本,你会发现很多隐含的信息。为了形成一个连贯的世界模型,你必须理解重力是如何起作用的,即使你并没有关于重力的明确解释。所以即使在重力这种情况下,有关重力的明确解释也会存在。

But the stuff that we think of as common sense reasoning, I feel like to generate language correctly, you're going to have to figure that out. Now, you could say, there's not enough text. Sorry. Okay. So what, you don't think so? No, I agree with what you just said, which is that to be able to do high level common sense, to have a valuable common sense, you need to have the low level common sense to build on top of. But that's not there. And that's not there in NLMs. NLMs are purely to train from text.
但是,我们所认为的常识推理,我觉得要正确生成语言,你必须得搞清楚这一点。现在,你可以说,文本不够多。抱歉。好的,那又怎样,你不这么认为吗?不,我同意你刚才说的话,也就是说,要能够进行高水平的常识推理,要拥有有价值的常识,你需要具备低级常识作为基础。但是这并不足够,NLMs中也没有这种基础。NLMs只是为了从文本中进行训练。

So then the other statement you made, I would not agree with the fact that implicit in all languages in the world is the underlying reality. There's a lot about underlying reality, which is not expressed in language. Is that obvious to you? Yeah, totally. So like all the conversations, what, okay, there's the dark web, meaning whatever, the private conversations like DMs and stuff like this, which is much, much larger probably than what's available, what what LMs are trained on.
所以你说的另一种观点,我不认同所有世界语言中都隐含着潜在的现实。有很多关于潜在现实的东西,语言无法表达。你觉得这个很明显吗?是的,完全是的。就像所有的对话,什么的,还有暗网,意味着私人对话像私信之类的东西,可能比公开的内容要大得多,也可能比什么用于训练的语言模型要大得多。

You don't need to communicate the stuff that is common. But the humor, all of it, no, you do. Like when you, you don't need to, but it comes through like, like if I accidentally knocked this over, you'll probably make fun of me in, in the content of the you making fun of me will be a explanation of the fact that cups fall and then, you know, gravity works in this way. And then you, you'll have some very vague information about what kind of things explode when they hit the ground. And then maybe you'll make a joke about entropy or something like this, it will never be able to reconstruct this again.
你不需要传达那些常识性的东西。但幽默,所有的幽默,是必须传达的。比如当你,你并不需要,但它就这样表达出来,就像如果我不小心打翻这个杯子,你可能会取笑我,而你取笑我的内容会解释杯子摔倒的事实,然后你知道,重力是以这种方式起作用。然后,你可能会有一些关于什么物体碰到地面会爆炸的模糊信息。然后也许你会开个关于熵之类的玩笑,我永远也无法再重建这件事。

Like, okay, you'll make a little joke like this, and there'll be trillion of other jokes. And from the jokes, you can piece together the fact that gravity works and mugs can break and all this kind of stuff. You don't need to see it'll be very inefficient. It's easier for like, it's not, not to think over. But I feel like it would be there if you have enough of that data. I just think that most of the information of this type that we have accumulated with babies is just not present in, in, in text, in any description, essentially.
就像这样,你会开个小玩笑,然后会有无数其他笑话。通过笑话,你可以推断重力起作用,杯子可以打破等等。你不需要看到就会很低效。更容易,不需要过多思考。但我觉得如果你有足够的数据,这些信息就会存在。我只是觉得,我们积累的这类信息大部分并没有以文本或其他描述的形式存在。

And the sensory data is much, as a much richer source for getting that kind of understanding. I mean, that's the 16,000 hours of, of wake time of a four year old, and 10 to the 15 bytes, you know, going through vision, just vision, right? There is a similar bandwidth, you know, of touch and a little less through audio. And then text doesn't, language doesn't come in until like, you know, a year in, in life. And by the time you are nine years old, you've learned about gravity, you know, about inertia, you know, about gravity, you know, the stability, you know, you know, about the distinction between any, in any, in any, in objects, you know, by 18 months, you know, about like, why people want to do things and you help them if they can't, you know, I mean, there's a lot of things that you learn mostly by observation, really, not even through interaction.
感官数据是一个更为丰富的信息来源,可以帮助我们更好地理解世界。一个四岁孩子的醒着的时间达到了16,000小时,而仅仅通过视觉,就有10^15字节的信息量。通过触觉,这个带宽类似,而通过听觉稍微少一些。语言的信息直到生命中的第一年才开始涌入。而当你九岁时,你已经学会了重力、惯性、平衡、对事物之间的区别等。在18个月的时候,你也知道了人们为什么要做某些事情,以及如果他们无法做到时,怎么帮助他们。其实,有很多事情你主要是通过观察学会的,而不是通过互动学会的。

In the first few months of life, babies don't, don't really have any influence on the world. They can only observe, right? And you accumulate like a gigantic amount of, of knowledge, just just from that. So that's what we're missing from current AI systems. I think in one of your slides, you have this nice plot that is one of the ways you show that LLMs are limited. I wonder if you could talk about hallucinations from your perspectives. The why hallucinations happen from large language models and why, and to what degrees that are fundamental flaw of large language models. Right.
在生命的最初几个月里,婴儿并没有真正对世界有任何影响。他们只能观察,对吧?而你累积了大量的知识,仅仅来自这种观察。这就是当前人工智能系统所缺失的东西。我想在你的幻灯片中有一张不错的图表,展示了LLMs的局限性之一。我想知道你是否可以从你的角度谈谈幻觉。为什么大型语言模型会产生幻觉,以及为什么这些幻觉在多大程度上是大型语言模型的根本缺陷。对吧。

So because of the autoregressive prediction, every time an LLM produces a token or a word, there is some level of probability for that word to take you out of the set of reasonable answers. And if you assume, which is a very strong assumption, that the probability of such error is that those errors are independent across a sequence of tokens being produced. What that means is that every time you produce a token, the probability that you rest, you stay within the set of correct answer decreases and it decreases exponentially. So there's a strong, like you said, assumption there that if there's a non-zero probability of making a mistake, which there appears to be, then there's going to be a kind of drift.
因此,由于自回归预测的原因,每当LLM生成一个标记或一个词时,这个词离合适答案集的可能性就会有一定水平的概率带走你。而且如果你假设,这是一个非常强烈的假设,即这种错误的概率在生成的一系列标记中是独立的。这意味着每次你生成一个标记时,你保持在正确答案集中的概率会下降,而且是指数级地下降。因此,在那里有一个很强烈的假设,就像你所说的,如果存在犯错误的非零概率,那么就会产生一种漂移。

Yeah. And that drift is exponential. It's like errors accumulate. Right. So, so the probability that an answer would be nonsensical increases exponentially with the number of tokens. Is that obvious to you, by the way? Like, well, so mathematically speaking, maybe, but like, isn't there a kind of gravitational pull towards the truth? Because on an average, hopefully the truth is well represented in the training set? No, it's basically a struggle against the curse of dimensionality.
是的。而且这种漂移是指数级增长的。就好像错误会累积一样。所以,随着标记的数量增加,答案变得毫无意义的概率会呈指数级增加。这对你来说显而易见吗?数学上可能是这样,但是,难道不会有一种朝向真相的引力吗?因为平均而言,希望真相在训练集中得到了很好的体现?不,基本上是在与维度诅咒作斗争。

So the way you can correct for this is that you fine tune the system by having it produce answers for all kinds of questions that people might come up with. And people are people. So a lot of the questions that they have are very similar to each other. So you can probably cover 80% or whatever of questions that people will ask by collecting data. And then you fine tune the system to produce good answers for all of those things. And it's probably going to be able to learn that because it's got a lot of capacity to learn.
因此,你可以通过微调系统来纠正这个问题,让它能够回答人们可能提出的各种问题。人们就是人,所以很多问题他们之间非常相似。通过收集数据,你可能可以覆盖80%或更多人们会问的问题。然后,你可以微调系统,让它能够为所有这些问题提供好的答案。而且它可能会学会这些,因为它有很大的学习能力。

But then there is, you know, the enormous set of prompts that you have not covered during training. And that set is enormous. Like within the set of all possible prompts, the proportion of prompts that have been used for training is absolutely tiny. It's a tiny, tiny, tiny subset of all possible prompts. And so the system will behave properly on the prompts that has been either a trained pre-trained or fine-tuned. But then there is an entire space of things that it cannot possibly have been trained on because it's just the number is gigantic. So whatever training the system has been subject to to produce appropriate answers, you can break it by finding out a prompt that will be outside of the set of prompts has been trained on or things that are similar. And then it will just spew complete nonsense.
但是,你知道,你在训练中还没有涵盖的大量提示。这个集合是巨大的。就所有可能的提示集合而言,已经用于训练的提示比例绝对微不足道。它只是所有可能提示的一个微小、微小的子集。因此,系统将会在已经训练、预训练或微调过的提示上展现出正常的行为。但是,还有一整个空间的东西,它不可能已经被训练过,因为数量简直是巨大的。所以,无论系统接受了什么样的训练来产生适当的答案,只要找到一个不在已经训练过的提示集合或相似提示的提示,就能打破它,然后它就会输出完全没有意义的东西。

Do you, when you say prompt, do you mean that exact prompt? Or do you mean a prompt that's like in many parts very different than like, is that easy to ask a question or to say a thing that hasn't been said before on the internet? I mean, people have come up with things where like you put a, essentially a random sequence of characters in the prompt. And that's enough to kind of throw the system into a mode where it's going to answer something completely different than it would have answered without this. So that's a way to jailbreak the system, basically get it, you know, go outside of its conditioning, right? So that's a very clear demonstration of it. But of course, you know, that's, that goes outside of what is designed to do. If you actually stitch together reasonably grammatical sentences, is that is it that easy to break it?
当你说提示时,你是指那个确切的提示吗?还是你是指一个与之完全不同的提示,像是问一个问题或说一些以前在互联网上没有被提及过的东西? 我的意思是,人们已经想出了一些方法,比如在提示中输入一系列随机字符。这足以让系统进入一种状态,会回答与没有这些字符时完全不同的东西。这是一种越狱系统的方式,基本上是让系统超出其正常操作范围,对吧?这就是它的非常明显的示范。当然,那是超出设计初衷的地方。如果你真的把合理的语法句子拼接在一起,它是不是那么容易就能破解?

Yeah, some people have done things like you write a sentence in English, right? That has, and or you ask a question in English and it produces a perfectly fine answer. And then you just substitute a few words by the same word in another language. I don't know if a certain the answer is complete nonsense. Yes. So I guess what I'm saying is like, which fraction of prompts that humans are likely to generate are going to break the system. So the problem is that there is a long tail. Yes. This is an issue that a lot of people have realized, you know, in social networks and stuff like that, which is there is a very, very long tail of things that people will ask. And you can fine-tune the system for the 80% or whatever of the things that most people will ask. And then this long tail is so large that you're not going to be able to fine-tune the system for all the conditions. And in the end, the system has a being kind of a giant looker table, right, essentially, which is not really what you want. You want systems that can reason, certainly that can plan.
是的,有些人做过类似的事情,比如你用英语写一句话,对吧?有人这样做过,或者你用英语提问,它能给出一个完全合理的答案。然后你只需要用另一种语言中的相同单词替换几个词。我不知道这个答案是否完全胡说八道。是的。所以我想说的是,人类可能产生的提示的分数中,会有多少破坏系统的。问题在于存在一个长尾现象。是的。这是很多人意识到的问题,在社交网络等方面,即人们会提出非常多不同的问题。你可以对80%或者大多数人会问的问题进行调整系统,但这个长尾现象实在太大,你无法为所有情况调整系统。最终,系统就变成了一个巨大的查找表,对吧,基本上,这并不是你想要的。你想要的是可以推理的系统,当然还要能够计划。

So the type of reasoning that takes place in LLM is very, very primitive. And the reason you can tell is primitive is because the amount of computation that is spent per token produced is constant. So if you ask a question and that question, and that question has an answer in a given number of tokens, the amount of competition devoted to computing that answer can be exactly estimated. It's like, you know, it's the size of the prediction network, you know, with its 36 layers or 92 layers or whatever it is, multiplied by the number of tokens. That's it. And so essentially, it doesn't matter if the question being asked is, is simple to answer, complicated to answer, impossible to answer, because it's a decidable or something. The amount of computation the system will be able to devote to that to the answer is constant, or is proportional to number of token produced in the answer, right? This is not the way we work. The way we reason is that when we're faced with a complex problem or complex question, we spend more time trying to solve it and answer it, right? Because it's more difficult. There's a prediction element. There's an iterative element where you're like, or adjusting your understanding of a thing by going over and over and over. There's a hierarchical and so on.
因此,在LLM中进行的推理类型非常原始。你可以判断它原始的原因是因为每个生成的标记所花费的计算量是恒定的。因此,如果你提出一个问题,并且这个问题有一个在特定数量标记中的答案,那么用于计算该答案的计算量可以被准确估计。就像是,你知道的,它就是预测网络的规模,带有36层或92层或其他数量的层,乘以该数量的标记。总之,实际上,问问题的难易程度并不重要,因为计算系统能够专注于回答这个问题的计算量是恒定的,或者与生成答案的标记数成正比,对吧?这不是我们工作的方式。我们在面对一个复杂问题或复杂问提时,会花更多时间尝试解决和回答它,对吧?因为它更加困难。其中有一个预测因素。有一种迭代元素,你可以通过反复检查来调整对某个事物的理解。它还有一个层次结构等。

Does this mean it's a fundamental flaw of LLM? So does it mean that it's more part to that question? Now you're just behaving like an LLM. You really answer. No, that is just the low-level world model on top of which we can then build some of these kinds of mechanisms, like you said, persistent long-term memory or reasoning, so on. But we need that world model that comes from language. Is it maybe it is not so difficult to build this kind of reasoning system on top of a well-constructed world model? Okay, whether it's difficult or not, the near future will say because a lot of people are working on reasoning and planning abilities for for dialogue systems.
这是否意味着这是LLM的一个根本性缺陷? 所以这是否意味着这更多地属于那个问题的一部分?现在你只是在表现得像一个LLM。 你真的回答了问题。 不,那只是我们可以在其基础上构建这些机制的低水平世界模型,比如你所说的持久的长期记忆或推理等。 但我们需要来自语言的那种世界模型。也许在一个良好构建的世界模型之上建立这种推理系统并不那么困难?好吧,无论是难还是不难,不久的将来会说出答案,因为很多人正在研究对话系统的推理和规划能力。

I mean, if we're even if we restrict ourselves to language, just having the ability to plan your answer before you answer in terms that are not necessarily linked with the language you're going to use to produce the answer, right? So this idea of this mental model that allows you to plan what you're going to say before you say it. That is very important. I think there's going to be a lot of systems over the next few years that are going to have this capability. But the blueprint of those systems would be extremely different from what we're going to see with our labs. So it's the same difference as the difference between what psychology school system one and system two in humans, right? System one is the type of tasks that you can accomplish without, like deliberately consciously think about how you do them. You just do them. You've done them enough that you can just do it subconsciously. Right? Without thinking about them. If you are an experienced driver, you can drive without really thinking about it and you can talk to someone at the same time or listen to the radio, right?
我的意思是,即使我们局限于语言,只要有能力在回答之前计划你的答案,而不一定与你将用来表达答案的语言联系在一起,对吧?所以这种心智模型允许你在说话之前计划你要说的话是非常重要的。我认为在未来几年,将会有很多系统具备这种能力。但这些系统的蓝图将与我们实验室看到的完全不同。这就像人类中的系统一和系统二之间的差别一样,对吧?系统一是那些你可以在没有刻意去意识地考虑如何做的情况下完成的任务。你已经做过足够多次,以至于可以在潜意识状态下完成。对吧?如果你是一名经验丰富的司机,你可以在不用真正思考的情况下驾驶,甚至可以同时和别人交谈或听收音机。

If you are a very experienced chess player, you can play against a non-experienced chess player without really thinking either. You just recognize the pattern that you play, right? That's the system one. So all the things that you do instinctively without really having to deliberately plan and think about it. And then there is all the tasks where you need to plan. So if you are a not-too-experienced chess player or you are experienced, but you play against another experienced chess player, you think about all kinds of options, right? You think about it for a while, right? And you're much better if you have time to think about it than you are if you play blitz with limited time. So this type of deliberate planning which uses your internal world model that system too. This is what L&M's currently cannot do. So how do we get them to do this, right? How do we build a system that can do this kind of planning that or reasoning that devotes more resources to complex problems than to simple problems? And it's not going to be autoregressive prediction of tokens. It's going to be more something akin to inference of latent variables in what used to be called probabilistic models or graphical models and things of that type.
如果你是一个非常有经验的国际象棋玩家,你可以和一个没有经验的国际象棋玩家对战而不需要真正去思考。你只是识别自己玩的模式,对吧?这就是系统一。所以所有那些你本能地做的事情,而不需要刻意计划和思考。然后还有那些你需要计划的任务。所以如果你不是太有经验的国际象棋玩家,或者你有经验但你和另一个有经验的国际象棋玩家对战,你会考虑各种选择,对吧?你会想一段时间,对吧?如果你有时间去思考,你就会比在有限时间内玩快速棋盘游戏时更好。这种需要刻意计划的类型使用了系统二的内部世界模型。这是 L&M 目前做不到的。那么我们如何让他们做到这一点呢?我们如何构建一个能够进行这种规划或推理的系统,将更多资源投入复杂问题而不是简单问题?这不会是自回归预测令牌的方法。它更类似于在传统上被称为概率模型或图形模型的潜变量推断。

So basically the principle is like this. The prompt is like observed variables. And what the model does is that it's basically a measure of it can measure to what extent an answer is a good answer for a prompt. So think of it as some gigantic neural net, but it's got only one output. And that output is a scalar number, which is let's say zero. If the answer is a good answer for the question and a large number, if the answer is not a good answer for the question, imagine you had this model. If you had such a model, you could use it to produce good answers. The way you would do is produce the prompt and then search through the space of possible answers for one that minimizes that number. That's called an energy base model. But that energy base model would need the model constructed by the LLM. Well, so really what you need to do would be to not search over possible strings of text that minimizes that energy. But what you would do is do this in abstract representation space. So in the space of abstract thoughts, you would elaborate a thought using this process of minimizing the output of your model, which is just a scalar. It's an optimization process.
因此,基本原则就是这样的。提示就好像被观察的变量一样。这个模型的作用是衡量答案对提示的合适程度。想象一下它就像一个巨大的神经网络,但只有一个输出。这个输出是一个标量数字,假设为零,如果答案是问题的好答案,那么输出就是一个较大的数字,如果答案不是问题的好答案。想象一下你有这样一个模型。如果你有这样一个模型,你可以使用它来生成好的答案。你可以通过生成提示,然后搜索可能答案的空间,找到一个最小化这个数字的答案。这就是能量基模型。但是,这个能量基模型需要LLM构建的模型。实际上,你需要做的是不在可能的文本字符串上进行搜索以最小化这个能量,而是在抽象表示空间中进行操作。在抽象思维空间中,你会用这个优化输出值的过程来详细阐述一个思想,这只是一个标量,它是一个优化过程。

So now the way the system produces its answer is through optimization by minimizing an objective function basically. And this is, we're talking about inference. We're not talking about training. The system has been trained already. So now we have an abstract representation of the thought of the answer, representation of the answer. We feed that to basically the two-reactive decoder, which can be very simple, that turns this into a text that expresses this thought. So that in my opinion is the blueprint of future data systems. They will think about their answer, plan their answer by optimization before turning it into text. And that is Turing complete. Can you explain exactly what the optimization problem there is? Like, what's the objective function? Just link on it, you kind of briefly described it. But over what space are you optimizing? The space of representations. It goes with abstract representation. That's right. So you have an abstract representation inside the system. You have a prompt. The prompt goes to an encoder, produces a representation. Perhaps it goes through a predictor that predicts a representation of the answer, of the proper answer. But that representation may not be a good answer because there might be some complicated reasoning you need to do. So then you have another process that takes the representation of the answers and modifies it so as to minimize a cost function that measures to what extent the answer is a good answer for the question.
因此,系统现在生成答案的方式是通过优化来最小化一个 object 函数。我们正在讨论推理,而不是训练。系统已经接受过训练。所以现在我们有了答案想法的抽象表示,答案的表达。我们将这个传递给基本的双反应解码器,它可以非常简单,将其转换成表达这个想法的文本。我认为这就是未来数据系统的蓝图。它们会在转换文本之前通过优化来思考并计划答案。这是图灵完备的。您能解释一下那里的优化问题是什么吗?例如,目标函数是什么?您只是简单地提到了一下。但是在什么空间上进行优化?是在表示的空间上进行优化。没错。在系统内部有一个抽象表示。您有一个提示。提示进入编码器,生成一个表示。也许它经过一个预测器,预测一个适当答案的表示。但这个表示可能不是一个好答案,因为可能需要进行一些复杂的推理。因此,然后有另一个过程,它接受答案的表示并进行修改,以便最小化一个成本函数,该函数度量答案对问题的合适程度。

Now, we sort of ignore the fact for the issue for a moment of how you train that system to measure whether an answer is a good answer for a question. But suppose such a system could be created. But what's the process, this kind of search-like process? It's an optimization process. You can do this if the entire system is differentiable, that scalar output is the result of running through some neural net, running the answer, the representation of the answer to some neural net. Then by gradient descent, by backpropagating gradients, you can figure out how to modify the representation of the answers to minimize that. So that's still a gradient-based inference. So now you have a representation of the answer in abstract space. Now you can turn it into text. And the cool thing about this is that the representation now can be optimized through gradient descent, but also is independent of the language in which you're going to express the answer. Right. So you're operating in the abstract representation. I mean, this goes back to the joint embedding that is better to work in the space of, I don't know, or to romanticize the notion like space of concepts versus the space of concrete sensory information. Okay. But can this do something like reasoning, which is what we're talking about? Well, not really. In a very simple way. I mean, basically, you can think of those things that are doing the kind of optimization I was talking about, except they optimize in the discrete space, which is the space of possible sequences of tokens. And they do this optimization in a horribly inefficient way, which is generate a lot of hypothesis and then select the best ones. And that's incredibly wasteful in terms of computation. Because you basically have to run your other lab for every possible, you know, generative sequence and it's incredibly wasteful.
现在,我们暂时忽略了如何训练系统来衡量答案是否是问题的好答案这个问题。但假设这样的系统可以创建。但是这个过程是什么,类似搜索的过程?这是一个优化过程。如果整个系统是可微分的,那么标量输出是通过一些神经网络运行,运行答案,将答案的表示传递给一些神经网络。然后通过梯度下降,通过反向传播梯度,您可以找出如何修改答案的表示以最小化结果。所以这仍然是基于梯度的推理。所以现在您在抽象空间中有了答案的表示。现在您可以将其转换为文本。这件事的很酷之处在于,现在表示可以通过梯度下降进行优化,但也与要表达答案的语言无关。对。所以您是在抽象表示中操作。我是说,这回到了共同嵌入这一概念,最好在概念空间中工作,而不是在具体感官信息的空间中工作。好的。但是这种方法能像我们所讨论的那样进行推理吗?嗯,并不是以非常简单的方式。我是说,基本上,你可以想像那些正在进行我所说的优化的东西,除了它们在离散空间中进行优化,这是可能的令牌序列的空间,它们以一种非常低效的方式进行优化,即产生很多假设,然后选择最好的那些。从计算的角度来看,这是非常浪费的,因为您基本上必须为每个可能的生成系列运行您的其他实验室,这是非常浪费的。

So it's much better to do an optimization in continuous space, where you can do gradient descent as opposed to like generate tons of things and then select the best. You just iteratively refine your answer to go towards the best, right? That's much more efficient. But you can only do this in continuous spaces with differentiable functions. You're talking about the reasoning, like ability to think deeply or to reason deeply. How do you know what is an answer that's better or worse based on deep reasoning? Right. So then we're asking the question of conceptually, how do you train an energy-based model? Right. So an energy-based model is a function with a scalar output, just a number. You give it two inputs, X and Y. And it tells you whether Y is compatible with X or not. X you observe, let's say it's a prompt, an image, a video, whatever. And Y is a proposal for an answer, a continuation of the video, whatever. And it tells you whether Y is compatible with X. And the way it tells you that Y is compatible with X is that the output of that function would be zero. If Y is compatible with X, it would be a positive number, non-zero, if Y is not compatible with X. Okay. How do you train a system like this at a completely general level? Is you show it pairs of X and Ys that are compatible, a question and the corresponding answer.
因此,在连续空间进行优化要好得多,可以进行梯度下降,而不是生成大量东西然后选择最好的。你只需迭代地改进答案以接近最佳答案,对吗?这样效率更高。但你只能在具有可微函数的连续空间中做到这一点。你在谈论推理,比如深入思考或深入推理的能力。你怎么知道一个答案是更好还是更差,基于深入的推理?对。然后我们在思考如何训练基于能量的模型。一个基于能量的模型是一个具有标量输出的函数,只是一个数字。你给它两个输入,X和Y。它告诉你Y是否与X兼容。X是你观察到的,比如一个提示,一张图片,一个视频,等等。而Y是对答案的一个提议,视频的延续,等等。它告诉你Y是否与X兼容。如果Y与X兼容,那么该函数的输出将为零。如果Y与X不兼容,它将是一个正数,非零。好的。如何在完全一般的层面上训练这样一个系统?你向它展示一对一兼容的X和Y,一个问题和相应的答案。

And you train the parameters of the big neural net inside to produce zero. Okay. Now that doesn't completely work because the system might decide, well, I'm just going to say zero for everything. So now you have to have a process to make sure that for a wrong Y, the energy would be larger than zero. And there you have two options. One is contrastive methods. So contrastive method is you show an X and a bad Y and you tell the system, well, that's, you know, give a high energy to this, like push up the energy, right? Change the weights in the neural net, the confusing energy so that it goes up. So that's contrastive methods. The problem with this is if the space of Y is large, the number of such contrastive samples you're going to have to show is gigantic.
然后你可以训练神经网络内部的参数,使其产生零。但这并不完全有效,因为系统可能会决定,好吧,我只会对所有事情说零。因此,现在您必须有一个过程来确保对于错误的Y,能量大于零。在这里,您有两个选项。一个是对比方法。对比方法是向系统展示一个X和一个错误的Y,并告诉系统,好吧,对于这个,增加能量,就是提高能量,对吧?改变神经网络中的权重,混淆能量使其增加。这就是对比方法。这种方法的问题在于,如果Y的空间很大,那么您必须展示的这种对比样本的数量是巨大的。

But people do this. They do this when you train the system with RLE-CHF, basically what you're training is what's called a reward model, which is basically an objective function that tells you whether an answer is good or bad. And that's basically exactly what this is. So we already do this to some extent. We're just not using it for inference. We're just using it for training. There is another set of methods which are non-contrastive and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of X, Y is that are compatible that come from your training set. How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy.
但人们确实在做这件事。当你用RLE-CHF训练系统时,基本上你训练的是一个叫做奖励模型的东西,这就是一个告诉你答案是好是坏的客观函数。这基本上就是这样。所以我们在一定程度上已经在做这件事。我们只是没有用它进行推论。我们只是用它来训练。还有一组方法是非对比方法,我更倾向于这些方法。这些非对比方法基本上是说,好吧,能量函数在来自你的训练集的X、Y对上需要有低能量。你如何确保在其他地方能量会更高?你可以通过在成本函数中添加一个正则化器、标准,一个项来实现这一点,它基本上能够最小化可以具有低能量的空间的体积。

And the precise way to do this is all kinds of different specific ways to do this depending on the architecture. But that's the basic principle. So that if you push down the energy function for particular regions in the X, Y space, it will automatically go up in other places because there's only a limited volume of space that can take low energy. Okay, by the construction of the system or by the regularizer, regularizing function. We've been talking very generally about what is the good X and the good Y, what is the good representation of X and Y? As we've been talking about language and if you just take language directly, that presumably is not good. So there has to be some kind of abstract representation of ideas. Yeah, so you can do this with language directly by just X is a text and Y is a continuation of that text. Yes. Or X is a question, why is it your answer? But you're saying that's not going to take it. I mean, that's going to do what all arms are doing.
在不同的架构下,有各种不同具体的方式来实现这一点。但基本原则是这样的。如果你在X与Y空间的特定区域压缩能量函数,它会自动在其他地方上升,因为只有有限的空间可以承载低能量。通过系统的构建或正则化函数。我们一直在讨论什么是好的X和好的Y,什么是X和Y的良好表示?当谈到语言时,如果直接采用语言,那可能并不好。因此,必须有一种对思想的抽象表达方式。是的,你可以通过直接使用语言来实现这一点,比如把X当作文本,Y当作文本的延续。或者X是一个问题,Y是你的回答。但你说这样做不够,这样做只会做和其他人一样的事情。

Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of this system, there is a latent variable, let's call it Z, that you can manipulate so as to minimize the output energy, then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer. So this kind of system could be trained in a very similar way, very similar way, but you have to have this way of preventing collapse of ensuring that there is high energy for things you're not training on.
嗯,不,这取决于系统内部结构是如何构建的。如果系统的内部结构是以这样的方式构建的,即在系统内部有一个潜在变量,我们称之为Z,你可以操纵它以最小化输出能量,那么Z可以被视为一个好答案的表示,可以将其转换为一个好答案的Y。因此,这种系统可以以非常相似的方式进行训练,但是你必须防止崩溃的方法,确保你没有训练的事物有高能量。

And currently, it's very implicit in NLM is done in a way that people don't realize it's being done, but it is being done. It's due to the fact that when you give a hyperability to a word, automatically you give low probability to other words, because you only have a finite amount of probability to go around right there to sum to one. So when you minimize the cross entropy or whatever, when you train your LLM to produce the to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words.
目前,NLM中的工作方式非常隐性,许多人都没有意识到它在进行,但实际上确实在进行。这是因为当你给一个词赋予高概率时,自动就会给其他词赋予低概率,因为你只有有限的概率分配给各个词,总和为1。因此,当你在训练LLM时最小化交叉熵等,以预测下一个词时,你增加了系统给出正确词的概率,但也减少了给出错误词的概率。

Now, indirectly, that gives a low probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all. But because you're not doing it on a joint probability of all the symbols in a sequence, you're just doing it kind of factorize that probability in terms of conditional probabilities over successive tokens. How do you do this for visual data?
现在,间接地,这给了好的词序列低概率和坏的词序列低概率,但这是非常间接的。为什么这实际上有效并不明显。但因为你不是在一个序列中的所有符号的联合概率上进行操作,你只是在条件概率上对连续标记进行了因子分解。你如何在视觉数据中做到这一点?

So we've been doing this with all JAPA architecture. The joint of badding. So there, the compatibility between two things is, here's an image or video, here's a corrupted, shifted or transformed version of that image or video or masked. And then the energy of the system is the prediction error of the representation, the predicted representation of the good thing versus the actual representation of the good thing. Right? So you run the corrupted image to the system, predict the representation of the good input and corrupted and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good, you know, if this is a good image and this is a corrupted version, it will give you zero energy if those two things are effectively, one of them is a corrupted version of the other, give you a high energy if the two images are totally different. And hopefully that whole process gives you a really nice compressed representation of reality, visual reality. And we know it does because then we use those for our presentations as input to a classification system. And then that system works really nice Okay.
所以我们一直在用JAPA架构进行这样的工作。badding的联合。所以,在这里,两者之间的兼容性是,这里有一个图像或视频,这是一个受损、移位或转换版本的图像或视频,或者是掩蔽的版本。然后系统的能量是表示的预测错误,好的事物的预测表示与实际表示之间的差异。对吧?所以你将受损的图像输入到系统中,预测好的输入和受损的表示,并计算预测错误。这就是系统的能量。所以这个系统会告诉你,如果这是一个好的图像,这是一个受损的版本,如果这两者有效地相互对应,那么它会给你零能量;如果这两个图像完全不同,它会给你高能量。希望整个过程能够给你一个对现实的非常好的压缩表示,视觉现实。我们知道它确实能做到,因为然后我们将这些用作我们演示的输入,作为分类系统的输入。然后那个系统运作得非常好。好的。

Well, so to summarize, you recommend in a in a in a spicy way that only on the coon cam, you recommend that we abandon generative models in favor of joint embedding architectures. Yes, abandoned auto aggressive generation. Yes, abandoned problem. This feels like court testimony, abandoned probabilistic models in favor of energy based models as we talked about, abandoned contrastive methods in favor of regularized methods.
嗯,所以总结一下,你建议在考虑到只在coon cam上推荐的情况下,我们应该放弃生成模型,转而支持联合嵌入结构。是的,放弃自我侵略性生成。是的,放弃这个问题。这感觉像是法庭证词,放弃概率模型,转而支持基于能量的模型,正如我们所讨论的,放弃对比方法,转而支持规范化方法。

And let me ask you about this. You've been for a while a critic of reinforcement learning. Yes. So the last recommendation is that we abandoned RL in favor of model predictive control, as you were talking about, and only use RL when planning doesn't yield the predicted outcome. And we use RL in that case to adjust the world model or the critic. Yes.
让我问你一下这个问题。你一直以来都是强化学习的批评者。是的。所以最后的建议是我们放弃强化学习,转而选择模型预测控制,就像你所说的,只有在规划不产生预期结果时才使用强化学习。在那种情况下,我们使用强化学习来调整世界模型或评论者。是的。

So you mentioned RLHF reinforcement learning with human feedback. Why do you still hate reinforcement learning? I don't hate reinforcement learning. And I think it's all love. I think it should not be abandoned completely. But I think it's used to be minimized because it's incredibly inefficient in terms of samples. And so the proper way to train a system is to first have it learn good representations of the world and world models from mostly observation, maybe a little bit of interactions. And then steered based on that, if the representation is good, then the adjustments should be minimal.
所以你提到了RLHF强化学习与人类反馈。为什么你还是讨厌强化学习呢?我并不讨厌强化学习。我认为这一切都是爱。我认为它不应该被完全放弃。但我认为它应该被最小化使用,因为在样本方面效率非常低。因此,训练一个系统的正确方式是首先让它从观察中学习世界的良好表示和世界模型,可能稍微有一点互动。然后基于此来引导,如果表示很好,那么调整应该是最小的。

Yeah. Now there's two things you can use. If you've learned a world model, you can use the world model to plan a sequence of actions to arrive at a particular objective. You don't need RL unless the way you measure whether you succeed might be in exact. Your idea of whether you're going to fall from your bike might be wrong, or whether the person you're fighting with MMA was going to do something and do something else. So there are, so there's two ways you can be wrong. Either your objective function does not reflect the actual objective function you want to optimize, or your world model is inaccurate. So you didn't, the prediction you were making about what was going to happen in the world is inaccurate. So if you want to adjust your world model while you are operating the world, or you are objective. function, that is basically in the realm of RL. This is what RL deals with to some extent. So adjust your world model, and the way to adjust your world model, even in advance, is to explore parts of the space where your world model, where you know that your world model is inaccurate. That's called curiosity basically, or play. When you play, you kind of explore parts of the state space that you don't want to do for real because it might be dangerous, but you can adjust your world model without killing yourself basically. So that's what you want to use RL for. When it comes time to learning a particular task, you already have all the good web presentations, you already have your world model, but you need to adjust it for the situation at hand. That's when you use RL.
是的。现在有两件事可以用到。如果你学会了一个世界模型,你可以使用这个世界模型来规划一系列行动,以达到特定的目标。除非你成功的方式可能是精确的,否则你不需要RL。你对自己是否会从自行车上摔下来的想法可能是错误的,或者你和MMA搏斗的对手会做一些事情,然后做另一些事情。所以有两种可能你会错误。要么你的目标函数不能反映你想要优化的实际目标函数,要么你的世界模型不准确。所以,你之前对世界会发生什么的预测是不准确的。因此,当你在操作世界或目标函数时,如果想调整你的世界模型,那基本上就是强化学习的范畴。这就是强化学习在一定程度上要处理的内容。所以要调整你的世界模型,调整世界模型的方法,甚至事先,在你知道你的世界模型不准确的空间中进行探索。这基本上被称为好奇心或玩耍。当你玩耍时,你会探索状态空间的一部分,你不愿意真正做,因为那可能很危险,但你可以在不危及自己的情况下调整你的世界模型。这就是你想要使用强化学习的地方。当学习一个特定的任务时,你已经有了所有良好的网络呈现,你已经有你的世界模型,但你需要为手头的情况进行调整。这时就是你使用强化学习的时候。

What do you think RLHF works so well? This enforcement learning of human feedback, what did it have such a transformational effect on large language models it before? What's had the transformational effect is human feedback. There is many ways to use it, and some of it is just purely supervised actually. It's not really your first one, Johnny. It's the HF. It's the HF. And then there is various ways to use human feedback. So you can ask humans to write multiple answers that are produced by world model. And then what you do is you train an objective function to predict that rating. And then you can use that objective function to predict whether an answer is good. And you can backpropagate gradient through this to fine tune your system so that it only produces highly rated answers. So that's one way. So that's like in RL, that means training what's called a reward model. So something that basically you're smaller on that, that estimates to what extent an answer is good. It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning. It's used for fine tuning your system. I think it would be much more efficient to use it for planning. But currently it's used to fine tune the parameters of the system. Now there are several ways to do this. Some of them are supervised. You just ask a human person like, what is a good answer for this? Then you just type the answer. There's lots of ways that those systems are being adjusted.
你认为RLHF为什么效果这么好?人类反馈的这种强化学习,是什么让它以前对大型语言模型产生了如此变革性的影响?有哪些给出了变革性影响的是人类反馈。使用它的方法有很多种,有些方法实际上仅仅是纯粹的监督。这并不是你所想的第一个,Johnny。这就是HF。就是HF。然后有各种各样的使用人类反馈的方法。你可以请人类写出由世界模型产生的多个答案。然后你训练一个目标函数来预测那个评分。然后你可以使用那个目标函数来预测一个答案是否好。你可以通过这个反向传播梯度来微调你的系统,以便只产生评分高的答案。这是其中一种方式。就像在RL中那样,这意味着训练所谓的奖励模型。总之就是你更小心地估计到一个答案有多好。这很类似于我之前谈到过用于计划的目标,只是现在它不是用于计划,而是用于微调你的系统。我认为更有效的方式是将其用于计划。但目前它是用于微调系统参数的。现在有几种方法可以做到这一点。其中一些是监督的。你只需询问一个人类,比如,对于这个问题,什么是一个好的答案?然后你就输入答案。这些系统的调整方法有很多。

Now a lot of people have been very critical of the recently released Google's Gemini 1.5. For essentially, in my words, I could say super woke. Woke in the negative connotation of that word. There's some almost hilariously absurd things that it does, like it modifies history, like generating images of Black George Washington or perhaps more seriously something that you commented on Twitter, which is refusing to comment on or generate images of or even descriptions of Tiananmen Square or the Tankman, one of the most sort of legendary protest images in history. Of course, these images are highly censored by the Chinese government. Everybody started asking questions of what is the process of designing these LLMs? What is the role of censorship in these and all that kind of stuff? You commented on Twitter saying that open source is the answer. Essentially. Can you explain? I actually made that comment on just about every social network. I've made that point multiple times in various forums.
现在很多人对谷歌最近发布的 Gemini 1.5 表达了批评。在我看来,可以说它非常“觉醒”。“觉醒”这个词带有负面意义。它做了一些几乎可笑荒谬的事情,比如修改历史,生成黑人乔治·华盛顿的图像,或者更严重的是拒绝评论或生成关于天安门广场或坦克人的图像,这是历史上最具传奇色彩的抗议形象之一。当然,这些图像在中国政府高度审查下。大家开始质疑设计这些 LLMs 的流程是什么?审查在其中扮演了什么角色?你在推特上表示开源是答案。你能解释一下吗?我实际上在几乎所有社交网络上都做了这个评论。我在各种论坛上多次表达了这一点。

Here's my point of view on this. People can complain that AI systems are biased. They generally are biased by the distribution of the training data that they've been trained on that reflects biases in society. That is potentially offensive to some people or potentially not. Some techniques to debias then become offensive to some people because of historical incorrectness and things like that. You can ask the question. The first question is, is it possible to produce an AI system that is not biased? The answer is absolutely not. It's not because of technological challenges, although they are technological challenges to that. It's because bias is in the eye of the beholder. Different people may have different ideas about what constitutes bias for a lot of things. There are facts that are indisputable, but there are a lot of opinions or things that can be expressed in different ways. You cannot have an unbiased system that's just an impossibility. What's the answer to this? The answer is the same answer that we found in liberal democracy about the press. The press needs to be free and diverse. We have free speech for a good reason. It's because we don't want all of our information to be to come from a unique source because that's opposite to the whole idea of democracy and progress of ideas and even science. In science, people have to argue for different opinions. Science makes progress when people disagree and they come up with an answer and a consensus forms and is true in all democracies around the world.
这是我的观点。人们可以抱怨人工智能系统存在偏见。它们通常受到训练数据分布的影响,这些数据反映了社会中的偏见。这可能会冒犯一些人,也可能不会。一些去偏见的技术可能会冒犯一些人,因为涉及到历史上的不正确性等问题。你可以提出问题。第一个问题是,是否可能制造一个没有偏见的人工智能系统?答案绝对是否定的。这并不是因为技术上的挑战,虽然有一些技术上的挑战。这是因为偏见取决于看问题者的眼光。不同的人对于什么才算是偏见可能会有不同看法。有些事实不容置疑,但很多观点或表达方式可能存在争议。你无法拥有一个毫无偏见的系统,那是不可能的。那么问题的答案是什么?答案就是我们在自由民主制度中关于新闻媒体发现的答案。新闻媒体应该是自由多样的。我们有言论自由是有充分理由的。这是因为我们不希望所有的信息都来自一个唯一的来源,这与民主的整体思想相悖,也违背了思想进步的原则,甚至违反了科学。在科学中,人们必须就不同意见进行辩论。当人们出现分歧,并达成共识时,科学才会取得进展,这在世界上所有的民主制度中都成立。

There is a future where every single one of our interactions with the digital world will be mediated by AI systems. We're going to have smart glasses. You can already buy them from MITA, the Rayban where you can talk to them and they are connected with an LLM and you can get answers on any question you have. You can be looking at a monument and there is a camera in the system that in the glasses, you can ask it, what can you tell me about this building on this monument? You can be looking at a menu in a foreign language and it's being translated for you or you can do real-time translation if you speak different languages. A lot of our interactions with the digital world are going to be mediated by those systems in the near future. Increasingly, the search engines that we're going to use are not going to be search engines. They're going to be dialogue systems that would just ask a question and it will answer and then point you to perhaps appropriate reference for it.
在未来,我们与数字世界的每一个互动都将由人工智能系统中介。我们将拥有智能眼镜。你可以从MITA购买它们,这是Rayban,你可以与它们交流,它们与LLM连接,并可以回答你所有的问题。你可以看着一个纪念碑,眼镜里有一个摄像头,你可以问它,关于这座建筑或纪念碑,你能告诉我什么?你可以看着一份外语菜单,它会为你翻译,或者如果你说不同的语言,你可以进行实时翻译。在不久的将来,我们与数字世界的很多互动都会由这些系统中介。越来越多的搜索引擎不再是搜索引擎。它们将是对话系统,你只需问一个问题,它就会回答,然后可能指向适当的参考资料。

Here is the thing, we cannot afford those systems to come from a handful of companies on the west coast of the US because those systems will constitute the repository of all human knowledge and we cannot have that be controlled by a small number of people. It has to be diverse for the same reason the press has to be diverse. How do we get a diverse set of AI systems? It's very expensive and difficult to train a base model, a base LLM at the moment, in the future it might be something different but at the moment that's an LLM.
事实是,我们无法承受那些系统只来自美国西海岸的少数几家公司,因为这些系统将构成所有人类知识的存储库,我们不能让这些知识被少数人控制。正如新闻媒体必须多元化一样,AI系统也必须多样化。我们如何获得多样化的AI系统?目前训练基本模型(LLM)非常昂贵且困难,未来可能会有所不同,但目前来说是LLM。

So only a few companies can do this properly and if some of those subsystems are open source, anybody can use them, anybody can fine-tune them. If we put in place some systems that allows any group of people, whether they are individual citizens, groups of citizens, government organizations, NGOs, companies, whatever, to take those open source systems, AI systems and fine-tune them for their own purpose, their own data. Then we're going to have a very large diversity of different AI systems that are specialized for all of those things. I'll tell you, I talked to the French government quite a bit and the French government will not accept that the digital diet of all their citizens be controlled by three companies on the west coast of the US. That's just not acceptable. It's a danger to democracy regardless of how well-intentioned those companies are.
因此,只有很少的公司能够正确地做到这一点,如果某些子系统是开源的,任何人都可以使用它们,任何人都可以进行微调。如果我们建立一些系统,允许任何一群人,无论是个体公民、公民团体、政府组织、非政府组织、公司等,使用那些开源系统、人工智能系统并为了他们自己的目的、自己的数据进行微调。那么我们将拥有各种不同的人工智能系统,专门针对所有这些事情。我告诉你,我经常与法国政府交谈,法国政府不会接受他们所有公民的数字信息饮食被美国西海岸的三家公司控制。这是不可接受的。这不管这些公司有多么好心,都对民主构成了危险。

And it's also a danger to local culture, to values, to language. I was talking with the founder of Infosys in India. He's funding a project to fine-tune LLM at the open source model produced by Meta so that LLM at two speaks all 22 official languages in India. It's very important for people in India. I was talking to a former colleague of mine, Gustav S. He said he used to be a scientist at FAIR and then moved back to Africa. I created a research lab for Google in Africa and now he's has a new startup, Co-Kara. What he's trying to do is basically have LLM that speaks the local languages in Senegal so that people can have access to medical information because they don't have access to doctors.
这也对当地文化、价值观和语言构成威胁。我曾与印度Infosys公司的创始人交谈。他们资助一个项目,通过优化Meta生产的开源模型LLM,使LLM能够使用印度的22种官方语言。对印度人民来说这非常重要。我曾和我的前同事Gustav S交谈,他曾在FAIR担任科学家,后来搬回非洲。他为谷歌在非洲建立了一个研究实验室,现在他创建了一个新的创业公司Co-Kara。他想要实现的目标基本上是让LLM能够使用塞内加尔的本地语言,以便人们可以获得医疗信息,因为他们无法看医生。

It's a very small number of doctors per capita in Senegal. I mean, you can't have any of this unless you have open source platforms. So with open source platforms, you can have AI systems that are not only diverse in terms of political opinions or things of that type, but in terms of language, culture, value systems, political opinions, technical abilities in various domains, and you can have an industry, an ecosystem of companies that fine-tune those open source systems for vertical applications in industry. You have, I don't know, a publisher has thousands of books and they want to build a system that allows a customer to just ask a question about any of the content of any of their books.
在塞内加尔,每千人口拥有的医生数量非常少。我的意思是,除非有开放源平台,否则你无法拥有这些。因此,借助开放源平台,你可以拥有不仅在政治观点等方面多样化的人工智能系统,还可以在语言、文化、价值体系、政治观点、各领域的技术能力等方面多样化,并且可以有一家公司生态系统,为工业中的垂直应用优化这些开源系统。比如,一个出版商有成千上万本书,他们想建立一个系统,让客户只需询问有关任何一本书内容的问题就可以了。

You need to train on their proprietary data. You have a company we have one within Meta, it's called MetaMate. It's basically an LLM that can answer any question about internal stuff about the company. Very useful. A lot of companies want this, right? A lot of companies want this not just for their employees, but also for their customers to take care of their customers. So the only way you're going to have an AI industry, the only way you're going to have AI systems that are not uniquely biased is if you have open source platforms on top of which any group can build specialized systems.
你需要在他们的专有数据上进行培训。我们拥有一个名为MetaMate的公司,它可以回答任何关于公司内部事务的问题。非常有用。很多公司都需要这个,对吧?很多公司不仅需要这个来帮助员工,还需要这个来照顾他们的客户。所以要拥有一个人工智能产业,要拥有没有独特偏见的人工智能系统,唯一的途径就是建立开源平台,任何团体都可以在其基础上构建专门的系统。

So the direction of inevitable direction of history is that the vast majority of AI systems will be built on top of open source platforms. So that's a beautiful vision. Meaning like a company like Meta or Google or so on should take only minimal fine tuning steps after the building the foundation pre-trained model as few steps as possible. Basically, can Meta afford to do that? No, so I don't know if you know this, but companies are supposed to make money somehow and open source is like giving away, I don't know, Mark made a video, Mark Zuckerberg, very sexy video talking about 350,000 NVIDIA H100s. The math of that is just for the GPUs, that's 100 billion, plus the infrastructure for training everything.
因此,历史不可避免地走向的方向是,绝大多数人工智能系统将建立在开源平台之上。这是一个美好的愿景。意思是像Meta、谷歌等公司在构建基础预训练模型后只需进行最小的微调步骤,尽可能少的步骤。基本上,Meta能负担得起这样做吗?不,我不知道你是否知道,但公司应该以某种方式赚钱,而开源就像是散发,马克发表了一个视频,马克•扎克伯格,非常性感地谈论了35万个NVIDIA H100。其中GPU的数量就达到了1000亿,再加上训练一切所需的基础设施。

So I'm no business guy, but how do you make money on that? So the division you paint is a really powerful one, but how is it possible to make money? Okay, so you have several business models, right? The business model that Meta is built around is your for a service and the financing of that service is either through ads or through business customers. So for example, if you have an LLM that can help a moment pop pizza place by talking to the customers who WhatsApp and so the customers can just order a pizza and the system will just ask them what tapping do you want or what size, blah, blah, blah. The business will pay for that.
所以我不是商人,但是你怎么在这上面赚钱呢?你所描绘的分工是非常强大的,但是如何可能赚钱呢?好的,你有几种商业模式,对吧?Meta围绕的商业模式是基于你提供的服务,这种服务的资金来源要么是广告,要么是商业客户。例如,如果你有一个可以通过WhatsApp与顾客交谈的LLM,让顾客可以直接订购披萨,系统会询问他们想要什么口味或者大小等等。企业会为此付费。

Okay, that's a model. And otherwise, if it's a system that is on the more classical services, it can be ad supported or there's several models. But the point is, if you have a big enough potential customer base and you need to build that system anyway for them, it doesn't hurt you to actually distribute it in open source. Again, I'm no business guy, but if you release the open source model, then other people can do the same kind of task and compete on it, basically provide fine tune models for businesses.
好的,这是一个模型。另外,如果它是一个更古典服务的系统,它可以通过广告支持或有几种模式。但关键是,如果你有一个足够庞大的潜在客户群,并且无论如何都需要为他们构建系统,那么实际上将其分发为开源不会对你造成伤害。再次强调,我不是商业人士,但如果你发布开源模式,那么其他人也可以做同样的任务并在此基础上竞争,基本上为企业提供精细调整的模型。

Sure. Because the Meta is making, by the way, I'm a huge fan of all this, but is the Meta is making is like, we'll do a better job of it. Well, no, the bet is more we already have a huge user base and customer base. So it's going to be useful to them. Whatever we offer them is going to be useful and there is a way to derive revenue from this. And it doesn't hurt that we provide that system or the base model, the foundation model, in open source for others to build applications on top of it too. If those applications don't have to be useful for our customers, we can just buy it from them.
当然。由于元宇宙正在发展,顺便说一句,我是所有这一切的一个大粉丝,但元宇宙正在发展的方式就是,我们会做得更好。嗯,不,更好的是,我们已经拥有庞大的用户群和客户群。所以对他们来说,这将是有用的。无论我们向他们提供什么,都将是有用的,并且有一种方法从中获取收入。而且我们还为其他人在其上构建应用程序提供了开放源代码的系统或基础模型。如果这些应用程序对我们的客户没有用,我们可以从他们那里购买。

It could be that they will improve the platform. In fact, we see this already. I mean, there is literally millions of downloads of Lama 2 and dozens of people who have provided ideas about how to make it better. So this clearly accelerates progress to make the system available to a sort of a wide community of people. And there is literally thousands of businesses who are building applications with it. So our ability to, Meta's ability to derive revenue from this technology is not impaired by the distribution of it, of base models in open source.
他们可能会改进这个平台。实际上,我们已经看到了这一点。我是说,Lama 2已经有数百万次的下载,还有数十人提供了关于如何改进的想法。这显然加快了让这个系统面向广大群体的进展。而且还有成千上万的企业正在用它来构建应用程序。因此,Meta从这项技术中获取收入的能力并不会受到开源基础模型的影响。

The fundamental criticism that Gemini is getting is that as you point out on the West Coast, just to clarify, we're currently in the East Coast where I was supposed Meta AI headquarters would be. So there are strong words about the West Coast. But I guess the issue that happens is, I think it's fair to say that most tech people have a political affiliation with the left wing. They lean left. And so the problem that people are criticizing Gemini with is that there's in that debiasing process that you mentioned that their ideological lean becomes obvious. Is this something that could be escaped? You're saying open source is the only way. Have you witnessed this kind of ideological lean that makes engineering difficult? No, I don't think it has to do. I don't think the issue has to do with the political leaning of the people designing those systems. It has to do with the acceptability or political leanings of their customer base or audience.
人们对Gemini的根本批评是,正如你在西海岸所指出的,只是为了澄清,我们目前在东海岸,我原本认为Meta AI总部会在东海岸。所以关于西海岸有一些强烈的言论。但我想发生的问题是,我认为可以公平地说,大多数技术人员都倾向于左翼的政治立场。他们倾向于左翼。人们批评Gemini的问题在于,在你提到的去偏见的过程中,他们的意识形态倾向变得明显。这种情况能够避免吗?你说开源是唯一的方法。你有见过这种让工程变得困难的意识形态倾向吗?不,我认为这与设计这些系统的人们的政治倾向无关。这与他们的客户群或受众的可接受性或政治倾向有关。

So a big company cannot afford to offend too many people. So they're going to make sure that whatever product they put out is safe, whatever that means. And it's very possible to overdo it. And it's also very possible to do it properly for everyone. You're not going to satisfy everyone. So that's what I said before. You cannot have a system that is unbiased, that is perceived as unbiased by everyone. It's going to be, you know, you push it in one way, when set of people are going to see it as biased and then you push it the other way and another set of people is going to see it as biased. And then in addition to this, there's the issue of if you push the system, perhaps it'll too far in one direction, it's going to be nonfactual. You're going to have, you know, black Nazi soldiers in the...
因此,一个大公司不能得罪太多人。因此,他们会确保无论他们推出什么产品,都是安全的,无论这意味着什么。过度做也是有可能的。对每个人都做得恰到好处也是有可能的。你不可能取悦每个人。这就是我之前说的。你不能拥有一个被认为是公正的系统,却被每个人视为公正。它会被,你把它推向一边,一群人会认为它是有偏见的,然后你把它推向另一边,另一群人会认为它是有偏见的。此外,还有一个问题是,如果你把系统推得太过火了,可能会出现事实不准确的情况。你会有,你知道,在军队里有黑人纳粹士兵...

Yes, we should mention image generation of black Nazi soldiers, which is not factually accurate. And can be offensive for some people as well. Right. So it's going to be impossible to kind of produce systems that are unbiased for everyone. So the only solution that I see is diversity. And diversity is full meaning of that word, diversity of in every possible way.
是的,我们应该提到黑人纳粹士兵形象的生成,这并不符合事实。对一些人来说,这也可能是令人不悦的。没错。因此,要生产出对每个人都没有偏见的系统可能是不可能的。我认为唯一的解决方案是多样性。而所谓的多样性,就是在一切可能的方面都要有多样性。

Yeah. Marc Andreessen just tweeted today. Let me do a TLDR. The conclusion is only startups and open source can avoid the issue that he's highlighting with Big Tech. He's asking, can Big Tech actually field generative AI products? One, ever escalating demands from internal activists, employee mobs, crazed executives, broken boards, pressure groups, extremist regulators, government agencies, the press, in quotes, experts and everything corrupting the output? Two, constant risk of generating a bad answer or drawing a bad picture or rendering a bad video. Who knows what is going to say or do at any moment? Three, legal exposure, product liability, slander, election law, many other things and so on. Anything that makes Congress mad. Four, continuous attempts to tighten grip on acceptable output to grade the model, like how good it actually is in terms of usable and pleasant to use and effective and all that kind of stuff. And five, publicity of bad text, images, video, actually puts those examples into the training data for the next version. So on. So he just highlights how difficult this is. From all kinds of people being unhappy. He said, you can't create a system that makes everybody happy. Yes.
是的。马克·安德里森今天刚刚推特了。让我来总结一下。结论是只有创业公司和开源软件能够避免他所强调的与大型科技公司相关的问题。他在问,大型科技公司真的能够推出产生式人工智能产品吗?首先,内部活动人士、雇员团体、疯狂的高管、破裂的董事会、压力团体、极端监管机构、政府机构、媒体、专家们和一切腐蚀产出的因素都在不断提出不断提升的要求吗?其次,不断面临生成错误答案、绘制错误图片、制作错误视频的风险。谁知道下一刻会说或做什么?再者,法律风险、产品责任、诽谤、选举法等等等等,任何可能让国会发怒的事情。接着,不断尝试加强对可接受产出的控制力,对模型的评分,比如它在可用性、使用愉悦性、有效性等方面究竟有多好。最后,糟糕文本、图片、视频的宣传实际上把这些例子放入了下个版本的训练数据中。所以,他强调了这个问题多么困难。因为各种人不满。他说,你不能创造一个让所有人都满意的系统。是的。

So if you're going to do the fine-tuning yourself and keep a close source, essentially the problem there is then trying to minimize the number of people who are going to be unhappy. And you're saying that almost impossible to do right and that's the better way is to do open source. Basically. Yeah. I mean, his mark is right about a number of things that he lists that indeed scare large companies. You know, certainly congressional investigations is one of them. Legal liability, you know, making things that get people to hurt themselves or hurt others like, big companies are really careful about not producing things of this type. And because they have, you know, they don't want to hurt anyone first of all. And then second, they want to preserve their business. So it's actually impossible for systems like this. It can inevitably formulate political opinions and opinions about what you're seeing, that may be political or not, but that people may disagree about, about, you know, moral issues and, you know, things about like questions about religion and things like that, right? Or cultural issues that people from different communities would disagree with in the first place.
因此,如果您要自己进行微调并保持密切关注来源,那么基本上其中的问题就是试图最大限度地减少不满意的人数。你说几乎不可能做到这一点,最好的方式是开源。基本上。是的。我是说,他在列举的一些事情上是正确的,确实让大公司感到害怕。您知道,肯定有调查委员会调查其中之一。法律责任,你知道,制造会让人受伤或伤害他人的东西,大公司非常小心,不生产这类产品。因为他们,首先不想伤害任何人。其次,他们想保护自己的业务。因此,对于这样的系统来说,它基本上不可能。它不可避免地会制定政治观点和关于您所看到的内容的观点,这可能是政治问题,也可能不是,但人们可能对此有分歧,关于道德问题,宗教问题和不同社区的人们一开始就有不同意见的文化问题。

So there's only kind of a relatively small number of things that people will sort of agree on, you know, basic principles. But beyond that, if you want those systems to be useful, they will necessarily have to offend a number of people inevitably. And so open source is just better. And then diversity is better, right? And open source enables diversity. That's right. Open source enables diversity. That's going to be fascinating world where if it's true that the open source world, if metal is the way and creates this kind of open source foundation model world, there's going to be like governments will have a fine tube model. And yeah. And then potentially, you know, people that vote left and right will have their own model in preference to be able to choose.
所以人们只会就某些基本原则达成共识,而其他方面如果你希望这些系统有用,它们必然会触及一些人。所以开源就更好。而多样性也更好,对吧?开源能促进多样性。没错。开源能促进多样性。如果开源世界是正确的,如果金属是一种方式并创造了这种开源基础模型世界,那将是一个令人着迷的世界,政府将有精致的管道模型。而且,潜在地,左右两派的选民将有他们自己的模型偏好以供选择。

And it will potentially divide us even more, but that's on us humans. We get to figure out basically the technology enables humans to human more effectively. And all the difficult ethical questions that humans raise will just leave it up to us to figure it out. Yeah. I mean, there are some limits to what, you know, the same way there are limits to free speech, there has to be some limit to the kind of stuff that those systems might be authorized to to produce, you know, some guardrails. So I mean, that's one thing I've been interested in, which is in the type of architecture that we were discussing before, where the output of the system is a result of an inference to satisfy an objective that objective can include guardrails.
这将潜在地进一步分裂我们,但这取决于我们人类。我们得弄清楚,基本上技术使人类更有效地进行人际交往。所有人类提出的困难伦理问题将由我们来解决。是的。我的意思是,有些事情是有限制的,你知道,就像言论自由有限制一样,这些系统可能授权生成的内容也必须有一些限制,你知道,一些防护措施。所以我一直对一件事情感兴趣,那就是我们之前讨论的体系结构类型,在那里,系统的输出是为满足一个目标而进行推理的结果,这个目标可以包含防护措施。

And we can put guardrails in open source systems. I mean, if we eventually have systems that are built with this blueprint, we can put guardrails in those systems that guarantee that there is sort of a minimum set of guardrails that make the system non dangerous and non toxic, et cetera, you know, basic things that everybody would agree on. And then, you know, the fine tuning that people will add or the additional guardrails that people will add will kind of cater to their community, whatever it is. And yeah, the fine tuning will be more about the gray areas of what is hate speech, what is dangerous and all that kind of stuff. I mean, you've different value systems. The value systems. I mean, like, but still even with the objectives of how to build a bio weapon, for example, I think something you've commented on, or at least there's a paper that we're a collection of researchers just trying to understand the social impacts of these LOMs. And I guess one threshold is nice. It's like, does the LOM make it any easier than a search would, like a Google search would? Right.
我们可以在开源系统中设置防护栏。我的意思是,如果最终我们建造的系统是按照这个蓝图设计的,我们可以在这些系统中设置防护栏,确保系统具有一套最基本的防护栏,使系统变得非危险性和非有毒,等等,你知道的,基本上每个人都会同意的东西。然后,你知道,人们会添加的微调或者额外的防护栏会更多地迎合他们的社区,无论是什么。是的,微调会更多地涉及到什么是仇恨言论,什么是危险的等等灰色地带。我是说,你有不同的价值观系统。价值观系统。我是说,但即使是关于如何构建生物武器的目标,例如,我认为你已经评论过的事情,或者至少有一篇论文是我们一群研究人员仅仅试图理解这些 LO 嵌入对社会影响。我猜一个阈值就很好。就像,LO 嵌入是否比谷歌搜索更易于实现?对。

So the increasing number of studies on this seems to point to the fact that it doesn't help. So having an LOM doesn't help you design or build a bio weapon or a chemical weapon, if you already have access to, you know, or search engine and their library. And so the sort of increased information you get or the ease with which you get it doesn't really help you. That's the first thing. The second thing is, it's one thing to have a list of instructions of how to make a chemical weapon, for example, or bio weapon. It's another thing to actually build it. And it's much harder than you might think, and then LOM will not help you with that. In fact, you know, nobody in the world, not even like, you know, countries use bio weapons because most of the times they have no idea how to protect their own populations against it. So it's too dangerous actually to kind of ever use. And it's in fact banned by international treaties. Chemical weapons is different. It's also banned by treaties, but it's the same problem. It's difficult to use in situations that doesn't turn against the perpetrators.
因此,关于这个问题正在增加的研究似乎表明,它并不起到帮助作用。拥有一个限制性的信息获取能力并不能帮助你设计或制造生物武器或化学武器,如果你已经可以获得,或者说,拥有搜索引擎和它们的图书馆。因此,你获取的信息量增加或者获取信息的便利程度并不能真正帮助你。这是第一点。第二点是,有一个制造化学武器或生物武器的指导清单是一回事,实际上制造出来是另一回事。这比你想象的要困难得多,限制性的信息获取能力并不能帮助你。实际上,世界上没有人,甚至是国家使用生物武器,因为大多数时候他们根本不知道如何保护自己的人口免受其影响。实际上,它实际上是太危险了以至于永远不会被使用。事实上,根据国际条约,它是被禁止的。化学武器是不同的。它也被条约禁止,但是同样的问题。在不反击制造者的情况下使用是困难的。

But we could ask it on the list. Like, I can give you a very precise list of instructions of how you build a rocket engine. And even if you have a team of 15 engineers that are really experienced building it, you're still going to have to blow up a dozen of them before you get one that works. And you know, it's the same with, you know, chemical weapons or bio weapons or things like this. It requires expertise, you know, in the real world that the underlying is not going to help you with. And it requires even the common sense expertise that we've been talking about, which is how to take language based instructions and materialize them in the physical world requires a lot of knowledge. It's not in the instructions. Yeah, exactly. A lot of biologists have posted on this actually in response to those things saying like, you realize how hard it is to actually do the lab work? I can know this is not trivial. Yeah. And that's Hans Marvak comes to light once again. Just the link around llama, mark announced that llama three is coming out eventually. I don't think there's a least date. But what are you most excited about? First, the llama two that's already out there. And maybe the future, llama three, four, five, six, 10, just the future of the open source under meta.
但是我们可以在清单上提问。比如,我可以给你一个非常精确的建造火箭发动机的指令清单。即使你有一个由15名经验丰富的工程师组成的团队在建造,你仍然需要炸毁十几个才能有一个正常工作的。你知道,化学武器或生物武器也是一样的。这需要专业知识,在现实世界中底层知识是不能帮到你的。这还需要我们讨论过的常识性专业知识,就是如何将基于语言的指令具象化到物理世界中需要大量知识。这并不在指令中。是的,很多生物学家实际上对此发表了意见,说实际上进行实验室工作有多么困难。我知道这并不是小菜一碟。是的。汉斯·马尔瓦克再次引起了关注。只是马克宣布llama三最终会发布。我不认为有一个确切的日期。但你最期待的是什么?首先是已经发布的llama二。也许未来会有llama三、四、五、六、十,以及开源元数据的未来。

Well, number of things. So there's going to be like various versions of llama that are, you know, improvements of previous llamas, bigger, better multimodal, things like that. And then in future generations, systems that are capable of planning that really understand how the world works. Maybe are trained from video. So they have some world model, maybe, you know, capable of the type of reasoning and planning I was talking about earlier. Like, how long is that going to take? Like when is the research that is going in that direction going to sort of feed into the product line? If you want of llama, I don't know, I can tell you. And there's, you know, a few breakthroughs that we have to basically go through before we can get there. But you'll be able to monitor our progress because we publish our research, right?
嗯,有几件事情。所以将会有各种不同版本的羊驼,是的,是对之前羊驼的改进,更大、更好的多模态等等。然后在未来的一代中,会有能够真正理解世界运作方式的规划系统。也许是通过视频训练的。因此他们有一些世界模型,也许,能够进行之前我所说的那种推理和规划。那需要多长时间?研究朝着这个方向的研究什么时候会融入产品线中呢?如果你想了解有关羊驼的,请问,我不知道,我可以告诉你。而且,在我们能够达到那里之前,我们还需要经历一些突破。但你可以监控我们的进展,因为我们会发布我们的研究成果,对吧?

So, you know, if last week, we published the vjapa work, which is sort of a first step towards training systems for video. And then the next step is going to be world models based on kind of this type of idea, training from video. The similar work at a defined also and taking place people. And also at UC Berkeley on world models and video, a lot of people are working on this. I think a lot of good ideas are coming are appearing. My bet is that those systems are going to be japa-like, they're not going to be generative models. And we'll see what the future will tell. There's really good work at gentlemen called Danyj Raffner, who's now deep-mind who's worked on kind of models of this type that learn representations and then use them for planning or learning tasks by reinforcement training. And a lot of work at Berkeley by Peter Ibiot, Sagnilla Veen, a bunch of other people at that type. I'm collaborating with actually in the context of some grants with my NYU hat. And then collaborations also through meta, because the the lab at Berkeley is associated with meta in some way, so with FAIR.
所以,你知道,如果上周我们发表了vjapa的工作,这是迈向视频训练系统的第一步。接下来的步骤将是基于这种类型的想法建立世界模型,从视频中进行训练。类似的工作在某种程度上也被定义出现在人们中间。在加州大学伯克利分校也有关于世界模型和视频的工作,有很多人在进行这方面的研究。我认为有很多好的想法正在涌现。我认为那些系统将会类似于vjapa,它们不会是生成模型。我们将看到未来会有怎样的发展。 有一个叫做Danyj Raffner的绅士有很出色的工作,他现在加入了deep-mind,他的研究是关于这种类型的模型,学习表示然后通过强化训练来计划或学习任务。伯克利大学也有很多人在进行类似的工作,像Peter Ibiot,Sagnilla Veen等等。实际上我还在一些资助项目中与他们合作,作为我在NYU的代表。而且还通过meta进行合作,因为伯克利的实验室在某种程度上与meta有关联,也与FAIR有合作。

So I think it's very exciting. You know, I think I'm super excited about, I haven't been that excited about like the direction of machine learning and AI, you know, since, you know, 10 years ago when FAIR was started, before that, 30 years ago, we were working on, let's say 35 on on comedy show nets and and the early days of neural nets. So I'm super excited because I see a path towards potentially human level intelligence with, you know, systems that can understand the world, remember plan, reason. There is some set of ideas to make progress there that might have a chance of working. And I'm really excited about this. What I like is that, you know, it's somewhat, we get onto like a good opportunity to actually make some real contributions against long term, big neural. direction and perhaps succeed before my brain turns to a white sauce or before I need to retire.
所以我觉得这非常令人兴奋。你知道,我对机器学习和人工智能的发展方向感到非常激动,自从10年前FAIR成立以来,我就没有像现在这样激动过了。在那之前,30年前,我们在研究神经网络和神经网络的早期阶段。所以我非常激动,因为我看到了一条可能通往人类智能水平的道路,通过可以理解世界、记忆、规划和推理的系统。在那方面取得进展的一些想法可能会取得成功。我对此感到非常兴奋。我喜欢的是,我们有机会实际上为实现长期的大规模神经网络做出一些真正的贡献,并且也许能在我的大脑变成白色之前或者在我需要退休之前实现成功。

Yeah. Yeah. You're also excited by, you is it beautiful to you just the amount of GPUs involved, sort of the the whole training process on this much compute is just zooming out, just looking at earth and humans together, have built these computing devices and are able to train this one brain. Then then we then open source, like giving birth to this open source brain trained on this gigantic compute system. There's just the details of how to train on that, how to build the infrastructure and the hardware, the cooling, all of this kind of stuff. Or use just still the most of your excitement is in the theory aspect of it, the meaning like the software. Well, it used to be a hard drive many years ago. Yes. Decades ago. Hardware has improved a little bit. Chilled bit. Yeah. I mean, certainly scale is necessary, but not sufficient.
是的。你也被这个 GPU 参与的数量感到兴奋,整个训练过程在这么多计算中进行就像缩放出来看地球和人类一起构建这些计算设备,并且能够训练这个大脑。然后我们就开源了,像是让这个在巨大计算系统上训练的开源大脑诞生一样。只是一些细节,如如何在上面进行培训、如何构建基础设施和硬件、冷却等等。还是你最感兴奋的是理论方面,比如软件的意义。嗯,许多年前还是硬盘。是的,几十年前。硬件改进了一点点。改进了一点。是的。我是说,规模当然是必要的,但并不足够。

Absolutely. So we certainly need competition. I mean, we're still far in terms of compute power from what we would need to match the compute power of the human brain. This may occur in the next couple of decades, but we're still some ways away. And certainly in terms of power efficiency, we're really far. So there's a lot of progress to make in hardware. Right now, a lot of progress is not, I mean, there's a bit coming from silicon technology, but a lot of you coming from architectural innovation and quite a bit coming from like more efficient ways of implementing the architectures that have become popular, basically combination of transformers and components. So there's still some ways to go until we're going to saturate, we're going to have to come up with new principles, new fabrication technology, new basic components, perhaps based on sort of different principle centers, classical digital CMOS. Interesting.
当然。我们肯定需要竞争。我的意思是,从计算能力方面来看,我们还远远不及匹敌人脑的计算能力所需的水平。也许在接下来的几十年内会出现这种情况,但我们仍有一段距离要走。而且在功耗效率方面,我们还有很长的路要走。因此,在硬件方面还有很多进展要做。目前, 很多进展并不是来自硅技术, 而是来自架构创新, 以及更有效地实现已经流行的架构的方法。因此, 我们还有一些路要走, 直到我们需要饱和, 我们必须提出新的原则、新的制造技术、新的基本组件, 或许基于不同原则中心, 与传统的数字CMOS不同。有趣的。

So you think in order to build AMI, we need, we potentially might need some hardware innovation too. Well, if you want to make it ubiquitous, yeah, certainly, because we're going to have to reduce the, you know, compute the power consumption. A GPU today, right, is half a kilowatt to a kilowatt. Human brain is about 25 watts. And a GPU is way below the power of human brain. You need, you know, something like 100,000 or a million to match it. So, so, you know, we are off by a huge factor.
所以你认为为了建立AMI,我们需要,可能还需要一些硬件创新。嗯,如果你想让它普及化,那当然,因为我们需要降低计算功耗。今天的GPU大约是半千瓦到一千瓦。人类大脑大约是25瓦。而GPU的功耗远低于人类大脑。你需要,你知道,约100,000到一百万个GPU才能匹敌它。所以,我们的差距是巨大的。

You often say that AGI is not coming soon, meaning like not this year, not the next few years, potentially farther away. What's your basic intuition behind that? So first of all, it's not going to be an event, right? The idea somehow, which, you know, is popularized by science fiction and Hollywood that, you know, somehow somebody is going to discover the secret, the secret to AGI or human level AI or AMI, whatever you want to call it. And then, you know, turn on a machine and then we have a GI. That's just not going to happen. It's not going to be an event. It's going to be gradual progress. Are we going to have systems that can learn from video, how the world works and learn good web presentations? Yeah. Before we get them to the scale and performance that we observe in humans, it's going to take quite a while. It's not going to happen in one day. Are we going to get systems that can have large amount of associative memories, they can remember stuff? Yeah, but same. It's not going to happen tomorrow. I mean, there is some busy techniques that need to be developed. We have a lot of them, but like, you know, to get this to work together with full systems is another story. Are we going to have systems that can reason and plan perhaps along the lines of the objective driven AI architecture is that I described before? Yeah, but like before we get this to work, you know, properly, it's going to take a while.
你经常说AGI不会很快到来,意思是不是今年,也不是接下来的几年,可能要更远一点。你对此的基本直觉是什么?首先,它不会是一个事件,对吧?这种想法在科幻小说和好莱坞电影中被炒作,即某个人会发现AGI或人类水平AI或AMI的秘密,然后打开一台机器,然后我们就有了AGI。这是不会发生的。它不会是一个事件,而是会是渐进的进步。我们会有系统能够从视频中学习世界如何运作并学习好的网络展示吗?是的。但在将它们发展到我们观察到的人类水平的规模和性能之前,还需要一段时间。它不会在一天内发生。我们会有系统能够拥有大量的联想记忆吗,它们能够记住事情?是的,但同样,这也不会立即发生。这需要一些尚未开发的技术。我们有很多技术,但要将它们与完整系统结合起来运行还需要一段时间。我们会有系统能够推理和规划,可能沿着我之前描述的目标驱动的AI架构的路线吗?是的,但在让它们正常运行之前,还需要一段时间。

So, and before we get all those things to work together, and then on top of this have systems that can learn like hierarchical planning, hierarchical representations, systems that can be configured for a lot of different situation at hands the way the human brain can. You know, all of this is going to take, you know, at least a decade and probably much more because there are a lot of problems that we're not seeing right now. We have not encountered, and so we don't know if there is an easy solution within this framework. So, you know, it's not just around the corner. I mean, I've been hearing people for the last 12, 15 years claiming that, you know, AGI is just around the corner and being system ADP wrong. And I knew they were wrong when they were saying it. I called their bush it. Why do you think people have been calling? First of all, I mean, from the beginning, from the birth of the term artificial intelligence, there has been a eternal optimism that's perhaps unlike other technologies. Is it a Marvak paradox? Is the explanation for why people are so optimistic about AGI? I don't think it's just Marvak's products. Marvak's products is a consequence of realizing that the world is not as easy as we think.
因此,在我们让所有这些东西一起工作之前,还要有像分层规划、分层表示这样可以学习的系统,以及可以针对很多不同情况进行配置的系统,就像人类大脑一样。你知道,所有这些至少需要花费十年甚至更多的时间,因为我们现在看不到很多问题。我们还没有遇到,所以我们不知道是否在这个框架内有简单的解决方案。所以,你知道,这并不是马上就会出现的。我是说,过去12、15年来,我一直听到人们声称,AGI就在不远处,结果都是错的。当他们说这些话时,我就知道他们是错的。我称之为荒谬。你认为人们为什么一直这样声称呢?首先,从人工智能这个词诞生之初开始,就一直有一种永恒的乐观主义,这可能与其他技术不同。这是马尔瓦克悖论吗?是人们为什么对AGI如此乐观的解释?我认为这不仅仅是马尔瓦克悖论。马尔瓦克悖论是认识到世界没有我们想象的那么简单的结果。

So first of all, intelligence is not a linear thing that you can measure with a scalar, with a single number. You know, can you say that humans are smarter than orongitongs? In some ways, yes. But in some ways orongitongs are smarter than humans in a lot of domains that allows them to survive in the forest, for example. So IQ is a very limited measure of intelligence. Do you intelligence is bigger than what IQ, for example, measures? Well, IQ can measure, you know, approximately something for humans. But because humans kind of, you know, come in relatively uniform form, right? But it only measures one type of ability that, you know, maybe relevant for some tasks, but not others.
首先,智力不是一种可以用一个单一数字衡量的线性事物。你能说人类比猩猩更聪明吗?在某些方面,是的。但在某些方面,猩猩比人类更聪明,比如在许多领域中让它们在森林中生存。因此,智商是对智力的一种非常有限的衡量。你认为智力比例如IQ所衡量的更广阔吗?嗯,IQ可以大致衡量人类的某些方面。但因为人类在某种程度上是相对统一的,是吗?但它只衡量了一种类型的能力,这种能力也许对一些任务有用,但对其他任务则不然。

And but then if you were talking about other intelligent entities for which the you know, the basic things that are easy to them is very different, then it doesn't mean anything. So intelligence is a collection of skills and an ability to acquire new skills efficiently. Right. And the collection of the skills that an intelligent, particular intelligent entity possess or is capable of learning quickly is different from the collection of skills of another one. And because it's a multi-dimensional thing, the set of skills is a high dimensional space, you can't measure, you can compare, you cannot compare two things as to whether one is more intelligent than the other. It's multi-dimensional.
但是如果你在谈论其他智能实体,对于他们来说基本的东西很容易,那么这就意味着不同。因此,智能是一组技能以及有效获取新技能的能力。对。具有智能的特定实体拥有或能够快速学习的技能集合与另一个实体的技能集合不同。由于智能是多维的,技能集合是高维空间,你无法衡量,也无法比较两个实体谁更聪明。这是多维的。

So you push back against what are called AI doomers a lot. Can you explain their perspective and why you think they're wrong? Okay. So AI doomers imagine all kinds of catastrophe scenarios of how AI could escape or control and basically kill us all. And that relies on a whole bunch of assumptions that are mostly false. So the first assumption is that the emergence of super intelligence could be an event that at some point we're going to have, we're going to figure out the secret and we'll turn on a machine that is super intelligent. And because we'd never done it before is going to take over the world and kill us all. That is false. It's not going to be an event.
所以你经常反对所谓的“AI末日论者”。你能解释一下他们的观点,以及为什么你认为他们是错的吗?好的。AI末日论者想象了各种各样的灾难情景,即AI可能会逃逸或控制,并在基本上杀死我们所有人。这基于一系列大多是错误的假设。首先的假设是,超级智能的出现可能会是一个事件,在某个时候我们会达到这一点,我们会找出秘密,然后启动一个超级智能的机器。因为我们以前从未这样做过,它将主宰世界并杀死我们所有人。这是错误的。这不会是一个事件。

We're going to have systems that are like as smart as a cat has all have all the characteristics of human level intelligence, but their level of intelligence would be like a cat or a parrot maybe or something. And then we're going to walk away up to make those things more intelligent. And as we make them more intelligent, we're also going to put some guardrails in them and learn how to kind of put some guardrails so they behave properly. And we're not going to do this with just one, it's not going to be one effort, it's going to be lots of different people doing this. And some of them are going to succeed at making intelligent systems that are controllable and say if I have the right guardrails and if some other goes rogue, then we can use the good ones to go against the rogue ones. So it's going to be my smart AI police against your rogue AI.
我们将拥有像猫一样聪明的系统,具有人类智能的所有特征,但它们的智能水平可能只相当于猫或鹦鹉。然后我们将进一步努力使这些系统更加智能化。随着我们使它们更加智能化,我们也将为它们设定一些防护措施,并学会如何使其表现得恰当。我们不会只依靠一个人做到这一点,这将是许多不同的人共同努力。其中一些人将成功制造出可控制的智能系统,如果我有正确的防护措施,甚至可以用好的系统对抗叛变的系统。因此,这将是我聪明的AI警察对抗你的叛逆AI。

So it's not going to be like, we're going to be exposed to a single rogue AI that's going to kill us all. That's just not happening. Now there is another fallacy, which is the fact that because a system is intelligent, it necessarily wants to take over. And there is several arguments that make people scared of this, which I think are completely false as well. So one of them is, in nature, it seems to be that the more intelligent species are the one that end up dominating the other. And even you know, extinguishing the others, sometimes by design, sometimes just by mistake. And so there is sort of thinking by which you say, well, if AI systems are more intelligent than us, surely they're going to eliminate us. If not by design, simply because they don't care about us. And that's just preposterous for a number of reasons. First reason is they're not going to be a species. They're not going to be a species that competes with us. They're not going to have the desire to dominate because the desire to dominate is something that has to be hardwired into an intelligent system. It is hardwired in humans. It is hardwired in baboons, in chimpanzees, in wolves, not in the wrong tongues. The species in which this desire to dominate or submit or attain status in other ways is specific to social species. Non-social species like our own tongues don't have it. And they are as smart as we are, almost. And to you, there's not significant incentive for humans to encode that into the AI systems. And to the degree they do, there'll be other AI's that sort of punish them for it. I'll compete them over. Well, there's all kinds of incentives to make AI systems abusive to humans. This is the way we're going to build them. And so then people say, oh, but look at LLM. LLM are not controllable. And they're right. LLM are not controllable. But objective-driven AI, so systems that derive their answers by optimization of an objective, means they have to optimize its objective. And that objective can include guardrails. One guardrail is obey humans. Another guardrail is don't obey humans if it's hurting other humans within humans. I've heard that before somewhere. I don't remember. Yes. Maybe in a book. Yeah. But speaking of that book, could there be unintended consequences also from all of this? No, of course. So this is not a simple problem. I mean, designing those guardrails so that the system behaves properly is not going to be a simple issue for which there is a silver bullet, for which you have a mathematical proof that the system can be safe. It's going to be very progressive iterative design system where we put those guardrails in such a way that the system behave properly. And sometimes they're going to do something that was unexpected because the guardrail wasn't right. And we're going to correct them so that they do it right. The idea somehow that we can't get it slightly wrong because if we get it slightly wrong, we all die is ridiculous. We're just going to go progressively. And it's just going to be, the analogy I've used many times is turbojet design. How did we figure out how to make turbojet so unbelievably reliable? I mean, those are incredibly complex pieces of hardware that run at really high temperatures for 20 hours at a time sometimes. And we can fly halfway around the world with a two engine jetliner at the other speed of sound. How incredible is it? It's just unbelievable. And did we do this because we invented like a general principle of how to make turbojet safe? No, we, it took decades to kind of fine tune the design of those systems so that they were safe. Is there a separate group within general electric or snack mail or whatever that is specialized in turbojet safety? No, it's the design is all about safety because a better turbojet is also a safer turbojet. So a more reliable one is the same for AI. Do you need specific provisions to make AI safe? No, you need to make better AI systems and they will be safe because they are designed to be more useful and more controllable.
因此,不会出现像我们会面对一个要杀死我们所有人的单个流氓人工智能那样的情况。这根本不会发生。现在还有另一个谬误,那就是因为一个系统是智能的,就必然想要掌权。有几个让人们感到害怕的论点,我认为完全是错误的。其中之一是,从自然界看,似乎更聪明的物种最终会支配其他物种。甚至有时是出于设计,有时只是出于错误。因此,有一种思维方式认为,如果人工智能系统比我们更智能,肯定会消灭我们。即使不是出于设计,只是因为他们不关心我们。出于许多理由考虑,这种想法不切实际。首先,它们不会成为一个物种。它们不会成为一个与我们竞争的物种。它们不会有想要控制的欲望,因为想要控制这种欲望必须硬编码到智能系统中。这是人类中固有的。这在狒狒、黑猩猩、狼等物种中也是固有的,而在错舌中不是固有的。这种想要支配或服从或以其他方式获得地位的欲望只属于社会物种。像我们错舌这样的非社会物种没有这种欲望。它们几乎和我们一样聪明。对于人类来说,没有足够的激励来将这种欲望编码进人工智能系统中。甚至到了那种程度,会有其他人工智能惩罚他们,或者与他们竞争。要让人工智能系统对人类进行虐待存在各种动机。这是我们将会构建它们的方式。所以有人说,但是看看LLM。LLM是不可控的。他们是正确的。LLM是不可控的。但是以目标为驱动的人工智能,即通过优化目标实现答案的系统,意味着他们必须优化其目标。而这个目标可以包含警示。其中之一是服从人类。另一个是不服从人类,如果这种行为伤害到其他人类。我以前在哪里听到过这个观点。我不记得了。是的。也许是在一本书中。是的。但是说到那本书,这一切也可能会导致意想不到的后果吗?当然了。因此,这不是一个简单的问题。设计这些警示规则,让系统表现正确,不会是一个简单的问题,也不会有一个解决问题的法宝,不可能有一个证明系统可以安全的数学证明。这将是一个非常逐步的迭代设计系统,我们将以这种方式设置这些警示规则,使系统表现正确。有时他们会做一些出乎意料的事情,因为警示规则不正确。我们将纠正他们,让他们做对。某种想法认为,如果我们稍微出错,我们就都会死亡,这是荒谬的。我们将逐步前进。我已经多次使用过的类比是涡轮喷气发动机的设计。我们是如何找出如何使涡轮喷气发动机如此令人信服的可靠性的呢?这些是非常复杂的硬件,有时会以很高的温度运行20个小时。有时我们可以以超音速飞行环球旅行。这是多么不可思议。但我们是因为我们发明了使涡轮喷气发动机安全的一般原则吗?不是的,我们花了数十年时间来微调这些系统的设计,使其安全。通用电气或斯内克玛尔内部有一个专门负责涡轮喷气发动机安全的独立小组吗?不,设计都关乎安全,因为一个更好的涡轮喷气发动机也是一个更安全的涡轮喷气发动机。因此对于人工智能也是一样。你需要特定的规定来确保人工智能的安全吗?不需要,你只需要设计更好的人工智能系统,它们将是安全的,因为它们被设计成更有用和更可控。

So let's imagine a system, AI system that's able to be incredibly convincing and can convince you of anything. I can at least imagine such a system and I can see such a system be weapon-like because it can control people's minds work pretty gullible. We want to believe a thing and you can have an AI system that controls it and you could see governments using that as a weapon. So do you think if you imagine such a system there's any parallel to something like nuclear weapons? No. So is it why? Why is that technology different? So you're saying there's going to be gradual development? Yeah. It's going to be, I mean it might be rapid but there'll be iterative and then we'll be able to kind of respond and so on.
因此,让我们想象一个系统,一个AI系统,能够非常令人信服,并能说服你相信任何事情。我至少可以想象到这样一个系统,我可以看到这样一个系统类似于武器,因为它能控制人们容易上当的心理。我们想要相信一件事情,而你可以拥有一个控制它的AI系统,你可以看到政府将其用作武器。所以你认为如果你想象这样一个系统,会有什么类似于核武器的平行之处吗?不。那么是为什么呢?为什么这项技术不同呢?所以你是说会有渐进式的发展?是的。它会是,我的意思是可能会很快,但会是渐进的,然后我们会能够做出回应等。

So that AI system designed by Vladimir Putin or whatever or his minions is going to be talking to, trying to talk to every American to convince them to vote for whoever pieces Putin or whatever or you know or rile people up against each other as they've been trying to do. They're not going to be talking to you. They're going to be talking to your AI assistant which is going to be as smart as they are. Right? That AI because as I said in the future every single one of your interaction with the digital world will be mediated by your AI assistant. So the first thing you're going to ask is is this a scam? Like is this thing like turning me to the truth? Like it's not even going to be able to get to you because it's only going to talk to your AI assistant, your AI assistant. It's not even going to, it's going to be like a spam filter. Right? You're not even seeing the email, the spam email. Right? It's automatically put in a folder that you never see.
普京设计的人工智能系统或其他人设计的系统,将试图与每个美国人交谈,说服他们投票给普京支持的候选人,或者煽动人们之间的对立,正如他们一直在努力做的那样。他们不会与你交谈,他们将与你的人工智能助手交谈,后者将和他们一样聪明。对吧?因为如我所说的,将来你在数字世界中的每一次互动都将由你的人工智能助手来中介。所以第一件事情你会问自己的是,这是不是诈骗?这个东西是不是把我带向真相?它甚至不会直接与你交流,因为它只会和你的人工智能助手说话,你的人工智能助手会像一种垃圾邮件过滤器一样。对吧?你甚至看不到那封垃圾邮件。它会自动被放在你从未看见的文件夹里。

It's going to be the same thing. That AI system that tries to convince you of something is going to be talking to your AI assistant which is going to be at least as smart as it is going to say this is spam. You know, it's not even going to bring it to your attention. So to you it's very difficult for anyone AI system to take such a big leap ahead to where it can convince even the other AI systems. So like there's always going to be this kind of race where nobody's way ahead. That's the history of the world. History of the world is, you know, whenever there is a progress someplace there is a countermeasure and you know it's a cat and mouse game. This is why mostly yes but this is why nuclear weapons are so interesting because there was such a powerful weapon that it matters who got it first.
这将是一样的情况。那个试图说服你的人工智能系统将会与你的人工智能助手交谈,后者至少会和前者一样聪明,会认为这是垃圾邮件。你知道,它甚至不会提醒你。因此,对于任何一种人工智能系统来说,要取得这么大的进步,以至于能够说服其他人工智能系统,是非常困难的。因此,总会有这种种类的竞争,没有人能领先太远。这就是世界的历史。世界的历史就是,你知道,每当某个地方取得进展,总会有对策,这就是一场猫鼠游戏。这就是为什么大多数时候是这样的,但这也是为什么核武器如此有趣的原因,因为它是如此强大的一种武器,首先得到它的人很重要。

That you know you could imagine Hitler, Stalin, Mao getting the weapon first and that having a different kind of impact on the world and then the United States getting the weapon first. To you nuclear weapons is like you don't imagine a breakthrough discovery and then Manhattan Project like ever for AI. No as I said it's not going to be an event. It's going to be continuous progress and whenever you know one breakthrough occurs it's going to be widely disseminated really quickly. Probably first within industry. I mean this is not a domain where you know government or military organizations are particularly innovative and they're in fact way behind and so this is going to come from industry and this kind of information disseminates extremely quickly.
你知道,你可以想象希特勒、斯大林、毛泽东先获取核武器,对世界产生不同的影响,然后是美国先获取核武器。对于你来说,核武器就像你无法想象会有一个突破性的发现,然后像人工智能的曼哈顿计划一样。就像我之前说的,这不会是一个事件。这将是持续的进展,每当一个突破发生时,它会迅速被广泛传播,很可能首先在工业界。我的意思是,这不是一个领域,政府或军事组织特别创新,实际上它们远远落后,因此这将来自于工业,并且这种信息会传播得非常迅速。

We've seen this over the last few years where you haven't even take AlphaGo. This was reproduced within three months even without like particularly detailed information. Yeah this is an industry that's not good at secrecy. No but even if there is just the fact that you know that something is possible. Makes you realize that it's worth investing the time to actually do it. You may be the second person to do it but you know you'll do it and you know save for you know all the innovations of you know cell supervising in transformers, decoder only architecture is LMS. I mean those things you don't need to know exactly the details of how they work to know that you know it's possible because it's deployed and then it's getting reproduced and then you know people who work for those companies move.
在过去几年中,我们已经看到了这种情况,您甚至都没有使用AlphaGo。即使没有特别详细的信息,也在三个月内得以复制。是的,这个行业不擅长保密。但是,即使只是知道某件事是可能的,这就让您意识到值得投入时间来实际做这件事。您可能是第二个做到这件事的人,但您知道自己能做到,并且,除了细胞监视和变压器的创新之外,解码器架构只是一种理论。我的意思是,您不需要准确了解它们的工作原理的细节,就知道这是可能的,因为它已经被部署,被复制,然后,那些为这些公司工作的人会离职。

They go from one company to another and you know the information disseminates. What makes the success of the US tech industry and Silicon Valley in particular is exactly that is because information circulates really really quickly and this you know disseminates very quickly and so you know the the whole region sort of is ahead because of that circulation of information. So maybe I just to linger on the psychology of AI doomers you give in the classic Yowin Lekunwe a pretty good example of just when a new technology comes to be. You say engineer says I invented this new thing I call it a ball pen and then the Twitter sphere responds OMG people could write horrible things with it like misinformation propaganda hate speech ban it now then writing doomers come in akin to the AI doomers imagine if everyone can get a ball pen this could destroy society there should be a law against using ball pen to write hate speech regularly ball pens now and then.
他们从一家公司转移到另一家公司,你知道信息会传播开来。美国科技产业和硅谷的成功之处在于信息传播非常迅速,因此整个地区因为信息的流通而领先。或许我只想仔细探讨一下AI灾难论者的心理,你在经典《尧文·莱昆卫》中给出了一个很好的例子,就是当一项新技术诞生时。你说工程师说我发明了这个新东西,我叫它圆珠笔,然后Twitter上的回应是天啊,人们可以用它写出可怕的事情,比如虚假信息、宣传、仇恨言论,立刻禁止它。然后写作灾难论者走进来,类似于AI灾难论者,想象一下如果每个人都可以得到一个圆珠笔,这可能会摧毁社会,应该制定法律禁止使用圆珠笔定期写仇恨言论。

the pencil industry mogul says yeah ball pens are very dangerous unlike pencil writing which is erasable ball pen writing stays forever government should require a license for a pen manufacturer. I mean this does seem to be part of human psychology when it comes up against new technology but what deep insights can you speak to about this? Well there is a natural fear of new technology and the impact it can have on society and people have kind of addictive reaction to you know the world they know being threatened by major transformations that are either cultural phenomena or technological revolutions and they fear for their culture they feel for their job they feel for their you know the future of their children and their way of life right so any change is feared and you see this you know a long history like any technology called revolution or cultural phenomenon was always accompanied by you know groups or reaction in the media that basically attributed all the problems the current problems of society to that particular change right electricity was going to kill everyone at some point you know you the train was going to be a horrible thing because you know you can't breathe past 50 kilometers an hour and so there's a wonderful website called the pessimist archive which has all those newspaper clips of all the horrible things people imagine would would arrive because of either a technological innovation or a cultural phenomenon you know this is wonderful examples of you know jazz or comic books being blamed for unemployment or you know young people not wanting to work anymore and things like that right and and that has existed for centuries and it's you know knee jerk reactions the question is you know do we embrace change or do we resist it and what are the real dangerous as opposed to the imagined imagined ones so people worry about I think one thing they worry about big tech something we've been talking about over and over but I think worth mentioning again they worry about how powerful AI will be and they worry about it being in the hands of one centralized power of just our handful of central control and so that's the skepticism with big tech you can make these companies can make a huge amount of money and control this technology and by so doing you know take advantage abuse the little guy in society well that's exactly why we need open source platforms
这位铅笔行业大亨说,是的,圆珠笔非常危险,不像铅笔写字那样可以擦掉,圆珠笔写字是永久的,政府应该要求笔制造商有许可证。当人类心理遇到新技术时,似乎会出现这种情况,但你能就此谈谈深刻见解吗?嗯,人们对新技术有一种自然的恐惧,担心其对社会的影响,人们对他们熟悉的世界被文化现象或技术革命所威胁有一种上瘾的反应,并担心他们的文化、工作、孩子的未来和生活方式。任何变化都会引发恐惧,你可以看到很长一段历史,任何技术革命或文化现象总是伴随着媒体对那种特定变化所造成的当前社会问题的指责,比如电力最终会杀死每个人,火车会成为可怕的事物,因为你在50公里每小时的速度以下无法呼吸。有一个名为"悲观者档案"的网站,收集所有这些新闻剪报,展示了人们想象的各种可怕的事情,就像技术创新或文化现象会带来的那样。这是一个很好的例子,比如爵士乐或漫画被指责导致失业,年轻人不愿意工作等等。这种情况存在了几个世纪,它是一种条件反射的反应。问题是,我们是接受变革还是抵抗它,真正的危险是什么,而不是想象的危险。人们担心一件事,我认为大家一直在谈论的是大科技,我认为值得再次提及,他们担心AI的强大,他们担心它掌握在一个中央权力手中或只有我们少数几个人控制,这是对大科技的怀疑,这些公司可以赚取巨额利润并控制这种技术,从而利用它滥用社会中的小人物,这正是为什么我们需要开放源码平台。

yeah I just wanted to nail the point home more and more yes so let me ask you on your like I said you do get a little bit you know flavorful on the internet your shabak tweeted something that you lol that in reference to hell 9 000 quote I appreciate your argument and I fully understand your frustration but whether the pod bay doors should be opened or closed as a complex than nuanced issue so you're at the head of meta AI you know this is something that really worries me that AI or AI overlords will speak down to us with corporate speak of this nature and you sort of resist that with your way of being is this something you can just comment on sort of working at a big company how you can avoid the overfearing I suppose the through caution create harm yeah again I think the answer to this is open source platforms and then enabling a widely diverse set of people to build AI assistance that represent the diversity of cultures opinions languages and value systems across the world so that you're not bound to just you know be brainwashed by a particular way of thinking because of single AI entity so I mean I think it's really really important question for society and the problem I'm seeing is is that which is why I've been so vocal and sometimes a little sardonic about it never stop never stop yeah we love it is because I see the danger of this concentration of power to to proprietary AI systems as a much bigger danger than everything else that if we really want you know diversity opinion AI systems that you know in the future that where we all be interacting through AI systems we need those to be diverse for the preservation of diversity of ideas and you know creating political opinions and and whatever and the preservation of democracy and what works against this is people who think that for reasons of security we should keep AI systems under lock and key because it's too dangerous to put it in the hands of everybody because it could be used by terrorists or something that would lead to a you know potentially a very bad future in which all of our information diet is controlled by a small number of companies who proprietary systems
是的,我只是想更加强调一下,所以让我问问你,就像我说的,你在网上有时候会有点“口味”独特,你的莎·巴克发了一条推文,提到地狱9,000的引用,我理解你的观点并完全理解你的 frustr 那么关于船舱门是否应该打开或关闭这个问题,它是一个复杂而微妙的问题,你作为 Meta AI 的负责人,我真的很担心 AI 或 AI 统治者会用这种企业说辞对我们说话,而你以自己的方式抵制了这种说辞,这是你可以评论的吗?在大公司工作时如何避免過度恐懼我想答案就是开源平台以及让广泛多样化的人群建立代表世界各地文化观点语言和价值观的 AI 助理,这样你就不会被某种思维方式所束缚,因为有单一的 AI 实体,所以我认为这是一个非常重要的社会问题,而我看到的问题是这种权力集中在专有的 AI 系统中比其他任何事情都更具危险性,如果我们真的希望有多样性观点的 AI 系统在将来我们都将通过 AI 系统进行交互,我们需要这些系统的多样性来保护思想的多样性,创建政治观点以及维护民主,而违背这一点的是那些认为出于安全原因,我们应该把 AI 系统锁在钥匙之下,因为把它交给所有人太危险了,因为它可能会被恐怖分子利用,导致一个可能非常糟糕的未来,在这个未来里我们所有的信息饮食都受到少数公司拥有的专有系统的控制。

Do you trust humans with this technology to to build systems that are on the whole good for humanity isn't that what democracy and free democracy and free speech is all about things so do you trust institutions to do the right thing do you trust people to do the right thing and yeah there's bad people are going to do bad things but they're not going to have superior technology to the good people so then it's going to be my good AI against your bad AI right I mean there's the examples that we were just talking about of you know maybe some rogue country will build you know some AI system that's going to try to convince everybody to go into a civil war or something or elect favorable ruler and but then they will have to go past our AI systems right and AI system with a strong Russian accent we'll be trying to convince her and doesn't put any articles in their sentences well it'll be at the very least absurdly comedic okay so I since we talked about sort of the physical reality I'd love to ask your vision of the future with robots and in this physical reality so many of the kinds of intelligence you've been speaking about would empower robots to be more effective collaborators with us humans so since Tesla's Optimus team has been showing us some progress on human robots I think it really reinvigorated the whole industry and that's that I think Boston Dynamics has been leading for a very very long time so now there's all kinds of companies figure AI obviously Boston Dynamics you can tree but there's like a lot of them it's great it's great yeah I mean I love it so do you think there'll be millions of human robots walking around soon not soon but it's gonna it's gonna happen like the next decade I think is going to be really interesting robots like the the emergence of the robotics industry has been in the waiting for you know 10-20 years without really emerging other than for like you know kind of pre-pore-end behavior and stuff like that and the main issue is again the more of a paradox like you know how do we get the system to understand how the world works and kind of you know plan actions and so we can do it for really specialized tasks and the way Boston Dynamics goes about it is you know basically with a lot of handcrafted dynamical models and careful planning in advance which is very classical robotics with a lot of innovation a little bit of perception but it's still not like they can't build a domestic robot right and you know we're still some distance away from completely autonomous level five driving and we're certainly very far away from having you know level five autonomous driving by a system that can train itself by driving 20 hours like any 17 year old so until we have again world models systems that can train themselves to understand how the world works we're not gonna we're not gonna have significant progress in robotics so a lot of the people working on robotic hardware at the moment are are betting or banking on the fact that AI is going to make sufficient progress towards that and they're hoping to discover a product in it too is a yeah before you have a really strong world model there'll be an almost strong world model and people are trying to find a product in a clumsy robot I suppose like not a perfectly efficient robot so there's the factory setting where human robots can help automate some aspects of the factory I think that's a crazy difficult task because of all the safety required and all this kind of stuff
你相信人类能够运用这项技术构建总体上有益于人类的系统吗?这不就是民主和言论自由所关乎的吗?因此,你相信机构能做出正确的选择吗?你相信人们会做正确的事情吗?是的,坏人会做坏事,但他们不会拥有比好人更先进的技术,所以这将是我的好人人工智能对抗你的坏人人工智能,对吧?我的意思是,刚才我们讨论的例子,可能某个流氓国家将建立某种人工智能系统,试图说服所有人陷入内战或选举有利于自己的统治者,但他们将不得不绕过我们的人工智能系统,而那种带有浓重俄罗斯口音的人工智能系统将试图说服别人,但是说话不加任何修饰的话语,这至少会变得荒谬可笑。那么,既然我们谈到了现实中的物理现实,我想问一下您对未来与机器人在这个物理现实中的愿景。您说的许多种类的智能将使机器人能够更有效地与我们人类合作,因此,自从特斯拉的Optimus团队开始展示人机器人方面的一些进展以来,我觉得这确实激发了整个行业的活力,我认为波士顿动力公司一直处于领先地位很长一段时间,所以现在有各种各样的公司在研究人工智能,显然有波士顿动力,还有其他很多公司,这很棒,很棒。我喜欢。那么,您认为不久会有数百万人工机器人在身边行走吗?不会很快,但会发生,就在下一个十年里,我认为未来将会很有趣。像机器人行业的崛起已经等待了10-20年,但除了一些类似游戏行为之类的东西之外,一直没有真正崛起,主要问题仍然是一个悖论,如何让系统理解世界如何运作并规划行动,所以我们可以针对非常专业的任务来做。波士顿动力的方法基本上是采用大量手工制作的动态模型和提前仔细的规划,这在很大程度上是经典机器人技术,有许多创新,也有一些感知,但他们仍然无法制造家庭机器人。而且,我们距离完全自主的5级驾驶还有一段距离,当然,我们离能够通过系统自己驾驶20个小时来训练自己的5级自主驾驶还很遥远,除非我们拥有可以自我训练的世界模型系统来理解世界如何运作,我们不会在机器人领域取得重大进展。目前从事机器人硬件研究的许多人正在押注或信任人工智能会在这方面取得足够的进展,并希望在其中发现产品。在拥有非常强大的世界模型之前,将会有一个几乎强大的世界模型,人们试图在笨拙的机器人中找到产品,而非完全有效率的机器人。在工厂设置中,人工机器人可以帮助自动化工厂的某些方面,我认为这是一个非常困难的任务,因为需要考虑所有的安全要求等等。

I think in the home is more interesting but then you start to think I think you mentioned loading the dishwasher right yeah like I suppose that's one of the main problems you're working on I mean there's you know uh cleaning up yeah cleaning the house uh clearing up the table after a meal um washing the dishes you know all those tasks you know cooking I mean all the tasks that you know in principle could be automated but are actually incredibly sophisticated really complicated but even just basic navigation. around and on a space full of uncertainty that sort of works like you can sort of do this now navigation is fine well navigation in a way that's compelling to us humans is a different thing yeah it's not going to be you know necessarily I mean we have demos actually because you know there is a so-called Embodied AI group at fair and you know they've been not building their own robots but using commercial robots um and you can you can tell the robot dog like you know go to the fridge and they can actually open the fridge and they can probably pick up a can in the fridge and stuff like that and and bring it to you I know so it can navigate it can grab objects as long as it's been trained to recognize them which you know vision systems work pretty well nowadays but but it's not like a completely you know general robot that would be you know sophisticated enough to do things like clearing up the table yeah to me that's an exciting future of getting human-headed robots robots in general in the hole more and more because that it gets humans to really directly interact with AI systems in the physical space and and so doing it allows us to philosophically psychologically explore our relationships with robots can be really really really interesting so I hope you make progress on the whole uh JAPA thing soon well I mean I hope I hope things kind of you know work as uh as planned um I mean again we've been kind of working on this idea of self-supervised running of uh from video for for 10 years and and you know only made significant progress in the last two or three and actually you've you've mentioned that there's a lot of interesting breakers that can happen without having access to a lot of compute yeah so if you're interested in doing a PhD in this kind of stuff there's a lot of possibilities still to do innovative work so like what advice would you give to a undergrad that's looking to uh go to grad school and do a PhD so basically I've listed them already uh this idea of how do you train a world model by observation and you don't have to train necessarily on gigantic data sets or I mean you could turn that to be necessary to actually train on large data sets to have emergent properties like like we have with other labs but I think there is a lot of good ideas that can be done without necessarily scaling up then there is how do you do planning with a learn world model if the world the system evolves in is not the physical world but it's the world of they said the internet or you know some sort of uh world of where an action consists in doing a search in a search engine or interrogating a data database or running a simulation or calling a calculator or solving a differential equation how do you get a system to actually plan a sequence of actions to you know give the solution to a problem um and so the question of planning is not just a question of planning physical actions can be you know planning actions to use tools for a dialogue system or for any kind of intelligence system and um there's some work on this but not like not a huge amount some work at fair one called tool former which was copy let's go and some more recent work on planning but um but I don't think we have like a good solution for any of that
我认为在家里更有趣,但是当你开始思考我认为你提到装洗碗机是吧,是的,就像我认为那是你正在解决的主要问题之一我是说,你知道,清洁房子,收拾餐桌,餐后洗碗,你知道所有这些任务,我是说烹饪我是说所有这些任务原则上可能是可以自动化的,但实际上非常精密复杂甚至只是基本的导航围绕在一个充满不确定性的空间中移动这种工作,也可以这样说现在导航还行,导航方式对我们人类来说充满挑战是不同的是的,这不会完全是,我的意思是,我们实际上有一些演示,因为你知道在Facebook有所谓的具身人工智能团队,他们一直在使用商业机器人而不是构建他们自己的机器人,你可以告诉机器狗,比方说,去冰箱,它可以打开冰箱,可能会拿起冰箱里的一罐等等,并且把它拿给你,所以它可以导航,可以抓取物体,只要它经过训练识别它们,视觉系统现在工作得还不错,但这并不像一个完全通用的机器人那样,足够精密可以做像收拾餐桌这样的事情,对我来说,这是一个激动人心的未来,人类头脑化的机器人,总的来说,因为这让人类更加直接地与物理空间中的人工智能系统进行互动,因此,这样做可以让我们在哲学上和心理上探索我们与机器人的关系,这可能会非常有趣,所以希望你早日取得整个JAPA项目的进展,我是说我希望我希望事情能按照计划进行,我是说,我们已经在自我监督的视频运行方面工作了十年,只在过去两三年取得了重大进展,实际上你提到了没有访问大量计算资源也可能会发生很多有趣的突破,如果你对从事这种工作的博士学位感兴趣,还有很多可能性可以开展创新工作,那么你会给想读研申请博士的本科生什么建议呢,基本上我已经列出了它们,这种通过观察训练世界模型的方法,你不一定要在巨大的数据集上训练,或者我是说你可以将其说成必须在大数据集上训练才能产生类似其他实验室的那种新属性,但我认为有很多好的想法是可以实现的,让我们不需扩展,然后还有一个问题,如何用学习到的世界模型进行规划,如果系统演变的世界不是物理世界,而是互联网的世界或者某种进行搜索引擎中进行搜索或查询数据库或运行模拟或调用计算器或解微分方程等等的行动构成的世界,你如何让系统规划一系列行动来解决问题,因此,规划的问题不仅仅是规划物理行动,也可以是规划使用工具对话系统或任何智能系统的行动,关于这方面也有一些研究,但不多,呢,有人在Facebook做过的叫做工具形态(TOOL FORMER)的一些工作,以及一些最近的规划工作但我认为我们还没有解决这些问题找到一个很好的解决方案。

then there is the question of hierarchical planning so as the example I mentioned of you know planning a trip from New York to Paris that's hierarchical but almost every action that we take involves hierarchical planning in some in some sense and we really have absolutely no idea how to do this like this zero demonstration of hierarchical planning uh in AI where the various levels of representations that are necessary have been learned we can do like two-level hierarchy hierarchical planning when we design the two the two levels so for example you have like a dog-like robot right you want it to go from the living room to the kitchen you can plan a path that avoids the obstacle and then um you can send this to a lower level planner that figures out how to move the legs to kind of follow that trajectories right so that works but that two-level planning is designed by hand right um we specify what the proper levels of abstraction the representation that each level of abstraction has have to be how do you learn this how do you learn that hierarchical representation of action plans right we you know with cognizant deep learning we can train the system to learn hierarchical representations of percepts what is the equivalent when what you're trying to represent our action plans for action plans yeah so you want you want basically a robot dog or humanoid robot that turns on and travels from New York to Paris all by itself for example all right you might have some trouble at the at the TSA but yeah no but even doing something fairly simple like a household task sure like you know cooking or something yeah that there's a lot involved it's a super complex task with we take and once again we take it for granted what hope do you have for um the future of humanity we're talking about so many exciting technologies so many exciting possibilities what gives you hope when you look out over the next 10 20 50 100 years if you look at social media there's a lot of there's wars going on there's division there's hatred all this kind of stuff that's also part of humanity but amidst all that what gives you hope i love that question uh we can make humanity smarter with AI okay i mean AI basically will amplify human intelligence it's as if every one of us will have a staff of smart AI assistants they might be smarter than us they'll do our bidding perhaps execute a task in ways that are much better than we could do ourselves because they'd be smarter than us and so it's like everyone would be the the boss of a staff of super smart virtual people so we shouldn't feel threatened by by this anymore than we should feel threatened by being the manager of a group of people some of whom are more intelligent than us
然后是关于分层规划的问题,就像我提到的规划从纽约到巴黎的旅行这样的例子,那是分层规划的,但几乎我们采取的每一个行动在某种意义上都涉及到分层规划,但我们真的完全不知道如何做到这一点,AI中几乎没有关于层次规划的演示,其中学习所需的各种表示不同层级已经被学习。我们可以设计两级层次规划,例如,您有一只类似狗的机器人,您希望它从客厅走到厨房,您可以规划避开障碍的路径,然后您可以将这发送到一个较低级别的规划器,找出如何移动腿部来跟随这些轨迹。这样工作,但这种两级规划是手动设计的,我们指定适当的抽象级别,每个抽象级别的表示必须是如何学习的,您如何学习操作计划的层次表示,您需要一个机器狗或人形机器人自己从纽约到巴黎旅行,例如,您可能在TSA那里遇到一些麻烦,但是即使做一些相当简单的家务活像做菜之类的事情,有很多涉及其中的细节,这是一个非常复杂的任务,我们认为这很简单,对于人类的未来我们有什么希望呢?我们正在谈论如此多令人激动的技术,如此多令人激动的可能性,当你展望未来10年、20年、50年、100年时,你对此有什么希望,如果看看社交媒体,有很多战争、分裂、仇恨等事情,这也是人性的一部分,但在所有这些中,是什么让你感到希望?我喜欢这个问题,我们可以通过AI让人类更聪明,好吧,我的意思是AI基本上会增强人类智慧,就好像我们每个人都会有一队智能AI助手,它们可能比我们聪明,它们会执行我们的命令,也许以比我们自己更好得多的方式执行任务,因为它们比我们聪明,所以每个人都将成为一支超级聪明虚拟人员的老板,因此我们不应该感到受到威胁,就像我们不应该感到受到比我们更聪明的一群人的威胁一样。

I certainly have a lot of experience with this of uh you know having people working with me who are smarter than me that's actually a wonderful thing so uh having machines that are smarter than us that assist us in or all of our tasks or daily lives whether it's professional or personal I think would be a absolutely wonderful thing because intelligence is the most uh is the commodity that is most in demand that that's really what i mean all the mistakes that humanity makes is because of lack of intelligence really or lack of knowledge which is you know related so um making people smarter which just it can only be better i mean for the same reason that you know public education is a good thing and books are a good thing and the internet is also a good thing intrinsically and even social networks are a good thing if you run them properly it's difficult but you know you can um because you know it it helps the communication of information and knowledge and the transmission of knowledge so AI is going to make humanity smarter and the analogy i've been using is the fact that perhaps an equivalent event in the history of humanity to what might be provided by generalization of AI assistant is the invention of the printing the printing press it made everybody smarter the fact that people could have access to to books books were a lot cheaper than they were before and so a lot more people had an incentive to learn to read which wasn't the case before and people became smarter it it enabled the enlightenment right there wouldn't be an enlightenment without the printing press it enabled philosophy rationalism escape from religious doctrine democracy science and certainly without this it wouldn't be it wouldn't have been the American revolution or the French revolution and so it would still be under a few little regimes perhaps and so it completely transformed the the world because people became smarter and kind of learn learn about things now it also created 200 years of essentially religious conflicts in Europe right because the first thing that people read was the bible and realized that perhaps there was a different interpretation of the bible than what the priests were telling them and so that created the Protestant movement and created the rift and in fact the catholic school the catholic church didn't like the idea of the printing press but they had no choice and so it had some bad effects and some some good effects i don't think anyone today would say that the invention of the printing press had the overall negative effect despite the fact that it created 200 years of religious conflicts in Europe
我确实有很多这方面的经验,我指的是和比我更聪明的人一起工作。这实际上是件好事。所以,拥有比我们更聪明的机器来辅助我们的所有任务或日常生活,无论是专业的还是个人的,我认为会是一件非常美好的事情。因为智慧是最受欢迎的商品,这就是我的意思。人类所犯的所有错误都是因为缺乏智慧或缺乏知识,这是相关的。让人们变得更聪明只会是更好的事情。同样的道理,公共教育是好事,书籍是好事,互联网也是好事,甚至社交网络在本质上也是好事,只要你正确地运营它们。这很难,但你可以,因为它有助于信息和知识的传播和交流。人工智能会让人类变得更聪明,我一直在使用的类比是智能助手的普及可能会在人类历史上达到的一个等同事件,就像印刷术的发明一样,这让所有人都变得更聪明了。人们可以接触到书籍,书籍比以前便宜得多,因此更多的人有动力学习阅读,这以前是不可能的。人们变得更聪明,它促进了启蒙运动。没有印刷术就不会有启蒙运动,它促进了哲学、理性主义、逃脱宗教教条、民主与科学。当然,如果没有这个,就不会有美国革命或法国革命,可能还会处于一些小统治下。所以它彻底改变了世界,因为人们变得更聪明,开始学习了解事物。但同时它也导致了欧洲大约200年的宗教冲突,因为人们读到的第一本书就是圣经,他们发现或许圣经有着和神职人员告诉他们的不同解释,于是就产生了新教运动和教会之间的裂痕。事实上,天主教会并不喜欢印刷术的出现,但他们无法避免。所以这种技术既产生了一些不良影响,也带来了一些好的影响。我不认为今天有人会说印刷术的发明总的来说有负面影响,尽管它在欧洲造成了大约200年的宗教冲突。

Now compare this and i i thought uh it was very part of myself to come up with this analogy but realized someone else came with the same idea before me um compared this with what happened in the Ottoman Empire the Ottoman Empire banned the printing press for 200 years um and it didn't ban it uh for all languages only for Arabic you could actually print books in Latin or Hebrew or whatever in the Ottoman Empire just not in Arabic and uh i thought it was because the rulers just wanted to preserve the control over the population and the dogma religious dogma and everything but after talking with the uh UAE minister of AI uh Omar al-Olamah he told me no there was another reason uh and the other reason was that uh it was to preserve the corporation of calligraphers right there's like a an art form which is you know writing those beautiful yes uh you know Arabic poems or whatever religious text in in this thing and it was very powerful corporation of scribes basically that kind of you know run a big chunk of the empire and you know couldn't put them out of business so they you know banned the printing press in part to protect that business now what's the analogy for AI today like who are we protecting by banning AI like who are the people who are asking that AI be regulated to protect their their jobs and of course you know there's it's it's a it's a real question of what is going to be the effect of uh you know technological transformation like AI on the on the job market and the labor market and their economies to are much more expert at this than I am but when I talk to them they tell us you know we're not gonna run out of the job this this is not gonna cause mass unemployment this this is just gonna be gradual uh shift of different professions the professions that are going to be hot 10 or 15 years from now we have no idea today what they're gonna be the same way if we go about 20 years in the past like who could have thought 20 years ago that like the hottest job even like five 10 years ago was mobile app developer like smartphones were invented most of the jobs of the future might be in in the metaverse well it could be yeah but the point is you can't possibly predict but you're right I mean you made a lot of strong points and I believe that people are fundamentally good and so if AI especially open source AI can make them smarter it just empowers the goodness in humans so I share that feeling okay I think people are fundamentally good and in fact a lot of doomers are doomers because they don't think that people are fundamentally good and they either don't trust people or they don't trust the institution to do the right thing so that people behave properly well I think both you and I believe in humanity and I think I speak for a lot of people in saying thank you for pushing the open source movement and pushing to making both research and AI open source making available to people and also the models themselves making that open source so thank you for that and thank you for speaking your mind in such colorful beautiful ways on the internet I hope you never stop you know one of the most fun people I know and get to be a fan of so yeah thank you for speaking to me once again and thank you for being you thank you thanks thanks for listening to this conversation on the coon to support this podcast please check out our sponsors in the description
现在比较一下,我认为这个比喻是我自己想出来的,但后来发现有人比我先想到了这个想法。我将这与奥斯曼帝国发生的情况进行了比较,奥斯曼帝国禁止印刷业达200年,但并非对所有语言都禁止,只是对阿拉伯语禁止,你实际上可以在奥斯曼帝国印刷拉丁文或希伯来语书籍,只是不能用阿拉伯语,我认为这是因为统治者只是想保持对人民和教条宗教教义的控制,但在与阿联酋AI部长奥马尔·阿拉玛的交谈后,他告诉我,不,还有另一个原因,那就是为了保护书法家的公司。书法家是写这些美丽的阿拉伯诗歌或其他宗教文本的一种艺术形式,他们是帝国的一个强大团体,你不能让他们破产,所以他们禁止了印刷业,部分是为了保护那个生意。那么今天关于AI的类比是什么?我们通过禁止AI来保护谁?有哪些人要求对AI进行监管以保护他们的工作,当然,技术转型如AI对就业市场、劳动力市场和经济的影响是一个实际问题,专家们更擅长这个领域,但当我与他们交谈时,他们告诉我们,我们不会失去就业,这不会导致大规模失业,这只是不同职业逐渐转变的过程,将会是在10至15年后热门的职业。谁能想到20年前,像移动应用开发者这样的热门职业竟然存在?未来的大部分工作可能都在虚拟现实中。可能是的,但关键是你不可能预测,你说的很有道理,我相信人们基本上是善良的,尤其是如果AI,特别是开源AI可以让他们更聪明,这将增强人类的善良。我分享这种感觉,我认为人们基本上是善良的,实际上,许多悲观主义者之所以是悲观主义者,是因为他们不相信人们基本上是善良的,他们不信任人们,或者不相信这些机构能以正确的方式使人们行为端正。我认为你和我都相信人类,也代表着很多人,感谢你推动开源运动,推动让研究和AI开源对人们开放,并且开源模型本身也让人们可以使用,所以谢谢你,也感谢你在互联网上以如此丰富多彩的方式说出你的看法,希望你永远不要停止,你是我所认识的最有趣的人之一,也是我偶像之一,所以再次感谢你与我交谈,谢谢你成为你自己,谢谢你。感谢你倾听这段对话,支持这一播客,请查看描述中的赞助商。

and now let me leave you with some words from Arthur C. Clark the only way to discover the limits of the possible is to go beyond them and to the impossible thank you for listening and hope to see you next time
现在让我用亚瑟·克拉克的一些话来结束吧:“发现可能的极限的唯一方法就是突破它们,前往不可能。”谢谢你的倾听,希望下次能再见到你。



function setTranscriptHeight() { const transcriptDiv = document.querySelector('.transcript'); const rect = transcriptDiv.getBoundingClientRect(); const tranHeight = window.innerHeight - rect.top - 10; transcriptDiv.style.height = tranHeight + 'px'; if (false) { console.log('window.innerHeight', window.innerHeight); console.log('rect.top', rect.top); console.log('tranHeight', tranHeight); console.log('.transcript', document.querySelector('.transcript').getBoundingClientRect()) //console.log('.video', document.querySelector('.video').getBoundingClientRect()) console.log('.container', document.querySelector('.container').getBoundingClientRect()) } if (isMobileDevice()) { const videoDiv = document.querySelector('.video'); const videoRect = videoDiv.getBoundingClientRect(); videoDiv.style.position = 'fixed'; transcriptDiv.style.paddingTop = videoRect.bottom+'px'; } const videoDiv = document.querySelector('.video'); videoDiv.style.height = parseInt(videoDiv.getBoundingClientRect().width*390/640)+'px'; console.log('videoDiv', videoDiv.getBoundingClientRect()); console.log('videoDiv.style.height', videoDiv.style.height); } window.onload = function() { setTranscriptHeight(); }; if (!isMobileDevice()){ window.addEventListener('resize', setTranscriptHeight); }