首页 >> 来自播客: User Upload Audio 更新反馈

Large Language Models explained briefly

发布时间 2024-11-20 15:07:15 来源

Imagine you happen across a short movie script that describes a scene between a person and their AI assistant. The script has what the person asks the AI, but the AI's response has been torn off. Suppose you also have this powerful magical machine that can take any text and provide a sensible prediction of what word comes next. You can then finish the script by feeding in what you have to the machine, seeing what it would predict to start the AI's answer, and then repeating this over and over with a growing script completing the dialogue.

想象一下，你偶然发现了一段简短的电影剧本，描述了一个人与他们的AI助手之间的对话场景。剧本中有这个人对AI的提问，但AI的回答部分已经被撕掉了。假设你还有一台强大的魔法机器，它可以接收任何文本并为下一个词语提供合理的预测。然后，你可以通过将已有的部分输入这台机器，查看它预测AI回答会以什么词语开始，并不断重复这一过程，逐步将剧本完成，补全整个对话。

When you interact with a chatbot, this is exactly what's happening. A large language model is a sophisticated mathematical function that predicts what word comes next for any piece of text. Instead of predicting one word with certainty, though, what it does is assign a probability to all possible next words. To build a chatbot, what you do is lay out some text that describes an interaction between a user and a hypothetical AI assistant. You add on whatever the user types in as the first part of that interaction. Then you have the model repeatedly predict the next word that such a hypothetical AI assistant would say in response, and that's what's presented to the user.

当你与聊天机器人互动时，实际情况是这样的：大型语言模型是一种复杂的数学函数，它用于预测任何文本中下一个可能出现的单词。它并不是对下一个词进行确定性的预测，而是为所有可能的下一个词分配一个概率。要构建一个聊天机器人，你需要准备一些描述用户与假想AI助手之间互动的文本。当用户输入信息时，这些信息会作为互动的开始部分被添加进去。然后，模型会反复预测这个假想的AI助手会说什么，并把这些回应展示给用户。

In doing this, the output tends to look a lot more natural if you allow it to select less likely words along the way at random. So what this means is even though the model itself is deterministic, a given prompt typically gives a different answer each time it's run. Models learn how to make these predictions by processing an enormous amount of text, typically pulled from the internet. For a standard human to read the amount of text that was used to train GPT-3, for example, if they read non-stop 24-7, it would take over 2600 years, larger model since then, train on much, much more.

在这个过程中，如果允许输出随机选择可能性较低的词语，结果往往会显得更加自然。这意味着，即便模型本身是确定性的，同一个提示在不同时间运行时通常会给出不同的答案。模型通过处理大量文本来学习如何做出这些预测，这些文本通常来自互联网。比如，要训练GPT-3所用的文本量，假如一个人不间断地阅读，需要耗费2600多年。此后的更大模型，则是在更多更多的文本上进行训练的。

You can think of training a little bit like tuning the dials on a big machine. The way that a language model behaves is entirely determined by these many different continuous values, usually called parameters or weights. Changing those parameters will change the probabilities that the model gives for the next word on a given input. What puts the large in large language model is how they can have hundreds of billions of these parameters. No human ever deliberately sets those parameters. Instead they begin at random, meaning the model just outputs gibberish, but they're repeatedly refined based on many example pieces of text.

你可以把训练看作是在一个大机器上调节旋钮。语言模型的行为完全由许多不同的连续数值决定，这些数值通常被称为参数或权重。修改这些参数会改变模型在给定输入下预测下一个词的概率。所谓“大”语言模型，就是指它们可以拥有数千亿个这样的参数。没有人会手动设置这些参数。相反，它们一开始是随机的，这意味着模型最初会输出一些无意义的内容，但会通过很多文本示例进行不断地优化和调整。

One of these training examples could be just a handful of words, or it could be thousands, but in either case the way this works is to pass in all but the last word from that example into the model and compare the prediction that it makes with the true last word from the example. An algorithm called back propagation is used to tweak all of the parameters in such a way that it makes the model a little more likely to choose the true last word and a little less likely to choose all the others. When you do this for many, many trillions of examples, not only does the model start to give more accurate predictions on the training data, but it also starts to make more reasonable predictions on text that it's never seen before.

这些训练示例中的一个可以是几个词，也可以是几千个词，但无论是哪种情况，操作方式都是将这个示例中除最后一个词外的所有词输入模型，然后将模型预测的结果与示例中的真实最后一个词进行比较。使用一种名为反向传播的算法来调整所有参数，使得模型更有可能选择真正的最后一个词，而其他词的可能性会降低。当你对数以万亿计的示例进行这个操作时，模型不仅能在训练数据上给出更准确的预测，还能在从未见过的文本上做出更合理的预测。

In the huge number of parameters and the enormous amount of training data, the scale of computation involved in training a large language model is mind-boggling. To illustrate, imagine that you could perform one billion additions and multiplications every single second. How long do you think that it would take for you to do all of the operations involved in training the largest language models? Do you think it would take a year? Maybe something like 10,000 years? The answer is actually much more than that. It's well over 100 million years.

在如此巨大的参数数量和庞大的训练数据量下，训练一个大型语言模型所需的计算规模是难以想象的。举个例子，假设你每秒能够完成十亿次加法和乘法运算。你认为完成训练最大的语言模型所需的所有运算需要多长时间呢？是一年？或者类似1万年？实际答案远不止这些。事实上，需要超过1亿年。

This is only part of the story though. This whole process is called pre-training. The goal of auto-completing a random passage of text from the internet is very different from the goal of being a good AI assistant. To address this, chatbots undergo another type of training just as important, called reinforcement learning with human feedback. Use flag-unhelpful or problematic predictions and their corrections further change the model's parameters, making them more likely to give predictions that users prefer.

这只是故事的一部分。这个整个过程被称为预训练。从互联网上自动补全一段随机文本的目标与成为一个优秀的AI助手的目标非常不同。为了解决这个问题，聊天机器人会进行另一种同样重要的训练，叫做人类反馈强化学习。通过标记不够有用或有问题的预测及其修正，进一步改变模型的参数，从而使其更可能给出用户更喜欢的预测。

Looking back at the pre-training though, this staggering amount of computation is only made possible by using special computer chips that are optimized for running many, many operations in parallel, known as GPUs. However, not all language models can be easily parallelized. Prior to 2017, most language models would process text one word at a time. But then, a team of researchers at Google introduced a new model known as the transformer. Transformers don't read text from the start to the finish. They soak it all in at once in parallel.

回顾预训练，这种惊人的计算量之所以成为可能，是因为使用了称为GPU的特殊计算芯片，这些芯片专为并行执行大量操作而优化。然而，并不是所有的语言模型都能轻松实现并行化。在2017年之前，大多数语言模型都是逐字处理文本。但是，谷歌的一个研究团队引入了一种新模型，称为Transformer。Transformer不像传统模型那样从头到尾顺序读取文本，而是同时并行地摄取和处理所有文本信息。

The very first step inside a transformer and most other language models for that matter is to associate each word with a long list of numbers. The reason for this is that the training process only works with continuous values, so you have to somehow encode language using numbers, and each of these list of numbers may somehow encode the meaning of the corresponding word. What makes Transformers unique is their reliance on a special operation known as attention. This operation gives all of these lists of numbers a chance to talk to one another and refine the meanings that they encode based on the context around, all done in parallel.

在Transformer和大多数其他语言模型中，第一步是将每个单词与一长串数字关联起来。这样做的原因是训练过程只能处理连续数值，因此必须通过某种方式用数字来编码语言，而这些数字列表能够在某种程度上表示对应单词的含义。Transformer的独特之处在于它依赖于一种被称为“注意力机制”的特殊运算。这种运算使得所有的数字列表可以互相交流，并在并行处理中根据周围的上下文来优化和细化它们所表示的含义。

For example, the numbers encoding the word bank might be changed based on the context surrounding it to somehow encode the more specific notion of a river bank. Transformers typically also include a second type of operation known as a feed-forward neural network, and this gives the model extra capacity to store more patterns about language learned during training. All of this data repeatedly flows through many different iterations of these two fundamental operations. And as it does so, the hope is that each list of numbers is enriched to encode whatever information might be needed to make an accurate prediction of what word follows in the passage.

例如，根据上下文，为单词“bank”编码的数字可能会被改变，以更具体地表达“河岸”的概念。变压器模型通常还包括第二种操作，称为前馈神经网络，这使模型具有更大的容量来存储在训练过程中学习到的更多语言模式。所有这些数据不断流经这两种基本操作的多个不同迭代。随着数据的流动，希望每组数字都会被丰富，以便编码所需的信息，从而准确预测文章中接下来的单词。

At the end, one final function is performed on the last vector in this sequence, which now has had a chance to be influenced by all the other context from the input text, as well as everything the model learned during training, to produce a prediction of the next word. Then the model's prediction looks like a probability for every possible next word. Although researchers design the framework for how each of these steps work, it's important to understand that the specific behavior is an emergent phenomenon based on how those hundreds of billions of parameters are tuned during training.

最后，在这个序列中的最后一个向量上执行一个最终的函数，此时这个向量已经受到输入文本中所有其他上下文以及模型在训练期间学到的所有知识的影响，从而生成对下一个单词的预测。然后，模型的预测结果呈现为每个可能的下一个单词的概率。尽管研究人员设计了每个步骤的工作框架，但需要理解的是，这种具体表现是基于数千亿参数在训练过程中调整而产生的现象。

This makes it incredibly challenging to determine why the model makes the exact predictions that it does. What you can see is that when you use large language model predictions to autocomplete a prompt, the words that it generates are uncannily fluent, fascinating, and even useful. If you're a new viewer and you're curious about more details on how transformers and attention work, boy do I have some material for you.

这使得确定模型为什么会做出精确预测变得极具挑战性。你可以看到，当你使用大型语言模型来完成一个提示时，它生成的词汇流畅得惊人，令人着迷，甚至非常实用。如果你是新观众，想了解更多关于变压器和注意力机制的细节，我有很多资料可以分享给你。

One option is to jump into a series I made about deep learning, where we visualize and motivate the details of attention and all the other steps in a transformer. But also, on my second channel I just posted a talk that I gave a couple months ago about this topic for the company TNG in Munich. Sometimes I actually prefer the content that I make as a casual talk rather than a produced video, but I leave it up to you which one of these feels like the better follow on.

一个选择是观看我制作的关于深度学习的系列视频，在这些视频中，我们直观展示并解释了注意力机制及变压器模型中的其他步骤。但同时，我在第二个频道上发布了我几个月前在慕尼黑为TNG公司做的一次关于这个话题的演讲。我有时候其实更喜欢以随意谈话的形式呈现的内容，而不是经过精心制作的视频。不过，这两种形式哪一种更适合您观看，我交给您来决定。