State of GPT | BRK216HFS - YouTube

发布时间 2023-05-24 16:00:00 来源

中英文字稿

Please welcome AI researcher and founding member of OpenAI, Andre Carpathi. Hi everyone. I'm happy to be here to tell you about the state of GPT and more generally about the rapidly growing ecosystem of large language models. So I would like to partition the talk into two parts. In the first part, I would like to tell you about how we train GPT assistance. And then in the second part, we are going to take a look at how we can use these assistance effectively for your applications. So first let's take a look at the emerging recipe for how to train these assistants and keep in mind that this is all very new and still rapidly evolving.

大家好，我很高兴能在这里向你们介绍GPT的现状以及日益增长的大型语言模型生态系统。因此，我想将讲话分为两部分。在第一部分中，我想向大家介绍我们如何训练GPT助手。接着，在第二部分中，我们将看一看如何有效地将这些助手应用于你的应用程序。首先，让我们来看看训练这些助手的新兴方法，记住这一切都非常新颖且不断演化。

But so far the recipe looks something like this. Now this is kind of a complicated slide so I'm going to go through it piece by piece. But roughly speaking we have four major stages. Pre-training, supervised fine tuning, reward modeling, reinforcement learning, and they follow each other seriously. Now in each stage we have a dataset that powers that stage. We have an algorithm that for our purposes will be an objective for training on your own network and then we have a resulting model and then there are some notes on the bottom.

目前而言，这个配方大概是这个样子的。这是一个比较复杂的幻灯片，我会逐个部分地讲解。但总的来说，我们有四个主要阶段：预先训练、监督微调、奖励模型以及强化学习，并且它们严格按照顺序进行。在每个阶段中，我们有一个数据集来支持该阶段。我们有一个算法，用于训练自己的网络的目标，然后我们得到一个模型，底部还有一些注释。

So the first stage we are going to start with is the pre-training stage. Now this stage is kind of special in this diagram and this diagram is not to scale because this stage is where all of the computational work basically happens. This is 99% of the training compute time and also flops. And so this is where we are dealing with internet scale datasets with thousands of GPUs in the computer and also months of training potentially. The other three stages are fine tuning stages that are much more along the lines of small few number of GPUs and hours or days.

因此，我们要开始的第一阶段是预先训练阶段。在此图中，这个阶段比例被放大，因为在这个阶段，几乎所有的计算工作都会发生。这是99%训练计算时间和浮点运算的地方。因此，我们将处理互联网规模的数据集，使用数千个GPU，在几个月内进行训练。其他三个阶段是微调阶段，它们更多地基于少量GPU，并且需要数小时或数天的时间。

So let's take a look at the pre-training stage to achieve a base model. First we are going to gather a large amount of data. Here is an example of what we call a data mixture that comes from this paper that was released by Meta where they released this Lama base model. You can see roughly the datasets that enter into these collections. So we have common crawl which is just a web scrape, C4 which is also common crawl and then some high quality datasets as well.

让我们来看看预训练阶段以实现一个基础模型。首先，我们需要收集大量的数据。这里有一个例子，称为数据混合，来自Meta发布的一篇论文，他们发布了这个Lama基础模型。您可以大致看到进入这些集合的数据集。因此，我们有常规爬网站的数据集，C4数据集也来自常规爬取，还有一些高质量的数据集。

So for example GitHub, Wikipedia, books, archives, stack exchange and so on. These are all mixed up together and then they are sampled according to some given proportions and that forms the training set for the neural mat for the GPT. Now before we can actually train on this data, we need to go through one more pre-processing step and that is tokenization. This is basically a translation of the raw text that we scrape from the internet into sequences of integers because that is the native representation over which GPT is function. Now this is a lossless kind of translation between pieces of text and tokens and integers and there are a number of algorithms for this stage.

例如，GitHub、Wikipedia、书籍、档案、Stack Exchange等等，这些都混合在一起，然后按照给定的比例进行采样，形成GPT神经网络的训练集。现在，在我们实际训练这个数据之前，我们需要经过一个预处理步骤，即分词。这实际上是将我们从互联网上爬取的原始文本翻译成整数序列，因为这是GPT可以处理的本机表示形式。现在，这是一种无损的文本和令牌之间的转换方式，有许多算法来实现这个阶段。

Typically for example you could use something like byte pairing coding which iteratively merges little text chunks and groups them into tokens. And so here I am showing some example chunks of these tokens and then this is the raw integer sequence that will actually feed into a transformer. Now here I am showing two sort of examples for high primers that govern this stage. So GPT4, we did not release too much information about how it was trained and so on so I am using GPT3's numbers but GPT3 is of course a little bit old by now about three years ago but Lama is a fairly recent model for meta.

通常情况下，我们可以使用类似于字节对编码的方式，迭代地合并小的文本块并将它们分组成标记。这里我展示了一些标记的示例块，这是实际将输入变压器的原始整数序列。现在我展示了这个阶段所掌控的高级别示例的两个示例。对于GPT4，我们没有过多地公布它的训练信息，因此我使用的是GPT3的数字，但是GPT3现在已经有点老了，大约是三年前的模型，而Lama是一个相对较新的元模型。

So these are roughly the orders of magnitude that we are dealing with when we are doing pre-training. The vocabulary size is usually a couple of 10,000 tokens. The context length is usually something like 2000, 4000 or nav days even 100,000 and this governs the maximum number of integers that the GPT will look at when it is trying to predict the next integer in a sequence. You can see that roughly the number of parameters is say 65 billion for Lama.

在进行预训练时，我们通常会处理这些数量级。词汇表通常有几万个词元。上下文长度通常为2000、4000甚至有时达到10万，这决定了当GPT试图预测序列中下一个整数时，它最多会查看的整数数量。您可以看到，Lama的参数数量大约为650亿。

Now even though Lama has only 65b parameters compared to GPT3's 175 billion parameters, Lama is a significantly more powerful model and intuitively that is because the model is trained for significantly longer. In this case 1.4 trillion tokens instead of just 300 billion tokens. So you should not judge the power of a model just by the number of parameters that it contains. Below I am showing some tables of rough hyper parameters that typically go into specifying the transformer neural network, so the number of heads, the dimension size, the number of layers and so on. And on the bottom I am showing some training hyper parameters.

尽管与GPT3的1750亿个参数相比，Lama只有650亿个参数，但是Lama是一个显著更强大的模型，直观地讲是因为该模型的训练时间长得多。在这种情况下，它训练了1.4万亿个令牌，而不仅仅是3000亿个令牌。因此，您不应仅通过模型包含的参数数量来评估模型的能力。下面，我将展示一些常规超参数表，这些表涉及指定变压器神经网络的头数、维度大小、层数等。并且在底部，我将展示一些训练超参数。

So for example to train the 65b model, Meta used 2000 GPUs, roughly 21 days of training and roughly several million dollars. And so that is the rough orders of magnitude that you should have in mind for the pre-training stage.

例如，对于训练大小为65b的模型，Meta公司使用了2000个GPU，在大约21天的时间内进行训练，成本大约数百万美元。因此，这是你在预训练阶段需要考虑的大致数量级。

Now when we are actually pre-training what happens. And speaking we are going to take our tokens and we are going to lay them out into data batches. So we have these arrays that will feed into the transformer and these arrays are B, the batch size and these are all independent examples stacked up in rows and B by T, T being the maximum context length.

现在，实际上我们在进行预训练时会发生什么。换句话说，我们将采取我们的令牌并将它们放入数据批次中。因此，我们有这些数组将馈入变换器，这些数组是B，批量大小，这些都是独立的示例堆叠在行中，B x T，T是最大上下文长度。简单来说，我们以一定的格式将数据处理好，以便在变换器中使用。

So in my picture I only have 10, the context length. So this could be 2000, 4000 etc. So these are extremely long rows. And what we do is we take these documents and we pack them into rows and we delimit them with these special end of text tokens basically on the transformer where a new document begins.

在我的图片中，我只有10个文本长度。这可能是2000，4000等等。因此，这些是非常长的行。我们把这些文档打包成行，使用特殊的文本结束标记将它们分隔开来，这个标记基本上是在转换器中用来表示新文档开始的。

And so here I have a few examples of documents and then I have stretched them out into this input. Now we are going to feed all of these numbers into transformer. And let me just focus on a single particular cell but the same thing will happen at every cell in this diagram.

因此，我在这里有一些文件示例，然后我将它们伸展成这个输入。现在，我们将把所有这些数字输入转换器中。让我只关注单个特定的单元，但是在此图表中的每个单元格都会发生同样的事情。

So let's look at the green cell. The green cell is going to take a look at all of the tokens before it, so all of the tokens in yellow and we are going to feed that entire context into the transformer neural network. And the transformer is going to try to predict the next token in the sequence in this case in red.

让我们来看看绿色单元格。绿色单元格将会检查其之前的所有标记，也就是黄色的所有标记，并将这整个语境输入到Transformer神经网络中。Transformer将会尝试预测下一个标记，也就是这个红色标记在序列中的位置。

Now the transformer, I don't have too much time to unfortunately go into the full details of this neural network architecture is just a large blob of neural net stuff for our purposes and it's got several 10 billion parameters typically or something like that.

现在说到transformer，很不幸我没有太多时间详细介绍这个神经网络架构的细节，但对于我们的目的来说，它只是一大堆神经网络的东西，通常有几百亿个参数左右。

And of course as it tunes these parameters you are getting slightly different predicted distributions for every single one of these cells. And so for example if our vocabulary size is 50,257 tokens then we are going to have that many numbers because we need to specify a probability distribution for what comes next. So basically we have a probability for whatever may follow.

当调整这些参数时，每个单元的预测分布都会略有不同。如果我们的词汇大小为50,257个标记，那么我们将需要指定下一个标记的概率分布，因此我们将有这么多的数字。所以基本上，我们对于接下来可能出现的任何内容都有一个概率。

Now in this specific example for this specific cell 513 will come next and so we can use this as a source of supervision to update our transformer's weights. And so we are applying this basically on every single cell in parallel and we keep swapping batches and we are trying to get the transformer to make the correct predictions over what token comes next in a sequence.

在这个具体的示例中，下一个单元格是513，因此我们可以将其作为监督源来更新变压器的权重。因此，我们基本上在每个单元格上并行应用此方法，不断交换批次，并尝试让变压器正确预测序列中下一个标记是什么。

So let me show you more concretely what this looks like when you train one of these models. This is actually coming from a New York Times and they trained a small GPT on Shakespeare. And so here is a small snippet of Shakespeare and they trained a GPT on it.

让我更具体地展示一下当你训练这些模型时，它是怎样的。这实际上来自《纽约时报》，他们在莎士比亚上训练了一个小型的GPT。这里是一个小片段莎士比亚的作品，他们在上面训练了一个GPT。

Now in the beginning at initialization the GPT starts with completely random weights. So you are just getting completely random outputs as well. But over time as you train the GPT longer and longer you are getting more and more coherent and consistent sort of samples from the model.

现在，在初始化时，GPT从完全随机的权重开始。因此，您也仅获得完全随机的输出。但随着您对GPT进行更长时间的训练，您将从该模型中获得越来越一致和连贯的样本。

And the way you sample from it of course is you predict what comes next. You sample from that distribution and you keep feeding that back into the process and you can basically sample large sequences. And so by the end you see that the transformer has learned about words and where to put spaces and where to put commas and so on.

当然，你从这个过程中采样的方式是预测接下来会出现什么。你从这个分布中进行采样，不断将结果反馈回这个过程中，你就可以采样大型序列。到最后，你会发现这个Transformers已经学会了单词的使用，知道在哪里需要添加空格，在哪里需要加逗号等等。

And so we are making more and more consistent predictions over time. These are the kinds of plots that you are looking at when you are doing model pre-training. Effectively we are looking at the loss function over time as you train and low loss means that our transformer is predicting the correct is giving a higher probability to the correct next integer in a sequence.

因此，随着时间的推移，我们在做出更加一致的预测。这些是你进行模型预训练时所看到的图表类型。实际上，我们在观察损失函数随时间变化的情况下进行训练，低损失意味着我们的转换器在预测正确的下一个整数序列时赋予了更高的概率。

Now what are we going to do with this model once we have trained it after our month? Well the first thing that we noticed, we the field is that these models basically in the process of language modeling learn very powerful general representations and it is possible to very efficiently fine tune them for any arbitrary downstream task might be interested.

那么，在我们一个月训练过这个模型后，我们会拿它来做什么呢？首先，我们在这个领域注意到，这些模型在语言建模过程中学习到非常强大的普遍表示形式，可以非常高效地进行微调，以适应任何可能感兴趣的下游任务。

So as an example if you are interested in sentiment classification the approach used to be that you collect a bunch of positives and negatives and then you train some kind of an NLP model for that. But the new approach is ignore sentiment classification.

举例来说，如果您对情感分类感兴趣，一般的方法是先收集一堆正面和负面的例子，然后训练一种自然语言处理模型。但是现在新的方法是忽略情感分类。

Go off and do large language model pre-training, train a large transformer and then you can only have a few examples and you can very efficiently fine tune your model for that task. And so this works very well in practice.

走一步，进行大型语言模型的预训练，训练一个大型transformer，然后你只需要很少的例子，就可以非常有效地对该任务进行微调。这在实践中非常有效。

And the reason for this is that basically the transformer is forced to multitask a huge amount of tasks in the language modeling task because just in terms of predicting the next token it is forced to understand a lot about the structure of the text and all the different concepts therein.

之所以如此，是因为在语言建模任务中，变压器被迫执行大量的任务。因为仅仅是预测下一个单词，它就被迫理解文本的结构以及其中的各种概念。这就是导致变压器需要执行多个任务的原因。

So that was GPT1. Now around the time of GPT2 people noticed that actually even better than fine tuning you can actually prompt these models very effectively. So these are language models and they want to complete documents. So you can actually trick them into performing tasks just by arranging these fake documents.

这就是GPT1的情况。到了GPT2的时候，人们开始注意到，实际上比起微调模型，你还可以通过模拟文件来非常有效地引导这些语言模型。这些模型旨在完成文件的自动补全，因此你可以通过伪造文件来让它们执行一些任务。

So in this example for example we have some passage and then we sort of like do QA, QA, QA. This is called a few shot prompt and then we do Q and then as the transformer is trying to complete the document it is actually answering our question. So this is an example of prompt engineering a base model making it believe that it is sort of imitating a document and it is getting it to perform a task.

就拿这个例子来说，我们有一些文本，然后进行问答，问答，问答的过程。这被称为“几个片段提示”，然后我们提出问题，当变压器试图完成文档时，它实际上正在回答我们的问题。因此，这是提示工程的一个示例，基本模型使其相信它正在模仿文档，并要求其执行任务。

And so this picked off I think the error of I would say prompting over fine tuning and seeing that this actually can work extremely well on a lot of problems even without training any neural networks fine tuning or so on. Now since then we have seen an entire evolutionary tree of base models that everyone has trained.

所以，我认为这个错误是由于过分强调优化而没有考虑到，即使没有对神经网络进行任何细微调整，它实际上在很多问题上都可以非常好地工作。自那时以来，我们已经看到了一个基础模型的整个进化树，每个人都对其进行了训练。

Not all of these models are available. For example the GPT-4 base model was never released. The GPT-4 model that you might be interacting with over API is not a base model. It is an assistant model and we are going to cover how to get those in a bit.

并非所有这些模型都是可用的。例如，GPT-4基础模型从未发布。您可能正在与通过API交互的GPT-4模型并非是基础模型。它是一个助手模型，我们将在稍后介绍如何获取它们。

GPT-3 base model is available via the API under the name Divenchy and GPT-2 base model is available even as weights on our GitHub repo. But currently the best available base model probably is the Lama series from Meta although it is not commercialized.

GPT-3基础模型可以通过API使用，名称为Divenchy，而GPT-2基础模型甚至可以作为我们GitHub存储库中的权重使用。但目前最好的可用基础模型可能是来自Meta的Lama系列，尽管它还未商业化。

Now one thing to point out is base models are not systems. They don't want to answer to your questions. They just want to complete documents. So if you tell them write a poem about the brain and cheese it will just answer questions with more questions. It is just completing what it thinks as a document.

现在需要指出的是，基础模型并不是系统。它们不想回答你的问题，它们只想完成文档。所以，如果你告诉它们写一首关于大脑和奶酪的诗，它只会用更多的问题来回答你。它只是按照它认为的文档来完成任务。

However you can prompt them in a specific way for base models that is more likely to work. So as an example here is a poem about brain and cheese and in that case it will autocomplete correctly. You can even trick base models into being assistants.

然而，您可以以特定的方式提示基本模型，这样更有可能起作用。例如，这里有一首关于大脑和奶酪的诗，这种情况下它会正确地自动完成。您甚至可以欺骗基本模型成为辅助工具。（意思是指出自动完成模型有时候需要特定的输入才能完成任务，可以通过诗歌、迷惑等方式来让输入符合模型的要求。）

And the way you would do this is you would create like a specific few shot prompt that makes it look like there is some kind of a document between a human and assistant and they are exchanging sort of information. And then at the bottom you sort of put your query at the end and the base model will sort of like condition itself into being like a helpful assistant and kind of answer. But this is not very reliable and doesn't work super well in practice although it can be done.

你可以通过创建一种特定的几个提示来模拟人类和助手之间有某种文件交换信息的情况，从而实现这一点。然后在底部，您可以将查询放在最后，基础模型会自动调整自己成为一种有帮助的助手并给出答案。但这种方法并不是非常可靠，并且在实践中效果不是特别好，虽然它是可行的。

So instead we have a different path to make actual GBT assistants not just base model document completers. And so that takes us into supervised fine tuning.

因此，我们采取了不同的路径，以制作实际的GBT助手，而不仅仅是基础模型文档完成者。这使我们进入了受监督的微调领域。

So in the supervised fine tuning stage we are going to collect small but high quality data sets. And in this case we are going to ask human contractors to gather data of the form of the form prompt and ideal response. And we are going to collect lots of these typically tens of thousands or something like that.

在监督微调阶段，我们将收集小型但高质量的数据集。在这种情况下，我们将要求人类承包商收集形式提示和理想响应的数据。我们通常会收集大量这样的数据，通常是数万条或类似的数量。

And then we are going to still do language modeling on this data. So nothing changed algorithmically. We are just swapping out a training set. So it used to be internet documents which is a high quantity local for basically QA prompt response kind data. And that is low quantity high quality. So we will still do language modeling and then after training we get an SFT model.

然后，我们仍然要对这些数据进行语言建模。算法上没有任何改变。我们只是更换了训练集合。以前是互联网文档，它是高数量的本地QA提示响应类型数据。现在变成低数量但高质量的数据。所以我们仍然会进行语言建模，然后在训练后获取一个SFT模型。

And you can actually deploy these models and they are actual assistants and they work to some extent. Let me show you what an example demonstration might look like. So here is something that a human contractor might come up with.

这些模型实际上可以被部署，并且它们是真实的助手，可以在一定程度上工作。让我展示一下可能的示例演示。这里是一个人类承包商可能会提出的内容。

Here is some random prompt. Can you write a short introduction about the relevance of the term monopsony or something like that. And then the contractor also writes out an ideal response. And when they write out these responses they are following extensive labeling documentations and there are being asked to be helpful, truthful and harmless. And this is labeling instructions here.

这是一个随机提示。您能否写一个简短的介绍，介绍垄断购买者这个术语的相关性或类似内容。然后，承包商还会写出一个理想的回答。当他们写出这些回答时，他们遵循广泛的标签文档，并被要求提供有帮助、真实和无害的信息。这里是标签说明。这段话的意思是：在这个任务中，需要写一个关于垄断购买者（monopsony）这个术语的简短介绍。而承包商将根据详细的标签文档撰写出一份理想的回答，要求在回答中提供有帮助、真实和无害的信息。

You probably can't read it. Another can I. But they are long and this is just people following instructions and trying to complete these prompts. So that is what the data set looks like and you can train these models and this works to some extent.

你可能无法阅读它，我也不行。但这些数据集很长，只是人们遵循指示并尝试完成这些提示。因此，数据集的样子就是这样的，你可以训练这些模型，它们在一定程度上是有效的。

Now you can actually continue the pipeline from here on and go into RLA chef reinforcement learning from human feedback that consists of both reward modeling and reinforcement learning. Let me cover that and then I will come back to why you may want to go through the extra steps and how that compares to just SFT models.

现在，你可以从这里开始继续流程，并进入由奖励建模和强化学习组成的RLA厨师强化学习，该学习来自人类反馈。我将解释一下这个过程，然后再回到为什么你可能想要进行额外的步骤以及与仅使用SFT模型相比的优势。

So in the reward modeling step what we are going to do is we are now going to shift our data collection to be of the form of comparisons. So here is an example of what our data set will look like. I have the same prompt, identical prompt on the top which is asking the assistant to write a program or function that checks if a given string is a palindrome.

在奖励建模步骤中，我们现在将把我们的数据收集转换为比较形式。以下是我们数据集的一个示例。我在顶部放置了相同的提示，这个提示要求助手编写一个程序或函数，以检查给定的字符串是否是回文。意思解释：在奖励建模步骤中，我们现在将采用比较的形式来收集数据。我们会用一个示例来说明该数据集。在示例中，我们有一个相同的提示，要求助手编写一个程序或函数，以检查给定的字符串是否为回文。

And then what we do is we take the SFT model which we have already trained and we create multiple completions. So in this case we have three completions that the model has created and then we ask people to rank these completions. So if you stare at this for a while and by the way these are very difficult things to do to compare some of these predictions and this can take people even hours for single prompt completion pairs.

然后我们所做的就是使用已经训练好的SFT模型，创建多个完成项。因此在这种情况下，我们有三个由模型创建的完成项，然后我们要求人们对这些完成项进行排名。所以，如果你仔细观察这些完成项，顺便说一句，这些预测是非常难比较的，这可能需要人们甚至几个小时才能完成单个提示完成对。

But let's say we decided that one of these is much better than the others and so on. So we rank them. Then we can follow that with something that looks very much like a binary classification on all the possible pairs between these completions.

假设我们决定其中一个比其他的更好，然后对它们进行排序。接着，我们可以对这些完成形式之间的所有可能配对进行类似于二元分类的操作。

So what we do now is we lay out our prompt in rows and the prompts is identical across all three rows here. So it's all the same prompt but the completion is very and so the yellow tokens are coming from the SFT model. So what we do is we append another special reward readout token at the end and we basically only supervise the transformer at this single green token. And the transformer will predict some reward for how good that completion is for that prompt.

现在我们要做的是将我们的提示在行中排列，并且这些行上的提示都是相同的。因此，这些提示都是相同的，但是完成情况是不同的，而黄色的标记来自于SFT模型。所以我们在末尾添加了另一个特殊的奖励标记，并且基本上只监督这个绿色标记的变换器。变换器将为这个提示的完成情况预测一些奖励的好坏。

And so basically it makes a guess about the quality of each completion and then once it makes a guess for every one of them we also have the ground truth which is telling us the ranking of them. And so we can actually enforce that some of these numbers should be much higher than others and so on. We formulate this into a loss function and we train our model to make reward predictions that are consistent with the ground truth coming from the comparisons from all these contractors.

基本上，它对每个完成的质量进行猜测，然后一旦它猜测完每一个，我们也有基准事实，告诉我们它们的排名。因此，我们实际上可以强制某些数字比其他数字高得多。我们将这个转化为一个损失函数，训练我们的模型使得奖励预测与来自所有工程师的比较的基准事实一致。

So that's how we train our reward model. And that allows us to score how good a completion is for a prompt. Once we have a reward model, we can't deploy this because this is not very useful as an assistant by itself but it's very useful for the reinforcement learning stage that follows now.

这就是我们训练奖励模型的方法。这让我们能够评估给定提示下完成得有多好。当我们有了奖励模型，我们不能单独部署它，因为它本身作为助手并不特别有用，但对于接下来的强化学习阶段非常有用。

Once we have a reward model, we can score the quality of any arbitrary completion for any given prompt. So what we do during the reinforcement learning is we basically get again a large collection of prompts and now we do reinforcement learning with respect to the reward model.

一旦我们有了奖励模型，我们就可以评分给出的提示任意完成的质量。因此，在强化学习过程中，我们基本上会获得大量提示，现在我们会针对奖励模型进行强化学习。

So here's what that looks like. We take a single prompt, we lay it out and roast and now we use the basically the model we'd like to train which is initialized at a safety model to create some completions in yellow. And then we append the reward token again and we read off the reward according to the reward model which is now kept fixed.

这就是这个过程的样子。我们拿一个单一的提示，进行排版并进行调制，现在我们使用预训练模型并将其初始化为一个安全模型，以生成一些黄色的完成部分。然后，我们再次添加奖励标记，并根据固定的奖励模型读取奖励。

It doesn't change anymore. And now the reward model tells us the quality of every single completion for these prompts. And so what we can do is we can now just basically apply the same language modeling loss function but we're currently training on the yellow tokens and we are weighing the language modeling objective by the rewards indicated by the reward model.

现在它已经不再改变了。现在，奖励模型告诉我们每个提示完成的质量如何。因此，我们现在可以基本上应用相同的语言建模损失函数，但我们目前是在黄色的记号上训练，并且我们正在根据奖励模型指示的奖励来衡量语言建模目标。

So as an example, in the first row the reward model said that this is a fairly high scoring completion and so all of the tokens that we happen to sample on the first row are going to get reinforced and they are going to get higher probabilities for the future.

拿第一行作为例子，奖励模型表示这是一个相当高得分的任务，因此我们在第一行抽样到的所有令牌都将得到加强，并在以后的任务中获得更高的概率。

Conversely on the second row the reward model really did not like this completion, negative 1.2. And so therefore every single token that we sampled in that second row is going to get a slightly higher probability for the future. And we do this over and over on many prompts, on many batches and basically we get a policy which creates yellow tokens here and it basically all of them, all of the completions here will score high according to the reward model that we trained in the previous stage. So that's how we train, that's what the RLHF pipeline is.

相反，在第二行，奖励模型实际上并不喜欢这个完成，得分为负1.2。因此，我们在第二行中抽样的每个令牌将在未来获得稍微更高的概率。我们在许多提示和许多批次上反复进行这样的操作，基本上我们得到了一个策略，它在这里创建了黄色令牌，而且基本上所有这里的完成都将根据我们在之前训练的奖励模型得分较高。所以这就是我们的训练方法，也就是RLHF管道的作用。

Now and then at the end you get a model that you could deploy and so as an example, Chatchy PT is an RLHF model. But some other models that you might come across like for example the Kuhnath 13B and so on, these are SFT models. So we have base models, SFT models and RLHF models and that's kind of like the state of things there.

不时地，在最后你可能得到一个可以部署的模型，比如说Chatchy PT就是一个RLHF模型。但是你可能会遇到其他的模型，比如说Kuhnath 13B等等，这些是SFT模型。因此我们有基础模型，SFT模型和RLHF模型，这就是目前的情况。

Now why would you want to do RLHF? So one answer that is kind of not that exciting is that it just works better. So this comes from the instruction PT paper. According to these experiments a while ago now, these PPO models are RLHF and we see that they are basically just preferred in a lot of comparisons when we give them to humans. So humans just prefer out basically tokens that come from RLHF models compared to SFT models, compared to base model that is prompted to be an assistant. And so it just works better. But you might ask why, why does it work better? And I don't think that there's a single like amazing answer that the community has really like agreed on. But I will just offer one reason potentially.

为什么您希望进行RLHF（反馈强化学习）？其中一个不是很令人激动的答案是它的效果更好。这是来自说明PT论文的研究结果。根据这些实验结果，PPO模型是RLHF的，当我们将它们提供给人类时，我们发现人们更喜欢这些来自RLHF模型的标记，相比之下，他们不太喜欢单纯的SFT模型或者被设置成辅助工具的基础模型。因此，它的效果更好。但您可能会问为什么，它为什么效果更好？我认为，社区中并没有一个出色的答案得到广泛认可。但我将提供一个潜在的原因。

And it has to do with a symmetry between how easy computationally it is to compare versus generate. So let's take an example of generating a high coup. Suppose I ask a model to write a high coup about paperclips. If you're a contractor trying to give a trained data, then imagine being a contractor collecting basically data for the SFT stage, how are you supposed to create a nice high coup for a paperclip? You might just not be very good at that. But if I give you a few examples of high coups, you might be able to appreciate some of these high coups a lot more than others. And so judging which one of these is good is much easier task.

这与比较和生成的计算复杂度之间的对称性有关。以写一首有高度艺术价值的俳句为例。假设我请求一个模型写一首关于回形针的俳句。如果你是一位承包商，试图提供训练数据，那么想象一下如何为回形针创作一首优美的俳句可能会很困难。但如果我给你一些高价值的俳句样例，你可能会更欣赏其中的一些俳句。因此，判断哪一首俳句好是一个更容易的任务。

And so basically this asymmetry makes it so that comparisons are a better way to potentially leverage yourself as a human and your judgment to create a slightly better model. Now our large of models are not strictly an improvement on the base models in some cases. So in particular, we've noticed for example that they lose some entropy. So that means that they give more PT results. They can output lower variations, like they can output samples with lower variation than the base model.

基本上，这种不对称性使得比较是潜在地利用自己作为人类和判断力以创建略微更好的模型的更好方法。现在我们的大模型在某些情况下并不严格是基础模型的改善。特别是，我们已经注意到它们失去了一些熵。这意味着它们提供更多的PT结果。它们可以输出低变异性的样本，比基础模型更为缓和。

So base model has lots of entropy and will give lots of diverse outputs. So for example, one kind of place where I still prefer to use a base model is in the setup where you basically have n things and you want to generate more things like it. And so here is an example that I just cooked up. I want to generate cool Pokemon names. I gave it seven Pokemon names and I asked the base model to complete the document and it gave me a lot more Pokemon names. I don't believe they're actual Pokemon. And this is the kind of task that I think base model would be good at because it still has lots of entropy and will give you lots of diverse cool kind of more things that look like whatever you give it before.

基本模型具有大量的熵，可以产生多样化的输出。例如，在你有n个事物并且想要生成更多类似事物的设置中，我仍然更喜欢使用基本模型。这里是一个我刚刚想到的例子。我想生成有趣的宝可梦名字。我给了它七个宝可梦名字，并要求基本模型完成文档，它给了我更多的宝可梦名字。我不认为它们是真正的宝可梦。这是我认为基本模型擅长的任务，因为它仍然具有许多熵，可以给你很多多样化的有趣的东西，看起来与你给它的东西类似。

So this is what this is number, having said all that, these are kind of like the assistant models that are probably available to you at this point. There's a team at Berkeley that ranked a lot of the available assistant models and gave them basically ELO ratings. So currently some of the best models are GPT4 by far, I would say, followed by Claude, GPT3.5, and then a number of models, some of these might be available as weights like the Kuna, Koala, etc. And the first three rows here are all, they are all RLHF models and all of the other models to my knowledge are SFT models, I believe.

所以这是数字是什么意思，说了这么多，这些都是你可能可以使用的助手模型。伯克利有一个团队排名了许多可用的助手模型并给出了ELO评分。目前一些最好的模型是GPT4，其次是Claude，GPT3.5，然后还有一些模型，其中一些可能作为权重可用，如Kuna、Koala等等。这里的前三行都是RLHF模型，我相信所有其他模型都是SFT模型。

Okay, so that's how we train these models on a high level. Now I'm going to switch gears and let's look at how we can best apply and GPT assistant models for your problems. Now I would like to work in a setting of a concrete example. So let's work with a concrete example here. Let's say that you are working on an article or a blog post and you're going to write this sentence at the end. California's population is 53 times that of Alaska.

好的，这就是我们以高层次进行这些模型训练的方式。现在我要转变思路，让我们看看如何最好地应用 GPT 助手模型来解决你的问题。现在我想在一个具体的例子中工作。那么让我们使用一个具体的例子。假设你正在撰写一篇文章或博客，在结束语中要写出以下这句话。加利福尼亚州的人口是阿拉斯加州的53倍。

So for some reason you want to compare the populations of these two states. Look about the rich internal model log and tool use and how much work actually goes computationally in your brain to generate this one final sentence. So here's maybe what that could look like in your brain. Okay, for this next step, let me blog, let me compare these two populations. Okay, first I'm going to obviously need to get both of these populations.

所以出于某些原因，你想比较这两个州的人口。观察你的大脑富有内在模型的日志和工具使用，以及为生成这一个最终的句子而实际进行的计算量。那么，这可能是你的大脑中的情况。好的，下一步，让我思考一下，让我比较这两个人口。首先，显然我需要获取这两个人口的数据。

Now I know that I probably don't know these populations off the top of my head. So I'm kind of like aware of what I know where I don't know of my self knowledge, right? So I go, I do some tool use and I go to Wikipedia and I look up California's population and Alaska's population.

现在我知道，我也许并不知道这些人口数据。因此，我意识到了自己知道哪些，不知道哪些。于是我使用了一些工具，去了解了加利福尼亚和阿拉斯加的人口数据，我在维基百科上查阅了这些信息。

Now I know that I should divide the two, but again, I know that dividing 39.2 by 0.74 is very unlikely to succeed. That's not the kind of thing that I can do in my head. And so therefore I'm going to rely on the calculator. So I'm going to use a calculator, punch it in and see that the output is roughly 53. And then maybe I do some reflection and sanity checks in my brain.

现在我知道应该把它们分开计算，但是我也知道39.2除以0.74很难成功。这不是我能靠头脑计算出来的。因此，我将依赖计算器。所以我要使用计算器，输入这个式子，看到输出约为53。然后，也许我会在脑海中反思和进行一些合理性检查。

So does 53 make sense? Well, that's quite a large fraction, but then California has the most popular state. So maybe that looks okay. So then I have all the information I might need and now I get to the sort of creative portion of writing. So I might start to write something like California has 53 x times greater and then I think to myself that's actually like really awkward phrasing. So let me actually delete that and let me try again.

“53”是否有意义呢？这是一个相当大的比例，但加利福尼亚州是最受欢迎的州。所以可能看起来还不错。现在，我获得了所有可能需要的信息，接下来进入写作的有创意的部分。我可能会开始写“加利福尼亚州的面积是53倍”，但我觉得这实际上是非常笨拙的措辞。所以，让我删除它然后重新尝试。

And so as I'm writing, I have this separate process almost inspecting what I'm writing and judging whether it looks good or not. And then maybe I delete and maybe I reframed it and then maybe I'm happy with what comes out. So basically long story short, a ton happens under the hood in terms of your internal monologue when you create sentences like this.

当我写作时，我几乎有一个独立的过程在审查我写的东西，判断它是否看起来好。然后可能我会删除它，重新构思，然后也可能我对最终的作品很满意。总的来说，当你创造这样的句子时，你的内在独白下面发生了很多事情。

But what does a sentence like this look like when we are training a GPT on it? From GPT's perspective, this is just a sequence of tokens. So GPT when it's reading or generating these tokens, it just goes chunk, chunk, chunk, chunk, chunk. And each chunk is roughly the same amount of computational work for each token.

但是如果我们对这句话进行GPT的训练，它会是什么样子呢？从GPT的角度来看，这只是一个记号序列。因此，在读取或生成这些记号时，GPT只是一块一块地处理，每一块对于每个记号来说大致都需要相同的计算量。

And these transformers are not very shallow networks. They have about 80 layers of reasoning, but 80 is still not like too much. And so this transformer is going to do its best to imitate. But of course, the process here looks very, very different from the process that you took.

这些变压器并不是非常浅的网络。它们有大约80层的推理，但80层仍然不算太多。因此，这个变压器会尽其所能去模仿。但是，当然，这里的过程看起来非常非常不同于您所采取的过程。意思：这段话描述了一个网络模型——变压器。这些变压器的层数达到80层，但这并不是太多。变压器模型会尽最大努力去模仿原本的过程，但实际上这个过程与其他过程很不相同。

So in particular, in our final artifacts, in the data sets that we create and then eventually feed to all alums, all of that internal dialogue is completely stripped. And unlike you, the GPT will look at every single token and spend the same amount of compute on every one of them. And so you can't expect it to actually like, well, you can't expect it to do to sort of do too much work per token.

因此，特别是在我们的最终作品中，我们创建的数据集，最终提供给所有校友使用，所有内部对话完全被剥离。与您不同的是，GPT将查看每个令牌并花费相同数量的计算量处理每个令牌。因此，您不能期望它实际上可以完成太多的令牌工作。

So and also in particular, basically these transformers are just like token simulators. So they don't know what they don't know. Like they just imitate the next token. They don't know what they're good at or not good at. They just tried their best to imitate the next token. They don't reflect in the loop. They don't sanity check anything.

因此，特别是这些转换器基本上就像令牌模拟器一样。所以它们不知道自己不知道的东西。就像它们只是模仿下一个令牌而已。它们不知道自己擅长什么或不擅长什么。它们只是尽力模仿下一个令牌。它们不检查循环中的反馈。它们也不对任何事情进行合理性检查。

They don't correct their mistakes along the way by default. They just sample token sequences. They don't have separate internal monologues, streams in their head. They're evaluating what's happening. Now they do have some sort of cognitive advantages, I would say. And that is that they do actually have very large fact-based knowledge across a vast number of areas because they have several 10 billion parameters.

他们默认情况下不纠正他们的错误，只是对标记序列进行抽样。他们没有单独的内部独白，也没有头脑中的流。他们正在评估发生的事情。我会说，他们确实具有某种认知优势。这是因为他们实际上具有涵盖许多领域的非常大的基于事实的知识，因为他们有几百亿个参数。

So it's a lot of storage for a lot of facts. But and they also, I think, have a relatively large and perfect working memory. So whatever fits into the context window is immediately available to the transformer through its internal self-attention mechanism.

它有很大的存储空间来存储大量的事实，并且它们也具有相对较大且完善的工作记忆。因此，通过其内部的自我关注机制，无论什么信息适合上下文窗口，变压器都能立即使用。

And so it's kind of like perfect memory, but it's got a finite size. But the transformer has a very direct access to it. And so it can like, losslessly remember anything that is inside its context window. So it's kind of how I would compare those two.

所以，这有点像完美的记忆，但它的大小是有限的。但是变换器可以直接访问它。因此，它可以无损地记住其上下文窗口内的任何内容。这就是我将这两者进行比较的方式。

And the reason I bring all of this up is because I think to a large extent, prompting is just making up for this sort of cognitive difference between these two kind of architectures. Like our brains here and LLM brains. You can look at it that way almost.

我提到这一切的原因是因为我认为在很大程度上，催促只是弥补这两种架构之间的认知差异。就像我们的大脑和LLM大脑一样。你可以这样看待它。

So here's one thing that people found, for example, works pretty well in practice. Especially if your tasks require reasoning, you can't expect the transformer to make too much reasoning per token. And so you have to really spread out the reasoning across more and more tokens.

有一件事情，例如在实践中非常有效。特别是如果你的任务需要推理，你不能期望transformer对每个token进行过多的推理。因此，你必须通过更多的token来扩展推理。

So for example, you can't give a transformer a very complicated question and expect it to get the answer in a single token. There's just not enough time for it. These transformers need tokens to think. Quote unquote, I like to say sometimes. And so this is some of the things that work well.

例如，你不能给一个变形器一个非常复杂的问题，并期望它在一个符号中得到答案。时间不够。这些变形器需要符记来思考。有时候我喜欢说：成语引用。这是一些运作良好的事情。

You may, for example, have a few shot prompt that shows the transformer that it should like show its work when it's answering a question. When it's answering a question. And if you give a few examples, the transformer will imitate that template. And it will just end up working out better in terms of its evaluation.

例如，您可以提供一些提示来告诉变压器在回答问题时，它应该展示其工作方式。如果您提供了几个例子，变压器会模仿这个模板，并在评估方面取得更好的效果。

Additionally, you can elicit this kind of behavior from the transformer by saying, "let's think step by step," because this conditioned the transformer into showing its work. And because it kind of snaps into a mode of showing its work, it's going to do less computational work per token. And so it's more likely to succeed as a result, because it's making slower reasoning over time.

此外，您可以通过说“一步一步地思考”，引导变压器表现出这种行为，因为这会让变压器习惯于展示它的计算过程。因为它会切换到显示计算过程的模式，所以每个符号的计算工作会更少。因此，它更有可能成功，因为它会随着时间的推移进行更缓慢的推理。

Here's another example. This one is called self-consistency. We saw that we had the ability to start writing. And then if it didn't work out, I can try again. And I can try multiple times and maybe select the one that worked best. So in these kinds of approaches, you may sample not just once, but you may sample multiple times. And then I have some process for finding the ones that are good and then keeping just those samples or doing a majority vote or something like that.

这里有另一个例子，叫做自我一致性。我们发现我们有能力开始写作。如果没能成功，我可以再试一次。我可以多次尝试，然后选择效果最好的一次。在这些方法中，你可能不只是抽样一次，而是抽样多次。然后我有一些过程来找到好的样本，只保留那些样本或进行多数投票等处理。

So basically these transformers in the process as they predict the next token, just like you, they can get unlucky. And they could sample a not very good token. And they can go down sort of like a blind alley in terms of reasoning. And so unlike you, they cannot recover from that. They are stuck with every single token they sample. And so they will continue the sequence even if they even know that this sequence is not going to work out. So give them the ability to look back, inspect, or try to find, try to basically sample around it.

基本上，这些变换器在预测下一个令牌时，像你一样，也可能不够幸运地随机到一个不太好的令牌。它们可能会走进推理的死胡同。与你不同的是，它们无法从中恢复过来。它们会继续生成序列，即使它们知道这个序列不会成功。所以，给它们一些能力，让它们回顾过去、检查或尝试寻找、尝试随机到一个更好的令牌。

Here's one technique also. It turns out that actually LLMs, like they know when they've screwed up. So as an example, say you asked the model to generate a poem that does not rhyme. And it might give you a poem, but it actually rhymes. But it turns out that especially for the bigger models, like GPT-4, you can just ask it, "did you meet the assignment?" And actually GPT-4 knows very well that it did not meet the assignment. It just kind of got unlucky in its sampling. And so it will tell you, "no, I didn't actually meet the assignment. Here, let me try again." But without you prompting it, it doesn't know to revisit and so on. So you have to make up for that in your prompts. You have to get it to check. If you don't ask it to check, it's not going to check by itself. It's just a token simulator.

这里还有一种技巧。其实LLM（语言模型）也能知道自己出错了。比如，让模型生成一首不押韵的诗，它可能生成了一首押韵的诗，但对于像GPT-4这样的大模型来说，你只需要问它，“你完成了任务吗？”它会非常清楚地知道自己没完成任务，只是在采样时运气不佳。这时，它会告诉你，“不，我没有完成任务。让我再试一次。”但如果你不提示它，它就不会知道要重新检查等等。因此，在你的提示中必须让它检查。如果你不让它检查，它就不会自动检查，只会表现为一个标记模拟器。

I think more generally, a lot of these techniques fall into the bucket of what I would say, recreating our system, too. So you might be familiar with the system one, system two, thinking for humans. System one is a fast, automatic process. And I think kind of corresponds to like an LLMs just sampling tokens. And system two is the slower, deliberate planning sort of part of your brain.

我认为，许多这些技术都可以归为重新创造我们的系统。你可能熟悉人类的系统一、系统二思维。系统一是快速自动的过程，我认为类似于使用LLMs对标记进行抽样。而系统二则是大脑的较慢、有意识的计划部分。

And so this is a paper actually from just last week, because this space is pretty quickly evolving. It's called Tree of Thought. And in Tree of Thought, the authors of this paper proposed maintaining multiple completions for any given prompt. And then they are also scoring them along the way and keeping the ones that are going well, if that makes sense. And so a lot of people are like really playing around with kind of prompt engineering to basically bring back some of these abilities that we sort of have in our brain for LLMs.

这是一篇最近发表的论文，因为这个领域在快速发展。这篇论文叫做“思想之树”。在这篇论文中，作者提出了为任何给定的提示维护多个完成版本，并在实现过程中对它们进行评分，并保留那些表现良好的版本。因此，许多人都在探索提示工程，以恢复一些我们大脑中的LLMs技能。

Now one thing I would like to note here is that this is not just a prompt. This is actually prompts that are together used with some Python glue code, because you don't actually have to maintain multiple prompts. And you also have to do some tree search algorithm here to like figure out which prompt to expand, etc. So it's a symbiosis of Python glue code and individual prompts that are called in a while loop or in a bigger algorithm.

在这里，我想指出一件事，这不仅仅是提示。实际上这是与一些Python粘合代码一起使用的提示，因为你不必真的维护多个提示。你还需要做一些树搜索算法，来找出要扩展的提示，等等。因此，这是Python粘合代码和单独的提示之间的共生关系，在while循环中或在更大的算法中调用。

I also think there's a really cool parallel here to AlphaGo. AlphaGo has a policy for placing the next stone where it plays Go, and this policy was trained originally by imitating humans. But in addition to this policy, it also does multi-carlo tree search, and basically it will play out a number of possibilities in its head and evaluate all of them and only keep the ones that work well. And so I think this is kind of an equivalent of AlphaGo but for text, if that makes sense.

我认为这里有一个与 AlphaGo 非常酷的相似之处。AlphaGo 有一个放置下一颗围棋棋子的策略，这个策略最初是通过模仿人类来训练的。但除此之外，它还进行了多层蒙特卡罗树搜索，基本上会在头脑中玩出许多可能性并对它们进行评估，只保留效果好的那些。因此，如果有意义的话，我认为这在某种程度上相当于用于文本的 AlphaGo。

So just like tree of thought, I think more generally people are starting to like really explore more general techniques of not just a simple question and so prompts, but something that looks a lot more like Python glue code stringing together many prompts. So on the right, I have an example from this paper code react where they structure the answer to a prompt as a sequence of thought action observation, thought action observation, and it's a full rollout kind of a thinking process to answer the query. And in these actions, the model is also allowed to tool use.

就像思路树一样，我认为越来越多的人开始更全面地探索不仅仅是简单问题和提示，而是像Python粘合码一样串接许多提示的更通用技术。右边是来自这篇论文《代码反应》的示例，他们将对提示的回答结构化为一个思考行动观察序列，思考行动观察，这是一种完整的思考过程来回答查询。在这些行动中，该模型也被允许用工具。

On the left, I have an example of auto GPT. And now auto GPT by the way, is a project that I think got a lot of hype recently. And I think, but I think I still find it kind of inspirational and interesting. It's a project that allows an LLM to sort of keep task list and continue to recursively break down tasks. And I don't think this currently works very well and I would not advise people to use it in practical applications. I just think it's something to generally take inspiration from in terms of where this is going, I think, over time. So that's kind of like giving our model system to thinking.

在图片左侧，我展示了一个自动 GPT 的例子。最近，自动 GPT 成为了一个备受关注的项目。虽然我认为它目前的效果并不太好，不建议人们将其用于实际应用，但我仍然感到灵感和兴趣。该项目允许 LLM 维护任务列表，并持续递归地拆分任务。我认为该功能仍存在一些问题。但我认为它反映了技术在未来的发展方向，这也是我们思考模型系统的一种思路。

The next thing that I find kind of interesting is this following sort of, I would say, almost psychological quirk of LLMs is that LLMs don't want to succeed. They want to imitate. You want to succeed and you should ask for it. So what I mean by that is when transformers are trained, they have training sets. And there can be an entire spectrum of performance qualities in their training data. So for example, there could be some kind of a prompt for some physics question or something like that. And there could be a student solution that is completely wrong. But there can also be an expert answer that is extremely right. And transformers can't tell the difference between like, look, or I mean, they know, they know about low quality solutions and high quality solutions. But by default, they want to imitate all of it because they're just trained on language modeling.

接下来我觉得有趣的一个事情是，LLM（Language and Learning Model）存在一种心理怪癖，他们并不想取得成功，而是想模仿别人。而你应该追求成功，应该为此而努力。当LLM被训练时，它们会有一个训练集。在这个训练数据中，会有各种各样的表现质量。例如，可能会有一道物理题目的提示，还会有一个完全错误的学生解法，或者一个极正确的专家解答。Transformers （变压器模型）无法分辨这些间的区别，但是他们会全部模仿，因为他们只是被训练用于语言模型。

And so at test time, you actually have to ask for a good performance. So in this example, in this paper, they tried various prompts. And let's think step by step was very powerful because it sort of like spread out the reasoning over many tokens. But what worked even better is let's work this out in a step by step way to be sure we have the right answer. And so it's kind of like conditioning on getting a right answer. And this actually makes the transformer work better because the transformer doesn't have to now hedge its probability mass on low quality solutions as ridiculous as that sounds. And so basically, feel free to ask for a strong solution. Say something like, you are a leading expert on this topic. Pretend you have IQ 120, et cetera. But don't try to ask for too much IQ because if you ask for a IQ of like 400, you might be out of data distribution.

在测试时，你实际上必须要求获得良好的表现。因此，在这个例子中，他们尝试了各种提示。让我们一步一步思考是非常有力的，因为它将推理分散在许多令牌上。但是，更好的方法是让我们按步骤进行工作，以确保我们有正确的答案。这就像是在获得正确答案的情况下进行调整。这实际上使Transformer的工作更好，因为Transformer不必将其概率质量置于低质量解决方案上，尽管这听起来很荒谬。因此，你可以自由地寻求强力的解决方案。比如说，“你是这个话题的领先专家。假装你有高智商等等。但不要尝试去要求过高的智商，因为如果你要求400的智商，你可能会超出数据分布范围。”

Or even worse, you could be in data distribution for some like sci-fi stuff. And it will start to like take on some sci-fi, like role playing or some stuff like that. So you have to find like the right amount of IQ, I think it's got some U-shaped curve there. Next up, as we saw, when we are trying to solve problems, we know we are good at and what we're not good at and we lean on tools computationally. You want to do the same potentially with your LLMs. So in particular, we may want to give them calculators, code interpreters, and so on, the ability to do search. And there's a lot of techniques for doing that.

甚至更糟糕的是，你可能会在某些像科幻小说的数据分发领域。它会开始像科幻小说一样出现，比如角色扮演或类似的东西。因此，你需要找到合适的智商水平，我认为它具有一些U型曲线。接下来，正如我们所看到的，当我们试图解决问题时，我们知道我们擅长什么和我们不擅长什么，并在计算工具上依赖。您可能希望以同样的方式处理您的LLMs。因此，我们可能希望给它们计算器、代码解释器等工具，以及搜索的能力。有许多技术可以做到这一点。

One thing to keep in mind again is that these transformers by default may not know what they don't know. So you may even want to tell the transformer in a prompt, you are not very good at mental arithmetic. Whenever you need to do very large number addition, multiplication or whatever, instead use this calculator. Here's how you use the calculator. Use this token, combination, et cetera, et cetera. So you have to actually like spell it out because the model by default doesn't know what it's good at or not good at necessarily, just like you and I might be.

需要记住的一件事是，这些转换器默认情况下可能不知道他们不知道的事情。因此，您甚至可能希望在提示中告诉转换器，您不擅长心算。每当您需要进行非常大的数字加法，乘法或其他计算时，请使用此计算器。以下是如何使用计算器。使用此令牌、组合等等。因此，您必须明确说明，因为模型默认情况下可能不知道它擅长或不擅长什么，就像您和我可能一样。

Next up, I think something that is very interesting is we went from a world that was retrieval only all the way, the pendulum went swung to the other extreme where it's memory only in LLMs. But actually there's this entire space in between of these retrieval augmented models and this works extremely well in practice. As I mentioned, the context window of a transformer is its working memory. If you can load the working memory with any information that is relevant to the task, the model will work extremely well because it can immediately access all that memory. And so I think a lot of people are really interested in basically retrieval augmented generation.

接下来，我认为非常有趣的是，我们从只有检索的世界一路走到了另一个极端，即LLMs中只有记忆。但实际上，在这两种模型之间有一个完整的空间可以用检索增量模型来填充，而这种模型在实践中非常有效。正如我所提到的，transformer的上下文窗口就是其工作内存。如果你可以将任何与任务相关的信息加载到工作内存中，该模型将非常有效，因为它可以立即访问所有内存。因此，我认为很多人对基本上是检索增量生成非常感兴趣。

And on the bottom I have like an example of Lama index, which is one sort of data connector to lots of different types of data. And you can make it, you can index all of that data and you can make it accessible to LLMs. And the emerging recipe there is you take relevant documents, you split them up into chunks, you embed all of them and you basically get embedding vectors that represent that data. You store that in the vector store and then at test time you make some kind of a query to your vector store and you fetch chunks that might be relevant to your task and you stuff them into the prompt and then you generate. So this can work quite well in practice.

在底部，我有一个喇嘛指数的示例，这是一种将多种不同类型的数据连接起来的数据连接器。你可以通过它来索引所有那些数据，并使之易于LLMs访问。这种新兴的方法是：你将相关文档拆分成块，将它们全部嵌入其中，然后生成一个嵌入向量来表示这些数据。将这些向量储存到向量库中，然后在测试时间，通过向量库进行相关的查询，提取可能与所需任务相关的文档块并将它们摆放到输入框中，随后自动生成处理任务。在实践中，这种方法是非常有效的。

So this is I think similar to when you and I solve problems, you can do everything from your memory and transformers have very large and extensive memory, but also it really helps to reference some primary documents. So whenever you find yourself going back to a textbook to find something or whenever you find yourself going back to documentation of a library to look something up, the transformers definitely when I do that too. You have some memory over how some documentation of a library works, but it's much better to look it up. So the same applies here.

我认为这跟你我解决问题时很相似，你能够从记忆中解决问题，而transformers拥有非常庞大和广泛的记忆，但参考一些主要文献真的很有帮助。所以每当你发现自己需要回到教科书中查找某些内容，或者需要回到图书馆的文献中查找某些内容时，transformers也会这样做。你对图书馆的一些文献的记忆也许有一些，但查阅会更好。所以，这同样适用于这里。

Next, I wanted to briefly talk about constraint prompting. I also find this very interesting. This is basically techniques for forcing a certain template in the outputs of LLMs. So guidance is one example from Microsoft actually. And here we are enforcing that the output from the LLMs will be JSON. And this will actually guarantee that the output will take on this form because they go in and they mess with the probabilities of all the different tokens that come out of the transformer and they clamp those tokens.

接下来，我想简要谈谈约束提示。我也觉得这很有趣。这基本上是迫使LLMs输出特定模板的技术。微软的Guidance就是一个例子。在这里，我们强制LLMs的输出是JSON格式。这实际上会保证输出采取这种形式，因为他们会干预所有来自变压器的不同标记的概率，并夹紧这些标记。

And then the transformer is only filling in the blanks here. And then you can enforce additional restrictions on what could go into those blanks. So this might be really helpful and I think this kind of constraint sampling is also extremely interesting.

然后转换器只是在填补空白。然后您可以对可以填入这些空白的内容施加附加限制。因此，这可能非常有帮助，我认为这种约束抽样也非常有趣。（意思是指，在使用生成式模型时，可以通过对填充空白的词语施加限制，来提高生成的文本质量，并且这种方法非常有趣。）

I also wanted to say a few words about fine tuning. It is the case that you can get really far with prompt engineering, but it's also possible to think about fine tuning your models. Now fine tuning models means that you are actually going to change the weights of the model. It is becoming a lot more accessible to do this in practice. And that's because of a number of techniques that have been developed and have libraries for very recently. So for example, parameter efficient fine tuning techniques like Laura, make sure that you are only training small sparse pieces of your model. So most of the model is kept clamped at the base model and some pieces of it are allowed to change and this still works pretty well empirically and makes it much cheaper to sort of tune only small pieces of your model.

我也想谈一下微调的问题。其实通过的提示工程，可以将模型优化到相当高的水平。不过，精细调节模型也是可能的。微调模型意味着你要修改模型的权重，实际上开始变得越来越便利。因为一系列技术不断开发出，并且最近出现了针对微调模型的库。例如，像Laura这样的参数高效微调技术可以确保你只对模型的少量稀疏片段进行训练。因此，大部分模型仍保持基本模型状态并进行微调，这在实践中表现得相当良好，并且可以大大降低调节模型成本。

There's also, it also means that because most of your model is clamped, you can use very low precision inference for computing those parts because they are not going to be updated by gradient descent and so that makes everything a lot more efficient as well. And in addition, we have a number of open source high quality base models currently as I mentioned, I think one of my quite nice although it is not commercially licensed I believe right now.

这也意味着，由于您的模型大部分被固定，您可以对计算这些部分使用非常低的精度推理，因为它们不会被梯度下降更新，这使得一切更加高效。另外，我们有许多开源高质量的基本模型，目前正如我之前所提到的。其中一个非常好，尽管目前没有商业许可证。

Something to keep in mind is that basically fine tuning is a lot more technically involved. It requires a lot more, I think technical expertise to do to do right. It requires human data contractors for data sets and or synthetic data pipelines that can be pretty complicated. This will definitely slow down your iteration cycle by a lot.

需要记住的是，实际上微调技术更需要技术方面的投入。它需要更多我认为专业知识来正确定义。它需要人类数据合同商来处理数据集和/或合成数据管道，这可能相当复杂。这肯定会大大减慢您的迭代周期。

And I would say on a high level, SFT is achievable because it is just your continuing the language modeling task. It's relatively straightforward. But RLHF I would say is very much research territory and is even much harder to get to work. And so I would probably not advise that someone just tries to roll their own RLHF implementation. These things are pretty unstable, very difficult to train. Not something that is I think very beginner friendly right now and is also potentially likely also to change pretty rapidly still.

我认为，从高层次来看，SFT可以实现，因为它只是在继续语言建模任务，相对来说比较简单。但是，我认为RLHF很大程度上是研究领域，而且要让它工作起来更加困难。因此，我可能不建议有人尝试自己实现RLHF。这些东西相当不稳定，非常难训练。目前可能不是非常适合初学者的，而且还有可能很快发生变化。

So I think these are my sort of default recommendations right now. I would break up your task into two major parts. Number one, achieve your top performance and number two, optimize your performance in that order. Number one, the best performance will currently come from GFD4 model. It is the most capable model by far. Use prompts that are very detailed. They have lots of task contents, relevant information and instructions. How long do lines of what would you tell a task contractor if they can't email you back? But then also keep in mind that a task contractor is a human and they have inner monologue and they are very clever, etc.

所以我想这些是我现在的默认建议。我建议您将任务分为两个主要部分。第一，达到您的最佳表现，第二，按照顺序优化您的表现。首先，目前最佳表现来自GFD4模型。这是迄今为止最有能力的模型。使用非常详细的提示，其中包含大量任务内容、相关信息和指令。如果任务承包商不能通过电子邮件回复，您将要告诉他们多长时间的线路呢？但也请记住，任务承包商是人，他们有内心的独白和非常聪明等。

LLMs do not possess those qualities. So make sure to think through the psychology of the LLM almost in cater prompts to that. Retrieve and add any relevant context and information to these prompts. Basically refer to a lot of the prompt engineering techniques. Some of them I'm highlighted in the slides above but also this is a very large space and I would just advise you to look for prompt engineering techniques online. There's a lot to cover there.

LLMs缺乏这些品质。因此，请确保认真思考LLM的心理状态，以便为之提供针对性的提示。检索并添加任何相关的上下文和信息到这些提示中。基本上，参考很多提示工程技术。其中一些在上面的幻灯片中已经被突出显示，但也有很多提示工程技术可以在网上找到。这是一个庞大的领域，我建议你去寻找提示工程技术。

Experiment with few shop examples. What this refers to is you don't just want to tell, you want to show, whenever it's possible. So give it examples of everything that helps you really understand what you mean if you can.

用少量商店的例子进行实验。这是指你不仅仅想要讲述，而是想要展示，尽可能地展示。因此，如果可能的话，给出能够帮助你真正理解你意图的一切例子。

Experiment with tools and plugins to offload a task that are difficult for LLMs natively. And then think about not just a single prompt and answer, think about potential chains and reflection and how you glue them together and how you could potentially make multiple samples and so on.

尝试使用工具和插件来卸载对LLMs本来难以完成的任务。然后，不仅考虑单个提示和答案，而是考虑潜在的链式反应，以及如何将它们组织在一起，以及如何在可能的情况下制作多个样本等等。

Finally, if you think you've squeezed out prompt engineering which I think you should stick with for a while, look at some potentially fine tuning model to your application but expect this to be a lot more slower and evolved. And then there's an expert fragile research zone here and I would say that is RLHF which currently does work a bit better than SFT if you can get it to work but again this is pretty involved I would say.

最后，如果你认为你已经挤出了迅速的工程，我认为你应该坚持一段时间，然后看看是否有一些可能对你的应用进行微调的模型，但是你应该预计这将会更加缓慢和演化。此外，这里还有一个专家脆弱性研究领域，我会说这是RLHF，如果你能让它起作用，它目前会比SFT表现得更好，但我再次要说这是相当复杂的。

And to optimize your costs, try to explore lower capacity models or shorter prompts as well. I also wanted to say a few words about the use cases in which I think LLMs are currently well suited for. So in particular note that there's a large number of limitations to LLMs today and so I would keep that definitely in mind for all your applications.

为了优化成本，尝试探索较低容量的型号或较短的提示语。我还想谈一谈我认为LLM目前适用的使用案例。特别需要注意的是，今天LLM存在许多限制，所以在您的所有应用程序中一定要记住这一点。

And this is by the way could be an entire talk so I don't have time to cover it in full detail. Models may be biased, they may fabricate hallucinate information, they may have reasoning errors, they may struggle in entire classes of applications, they have knowledge cutoffs so they might not know any information above say September 2021. They are susceptible to a large range of attacks which are sort of like coming out on Twitter daily including prompt injection, jailbreak attacks, data poisoning attacks and so on.

顺便提一下，这个题目可能需要单独讲，所以我没办法详细阐述。模型可能存在偏见，可能会虚构幻想信息，可能会出现推理错误，可能在某些应用中不太行，它们的知识有限，可能不知道2021年9月以后的任何信息。它们容易受到各种攻击，有点像每天在 Twitter 上公开攻击，包括提示注入、越狱攻击、数据中毒攻击等。

My recommendation right now is use LLMs in low stakes applications, combine them with always with human oversight, use them as a source of inspiration and suggestions and think co-pilots instead of completely autonomous agents that are just like performing a task somewhere. It's just not clear that the models are there right now.

目前我推荐在低风险应用中使用LLMs，始终与人类监督结合使用，将它们作为启示和建议的来源，并将其视为共同驾驶员，而不是完全自主的代理人，只是在某处执行任务。目前尚不清楚这些模型是否能够胜任。

So I wanted to close by saying that GPT-4 is an amazing artifact. I'm very thankful that it exists and it's beautiful. It has a ton of knowledge across so many areas, it can do math, code and so on. In addition there's this thriving ecosystem of everything else that is being built and incorporated into the ecosystem, some of the things I've talked about.

最后我想说的是，GPT-4是一项非常惊人的技术。我非常感激它的存在，它非常优美。它拥有各种领域的大量知识，可以进行数学运算、编码等。此外，还有其他许多正在建设和加入到生态系统的繁荣生态系统，其中一些我已经提到过。

And all of this power is accessible at your fingertips. So here's everything that's needed in terms of code to ask GPT-4 a question to prompt it and get a response. In this case I said can you say something to inspire the audience of Microsoft Bill 2023? And I just punched this into Python and verbatim GPT-4 said the following.

而这些强大的能力都可以轻松访问到你的指尖。因此，以下是有关代码的所有必要内容，可以向GPT-4提问并获取回答。在这种情况下，我说你能说些什么来激发2023年微软Bill的观众吗？我只是在Python中键入了这个问题，GPT-4直接回答了以下内容。

And by the way I did not know that they used this trick in the keynote. So I thought I was being clever but it is really good at this. It says ladies and gentlemen innovators and trailbitters in Microsoft Bill 2023. Welcome to the gathering of brilliant minds like no other. You are architects of the future, the visionaries molding the digital realm in which humanity thrives, embrace the limitless possibilities of technologies and let your ideas soar as high as your imagination. Together let's create a more connected, remarkable and inclusive world for generations to come. Get ready to unleash your creativity, canvas the unknown and turn dreams into reality. Your journey begins today. Thank you.

顺便说一句，我不知道演讲中使用了这个技巧。所以我认为我很聪明，但实际上这个技巧确实很好。它说：尊敬的女士们、先驱者、创新者们，在微软2023年度全球大会上，欢迎来到一个与众不同的杰出头脑集聚之地。你们是未来的建筑师，是塑造人类繁荣数字世界的先见者，拥抱技术的无限可能性，让你们的想象力尽情飞翔。让我们一起为子孙后代创造一个更加联系、出色和包容的世界。准备好释放你的创造力，勇闯未知，将梦想变为现实。你的旅程从今天开始。谢谢。