Attention in transformers, visually explained | Chapter 6, Deep Learning - YouTube

发布时间 2024-04-06 21:53:54 来源

中英文字稿

In the last chapter, you and I started to step through the internal workings of a transformer. This is one of the key pieces of technology inside large language models and a lot of other tools in the modern wave of AI. It first hit the scene in a now famous 2017 paper called Attention is All You Need, and in this chapter, you and I will dig into what this attention mechanism is, visualizing how it processes data. As a quick recap, here's the important context I want you to have in mind. The goal of the model that you and I are studying is to take in a piece of text and predict what word comes next.

在上一章中，你和我开始深入了解变压器的内部工作原理。这是大型语言模型和现代人工智能浪潮中许多其他工具内部的关键技术之一。它首次出现在一篇如今著名的2017年论文《注意力就是一切》中，本章中，你和我将深入探讨这种注意机制是什么，以及它如何处理数据。简单回顾一下，这是我希望你记住的重要背景。我们正在研究的模型的目标是接收一段文本并预测接下来是什么单词。

The input text is broken up into little pieces that we call tokens, and these are very often words or pieces of words. But just to make the examples in this video easier for you and me to think about, let's simplify by pretending that tokens are always just words. The first step in a transformer is to associate each token with a high dimensional vector, what we call its embedding. Now, the most important idea I want you to have in mind is how directions in this high dimensional space of all possible embeddings can correspond with semantic meaning.

输入文本被分成我们称之为标记的小块，这些标记很常见是单词或单词的部分。但是为了让这个视频中的例子更容易理解，让我们简化一下，假装标记总是单词。transformer中的第一步是将每个标记与一个高维向量相关联，我们称之为它的嵌入。现在，我想让你记住的最重要的观念是，这个所有可能嵌入的高维空间中的方向如何对应语义含义。

In the last chapter, we saw an example for how direction can correspond to gender, in the sense that adding a certain step in this space can take you from the embedding of a masculine noun to the embedding of the corresponding feminine noun. That's just one example you could imagine how many other directions in this high dimensional space could correspond to numerous other aspects of a word's meaning. The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning.

在上一章中，我们看到了一个示例，说明方向可以对应性别，也就是说在这个空间中添加一个特定的步骤可以将您从一个阳性名词的嵌入转移到相应阴性名词的嵌入。这只是一个示例，可以想象在这个高维空间中有多少其他方向可能对应一个单词意义的许多其他方面。变换器的目标是逐渐调整这些嵌入，使它们不仅仅编码一个单词，而是在其中融入更丰富的上下文含义。

I should say up front that a lot of people find the attention mechanism, this key piece in a transformer, very confusing, so don't worry if it takes some time for things to sink in. I think that before we dive into the computational details and all the matrix multiplications, it's worth thinking about a couple examples for the kind of behavior that we want attention to enable. Consider the phrases American true mole, one mole of carbon dioxide, and take a biopsy of the mole.

我应该在前面说一下，很多人觉得注意机制，在transformer中的关键部分，非常令人困惑，所以如果理解过程需要一些时间的话不要担心。我认为在我们深入探讨计算细节和所有矩阵乘法之前，值得考虑一些例子，来思考我们希望注意力能够实现的行为方式。比如考虑词组“美国真痣”，“一摩尔二氧化碳”，以及“取出痣的活检”。

You and I know that the word mole has different meanings in each one of these, based on the context. But after the first step of a transformer, the one that breaks up the text and associates each token with a vector, the vector that's associated with mole would be the same in all three of these cases, because this initial token embedding is effectively a lookup table with no reference to the context. It's only in the next step of the transformer that the surrounding embeddings have the chance to pass information into this one.

你和我都知道，在不同的情境下，单词"mole"有不同的含义。但是在变压器的第一步之后，那个将文本分解并将每个标记与一个向量关联的步骤之后，与"mole"相关联的向量在这三种情况下将是相同的，因为这种初始标记嵌入实际上是一个查找表，不参考上下文。只有在变压器的下一步中，周围的嵌入才有机会向这个嵌入传递信息。

The picture you might have in mind is that there are multiple distinct directions in this embedding space, encoding the multiple distinct meanings of the word mole, and that a well-trained attention block calculates what you need to add to the generic embedding to move it to one of these more specific directions as a function of the context. To take another example, consider the embedding of the word tower. This is presumably some very generic, non-specific direction in the space, associated with lots of other large tall nouns.

您可能心中所想的图片是，在这个嵌入空间中有多个不同的方向，编码了单词“mole”的多个不同含义，一个训练有素的注意力模块会根据上下文计算出您需要添加到通用嵌入中的内容，以将其移动到其中一个更具体的方向。再举一个例子，考虑单词“塔”的嵌入。这很可能是空间中一些非常通用的、非特定的方向，与许多其他大而高的名词相关联。

If this word was immediately preceded by Eiffel, you could imagine wanting the mechanism to update this vector so that it points in a direction that more specifically encodes the Eiffel tower, maybe correlated with vectors associated with Paris and France and things made of steel. If it was also preceded by the word miniature, then the vector should be updated even further so that it no longer correlates with large tall things. More generally than just refining the meaning of a word, the attention block allows the model to move information encoded in one embedding to that of another, potentially ones that are quite far away, and potentially with information that's much richer than just a single word.

如果这个词紧接着前面是埃菲尔，则你可以想象希望该机制更新这个向量，使其指向更具体编码埃菲尔铁塔的方向，也许与与巴黎、法国和钢铁制品相关的向量相关。如果它也前面加上了微型一词，那么向量应该进一步更新，使其不再与大高的事物相关联。除了仅仅细化一个词的含义，注意力机制还允许模型将一个嵌入中编码的信息转移到另一个嵌入中，可能是相当远的位置，可能包含比单个词更丰富的信息。

What we saw in the last chapter was how after all of the vectors flow through the network, including many different attention blocks, the computation that you perform to produce a prediction of the next token is entirely a function of the last vector in the sequence. So imagine, for example, that the text you input is most of an entire mystery novel, all the way up to a point near the end, which reads, therefore the murderer was, if the model is going to accurately predict the next word, that final vector in the sequence, which began its life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much much more than any individual word, somehow encoding all of the information from the full context window that's relevant to predicting the next word.

在上一章中，我们看到的是在所有向量经过网络的流动后，包括许多不同的注意力块，你执行的计算以产生对下一个标记的预测完全是序列中最后一个向量的函数。因此，想象一下，比如你输入的文本是一个完整的侦探小说的大部分内容，一直到接近结尾的某一点，文本内容是：“因此凶手是”，如果模型要准确预测下一个词，序列中的最后一个向量，最初仅嵌入了单词“是”的向量，将不得不通过所有注意力块的更新来代表远远超过任何单个单词的信息，以某种方式编码与预测下一个词相关的全部上下文窗口的信息。

To step through the computations though, let's take a much simpler example. Imagine that the input includes the phrase, a fluffy blue creature roamed the verdant forest, and for the moment, suppose that the only type of update that we care about is having the adjectives adjust the meanings of their corresponding nouns. What I'm about to describe is what we would call a single head of attention, and later we will see how the attention block consists of many different heads run in parallel. Again, the initial embedding for each word is some high-dimensional vector that only encodes the meaning of that particular word with no context. Actually, that's not quite true. They also encode the position of the word. There's a lot more to say about the specific way the positions are encoded, but right now, all you need to know is that the entries of this vector are enough to tell you both what the word is and where it exists in the context. Let's go ahead and denote these embeddings with the letter E.

为了更容易理解计算过程，让我们来看一个更简单的例子。假设输入包含短语“一只蓝色毛茸茸的生物在翠绿的森林里漫步”，暂时假设我们只关心的一种更新类型是形容词调整其对应名词的含义。我即将描述的是我们称之为单个关注头，稍后我们将看到关注块由许多不同头同时运行组成。再次强调，每个单词的初始嵌入是一个高维向量，只编码了该特定单词的含义而没有上下文。事实上，这并不完全准确。它们还编码了单词的位置。有很多关于位置编码的具体方式的内容，但现在你只需要知道这个向量的条目足以告诉你该单词是什么，以及它在上下文中的位置。让我们将这些嵌入用字母E表示。

The goal is to have a series of computations produce a new refined set of embeddings, where, for example, those corresponding to the nouns have ingested the meaning from their corresponding adjectives. And playing the deep learning game, we want most of the computations involved to look like matrix vector products, where the matrices are full of tunable weights, things that the model will learn based on data. To be clear, I'm making up this example of adjectives updating nouns just to illustrate the type of behavior that you could imagine an intention-head doing. As with so much deep learning, the true behavior is much harder to parse, because it's based on tweaking and tuning a huge number of parameters to minimize some cost function. It's just that, as we step through all of the different matrices filled with parameters that are involved in this process, I think it's really helpful to have an imagined example of something that it could be doing to help keep it all more concrete.

目标是通过一系列计算生成一组新的精炼嵌入集，例如，与名词对应的嵌入已经吸收了其对应形容词的含义。在深度学习中，我们希望大部分涉及的计算看起来像矩阵向量乘积，其中矩阵充满可调参数，这些参数模型会根据数据学习。要清楚的是，我举例说明形容词更新名词的情况，只是为了说明你可以想象一个意图头在做的行为类型。就像很多深度学习，真正的行为很难解析，因为它基于调整和调节大量参数以最小化某些成本函数。只是当我们逐步深入参与这个过程的所有不同参数填充的矩阵时，我觉得想象一个有助于保持整体具体的示例会非常有帮助。

For the first step of this process, you might imagine each noun, like a creature, asking the question, Hey, are there any adjective sitting in front of me? And for the words fluffy and blue, to each be able to answer, yeah, I'm an adjective and I'm in that position. That question is somehow encoded as yet another vector, another list of numbers, which we call the query for this word. This query vector, though, has a much smaller dimension than the embedding vector, say 128. Computing this query looks like taking a certain matrix, which I'll label WQ, and multiplying it by the embedding. Compressing things a bit, let's write that query vector as Q, and then anytime you see me put a matrix next to an arrow like this one, it's meant to represent that multiplying this matrix by the vector at the arrow's start gives you the vector at the arrow's end. In this case, you multiply this matrix by all of the embeddings in the context, producing one query vector for each token. The entries of this matrix are parameters of the model, which means the true behavior is learned from data, and in practice, what this matrix does in a particular attention head is challenging to parse. But for our sake, imagining an example that we might hope that it would learn, we'll suppose that this query matrix maps the embeddings of nouns to certain directions in this smaller query space that somehow encodes the notion of looking for adjectives in preceding positions. As to what it does to other embeddings, who knows, maybe it simultaneously tries to accomplish some other goal with those, right now, where laser focused on the nouns.

在这个过程的第一步中，你可以想象每个名词都像一个生物，问自己这个问题，嘿，有没有形容词在我前面？对于“蓬松”和“蓝色”这两个词，它们每个都可以回答，是的，我是一个形容词，我就在那个位置。这个问题以某种方式被编码为另一个向量，另一个数字列表，我们称之为这个单词的查询。查询向量的维度明显小于嵌入向量，比如128。计算这个查询看起来像是取一个特定的矩阵，我将其标记为WQ，并将其与嵌入相乘。简略一点，我们将查询向量写为Q，当你看到我把一个矩阵放在一个箭头旁边时，它表示将这个矩阵乘以箭头起点处的向量会得到箭头终点处的向量。在这种情况下，你将这个矩阵乘以上下文中所有的嵌入，为每个标记产生一个查询向量。这个矩阵的条目是模型的参数，这意味着真正的行为是从数据中学习的，在实践中，这个矩阵在特定的注意力头中的作用很难解释。但就我们的利益而言，我们可以想象一个我们希望它学习的例子，我们假设这个查询矩阵将名词的嵌入映射到这个较小的查询空间中的某些方向，这种方向以某种方式编码了在前面位置寻找形容词的概念。至于它对其他嵌入的影响，谁知道呢，也许它同时尝试用那些完成其他目标，但现在，我们只专注于名词。

At the same time, associated with this is a second matrix, called the key matrix, which you also multiply by every one of the embeddings. This produces a second sequence of vectors that we call the keys. Conceptually, you want to think of the keys as potentially answering the queries. This key matrix is also full of tunable parameters, and just like the query matrix, it maps the embedding vectors to that same smaller dimensional space. You think of the keys as matching the queries whenever they closely align with each other. In our example, you would imagine that the key matrix maps the adjectives, like fluffy and blue, to vectors that are closely aligned with the query produced by the word creature. To measure how well each key matches each query, you compute a dot product between each possible key query pair. I like to visualize a grid full of a bunch of dots, where the bigger dots correspond to the larger dot products, the places where the keys and queries align.

与这一点同时相关的是第二个矩阵，称为关键矩阵，你也要将每个嵌入向量与其相乘。这会产生我们称之为关键的第二个向量序列。从概念上讲，你可以将关键视为潜在回答查询的内容。这个关键矩阵也充满了可调参数，就像查询矩阵一样，它将嵌入向量映射到相同较小维度的空间中。当关键与查询彼此紧密对齐时，你可以将关键视为与查询相匹配。在我们的例子中，你可以想象关键矩阵将形容词，如蓬松和蓝色，映射到与单词“生物”生成的查询密切对齐的向量上。为了衡量每个关键与每个查询的匹配程度，你计算每个可能的关键-查询对之间的点积。我喜欢将一个充满许多点的网格可视化，其中更大的点对应于较大的点积，也就是关键和查询对齐的地方。

For our adjective noun example, that would look a little more like this, where if the keys produced by fluffy and blue really do align closely with the query produced by creature, then the dot products in these two spots would be some large positive numbers. In the lingo, machine learning people would say that this means the embeddings of fluffy and blue attend to the embedding of creature. By contrast to the dot product between the key for some other word, like the and the query for creature, would be some small or negative value that reflects that these are unrelated to each other. So we have this grid of values that can be any real number from negative infinity to infinity, giving us a score for how relevant each word is to updating the meaning of every other word.

对于我们的形容词名词示例，情况会有些类似，如果由"蓬松(fluffy)"和"蓝色(blue)"产生的关键词确实与由"生物(creature)"产生的查询密切相关，那么这两个地方的点积将会是一些较大的正数。在术语中，机器学习人员会说这意味着"蓬松(fluffy)"和"蓝色(blue)"的嵌入关注着"生物(creature)"的嵌入。相比之下，与其他词的关键词（如"the"）与"生物(creature)"的查询之间的点积将是一些小或负的值，反映出它们彼此无关。因此，我们有一系列值，这些值可以是从负无穷到正无穷的任何实数，为我们提供了每个单词与更新每个其它单词含义相关性的分数。

The way we're about to use these scores is to take a certain weighted sum along each column, weighted by the relevance. So instead of having values range from negative infinity to infinity, what we want is for the numbers in these columns to be between 0 and 1, and for each column to add up to 1, as if they were a probability distribution. If you're coming in from the last chapter, you know what we need to do then. We compute a softmax along each one of these columns to normalize the values. In our picture, after you apply softmax to all of the columns, we'll fill in the grid with these normalized values. At this point, you're safe to think about each column as giving weights according to how relevant the word on the left is to the corresponding value at the top. We call this grid an attention pattern.

我们将使用这些分数的方式是沿着每一列进行一定加权和，根据相关性进行加权。所以，我们希望这些列中的数字不再是从负无穷到正无穷的范围，而是在0到1之间，且每一列的总和为1，就像它们是一个概率分布一样。如果您是从上一章中来的，那么您知道我们接下来需要做什么。我们对每一列进行softmax计算以归一化数值。在我们的图中，当您对所有列应用softmax后，我们将用这些归一化值填充网格。此时，您可以安全地将每一列视为给出权重，根据左侧的单词与顶部对应值的相关性程度。我们将这个网格称为一个注意力模式。

Now if you look at the original transformer paper, there's a really compact way that they write this all down. Here, the variables Q and K represent the full arrays of query and key vectors respectively, those little vectors you get by multiplying the embeddings by the query and the key matrices. This expression up in the numerator is a really compact way to represent the grid of all possible dot products between pairs of keys and queries. A small technical detail that I didn't mention is that for numerical stability, it happens to be helpful to divide all of these values by the square root of the dimension in that key query space. Then this softmax that's wrapped around the full expression is meant to be understood to apply column by column. As to that V term, we'll talk about it in just a second.

现在如果您看原始变压器论文，他们以一种非常简洁的方式将所有内容写出来。在这里，变量Q和K分别代表查询和键向量的完整数组，这些小向量是通过将嵌入乘以查询和键矩阵得到的。分子中的这个表达式是一种非常简洁的方式来表示所有可能的键和查询对之间的点积网格。一个我没有提到的小技术细节是，为了数值稳定性，将这些值全部除以键查询空间中维数的平方根是有帮助的。然后包裹在整个表达式周围的softmax是每列分别应用的。至于V项，我们稍后会讨论它。

Before that, there's one other technical detail that so far I've skipped. During the training process, when you run this model on a given text example, and all of the weights are slightly adjusted and tuned to either reward or punish it based on how high a probability it assigns to the true next word in the passage, it turns out to make the whole training process a lot more efficient. If you simultaneously have it predict every possible next token following each initial subsequence of tokens in this passage, for example with the phrase that we've been focusing on, it might also be predicting what words follow creature and what words follow the. This is really nice because it means what would otherwise be a single training example effectively acts as many.

在那之前，还有一个技术细节，我到目前为止没有提到。在训练过程中，当你在给定的文本示例上运行该模型时，所有权重都会略微调整并根据其分配给段落中真实下一个词的概率高低来调整，以奖励或惩罚它，结果会使整个训练过程更加高效。如果你同时让它预测在该段落中每个初始子标记序列之后的所有可能下一个标记，例如我们一直关注的短语，它还可能在预测creature后面接的词和the后面接的词。这非常好，因为这意味着原本单个的训练示例实际上起到多个作用。

For the purposes of our attention pattern, it means that you never want to allow later words to influence earlier words, since otherwise they could kind of give away the answer for what comes next. What this means is that we want all of these spots here, the ones representing later tokens, influencing earlier ones, to somehow be forced to be zero. The simplest thing you might think to do is to set them equal to zero, but if you did that, the columns wouldn't add up to one anymore, they wouldn't be normalized. So instead, a common way to do this is that before applying softmax, you set all of those entries to be negative infinity. If you do that, then after applying softmax, all of those get turned into zero, but the columns stay normalized. This process is called masking. There are versions of attention where you don't apply it, but in our GPT example, even though this is more relevant during the training phase than it would be, say, running it as a chatbot or something like that, you do always apply this masking to prevent later tokens from influencing earlier ones.

在我们的注意力模式中，这意味着您永远不希望让后来的单词影响先前的单词，否则它们可能会透露接下来会出现的答案。这意味着我们希望这里的所有位置，代表后来的标记，影响先前的标记，但以某种方式被强制为零。您可能会认为最简单的做法是将它们设置为零，但如果这样做，列就不再相加为1，它们将不再被归一化。因此，通常的做法是，在应用softmax之前，将所有这些条目设为负无穷大。如果您这样做，然后应用softmax后，所有这些都会变成零，但列仍然保持归一化。这个过程称为掩码。有一些注意力的版本不应用这个过程，但在我们的GPT示例中，即使在训练阶段比在运行它作为聊天机器人等时更相关，您总是要应用这种掩码，以防后续标记影响先前的标记。

Another fact that's worth reflecting on about this attention pattern is how its size is equal to the square of the context size. So this is why context size can be a really huge bottleneck for large language models, and scaling it up is non-trivial. As you might imagine, motivated by a desire for bigger and bigger context windows, recent years have seen some variations to the attention mechanism aimed at making context more scalable, but right here you and I are staying focused on the basics. Okay, great. Computing this pattern lets the model deduce which words are relevant to which other words. Now you need to actually update the embeddings, allowing words to pass information to whichever other words they're relevant to.

这个值得思考的关于这种注意力模式的另一个事实是，它的大小等于上下文大小的平方。这就是为什么上下文大小可以成为大型语言模型的一个巨大瓶颈，并且将其扩展起来并不是一件简单的事情。正如你所能想象的那样，出于对更大的上下文窗口的渴望，近年来出现了一些旨在使上下文更具可伸缩性的注意力机制的变种，但在这里，您和我都将专注于基础知识。好的，很好。计算这种模式可以让模型推断出哪些单词与哪些其他单词相关。现在您需要实际更新嵌入，使单词可以传递信息给它们相关的其他单词。

For example, you want the embedding of fluffy to somehow cause a change to creature that moves it to a different part of this 12,000-dimensional embedding space, that more specifically encodes a fluffy creature. What I'm going to do here is first show you the most straightforward way that you could do this, though there's a slight way that this gets modified in the context of multi-headed attention. This most straightforward way would be to use a third matrix, what we call the value matrix, which you multiply by the embedding of that first word, for example, fluffy. The result of this is what you would call a value vector, and this is something that you add to the embedding of the second word, in this case something you add to the embedding of creature.

例如，您希望“fluffy”的嵌入在某种程度上导致一个改变，将其移动到此12000维嵌入空间的不同部分，更具体地编码一个“fluffy creature”。我要做的第一件事是向您展示最直接的方法，尽管在多头注意力的背景下这种方法会略有修改。最直接的方法是使用第三个矩阵，我们称之为值矩阵，将其乘以第一个词“fluffy”的嵌入。其结果就是您所谓的值向量，这是您添加到第二个词的嵌入中的内容，即在这种情况下您添加到“creature”的嵌入中的内容。

So this value vector lives in the same very high-dimensional space as the embeddings. When you multiply this value matrix by the embedding of a word, you might think of it as saying, if this word is relevant to adjusting the meaning of something else, what exactly should be added to the embedding of that something else in order to reflect this? Looking back in our diagram, let's set aside all of the keys and the queries, since after you compute the attention pattern you're done with those, then you're going to take this value matrix and multiply the value of the two. The value matrix and multiply it by every one of those embeddings to produce a sequence of value vectors. You might think of these value vectors as being kind of associated with the corresponding keys.

因此，这个数值向量存在于与嵌入向量相同的非常高维空间中。当你将这个数值矩阵乘以一个词的嵌入时，你可以想象这是在说，如果这个词与调整其他事物的含义相关，那么应该添加到这些事物的嵌入中以反映这一点的确切内容是什么？回顾我们的图表，让我们先把所有的键和查询都放在一边，因为在计算注意力模式后，你就不再需要它们，然后你将取这个数值矩阵并将这两者的值相乘。将这个数值矩阵与每一个嵌入相乘，以产生一系列数值向量。你可以将这些数值向量看作与相应的键有某种关联。

For each column in this diagram, you multiply each of the value vectors by the corresponding weight in that column. For example, here, under the embedding of creature, you would be adding large proportions of the value vectors for fluffy and blue, while all of the other value vectors get zeroed out, or at least nearly zeroed out. And then finally, the way to actually update the embedding associated with this column, previously encoding some context-free meaning of creature, you add together all of these rescaled values in the column, producing a change that you want to add that all labeled delta E, and then you add that to the original embedding. Hopefully, what results is a more refined vector encoding the more contextually rich meaning, like that of a fluffy blue creature.

在这个图表中的每一列，你要将该列中的每一个数值向量与相应权重相乘。例如，在这里，在代表creature的嵌入下，你会添加fluffy和blue的数值向量的大部分，而其他数值向量则会被清零，或者至少接近清零。最后，为了实际更新与该列相关联的嵌入，以前编码了一些与creature无关的意义，你需要将该列中所有这些经过重新缩放的数值相加在一起，产生一个你想要添加的变化，标记为delta E，然后将其加到原始的嵌入中。希望的结果是获得一个更精细的向量，编码更为丰富的语境含义，比如一个fluffy blue creature。

And of course, you don't just do this to one embedding, you apply the same weighted sum across all of the columns in this picture, producing a sequence of changes. Adding all of those changes to the corresponding embeddings produces a full sequence of more refined embeddings popping out of the attention block. Zooming out, this whole process is what you would describe as a single head of attention. As I've described things so far, this process is parameterized by three distinct matrices, all filled with tunable parameters, the key, the query, and the value. I want to take a moment to continue what we started in the last chapter with a scorekeeping where we count up the total number of model parameters, using the numbers from GPT-3.

当然，你不仅仅对一个嵌入进行这样的操作，你将同样的加权和应用到这个图片中的所有列上，产生一系列变化。将所有这些变化加到相应的嵌入上，就会产生一整个更加精细的嵌入序列从注意力块中出现。放大来看，整个过程就是你所描述的注意力的一个头。到目前为止，我描述的这个过程由三个独立的矩阵参数化，所有填充有可调参数，即关键词、查询和价值。我想花一点时间继续上一章我们开始的事情，即计算模型参数的总数，使用来自GPT-3的数据。

These key and query matrices each have 12,288 columns, matching the embedding dimension, and 128 rows, matching the dimension of that smaller key query space. This gives us an additional 1.5 million or so parameters for each one. If you look at that value matrix by contrast, the way I've described things so far would suggest that it's a square matrix that has 12,288 columns and 12,288 rows. Since both its inputs and its outputs live in this very large embedding space. If true, that would mean about 150 million added parameters. And to be clear, you could do that. You could devote orders of magnitude more parameters to the value map than to the key and query.

这些关键和查询矩阵都有12,288列，与嵌入维度相匹配，并且有128行，与更小的关键查询空间的维度相匹配。这为我们每个矩阵额外增加了大约150万个参数。然而，如果你看一下数值矩阵，到目前为止我描述的方式会让人觉得它是一个正方形矩阵，有12,288列和12,288行。因为它的输入和输出都处于这个非常庞大的嵌入空间中。如果是真的，那就意味着大约有1.5亿个额外参数。并且要明确一点，你可以这样做。你可以为数值映射分配比关键和查询更多数量级的参数。

But in practice, it is much more efficient if instead you make it so that the number of parameters devoted to this value map is the same as the number devoted to the key and the query. This is especially relevant in the setting of running multiple attention heads in parallel. The way this looks is that the value map is factored as a product of two smaller matrices. Conceptually, I would still encourage you to think about the overall linear map, one with inputs and outputs, both in this larger embedding space. For example, taking the embedding of blue to this blueness direction that you would add to nouns. It's just that it's broken up into two separate steps.

然而，在实践中，如果您将用于值映射的参数数量与用于键和查询的参数数量相同，效率会更高。在同时运行多个注意力头的情况下，这一点尤为重要。这看起来是这样的，值映射被分解为两个较小矩阵的乘积。在概念上，我仍然鼓励您考虑整体的线性映射，包括在这个更大的嵌入空间中的输入和输出。例如，将蓝色的嵌入映射到您会添加到名词中的蓝色方向。只是这个过程被分解为两个独立的步骤。

The first matrix on the right here has a smaller number of rows, typically the same size as the key query space. What this means is you can think of it as mapping the large embedding vectors down to a much smaller space. This is not the conventional naming, but I'm going to call this the value down matrix. The second matrix maps from this smaller space back up to the embedding space, producing the vectors that you use to make the actual updates. I'm going to call this one the value up matrix, which again is not conventional. The way that you would see this written in most papers looks a little different.

这里右边的第一个矩阵的行数较少，通常与关键查询空间大小相同。这意味着你可以将大型嵌入向量映射到一个更小的空间中。这并不是传统的命名方式，但我将其称为值下矩阵。第二个矩阵从这个较小的空间映射回嵌入空间，产生用于实际更新的向量。我将其称为值上矩阵，同样这也不是传统的方式。在大多数论文中，你会看到这个写法有点不同。

I'll talk about it in a minute, in my opinion, it tends to make things a little more conceptually confusing. To throw in linear algebra jargon here, what we're basically doing is constraining the overall value map to be a low-rank transformation. Turning back to the parameter count, all four of these matrices have the same size, and adding them all up, we get about 6.3 million parameters for one attention head. As a quick side note, to be a little more accurate, everything described so far is what people would call a self-attention head, to distinguish it from a variation that comes up in other models that's called cross-attention.

我马上就会讲到这个问题，在我看来，这往往会让事情在概念上变得更加混乱。换言之，我们基本上是将整体价值映射限制为低秩变换。回到参数计数的问题上，这四个矩阵都具有相同的大小，将它们全部加在一起，我们得到一个注意力头部约有630万个参数。顺带一提，要更准确一些，到目前为止描述的都是人们所称的自注意力头部，以区别于其他模型中出现的被称为交叉注意力的变体。

This isn't relevant to our GPT example, but if you're curious, cross-attention involves models that process two distinct types of data, like text in one language and text in another language that's part of an ongoing generation of a translation, or maybe audio input of speech and an ongoing transcription. A cross-attention head looks almost identical. The only difference is that the key and query maps act on different data sets. In a model-doing translation, for example, the keys might come from one language, while the queries come from another, and the attention pattern could describe which words from one language correspond to which words in another.

这并不适用于我们GPT示例，但如果你感到好奇，交叉注意力涉及处理两种不同类型的数据的模型，例如一个语言中的文本和另一种语言中的文本，这些文本是正在进行的翻译的一部分，或者可能是语音输入和正在进行的转录。交叉关注头看起来几乎相同。唯一的区别是关键和查询映射作用于不同的数据集。例如，在一个进行翻译的模型中，关键可能来自一种语言，而查询来自另一种语言，注意力模式可以描述一个语言中的哪些词对应于另一个语言中的哪些词。

And in this setting, there would typically be no masking, since there's not really any notion of later tokens affecting earlier ones. Staying focused on self-attention, though, if you understood everything so far, and if you were to stop here, you would come away with the essence of what attention really is. All that's really left to us is to lay out the sense in which you do this many, many different times. In our central example, we focused on adjectives updating nouns, but of course, there are lots of different ways that context can influence the meaning of a word.

在这种情况下，通常不会有掩码，因为实际上并没有什么后续令牌影响先前令牌的概念。然而，保持专注于自注意力，如果你到目前为止都理解了一切，如果你在这里停下来，你将真正理解注意力的本质。现在我们唯一要做的就是阐明你这样做的意义，这样做很多很多次。在我们的中心例子中，我们关注形容词更新名词的情况，但当然，有很多不同的方式，上下文可以影响一个词的含义。

If the words they crashed the, preceded the word car, it has implications for the shape and the structure of that car. And a lot of associations might be less grammatical. If the word wizard is anywhere in the same passage as Harry, it suggests that this might be referring to Harry Potter, whereas if instead the words Queen Sussex and William were in that passage, then perhaps the embedding of Harry should instead be updated to refer to the prince. For every different type of contextual updating that you might imagine, the parameters of these key and query matrices would be different to capture the different attention patterns, and the parameters of our value map would be different based on what should be added to the embeddings.

如果在“汽车”一词之前出现了“碰撞”，这会对该汽车的形状和结构产生影响。很多联系可能与语法关系较小。如果在同一段落中出现了“巫师”一词和哈利，这可能暗示这可能指的是哈利·波特，而如果相反地，这段文字中出现了“苏塞克斯女王”和“威廉”，那么或许应该更新哈利的嵌入以指代王子。对于你可以想象到的每一种不同类型的语境更新，这些关键和查询矩阵的参数都会不同，以捕捉不同的注意力模式，而我们的价值映射的参数将基于应该添加到嵌入中的内容而有所不同。

And again, in practice, the true behavior of these maps is much more difficult to interpret, where the weights are set to do whatever the model needs them to do to best accomplish its goal of predicting the next token. As I said before, everything we described is a single head of attention, and a full attention block inside a transformer consists of what's called multi-headed attention, where you run a lot of these operations in parallel, each with its own distinct key query and value maps. GPT-3, for example, uses 96 attention heads inside each block. Considering that each one is already a bit confusing, it's certainly a lot to hold in your head. Just to spell it all out very explicitly, this means you have 96 distinct key and query matrices, producing 96 distinct attention patterns.

再次提醒，在实践中，这些映射的真实行为更加难以解释，权重被设置为做任何模型需要做的事情，以便最好地实现预测下一个标记的目标。正如我之前说过的，我们描述的一切都是注意力的一个头部，一个transformer中的一个完整的注意力块包含了所谓的多头注意力，其中你并行运行许多这些操作，每个操作都有自己独特的键、查询和值映射。例如，GPT-3在每个块内部使用了96个注意力头。考虑到每一个已经有点混乱，肯定很难记住这么多信息。总而言之，这意味着你有96个不同的键和查询矩阵，产生了96个不同的注意力模式。

Then each head has its own distinct value matrices, used to produce 96 sequences of value vectors. These are all added together, using the corresponding attention patterns as weights. What this means is that for each position in the context, each token, every one of these heads produces a proposed change to be added to the embedding in that pattern. So what you do is you sum together all of those proposed changes, one for each head, and you add the result to the original embedding of that position. This entire sum here would be one slice of what's outputted from this multi-headed attention block, a single one of those refined embeddings that pops out the other end of it.

然后，每个头部都有自己独特的值矩阵，用来生成96个值向量序列。所有这些向量都被相应的注意力模式作为权重相加在一起。这意味着对于上下文中的每个位置，每个标记，每一个头部都产生了一个建议的变化，可以添加到该模式的嵌入中。所以你所做的就是将所有这些建议的变化相加，每个头部一个，将结果添加到该位置的原始嵌入中。这整个总和将是从这个多头注意力块输出的一部分，它们中的一个经过精炼的嵌入，从另一端弹出。

Again, this is a lot to think about, so don't worry at all if it takes some time to sink in. The overall idea is that by running many distinct heads in parallel, you're giving the model the capacity to learn many distinct ways that context changes meaning. Pulling up our running tally for parameter count with 96 heads, each including its own variation of these four matrices, each block of multi-headed attention ends up with around 600 million parameters. There's one added slightly annoying thing that I should really mention for any of you who go on to read more about transformers. You remember how I said that the value map is factored out into these two distinct matrices, which I labeled as the value down and the value up matrices. The way that I framed things would suggest that you see this pair of matrices inside each attention head, and you could absolutely implement it this way. That would be a valid design.

这是需要深思的很多内容，所以如果需要一些时间来理解也不用担心。总体的想法是，通过并行运行许多不同的头部，您为模型提供了学习上下文变化含义的许多不同方式的能力。累计参数数量为96个头部，每个头部包括自己的这四个矩阵的变化版本，每个多头注意力块最终约有6亿个参数。还有一件稍微令人烦恼的事情，我应该真的提一下，对于那些将继续了解变压器的人。你还记得我说过值映射被分解成这两个不同的矩阵，我将这两个矩阵标记为值下和值上矩阵。我所表述的方式会让你认为在每个注意力头部内看到这一对矩阵，并且你绝对可以采用这种方式实现。这将是一种有效的设计。

But the way that you see this written in papers and the way that it's implemented in practice looks a little different. All of these value up matrices for each head appear stapled together in one giant matrix that we call the output matrix, associated with the entire multi-headed attention block. And when you see people refer to the value matrix for a given attention head, they're typically only referring to this first step, the one that I was labeling as the value down projection into the smaller space. For the curious among you, I've left a non-screen note about it. It's one of those details that runs the risk of distracting from the main conceptual points, but I do want to call it out just so that you know if you read about this in other sources.

但是你在文献中看到的方式和实际应用中的方式看起来有点不同。每个头部的value矩阵似乎被拼接在一起，形成一个我们称之为输出矩阵的巨大矩阵，与整个多头注意力块相关联。当你看到人们提到给定注意力头的value矩阵时，他们通常只是指这第一步，我标记为value向下投影到更小空间的步骤。对于你们中间有兴趣的人，我在这留了一个非屏幕注释。这是一个可能会分散注意力的细节，但我还是想提一下，以便你知道如果在其他来源中看到这个内容时能认出来。

Setting aside all the technical nuances, in the preview from the last chapter, we saw how data flowing through a transformer doesn't just flow through a single attention block. For one thing, it also goes through these other operations called multi-layer perceptrons. We'll talk more about those in the next chapter. And then it repeatedly goes through many, many copies of both of these operations. What this means is that after a given word imbibes some of its context, there are many more chances for this more nuanced embedding to be influenced by its more nuanced surroundings. The further down the network you go, with each embedding taking in more and more meaning from all the other embeddings, which themselves are getting more and more nuanced, the hope is that there's the capacity to encode higher level and more abstract ideas about a given input beyond just descriptors and grammatical structure. Things like sentiment and tone and whether it's a poem and what underlying scientific truths are relevant to the piece and things like that.

在不考虑所有技术细节的情况下，在上一章的预览中，我们看到数据通过变压器传递不仅仅是通过单个注意力块。首先，它还经过了另外一些称为多层感知器的操作。我们将在下一章更多地讨论这些。然后，它会反复通过很多很多次这两种操作的副本。这意味着在给定的单词吸收了一些上下文后，有更多的机会让这更复杂的嵌入被其更复杂的周围环境所影响。随着网络的深入，每个嵌入从所有其他嵌入中吸收更多含义，这些嵌入本身也变得越来越复杂，希望可以编码关于给定输入的更高级和更抽象的想法，超出仅仅描述和语法结构。诸如情感、语气、是否是诗歌以及与文章相关的基础科学真相等等的内容。

Turning back one more time to our scorekeeping, GPT-3 includes 96 distinct layers, so the total number of key query and value parameters is multiplied by another 96, which brings the total sum to just under 58 billion distinct parameters devoted to all of the attention heads. That is a lot to be sure, but it's only about a third of the 175 billion that are in the network in total. So even though attention gets all of the attention, the majority of parameters come from the blocks sitting in between these steps.

再回到我们的计分系统，GPT-3包括96个不同的层，所以关键查询和值参数的总数乘以另外96，这使得所有注意力头总共约有580亿个不同的参数。这当然是一个庞大的数字，但仅占总网络规模的三分之一，总共的参数达到了1750亿。因此，尽管注意力占据了所有的关注点，但大部分参数来自于这些步骤之间的块。

In the next chapter, you and I will talk more about those other blocks and also a lot more about the training process. A big part of the story for the success of the attention mechanism is not so much any specific kind of behavior that it enables, but the fact that it's extremely parallelizable, meaning that you can run a huge number of computations in a short time using GPUs. Given that one of the big lessons about deep learning in the last decade or two has been that scale alone seems to give huge qualitative improvements in model performance, there's a huge advantage to parallelizable architectures that let you do this.

在下一章中，你和我会更多地讨论其他模块以及培训过程。注意机制成功的故事中，一个重要部分并不是它能够实现的任何特定行为，而是它具有极高的并行性，这意味着你可以使用GPU在短时间内运行大量的计算。在过去的十年或二十年里，关于深度学习的一大教训就是规模似乎会显著提高模型性能，这给能够并行化进行大规模计算的架构带来了巨大优势。

If you want to learn more about this stuff, I've left lots of links in the description. In particular, anything produced by Andrei Carpathy or Chris Ola tend to be pure gold. In this video, I wanted to just jump into attention in its current form, but if you're curious about more of the history for how we got here and how you might reinvent this idea for yourself, my friend Vivek just put up a couple videos giving a lot more of that motivation. Also, Britt Cruz from the channel The Art of the Problem has a really nice video about the history of large language models.

如果你想了解更多关于这个话题的信息，我在描述中留下了很多链接。特别是由Andrei Carpathy或Chris Ola制作的任何内容都是宝藏。在这个视频中，我只想讨论注意力机制在当前形式下的应用，但如果你对我们如何到达这个阶段感到好奇，并想要重新构想这个想法，我的朋友Vivek刚刚发布了几个视频，提供了更多的动力和历史。此外，The Art of the Problem频道的Britt Cruz有一个关于大型语言模型历史的很棒的视频。