I'm Craig Smith and this is I on AI. This week I talk to Ilya Suitskiver, a co-founder and chief scientist of OpenAI and one of the primary minds behind the large lightwage model GPT3 and its public progeny JAPGPT, which I don't think it's an exaggeration to say is changing the world.
This isn't the first time Ilya has changed the world. Jeff Hinton has said he was the main impetus for Alex Net, the convolutional neural network whose dramatic performance stunned the scientific community in 2012 and set off the deep learning revolution.
As is often the case in these conversations they assume a lot of knowledge on the part of listeners primarily because I don't want to waste the limited time I have to speak to people like Ilya, explaining concepts for people or events that can easily be Googled or being I should say where the chat GPT can explain for you.
The conversation with Ilya follows a conversation with Jan Lecun in a previous episode so if you haven't listened to that episode I encourage you to do so. Meanwhile I hope you enjoy the conversation with Ilya as much as I do.
Ilya, it's terrific to meet you to talk to you I watch many of your talks online and when we're in red many of your papers. Can you start just by introducing yourself a little bit of your background? I know you were born in Russia where you were educated what got you interested in computer science if that was the initial impulse or brain science neuroscience or whatever it was and then I'll start asking questions.
Yeah I can talk about that a little bit. So you are indeed I was born in Russia I grew up in Israel and then as a teenage I'm a family immigrated to Canada.
My parents say I was interested in AI from a pretty early age I also was very motivated by consciousness I was very disturbed by it and I was curious about things that could help me understand it better and AI seemed like a very like a good angle there. So I think these were some of the ways that got me started and I actually started working with Jeff Hinton very early when I was 17 we moved to Canada and I immediately was able to join the University of Toronto and I really wanted to do machine learning because that seemed like the most important aspect of artificial intelligence that at the time was completely inaccessible.
Like to give some context the year was 2003 day we take it for granted that computers can learn but in 2003 we took it for granted that computers can't learn. The biggest achievement away I think then was deep blue the chess plane engine. But there it was like you have this game and you have this research and you have this simple way of determining if one position is better than another and it really did not feel like that could possibly be applicable to the real world because there is no learning and learning was this big mystery and so I was really really interested in learning and to my great luck Jeff Hinton was a professor in the university I was in and so I was able to find him and we began working together almost right away.
And was your impulse as it was for Jeff to understand how the brain worked or was it more that you were simply interested in the idea of machines learning. AI is so big and so the motivations were just as many like it is interesting but how does intelligence work at all. But right now we have quite a bit of an idea but it's a big neural net and we know how it works to some degree but back then all of the neural nets were around no one knew that neural nets are good for anything.
So how does intelligence work at all? How can we make computers be even slightly intelligent and I had a very explicit intention to make a very small but the real contribution to AI because there were lots of contributions to AI which weren't real which were but I could tell for various reasons that they weren't real that nothing would come out of it and I just thought nothing works at all. AI is a hopeless field so the motivation was could I understand how intelligence to work and also make a contribution towards it. So that was my initial early motivation.
That's 2003 almost exactly 20 years ago. And then Alex and I've spoken to Jeff and he said that it was really your excitement about the breakthroughs and convolutional neural networks that led you to apply for the ImageNet competition and that the Alex had the coding skills to train the network. Can you talk just a little bit about that? I don't want to get bogged down in history but it's fascinating.
So in a nutshell I had the realization that if you train a large neural network on a large, sorry large and deep because back then the deep part was still new. If you train a large and a deep neural network on a big enough data set that specifies some complicated task that people do such as vision but also others and you just train that neural network then you will succeed necessarily and the logic for it was very irreducible where we know that the human brain can solve these tasks and can solve them quickly and the human brain is just a neural network with slow neurons.
So we know that some neural network can do it really well. So then you just need to take a smaller but related neural network and just train it on data and the best neural network inside the computer will be related to the neural network that we have that performs this task. So it was an argument that the neural network, the large and deep neural network can solve the task and furthermore we have the tools to train it that was the result of the technical work that was done in Jeff's lab.
So you combine the two, we can train those neural networks. It needs to be big enough so that if you trained it, it would work well and you need the data which can specify the solution and with ImageNet all the ingredients were there. Alex had these very fast convolutional kernels, ImageNet had the large enough data and there was a real opportunity to do something totally unprecedented and it totally worked out. Yeah. That was supervised learning and the convolutional neural nets.
In 2017 the attention as all you need paper came out introducing the self-attention and transformers. At what point did the GPT project start? Was there some intuition about transformers and self-supervised learning? Can you talk about that? So for context at OpenAI from the earliest days we were exploring the idea that predicting the next thing is all you need.
We were exploring it with the much more limited neural networks of the time but the hope was that if you have a neural network that can predict the next word, the next pixel, really it's about compression, prediction is compression and predicting the next word is not, let's see let me think about the best bit to explain it because there are there were many things going on and they were all related. Maybe I'll take a different direction. We were indeed interested in trying to understand how far predicting the next word is going to go and whether it will solve unsupervised learning.
So back before the GPT's unsupervised learning was considered to be the Holy Grail of Machine Learning. Now it's just been fully solved and not even talked about it but it was a Holy Grail. It was very mysterious and so we were exploring the idea. I was really excited about it that predicting the next word well enough is going to give you unsupervised learning if it will learn everything about the data set that's going to be great but our neural networks were not up for the task. We were using recurrent neural networks.
When the transformer came out it was literally as soon as the paper came out literally the next day it was clear to me to us that transformers are dressed to limitations of recurrent neural networks of learning long term dependencies. It's a technical thing but it was like vis-veech to transformers right away and so the very nascent GPT effort continued then and then like we did a transformer it started to work better and you make it bigger and then you realized you need to keep making it bigger and we did and that's what led to eventually GPT3 and essentially where we are today.
Yeah and I just wanted to ask actually I'm getting caught up in this history but I'm so interested in it. I want to get to the problems or the shortcomings of large language models or large models generally but Rich Sutton had been writing about scaling and how that's all we need to do we don't need new algorithms we just need to scale. Did he have an influence on you or was that a parallel track of thinking?
No I would say that when he posted his article then we were very pleased to see some external people thinking in similar lines and if you thought it was very eloquently articulated but I actually think that's the bitter lesson as articulated over a state's case or at least I think the takeaway that people have taken from it over state's case.
The takeaway that people have is doesn't matter what you do just scale but that's not exactly true. You got to scale something specific you got to have something that you'll be able to benefit from the scale.
The great breakthrough of deep learning is that it provides us with the first ever way of productively using scale and getting something out of it in return. Before that what would people use large computer clusters for?
I guess they would do it for weather simulations or physics simulations or something but that's about it maybe moviemaking but no one had any real need for computer clusters because what do you do with them? The fact that deep neural networks when you make them larger and you train them on more data work better provided us with the first thing that is interesting to scale but perhaps one day we will discover that there is some little twist on the thing that we scale that's going to be even better to scale.
Now how big of a twist and then of course with the benefit of heights that you will say does it even count it's such a simple change but I think the true statement is that it matters what you scale. Right now we just found like a thing to scale that gives us something in return.
The limitation of large language models to say exists is their knowledge is contained in the language that they're trained on and most human knowledge I think everyone agrees is non-linguistic I'm not sure no Chomsky agrees but there's a problem in large language models as I understand it their objective is to satisfy the statistical consistency of the prompt.
They don't have an underlying understanding of reality that language relates to I asked GBT about myself it recognized that I'm a journalist that I've worked at these various newspapers but it went on and on about awards that I've never won and put it all red beautifully but none of it connected to the underlying reality. Is there something that is being done to address that in your research going forward?
Yeah so before I comment on the immediate question that you ask I want to comment about some of the earlier parts of the question. Sure. I think that it is very hard to talk about the limits or limitations rather of even something like a language model because two years ago people's confidently spoke about their limitations and they weren't entirely different.
Right so it's important to keep this context in mind how confident are we that these limitations that we'll see today will still be with us two years from now. I am not that confident there is another comment I want to make about one part of the question which is that these models just learned statistical regularities and therefore they don't really know what the nature of the world is and I have a view that differs from this.
In other words I think that learning the statistical regularities is a far bigger deal than meets the eye. The reason we don't initially think so is because we haven't at least most people those who haven't really spent a lot of time with neural networks which are on some level statistical.
Like what's statistical model you just fit some parameters what is really happening. But I think there is a better interpretation to the earlier point of prediction as compression. Prediction is also statistical phenomenon yet to predict you eventually need to understand the true underlying process that produced the data to predict the data well to compress it well you need to understand more and more about the world that produced the data as our generative models become extraordinarily good.
They will have I claim a shocking degree of understanding of the world and many of its subtleties but it's not just the world it is the world as seen through the lens of text. It tries to learn more and more about the world through a projection of the world on the space of text as expressed by human beings on the internet but still this text already expresses the world and I'll give you an example a recent example which I think is really telling and fascinating.
So we've all heard of Sydney being alter ego and I've seen this really interesting interaction with Sydney over Sydney became combative and aggressive when the user told it that it thinks that Google is a better search engine than big. Now how can we like what is a good way to think about this phenomenon?
What's a good language? What's what does it mean? You can say well like it's just predicting what people would do and people would do this. It's just true but maybe we're now reaching a point where the language of psychology is starting to be appropriate to understand the behavior of these neural networks. Now let's talk about the limitations.
It is indeed the case that these neural networks are they do have a tendency to hallucinate but that's because a language model is great for learning about the world but it is a little bit less great for producing good outcomes and there are various technical reasons for that which I could elaborate. On if you think it's useful but that is right now at like at this second I will skip that.
There are technical reasons why a language model is much better at learning about the world, learning incredible representations of ideas of concepts of people of processes that exist but its outputs aren't quite as good as one would hope or as or other as good as they could be which is why for example for a system like chat GPT is a language model that has an additional reinforcement learning training process.
Recall it reinforcement learning from human feedback but the thing to understand about that process is this. We can say that the pre-training process when you just train a language model you want to learn everything about the world then the reinforcement learning from human feedback. Now we care about the outputs. Now we say anytime the output is inappropriate don't do this again. Every time the output does not make sense don't do this again and it learns quickly to produce good outputs but now it's the level of the outputs which is not the case during pre-training during the language model training process.
Now on the point of hallucinations and it has a propensity of making stuff up, indeed it is true. Right now these neural networks even chat GPT makes things up from time to time and that's something that also greatly limits their usefulness but I'm quite hopeful that by simply improving this subsequent reinforcement learning from human feedback step we could just teach it to not hallucinate.
Now you could say is it really going to learn? My answer is let's find out. And that feedback loop is coming from the public chat GPT interface that if it tells me that I want to pull it's here which unfortunately I didn't. I can tell it that it's wrong and will that train it or create some punishment or reward so that the next time I ask it it'll be more accurate.
The way we do things today is that it would be higher people to teach our neural net to behave, to teach our GPT to behave. And right now the manner, the precise manner in which they specified the desired behavior is a little bit different but indeed what you described is the way in which teaching is going to like basically be that's the correct way to teach you. Just interact with it and it sees from your reaction it's in first oh that's not what you wanted.
You are not happy with its output therefore the output was not good and it should do something different the next time. So in particular hallucinations come up as one of the bigger issues and we'll see but I think there is a quite a high chance that this approach will be able to address them completely.
I wanted to talk to you about Yanla Kun's work on joint embedding predictive architectures and his idea that what's missing from large language models is this underlying world model that is non-linguistic that the language model can refer to.
It's not something that's built but I wanted to hear what you thought of that and whether you've explored that at all. So I reviewed Yanla Kun's proposal and there are a number of ideas there and they're expressed in different language and there are some maybe small differences from the current paradigm but to my mind they are not very significant and I'd like to elaborate.
The first claim is that it is desirable for a system to have multi-model understanding where it doesn't just know about the world from text and my comment on that will be that indeed multi-model understanding is desirable because you learn more about the world.
You learn more about people you learn more about their condition and so the system will be able to understand what the task that it's supposed to solve and the people and what they want better. We have done quite a bit of work on that most notably in the formal through-mage neural nets that we've done. One is called clip and one is called dali and both of them move towards this multi-model direction but I also want to say that I don't see the situation as a binary either or that if you don't have vision if you don't understand the world visually or from video then things will not work and I'd like to make the case for that.
So I think that some things are much easier to learn from images and diagrams and so on but I claim that you can still learn them from text only just more slowly and I'll give you an example. Consider the notion of color.
Surely one cannot learn the notion of color from text only and yet when you look at the embeddings when you make a small detour to explain the concept of an embedding.
毫无疑问,仅靠文字是无法理解颜色概念的,但当你稍微解释一下嵌入概念时,看看这些嵌入物就会让你明白。
Every neural network represents words sentences concepts through representations embeddings, high-dimensional vectors and one thing that we can do is that we can look at those high-dimensional vectors and we can look at what's similar to what, how does the network see this concept of that concept and so we can look at the embeddings of colors and embeddings of colors happen to be exactly right you know it like it knows that purple is more similar to blue than to red and it knows that purple is less similar to red than orange is it knows all those things just from text how can that be so if you have a vision the distinctions between color just jump at you you immediately perceive them whereas with text it takes you longer maybe you know how to talk and you already understand syntax and words and grammars and only much later you say all these colors actually start to understand them so this will be my point about the necessity of multi-modality which I claim it is not necessary but it is most definitely useful.
I think it's a good direction to pursue I just don't see it in such stark either or claims so the proposal in the paper makes a claim that one of the big challenges is predicting high-dimensional vectors which have uncertainty about them so for example predicting an image like the paper makes a very strong claim there that it's a major challenge and we need to use a particular approach to address that but one thing which I found surprising or at least unacknowledged in the paper is that the current auto-regressive transformers already have that property.
I'll give you two examples one is given one page in a book predict the next page in a book there could be so many possible pages that follow it's a very complicated high-dimensional space and the ideal village is fine the same applies to images these auto-regressive transformers work perfectly on images for example like this open AI you've done work on the iGPT we just took a transformer and we applied it to pixels and it worked super well and it could generate images in the very complicated in subtle ways it had the very beautiful unsupervised representation learning with dali one same thing again you just generate think with as large pixels like rather than generic million pixels we cluster the pixels into large pixels in the generate thousand large pixels I believe Google's work on image generation from earlier this year called party I believe they'll also take a similar approach.
So the party where I thought that the paper made a strong comment around well the current approaches can't deal with predicting high-dimensional distributions I think they definitely can so maybe this is another point I would make and then what you're talking about converting pixels into vectors it's essentially turning everything into language the vector is like a string of text right the fine language though you turn it into a sequence yeah a sequence of what like you could argue that even for a human life is a sequence of bits now there are other things that people use right now like diffusion modes where they produce those bits rather than one bit at a time they produce them in parallel but I would argue that on some level this distinction is immaterial.
I claim that on some level it doesn't really matter it matters as in like you can get a 10x efficiency gain which is huge in practice but conceptually I claim he doesn't matter on this idea of having an army of human trainers that are working with chat gbt or a large language model to guide at it in effect with reinforcement learning it just intuitively that doesn't sound like an efficient way of teaching a model about the underlying reality of its language isn't there a way of automating that and to to Yen's credit I think that's what he's talking about is coming up with an algorithmic means of teaching a model the underlying reality without a human having to intervene yeah so I have two comments on that.
I think so the first place so I have a different view on the question so I wouldn't agree with the phrasing of the question yeah I claim that our pre-trained models already know everything they need to know about the underlying reality they all already have this knowledge of language and also a great deal of knowledge about the processes that exist in the world that produce this language and maybe I should reiterate this point it's a small tangent but I think it's so important the thing that large generative models learn about their data and in this case large language models about text data are some compressed representations of the real world processes that produce this data which means not only people and something about their thoughts something about their feelings but also something about the condition that people are in and the interactions that exist between them the different situations a person can be all of these are part of that compressed process that is represented by the neural net to produce the text the better the language model the better the generative model the higher the fidelity the more the better this the better it captures this process so that's the first comment that we make and so in particular I will say the models already have the knowledge.
Now the army of teachers as you phrase it indeed you know when you want to build the system that performs as well as possible you just say okay like if this thing works do more of that but of course those teachers are also using AI assistants those teachers aren't on their own they are working with our tools together they are very efficient it's like the tools are doing the majority of the work but you do need to have you need to have oversight you need to have people reviewing the behavior because you want to have it to eventually achieve a very high level of reliability but overall I'll say that we are at the same time this second step after we take the finished pre-trained model and then we apply the reinforcement learning on it there is indeed a lot of motivation to make it as efficient and as precise as possible so that the resulting language model will be as well behave as possible so yeah there is these human teachers who are teaching them a model with desired behavior they are also using AI assistants and the manner in which they use AI assistants is constantly increasing so their own efficiency keeps increasing so maybe this will be one way to answer this question yeah and so what you're saying is through this process eventually the model will become a more and more discerning more and more accurate in its outputs.
Yes and it's that's right there is an analogy here which is it already knows all kinds of things and now just want to really say no this is not what we want don't do this here you made a mistake here in the output and of course it's exactly as you say with as much AI in the loop as possible so that the teachers who are providing the final correction to the system their work is amplified they are working as efficiently as possible so it's not unlike an education process how to act well in the world we need to do additional training just to make sure that the model knows that hallucination is not okay ever and then once it knows that now you are in business I think and it's that reinforcement learning human teacher loop that will teach a human teacher loop or some other variant but there is definitely an argument to be made that something here should work and we'll find out. pretty soon I'm sure that's one of the questions where is this going what research are you focused on right now I can't talk in detail about the specific research that I'm working on but I can mention a little bit I can mention some of the research in broad strokes and it would be something like I'm very interested in making those models more reliable more controllable make them learn faster from less data less instructions make them so that indeed they don't hallucinate and I think that all this cluster of questions which I mentioned they're all connected and there's also a question of how far in the future are we talking about in this question and what I commented here on is perhaps nearer future.
You talk about the similarities between the brain and neural answers a very interesting observation that Jeff Hinton made to me for sure it's not new to other people but that large models or large language models in particular hold a tremendous amount of data with a modest number of parameters compared to the human brain which has trillions and trillions of parameters but a relatively small amount of data have you thought of it in those terms and can you talk about what's missing in large models to have more parameters to handle the data is that a hardware problem or a training problem this comment which you made is related to one of the problems that I mentioned in the earlier questions of learning from less data and indeed the current structure of the technology does like a lot of data especially early in training now later in training it becomes a bit less data hungry which is why at the end it can learn very not as fast as people yet but it can learn quite quickly so already that means that in some sense do we even care that we need all this data to get to this point but indeed more generally I think we'll be possible to learn more from less data I think it's just I think it requires some creative ideas but I think it is possible and I think learning more from less data will unlock a lot of different possibilities it will allow us to teach RRIs the skills that is missing and to convey to it our desires and preferences exactly how we wanted to behave more recently so I would say that faster learning is indeed very nice and although already after language model that train they can unlock quite quickly I think there is opportunities to do more there.
你谈到了大脑和神经系统之间的相似之处,这是 Jeff Hinton 向我提出的一个非常有趣的观察。尽管这对其他人来说并不新奇,但是相对于拥有数百万亿参数但是数据量相对较小的人脑来说,大型模型,特别是大型语言模型,拥有巨大的数据量和适度数量的参数。你是否考虑过这个问题?你认为大型模型缺乏处理数据所需的参数吗?这是硬件问题还是培训问题?你提出的这个评论与我之前提出的从少量数据中学习的问题有关。当前技术结构确实需要大量的数据,尤其是在早期的培训阶段。而在培训后期,数据需求量会减少一些,这就是为什么在最后它能够学习得不像人类那么快,但是学习速度还是相当快的原因。因此,我们是否需要所有这些数据来达到这一点,实际上是可以考虑的。而且,我认为总的来说,我们可以通过更少的数据获取更多的学习。这需要一些创造性的想法,但我认为是有可能的。从少量数据中学习会开启许多不同的可能性,它将让我们能够教RRIs缺失的技能,并准确地传达我们的愿望和偏好。因此,快速的学习确实非常好,虽然大型语言模型已经可以快速启用,但我认为还有更多的机会可以做出更多的成果。
Purdue make a comment that that we need faster processors to build to scale further and it appears that the scaling of models that there's no ends in sight but the power required to train these models were reaching the limit at least the socially accepted limit so I just want to make one comment which is I don't remember the exact comment that I made if you're referring to but you always want faster processors of course you always want more of them of course power keeps going up generally speaking the cost is going up and the question that I would ask is not whether the cost is large but whether the thing that we get out of paying this cost outweighs the cost maybe you pay all this cost and you get nothing then yeah that's not worth it but if you get something very useful something very valuable something you can solve a lot of problems that you have which we really want sold then the cost can be justified but in terms of the processors faster processors yeah any day are you involved at all in hardware question you work with cerebrists for example the wafer scale chips no all our hardware comes from Azure to be used they provide you yeah yeah
you did talk at one point I saw about democracy and about the impact that's that AI can have on democracy people have talked to me about that if you had enough data and a large enough model you could train the model on the data and it could come up with an optimal solution that would satisfy everybody and do you have any aspiration or do you think about where this might lead in terms of helping humans manage society yeah let's see it's such a big question because it's a much more of a future looking question like I think that there is still many ways in which our models will become far more capable than they are right now there's no question in particular although we train them and use them and so on there's going to be a few changes here and there they might not be immediately obvious today but I think in hindsight it will be extremely obvious that will indeed allow it to have that ability to come up with solutions to problems of this kind it's unpredictable exactly how governments will use this technology as a source of getting advice of various kinds I think that to the question of democracy one thing which I think could happen in the future is that because you have these neural nets and they're going to be so pervasive and they're going to be so impactful in society we will find that it is desirable to have some kind of a democratic process where let's say the citizens of a country provide some information to the neural net about how they'd like things to be how they'd like it to behave or something along these lines I could imagine that happening that can be a very like a high band with form of democracy perhaps where you get a lot more information out of each citizen and you aggregate it to specify how exactly you want such systems to act now it opens a whole lot of questions but that's one thing that could happen in the future yeah and I can see in the democracy exactly you gave that that that individuals would have the opportunity to to input data
but and this sort of goes to the world model question do you think AI systems will eventually be large enough that they can understand a situation and analyze all of the variables but you would need a model that does more than absorb language or a child with thing what does it mean to analyze all the variables eventually there will be a choice you need to make where you say this variable is similarly important I want to go deep because a person can read the book I can read a hundred books or I can read what book very slowly and carefully and get more out of it so there will be some element of that also I think it's probably fundamentally impossible to understand everything in some sense anytime there is any kind of complicated situation in society even in a company even in a midsize company it's already beyond the comprehension of any single individual and I think that if we build our AI systems the right way I think AI could be incredibly helpful in pretty much any situation that's it for this episode I want to thank Ilio for his time I also want to thank Ellie George for helping arrange the interview if you want to read a transcript of this conversation you can find one on our website i on AI that's EYEE hyphen o n dot AI we love to hear from listeners so feel free to email me at Craig C-R-A-I-G at EYEE hyphen o n dot AI I get a lot of emails so put listener in the subject line so I don't miss it we have listeners in 170 countries and territories to remember the singularity may not be near but AI is changing your world so pay attention