Thank you so much to friend Whiting me. It's such a pleasure to be talking about these things here in my own department. And it's so cool to see how many interesting things are happening right here. So I'm going to talk about keeping AI under control with mechanistic interpretability. And in particular, how I think we physicists have a great opportunity to help with this.
So first of all, why might we want to keep AI under control? Well, obviously, as we've heard this morning, because it's getting more and more powerful, we've all seen this paper from Microsoft arguing that the GPT-4 is already showing sparks of artificial general intelligence. Here is Yoshua Benjio. Now I'll reach the point where there are AI systems like between humans meaning they can pass the trade test. So I can debate whether or not GPT-4 passes a Turing test. But Yoshua Benjio should certainly get a vote in that debate, since he's one of the Turing Award winners, the equivalent of the Nobel Prize for AI. And this growth in progress is obviously also, as you know, started freaking a lot of people out.
Here we have his Turing Award co-winner Jeff Hinton. I'm not sure if the audio is actually going out. Is it? Are we close to the computers coming up with their own ideas for improving the sound? Yeah, two months. And then it could just go. That's an issue, right. We have to think hard about how to control that. Yeah, can we? We don't know. We haven't been there yet, but we can try. OK, let's see. It was kind of concerning. Yes. And then piling on some opman, CEO of OpenAI, and then of course it has given us chat GPT-4, GPT-4. Have this to say? And the bad case, and I think this is important to say, is like lights out for all of us. Lights out for all of us. Doesn't sound so great. And of course, then we had a bunch of us who called for a pause and an open letter. And then we had shortly after that, this bunch of AI researchers talking about how this poses a risk of extinction, which is all over the news. Specifically, it was the shortest open letter I've ever read. And I just want sentence mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks, such as pandemics and nuclear war. So basically, the whole point of this was that it mainstreamed the idea that, hey, maybe we could get wiped out. So we really should keep it under control. And the most interesting thing here, I think, is who signed it? You have not only top academic researchers who don't have a financial conflict of interest, people like Jeff Hinton, Yoshua Benjo. But you also have the CEOs here, Demis Asab, is from Google Deep Mind, Sam Altman again, Dari Amade, et cetera. So a lot of reasons why we should keep AI under control.
How can we help? I feel that, first of all, we obviously should. And Peter, earlier this morning, gave a really great example of how I think we really can help by opening up the black box and getting to a place where we're not just using ever more powerful systems that we don't understand, but where we're instead able to understand them better. This has always been the tradition in physics when we work with powerful things. If you want to get a rocket to the moon, you don't just treat it as a black box and you fired. Oh, that one went a little too far to the left. Let's aim a little farther to the right next time. No, what you do is you figure out the laws of, you figure out Einstein's laws of gravitation, you figure out Dari Amade, et cetera. And then you can be much more confident that you're going to control what you build.
So this is actually a field which has gained a lot of momentum quite recently. It's a very small field still. It's known by the nerdy name of mechanistic interpretability to give you an idea of how small it is.
If you compare it with neuroscience, and you think of this as artificial neuroscience, neuroscience is a huge field, of course. And look how few people there are here at MIT at this conference and I organized just two months ago. This was the biggest conference by far in this little nascent field, OK? So that's the bad news. It's very few people working on it.
But the good news is, even though there are so few, there's already been a lot of progress, remarkable progress. I would say more progress in this field than in all of big neuroscience in the last year. Why is that?
It's because here you have a huge advantage over ordinary neuroscience in that, first of all, to study the brain with 10 to the 11 neurons. You have a hard time reading out more than 1,000 at a time. You need to get IRB approval for all sorts of ethics reasons and so on. Here, you can read out every single neuron all the time. You can also get all the synaptic weights. You don't even have to go to the IRB either. And you can use all these traditional techniques where you actually mess with the system and see what happens that we love to do in physics.
And I think there are three levels of ambition that can motivate you to want to work on mechanistic interpretability, which is, of course, what I'm trying to do here to encourage you to work more on this.
The goal of it is the first lowest ambition level is just when you train a black box neural network on some data to do some cool stuff to understand well enough that you can diagnose its trustworthiness, make some assessment of how much you should trust it. That's already useful.
Second level of ambition, if you take it up a notch, is to understand it so well that you can improve its trustworthiness.
如果你再进一步,第二层雄心壮志就是要彻底理解它,从而能提升它的可信度。
And the ultimate level of ambition, and we are very ambitious here at MIT, is to understand it so well that you can guarantee trustworthiness. We have a lot of work at MIT on formal verification where you do mathematical proofs that a code is going to do what you want it to do.
Proof carrying code is a fat popular topic in computer security where the it's a little bit like if I was check your in reverse, if I was check your refused to run your code, if it can prove that it's harmful. Here, instead, the operating system says to the code, give me a proof that you're going to do what you say you're going to do. And if the code can't present the proof that the operating system can check it, it won't run it.
It's hopeless to come up with rigorous proofs for neural networks because it's like trying to prove things about spaghetti. But the vision here is if you can use AI to actually mechanistically extract out the knowledge that's been learned, you can re-implement it in some other kind of architecture, which isn't a neural network, which really lends itself to formal verification.
If we can pull off this moonshot, then we can trust systems much more intelligent than us because no matter how smart they are, they can't do the impossible.
So in my group, we've been having a lot of fun working on extracting learned knowledge from the black box in the mechanistic interpretability spirit. You heard, for example, my grad student, Michelle talked about this quantum thing recently. And I think this is an example of something which is very encouraging because if this quantum hypothesis is true, you can do with divide and conquer.
You don't have to understand the whole neural network all at once. But you can look at the street, quantitative learning, study them separately, much like we physicists don't try to understand the status center all at once. First, we try to understand the individual atoms that it's made of, and then we can work our way up to solid state physics and so on.
Also reminds me a little bit of Minsky's Society of Mines where you have many different systems working together to provide very powerful things.
这也让我有点想起明斯基的“矿山社会”,在那里,许多不同的系统共同运作,提供了非常强大的事物。
I'm not going to try to give a full summary of all the cool stuff that went down at this conference, but I can share, there's a website where we have all the talks on YouTube if anyone wants to watch them later.
But I want to just give you a little more nerd flavor of how tools that many of you here know, as physicists, are very relevant to this. Things like phase transitions, for example. So we already heard a beautiful talk by Jacob Andreas about knowledge representations.
There's been a lot of progress on figuring out how large language models represent knowledge, how they know that the Eiffel Tower is in Paris and how you can change the weights so that it thinks it's in Rome, et cetera, et cetera.
We did a study on algorithmic data sets where we found phase transitions. So if you try to make the machine learning learn a giant multiplication table, this could be for some arbitrary group operation. So something more interesting than standard multiplication. Then if there's any sort of structure here, if this operation is, for example, commutative, then you only really need the training data for about half of the entries and you can figure out the other half because it's a symmetric matrix. If it's also associative, then you need even less, et cetera. So as soon as the machine learning discovers some sort of structure, it might learn to generalize.
So here's a simple example. Edition, module 59. We train an inner learner to do this. We don't give it an input as numbers. We just give it each of the numbers from zero to 58 as a symbol. So it doesn't have any idea that they should be thought of as numbers and it represents them by embedding them in some internal space. And then we find that exactly at the moment when it learns to generalize, to unseen examples, there's a phase transition in how it represents in the internal space. We find that it was in a high dimensional space, but everything collapses through two dimensional hyperplane. I'm showing you here in a circle. Boom. That's of course exactly like the way we do edition, module 12 when we look at a clock, right? So it finds a representation where it's actually adding up angles which automatically captures in this case the commutativity and associativity. And I suspect this might be a general thing that happens in learning language and other things also that it comes up with a very clever representation that is such that it geometrically encompasses a lot of the key properties that let's it generalize.
We do a lot of phase transition experiments also. We tweak various properties of the neural network and see that sometimes it, so there's one region of, if you think of this being, you know, water, you could have pressure and temperature on your phase diagram, but here there are various other nerdy machine learning parameters and there's some, you get these phase transition boundaries between where it just learns properly, where it can generalize, where it fails to generalize and it never learns anything or where it just over fits this is for the example of just doing regular addition. So you see it learns to put the symbols on a line rather than a circle in the cases where it works out.
So I wanna leave a little bit of time for questions, but the bottom line I would like you to take away from all this is I think it's too pessimistic to say, oh, you know, we're forever just gonna be stuck with these black boxes that we can never understand. Of course, if we convince ourselves that it's impossible, where we're gonna fail, that's the best recipe for failure. I think it's quite possible that we really can understand enough about very powerful AI systems that we can have very powerful AI systems that are provably safe and physicists can really help a lot because we have a much higher bar before we mean by understanding things than a lot of our colleagues in other fields. And we also have a lot of really great tools. We love studying nonlinear dynamical systems. We love studying phase transitions and so many other things which are turning out to be key in doing this kind of progress.
So if anyone is interested in collaborating or learning more about mechanistic interpretability and basically studying the learning and execution of neural networks as just yet another cool physical system to try to understand just to read that to me. And let's talk. Thank you. Thank you very much. Does anyone have questions?
I actually have one to start with. So just sort of you explaining in these last few slides, a lot of the theme sort of seem to be applying like the laws of thermodynamics and other physical laws to these systems. And the parallel I thought of is the field of biophysics also sort of emerged out of this, right? Applying physical laws to systems that were considered too complex to understand before we really thought about it carefully. Is there any sort of emerging field like that in the area of AI or understanding neural networks other than that little conference you just mentioned? Or is that really all that's there right now?
There's so much room for there to really be an emerging field like this. And I invite all of you to help build it. It's obviously a field which is not only very much needed but it's just so interesting.
There have been so many times in recent months when I read a new paper by someone else about this and I'm like, oh, this is so beautiful. You know, another way to think about this is I always tell my students that when they pick tasks to work on, they should look for areas where there's more data where experiment is ahead of theory. That's the best place to do theoretical work. And that's exactly what we have here, right?
If you train some system like GPT-4 to do super interesting things or use LAMA-2 that just came out where you have all the parameters, it's an incredibly interesting system. You can get massive amounts of data and we have the most fundamental things we don't understand. It's just like when the LHC turns on or when you first launched the Hubble Space Telescope or the WMAP satellite or something like that, you have the massive amount of data, really cool basic questions. It's the most fun domain to do physics in and yeah, let's build a field around it. Thank you, we've got a question up there.
Hi, Professor Tegmark. I was wondering, so most first amazing talk, I love the concept, but I was wondering if it is possible that this approach might not oversee but miss situations in which the language model actually performs very well, not in a sort of like a concise region, like a phase region on parameter space, but rather like in small blobs all around because in most physical systems, we have a lot of parameters that we will have phases and the phases are mostly concise in regions of n dimensions or whatever, and then there are phase transitions, which is the concept here, but it is also like since this is not necessarily a physical system, maybe there might be a situation in which the best way that it performs this in specific combinations of parameters, there are like points or little blobs around, I don't know if my question went through.
Yeah, good question. I think I need to explain better. I think my proposal is actually more radical than maybe it's, I could properly explain. I think we should never put something we don't understand, like GPT-4 in charge of the MIT nuclear reactor or any high stakes systems. I think we should use these black box systems to discover amazing knowledge and discover patterns in data, and then we should extract, not stop there and just connect it to the nuclear weapons system or whatever, but we should instead develop other AI techniques to extract out the knowledge that they've learned and be implement them in something else, right?
So, to take your physics metaphor again, that's a Galileo when he was four years old, if his daddy threw him a ball, he'd catch it. Because his black box neural network had gotten really good at predicting the trajectory. Then he got older and he's like, wait a minute, these trajectories always have the same shape, but it's a parabola, y equals x squared and so on. And when we send the rocket to the moon, right, we don't put a human there to make poorly understood decisions. We actually have extracted out the knowledge and written the Python code or something else that we can verify.
I think the real power, I think we need to let go of a stop putting an equal sign between large language models and AI. We've had radically different ideas of what AI should be. First, we thought about it in the von Wehrmann paradigm of computation. Now we're thinking about LLMs. We can think of other ones in the future. What's really amazing about neural networks, in my opinion, is not their ability to execute a computation at runtime. They're just another massively parallel computational system and there are plenty of other ones too that are easier to formally verify. What they really shine is in their ability to discover patterns and data to learn. And let's continue using them for that.
You could even imagine an incredibly powerful AI that is just allowed to learn but is not allowed to act back on the world in any way. And then you use other systems to extract out what it's learned and you implement that knowledge into some system that you can provably trust. This to me is the path forward that's really safe. And maybe there will still be some kind of stuff which is so complicated we can't prove that it's going to do what we want. So let's not use those things until we can prove them because I'm confident that the set of stuff that can be made provably safe is vastly more powerful and useful and inspiring than anything we have now. So why should we risk losing control when we can do so much more first and provably safely?
So for your first question, is it just an empirical observation or do you have a theoretical model like you do in physics? Right now it's mainly theoretical observation. And actually we have seen many examples of face transitions cropping up in machine learning. And so have many other authors.
I am so confident that there is a lot of things that are happening and so have many other authors. There is the beautiful theory out there to be discovered. Sort of unified theory of face transitions in learning. And maybe you're going to be the one to first formulate it. I don't think it's a coincidence that these things keep happening like this.
But this is this gives you all an example of how basic physics like questions there are out there that are still unanswered. Where we have massive amounts of data is clues to guide us towards them. Thank you.
And I think there's probably even in discover, we will probably discover some point in the future even a very deep, unified relationship or duality between thermodynamics and learning dynamics is the hunch I have. Thank you.