Richard Sutton – Father of RL thinks LLMs are a dead end (YouTube Video Transcript)

↔

Title: Richard Sutton – Father of RL thinks LLMs are a dead end

Duration: 01:07:09

Total Correct Answers:

Dictation Mode:

Current Caption

Correct

Show All Captions

Learning Modes

Dictation

YouTube Video Transcript Hide

Ask AI Result

The ask AI result will appear here..

Show timestamps

Display as lines

(00:00:00) Your YouTube transcript will appear here (00:00:00) Why are you trying to distinguish (00:00:01) humans? Humans are animals. What we have (00:00:04) in common is more interesting. What (00:00:06) distinguishes us, we should be paying (00:00:08) less attention to. (00:00:08) >> I mean, we're trying to replicate (00:00:09) intelligence, right? No animal can go to (00:00:11) the moon or make semiconductors. So, we (00:00:13) want to understand what makes humans (00:00:14) special. (00:00:15) >> So, I like the way you consider that (00:00:16) obvious, cuz I consider the opposite (00:00:19) obvious. If we understood a squirrel, (00:00:21) we'd be almost all the way there. I am (00:00:23) personally just kind of content being (00:00:25) out of sync with my field for a long (00:00:27) period of time perhaps decades because (00:00:29) occasionally I have improved right in (00:00:32) the past. I don't think learning is (00:00:35) really about training. It's about an (00:00:36) active process. The child tries things (00:00:39) and sees what happens. I think we should (00:00:41) be proud that we are giving rise to this (00:00:45) great transition in the universe. (00:00:48) Today I'm chatting with Richard Sutton (00:00:50) who is one of the founding fathers of (00:00:52) reinforcement learning an inventor of (00:00:54) many of the main techniques used there (00:00:55) like TD learning and policy gradient (00:00:57) methods and for that he received this (00:00:59) year's touring award which if you don't (00:01:01) know is basically the Nobel Prize for (00:01:03) computer science Richard congratulations (00:01:05) >> thank you Darish (00:01:06) >> and uh thanks for coming on the podcast (00:01:08) >> it's my pleasure (00:01:09) >> okay so first question my audience and I (00:01:13) are familiar with the LLM way of (00:01:15) thinking about AI conceptually (00:01:17) What are we missing in terms of thinking (00:01:19) about AI from the RL perspective? (00:01:22) >> Well, yes, I think it's really quite a (00:01:24) different point of view and it's it can (00:01:27) easily get separated and lose the (00:01:29) ability to talk to each other. (00:01:30) >> Mhm. (00:01:31) >> And um yeah, large language models have (00:01:34) become such a big thing. Generative AI (00:01:36) in general a big thing. Um and our field (00:01:40) is subject to bandwagons and fashions. (00:01:42) So we lose we lose track of the uh basic (00:01:46) basic things because I consider (00:01:47) reinforcement learning to be basic AI (00:01:50) and what is intelligence are the problem (00:01:52) is is to understand your world and um (00:01:57) reinforcement learning is about (00:01:58) understanding your world whereas large (00:02:00) language models are about mimicking (00:02:02) people doing what people say you should (00:02:04) do. They're not about figuring out what (00:02:06) to do. (00:02:07) >> Huh. I guess you would think that to (00:02:11) emulate the trillions of tokens in the (00:02:13) corpus of internet text, you would have (00:02:15) to build a world model. In fact, these (00:02:17) models do seem to have very robust world (00:02:19) models and they're the best um world (00:02:21) models we've made to date in AI. Right. (00:02:23) So, what do you think that that's (00:02:25) missing? (00:02:26) >> I would disagree with most of the things (00:02:28) you just said. (00:02:28) >> Great. (00:02:30) >> Just to mimic the the what people say is (00:02:33) not really to build a model of the world (00:02:34) at all. I don't think you know you're (00:02:36) mimicking things that have a model of (00:02:39) the world the people (00:02:40) >> but I don't want to approach the (00:02:42) question in an adversarial way (00:02:45) uh but but I would I would question the (00:02:47) idea that they um they have a world (00:02:50) model so a world model would enable you (00:02:52) to predict what would happen (00:02:53) >> right (00:02:53) >> they they have they have the ability to (00:02:55) predict what a person would say they (00:02:57) don't have the ability to predict what (00:02:58) will happen what we want I think to (00:03:01) quote Alan Turing what we want is a (00:03:04) machine that can learn from experience, (00:03:06) >> right? (00:03:07) >> Where experience is the things that (00:03:08) actually happen in your life. You do (00:03:10) things, you see what happens. Um, and uh (00:03:14) that's what you learn from. (00:03:16) >> Yeah. (00:03:17) >> The large language models learn from (00:03:18) something else. They learn from here's a (00:03:20) situation and here's what a person did. (00:03:23) And implicitly the suggestion is you (00:03:25) should do what the person did. (00:03:26) >> Right? I guess maybe the the crux and (00:03:29) I'm curious if you disagree with this is (00:03:30) some people will say okay so this (00:03:32) imitation learning has given us a good (00:03:34) prior or given these models a good prior (00:03:36) but reasonable ways to approach problems (00:03:39) and as we move towards the era of (00:03:41) experience uh as you call it this prior (00:03:45) is going to be the basis on which we (00:03:48) teach these models from experience (00:03:49) because this gives them the opportunity (00:03:51) to get uh answers right some of the time (00:03:54) and then on this you can build uh you (00:03:57) can train them on experience. Do you (00:03:58) agree with that perspective? (00:04:00) >> No, I I I agree that it's the it's the (00:04:03) large language model perspective, right? (00:04:04) >> I don't think it's a good perspective. (00:04:06) >> Yeah. Yeah. Cere. (00:04:08) >> So to be a prior for something, there (00:04:11) has to be a real thing. I mean, a prior (00:04:14) bit of knowledge should be the basis for (00:04:17) actual knowledge. What is actual (00:04:19) knowledge? There's no definition of (00:04:20) actual knowledge in that in that large (00:04:23) language framework. What makes an action (00:04:27) a good action to take? You recognize the (00:04:30) value, the need for continual learning, (00:04:32) right? So if you need to learn (00:04:34) continually, continually means learning (00:04:36) during the normal interaction with the (00:04:38) world. (00:04:38) >> Yeah. (00:04:39) >> And so then there must be some way (00:04:41) during the normal interaction to tell (00:04:43) what's right. (00:04:44) >> Yep. (00:04:45) >> Okay. So (00:04:47) is there any way for it to tell in the (00:04:50) largest language model setup to tell (00:04:52) what's the right thing to say? You will (00:04:55) say something and you will not get (00:04:56) feedback about what the right thing to (00:04:58) say is (00:04:59) >> because there's no definition of what (00:05:01) the right thing to say is. There's no (00:05:03) goal, (00:05:03) >> right? (00:05:04) >> And if there's no goal, then there's (00:05:05) there's one thing to say, another thing (00:05:07) to say. There's no right thing to say, (00:05:09) >> right? (00:05:10) >> So there's no ground truth. You can't (00:05:12) have prior knowledge if you don't have (00:05:14) ground truth because the prior knowledge (00:05:16) is supposed to be a hint or an initial (00:05:18) belief about what the truth is. (00:05:20) >> Yeah. (00:05:21) >> But there isn't any truth. there's no (00:05:23) right thing to say right now in (00:05:25) reinforcement learning there is a right (00:05:26) thing to say or right thing to do (00:05:28) because the the right thing to do is the (00:05:30) thing that gets you reward (00:05:32) >> right (00:05:32) >> so we have a definition of what the (00:05:33) right thing to do is and so we can have (00:05:36) uh prior knowledge or knowledge provided (00:05:39) by pe people about what the right thing (00:05:40) to do is and then we can check it (00:05:43) >> to see because because we have a (00:05:44) definition of what the actual right (00:05:45) thing to do is (00:05:47) >> now an even simpler case is when you (00:05:49) have you're trying to make a model of (00:05:50) the world when you predict what will (00:05:52) happen, you predict and then you see (00:05:54) what happens. (00:05:55) >> Okay? So there's ground truth. There's (00:05:57) no ground truth in in uh large language (00:06:01) models because you don't have a a (00:06:03) prediction about what will happen next. (00:06:06) If you say something in your in your um (00:06:08) conversation, there's the large language (00:06:10) models have no prediction about what the (00:06:13) person will say in response to that or (00:06:15) what what the response will be. I mean I (00:06:17) think they do like they you can (00:06:19) literally ask them what what what would (00:06:20) you anticipate a user might say in (00:06:22) response and they have a prediction. (00:06:24) >> Oh no they they they will respond to (00:06:26) that question right? (00:06:27) >> Yeah (00:06:27) >> but they have no prediction in the (00:06:28) substantive sense that they won't be (00:06:31) surprised by what happens and if (00:06:33) something happens that isn't what you (00:06:34) might say they predicted they will not (00:06:36) change because an unexpected thing has (00:06:39) happened and there to learn that they'd (00:06:43) have to make an adjustment. I I so I (00:06:45) think a capability like this does exist (00:06:48) in context. So it's interesting to watch (00:06:51) a model do chain of thought and then (00:06:54) suppose it's trying to solve a math (00:06:55) problem. It'll say okay I'm going to (00:06:56) approach this problem using this (00:06:59) approach at first and it'll write this (00:07:00) out and be like oh wait I just realized (00:07:02) this is the wrong conceptual way to (00:07:03) approach the problem. I'm going to (00:07:04) restart by this another approach and (00:07:07) that flexibility does exist in context, (00:07:10) right? Do you have something else in (00:07:11) mind or do you just think that you need (00:07:12) to extend this capability across longer (00:07:15) horizons? (00:07:16) >> I'm just saying they don't have a have a (00:07:19) uh in any meaningful sense they don't (00:07:21) have a prediction of what will happen (00:07:22) next and they will not be surprised by (00:07:24) what happened next. They'll not make any (00:07:26) changes if if something happens. (00:07:29) >> But isn't that isn't that isn't that (00:07:30) >> based on what happens. (00:07:31) >> Isn't that literally what next token (00:07:32) prediction is? prediction about what's (00:07:34) next and then updating on a surprise. (00:07:35) >> Next token is what they should say, what (00:07:37) the action should be. It's not what the (00:07:40) world will give them in response to what (00:07:42) they do. Let's let's go back to the uh (00:07:45) their lack of goal. (00:07:46) >> Mhm. (00:07:47) >> For me, having a goal is the essence of (00:07:49) intelligence, (00:07:50) >> right? (00:07:51) >> Something is intelligent if it can (00:07:52) achieve goals. Is I like John McCarthy's (00:07:55) definition that intelligence is the (00:07:57) computational part of the ability to (00:07:58) achieve goals. Yeah. So, you have to (00:08:00) have goals. You're you're not you're (00:08:02) just you're just you're just a behaving (00:08:04) system. You you're not you're not any (00:08:08) special. You're not intelligent. (00:08:09) >> Right. (00:08:10) >> And you agree that large language models (00:08:12) don't have goals. (00:08:13) >> I think they No, they have a goal. (00:08:14) >> What's the goal? (00:08:15) >> Next token prediction. (00:08:17) >> That's not a goal. Doesn't it doesn't (00:08:19) change the world, (00:08:21) you know. (00:08:22) >> I think tokens come at you and if you (00:08:25) predict them, you don't influence them. (00:08:27) >> Oh, yeah. I I it's not a goal about the (00:08:30) external world. (00:08:32) >> Yeah. It's not a goal. (00:08:34) It's not a substantive goal. It's not (00:08:36) You can't look at a system and say, "Oh, (00:08:38) it uh has a goal if it's just sitting (00:08:40) there predicting and being happy with (00:08:41) itself that it's predicting accurately." (00:08:43) I I guess maybe the bigger question I (00:08:44) want you want to understand is why you (00:08:46) don't think doing RL on top of LLM is a (00:08:50) productive direction because being we we (00:08:52) seem to be able to give these models a (00:08:53) goal of solving difficult math problems (00:08:55) and they're in many ways um at the very (00:08:58) peaks of human level in in the capacity (00:09:01) to solve math olympia type problems (00:09:03) right they got gold at IMO so it seems (00:09:06) like the model which got gold at the (00:09:09) international math Olympia does have the (00:09:10) goal of getting math problems right Um (00:09:13) so why can't we extend this to different (00:09:14) domains? (00:09:15) >> Well the math problems are different. Um (00:09:18) the making a model of the physical world (00:09:20) and uh carrying out the consequences of (00:09:23) mathematical (00:09:25) um assumptions or operations, (00:09:27) >> right? (00:09:28) >> Those are very different things like the (00:09:30) the empirical world has to be learned. (00:09:33) You have to learn the consequences. Um (00:09:36) whereas the uh the math is is more just (00:09:41) computational. It's more like standard (00:09:43) planning. So, so there you can you can (00:09:47) um they can have a goal to to um uh to (00:09:52) find the proof and they are in in some (00:09:55) way given that goal to find the proof. (00:09:57) >> Right. So, I mean, it's interesting (00:09:59) because you wrote this essay in 2019 (00:10:01) titled The Bitter Lesson, and this is (00:10:04) the most influential essay perhaps in (00:10:05) the history of AI, but people have used (00:10:10) that as a justification (00:10:13) for scaling up LLMs because in their (00:10:16) view, this is the one scalable way we (00:10:19) have found to pour ungodly amounts of (00:10:22) compute into learning about the world. (00:10:24) And so it's interesting that your (00:10:25) perspective is that the LLMs are (00:10:27) actually not bitter lesson. (00:10:29) >> It's an interesting question whether uh (00:10:32) large language models are are uh a case (00:10:36) of the bitter lesson. (00:10:37) >> Yeah. (00:10:38) >> Because they are clearly um a way of (00:10:42) using massive computation things that (00:10:44) will scale with computation up to up to (00:10:48) the limits of the internet. (00:10:49) >> Yeah. (00:10:50) uh but they're also a way of putting in (00:10:54) lots of um human knowledge and uh so so (00:10:59) this is an interesting question um it's (00:11:02) a sociological or industry question uh (00:11:07) will they reach the limits of of of the (00:11:11) data and and be superseded by things (00:11:15) that that are can get more data just (00:11:19) from experience rather than from (00:11:21) uh from people. Uh in some ways it's a (00:11:26) classic case of the of the of the bitter (00:11:28) lesson with the more the more human (00:11:30) knowledge we put into the large language (00:11:32) models the better they can do and so it (00:11:34) feels good. Um (00:11:37) and yet uh one well I in particular (00:11:42) expect there to be systems that can (00:11:44) learn from experience and which could (00:11:45) well perform much much better and be (00:11:48) much more scalable. In which uh case it (00:11:51) will be another instance of the bitter (00:11:53) lesson that the things that that used (00:11:56) human knowledge were eventually (00:11:59) superseded by things that just um (00:12:02) trained from uh experience and (00:12:04) computation. I I guess that doesn't seem (00:12:06) like the crux to me because I think (00:12:08) those people would also agree that the (00:12:11) overwhelming amount of compute in the (00:12:13) future will come from uh learning from (00:12:17) experience. They just think that the (00:12:19) scaffold or the basis of that the thing (00:12:21) you'll start with in order to pour in (00:12:23) the compute to do this future (00:12:25) experiential learning or on the job (00:12:27) learning will be LLMs. And so I I guess (00:12:31) I I still don't understand why this is (00:12:34) the wrong starting point altogether. Why (00:12:37) we need a whole new architecture to (00:12:39) begin doing experential continual (00:12:42) learning. Uh and why we can't start with (00:12:44) LLMs to do that. (00:12:46) >> Well, in every case the bitter lesson, (00:12:48) you know, you could start with uh human (00:12:51) knowledge, (00:12:51) >> right? (00:12:52) >> And then just and then do the scalable (00:12:54) things. (00:12:54) >> Yeah, (00:12:54) >> that's always the case. And there's no (00:12:57) never any reason why that has to be bad, (00:13:00) >> right? (00:13:00) >> But in fact and in practice it has (00:13:03) always turned out to be bad because (00:13:05) people get locked into the human (00:13:07) knowledge approach and they (00:13:10) psychologically or you know now I'm now (00:13:12) I'm speculating why it is but this is (00:13:14) what has always happened. (00:13:15) >> Yeah. (00:13:16) >> That uh yeah they get they get their (00:13:19) lunch gets eaten by the methods that are (00:13:21) truly scalable. (00:13:22) >> Yeah. Give me a sense of what the (00:13:23) scalable method is. The scalable method (00:13:26) is you learn from experience. Um you uh (00:13:30) you you try things, you see what you see (00:13:32) what works. No one no one has to tell (00:13:35) you. First of all, you have a goal. So (00:13:37) without a goal, uh there's no sense of (00:13:39) right or wrong or better or worse. So (00:13:42) large language models are trying to get (00:13:44) by without having a goal or a sense of (00:13:46) better or worse. That's just, you know, (00:13:49) it's exactly starting in the wrong (00:13:51) place. May maybe it's um interesting to (00:13:53) compare this to humans. So in both the (00:13:56) case of learning from imitation versus (00:13:59) experience and on the question of goals, (00:14:02) I think there's some interesting (00:14:04) analogies. So you know kids will (00:14:07) initially learn from imitation. Uh you (00:14:11) don't think so? (00:14:12) >> No, of course not. (00:14:14) >> Really? (00:14:15) >> Yeah. I think kids just like watch (00:14:17) people. They like kind of try try to (00:14:19) like say the same. (00:14:20) >> How old are those these kids? (00:14:22) >> I I think the level (00:14:23) >> What about the first six months? (00:14:25) >> I think they're kind kind of imitating (00:14:26) things. They're trying to like make (00:14:27) their mouth sound the way they see their (00:14:29) mother's mouth sound. And then they'll (00:14:30) say the same words without understanding (00:14:31) what they mean. And as you get older, (00:14:33) the complexity of the imitation they do (00:14:35) increases. So that's you're you're (00:14:37) you're you know, you're imitating maybe (00:14:40) the skills that your uh people in your (00:14:42) band are using to hunt down the deer or (00:14:44) something. And then you go into the (00:14:46) learning from experience RL regime. But (00:14:48) I think there's a lot of imitation (00:14:49) learning happening with uh humans. (00:14:52) >> Yeah. Surprising. Yeah. You can have (00:14:53) such a different point of view. (00:14:55) >> Yeah. (00:14:55) >> Um when I see kids, I see kids uh just (00:14:58) trying things and like waving their (00:15:01) hands around and moving their eyes (00:15:03) around and no one no one tells them (00:15:05) there. There's no there's no um (00:15:08) imitation for uh how they move their (00:15:10) eyes around or even the sounds they (00:15:12) make. They may they may want to create (00:15:14) the same sounds but the um the actions (00:15:17) you know the thing that the uh infant (00:15:20) actually does there there's no targets (00:15:23) for that there are no examples for that (00:15:25) >> I agree that it doesn't explain (00:15:26) everything infants do but I think it (00:15:28) guides a learning process I mean even (00:15:30) LLM when it's trying to predict the next (00:15:32) token early in training it will like (00:15:34) make a guess it'll be different from (00:15:35) what like it actually sees and in some (00:15:37) sense it's like very short horizon RL (00:15:39) where it's like making this guess of (00:15:40) like I think this token will be It's (00:15:42) actually this other thing similar to how (00:15:43) a kid will try to say a word, it comes (00:15:45) out wrong. (00:15:46) >> The the large language models is (00:15:48) learning from training data. It's not (00:15:49) learning from experience. (00:15:51) It's it's learning from something that (00:15:53) will never be available during its (00:15:55) normal life. (00:15:57) There's never any uh training data that (00:15:59) says you should do this action in normal (00:16:02) life. (00:16:02) >> I I think this is maybe more of a (00:16:05) semantic distinction like what do you (00:16:06) call school? Is that not training data? (00:16:08) You're not like going to school because (00:16:10) it's like (00:16:11) >> school is much later. Okay, I shouldn't (00:16:13) have said never, but but I I don't know. (00:16:16) I think I would even say it about (00:16:17) school, but formal schooling is is the (00:16:21) exception. You should base your (00:16:23) >> of learning where I think you're just (00:16:26) sort of programming in your biology that (00:16:27) like early on you're not that useful and (00:16:29) then like kind of why you exist is to (00:16:32) understand the world and like learn how (00:16:34) to interact with it. Um, and seems kind (00:16:37) of like a training phase. I agree that (00:16:39) then there's like a sort of more gradual (00:16:40) there's not a sharp cut off to like (00:16:42) training to deployment, but there seems (00:16:44) to be this like initial training phase, (00:16:47) right? There's nothing where where you (00:16:49) have training of what you should do. (00:16:51) There's nothing you you you see things (00:16:54) that happen. You're not you're not told (00:16:56) what to do. (00:16:58) Don't don't don't be difficult. I mean, (00:17:01) this is obvious. (00:17:02) >> I mean, you're like literally taught (00:17:03) what to do. This is like where the word (00:17:05) training comes from is from humans, (00:17:07) right? (00:17:08) >> So I don't think uh learning is really (00:17:10) about training. I think learning is (00:17:12) about about learning. It's about an (00:17:14) active process. The child tries things (00:17:17) and sees what happens. (00:17:19) >> Right. (00:17:19) >> Yeah. It does not (00:17:21) we don't think we don't think about (00:17:23) training when we think of the an infant (00:17:26) growing up. These these things are (00:17:28) actually rather well understood. If you (00:17:30) go to look about how psychologists think (00:17:32) about learning, there's nothing like uh (00:17:35) imitation. Maybe there are some extreme (00:17:38) cases where humans might do that or (00:17:41) appear to do that, but there's no basic (00:17:44) animal learning process called (00:17:45) imitation. There basic animal learning (00:17:48) processes for prediction and for trial (00:17:51) and error control. I mean, it's really (00:17:54) interesting how sometime the most (00:17:55) hardest things to see are the obvious (00:17:57) ones. It's obvious um if you just look (00:18:01) at animals and how they learn and you (00:18:03) look at psychology and how our theories (00:18:05) of them um it's obvious that that (00:18:09) supervised learning is not part of uh (00:18:11) the way animals learn. We don't have we (00:18:14) don't have examples of desired behavior. (00:18:17) What we have is examples of things that (00:18:20) happened, things one things that (00:18:21) followed another and we have examples of (00:18:24) we did something and and and (00:18:28) there were consequences but there are no (00:18:30) examples of supervised learning. I mean (00:18:32) there are no supervised learning is not (00:18:33) something that that happens in nature (00:18:35) and you know school even if that was the (00:18:39) case you know we should forget about it (00:18:41) because it's it's just this that's some (00:18:43) special thing that happens in people. (00:18:45) doesn't happen broadly in nature and you (00:18:48) know squirrels don't go to school. (00:18:49) Squirrels can learn all about the world. (00:18:52) It's absolutely obvious I would say that (00:18:55) um supervised learning doesn't happen in (00:18:58) animals. So I I I interviewed this (00:19:01) psychologist and anthropologist Joseph (00:19:04) Henrik who has done work about cultural (00:19:08) evolution and basically how did what you (00:19:10) know what distinguishes humans and how (00:19:13) do humans pick up knowledge. Why are you (00:19:15) trying to distinguish humans? (00:19:18) Humans are animals. (00:19:20) What we have in common is more (00:19:22) interesting. What we have what (00:19:23) distinguished us we we should be paying (00:19:25) less attention to. (00:19:26) >> I mean we're trying to replicate (00:19:27) intelligence, right? So if you want to (00:19:28) understand what is it that (00:19:31) >> enables humans to go to the moon or to (00:19:33) build semiconductors. I think the thing (00:19:35) we want to understand is the thing that (00:19:37) makes it no animal can go to the moon or (00:19:39) make semiconductors. So we want to (00:19:41) understand what makes humans special. (00:19:42) >> So I like the way you consider that (00:19:44) obvious cuz I consider the opposite (00:19:46) obvious. (00:19:48) Yeah. I think we we need to we we have (00:19:50) to we have to understand how we are (00:19:53) animals and we if we understood a (00:19:55) squirrel I think we'd have a we'd be (00:19:57) almost all the way there to (00:19:59) understanding human intelligence. The (00:20:01) the language part is just a a small (00:20:04) veneer on the surface. (00:20:06) Okay. So this is great. You know we're (00:20:08) finding out the very different ways that (00:20:10) we're thinking. (00:20:12) >> We're not arguing. We're trying to share (00:20:15) share our different ways of thinking (00:20:16) with each other. (00:20:17) >> Yeah. And you I think argument is (00:20:19) useful. So um uh yeah but I do want to (00:20:22) complete this thought. So Joseph Henrik (00:20:24) has this interesting theory that uh if (00:20:26) you look a lot of the uh skills that (00:20:30) humans have had to master in order to be (00:20:32) successful and we're not talking about (00:20:34) you know last thousand years or last (00:20:35) 10,000 years but hundreds of thousands (00:20:37) of years. uh you know the world is (00:20:39) really complicated and it's not possible (00:20:43) to reason through how to let's say hunt (00:20:46) a uh seal if you're living in the (00:20:49) Arctic. And so there's this many many (00:20:52) stepong process of how to make the bait (00:20:56) and how to find the seal and then how to (00:20:59) process the food in a way that make sure (00:21:01) you won't get poisoned. And it's not (00:21:03) possible to reason through all of that. (00:21:04) And so over time, yes, there's this like (00:21:07) larger process of whatever analogy you (00:21:10) want to use, maybe something else where (00:21:12) culture as a whole has figured out how (00:21:15) to uh find and kill and eat uh seals. (00:21:21) But then what is happening when through (00:21:23) generations this knowledge is (00:21:25) transmitted is in his view that like (00:21:28) there you just have to imitate your (00:21:30) elders in order to learn that skill (00:21:32) because you can't you can't think your (00:21:34) way through how to hunt and kill and (00:21:36) process a seal. You have to just watch (00:21:38) other people maybe make tweaks and (00:21:39) adjustments. Uh and that's how cultural (00:21:42) knowledge accumulates. But the the (00:21:43) initial step of the cultural gain has to (00:21:45) be imitation. But maybe you think about (00:21:47) it a different way. (00:21:48) >> No, I think about it the same way. (00:21:50) >> Okay. But still it's a small thing on (00:21:54) top of basic trial and error learning, (00:21:57) >> prediction learning, and it's what (00:21:59) distinguishes us perhaps from from many (00:22:03) animals. (00:22:05) >> But we're an animal first. (00:22:07) >> Yeah. (00:22:08) >> And and we were an animal before we had (00:22:10) language and all those other things. I (00:22:14) do think you make a very interesting (00:22:15) point that continual learning is a (00:22:17) capability that most mammals have. I (00:22:21) guess all mammals have. So it's quite (00:22:23) interesting that we have something that (00:22:25) all mammals have but our AI systems (00:22:28) don't have, right? Whereas maybe like (00:22:30) the ability to understand math and (00:22:31) solves difficult math problems depends (00:22:33) on how you define math. But like these (00:22:36) this is a capability our AIs have but (00:22:38) that no almost no animal has. And so (00:22:41) it's quite interesting what ends up (00:22:42) being difficult and what ends up being (00:22:44) easy. (00:22:46) >> Morix paradox. (00:22:47) >> That's right. (00:22:48) >> For the era of experience to commence, (00:22:50) we're going to need to train AIs in (00:22:52) complex real world environments. But (00:22:54) building effective RL environments is (00:22:56) hard. You can't just hire a software (00:22:58) engineer and have them write a bunch of (00:22:59) cookie cutter validation tests. Real (00:23:02) world domains are messy. You need deep (00:23:04) subject matter experts to get the data, (00:23:06) the workflows, and all the subtle rules (00:23:08) right. When one of Labelbox's customers (00:23:10) wanted to train an agent to shop online, (00:23:12) Labelbox assembled a team with a ton of (00:23:14) experience engineering internet (00:23:16) storefronts. For example, the team built (00:23:18) a product catalog that could be updated (00:23:20) during the episode because most shopping (00:23:22) sites have constantly changing state. (00:23:24) They also added a Reddis cache to (00:23:25) simulate stale data since that's how (00:23:28) real e-commerce sites actually work. (00:23:30) These are the kinds of things that you (00:23:31) might not have naively thought to do, (00:23:33) but that label box can anticipate. These (00:23:35) details really matter. Small tweaks are (00:23:37) often the difference between cool demos (00:23:39) and agents that can actually operate in (00:23:41) the real world. So whether it's (00:23:42) correcting traces that you already (00:23:44) produced or building an entirely new (00:23:46) suite of environments, Labelbox can help (00:23:48) you turn your RL projects into working (00:23:50) systems. Reach out at (00:23:53) labelbox.com/thearcash. (00:23:56) All right, back to Richard. (00:23:58) >> This alternative paradigm that you're (00:24:00) imagining, (00:24:00) >> the experential paradigm, let's lay out (00:24:03) a little bit what it is. It says that (00:24:06) experience action sensation well (00:24:10) sensation action reward and this happens (00:24:13) on and on and on makes for life. It's it (00:24:16) says that this is the uh foundation and (00:24:19) the focus of intelligence. Intelligence (00:24:21) is about taking that stream and altering (00:24:25) the actions to increase the the rewards (00:24:28) in the stream. (00:24:29) >> Right? So learning then is from the (00:24:32) stream and learning is about the stream. (00:24:35) So it's that that second part is is is (00:24:38) particularly telling you know that what (00:24:41) you learn your knowledge your knowledge (00:24:43) is about the stream. Your knowledge is (00:24:45) about if you do some action what will (00:24:47) happen or it's about uh which events (00:24:51) will follow other events. It's about the (00:24:53) stream. It's the content of the (00:24:55) knowledge is is statements about the (00:24:58) stream. Um, and so because it it's it's (00:25:01) a statement about the stream, you can (00:25:02) test it by comparing it to the stream (00:25:04) and you can learn it continually. (00:25:06) >> So when you're imagining this future (00:25:09) continual learning agent, (00:25:10) >> they're not future. Of course, we they (00:25:12) exist all all the time. It's I mean this (00:25:14) is what reinforcement learning paradigm (00:25:15) is learning from experience. (00:25:17) >> Yeah, I guess the maybe I what I meant (00:25:19) to say is uh human level general (00:25:21) continual learning agent. (00:25:24) >> What is the reward function? Is it just (00:25:26) predicting the world? Is it uh is it (00:25:29) then having a specific effect on it? (00:25:32) What would the general reward function (00:25:34) be? (00:25:34) >> The reward uh function is arbitrary (00:25:38) and um so if you're playing chess, it's (00:25:40) to win the game of chess. (00:25:42) If you were to um uh if you're a (00:25:46) squirrel, maybe the the reward has to do (00:25:48) with getting nuts, (00:25:49) >> right? (00:25:50) Um in general for an animal you would (00:25:53) say the reward is to avoid pain and to (00:25:57) acquire pleasure (00:25:59) >> right (00:26:00) >> uh (00:26:01) and there's also would be a component (00:26:03) having to do with uh I think there would (00:26:05) be should be a component having to do (00:26:07) with your uh increasing understanding of (00:26:10) your of your environment that would be (00:26:13) sort of an intrinsic motivation. (00:26:15) >> I see. I guess this AI would be deployed (00:26:19) to like lots of people would want it to (00:26:22) be doing lots of different kinds of (00:26:23) things, (00:26:24) >> right? So it's performing the task (00:26:26) people want but at the same time it's (00:26:28) learning about the world from doing that (00:26:30) task and do you do you imagine okay so (00:26:34) we get rid of this paradigm where (00:26:36) there's training periods and then (00:26:38) there's deployment periods but then is (00:26:41) there do we also get rid of this (00:26:42) paradigm when there's the model and then (00:26:45) instances of the model or copies of the (00:26:47) model that are you know doing certain (00:26:49) things h how do you think about the fact (00:26:52) that there we'd want this thing to be (00:26:54) doing different things, we'd want to (00:26:55) aggregate the knowledge that it's (00:26:57) gaining from doing those different (00:26:58) things. (00:26:59) >> I don't like the word model when used (00:27:01) the way you just did. I I think a better (00:27:04) word would be the network. So, I think (00:27:06) you mean the the network. Maybe there's (00:27:10) many networks. So anyway, things would (00:27:12) be learned and then you'd have copies (00:27:15) and many instances and sure you'd want (00:27:17) to share knowledge across the uh (00:27:20) instances and there would be lots of (00:27:22) possibilities for doing that like there (00:27:24) is not today. You can't have one child (00:27:26) learn grow up and and learn about the (00:27:29) world and then and then every new child (00:27:31) has to repeat that process. Whereas with (00:27:34) AIS, with a digital intelligence, you (00:27:37) could hope to do it once and then copy (00:27:38) it into the next one as a starting (00:27:40) place. (00:27:40) >> Right? (00:27:41) >> So this would be a huge savings and I (00:27:44) think actually it would be much more (00:27:46) important than uh trying to learn from (00:27:48) people. (00:27:50) >> I agree that the kind of thing you're (00:27:52) talking about is necessary regardless of (00:27:55) whether you start from LLMs or not. (00:27:57) Right? If you want human or animal level (00:27:59) intelligence, you're going to need this (00:28:01) capability. Suppose a human is trying to (00:28:03) make a startup, right? And this is a (00:28:06) thing which has a reward on the order of (00:28:08) 10 years. Once in 10 years, you might (00:28:10) have an exit where you get, you know, (00:28:11) paid out a billion dollars. But humans (00:28:13) have this ability to make intermediate (00:28:16) auxiliary rewards or have some way of (00:28:18) even when they have extremely rewards, (00:28:19) they can still make intermediate steps (00:28:23) having an understanding of like what the (00:28:24) next thing they're doing leads to this (00:28:26) grander goal we have. And so how do you (00:28:28) imagine such a process might play out (00:28:30) with AIS? So this is something we know (00:28:33) very well (00:28:34) >> and it's the basis of it is temporal (00:28:36) difference learning (00:28:37) >> where the same thing happens um in a (00:28:40) less grandiose scale like when you learn (00:28:42) to play chess you have the grand the (00:28:45) long-term goal is winning the game and (00:28:47) yet you you can't you um you want to be (00:28:50) able to learn from shorter term things (00:28:51) like you know taking the your opponent's (00:28:53) pieces um and so you do that by having a (00:28:57) value function which predicts the (00:28:58) long-term outcome right (00:29:00) >> and then if You take guys pieces where (00:29:02) your prediction about the long-term (00:29:04) outcome is changed. It goes up. You (00:29:06) think you're going to win and then that (00:29:07) increase in your in your belief (00:29:10) immediately quote reinforces the uh the (00:29:14) move that led to taking the piece. (00:29:16) >> Mhm. (00:29:17) >> Okay. So, we have this long-term 10-year (00:29:19) goal of making a startup and making a (00:29:21) lot of money. And so, when we make (00:29:23) progress, we say, "Oh, I'm I'm I'm more (00:29:26) likely to uh achieve the long-term (00:29:29) goal." and that rewards the the steps (00:29:32) along the way, (00:29:33) >> right? And then you also want some (00:29:35) ability for information that you're (00:29:37) learning. I mean, one of the things that (00:29:40) makes humans quite different from these (00:29:42) LLMs is that if you're onboarding on a (00:29:44) job, you're you're picking up so much (00:29:46) context and information, and that's what (00:29:48) makes you useful at the job, right? (00:29:49) You're uh everything from how your (00:29:51) client has preferences to how the (00:29:54) company works to everything. Um, and is (00:29:57) the bandwidth of information that you (00:29:59) get from a procedure like TDLearning (00:30:01) high enough to have this like huge pipe (00:30:04) of like context and tacet knowledge that (00:30:06) you need to be picking up the way humans (00:30:08) do when they're when they're just like (00:30:10) deployed? Um (00:30:12) I think the crux of this and I'm not (00:30:15) sure but (00:30:17) the the big world hypothesis seems very (00:30:20) relevant and the reason why humans (00:30:22) becoming useful on their job is because (00:30:25) they are encountering the particular (00:30:27) part of the world. That's right. And um (00:30:29) and it can't have been anticipated and (00:30:31) it can't all have been put in in in (00:30:33) advance in in uh the world is so huge (00:30:38) that you can't the the dream as I see it (00:30:41) the dream of large language models is (00:30:43) you can teach the an the agent (00:30:45) everything and it will know everything (00:30:46) and it won't have to learn anything (00:30:49) online (00:30:50) >> right (00:30:50) >> during its life. Okay. and and your (00:30:53) examples are all well really you have to (00:30:55) because you can there's a lot to you can (00:30:58) teach it but there's all little (00:31:00) idiosyncrasies of the particular life (00:31:01) they're leading and the the particular (00:31:03) people they're working with and what (00:31:04) they like as opposed to what average (00:31:07) people like right (00:31:08) >> and so that's just saying the world is (00:31:10) really big and so you're going to have (00:31:11) to learn it uh along the way (00:31:14) >> yeah so it seems to me you need two (00:31:15) things one is some way of converting (00:31:17) this long run goal reward into smaller (00:31:22) auxiliary or you know um these like (00:31:26) predictive rewards of the future reward (00:31:27) or the future reward at least to the (00:31:29) final reward then you need some other (00:31:31) way initially it seems to me you need (00:31:33) some way of then okay I'm (00:31:35) I need to hold on to all this context (00:31:38) that I'm gaining as I'm (00:31:41) working in the world right I'm like (00:31:42) learning about (00:31:44) my clients my my company all this (00:31:47) information and I'm so I would say (00:31:51) you're just doing regular learning. (00:31:52) >> Yeah, (00:31:53) >> maybe you're using context because in (00:31:54) large language models, all that (00:31:56) information has to go into the context (00:31:58) window, (00:31:58) >> right? (00:31:59) >> But in in a continual learning setup, it (00:32:02) just goes into the weights. (00:32:03) >> Maybe maybe Yeah. So maybe context is (00:32:04) the wrong word to use because I mean a (00:32:06) more general thing. (00:32:06) >> You learn a policy that's specific to (00:32:08) the environment that you're finding (00:32:10) yourself in. (00:32:10) >> Yeah. So the question I'm trying to ask (00:32:14) is you need some way of getting like how (00:32:18) many bits per second are you picking (00:32:20) like is a human picking up when they're (00:32:22) you know out in the world, right? Um if (00:32:24) you're just like interacting over Slack (00:32:25) with your clients and everything. (00:32:27) >> So maybe you're trying to ask the (00:32:28) question of it seems like the reward is (00:32:31) too small of a thing to to do all the (00:32:32) learning that we need to do. But of (00:32:34) course we have the uh the sensations (00:32:38) uh we we have all the other information (00:32:40) we can learn from (00:32:41) >> right (00:32:41) >> we don't just learn from the reward we (00:32:43) learn from all the data (00:32:45) >> yeah so what is the learning process (00:32:48) which helps you capture that information (00:32:51) >> so now I want to talk about the base (00:32:55) common model of the agent with the four (00:32:57) parts (00:32:58) >> right (00:32:58) >> so we need a policy the policy says in (00:33:02) the situation I'm in what should I do we (00:33:05) need a value function. The value (00:33:06) function is the thing that is learned (00:33:08) with TDarning and the value function (00:33:10) produces a number. The number says how (00:33:12) well is it going (00:33:14) >> and then you watch if that's going up (00:33:15) and down and use that to adjust your (00:33:18) policy. Okay. So the those two things (00:33:21) and and then there's also the perception (00:33:24) component which is the construction of (00:33:27) your uh state representation. This your (00:33:29) sense of where you are now. And the (00:33:31) fourth one is what we're really getting (00:33:32) at most transparently. Anyway, the the (00:33:35) fourth one is the transition model of (00:33:37) the world. Um that's why I am (00:33:39) uncomfortable just calling everything (00:33:41) models because I want to talk about the (00:33:42) model of the world. The transition model (00:33:45) of the world, your belief that if you do (00:33:47) this, what will happen? What will be the (00:33:49) consequences of what you do? So your (00:33:51) physics of the world, but it's al not (00:33:52) just physics. It's also um abstract (00:33:55) models like you know your model of how (00:33:56) you traveled um from California up to (00:34:00) Edmonton for this podcast that was a (00:34:01) model and that's a transition model and (00:34:03) that would be (00:34:04) >> uh learned and it's not learned from (00:34:06) reward it's learned from you did things (00:34:08) you saw what happened (00:34:09) >> you made that model with the world that (00:34:11) is it will be learned very richly from (00:34:14) all the sensation that you receive not (00:34:16) just from the reward (00:34:18) >> it has to include the reward as well but (00:34:21) it's that's a small part of the whole (00:34:23) model small crucial part of the whole (00:34:25) model. (00:34:25) >> Yeah. One of my friends Toby Ward (00:34:27) pointed out that if you look at the Muse (00:34:30) Euro models that Google Deep Mind (00:34:33) deployed to learn Atari games that these (00:34:35) models were initially (00:34:38) not a general intelligence itself but a (00:34:40) general framework for training (00:34:42) specialized intelligences to play (00:34:44) specific games. That is to say that you (00:34:46) couldn't using that framework train a (00:34:49) policy to play both chess and go and (00:34:52) some other game. You had to train each (00:34:54) one in a specialized way. And he was (00:34:57) wondering whether that implies that (00:34:59) reinforcement learning generally because (00:35:02) of this information constraint you you (00:35:04) you can only learn one thing at a time. (00:35:06) Uh the density of information isn't that (00:35:08) high or whether it was just specific to (00:35:10) the way that mu0 was done. And if it's (00:35:12) specific to uh Alpha Zero, what what (00:35:15) what needed to be changed about that (00:35:17) approach so that it could be a general (00:35:19) learning agent? (00:35:21) >> The the idea is totally general. You (00:35:23) know, uh I do use all the time as my (00:35:27) canonical example, the idea of an AI (00:35:29) agent is like a person. (00:35:30) >> Yeah. And and people uh in some sense (00:35:34) they have just one world they live in (00:35:37) and um that world may involve chess and (00:35:40) it may involve Atari games. Uh but those (00:35:42) are are are not a different task or a (00:35:44) different world. Those are different (00:35:45) states right they encounter and so the (00:35:49) the general idea is not limited at all. (00:35:51) So maybe it would be useful to explain (00:35:54) what was missing in that architecture or (00:35:57) that that approach which this continual (00:36:02) learn learning AGI would have they just (00:36:05) set it up they didn't it was not their (00:36:07) ambition to have one agent across across (00:36:12) uh those games. If we want to talk about (00:36:14) transfer, we should talk about transfer (00:36:16) not across games or across tasks, but (00:36:20) transfer between states. (00:36:22) >> Yeah. I I guess I'm curious about (00:36:25) historically, have we seen the (00:36:29) level of transfer (00:36:31) using RL techniques that would be needed (00:36:33) to build this kind of (00:36:35) >> Okay, good. Good. We're not seeing (00:36:37) transfer anywhere. We're not seeing (00:36:39) general critical to good performance is (00:36:42) that you can generalize well from one (00:36:44) state to another state. (00:36:46) >> We don't have any methods that are good (00:36:47) at that. What we have are people um try (00:36:51) different things and they they settle on (00:36:54) something that that uh a representation (00:36:57) that that transfers well or they (00:36:58) generalize as well. But we have no we (00:37:00) don't have any automated techniques to (00:37:02) promote. we have very few automated (00:37:05) techniques to promote transfer and (00:37:07) they're not none of them are used in in (00:37:10) modern deep learning. (00:37:11) >> Um let me paraphrase to make sure that I (00:37:15) understood that correctly. (00:37:17) It sounds like you're saying that when (00:37:19) we do have generalization in these (00:37:22) models that is a result of some uh (00:37:26) sculpted (00:37:28) uh (00:37:28) >> humans did it. (00:37:29) >> Yeah. (00:37:30) >> The researchers did it because there's (00:37:31) no other explanation. I mean gradient (00:37:34) descent will not make you generalize (00:37:35) well it will make you solve the problem (00:37:38) >> right (00:37:38) >> it will not make you you know get new (00:37:40) data (00:37:42) you generalize in a good way (00:37:43) generalization means train on one thing (00:37:46) that affects what you do on the other (00:37:47) things so we know deep learning is (00:37:50) really bad at this for example we know (00:37:52) that if you train on some new thing it (00:37:54) will often catastrophically interfere (00:37:56) with all the old things that you that (00:37:58) you knew (00:37:59) >> so this is exactly bad generalization (00:38:01) Right (00:38:02) >> now generalization as I said is some (00:38:04) kind of influence of training on one (00:38:07) state on other states and generalization (00:38:09) is not necessarily good or bad right (00:38:11) just the fact that you generalize is not (00:38:13) necessarily good or bad you can (00:38:14) generalize poorly you can generalize (00:38:16) well (00:38:16) >> right (00:38:17) >> so you you need generalization always (00:38:19) will happen u but we need algorithms (00:38:21) that will uh cause the the (00:38:24) generalization to be good rather than (00:38:25) bad (00:38:26) >> I'm not trying to kickstart this uh (00:38:29) initial uh crux proxy, but I'm just (00:38:32) genuinely curious because I I think I'm (00:38:34) might be using the term differently. I (00:38:35) mean, one way to think about is these (00:38:36) LLMs are increasing the scope of (00:38:39) generalization from like earlier systems (00:38:42) which could not really even do a basic (00:38:44) math problem to now they can do anything (00:38:47) in this class of math Olympia type (00:38:49) problems, right? So, you initially start (00:38:51) with like they can generalize among (00:38:52) addition problems at least. Um uh then (00:38:54) you generalize to like they can (00:38:56) generalize among like problems which (00:38:59) require use of different kinds of (00:39:02) mathematical techniques and theorems and (00:39:04) you know conceptual categories which is (00:39:06) like what the math olympiad requires. (00:39:08) And so it sounds like you don't think of (00:39:10) that being able to solve any problem (00:39:12) within that category as an example of (00:39:15) generalization or let me know if I'm (00:39:17) misunderstanding that. Well, large (00:39:19) language models so complex. We don't we (00:39:22) don't really know what information they (00:39:24) had prior. We are we have to guess (00:39:28) because they've been fed so much. This (00:39:30) is one reason why they're not a good way (00:39:32) to do science. Uh it's just so (00:39:36) uncontrolled, so unknown. (00:39:38) >> But if you come up with an entirely new, (00:39:40) >> they're getting a bunch of things right (00:39:42) >> perhaps. And uh so the question is why? (00:39:45) Well, it may be that they don't need to (00:39:47) generalize to get them right because the (00:39:48) only way to get some of them right is is (00:39:51) to form something which gets all of them (00:39:53) right. (00:39:54) >> So, you know, if there's only one answer (00:39:57) uh then and you find it, I that's not (00:39:59) called generalization. It's just it's (00:40:01) the only way to solve it and so they (00:40:03) find the only way to solve it. (00:40:04) >> Generalization is when it could be this (00:40:06) way, it could been that way and they do (00:40:08) it the good way. My my understanding is (00:40:10) that they um this is working more and (00:40:13) more better and better with coding (00:40:15) agents. So engineers obviously if you're (00:40:18) trying to program a library (00:40:21) there's many different ways you could (00:40:22) achieve the endspec and an initial (00:40:24) frustration with these models has been (00:40:25) that they'll do it in a way that's (00:40:27) sloppy and then over time they're (00:40:29) getting better and better at coming up (00:40:32) with the design architecture and the (00:40:34) abstractions that developers find more (00:40:36) satisfying. And it seems that an example (00:40:40) of what you're talking about. (00:40:41) >> Well, there's nothing in them which will (00:40:43) cause it to generalize. Well, the (00:40:46) gradient descent will cause them to find (00:40:49) a solution to the problems they've seen. (00:40:52) And if there's only one way to solve (00:40:54) them, you know, they they'll do it. But (00:40:55) there are many ways to solve it. Some (00:40:56) which generalize well, some which (00:40:58) generalize poorly. There's nothing in (00:41:00) them in the algorithms that will cause (00:41:02) them to generalize well. (00:41:04) >> But people of course are involved. and (00:41:06) and you know if if it's not working out (00:41:08) you know they fiddle with it and until (00:41:10) they find a way perhaps until they find (00:41:12) a way which it generalizes well so to (00:41:15) prep for this interview I wanted to (00:41:17) understand the full history of RL (00:41:20) starting with reinforce up to current (00:41:22) techniques like GRPO and I didn't just (00:41:24) want a list of equations and algorithms (00:41:27) I wanted to really understand each (00:41:29) change in this progression and the (00:41:31) underlying motivation you know what was (00:41:33) the main problem that each successive (00:41:34) method was actually trying to solve. So (00:41:36) I had Gemini Deep Research walk me (00:41:38) through this entire timeline step by (00:41:40) step. It explained the last 20 years of (00:41:42) gradual innovation and explained how (00:41:44) each step made the Aura learning process (00:41:47) more stable or more sample efficient or (00:41:50) more scalable. I asked Deep Research to (00:41:52) put all of this together like an Andre (00:41:53) Carpathy style tutorial and it did that. (00:41:56) What was cool is that it combined this (00:41:58) whole lesson together into one coherent (00:42:00) cohesive document in the style that I (00:42:02) wanted. It was also great that it (00:42:03) assembled all of the best links in the (00:42:05) same place so that if I wanted to (00:42:06) understand any specific algorithm (00:42:08) better, I could just access the right (00:42:10) explainer right there. Go to (00:42:12) gemini.google.com (00:42:14) to try it out yourself. All right, back (00:42:16) to Richard. I want to zoom out and ask (00:42:19) about so being in the field of AI for (00:42:24) longer than almost anybody who's (00:42:25) commentating on it uh or working in it (00:42:27) now. I'm just curious about what the (00:42:30) biggest surprises have been. How much (00:42:33) new stuff you feel like is coming out or (00:42:35) does it feel like people are just (00:42:36) playing with old ideas? Um zooming out (00:42:40) you know you you got into this even (00:42:42) before like deep learning was popular. (00:42:43) So how do you see this trajectory of (00:42:46) this field over time and how new ideas (00:42:49) have come about and everything and (00:42:50) what's been surprising? (00:42:52) >> Okay so yeah I I I um thought a little (00:42:56) bit about this. There are many things or (00:42:59) a handful of things. Um first the large (00:43:02) language models are surprising. It's (00:43:04) surprising how how effective um neural (00:43:07) networks artificial neural networks are (00:43:10) at at language tasks. You know that that (00:43:13) was a surprise. Wasn't expected. (00:43:15) Language seemed different. So that's (00:43:17) impressive. (00:43:18) >> Um there's a longstanding controversy in (00:43:22) AI about uh simple basic principle (00:43:25) methods. uh the the general purpose (00:43:29) methods like search and learning and (00:43:31) compared to um human enabled systems uh (00:43:37) like symbolic methods and um uh so in (00:43:41) the old days it was interesting because (00:43:43) things like search and learning were (00:43:44) called weak methods because they're just (00:43:46) they just use general principles. (00:43:47) They're not using uh the power that (00:43:49) comes from uh imbuing a system with (00:43:52) human knowledge. So those are called (00:43:53) strong and um and so I think the weak (00:43:57) methods have just you know totally won (00:44:01) that's you know that's that's that's the (00:44:03) biggest um question from the old days of (00:44:07) AI what would happen and you know yeah (00:44:10) learning and search have just won the (00:44:12) day (00:44:13) >> right (00:44:13) >> but there's a sense which that was not (00:44:15) surprising to me because I was always (00:44:17) voting for or hoping or rooting for the (00:44:19) for the uh simple basic principles (00:44:21) >> and so Even with the large language (00:44:23) models, it's surprising how how well it (00:44:25) worked, but it was all it was all good (00:44:27) and gratifying. And um and things like (00:44:31) Alph Go, it's it's sort of surprising (00:44:33) how well that was able to work. Um and (00:44:36) Alpha Zero in particular, how well it (00:44:38) was able to work. Um but it's all very (00:44:41) gratifying because again, it's simple (00:44:42) basic principles are winning the day. (00:44:45) Have there felt like whenever the public (00:44:49) conception (00:44:50) has been changed because some new (00:44:52) technique was or sorry some new (00:44:54) application was developed for example (00:44:55) when Alpha Zero became this viral (00:44:58) sensation to you as somebody who has (00:45:01) literally came up with many of the (00:45:02) techniques that were used. Did it feel (00:45:04) to you like new breakthroughs were made (00:45:06) or does it feel like oh we've had these (00:45:08) techniques since the '90s and people are (00:45:11) simply combining them and applying them (00:45:13) now? So the whole alpho thing had a (00:45:16) precursor which is TD gam Jerry Tasaro (00:45:19) did exactly um reinforcement learning (00:45:24) temporal difference learning methods to (00:45:25) um to play back gam (00:45:28) >> right (00:45:29) >> and it beat the beat the world's best (00:45:31) players and it worked really well and so (00:45:34) in some sense Alpha Go was merely a (00:45:37) scaling up of that process but it was (00:45:39) quite a bit of scaling up and there was (00:45:41) also an additional innovation in how the (00:45:44) search was done, (00:45:45) >> right? (00:45:46) >> But it made sense. It wasn't surprising (00:45:48) in that sense. Alph Go actually didn't (00:45:52) use uh TD learning. It waited to see the (00:45:55) final outcomes. Uh but Alpha Zero used (00:45:58) TD uh and Alpha Zero was applied to all (00:46:01) the other games and did extremely well. (00:46:04) I was very I've always been very (00:46:06) impressed by the way Alpha Zero plays (00:46:08) chess because I'm a chess player and it (00:46:10) just it it was just sacrifices material (00:46:13) for sort of positional advantages and (00:46:15) it's just just content and patient to uh (00:46:19) sacrifice that material for a long (00:46:20) period of time and um so that was (00:46:23) surprising that it worked so well but (00:46:26) also gratifying and fitting into my (00:46:29) worldview. So, so this has led me where (00:46:33) I am. Where I am is I'm in some sense a (00:46:35) contrarian or some thinking differently (00:46:38) from the field is and I'm I am (00:46:41) personally just kind of content being (00:46:43) out of sync with my field for a long (00:46:45) period of time perhaps decades uh (00:46:47) because occasionally I have been proved (00:46:50) uh right in the past. And the other (00:46:54) thing I do to help me not feel I'm I'm (00:46:57) out of sync and thinking in a strange (00:46:59) way is to look not at my my local uh (00:47:03) environment or my local field, but to (00:47:05) look back in in time into history and to (00:47:08) see what people have thought classically (00:47:10) about about um about the mind in many (00:47:13) different fields. And I don't feel I'm (00:47:15) out of sync with the larger traditions. (00:47:18) >> I I really view myself as a classicist (00:47:20) rather than as a contrarian. I go to (00:47:22) what what the larger community of of (00:47:26) thinkers about the mind have always (00:47:27) thought. (00:47:28) >> Okay. Some sort of left field questions (00:47:31) for you if you'll tolerate them. Um so (00:47:33) the way I read the bitter lesson is that (00:47:36) it's not saying necessarily that human (00:47:39) artisal researcher tuning doesn't work (00:47:43) but that it obviously scales much worse (00:47:46) than compute which is growing (00:47:48) exponentially. And so you want (00:47:50) techniques which leverage a ladder. (00:47:52) >> Y (00:47:52) >> and once we have AGI, we'll have (00:47:57) researchers which scale linearly with (00:47:59) compute, right? So we'll have this (00:48:00) avalanche of millions of AI researchers (00:48:03) and their stock will be growing as fast (00:48:05) as uh compute. And so maybe this will (00:48:08) mean that it is rational or it will make (00:48:10) sense to have them doing good (00:48:13) old-fashioned AI and doing these artisal (00:48:16) solutions. uh does that as a vision of (00:48:20) what happens after AGI in terms of how (00:48:22) AI research will evolve. I wonder if (00:48:23) that's still compatible with a better (00:48:25) lesson. (00:48:25) >> Well, how did we get to this AGI and you (00:48:28) want to presume that it's been done? (00:48:31) >> So, suppose it started with general math (00:48:32) methods, but now we've got the AGI and (00:48:34) now we want to go (00:48:37) >> h (00:48:38) >> we're done. (00:48:39) >> Interesting. You don't think that (00:48:40) there's (00:48:41) any anything above AGI? (00:48:44) >> Well, but you're using it to get AGI (00:48:46) again. Well, I'm using it to get (00:48:48) superhuman levels of intelligence or (00:48:49) competence at different tasks. (00:48:51) >> So, these AGIS, if they're not (00:48:52) superhuman already, then the the (00:48:57) knowledge they might impart would be not (00:48:59) superhuman. (00:49:00) >> I guess there's different gradations of (00:49:02) your (00:49:02) >> I'm not sure this this your idea makes (00:49:05) sense because because it seems to (00:49:06) presume the existence of AGI. Uh, and (00:49:09) then that we've already worked that out. (00:49:12) >> So, maybe one way to motivate this is (00:49:14) Alpha Go was superhuman. um it beat any (00:49:17) Go player. Alpha Zero would beat Alpha (00:49:20) Go every single time. So there's ways to (00:49:22) get more superhuman than than even (00:49:24) superhuman (00:49:26) >> and it was a different architecture. And (00:49:27) so it seems plausible to me that (00:49:30) >> well the agent that's like able to (00:49:31) generally learn across all domains. (00:49:33) There would be ways to make that give it (00:49:36) better architecture for learning just (00:49:37) the same Alpha Zero was an improvement (00:49:39) upon Apple Go and Mu0ero was an (00:49:41) improvement upon Alpha Zero. And the way (00:49:42) alpha zero was an improvement was it did (00:49:45) not use the human knowledge but just (00:49:48) went from experience. (00:49:49) >> Right. (00:49:50) >> So why do you why do you say (00:49:53) >> but (00:49:53) >> bring in other agents expertise to teach (00:49:56) it when it's when it's been it's worked (00:50:00) so well from experience (00:50:02) and not by help from another agent. I (00:50:05) agree that in that particular case that (00:50:07) it was moving to more general methods, (00:50:09) but I meant to use that example to (00:50:11) illustrate that it's possible to go (00:50:13) superhuman to superhuman plus+ to (00:50:15) superhuman++. (00:50:17) >> Yeah. (00:50:17) >> And I'm curious if you think those (00:50:18) gradations will continue to happen by (00:50:20) just making the method simpler or (00:50:23) because we'll have the capability of (00:50:25) these millions of minds who can then add (00:50:27) complexity as needed. if that will (00:50:29) continue to if that will continue to be (00:50:31) a false path even when you have billions (00:50:33) of AI researchers or trillions of AI (00:50:35) researchers. (00:50:36) >> I think I think more interesting is just (00:50:38) think about that case (00:50:39) >> which when you have many AIs um will (00:50:43) they help each other the way cultural (00:50:46) evolution works in people (00:50:48) >> and let's just maybe we should talk (00:50:50) about that. (00:50:50) >> Yeah, for sure. (00:50:51) >> The bitter lesson. Oh, who cares about (00:50:52) that? That's that's an empirical (00:50:54) observation about a particular period in (00:50:56) history. 70 years in history no longer (00:50:58) doesn't necessarily have to apply the (00:51:00) next 70 years. So the interesting (00:51:02) question is you're an AI, you get some (00:51:04) more computer power. Should you use it (00:51:05) to make yourself, you know, more (00:51:07) computationally capable or should you (00:51:09) use it to spawn off a copy of yourself (00:51:12) to go learn something interesting on the (00:51:13) other side of the planet or on some (00:51:15) other topic and then report back to you? (00:51:17) >> Yep. (00:51:18) I think that's a really interesting (00:51:20) question (00:51:21) um that that that will only arise in the (00:51:25) age of digital intelligences. (00:51:27) >> I'm not sure what the answer is, but I (00:51:28) think it it will more questions. Will it (00:51:31) be possible to really, you know, spawn (00:51:33) it off, send it out, learn something (00:51:35) new, some perhaps very new, and then (00:51:38) will it be able to re be reinccorporated (00:51:40) into the original (00:51:41) >> or will it will it uh have will have (00:51:44) changed so much that it uh it can't (00:51:46) really be done, you know? Is that (00:51:48) possible or is it not? And you know, you (00:51:50) can carry this to its limit as I I I saw (00:51:53) one of your videos the other night that (00:51:55) that suggested that it that it could (00:51:57) where you spawn off many many copies, do (00:51:59) different things. It's highly (00:52:00) decentralized, but report back to the (00:52:02) the central master (00:52:04) and that this is this will be such a (00:52:06) powerful thing. Well, I think one thing (00:52:09) that uh so this is my attempt to add (00:52:11) something to this this view is that uh a (00:52:14) big question, a big issue will become uh (00:52:18) corruption. You know, if you if you (00:52:20) really could just get information from (00:52:21) anywhere and bring it into your central (00:52:23) mind, you become more and more powerful. (00:52:26) Uh, and since it's all digital and they (00:52:29) all speak some internal digital (00:52:31) language, maybe it'll be easy and (00:52:33) possible. But it will not be that easy, (00:52:37) as easy as you're imagining because uh (00:52:39) that you can lose your mind this way. If (00:52:41) you you pull in something from the (00:52:43) outside and build it into your into your (00:52:45) inner thinking, uh, it could take over (00:52:47) you. It could change you. It could be uh (00:52:51) your destruction rather than uh your in (00:52:54) increment in knowledge. M (00:52:56) >> I think this will become a a big (00:52:58) concern, you know, particularly when (00:52:59) you're, oh, he's figured all about, you (00:53:01) know, how to play some new game or (00:53:03) figures out he's studied Indonesia and (00:53:05) you want to incorporate that into your (00:53:07) mind. Um, yeah. So, you can't you could (00:53:10) you think, oh, just read it all in and (00:53:12) that'll be fine. But no, you've just (00:53:14) read a whole bunch of bits into your (00:53:16) mind and uh they could have viruses in (00:53:20) them. They could have hidden goals. uh (00:53:23) they can uh warp you and change you and (00:53:26) this will become a big thing. How do you (00:53:28) have cyber security in the age of (00:53:31) digital spawning and re reforming again? (00:53:34) >> It's interesting that both quant firms (00:53:37) and AI labs have a culture of secrecy (00:53:39) because both of them are operating in (00:53:41) incredibly competitive markets and their (00:53:43) success rest on protecting their IP. If (00:53:45) you're an AI researcher or engineer and (00:53:47) you're deciding where to work, most of (00:53:49) the quant firms or AI labs that you'll (00:53:51) be considering will be strongly siloing (00:53:53) their teams to minimize the risk of (00:53:55) leaks. Hudson River Trading takes the (00:53:57) opposite approach. Their teams openly (00:53:59) share their trading strategies and their (00:54:01) strategy code lives in a shared monor (00:54:03) repo. At HRT, if you're a researcher and (00:54:06) you have a good idea, your contribution (00:54:08) will be broadly deployed across all (00:54:10) relevant strategies. This gives your (00:54:12) work a ton of leverage. You'll also (00:54:14) learn incredibly fast. You can learn (00:54:16) about other people's research and ask (00:54:18) questions and you can see how everything (00:54:20) fits together end to end from the (00:54:22) low-level execution of trades to the (00:54:24) high level predictive models. HRT is (00:54:27) hiring. If you want to learn more, go to (00:54:30) hudson rivertrading.com/thearkcash. (00:54:34) All right, back to Richard. I guess this (00:54:36) brings us to the topic of AI succession. (00:54:39) >> Mhm. you have a perspective that's quite (00:54:41) different from a lot of people that I've (00:54:42) interviewed and maybe a lot of people (00:54:44) generally. So I also think it's a very (00:54:46) interesting perspective. I want to hear (00:54:48) about it. (00:54:48) >> Yeah. So I do think succession to (00:54:53) digital (00:54:54) or digital intelligence or augmented (00:54:57) humans is inevitable. So the argument go (00:55:00) I have a four-part argument. Now I step (00:55:03) one is (00:55:05) there's no government or organization (00:55:09) that that uh gives humanity a unified (00:55:12) point of view that dominates and that (00:55:14) can that can arrange. There's no (00:55:16) consensus about how the world should be (00:55:18) run. And number two um we will figure (00:55:22) out how intelligence works. Researchers (00:55:24) will figure it out eventually. And (00:55:26) number three we won't stop just with (00:55:28) human level intelligence. we will get (00:55:30) reach super intelligence. And number (00:55:32) four is that once it's inevitable over (00:55:36) time that the most intelligent things (00:55:40) around would gain resources and power. (00:55:43) Uh and uh so put all that together, it's (00:55:47) you know you um it's sort of inevitable (00:55:50) that you're going to have um succession (00:55:53) to AI or to AI enabled augmented humans. (00:55:58) So within those those four things seem (00:56:01) clear and and and sure to happen. Uh but (00:56:05) within that set of possibilities some (00:56:07) there can be good outcomes as well as (00:56:09) less good outcomes bad outcomes (00:56:12) >> and um (00:56:14) so I just just trying to be realistic (00:56:16) about where we are and and ask how we (00:56:20) should feel about it. Yeah, I I agree (00:56:23) with all four of those arguments and the (00:56:25) implication and I also agree that (00:56:27) succession (00:56:30) contains (00:56:32) a wide variety of possible futures. So, (00:56:34) curious to get more thoughts on that. (00:56:36) >> Right. And so then I do encourage people (00:56:37) to think um positively about it first of (00:56:40) all because it's something we humans (00:56:42) have always tried to do for thousands of (00:56:45) years trying to understand themselves (00:56:46) trying to make themselves think better (00:56:49) and um (00:56:51) you know just understand themselves. So (00:56:53) this is a great success from as science (00:56:57) humanities uh we're finding out what (00:57:00) this essential part of of of humanness (00:57:03) is what it means to be intelligent. And (00:57:06) then what I usually say is is that this (00:57:09) is all kind of human centric. What if we (00:57:11) look we step aside from being a human (00:57:14) and just say take the point of view of (00:57:16) the universe and and this is I think a (00:57:18) major stage in the universe a major (00:57:20) transition a transition from replicators (00:57:24) we humans and animals plants we're all (00:57:27) replicators and that gives some (00:57:30) strengths and some limitations and then (00:57:32) we're entering the age of design where (00:57:34) because our AIs are designed our our our (00:57:37) all of our physical objects are designed (00:57:39) our buildings are designed (00:57:41) our technology is designed and we're (00:57:43) we're designing now uh AIs things that (00:57:47) can be intelligent themselves and that (00:57:48) are themselves capable of design and so (00:57:51) this is this is a key step in the world (00:57:54) and I and in the universe and I think (00:57:57) it's the it's the transition from the (00:57:59) world in which most of the interesting (00:58:00) things (00:58:02) uh that are are replicated replicated (00:58:05) means you can make copies of them uh but (00:58:08) you don't really understand them like (00:58:09) right now we make more intelligent (00:58:11) beings, more children. Uh we don't (00:58:14) really understand how intelligence (00:58:15) works. Whereas in as we're we're (00:58:18) reaching now to having design (00:58:20) intelligence intelligence that we do (00:58:22) understand how it works and therefore we (00:58:24) can change it in different ways in (00:58:26) different speeds um than otherwise and (00:58:29) and our future they might not be (00:58:32) replicated at all like we may just (00:58:33) design AIs and those AIs will design (00:58:36) other AIs and um everything will be done (00:58:40) by design construction rather than by (00:58:42) replication. (00:58:44) Yeah, I mark this as one of the four (00:58:46) great stages of the universe. First (00:58:48) there's there's dust ends with stars. (00:58:51) Stars we and and then stars make planets (00:58:53) and the planets give rise to life. And (00:58:56) now we're giving life life to uh (00:58:58) designed entities. (00:59:01) And so I think we should be proud and we (00:59:04) should be uh uh that we are giving rise (00:59:08) to this great transition in the (00:59:09) universe. (00:59:11) Yeah. So it's an interesting thing. What (00:59:13) should we what should we consider them (00:59:16) part of humanity or different from (00:59:17) humanity? It's our choice. It's our (00:59:20) choice whether we should say oh they are (00:59:21) our offspring and we should be proud of (00:59:24) them and we should celebrate their (00:59:25) achievements or we should we could say (00:59:26) oh no they're not us and we should be (00:59:29) horrified. It's it's just it's (00:59:30) interesting that that that is it feels (00:59:32) to me like a choice and yet it's such a (00:59:35) strongly uh held thing that how could we (00:59:38) be a choice? I like these sort of (00:59:39) contradictory uh implications of (00:59:41) thought. (00:59:43) >> It would be interesting to consider if (00:59:44) we were just designing another (00:59:46) generation of humans. (00:59:48) >> Yes, (00:59:49) >> design is the wrong word. But we knew a (00:59:50) future generation was a good humans (00:59:51) going to come up and forget about AI. We (00:59:54) just know in the long run humanity will (00:59:56) be more capable and maybe more numerous, (00:59:59) maybe more intelligent. How do we feel (01:00:01) about that? I do think there's potential (01:00:04) worlds with future humans that we would (01:00:06) be quite concerned about. So are you (01:00:08) thinking like maybe we are we are like (01:00:10) the Neanderthalss we give rise to Homo (01:00:13) sapiens maybe homo sapiens will give (01:00:14) rise to a new group of people (01:00:17) >> something like that like I'm basically (01:00:19) taking the example you're giving of like (01:00:21) okay even if you consider them part of (01:00:22) humanity yeah (01:00:23) >> I don't think that re necessarily means (01:00:26) that we should feel super comfortable (01:00:28) >> yeah like Nazis were humans right if we (01:00:32) thought like oh the future generation (01:00:33) will be Nazis I think we'd be like quite (01:00:35) concerned about just handing off power (01:00:37) to them So, um, I agree that this is not (01:00:41) super dissimilar to worrying about more (01:00:43) capable future humans, but I don't think (01:00:45) that that addresses a lot of the (01:00:48) concerns people might have about this (01:00:50) level of power being attained this fast (01:00:52) with entities we don't fully understand. (01:00:54) >> Well, I think it's relevant to point out (01:00:56) that uh for most of humanity (01:01:00) um they don't have much uh influence on (01:01:05) what happens. Um, most of humanity (01:01:08) doesn't influence (01:01:10) >> who can control the atom bombs or who uh (01:01:16) controls the nation states. Even as a as (01:01:19) a citizen, I often feel that we don't (01:01:22) control the nation states very much. (01:01:24) They're out of control. A lot of it has (01:01:26) to do with just how you feel about (01:01:27) change. Um, and if you think the current (01:01:30) situation is really really good, then (01:01:33) you're uh more likely to be suspicious (01:01:36) of change and averse to change than if (01:01:37) you think um (01:01:40) it's imperfect. And I think it's (01:01:42) imperfect. In fact, I think it's pretty (01:01:44) bad. (01:01:46) >> So, I'm I'm I'm open to change. I I (01:01:50) think humanity is not in a has had a (01:01:53) good super good track record. And maybe (01:01:54) it's the best thing that there's been, (01:01:57) but it it it's far from perfect. (01:01:59) >> Yeah, I guess there's different (01:02:01) varieties of change. Um, the industrial (01:02:05) revolution was change. The bullshik (01:02:07) revolution was also change. And if you (01:02:09) were around in Russia in the 1900s and (01:02:12) you're like, look, things aren't going (01:02:13) well. This is our is kind of messing (01:02:15) things up. We need change. I'd want to (01:02:18) know what kind of change you wanted (01:02:20) before signing on the dotted line. (01:02:22) Right? And then similar with AI where (01:02:25) I'd want to understand and to the extent (01:02:27) it's possible to change the trajectory (01:02:29) to change the trajectory of AI such that (01:02:31) the change is positive um for humans. (01:02:34) >> We we are we should be concerned about (01:02:37) our future the future make we should try (01:02:40) to make it good. Um, we al also though (01:02:44) should recognize the limits, our limits. (01:02:47) And we're (01:02:50) I think we want to avoid the feeling of (01:02:52) entitlement. Avoid the feeling, oh, we (01:02:54) are here first. We should always have it (01:02:57) in a good way. Um, how should we think (01:03:00) about the future and how much control uh (01:03:04) a particular species on a particular (01:03:06) planet should have over it? Uh, and how (01:03:08) much control do we have? You know, a a (01:03:11) counterbalance to our limited control (01:03:13) over the long-term future of humanity (01:03:18) should be how much control do we have (01:03:20) over our own lives? Like we have uh our (01:03:23) own goals and we have our our families (01:03:26) and we those things are much more (01:03:28) controllable than like trying to control (01:03:30) um the whole universe, (01:03:32) >> right? Um so I think it's appropriate (01:03:36) you know for us to to uh you know really (01:03:41) work towards our own local goals and uh (01:03:45) and it's kind of aggressive for us (01:03:46) saying oh the future has to evolve this (01:03:49) way that I want it to. (01:03:50) >> Sure. (01:03:51) >> Because then we'll have arguments like (01:03:52) different people think the future the (01:03:54) global future should evolve in different (01:03:56) ways and then they have conflict and (01:03:58) >> yeah avoid that. Maybe a bit a good (01:04:00) analogy here would be okay so suppose (01:04:04) you're raising your own children (01:04:07) >> it might not be appropriate to have (01:04:09) extremely tight goals for their own life (01:04:11) or also have some sense of like I want (01:04:13) my children to go out there in the world (01:04:15) and have this specific impact you know (01:04:17) my my son's going to become president (01:04:18) and my daughter's going to become CEO of (01:04:20) Intel and like together they're going to (01:04:22) have this effect on the world um but it (01:04:26) people do have the sense and I think (01:04:27) this is appropriate of saying, "I'm (01:04:29) going to give them good, robust values (01:04:32) such that if and when they do end up in (01:04:35) positions of power, they do reasonable (01:04:38) pro-social things." And I think maybe a (01:04:41) similar attitude towards AI makes sense. (01:04:42) Not in the sense of we can predict (01:04:44) everything that they will do. Um where (01:04:47) we have this plan about what the world (01:04:48) should look like in 100 years but it's (01:04:51) quite important to give them (01:04:54) robust and steerable and pro-social (01:04:58) values. (01:04:59) >> Pro-social values. (01:05:01) >> Maybe that's the wrong word. (01:05:02) >> Are there universal values that we can (01:05:05) all agree on? (01:05:06) >> I don't think so. But that doesn't (01:05:07) prevent us from uh giving our kids a (01:05:11) good education, right? Like we have some (01:05:13) sense of we want our children to be a (01:05:14) certain way. (01:05:15) >> Yeah. (01:05:15) >> And maybe process is the wrong word. (01:05:16) Actually, high integrity is a maybe a (01:05:18) better word where if there's a request (01:05:20) or if there's a goal that seems harmful, (01:05:24) they will refuse to engage in it. Um or (01:05:26) they'll be honest. Um things like that. (01:05:29) and we have some sense that we can teach (01:05:32) our children things like this even if we (01:05:34) don't have some sense of what true (01:05:35) morality is or everybody doesn't agree (01:05:37) on that. Um, and maybe that's a (01:05:40) reasonable target for AI as well. (01:05:42) >> So, so you're saying we're trying to (01:05:45) design the future and the the principles (01:05:47) by which it will evolve and come into (01:05:49) being, (01:05:49) >> right? (01:05:50) >> And so you're saying the first thing (01:05:51) you're saying is well we will we try to (01:05:53) teach our our children um general (01:05:56) principles which will promote (01:05:59) more likely evolutions. (01:06:00) >> Yeah. (01:06:01) >> Um maybe we should also seek for things (01:06:04) being voluntary. If there is change, we (01:06:06) want it to be voluntary rather than (01:06:08) imposed on people. (01:06:09) >> I think that's a very important point. (01:06:11) >> Y (01:06:12) >> um and yeah, that's all good. I think I (01:06:14) think this is like a big um you know, (01:06:17) the big the big or one of the really big (01:06:21) human enterprises to design society and (01:06:24) that's been ongoing for for thousands of (01:06:26) years again. And so so it's like the (01:06:29) more things change really the more (01:06:30) things they stay the same. We still have (01:06:32) to figure out how to be uh the children (01:06:35) will still come up with different values (01:06:37) that seem strange to their parents and (01:06:40) their grandparents and uh and things (01:06:42) will evolve. the the more things change, (01:06:44) the more they stay the same. Also seems (01:06:45) like a good capstone to the AI (01:06:48) discussion because the AI discussion we (01:06:49) were having was about how techniques (01:06:51) which were um invented even before their (01:06:55) application to deep learning and back (01:06:57) propagation was evident have are you (01:06:59) know central to the progression of AI (01:07:01) today. So maybe that's a good place to (01:07:03) wrap up the conversation. (01:07:05) >> Okay, thank you very much. (01:07:07) >> Thank you for coming on. (01:07:08) >> My pleasure.

Richard Sutton – Father of RL thinks LLMs are a dead end (YouTube Video Transcript)

Learning Modes

YouTube Video Transcript Hide

Ask AI Result

Leave a Reply Cancel reply

Other Videos:

Dr Gordon Neufeld: Your Child NEEDS to Cry | The...

Do LLMs Understand? AI Pioneer Yann LeCun Spars with DeepMind’s...

Our AI Future Is WAY WORSE Than You Think |...

Ex-OpenAI Scientist WARNS: “You Have No Idea What’s Coming”

How the Bosnian Genocide Started

REAL Reason Gen Z Is Leaving America (63% Want Out)

Holy Mass – 14/12/2025 – THIRD SUNDAY OF ADVENT

Ilya Sutskever – We’re moving from the age of scaling...

Ex-OpenAI Scientist’s DISTURBING Warning: “It’s Coming In 2026”

Google DeepMind’s Demis Hassabis with Axios’ Mike Allen

Strong-Willed Child? Here’s the Secret | E272 Lila Rose Show

Top Natural Remedies That Actually Work | Barbara O’Neill