Andrej Karpathy — “We’re summoning ghosts, not building animals” (YouTube Video Transcript)

↔

Title: Andrej Karpathy — “We’re summoning ghosts, not building animals”

Duration: 02:26:08

Total Correct Answers:

Dictation Mode:

Current Caption

Correct

Show All Captions

Learning Modes

Dictation

YouTube Video Transcript Hide

Ask AI Result

The ask AI result will appear here..

Show timestamps

Display as lines

(00:00:00) Your YouTube transcript will appear here (00:00:00) reinforcement learning is terrible. (00:00:03) It just so happens that everything that (00:00:04) we had before is much worse. I'm (00:00:07) actually optimistic. I think this will (00:00:08) work. I think it's tractable. I'm only (00:00:10) sounding pessimistic because when I go (00:00:11) on my Twitter timeline, I see all this (00:00:13) stuff that makes no sense to me. A lot (00:00:15) of it is, I think, honestly just uh (00:00:17) fundraising. We're not actually building (00:00:18) animals. We're building ghosts. These (00:00:20) like sort of ethereal spirit entities (00:00:22) because they're fully digital and (00:00:24) they're kind of like mimicking humans. (00:00:25) And it's a different kind of (00:00:26) intelligence. It's business as usual (00:00:28) because we're in an intelligence (00:00:29) explosion already and have been for (00:00:31) decades. Everything is gradually being (00:00:32) automated. Has been for hundreds of (00:00:34) years. Don't write blog posts. Don't do (00:00:36) slides. Don't do any of that. Like build (00:00:38) the code, arrange it, get it to work. (00:00:39) It's the only way to go. Otherwise, (00:00:40) you're missing knowledge. If you have a (00:00:42) perfect AI tutor, maybe you can get (00:00:44) extremely far. The geniuses of today are (00:00:45) barely scratching the surface of what a (00:00:47) human mind can do. I think (00:00:49) >> today I'm speaking with Andre Karpathy. (00:00:51) Andre, why do you say that this will be (00:00:53) the decade of agents and not the year of (00:00:54) agents? (00:00:55) >> Mhm. Uh well first of all uh thank you (00:00:57) for uh having me here. I'm excited to be (00:00:59) here. So the quote that you've just (00:01:01) mentioned it's the decade of agents. (00:01:03) That's actually a reaction to an (00:01:04) existing pre-existing quote I should say (00:01:06) where I think a lot of some of the labs (00:01:07) I'm not actually sure who said this but (00:01:09) they were alluding to this being the (00:01:10) year of agents (00:01:12) >> uh with respect to LLMs and uh how they (00:01:14) were going to evolve. And I think um I (00:01:16) was triggered by that because I feel (00:01:18) like there's some overpredictions going (00:01:19) on in the industry and uh in my mind (00:01:22) this is really a lot more accurately (00:01:24) described as the decade of agents and we (00:01:26) have some very early agents that are (00:01:27) actually like extremely impressive and (00:01:28) that I use daily uh you know cloud and (00:01:30) codeex and so on but I still feel like (00:01:32) there's uh so much work to be done and (00:01:34) so I think my like my reaction is like (00:01:36) we'll be working with these things for (00:01:38) decade they're going to get better uh (00:01:40) and uh it's going to be wonderful but I (00:01:42) think I was just reacting to the (00:01:43) timelines I suppose of (00:01:45) of the uh implication (00:01:47) >> and what do you think will take a decade (00:01:48) to accomplish? What are the bottlenecks? (00:01:51) >> Well, um actually make it work. So in my (00:01:53) mind, I mean when you're talking about (00:01:54) an agent, I guess or what the labs have (00:01:56) in mind and what maybe I have in mind as (00:01:58) well is it's uh you should think of it (00:01:59) almost like an employee or like an (00:02:00) intern that you would hire to work with (00:02:01) you. Uh so for example, you work with (00:02:03) some employees here. Um when would you (00:02:05) prefer to have an agent like Cloud or (00:02:07) Codeex uh do that work? Like currently (00:02:09) of course they can't. What would it take (00:02:10) for them to be able to do that? Why (00:02:12) don't you do it today? And the reason (00:02:13) you don't do it today is because they (00:02:14) just don't work. So like they don't have (00:02:17) enough intelligence. They're not (00:02:18) multimodal enough. They can't do (00:02:19) computer use and all this kind of stuff. (00:02:21) And they don't do a lot of things. (00:02:23) You know, they don't have continual (00:02:25) learning. You can't just tell them (00:02:26) something and they'll remember it. And (00:02:27) they're just cognitively lacking. And (00:02:29) it's just not working. And I just think (00:02:30) that it will take about a decade to work (00:02:32) through all of those issues. (00:02:33) >> Interesting. So, um, as a professional (00:02:35) podcaster and a (00:02:38) a viewer of AI from afar, it's sort of (00:02:41) easy to identify for me like, oh, here's (00:02:43) what's lacking. Continual learning is (00:02:45) lacking or multimodality is lacking. But (00:02:48) I don't really have a good um way of (00:02:51) trying to put a timeline on it. Like if (00:02:53) somebody's like, how long will continual (00:02:54) learning take? I (00:02:56) >> there's no like prior I have about like (00:02:58) this is a project that should take 5 (00:02:59) years, 10 years, 50 years. (00:03:01) >> Why a decade? Why not one year? Why not (00:03:03) 50 years? (00:03:04) >> Um, yeah, I guess this is where you get (00:03:06) into like a bit of I guess my own (00:03:08) intuition a little bit and also just (00:03:10) kind of doing a bit of an extrapolation (00:03:12) of with respect to my own experience in (00:03:14) the field, right? So, I guess I've been (00:03:15) in AI for (00:03:16) >> almost two decades. I mean, it's going (00:03:17) to be maybe 15 years or so, not that (00:03:19) long. um you had Richard Sutton here who (00:03:21) was around of course for much longer but (00:03:23) I do have about 15 years of experience (00:03:25) of people making predictions of seeing (00:03:26) how they actually uh turned out and also (00:03:28) I was in the industry for a while and I (00:03:30) was in research and I've worked in the (00:03:31) industry for a while so I guess I kind (00:03:33) of have uh just a general intuition that (00:03:35) I have left from that uh and uh I feel (00:03:38) like the problems are uh tractable (00:03:41) they're surmountable (00:03:42) >> but uh they're still difficult and if I (00:03:44) just average it out that just kind of (00:03:45) feels like a ticket I guess to me (00:03:47) >> this is actually quite interesting I I (00:03:48) want to like hear not only the history (00:03:51) but what people in the room felt was (00:03:54) about to happen at various different (00:03:56) >> breakthrough moments like what were the (00:03:59) ways in which their feelings were either (00:04:01) overly pessimistic or overly optimistic? (00:04:03) >> Yeah. (00:04:03) >> Yeah. Should we just go through each of (00:04:04) them one by one? (00:04:05) >> Oh yeah. I mean that's a giant question (00:04:06) because of course you're talking about (00:04:07) 15 years of stuff that happened. I mean (00:04:09) AI is actually like so wonderful because (00:04:10) there have been a number of I would say (00:04:12) seismic shifts (00:04:13) >> that were like the entire field has sort (00:04:15) of like suddenly looked a different way, (00:04:16) right? And I guess I've maybe lived (00:04:18) through two or three of those. (00:04:20) >> And I still think there will continue to (00:04:21) be some because they come with some kind (00:04:22) of like almost surprising irregularity. (00:04:25) >> Well, my when my career began, of (00:04:26) course, like when I started to work on (00:04:28) deep learning, when I became interested (00:04:29) in deep learning, this was just kind of (00:04:30) like by chance of being right next to (00:04:32) Jeff Hinton at University of Toronto. (00:04:34) And Jeff Hinton, of course, is kind of (00:04:35) like the godfather figure of AI and he (00:04:37) was training all these neural networks (00:04:38) and I thought it was incredible and (00:04:39) interesting, but this was not like the (00:04:41) main thing that everyone in AI was doing (00:04:43) by far. Yeah, (00:04:44) >> this was a niche subject on the side. (00:04:46) That's kind of maybe like the first like (00:04:48) dramatic sort of seismic shift that came (00:04:50) with the Alexet and so on. (00:04:51) >> I would say like Alex sort of reoriented (00:04:53) everyone and everyone started to train (00:04:54) neural networks. Uh but it was still (00:04:57) like very like per task per specific (00:04:59) task. So maybe I have an image (00:05:00) classifier or I have a neural machine (00:05:03) translator or something like that. And (00:05:04) people became very slowly actually (00:05:06) interested in basically kind of agents I (00:05:07) would say. uh um and people started to (00:05:10) think okay well maybe we have a check (00:05:11) mark next to the visual cortex or (00:05:13) something like that but what about the (00:05:14) other parts of the brain and how can we (00:05:15) get an actual like full agent or an full (00:05:17) entity that can actually interact in the (00:05:19) world and I would say the Atari uh sort (00:05:21) of uh deep reinforcement learning shift (00:05:23) in 2013 or so uh was part of that early (00:05:26) effort of agents in my mind because it (00:05:28) was an attempt to try to get agents that (00:05:30) not just perceive the world but also (00:05:31) take actions and interact and get (00:05:33) rewards from environments and at the (00:05:34) time this was Atari games (00:05:36) >> right (00:05:36) >> and I kind of feel like that was a (00:05:38) misstep actually uh and it was a misstep (00:05:40) that actually even the early openi that (00:05:42) I was a part of of course uh kind of (00:05:44) adopted because at that time the (00:05:46) sitegeist was reinforcement learning (00:05:48) environments games playing beat games (00:05:51) get lots of different types of games and (00:05:52) open was doing a lot of that. So that (00:05:54) was maybe like another like prominent (00:05:57) part of I would say AI where maybe for (00:05:59) two or three or four years everyone was (00:06:01) doing reinforcement learning on games (00:06:03) >> and uh basically that was a little bit (00:06:05) of a misstep (00:06:07) >> and what I was trying to do at open a (00:06:08) actually is like I was always a little (00:06:09) bit suspicious of games as being like (00:06:11) this thing that would actually lead to (00:06:12) AGI because in my mind you want (00:06:14) something like an accountant or uh like (00:06:16) something that's actually interacting (00:06:16) with the real world and I just didn't (00:06:18) see how games kind of like add up to it (00:06:20) and so my project at OpenAI for example (00:06:22) was um within in the scope of the (00:06:24) universe project on an on an agent that (00:06:27) was using keyboard and mouse to operate (00:06:29) web pages. And I really wanted to have (00:06:31) something that like interacts with, you (00:06:33) know, the actual digital world that can (00:06:34) do knowledge work. (00:06:35) >> And it just so turns out that um this (00:06:37) was extremely early, way too early. so (00:06:39) early that we shouldn't have been (00:06:41) working on that, you know, uh because um (00:06:43) if you're just stumbling your way around (00:06:45) and keyboard mashing and mouse clicking (00:06:47) and trying to get rewards in these (00:06:48) environments, um your reward is too (00:06:51) sparse and you just won't learn and (00:06:52) you're going to burn a forest uh (00:06:53) computing and you're never actually (00:06:55) going to get something off the ground. (00:06:56) >> And so what you're missing is this uh (00:06:58) power of representation in the neural (00:07:00) network. (00:07:01) >> And so for example, today people are (00:07:02) training those computer using agents, (00:07:03) but they're doing it on top of a large (00:07:05) language model. And so you actually have (00:07:06) to get the language model first. you (00:07:07) have to get the representations first (00:07:09) and you have to do that by all the (00:07:10) pre-training and all the LLM stuff. So I (00:07:12) kind of feel like maybe loosely speaking (00:07:14) it was like people keep maybe trying to (00:07:17) get the full thing too early a few times (00:07:19) where people like really try to go after (00:07:21) agents too early I would say and that (00:07:23) was Atari and Universe (00:07:25) >> uh and even my own experience and you (00:07:26) actually have to do some things first (00:07:28) before you sort of get to those agents. (00:07:30) Um, and maybe now the agents are a lot (00:07:31) more competent, but maybe we're still (00:07:33) missing uh sort of some parts uh of that (00:07:35) stack. But I would say maybe those are (00:07:37) like the three like major buckets of (00:07:39) what people were doing. Uh, training (00:07:41) neural nets per tasks trying to the (00:07:43) first round of agents uh and then maybe (00:07:45) the LLMs and actually seeking the (00:07:47) representation power of the neural (00:07:48) networks before you uh tack on (00:07:50) everything else on top. (00:07:51) >> Interesting. Yeah, I guess if I were to (00:07:53) steal man, the sort of the sudden (00:07:54) perspective would be that humans (00:07:56) actually can just take on everything at (00:07:58) once, right? Even animals can take on (00:07:59) everything at once, right? Animals are (00:08:01) maybe a better example because they (00:08:02) don't even have the scaffold of (00:08:03) language. They just get thrown out into (00:08:05) the world and they just have to make (00:08:06) sense of everything without any labels. (00:08:09) Um, (00:08:10) >> and the vision for AGI then should just (00:08:12) be something which like just looks at (00:08:13) sensory data, looks at the computer (00:08:15) screen, and it just like figures out (00:08:17) what's going on from scratch. I mean, if (00:08:19) a human was put in a similar situation, (00:08:21) had to be trained from scratch, but I (00:08:22) mean, this is like a human growing up or (00:08:23) an animal growing up. So, why shouldn't (00:08:25) that be the vision for AI rather than (00:08:26) like this thing where we're doing (00:08:28) millions of years of training? I think (00:08:30) that's a really good question and I (00:08:31) think um I mean so so Sutton was on your (00:08:34) podcast and I saw the podcast and I had (00:08:36) a write up about that podcast almost (00:08:37) that gets into a little bit of how I see (00:08:40) things and I I kind of feel like I'm (00:08:42) very careful to make analogies to (00:08:44) animals because they came about by a (00:08:46) very different optimization process. (00:08:48) >> Animals are evolved and they actually (00:08:50) come with a huge amount of hardware (00:08:51) that's built in. Um, and when, for (00:08:53) example, my example in the post was the (00:08:55) zebra. A zebra gets born and a few (00:08:57) minutes later it's running around and (00:08:58) following its mother. That's an (00:09:00) extremely complicated thing to do. (00:09:01) >> Yeah. (00:09:02) >> Um, that's not reinforcement learning. (00:09:04) That's something that's baked in. And (00:09:05) evolution obviously has some way of (00:09:07) encoding the weights of our neural nets (00:09:09) in ATCGS. And I have no idea how that (00:09:11) works, but it apparently works. So, I (00:09:13) kind of feel like uh brains just were (00:09:16) came from a very different process. And (00:09:18) I I'm very hesitant to take inspiration (00:09:20) from it because we're not actually (00:09:21) running that process. So in my post, I (00:09:24) kind of said we're not actually building (00:09:25) animals. Uh we're building ghosts. (00:09:27) >> Yeah. (00:09:27) >> Or spirits or whatever people want to (00:09:29) call it. Uh because um we're not uh (00:09:32) we're not doing training by evolution. (00:09:34) Uh we're doing training by basically (00:09:36) imitation of humans and the data that (00:09:38) they've put on the internet. And so you (00:09:40) end up with these like sort of ethereal (00:09:42) spirit entities because they're fully (00:09:43) digital and they're kind of like (00:09:44) mimicking humans. And it's a different (00:09:46) kind of intelligence. Like if you (00:09:47) imagine a space of intelligences, we're (00:09:49) we're starting off at a different point (00:09:50) almost. We're not we're not really (00:09:52) building animals, but I think it's also (00:09:53) possible to make them a bit more (00:09:54) animallike over time. And I think we (00:09:56) should be doing that. And so I kind of (00:09:57) feel like so just I guess one more point (00:09:59) is I do feel like Sutton basically has a (00:10:01) very like his framework is like we want (00:10:04) to build animals and I actually think (00:10:05) that would be wonderful if we can get (00:10:06) that to work that would be amazing. If (00:10:08) there was a single like (00:10:10) >> algorithm that you can just you know run (00:10:12) on the internet and it learns everything (00:10:14) that would be incredible. I almost (00:10:16) suspect that I'm not actually sure that (00:10:18) it exists and that's certainly actually (00:10:19) not what animals do (00:10:21) >> because animals have this outer loop of (00:10:23) evolution, (00:10:23) >> right? (00:10:24) >> Um, and a lot of what looks like (00:10:25) learning is actually a lot more (00:10:27) maturation of the brain and I think that (00:10:29) actually very little reinforcement (00:10:31) learning for animals and I think a lot (00:10:33) of the reinforcement learning is (00:10:34) actually like more like motor tasks. (00:10:36) It's not intelligence tasks. So I (00:10:37) actually kind of think humans don't (00:10:39) actually like really use RL roughly (00:10:40) speaking is what I would say. (00:10:41) >> Can you read the last sentence? A lot of (00:10:42) that intelligence is not motor task. (00:10:44) It's what? Sorry. A lot of the (00:10:45) reinforcement learning in my perspective (00:10:46) would be things that are a lot more like (00:10:47) motor like like uh simple kind of like (00:10:50) task throwing a hoop or something like (00:10:52) that. Um but I don't think that humans (00:10:55) use reinforcement learning for a lot of (00:10:57) intelligence tasks like problem solving (00:10:58) and so on. (00:10:59) >> Interesting. (00:11:00) >> That doesn't mean we don't have we we (00:11:01) shouldn't do that for research but I (00:11:03) just feel like that's what animals do or (00:11:05) don't. (00:11:06) >> I'm going to take us a second to digest (00:11:07) that because there's a lot of different (00:11:09) ideas. Maybe one clarifying question I (00:11:12) can ask to um understand a perspective. (00:11:14) So I think you suggest that look (00:11:16) evolution is doing the kind of thing (00:11:18) that pre-training does in the sense of (00:11:20) building something which can then (00:11:23) understand the world. The difference I (00:11:25) guess is that evolution (00:11:27) has to be titrated in the case of humans (00:11:29) through 3 gigabytes of DNA. And so (00:11:33) that's very unlike the weights of a (00:11:36) model. I mean literally the weights of (00:11:38) the model are a brain which obviously is (00:11:40) not encoded in the the sperm and the egg (00:11:42) or does not exist in the sperm and the (00:11:44) egg. So it has to be grown and also the (00:11:47) information for every single synapse in (00:11:49) the brain simply cannot exist in the 3 (00:11:51) gigabytes that exist in the DNA. (00:11:53) Evolution seems closer to finding the (00:11:54) algorithm (00:11:56) >> which then does the lifetime learning. (00:11:59) Now maybe the lifetime learning is not (00:12:01) analogous to RL to your point. Is that (00:12:04) compatible with the thing you were (00:12:05) saying or would you disagree with that? (00:12:06) >> I think so. I would agree with you that (00:12:07) there's some miraculous compression (00:12:08) going on because obviously the weights (00:12:09) of the neural net are not stored in (00:12:11) ATCGs. (00:12:12) >> There's some kind of a dramatic (00:12:13) compression and there's some kind of (00:12:15) like learning algorithms encoded that (00:12:16) that take over and do some of the (00:12:18) learning online. (00:12:19) >> So I definitely agree with you on that. (00:12:21) Basically I would say I'm a lot more (00:12:22) kind of like practically minded. I don't (00:12:24) come at it from the perspective of like (00:12:25) let's build animals. I come from it (00:12:27) perspective of like let's build useful (00:12:28) things. So I have a hard hat on and I'm (00:12:31) just observing that look we're not going (00:12:32) to do evolution because I don't know how (00:12:34) to do that. Uh but it does turn out we (00:12:36) can build these ghost spirit-l like (00:12:37) entities by imitating internet (00:12:39) documents. This works and it's actually (00:12:41) kind of like it's a way to bring you up (00:12:44) to something that has a lot of sort of (00:12:46) built-in knowledge and intelligence in (00:12:47) some way. Uh similar to maybe what (00:12:49) evolution has done. So that's why I kind (00:12:51) of call pre-training this kind of like (00:12:52) crappy evolution. It's like the (00:12:55) practically possible version with our (00:12:57) technology and what we have available to (00:12:58) us to get to a starting point where we (00:13:01) can actually do things like (00:13:02) reinforcement learning and so on. M just (00:13:04) to steelman the other perspective (00:13:05) because after doing this in an interview (00:13:06) and thinking about it a bit he has an (00:13:08) important point here evolution does not (00:13:10) give us the knowledge really right it (00:13:12) gives us the algorithm to find the (00:13:14) knowledge and that seems different from (00:13:15) pre-raining so if perhaps the (00:13:17) perspective is that pre-training helps (00:13:19) build the kind of entity which can learn (00:13:21) better it teaches metal learning and (00:13:23) therefore it is some similar to like (00:13:25) finding an algorithm um but if if it's (00:13:27) like evolution gives us knowledge and (00:13:28) pre-training gives us knowledge they're (00:13:29) not that analogy seems to break down (00:13:31) >> so it's subtle and I You're you're right (00:13:33) to push back on it, but basically the (00:13:35) thing that pre-training is doing, so (00:13:36) you're basically getting the next token (00:13:38) predictor on over the internet and (00:13:39) you're training that into a neural net. (00:13:41) >> It's doing two things actually that are (00:13:42) kind of like unrelated. Number one, it's (00:13:44) picking up all this knowledge as I call (00:13:46) it. Number two, it's actually becoming (00:13:47) intelligent. (00:13:48) >> Um, by observing the algorithmic (00:13:50) patterns in the internet, it actually (00:13:52) kind of like boots up all these like (00:13:53) little circuits and algorithms inside (00:13:55) the neural net to do things like in (00:13:56) context learning and all this kind of (00:13:57) stuff. (00:13:58) >> And actually, you don't actually need or (00:14:00) want the knowledge. I actually think (00:14:01) that's probably actually holding back (00:14:03) the neural networks overall because it's (00:14:04) actually like getting them to rely on (00:14:05) the knowledge a little too much (00:14:06) sometimes. (00:14:07) >> For example, I I kind of feel like (00:14:09) agents one thing they're not very good (00:14:10) at is going off the data manifold of (00:14:12) what exists on the internet. (00:14:13) >> If they had less knowledge or less (00:14:15) memory actually maybe they would be (00:14:17) better. (00:14:17) >> Yeah. Yeah. And so what I think we have (00:14:19) to do kind of going forward and this (00:14:20) will be part of the research paradigms (00:14:21) is I actually think we need to start um (00:14:23) we need to figure out ways to remove (00:14:25) some of the knowledge and to keep what I (00:14:26) call this cog is this cognitive core (00:14:29) >> is this like intelligent entity that is (00:14:31) kind of stripped from knowledge but (00:14:32) contains the algorithms and contains the (00:14:34) magic you know of intelligence and (00:14:36) problem solving and the strategies of it (00:14:38) and all this kind of stuff. (00:14:39) >> There's so much interesting stuff there. (00:14:41) Okay. So let's start with in context (00:14:43) learning. This is an obvious point, but (00:14:45) I think it's worth just like saying it (00:14:47) explicitly and meditating on it. The (00:14:49) situation in which these models seem the (00:14:51) most intelligent in which they are like (00:14:53) I talk to them and I'm like, "Wow, (00:14:54) there's really something on the other (00:14:56) end that's like responding to me (00:14:57) thinking about things. If it like makes (00:14:59) a mistake, it's like, oh wait, that's (00:15:00) actually the wrong way to think about (00:15:01) it. I'm backing up." All that is (00:15:02) happening in context. That's where I (00:15:04) feel like the real intelligence you can (00:15:05) like visibly see. (00:15:07) >> And that in context learning process is (00:15:11) developed by gradient descent on (00:15:12) pre-training, right? like it meta it (00:15:14) spontaneously metalarns in context (00:15:16) learning but the incontext learning (00:15:18) itself is not gradient descent in the (00:15:20) same way that our lifetime intelligence (00:15:23) as humans to be able to do things is (00:15:25) conditioned by evolution but our actual (00:15:27) learning during our lifetime is like (00:15:29) happening through some other process (00:15:31) >> I actually don't fully agree with that (00:15:32) but you should continue with (00:15:33) >> okay actually then I I'm very curious to (00:15:35) understand how that analogy breaks down (00:15:37) >> I think I'm hesitant to say that in (00:15:39) context learning is not doing gradient (00:15:40) descent uh because I mean it's not doing (00:15:42) explicit creating descent, but I I still (00:15:44) think that (00:15:45) >> so in context learning basically it's (00:15:46) it's pattern completion within uh a (00:15:48) token window, right? And it just turns (00:15:50) out that there's a huge amount of (00:15:51) patterns on the internet. And so you're (00:15:52) right, the model kind of like learns to (00:15:54) complete the pattern. Yeah. (00:15:55) >> And that's inside the weights. The (00:15:56) weights of the neural network are trying (00:15:58) to discover patterns and complete the (00:16:00) pattern. And there's some kind of an (00:16:01) adaptation that happens inside the (00:16:02) neural network, right? (00:16:03) >> Uh which is kind of magical and just (00:16:05) falls out from internet just because (00:16:06) there's a lot of patterns. I will say (00:16:09) that there have been some papers that I (00:16:11) thought were interesting that actually (00:16:12) look at the mechanisms behind in context (00:16:14) learning and I do think it's possible (00:16:15) that in context learning actually runs a (00:16:16) small gradient descent loop internally (00:16:18) in the layers of the neural network and (00:16:20) so I recall one paper in particular (00:16:22) where they were doing um uh linear (00:16:24) regression actually using in context (00:16:26) learning. So basically your inputs into (00:16:27) the neural network are XY pairs (00:16:31) >> XY XY XY that happen to be on the line (00:16:33) >> and then you do X and you expect the Y (00:16:35) and the neural network when you train it (00:16:37) in this way actually does do um does do (00:16:39) linear regression (00:16:41) >> and um normally when you would run (00:16:43) linear regression you have a small (00:16:44) gradient descent optimizer that (00:16:45) basically looks at XY looks at an error (00:16:48) calculates the gradient of the weights (00:16:49) and does the update a few times. It just (00:16:51) turns out that when they looked at the (00:16:52) weights of that in context learning (00:16:54) algorithm uh they actually found some (00:16:56) analogies to uh to gradient descent (00:16:58) mechanics. In fact, I think even the (00:17:00) paper went was stronger because they (00:17:02) actually hardcoded the weights of a (00:17:04) neural network to do gradient descent (00:17:06) through uh attention and all the all the (00:17:09) internals of of the neural network. So, (00:17:11) I guess that's just my only push back is (00:17:12) that who knows how in context learning (00:17:14) works, but I actually think that it's (00:17:16) probably doing a little bit of some kind (00:17:17) of funky gradient descent internally and (00:17:19) that I think that that's that's (00:17:21) possible. So, I guess I I was only (00:17:22) pushing back on you're saying it's not (00:17:24) doing in context learning. Who knows (00:17:25) what it's doing, but it's probably maybe (00:17:26) doing something similar to it, but we (00:17:28) don't know. So then it's worth thinking (00:17:29) about okay if both of them are (00:17:31) implementing gradient sorry if in (00:17:33) context learning and pre-training are (00:17:35) both implementing something like (00:17:36) gradient descent (00:17:38) >> why does it feel like in context (00:17:39) learning actually we're getting to this (00:17:42) like continual learning real (00:17:43) intelligence like thing whereas you (00:17:45) don't get the analogous feeling just (00:17:46) from pre-training at least you could (00:17:48) argue that and so if it's the same (00:17:50) algorithm what could be different well (00:17:51) one way you can think about it is how (00:17:53) much information does the model store (00:17:56) perform information it receives from (00:17:59) training. And if you look at (00:18:01) pre-training, if uh I think if you look (00:18:03) at llama 3 for example, I think it's (00:18:04) trained on (00:18:06) >> 15 trillion tokens and if you look at (00:18:08) the 70B model, (00:18:10) >> that would be the equivalent of 07 bits (00:18:13) per token in that it sees in (00:18:14) pre-training in terms of like the (00:18:16) information in the weights of the model (00:18:17) compared to the tokens it reads. (00:18:19) >> Whereas if you look at the KV cache (00:18:21) >> and how it grows per additional token in (00:18:23) in context learning, it's like 320 (00:18:25) kilobytes. (00:18:26) >> Yeah. So that's a 35 millionfold (00:18:28) difference in how much information per (00:18:30) token is assimilated by the model. I (00:18:34) wonder if that's relevant at all. (00:18:35) >> I think I kind of agree. I mean the way (00:18:37) I usually put this is that anything that (00:18:39) happens during the training of the (00:18:40) neural network. The knowledge is only (00:18:42) kind of like a hazy recollection of what (00:18:44) happened in train in the training time (00:18:45) and that's because the compression is (00:18:47) dramatic. You've you're taking 15 (00:18:48) trillion tokens and you're compressing (00:18:49) it to just your final network of a few (00:18:51) billion parameters. So obviously it's a (00:18:52) massive amount of compression going on. (00:18:54) uh so I kind of refer to it as like a (00:18:56) hazy recollection of the internet (00:18:57) documents whereas anything that happens (00:18:59) in the context window of the neural (00:19:00) network you're plugging all the tokens (00:19:02) and it's building up all this KV cache (00:19:03) representation is very directly (00:19:05) accessible to the neural net so I (00:19:06) compare the KV cache and the the stuff (00:19:08) that happens at test time to like more (00:19:10) like a working memory (00:19:11) >> uh like all the stuff that's in the in (00:19:13) um in the context window is very (00:19:14) directly accessible to the neural net so (00:19:16) there's always like these um almost (00:19:19) surprising analogies between LLMs and (00:19:20) humans and I find them kind of (00:19:22) surprising because we're not trying to (00:19:23) build a human brain of course u just (00:19:25) directly we're just finding that this (00:19:26) works and we're doing it (00:19:27) >> but I do think that (00:19:29) >> anything that's in the weights it's kind (00:19:30) of like a hazy recollection of what you (00:19:32) read a year ago anything that you give (00:19:34) it as a context uh at test time is (00:19:37) directly in the working memory um and I (00:19:39) think that's a very powerful analogy to (00:19:40) think through things so when you for (00:19:42) example go to an LLM and you ask it (00:19:43) about some book and what happened in it (00:19:45) like nan's book or something like that (00:19:47) the LM will often give you some stuff (00:19:49) which is roughly correct but if you give (00:19:50) it the full chapter and ask it questions (00:19:52) you're going to get much better results (00:19:54) because it's now loaded in the working (00:19:55) memory of the model. So I basically (00:19:57) agree with your very long way of saying (00:19:59) that I kind of agree and that's why (00:20:00) >> stepping back what is it the part about (00:20:02) human intelligence that we like have (00:20:05) most failed to replicate with these (00:20:07) models? (00:20:08) >> Um I almost feel like um just uh just a (00:20:13) lot of it still. So maybe one way to (00:20:15) think about it, I don't know if this is (00:20:16) the the best way, but I almost kind of (00:20:19) feel like again making these analogies, (00:20:20) imperfect as they are, um we've stumbled (00:20:23) by with the transformer neural network, (00:20:24) which extremely powerful, very general. (00:20:27) You can train transformers on audio or (00:20:29) video or text or whatever you want and (00:20:32) it just learns patterns and they're very (00:20:33) powerful and it works really well. That (00:20:36) to me almost indicates that this is kind (00:20:37) of like some piece of cortical tissue. (00:20:39) Uh it's something like that because the (00:20:40) cortex is famously very um plastic as (00:20:42) well. you can rewire um you know parts (00:20:45) of brains and there was the slightly (00:20:47) gruesome experiments with rewiring like (00:20:49) visual cortex to the auditory cortex and (00:20:51) this animal like learn find etc. Um, so (00:20:54) I think that this is kind of like (00:20:55) cortical tissue. I think when we're (00:20:58) doing reasoning and planning inside the (00:21:00) neural networks, so basically doing a (00:21:02) reasoning traces um for thinking models, (00:21:04) that's kind of like the prefrontal (00:21:05) cortex. Um, and then um I think we uh (00:21:10) maybe those are like little check marks, (00:21:12) but I still think there's many uh brain (00:21:13) parts and nuclei that are not explored. (00:21:15) So maybe for example there's a basic (00:21:16) ganglia doing a bit of reinforcement (00:21:18) learning when we fine tetune the models (00:21:19) on reinforcement learning but you you (00:21:20) know whereas like the hippocampus not (00:21:22) obvious what that would be some parts (00:21:24) are probably not important maybe the (00:21:25) cerebellum is like not important to (00:21:27) cognition it's thought so so we can skip (00:21:28) some of it uh but I still think there's (00:21:30) for example the amydala all the emotions (00:21:32) and instincts um and there's probably (00:21:34) like a bunch of other nuclei in the (00:21:36) brain that are very ancient that I don't (00:21:37) think we've like really replicated I (00:21:39) don't actually know that we should be (00:21:40) pursuing you know the building of an (00:21:42) analog of human brain I'm again an (00:21:44) engineer mostly at heart. But um I still (00:21:47) feel like maybe another way to answer (00:21:50) the question is you're not going to hire (00:21:51) this thing as an intern and it's missing (00:21:53) a lot of it's because it comes with a (00:21:54) lot of these cognitive deficits that we (00:21:55) all intuitively feel when we talk to the (00:21:57) models. (00:21:58) >> Um (00:21:58) >> and so it's just like not fully there (00:22:00) yet. You can look at it as like not all (00:22:02) the brain parts are checked off yet. (00:22:04) >> This is maybe relevant to the question (00:22:07) of thinking about how fast these issues (00:22:09) will be solved. So sometimes people will (00:22:12) say about continual learning. Look, (00:22:14) actually you could already you could (00:22:16) easily replicate this capability just as (00:22:18) in context learning emerged (00:22:19) spontaneously as a result of (00:22:21) pre-training. (00:22:22) Continual learning over longer horizons (00:22:24) will emerge spontaneously if the model (00:22:27) is incentivized to recollect information (00:22:29) over longer horizons or horizons longer (00:22:32) than one session. So if there's um some (00:22:36) like outer loop RL which has many (00:22:40) sessions within that outer loop then (00:22:43) like this continual learning where it (00:22:44) uses like it fine-tunes itself or it (00:22:46) writes to an external memory or (00:22:47) something will just sort of like emerge (00:22:49) spontaneously. Do you think (00:22:50) >> do you think things are that are (00:22:52) plausible? I just I don't have really a (00:22:53) prior over like how plausible is that? (00:22:54) How likely is that to happen? (00:22:55) >> I don't know that I fully resonate with (00:22:57) that because I feel like these models (00:22:58) when you boot them up and they have zero (00:23:00) tokens in the window, they're always (00:23:01) like restarting from scratch where they (00:23:03) were. So I don't actually know in that (00:23:05) worldview what it looks like. Uh because (00:23:07) um again making maybe making some (00:23:10) analogies to humans just because I think (00:23:11) it's roughly concrete and kind of (00:23:13) interesting to think through. I feel (00:23:15) like when I'm awake I'm building up a (00:23:16) context window of stuff that's happening (00:23:17) during the day but I feel like when I go (00:23:19) to sleep something magical happens where (00:23:21) uh I don't actually think that that (00:23:22) context window stays around. Um I think (00:23:24) there's some process of distillation (00:23:25) into weights of my brain. (00:23:27) >> Yeah. (00:23:27) >> Um and this happens during sleep and all (00:23:29) this kind of stuff. We don't have an (00:23:30) equivalent for of that in large language (00:23:33) models and that's to me more adjacent to (00:23:35) when you talk about continual learning (00:23:36) and so on as absent. These models don't (00:23:39) really have this distillation phase um (00:23:41) of taking what happened, analyzing it, (00:23:44) obsessively thinking through it, um (00:23:47) basically doing some kind of a synthetic (00:23:48) data generation process and distilling (00:23:50) it back back into the weights and maybe (00:23:51) having uh you know specific neural net (00:23:54) per person uh maybe it's a Laura it's (00:23:57) not a full uh yeah it's not a full (00:23:59) weight uh neural network that's it's (00:24:01) just small some of the small sparse (00:24:04) subset of the weights are changed (00:24:05) >> but basically we do want to create ways (00:24:07) of creating these individuals that have (00:24:09) very long contexts. It's not only (00:24:11) remaining in the context window because (00:24:12) the context windows grow very very long (00:24:15) like maybe we have some very elaborate (00:24:16) sparse attention over it (00:24:18) >> but I still think that humans obviously (00:24:20) have some process for distilling some of (00:24:22) that knowledge into the weights we're (00:24:23) missing it and I do also think that (00:24:25) humans um have some kind of a very (00:24:27) elaborate sparse attention scheme (00:24:30) >> um which I think we're starting to see (00:24:32) some early hints of uh so deepse v3.2 2 (00:24:35) just came out and I saw that they have (00:24:36) like a sparse attention as an example (00:24:38) and this is one way to have very very (00:24:40) long context windows. (00:24:41) >> So I almost feel like we are redoing a (00:24:43) lot of the cognitive tricks that (00:24:45) evolution came up with through a very (00:24:47) different process but we're I think (00:24:48) going to converge on a similar (00:24:49) architecture cognitively. (00:24:50) >> Interesting. In 10 years do you think (00:24:52) it'll still be something like a (00:24:53) transformer but with a much more (00:24:55) modified attention and more sparse uh (00:24:57) MLPS and so forth? Well, the way I like (00:24:59) to think about it is okay, let's uh (00:25:01) translation invariance in time, right? (00:25:02) So 10 years ago, where were we? (00:25:04) >> 2015 uh we had uh convolutional neural (00:25:07) networks primarily. Residual networks (00:25:09) just came out. Um so remarkably similar (00:25:12) I guess, but quite a bit different (00:25:13) still. I mean transformer was not (00:25:14) around. Um (00:25:16) >> you know all the um all these sort of (00:25:18) like more modern uh tweaks on the (00:25:20) transformer were not around. So maybe (00:25:22) some of the things that we can bet on I (00:25:24) think in 10 years uh by translational (00:25:26) sort of equivariance is um we're still (00:25:29) training giant neural networks with uh (00:25:30) forward backward pass and update through (00:25:32) gradient descent um but maybe it looks a (00:25:36) little bit different (00:25:36) >> and it's just everything is much bigger (00:25:38) actually recently I also went back all (00:25:41) the way to 1989 which was kind of a fun (00:25:43) uh exercise for me a few years ago uh (00:25:45) because I was reproducing uh Yan Lakun's (00:25:48) 1989 convolutional network which was the (00:25:50) first neural network I'm aware of (00:25:51) trained via gradient descent like modern (00:25:53) neural network trained gradient descent (00:25:55) on uh digit recognition (00:25:57) >> and I was just interested in okay how (00:25:59) can I modernize this how much of this is (00:26:01) algorithms how much of this is data how (00:26:02) much of this progress is uh compute and (00:26:04) systems and I was able to very quickly (00:26:06) like half the learning rate just knowing (00:26:08) by tra time travel by 33 years so if I (00:26:11) time travel by algorithms to 33 years I (00:26:13) could adjust what yan did in 1989 and I (00:26:16) could basically half the learning half (00:26:17) the error but to get further gains I had (00:26:20) to add a lot more data. I had to like (00:26:22) 10x the training set and then I had to (00:26:24) actually add more computational (00:26:25) optimizations. Uh had to basically train (00:26:28) for much longer with dropout and other (00:26:29) regularization techniques. (00:26:30) >> And so it's almost like all these things (00:26:33) have to improve simultaneously. So um (00:26:35) you know we're probably going to have a (00:26:36) lot more data. We're probably going to (00:26:37) have a lot better hardware. Probably (00:26:38) going to have a lot better kernels and (00:26:40) software. We're probably going to have (00:26:41) better algorithms. And all of those it's (00:26:43) almost like no one of them is winning (00:26:45) too much. All of them are surprisingly (00:26:47) equal. (00:26:48) M (00:26:49) >> and this has kind of been the trend for (00:26:50) a while. So I guess to answer maybe your (00:26:52) question, I expect differences (00:26:55) algorithmically to what's happening (00:26:56) today. Uh but I do also expect that some (00:26:58) of the things that have stuck around for (00:27:00) a very long time will probably still be (00:27:01) there. It's probably still giant neural (00:27:03) network trained with gradient descent. (00:27:04) That would be my guess. (00:27:05) >> It's surprising that all of those things (00:27:06) together only haved um uh half the (00:27:12) error. Yeah. which is so like 30 years (00:27:14) of progress is uh maybe maybe half is a (00:27:16) lot because like if you half the error (00:27:17) that actually means that (00:27:18) >> half is a lot. Yeah. (00:27:19) >> Yeah. Yeah. Okay. Um (00:27:20) >> but it's I guess what was shocking to me (00:27:21) is everything needs to improve across (00:27:23) the board. (00:27:24) >> Uh architecture optimizer loss function (00:27:26) and also has improved across the board (00:27:28) forever. So I kind of expect all those (00:27:30) changes to be alive and well. Well, (00:27:31) yeah. Actually, I was about to ask a (00:27:33) very similar question about nano chat (00:27:34) because since you just coded up (00:27:36) recently, every single sort of step in (00:27:39) the, you know, process of building a (00:27:41) chatbot is like fresh in your RAM. (00:27:43) >> And I'm curious if you had similar (00:27:45) thoughts about like, oh, there was no (00:27:47) one thing that was relevant to going (00:27:49) from GPT2 to nanohat. What are sort of (00:27:53) like surprising takeaways from the (00:27:55) experience (00:27:56) >> building? So, nanohat is a kind of a (00:27:58) repository I released was it yesterday (00:28:00) or day before? I can't remember. (00:28:03) We can see the sleeve deliberation that (00:28:05) went into the (00:28:06) >> um well it's just trying to be a it's (00:28:09) trying to be the simplest complete (00:28:11) repository that covers the whole (00:28:12) pipeline end to end of building a chacha (00:28:15) pt clone (00:28:16) >> and so you know you have all of the (00:28:18) steps not just any individual step which (00:28:20) is a bunch of I worked on all the (00:28:21) individual steps sort of in the past and (00:28:23) really small pieces of code that kind of (00:28:25) um show you how that's done in (00:28:26) algorithmic sense um uh in like simple (00:28:29) code but this kind of handles the entire (00:28:31) pipeline I I think in terms of learning (00:28:33) it's not it's not so much um I don't (00:28:35) know that I actually found something (00:28:36) that I learned from from it necessarily. (00:28:38) I kind of already had in my mind as like (00:28:40) how you build it and this is just a (00:28:41) process of mechanically uh building it (00:28:45) and making it clean enough and uh so (00:28:48) that people can actually learn from it (00:28:49) and um that uh they find it useful. (00:28:52) >> Yeah. What is the best way for somebody (00:28:53) to learn from it? Is it just like delete (00:28:55) all the code and try to reimplement from (00:28:56) scratch? Try to add modifications to it? (00:28:58) >> Uh yeah, I think that's a that's a great (00:29:00) question. I would probably say so (00:29:02) basically it's about 8,000 lines of code (00:29:03) that takes you through the entire (00:29:04) pipeline. I would probably put it on the (00:29:06) right monitor like if you have two (00:29:08) monitors you put it on the on the right. (00:29:09) >> Um and you want to build it from (00:29:11) scratch. You build it from start. You're (00:29:13) not allowed to copy paste. You're (00:29:15) allowed to reference. You're not allowed (00:29:16) to copy paste. Maybe that's how I would (00:29:17) do it. (00:29:18) >> Um but I also think the repository by (00:29:20) itself it is like a pretty large beast. (00:29:22) I mean it's you know it's a it's when (00:29:24) you write this code you don't go from (00:29:25) top to bottom. you go from chunks and (00:29:27) you grow the chunks and uh that (00:29:29) information is absent like you wouldn't (00:29:30) know where to start and so I think it's (00:29:32) not just the final repository that's (00:29:34) needed it's like the building of the (00:29:35) repository which is a complicated chunk (00:29:37) growing process (00:29:38) >> right (00:29:39) >> uh so that part is not there yet I would (00:29:41) love to actually like add that probably (00:29:42) later this week or something in some way (00:29:44) like either it's a uh it's probably a (00:29:46) video or something like that but um but (00:29:48) maybe roughly speaking that's what I (00:29:50) would try to do is build the stuff (00:29:52) yourself uh but uh don't allow yourself (00:29:54) copy paste (00:29:55) >> I do think that there's two types of (00:29:56) knowledge almost like there's the high (00:29:58) level surface knowledge but the thing is (00:30:00) that when you actually build something (00:30:01) from scratch you're forced to come to (00:30:02) terms with what you don't actually (00:30:04) understand and you don't know that you (00:30:05) don't understand it (00:30:06) >> interesting (00:30:06) >> and it always leads to a deeper (00:30:07) understanding uh and um it's like just (00:30:10) the only way to to build is like if I (00:30:12) can't build it I don't understand it is (00:30:14) that a fine code I believe or something (00:30:16) along those lines (00:30:17) >> I 100% I've always believed this very (00:30:19) strongly uh because there's all these (00:30:21) like micro things that are just not (00:30:23) properly arranged and you don't really (00:30:24) have the knowledge you just in had the (00:30:25) knowledge. So, don't write blog posts, (00:30:27) don't do slides, don't do any of that. (00:30:29) Like, build the code, arrange it, get it (00:30:30) to work. It's the only way to go. (00:30:32) Otherwise, you're missing knowledge. (00:30:33) >> Um, you tweeted out that coding models (00:30:35) were actually of very little help to you (00:30:37) in assembling this repository and I'm (00:30:39) curious why that was. (00:30:41) >> Yeah. Uh, so the repository, I guess I (00:30:44) built it over a period of a bit more (00:30:45) than a month, and I would say there's (00:30:47) like three major classes of how people (00:30:49) interact with code right now. Some (00:30:51) people completely reject all of LLMs and (00:30:53) they are just uh writing by scratch. I (00:30:55) think this is probably not the the right (00:30:56) thing to do anymore. Um the intermediate (00:30:59) part which is where I am is you still (00:31:01) write a lot of things from scratch but (00:31:03) you use uh the autocomplete uh that's (00:31:05) basically uh available now from these (00:31:06) models. So when you start writing out (00:31:08) little piece of it it will it would auto (00:31:10) complete for you and you can just tap (00:31:11) through and most of the time it's (00:31:12) correct. Sometimes it's not and you edit (00:31:14) it but you're still very much the um (00:31:16) sort of architect of what you're (00:31:18) writing. And then there's the, you know, (00:31:19) VIP coding, uh, you know, hi, please (00:31:22) implement this or that, uh, you know, (00:31:24) enter and then let the model do it. And (00:31:26) that's the agents. (00:31:27) >> Um, I do feel like the agents work in (00:31:30) very specific settings and I would use (00:31:32) them in specific settings. But again, (00:31:34) these are all tools available to you and (00:31:35) you have to like learn what they what (00:31:37) they're good at and what they're not (00:31:38) good at and when to use them. (00:31:39) >> So the agents are actually pretty good. (00:31:40) For example, if you're doing boilerplate (00:31:42) stuff, (00:31:42) >> boilerplate code that's like just cop, (00:31:44) you know, just copy paste stuff. They're (00:31:46) very good at that. they're very good at (00:31:47) stuff that occurs very often in the (00:31:49) internet (00:31:50) um because there's lots of examples of (00:31:52) it in the training sets of these models. (00:31:54) Um so so there's like features of things (00:31:56) that where the models will do very well. (00:31:58) I would say nanohat is not an example of (00:32:00) this because it's a fairly unique (00:32:02) repository. There's not that much code I (00:32:04) think in the way that I've structured it (00:32:06) and um and it's not boilerplate code. (00:32:09) It's like actually like intellectually (00:32:10) intense code almost and everything has (00:32:12) to be very precisely arranged and the (00:32:13) models are always trying to (00:32:15) >> they kept trying to I mean they have so (00:32:17) many cognitive deficits right so one (00:32:18) example they keep trying to they keep (00:32:20) misunderstanding the code um because (00:32:23) they they have too much memory from all (00:32:25) the typical ways of doing things on the (00:32:26) internet that I just wasn't adopting. (00:32:28) >> Uh so the models for example (00:32:30) >> I mean I don't know if I want to get (00:32:31) into the full details but they keep they (00:32:33) keep um they keep thinking I'm writing (00:32:35) normal code and I'm not. Maybe one (00:32:37) example maybe (00:32:38) >> one example is so the way to synchronize (00:32:41) so you have eight GPUs that are all (00:32:42) doing forward backwards the way to (00:32:44) synchronize gradients between them is to (00:32:45) use a distributed data parallel (00:32:47) container of PyTorch which automatically (00:32:49) does all the as you're doing the (00:32:50) backward it will start communicating and (00:32:51) synchronizing gradients I didn't use DDP (00:32:54) because I didn't want to use it because (00:32:56) it's not necessary so I threw it out (00:32:58) >> and I basically wrote my own (00:32:59) synchronization routine that's inside (00:33:01) the step of the optimizer and so the (00:33:03) models were trying to get me to use the (00:33:05) DDB container (00:33:06) >> and they very concerned about okay this (00:33:09) gets way too technical but I wasn't (00:33:10) using that container because I don't (00:33:12) need it and I have a custom (00:33:12) implementation of something like it (00:33:14) >> and they just couldn't internalize that (00:33:15) you had your own (00:33:16) >> yeah they couldn't they couldn't get (00:33:17) past that and then um they kept trying (00:33:20) to like mess up the style like they're (00:33:22) way too overdefensive they make all (00:33:24) these try catch statements they keep (00:33:25) trying to make a production codebase and (00:33:27) I have a bunch of assumptions in my code (00:33:29) and it's okay and uh and it's just like (00:33:32) I don't need all this extra stuff in (00:33:34) there and so I just kind of feel like (00:33:35) they're bloating the codebase. They're (00:33:37) bloating the complexity. They keep (00:33:38) misunderstanding. They're using (00:33:39) deprecated APIs a bunch of times. So, (00:33:42) it's total mess. Um, and uh, it's just (00:33:45) it's just not that useful. I can go in, (00:33:47) I can clean it up, but it's not that (00:33:48) useful. I also feel like it's kind of (00:33:50) annoying to have to like type out what I (00:33:52) want in English because it's just too (00:33:53) much typing. Like, if I just navigate to (00:33:55) the part of the code that I want and I (00:33:57) go where I where I know the code has to (00:33:58) appear and I start typing out the first (00:34:00) three letters, autocomplete gets it and (00:34:01) just gives you the code. And so I think (00:34:03) it's this is a very high information (00:34:05) bandwidth to specify what you want is if (00:34:07) you point to the code where you want it (00:34:08) and you type out the first few pieces (00:34:10) and the model will complete it. (00:34:12) >> So I guess what I mean is um I think (00:34:15) these models are good in certain parts (00:34:17) of the stack actually use the models a (00:34:19) little bit in um there are two examples (00:34:22) where I actually use the models that I (00:34:23) think are illustrative. Uh one was when (00:34:25) I generated the report that's actually (00:34:27) more boilerplatey. So I actually bcoded (00:34:29) part partially some of that stuff that (00:34:30) was fine um because it's not like (00:34:32) mission critical stuff and it works (00:34:34) fine. (00:34:34) >> And then the other part is when I was (00:34:35) rewriting the tokenizer uh in Rust uh (00:34:38) I'm actually not as good at Rust because (00:34:40) I'm fairly new to Rust. So I was doing (00:34:42) there's a bit of vibe coding going on uh (00:34:44) in when I was writing some of the Rust (00:34:46) code but I had Python implementation (00:34:47) that I fully understand and I'm just (00:34:49) making sure I'm making a more efficient (00:34:50) version of it and I have tests so I feel (00:34:52) safer doing that stuff. Um and so (00:34:54) basically they lower or like they (00:34:56) increase accessibility to uh languages (00:34:59) or paradigms that you might not as be (00:35:01) not be as familiar with. Uh so I think (00:35:03) they're very helpful there as well. (00:35:04) >> Yeah. (00:35:05) >> Uh because there's a ton of Rust code (00:35:06) out there. The models are actually (00:35:07) pretty good at it. I happen to not know (00:35:09) that much about it. So the models are (00:35:10) very useful there. (00:35:11) >> Um the reason I think this question is (00:35:12) so interesting is because the main story (00:35:16) people have about AI exploding and (00:35:19) getting to super intelligence pretty (00:35:20) rapidly. is AI automating, AI (00:35:23) engineering and AI research. (00:35:25) So they'll look at the fact that you can (00:35:27) have cloud code make entire applications (00:35:29) from scratch and be like if you had this (00:35:30) incapability inside of open AI and deep (00:35:33) mind and everything well just imagine (00:35:35) the level of like just you know a (00:35:37) thousand of you or a million of you in (00:35:38) parallel finding little architectural (00:35:40) tweaks and so it's quite interesting to (00:35:42) hear you say that this is the thing (00:35:44) they're sort of asymmetrically worse at (00:35:46) and it's like quite relevant to (00:35:47) forecasting whether the AI 2027 type (00:35:50) explosion (00:35:51) >> is likely to happen anytime soon. I (00:35:53) think that's a good way of putting it. (00:35:55) And I think you're getting at some of my (00:35:56) like why my timelines are a bit longer. (00:35:58) You're right. Um I think um yeah, (00:36:01) they're not very good at code that (00:36:02) hasn't never been written before maybe (00:36:04) is like one way to put it, which is like (00:36:05) what we're trying to achieve when we're (00:36:06) building these models. (00:36:08) >> Very naive question, but um the (00:36:10) architectural tweaks that you're adding (00:36:12) to uh Nanohat, they're in a paper (00:36:16) somewhere, right? They might even be in (00:36:17) a repo somewhere. So it's (00:36:20) um is it surprising that they aren't (00:36:22) able to integrate that into whenever (00:36:24) you're like add rope embeddings or (00:36:26) something they do that in the wrong way. (00:36:30) >> It's it's tough. I think they kind of (00:36:31) know they kind of know but they don't (00:36:32) fully know and they don't know how to (00:36:34) fully integrate it into the repo and (00:36:35) your style and your code and your place (00:36:36) and some of the custom things that (00:36:38) you're doing and (00:36:39) >> and uh how it fits with all the (00:36:40) assumptions of the repository and all (00:36:42) this kind of stuff. So I think they do (00:36:43) have some knowledge but um they haven't (00:36:46) gotten to the place where they can (00:36:47) actually integrate it, make sense of it (00:36:50) uh and so on. I do think that a lot of (00:36:51) the stuff by the way continues to (00:36:52) improve. So um I think currently (00:36:54) probably state-of-the-art model that I (00:36:56) go to is the GP5 Pro. (00:36:57) >> Um and uh that's a very very powerful (00:37:00) model. So if I actually have 20 minutes (00:37:01) I will copy paste my entire repo and I (00:37:03) go to GPT5 Pro the Oracle for like some (00:37:05) questions and often it's not too bad and (00:37:08) surprisingly good compared to what (00:37:09) existed a year ago. (00:37:10) >> Yeah. Um, but I do think that uh overall (00:37:12) the models are are um they're not there. (00:37:15) And I kind of feel like the industry (00:37:16) it's it's um it's over it's it's making (00:37:21) too big of a jump and it's trying to (00:37:23) pretend like this is amazing and it's (00:37:25) not. It's slop and I think they're not (00:37:27) coming to terms with it and maybe (00:37:28) they're trying to fund raise or (00:37:29) something like that. I'm not sure what's (00:37:30) going on but it's we're at this (00:37:32) intermediate stage. The models are (00:37:34) amazing. They still need a lot of work (00:37:36) for now. autocomplete is my sweet spot (00:37:38) >> but sometimes for some types of code I (00:37:40) will go to a nom agent. (00:37:41) >> Yeah. Yeah. (00:37:42) >> Actually this this is also here's (00:37:43) another reason why this is really (00:37:44) interesting. Um through the history of (00:37:47) programming there's been many (00:37:50) productivity improvements compilers (00:37:53) linting better programming languages etc (00:37:56) which have increased programmer (00:37:57) productivity but have not led to an (00:37:59) explosion. So that's like one that (00:38:01) sounds very much like autocomplete tab (00:38:04) and this other category is just like (00:38:06) automation of the programmer (00:38:07) >> and it's interesting you're seeing more (00:38:09) in the category of the historical (00:38:11) analogies of like you know better (00:38:13) compilers or something (00:38:14) >> maybe because this one other kind of (00:38:16) thought that is like (00:38:17) >> I do feel like I have a hard time (00:38:18) differentiating where AI begins and (00:38:20) stops because I do see AI as (00:38:22) fundamentally an extension of computing (00:38:23) in some in some pretty fundamental way (00:38:25) and I I feel like I see a continuum of (00:38:28) this kind of like recursive (00:38:29) self-improvement or like of speeding up (00:38:31) uh programmers all the way from the (00:38:32) beginning like even like I would say (00:38:34) like uh code editors (00:38:36) >> um uh syntax highlighting (00:38:39) >> uh syntax uh or like checking even of (00:38:41) the of the types like data type checking (00:38:44) >> um (00:38:45) >> all these kinds of tools that we've (00:38:46) built for each for each other even (00:38:47) search engines like why aren't search (00:38:49) engines part of AI like (00:38:50) >> I don't know like ranking is kind of AI (00:38:53) right at some point Google was like even (00:38:54) early on they were thinking of (00:38:55) themselves as an AI company doing Google (00:38:57) search engine which I think is totally (00:38:58) fair (00:38:59) >> and So, I kind of see it as a lot more (00:39:00) of a continuum than I think other people (00:39:02) do and I don't it's hard for me to draw (00:39:03) the line and I kind of feel like okay, (00:39:05) we're now getting a much better (00:39:06) autocomplete and now we're also getting (00:39:08) some agents which are kind of like these (00:39:09) loopy things but they kind of go off (00:39:10) rails sometimes. Um, and what's going on (00:39:14) is that the human is progressively doing (00:39:16) a bit less and less of the low-level (00:39:17) stuff. For example, we're not writing (00:39:19) the assembly code because we have (00:39:20) compilers, (00:39:20) >> right? Like compilers will take my high (00:39:22) level language and C and write the (00:39:23) assembly code. So we're abstracting (00:39:25) ourselves very very slowly and there's (00:39:27) this what I call autonomy slider of like (00:39:29) more and more stuff is automated of the (00:39:30) stuff that can be automated at any point (00:39:32) in time and we're doing a bit less and (00:39:33) less and uh raising ourselves in the (00:39:36) layer abstraction over the automation. (00:39:38) One of the big problems with RL is that (00:39:40) it's incredibly information sparse. (00:39:42) Lelbox can help you with this by (00:39:44) increasing the amount of information (00:39:46) that your agent gets to learn from with (00:39:48) every single episode. For example, one (00:39:50) of their customers wanted to train a (00:39:52) coding agent. So, Labelbox augmented an (00:39:54) IDE with a bunch of extra data (00:39:57) collection tools and staffed a team of (00:39:58) expert software engineers from their (00:40:00) aligner network to generate trajectories (00:40:03) that were optimized for training. Now, (00:40:05) obviously, these engineers evaluated (00:40:07) these interactions on a past field (00:40:08) basis, but they also rated every single (00:40:11) response on a bunch of different (00:40:12) dimensions like readability and (00:40:14) performance. And they wrote down their (00:40:16) thought processes for every single (00:40:18) rating that they gave. So you're (00:40:20) basically showing every single step an (00:40:22) engineer takes and every single thought (00:40:24) that they have while they're doing their (00:40:26) job. And this is just something you (00:40:28) could never get from usage data alone. (00:40:31) And so Labelbox packaged up all these (00:40:32) evaluations and included all the Asian (00:40:35) trajectories and the corrective human (00:40:37) edits for the customer to train on. This (00:40:40) is just one example. So go check out how (00:40:42) Labelbox can get you highquality (00:40:43) frontier data across domains, (00:40:46) modalities, and training paradigms. (00:40:48) reach out at labelbox.com. (00:40:54) Let's talk about RL a bit. Uh you two (00:40:56) did some very interesting things about (00:40:58) this. Um conceptually, how should we (00:41:01) think about the way that humans are able (00:41:03) to build a rich world model just from (00:41:06) interacting with our environment and in (00:41:08) ways that seems almost irrespective of (00:41:11) the final reward at the end of the (00:41:12) episode. (00:41:13) >> Mhm. If somebody has, you know, (00:41:15) somebody's starting to start a business (00:41:16) and at the end of 10 years, she finds (00:41:18) out whether the business succeeded or (00:41:19) failed, (00:41:20) >> we say that she's earned a bunch of (00:41:21) wisdom and experience, (00:41:22) >> but it's not because like the log probs (00:41:24) of every single thing that happened over (00:41:26) the last 10 years are updated or (00:41:27) downweighted. It's something much more (00:41:28) deliberate and uh rich is happening. (00:41:31) What is the ML analogy and how does that (00:41:33) compare to what we're doing with other (00:41:34) ones right now? (00:41:35) >> Yeah, maybe the way I would put it is (00:41:36) humans don't use reinforcement learning (00:41:38) is maybe what I've as I've said it. I I (00:41:40) think they do something different which (00:41:41) is yeah you experience so reinforcement (00:41:43) learning is a lot worse than I think the (00:41:45) average person thinks (00:41:48) reinforcement learning is terrible. (00:41:50) It just so happens that uh everything (00:41:52) that we had before is much worse (00:41:56) u because previously we're just (00:41:57) imitating people so it has all these (00:41:58) issues. Um so in reinforcement learning (00:42:01) say you're working with uh you're (00:42:02) solving a math problem because it's very (00:42:04) simple. You're given a math problem and (00:42:06) you're trying to find the solution. Um (00:42:08) now in reinforcement learning you will (00:42:11) try lots of things in parallel first. So (00:42:14) uh you're given a problem you try (00:42:15) hundreds of different attempts and these (00:42:17) attempts can be complex right they can (00:42:19) be like oh let me try this let me try (00:42:20) that this didn't work that didn't work (00:42:22) etc. And then maybe you get an answer (00:42:24) and now you check the back of the book (00:42:25) and you see okay the correct answer is (00:42:27) this and then you can see that okay this (00:42:30) one this one and that one got the (00:42:31) correct answer but these other 97 of (00:42:33) them didn't. So literally what (00:42:34) reinforcement learning does is it goes (00:42:36) to the ones that worked really well and (00:42:38) every single thing you did along the way (00:42:39) every single token gets upweighted of (00:42:41) like do more of this. The problem with (00:42:43) that is I mean people will say that u (00:42:45) your estimator has high variance but (00:42:47) what I mean it's just noisy it's noisy. (00:42:50) So basically it kind of almost assumes (00:42:52) that every single little piece of the (00:42:53) solution that you made that ride the (00:42:55) right answer was correct thing to do (00:42:56) which is not true. Like you may have (00:42:57) gone down the wrong alleys until you (00:43:00) arrive the right solution. Every single (00:43:01) one of those incorrect things you did, (00:43:03) as long as you got to the correct (00:43:04) solution, will be upweed as do more of (00:43:05) this. It's terrible. (00:43:07) >> Yeah, it's noise. You've done all this (00:43:09) work only to find a single at the end, (00:43:11) you get a single number of like, oh, you (00:43:13) did correct. And and based on that, you (00:43:15) weigh that entire trajectory as like (00:43:17) upweight or down weight. And so you're (00:43:19) the way I like to put it is you're (00:43:20) sucking supervision through a straw. Uh (00:43:22) because you've done all this work that (00:43:24) could be a minute of rollout and you're (00:43:25) you're like sucking the bits of (00:43:27) supervision of the final reward signal (00:43:28) through a straw and you're like putting (00:43:30) it You're like (00:43:32) basically like um yeah, you're (00:43:34) broadcasting that across the entire (00:43:35) trajectory and using that to upway or (00:43:37) down with that trajectory. It's crazy. A (00:43:39) human would never do this. Number one, a (00:43:41) human would never do hundreds of (00:43:42) rollouts. Uh number two, when a person (00:43:44) sort of finds a solution, they will have (00:43:47) a pretty complicated process of review (00:43:48) of like, okay, I think these parts that (00:43:50) I did well, these parts I did not do (00:43:51) that well, I should probably do this or (00:43:54) that. And they think through things. (00:43:55) There's nothing in current LLM that does (00:43:57) this. There's no equivalent of it. Um (00:43:59) but I do see papers popping out that are (00:44:01) trying to do this because it's obvious (00:44:03) to everyone in the field. (00:44:04) >> Yeah. (00:44:04) >> So I kind of see as like the first (00:44:06) imitation learning actually by the way (00:44:07) was extremely surprising and miraculous (00:44:09) and amazing that we can uh fine-tune by (00:44:11) imitation on humans. Um and that was (00:44:13) incredible because in the beginning all (00:44:14) we had was base models. Base models are (00:44:16) autocomplete. uh and it wasn't obvious (00:44:18) to me at the time uh and I had to learn (00:44:20) this and the paper that like blew my (00:44:23) mind was instruct GPT because it pointed (00:44:25) out that hey you can take the (00:44:26) pre-trained model which is autocomplete (00:44:28) and if you just fine-tune it on text (00:44:30) that looks like conversations the model (00:44:32) will very rapidly adapt to become very (00:44:34) conversational and it keeps all the (00:44:35) knowledge from pre-training and this (00:44:37) blew my mind because I didn't understand (00:44:39) that it's just like stylistically can (00:44:40) adjust so quickly and become an (00:44:42) assistant to a user through through just (00:44:44) a few loops of fine-tuning on that kind (00:44:46) of data It was very miraculous to me (00:44:48) that that that worked. So incredible. (00:44:50) And that was like two years, three years (00:44:52) of work. And now came RL. And RL allows (00:44:55) you to do a bit better than just (00:44:56) imitation learning, right? Because you (00:44:58) you can't have these re um reward (00:45:00) functions and you can hill climb on the (00:45:01) reward functions. And so some problems (00:45:03) have just correct answers. You can hill (00:45:05) climb on that without getting expert (00:45:06) trajectories to imitate. So that's (00:45:08) amazing. And the model can also discover (00:45:10) solutions that the human might never (00:45:11) come up with. (00:45:12) >> Uh so this is incredible. And yet it's (00:45:14) still stupid. Um, so I think we need we (00:45:18) need more and so I saw a paper from (00:45:20) Google yesterday that tried to have this (00:45:21) reflect and review p um uh idea uh in (00:45:24) mind. Uh what was the memory bank paper (00:45:27) or something? I don't know. I've (00:45:29) actually seen a few papers along these (00:45:30) lines. So I expect there to be some kind (00:45:32) of a major update to how we do (00:45:34) algorithms for LLMs coming in that realm (00:45:37) and then I think we need three or four (00:45:38) or five more (00:45:41) um something like that. But you you're (00:45:43) so good at coming up with the evocative (00:45:45) evocative phrases sucking supervision (00:45:48) through a straw. It's like so good. Um (00:45:51) why hasn't So you're saying like your (00:45:53) problem with outcome based reward is (00:45:55) that you have this huge trajectory and (00:45:57) then at the end you're you're trying to (00:46:00) learn every single possible thing about (00:46:02) what you should do and what you should (00:46:03) learn about the world from that one (00:46:04) final bit. um why hasn't given the fact (00:46:08) that this is obvious, why hasn't process (00:46:09) based supervision (00:46:10) >> as an alternative been a successful way (00:46:12) to make models more capable? What what (00:46:15) has been preventing us from using this (00:46:16) alternative paradigm? (00:46:17) >> So process based supervision just refers (00:46:18) to the fact that we're not going to have (00:46:19) a reward function only at the very end (00:46:21) of after you've made 10 minutes of work. (00:46:22) I'm not going to tell you you did well (00:46:24) or not well. (00:46:24) >> I'm going to tell you at every single (00:46:25) step of the way how well you're doing. (00:46:27) >> Um and this is basically the reason we (00:46:29) don't have that is it's not trick it's (00:46:30) tricky how you do that properly. (00:46:32) >> Um because you have partial solutions (00:46:33) and you don't know how to assign credit. (00:46:35) So when you get the right answer, it's (00:46:37) just uh an equality match to the answer. (00:46:39) Very simple to implement. (00:46:41) >> If you're doing basically process (00:46:42) supervision, how do you assign an (00:46:44) automatable way partial credit (00:46:46) assignment? It's not obvious how you do (00:46:48) it. Lots of labs, I think, are trying to (00:46:49) do it with these LLM judges. So (00:46:51) basically, you get LLMs to try to do it. (00:46:52) So you prompt an LLM, hey, look at a (00:46:54) partial solution of a student. How well (00:46:56) do you think they're doing if the answer (00:46:57) is this? And they try to tune the (00:46:58) prompt. Um, the reason that I think this (00:47:01) is kind of tricky is quite subtle. And (00:47:03) it's the fact that anytime you use an (00:47:04) LLM to assign a reward, those LLMs are (00:47:07) giant things with billions of parameters (00:47:09) and they're gameable. (00:47:10) >> And if you're reinforcement learning (00:47:11) with respect to them, you will find (00:47:12) adversarial examples for your LM judges (00:47:15) >> almost guaranteed. (00:47:16) >> You can't do this for too long. You do (00:47:17) maybe 10 steps or 20 steps, maybe it (00:47:19) will work. But you can't do a hundred or (00:47:21) a thousand because it's not obvious (00:47:22) because um I know I understand it's not (00:47:24) obvious but basically the model will (00:47:26) find little cracks. (00:47:29) It will find all these like spurious (00:47:31) things in the nooks and crannies of the (00:47:32) giant model and find a way to cheat it. (00:47:35) So one example uh that's prominently in (00:47:37) my mind is I think this I think this was (00:47:39) probably public but basically if you're (00:47:42) using an LM judge for a reward so you (00:47:44) just give it a solution from a student (00:47:45) and ask it if the student will or not. (00:47:47) We were training with reinforcement (00:47:49) learning against that reward function (00:47:51) >> and it worked really well and then um (00:47:53) suddenly the reward became extremely (00:47:55) large like it was massive jump and it (00:47:57) did perfect and you're looking at it (00:47:58) like (00:47:59) >> wow this this means the student is (00:48:01) perfect in all these problems it's fully (00:48:02) solved math (00:48:04) >> but actually what's happening is that (00:48:05) when you look at the completions that (00:48:06) you're getting from the model they are (00:48:08) complete nonsense they start out okay (00:48:09) and then they change to duh duh duh duh (00:48:11) so it's just like oh okay let's take two (00:48:13) plus three and we do this and this and (00:48:14) then duh duh duh duh duh (00:48:16) >> and you're looking at it's like this (00:48:17) crazy. How is it getting a reward of one (00:48:19) or 100%. Um, and you look at the LLM (00:48:21) judge and it turns out the is an (00:48:23) adversarial example for the model and it (00:48:25) assigns 100% probability to it. And it's (00:48:27) just because this is an out of sample (00:48:29) example to the LLM. It's never seen it (00:48:31) during training and you're in pure (00:48:33) generalization land, (00:48:34) >> right? (00:48:34) >> It's never seen it during training. And (00:48:36) in the pure generalization land, you can (00:48:37) find these examples that that uh break (00:48:40) it. (00:48:40) >> You're basically training the LLM to be (00:48:43) a prompt injection model. Not even that (00:48:45) prompting injection is way too fancy. (00:48:47) You're you're finding adversarial (00:48:48) examples as they're called. These are (00:48:49) nonsensical uh solutions um that are (00:48:53) obviously wrong, but the model things (00:48:54) are amazing. (00:48:55) >> So to the extent you think this is the (00:48:57) bottleneck to making RL more functional, (00:49:00) then that will require making LLMs (00:49:02) better judges if you want to do this in (00:49:03) an automated way. And then so is it just (00:49:06) going to be like some sort of GAN-like (00:49:07) approach where you had to train models (00:49:08) to be more robust? Yeah. To (00:49:10) >> I think the labs are probably doing all (00:49:11) that like okay so the obvious thing is (00:49:13) like the should not get 100% reward. (00:49:15) Okay well take the put in the training (00:49:17) set of the LM judge and say this is not (00:49:18) 100% this is 0%. You can do this (00:49:20) >> but every time you do this you get a new (00:49:23) LLM and it still has adversarial (00:49:24) examples. There's infinity adversarial (00:49:26) examples. And I think probably if you (00:49:28) iterate this a few times, it'll probably (00:49:30) be harder and harder to find real (00:49:31) examples, but I'm not 100% sure because (00:49:32) this thing has a trillion parameters or (00:49:34) whatnot. Um, so I bet you the the labs (00:49:38) are trying. Uh, I don't actually I I (00:49:40) still think I still think we need other (00:49:43) ideas. (00:49:45) >> Interesting. Do do you have some shape (00:49:46) of what the other idea (00:49:49) >> could be? So like this this idea of like (00:49:52) review um review solution encompass (00:49:55) synthetic examples such that when you (00:49:56) train on them you get uh you get better (00:49:58) and like metal learn it in some way and (00:50:00) I think there's some papers that I'm (00:50:01) starting to see pop out. I only am at a (00:50:03) stage of like reading abstracts because (00:50:04) a lot of these papers, you know, they're (00:50:06) just ideas. Someone has to actually like (00:50:08) make it work on a frontier LLM lab scale (00:50:11) uh in full generality because when you (00:50:13) see these papers, they pop up and it's (00:50:14) just like a little bit of noisy, you (00:50:16) know, it's cool ideas, but I haven't (00:50:17) actually seen anyone convincingly uh (00:50:20) show that this is possible. That said, (00:50:22) the LLM labs are fairly closed. Uh so, (00:50:24) who knows what they're doing now, but (00:50:26) >> yeah. So I guess I can I I see a very um (00:50:29) not easy but like I I can conceptualize (00:50:32) how you would be able to train on (00:50:34) synthetic examples or synthetic problems (00:50:36) that you have made for yourself. But (00:50:37) there seems to be another thing humans (00:50:38) do. Maybe sleep is this, maybe (00:50:40) daydreaming is this (00:50:42) >> which is not necessarily come up with (00:50:44) fake problems but just like reflect. (00:50:46) >> Yeah. (00:50:47) >> And I'm not sure what the ML analogy (00:50:49) for, you know, daydreaming or sleeping (00:50:50) but just like just reflecting. I haven't (00:50:52) come up with any problem. Yeah, I mean (00:50:53) obviously the very basic analogy would (00:50:54) just be like fine-tuning on reflection (00:50:57) bits, but I feel like in practice that (00:50:59) probably wouldn't work that well. So I (00:51:00) don't know if you have some take on what (00:51:03) the analogy of like this thing is. (00:51:05) >> Yeah, I do think that that we're missing (00:51:06) some aspects there. So as an example, (00:51:08) >> uh when you're reading a book, um (00:51:11) >> I almost feel like currently when LLMs (00:51:12) are reading a book, what that means is (00:51:14) we stretch out the sequence of text and (00:51:16) the model is predicting the next token (00:51:18) and it's getting some knowledge from (00:51:19) that. That's not really what humans do, (00:51:20) right? So when you're reading a book, I (00:51:22) almost don't even feel like the book is (00:51:23) like exposition I'm supposed to be (00:51:25) attending to and training on. The book (00:51:26) is a is a set of prompts for me to do (00:51:29) synthetic data generation (00:51:30) >> or for you to get to a book club and (00:51:32) talk about it with your friends. And (00:51:33) it's by manipulating that information (00:51:35) that you actually gain that knowledge. (00:51:37) And I I think we have no equivalent of (00:51:39) that again with LLMs. They don't really (00:51:41) do that. But I'd love to see during (00:51:42) pre-training some kind of a stage that (00:51:44) uh thinks through the material and tries (00:51:46) to reconcile it with what it already (00:51:47) knows and thinks through for like some (00:51:49) amount of time. and um gets that to (00:51:51) work. And so there's no equivalence of (00:51:53) any of this. This is all research. (00:51:54) There's some subtle very subtle that I (00:51:56) think are very hard to understand (00:51:58) reasons why it's not trivial. So if I (00:52:00) can just describe one, (00:52:02) >> why can we just synthetically generate (00:52:03) and train on it? (00:52:04) >> Well, because every synthetic example (00:52:06) like if I just give synthetic generation (00:52:07) of the model thinking about a book, you (00:52:09) look at it and you're like, "This looks (00:52:10) great. Why can't I train on it?" Well, (00:52:12) you could try, but the model will (00:52:13) actually get much worse if you continue (00:52:14) trying. And that's because all of the (00:52:17) samples you get from models are silently (00:52:19) collapsed. They're silently, this is not (00:52:21) obvious if you look at any individual (00:52:22) example of it. They occupy a very tiny (00:52:24) manifold of the possible space of um (00:52:27) sort of thoughts about content. So the (00:52:29) LLMs when they come off, they're what we (00:52:31) call collapsed. They have a collapsed (00:52:32) data distribution. If you sample, one (00:52:35) easy way to see it is go to Chachi PT (00:52:37) and ask it tell me a joke. It only has (00:52:39) like three jokes. (00:52:40) >> It's not giving you the whole breath of (00:52:42) possible jokes. (00:52:42) >> It's giving you like it knows like three (00:52:44) jokes. Yeah, (00:52:45) >> they're silently collapsed. So (00:52:46) basically, you're not getting the (00:52:48) richness and the diversity and the (00:52:49) entropy uh from these models as you (00:52:51) would get from humans. So humans are a (00:52:53) lot more sort of noisier, but at least (00:52:55) they're not biased. They're not um in in (00:52:57) a statistical sense, they're not (00:52:58) silently collapsed. They maintain a huge (00:53:00) amount of entropy. So how do you get (00:53:02) synthetic data generation to work (00:53:04) despite the collapse and while (00:53:05) maintaining the entropy is a research (00:53:07) problem. Um, just to make sure I (00:53:09) understood, the reason that the collapse (00:53:11) is relevant to synthetic data generation (00:53:12) is because you want to be able to come (00:53:13) up with synthetic problems or (00:53:16) reflections which are not already in (00:53:18) your data distribution. (00:53:20) >> I guess what I'm saying is um, say we (00:53:23) have a chapter of a book and I ask a nom (00:53:25) to think about it. (00:53:26) >> Um, it will give you something that (00:53:27) looks very reasonable. But if I ask it (00:53:29) 10 times, you'll notice that all of them (00:53:31) are the same. You can't just leave (00:53:33) scaling scaling quote unquote reflection (00:53:36) on the same amount of uh you know prompt (00:53:40) information and then get returns from (00:53:41) that. Okay. (00:53:41) >> Yeah. Yeah. So any individual sample (00:53:43) will look okay but the distribution of (00:53:45) it is is quite terrible and it's quite (00:53:47) terrible in such a way that if you (00:53:48) continue training on too much of your (00:53:49) own stuff you actually collapse. I (00:53:51) actually think that um there's no like (00:53:53) fundamental solutions to this possibly (00:53:54) and I also think humans collapse over (00:53:56) time. Uh I think this is uh again these (00:53:58) analogies are surprisingly good but (00:54:00) humans collapse during the course of (00:54:01) their lives. This is why children have (00:54:04) completely u you know they haven't (00:54:05) overfit yet and they will say stuff that (00:54:07) will shock you because it's kind of you (00:54:09) can see where they're coming from but (00:54:10) it's just not the thing people say (00:54:12) >> and because they're not yet collapsed (00:54:14) but we're collapsed. We end up (00:54:16) revisiting the same thoughts. we end up, (00:54:18) you know, saying more and more of the (00:54:20) same stuff and the learning rates go (00:54:21) down and uh the collapse continues to (00:54:23) get worse and then um everything (00:54:26) deteriorates. (00:54:27) >> Have Have you seen a super interesting (00:54:28) paper that dreaming is a way of (00:54:31) preventing this kind of overfitting and (00:54:33) collapse that the reason dreaming is uh (00:54:37) evolutionary adaptive is to (00:54:39) >> put you in weird situations that are (00:54:41) like very unlike your day-to-day reality (00:54:43) so that to prevent this kind of (00:54:44) >> overfitting. It's an interesting idea. I (00:54:45) mean, I do think that when you're (00:54:47) generating things in your head and then (00:54:49) you're attending to it, you're kind of (00:54:50) like training on your own samples. (00:54:51) You're training on your synthetic data (00:54:53) and if you do it for too long, you go (00:54:54) off rails um and you collapse way too (00:54:56) much. So, you always have to like seek (00:54:58) um entropy in your life. (00:55:00) >> Yeah. (00:55:01) >> Uh so talking to other people is a great (00:55:02) source of entropy (00:55:04) >> and uh things like that. So maybe the (00:55:06) brain has also built some internal (00:55:07) mechanisms uh for increasing the amount (00:55:09) of entropy um in in that process. But (00:55:13) yeah, maybe that's an interesting idea. (00:55:14) This is a very ill-formed thought. So I (00:55:16) I'll just put it out and let you react (00:55:18) to it. The best learners that we are (00:55:20) aware of, which are children, are (00:55:22) extremely (00:55:24) bad at recollecting information. In (00:55:26) fact, at the very earliest stages of (00:55:28) childhood, you will forget everything. (00:55:29) You're just an amnesiac about everything (00:55:31) that happens before a certain uh year (00:55:32) date, but you're like extremely good at (00:55:34) picking up new languages and learning (00:55:35) from the world. And maybe there's some (00:55:37) element of like being able to see the (00:55:38) forest for the trees. Whereas if you (00:55:40) compare it to the ex opposite end of the (00:55:41) spectrum, you have LLM pre-training (00:55:44) which these models will literally able (00:55:46) to regurgitate word for word what is the (00:55:48) next thing in a Wikipedia page, but (00:55:51) their ability to learn abstract concepts (00:55:53) really quickly the way a child can is (00:55:55) much more limited. And then adults are (00:55:57) somewhere in between where they don't (00:55:58) have the flexibility of childhood (00:56:00) learning, but they can, you know, adults (00:56:02) can memorize facts and information in a (00:56:04) way that is harder for kids. And I don't (00:56:06) know if there's something interesting (00:56:08) about that. I think there's something (00:56:09) very interesting about that. Yeah, 100%. (00:56:11) I do think that humans actually (00:56:13) >> um they do kind of like have a lot more (00:56:14) of an element compared to like seeing (00:56:16) the forest for the trees (00:56:18) >> and and we're not actually that good at (00:56:19) memorization which is actually a (00:56:21) feature. (00:56:22) >> Um because we're not that good at (00:56:24) memorization, we actually are kind of (00:56:26) like forced to uh find the patterns uh (00:56:29) um like more in a more general sense. I (00:56:32) think lens for in comparison are (00:56:33) extremely good at memorization. they (00:56:35) will recite passages from all these uh (00:56:37) training sources. Uh you can give them (00:56:39) completely nonsensical data like you can (00:56:41) take um you can hash some amount of text (00:56:43) or something like that. You get (00:56:44) completely random sequence. If you train (00:56:45) on it even just I think a single (00:56:46) iteration or two it can suddenly (00:56:48) regurgitate the entire thing. It will (00:56:49) memorize it. There's no way a person can (00:56:51) read a single sequence of random numbers (00:56:53) and recite it to you. Um, and that's a (00:56:56) feature, not a bug almost. Uh, because (00:56:58) it forces you to like only learn the (00:56:59) generalizable components, whereas LLMs (00:57:02) are distracted by all the memory that (00:57:04) they have of the pre-trained documents (00:57:06) and it's probably very distracting to (00:57:07) them, uh, in a certain sense. So that's (00:57:10) why when I talk about the cognitive (00:57:11) core, I actually want to remove the (00:57:12) memory, which is what we talked about. (00:57:14) I'd love to have it them have less (00:57:16) memory so that they have to look things (00:57:17) up uh and that they only maintain the (00:57:19) algorithms for like thought uh and the (00:57:22) idea of an experiment and all this (00:57:24) cognitive glue of um of acting (00:57:26) >> and this is also relevant to preventing (00:57:28) model collapse. (00:57:30) >> Um let me think um (00:57:35) I'm not sure I think it's almost like a (00:57:36) separate axis. M (00:57:37) >> it's almost like the the models are way (00:57:39) too good at uh memorization and somehow (00:57:41) we should we should remove that and I (00:57:43) think people people are much worse but (00:57:44) it's a good thing. (00:57:46) >> What is a solution to model collapse? I (00:57:48) mean you could so there's very naive (00:57:50) things you could attempt is just like (00:57:52) >> um the distribution over lo should be (00:57:55) wider or something like there's many (00:57:56) naive things you could try. What ends up (00:57:58) being the problem with the naive (00:57:59) approaches? (00:58:00) >> Um yeah I think that's a great question. (00:58:02) I mean you can imagine having a (00:58:03) regularization for entropy and things (00:58:04) like that. I guess they just don't work (00:58:06) as well empirically because uh right now (00:58:09) like the models are collapsed but I will (00:58:10) say um most of the tasks that we want of (00:58:13) them don't actually demand the diversity (00:58:17) >> is probably the the answer of what's (00:58:18) going on and so it's just that the model (00:58:20) the frontier labs are trying to make the (00:58:22) models useful and I kind of just feel (00:58:24) like the diversity of the outputs is not (00:58:25) so much number one it's much harder to (00:58:27) work with and evaluate and all this kind (00:58:28) of stuff but maybe it's not what's (00:58:30) actually capturing most of the value. (00:58:31) Um, (00:58:32) >> in fact, it's actively penalized, right? (00:58:34) If you if you're like super creative in (00:58:36) RL, it's like not good. (00:58:38) >> Yeah. Or like maybe if you're doing a (00:58:39) lot of writing help from LMS and stuff (00:58:40) like that, I think it's probably bad (00:58:41) because the models will give you these (00:58:43) like silently (00:58:44) >> all the same stuff, you know? So, (00:58:47) they're not um they won't explore lots (00:58:48) of different ways of answering a (00:58:50) question, right? (00:58:51) >> But I kind of feel like maybe this (00:58:52) diversity is just not as big of um yeah, (00:58:55) maybe like yeah, not as many (00:58:56) applications need it, so the models (00:58:57) don't have it, but then it's actually a (00:58:58) problem at synthetic generation time, (00:58:59) etc. So we're actually shooting (00:59:00) ourselves in the foot by not allowing (00:59:02) this entropy to maintain in the model. (00:59:04) And I think possibly uh the labs should (00:59:06) try harder. (00:59:07) >> And then I think you hinted that it's a (00:59:09) it's a very fundamental problem. It (00:59:11) won't be easy to solve. And yeah, what's (00:59:13) your intuition for that? (00:59:14) >> I don't actually know if it's um super (00:59:16) fundamental. Uh I don't actually know if (00:59:18) I intended to to say that. I do think (00:59:20) that um (00:59:22) I haven't done these experiments, but I (00:59:24) do think that you could probably (00:59:25) regularize the entropy to be uh to be (00:59:26) higher. So you're encouraging the model (00:59:28) to give you more and more solutions. Um (00:59:30) but you don't want it to start deviating (00:59:32) too much from the training data. It's (00:59:33) going to start making up its own (00:59:34) language. It's going to start using (00:59:35) words that are extremely rare. U you (00:59:37) know so it's going to drift too much (00:59:39) from the distribution. Uh so I think (00:59:41) controlling the distribution is just (00:59:42) like a tricky it's just like someone (00:59:44) just has to (00:59:45) >> it's probably not trivial in that sense. (00:59:48) >> How many bits should the optimal core (00:59:52) >> of intelligence end up being if you just (00:59:54) had to make a guess? the thing we put on (00:59:56) the uh van (00:59:57) >> pros how big does it have to be? (01:00:00) >> So it's really interesting in the (01:00:01) history of the field because at one (01:00:03) point everything was very um scaling (01:00:05) pill in terms of like oh we're going to (01:00:06) make much bigger models trillions of (01:00:08) parameter models and actually what the (01:00:09) models have done in size is they've gone (01:00:11) up and now they've actually kind of like (01:00:14) actually even come down their models are (01:00:16) smaller (01:00:17) >> and even then I actually think they (01:00:18) memorized way too much. Um, so I think I (01:00:21) had a prediction a while back that I I (01:00:23) almost feel like we can get cognitive (01:00:24) cores that are very good at even like a (01:00:26) billion billion parameters. It it should (01:00:28) be already like like if you talk to a (01:00:30) billion parameter model I think in 20 (01:00:32) years you can actually have a very (01:00:33) productive conversation. It thinks um (01:00:36) and it's a lot more like a human. But if (01:00:38) you ask it some factual question might (01:00:39) have to look it up but it knows that it (01:00:41) doesn't know and it might have to look (01:00:42) it up and it will just do all the (01:00:43) reasonable things. That that's actually (01:00:44) surprising that you think it will take a (01:00:46) billion because already we have a (01:00:47) billion parameter models or a couple (01:00:49) billion parameter models that are like (01:00:50) very intelligent. (01:00:51) >> Well, some of our models are like a (01:00:53) trillion parameters, right? But they (01:00:54) remember so much stuff like just (01:00:56) >> Yeah. But I'm surprised that in 10 years (01:00:59) given the pace, okay, we have GPT OSS (01:01:03) 20B that's way better than GPD4 original (01:01:07) which was a trillion plus uh parameters. (01:01:10) So given that trend, I'm actually (01:01:11) surprised you think in 10 years the (01:01:13) cognitive core is still a billion (01:01:15) parameters. I would I'm surprised you're (01:01:16) not like it's going to be like uh tens (01:01:18) of millions or millions. (01:01:20) >> No, because I basically think that the (01:01:21) training data is so here's the issue. (01:01:23) The training data is the internet which (01:01:24) is really terrible. (01:01:26) >> So there's a huge amount of gains to be (01:01:27) made because the internet is terrible. (01:01:28) Like if you actually and even the (01:01:30) internet when you and I think of the (01:01:31) internet, you're thinking of like a Wall (01:01:32) Street Journal or (01:01:34) >> that's not what this is. When you're (01:01:35) actually looking at a preaching data set (01:01:36) in the Frontier Lab and you look at a (01:01:38) random internet document, it's total (01:01:40) garbage. Like I don't even know how this (01:01:42) works at all. It's some like stock (01:01:44) ticker symbols. Uh (01:01:47) >> it's a huge amount of slop and garbage (01:01:49) from like all the corners of the (01:01:50) internet. It's not like your Wall Street (01:01:51) Journal article that's extremely rare. (01:01:53) >> Um so I almost feel like because the (01:01:55) internet is so terrible, we actually (01:01:57) have to sort of like build really big (01:01:58) models to compress all that. Uh most of (01:02:01) that compression is memory work instead (01:02:03) of like cognitive work. But what we (01:02:04) really want is the cognitive part (01:02:06) actually delete the memory (01:02:07) >> and then so I guess what I'm saying is (01:02:09) like we need intelligent models to help (01:02:12) us refine even the pre-training set to (01:02:14) just narrow it down to the cognitive (01:02:15) components and then I think you get away (01:02:17) with a much smaller model because it's a (01:02:18) much better data set and you could train (01:02:20) it on it but probably it's not trained (01:02:22) directly on it. It's probably distilled (01:02:23) for a much better model still but (01:02:24) >> but why is the distilled version still a (01:02:26) billion is I guess the thing I'm curious (01:02:28) about. (01:02:29) >> I just feel like distillation work (01:02:30) extremely well. So um almost every small (01:02:32) model if you have a small model it's (01:02:34) almost certainly distilled. Why would (01:02:35) you train on (01:02:36) >> right? No no but why is a distillation (01:02:37) not in 10 years not getting below 1 (01:02:39) billion. (01:02:40) >> Oh you think it should be smaller than a (01:02:41) million? (01:02:42) >> I mean come on right I don't know at (01:02:45) some point uh it should take at least a (01:02:47) billion knobs uh to do something (01:02:49) interesting. You're thinking it should (01:02:50) be even smaller. (01:02:51) >> Yeah. I mean just like if you look at (01:02:52) the trend over the last few years just (01:02:54) finding low hanging fruit and going from (01:02:56) like trillion plus models that are like (01:02:58) literally two orders of magnitude (01:03:00) smaller in a matter of two years and (01:03:02) having better performance. (01:03:03) >> Yeah. (01:03:04) >> It makes me think the the sort of like (01:03:06) core of intelligence might be (01:03:08) >> even way way smaller like plenty of room (01:03:10) at the bottom to to paraphrase fineman. (01:03:12) >> I mean I almost feel like I'm already (01:03:13) contrarian by talking about a billion (01:03:14) parameter cognitive core and you're (01:03:16) outdoing me. I think um yeah maybe we (01:03:19) could get a little bit smaller. I mean, (01:03:20) I still think that there should be (01:03:21) enough. (01:03:22) >> Yeah, maybe it can be smaller. (01:03:23) >> I do think that practically speaking, (01:03:25) you want the model to have some (01:03:26) knowledge. You don't want it to be (01:03:27) looking up everything. (01:03:28) >> Um because then you can't like think in (01:03:30) your head. You're looking up way too (01:03:31) much stuff all the time. So, I do think (01:03:32) it needs to be some basic curriculum (01:03:34) needs to be there for knowledge. (01:03:36) >> Uh but it doesn't have esoteric (01:03:38) knowledge, you know. (01:03:38) >> Yeah. So, we're discussing what like (01:03:40) plausibly could be the cognitive core. (01:03:41) There's a separate question which is (01:03:43) what will actually be the size of French (01:03:46) models over time? And I'm curious to (01:03:47) have a prediction. So we had increasing (01:03:50) scale up to maybe 4.5 and now we're (01:03:52) seeing decreasing/plateing scale. (01:03:55) There's many reasons that could be going (01:03:56) on but do you have a prediction about (01:03:58) going forward will scale will the (01:04:00) biggest models be bigger? Will they be (01:04:01) smaller? Will they be the same? (01:04:03) >> Um yeah I don't know that I have a super (01:04:05) strong prediction. I do think that the (01:04:07) labs are just being practical. They have (01:04:09) a flops budget and a cost budget. And it (01:04:11) just turns out that pre-shraining is not (01:04:12) where you want to put most of your flops (01:04:14) or your cost. So that's why the models (01:04:15) have gotten smaller because they are a (01:04:17) bit smaller. or the pre-training stages (01:04:18) smaller etc but they make it up in (01:04:20) reinforcement learning and all this kind (01:04:21) of stuff mid training and all this kind (01:04:22) of stuff that follows (01:04:23) >> uh so they're just being practical in (01:04:25) terms of all the stages and how you get (01:04:26) the most bang for the buck um so I guess (01:04:28) like forecasting that trend I think uh (01:04:30) is quite hard I do still expect that (01:04:32) there's so much longing for it that's my (01:04:33) basic that's my basic expectation um and (01:04:38) so I I have a very wide distribution (01:04:40) here um do you expect the longing for it (01:04:42) to be similar in kind to the kinds of (01:04:45) things that have been happening over the (01:04:47) two to five years like just in terms of (01:04:49) like if I look at nano chat versus nano (01:04:52) GPT and then the architectural tweaks (01:04:53) you made (01:04:54) >> is that basically like the flavor of (01:04:55) things you continue to keep happening or (01:04:57) is there you're not expecting any giant (01:05:00) >> part yeah I I expect the data sets to (01:05:02) get much much better because when you (01:05:03) look at the average data sets they're (01:05:04) extremely terrible like so bad that I (01:05:06) don't even know how anything works to be (01:05:07) honest like look at the average example (01:05:08) in the training set (01:05:10) >> like factual mistakes errors yeah (01:05:13) >> nonsensical things um somehow when you (01:05:15) do it at scale the the noise washes away (01:05:18) and you're left with some of the signal. (01:05:20) Um so data sets will improve a ton. It's (01:05:22) just everything gets better. So um our (01:05:25) hardware um all the kernels um uh all (01:05:28) the kernels for running the hardware and (01:05:29) maximizing what you get with the (01:05:30) hardware, you know. So NVIDIA is slowly (01:05:32) tuning the actual hardware itself tensor (01:05:34) course and so on. All that needs to (01:05:35) happen and will continue to happen. Uh (01:05:37) all the kernels will get better and (01:05:38) utilize the chip to the max extent. all (01:05:40) the algorithms will probably improve (01:05:42) improve over optimization architecture (01:05:43) and um just all the modeling components (01:05:45) of how everything is done and what the (01:05:47) algorithms are that we're even training (01:05:48) with. So I do I do kind of expect like a (01:05:51) just very just everything nothing (01:05:53) dominates everything plus 20%. (01:05:57) >> Right. Interesting. (01:05:58) >> This is like roughly what I've seen. (01:05:59) >> Okay. This is my general manager Max. (01:06:02) >> Good to be here here every day. (01:06:03) >> And you have been here since you were (01:06:04) onboarded about 6 months ago. But when I (01:06:06) was (01:06:06) >> months ago (01:06:07) >> Oh, right. Um, time passes so fast. But (01:06:10) when I on boarded you, I was in France (01:06:12) and so we basically didn't get the (01:06:14) chance to talk at all almost (01:06:16) >> and you basically just gave me one (01:06:18) login. (01:06:19) >> I gave you access to my Mercury (01:06:21) platform, which is the banking platform (01:06:23) that I was using at the time to run the (01:06:24) podcast. (01:06:25) >> And so I logged into Mercury assuming (01:06:26) that that would just be the first of (01:06:27) many steps, but I realized that was how (01:06:30) you were running the entire business, (01:06:32) even down to a lot of our editors are (01:06:34) international contractors. And so you (01:06:35) had just figured out how to set up these (01:06:37) recurring payments to set up basic (01:06:39) payroll. (01:06:39) >> I mean, Mercury made the experience of (01:06:41) all of these things I was doing before (01:06:42) so seamless that it didn't even occur to (01:06:44) me until you pointed it out that this is (01:06:45) not the natural way to set up payroll or (01:06:48) invoicing or any of these other things. (01:06:50) >> I I was surprised, but I was like, it's (01:06:51) worked so far, so maybe I'll trust it. (01:06:54) And then now I can't think of doing (01:06:55) anything else. (01:06:56) >> All right, you heard him. Visit (01:06:58) mercury.com to apply online in minutes. (01:07:01) Cool. Thanks, Max. (01:07:02) >> Thanks for having me. (01:07:03) >> Dude, you're great at this. I'm so (01:07:04) nervous, but thank you. (01:07:06) >> Mercury is a financial technology (01:07:07) company, not a bank. Banking services (01:07:09) provided through Choice Financial Group, (01:07:11) column NA, and Evolve Bank and Trust (01:07:12) members FDIC. People have proposed (01:07:15) different ways of charting how much (01:07:18) progress we've made towards full AGI (01:07:21) because if you can come up with some (01:07:23) line, then you can see where that line (01:07:24) intersects with AGI and where that would (01:07:26) happen on the X-axis. And so people have (01:07:28) proposed, oh, it's like the education (01:07:30) level, like we had a high schooler and (01:07:31) then then they went to college with RL (01:07:33) and they're going to get a PhD. I don't (01:07:34) like that one. (01:07:35) >> Um or and then they'll propose horizon (01:07:37) length. So maybe they can do tasks that (01:07:39) take a minute. Uh they can do those (01:07:41) autonomously, then they can autonomously (01:07:43) do tasks that take an hour, a human an (01:07:44) hour, a human a week, etc. (01:07:46) >> How do you think about what is the (01:07:49) relevant um y-axis here? What is the how (01:07:52) should we think about how AI is making (01:07:54) progress? (01:07:54) >> So I guess I have two answers to that. (01:07:56) Number one, I'm almost tempted to like (01:07:58) reject the question entirely because (01:07:59) again like I see this as an extension of (01:08:00) computing. Have we talked about like how (01:08:02) to chart progress in computing or how do (01:08:04) you chart progress in computing since (01:08:05) 1970s or whatever. What is the x-axis? (01:08:08) So I kind of feel like the whole (01:08:09) question is kind of like funny from that (01:08:10) perspective a little bit. Um but I will (01:08:12) say I guess like when people talk about (01:08:14) AI and the original AGI and how we spoke (01:08:16) about it when we um when OpenAI started (01:08:19) >> AGI was a system you can go to that can (01:08:22) do any task that is economically (01:08:24) valuable any economically valuable task (01:08:26) at um human performance or better. (01:08:29) >> Okay. So that was the definition and I (01:08:31) was pretty happy with that at the time (01:08:32) and I kind of feel like I've stuck to (01:08:33) that definition forever and then people (01:08:35) have made up all kinds of other (01:08:36) definitions but I I like I feel like I (01:08:39) like that definition. Now, number one, (01:08:41) the first concession that people make (01:08:43) all the time is they just take out all (01:08:44) the physical stuff because we're just (01:08:46) talking about digital knowledge work. I (01:08:48) feel like that's a pretty major (01:08:49) concession compared to the original (01:08:50) definition which was like any task a (01:08:52) human can do. I can lift things, etc. (01:08:54) Like AI can't do that obviously. So, (01:08:56) okay, but we'll take it. (01:08:57) >> Uh, what fraction of the economy are we (01:08:59) taking away by saying, "Oh, only (01:09:01) knowledge work." Um, I don't actually (01:09:03) know the numbers. I feel like um it's (01:09:04) about 10 to 20% if I had to guess. Is um (01:09:07) is only knowledge work. uh like someone (01:09:10) could work from home and perform tasks (01:09:11) something like that. Um I still think (01:09:14) it's a really large market. Uh like um (01:09:16) yeah what is the size of the economy and (01:09:18) what is 10 20% like we're still talking (01:09:20) about few trillion dollars of even in (01:09:22) the US of market share almost or like (01:09:25) work. (01:09:26) >> Um so still a very massive bucket. So (01:09:28) but I guess like going back to the (01:09:30) definition I guess what I would be (01:09:31) looking for is uh to what extent is that (01:09:33) definition true? Uh so um are there jobs (01:09:36) or lots of tasks? If we think of tasks (01:09:38) as you know not jobs but tasks kind of (01:09:41) difficult because the problem is like (01:09:43) society will refactor based on the tasks (01:09:46) that make up jobs compared to what's (01:09:47) yeah based on what's automatable or not (01:09:49) but today what jobs are replaceable by (01:09:51) AI so a good example recently was um (01:09:55) Jeff Hinton's prediction that (01:09:56) radiologists would not be a job anymore (01:09:58) and this turned out to be very wrong in (01:09:59) a bunch of ways right so radiologists (01:10:01) are alive and well and growing even (01:10:03) though computer vision is really really (01:10:04) good at recognizing all the different (01:10:06) things that they have to recognize in (01:10:07) and it's just messy complicated job with (01:10:10) a lot of surfaces and dealing with (01:10:11) patients and all this kind of stuff in (01:10:12) the context of it. Um so I guess I don't (01:10:16) actually know that by that definition AI (01:10:18) has made a huge amount of dent yet. Um (01:10:21) but some of the some of the jobs maybe (01:10:22) that I would be looking for have some (01:10:24) features that I think make it very (01:10:25) amenable to automation earlier than (01:10:27) later. As an example, call center (01:10:28) employees often come up and I think (01:10:30) rightly so. Uh because call center (01:10:32) employees have a number of simplifying (01:10:34) uh properties with respect to what's (01:10:35) automatable today. um their jobs are (01:10:39) pretty simple. It's a sequence of tasks (01:10:41) and every task looks similar like you (01:10:43) take a phone call with a person, it's 10 (01:10:44) minutes of interaction or whatever it (01:10:46) is, probably a bit longer in my (01:10:47) experience, a lot longer. Um and you (01:10:50) complete some task in some scheme and (01:10:52) you change some database entries around (01:10:54) or something like that. So you keep (01:10:55) repeating something over and over again (01:10:56) and that's your job. So basically you do (01:10:59) want to bring in the task horizon how (01:11:01) long it takes to perform a task. (01:11:03) >> And then you want to also remove context (01:11:05) like you're not dealing with different (01:11:06) parts of services of companies or other (01:11:08) customers. It's just the database you (01:11:10) and a person you're serving. And so it's (01:11:12) more closed. It's more understandable (01:11:14) and it's purely digital. So I I would be (01:11:16) looking for those things. But even there (01:11:18) I'm not actually looking at full (01:11:19) automation yet. I'm looking for an (01:11:21) autonomy slider and I almost expect that (01:11:23) we are not going to instantly replace (01:11:25) people. We're going to be swapping in (01:11:27) AIs that do 80% of the volume. They (01:11:29) delegate 20% of the volume to humans and (01:11:31) humans are supervising teams of five AIs (01:11:33) doing the call center work that's more (01:11:35) rote. Um so I would be looking for new (01:11:38) interfaces or new um companies that (01:11:40) provide some kind of a layer that allows (01:11:43) you to manage some of these AIs that are (01:11:45) not yet perfect. (01:11:46) >> Yeah. (01:11:47) >> And then I would expect that across the (01:11:48) economy and a lot of jobs are a lot (01:11:50) harder than call center employee. I (01:11:52) wonder with radiologists, (01:11:54) I'm totally speculating. I have no idea (01:11:56) how what the actual workflow of a (01:11:57) radiologist involves, (01:11:59) >> but one analogy that might be applicable (01:12:01) is um when we were first being ruled (01:12:05) out, there would be a person sitting in (01:12:07) the front seat (01:12:08) >> and you just had to have them there to (01:12:10) make sure that if something went really (01:12:11) wrong, they're there to monitor. And I (01:12:13) think even today, people are still (01:12:14) watching to make sure things are going (01:12:15) well. Um Robo Taxi, who was just (01:12:17) deployed, actually still has a person (01:12:18) inside it. And we we could be in a (01:12:20) similar situation where if you automate (01:12:23) 99% of a job, that last 1% the human has (01:12:26) to do is incredibly valuable because (01:12:28) it's bottlenecking everything else. And (01:12:30) if it end had if it was the case with (01:12:32) like with radiologists where the person (01:12:33) sitting in the front of the Uber or the (01:12:34) front of the Whimo has to be specially (01:12:36) trained for years in order to be able to (01:12:37) provide the last 1%. Their wages should (01:12:39) go go up tremendously because they're (01:12:41) like the one the one thing bottlenecking (01:12:42) wide deployment. So radiologists I think (01:12:44) their wages have gone up for similar (01:12:46) reasons. if you're like the last (01:12:47) bottleneck, you should you're like and (01:12:49) you're not funible, which like you know (01:12:50) a way driver might be fungeable with (01:12:52) other things. Um so you might see this (01:12:54) thing where like your wages go like (01:12:56) >> and until you get to 90% and then like (01:12:58) just like that (01:12:59) >> and when the last 1% is gone. (01:13:00) >> I see. (01:13:01) >> Um and I wonder if we're similar things (01:13:03) with radiology or salaries of call (01:13:05) center workers or anything like that. (01:13:07) >> Yeah, I think that's that's an (01:13:08) interesting um question. I don't think (01:13:10) we're currently seeing that with (01:13:11) radiology or uh and I don't have like um (01:13:15) in my understanding but I think (01:13:16) radiology is not a good example (01:13:17) basically. I don't know why Jeff Hinton (01:13:19) picked on radiology uh because I think (01:13:21) it's an extremely messy messy (01:13:23) complicated profession. (01:13:24) >> Yeah. (01:13:25) >> Uh so I would be a lot more interested (01:13:26) in what's happening with call center (01:13:27) employees today for example uh because I (01:13:29) would expect a lot of the road stuff to (01:13:31) be uh automatable today (01:13:32) >> and I don't have a first level access to (01:13:34) it but maybe I would be looking for (01:13:35) trends of what's happening with the call (01:13:37) center employees. Maybe some of the (01:13:39) things I would also expect is maybe they (01:13:41) are uh swapping in AI but then I would (01:13:43) still wait for a year or two because I (01:13:46) would potentially expect them to pull (01:13:47) pull back and actually rehire some of (01:13:48) the people. (01:13:49) >> I think there's been evidence that (01:13:50) that's already been happening in the (01:13:52) generally like companies that have been (01:13:53) adopting AI which I think is quite (01:13:54) surprising and I also find what was (01:13:56) really surprising. (01:13:58) >> Okay. Um AGI right like a thing which (01:14:01) should do everything and okay we'll take (01:14:03) out physical work. So think we should be (01:14:05) able to do all knowledge work. And what (01:14:07) you would have naively anticipated that (01:14:09) the way this regression would happen is (01:14:10) like you take a little task that a (01:14:14) consultant is doing, you take that out (01:14:16) of the bucket. You take a little task (01:14:17) that um an accountant is doing, you take (01:14:20) that out of the bucket. Uh and then (01:14:22) you're just doing this across all (01:14:23) knowledge work. But instead, if we do (01:14:25) believe we're on the path of hi with the (01:14:26) current paradigm, the progression is (01:14:28) very much not like that. at least um (01:14:30) >> it just does not seem like consultants (01:14:32) and accountants or whatever are getting (01:14:33) like huge productive improvement. It's (01:14:34) very much like (01:14:36) >> programmers are like getting more and (01:14:39) more chills of the way of their work. If (01:14:40) you to look at the revenues of these (01:14:41) companies discounting just like normal (01:14:43) chat revenue which I think is like I (01:14:45) don't know that's similar to like Google (01:14:47) or something just looking at API (01:14:50) revenues it's like dominated by coding (01:14:51) right so this thing which is general (01:14:54) quote unquote should be able to do any (01:14:55) knowledge work is just overwhelmingly (01:14:57) doing only coding and it's a surprising (01:15:00) way that you would expect like the AGI (01:15:02) to be deployed (01:15:03) >> so I think there's there's an (01:15:04) interesting point here because I do (01:15:06) believe coding is like the perfect first (01:15:08) thing for uh for a for uh these LLMs and (01:15:11) uh agents and that's because coding has (01:15:13) always fundamentally uh worked around (01:15:16) text. (01:15:17) >> It's computer terminals and text and (01:15:19) everything is based around text and LLMs (01:15:21) the way they're trained on the internet (01:15:23) love text (01:15:24) >> and so they're perfect text processors (01:15:26) and there's all this data out there and (01:15:27) it's just perfect fit. Um and also we (01:15:30) have a lot of infrastructure pre-built (01:15:31) for handling uh code and text. So for (01:15:34) example, we have a Visual Studio Code or (01:15:36) you know um your favorite um uh IDE (01:15:39) showing you code um and an agent can (01:15:42) plug into that. So for example, if an (01:15:43) agent has a diff where it made some (01:15:45) change, we suddenly have all this code (01:15:46) already that shows all the differences (01:15:48) to a codebase uh using a diff. So we've (01:15:51) it's almost like we've pre-built a lot (01:15:53) of the a lot of the infrastructure for (01:15:55) code. Now contrast that with some of the (01:15:57) things that that don't enjoy that at (01:15:58) all. So as an example like um there's (01:16:00) people trying to build automation not (01:16:02) for coding but for example for slides (01:16:04) like I saw a company doing slides that's (01:16:06) much much harder and the reason it's (01:16:07) much much harder is because slides are (01:16:08) not text. (01:16:09) >> Yeah. (01:16:10) >> Slides are little graphics and they're (01:16:12) arranged spatially and uh there's visual (01:16:14) component to it and um and slides uh (01:16:17) don't have this pre-built (01:16:18) infrastructure. Like for example if an (01:16:20) agent is to make a different uh change (01:16:21) to your slides. How does a thing show (01:16:23) you the diff? How do you see the diff? (01:16:25) There's no there's no nothing that shows (01:16:27) diffs for slides. Mhm. (01:16:28) >> So someone has to build it. Um so it's (01:16:30) just some of these things are not (01:16:32) amendable to AIS as they are which is (01:16:35) text processors and code surprisingly (01:16:37) is. (01:16:37) >> I I actually I'm not sure if that alone (01:16:40) explains it because (01:16:42) I personally have tried to get LLM to be (01:16:46) useful in domains which are just pure (01:16:49) language in language out. Um like (01:16:52) rewriting transcripts, like coming up (01:16:54) with clips based on transcripts, etc. (01:16:56) And you might say, well, I didn't, it's (01:16:58) very plausible that like I didn't do (01:16:59) every single possible thing I could do (01:17:00) to I put a bunch of, you know, good (01:17:03) examples in context, but maybe I should (01:17:04) have done like some kind of fine tuning, (01:17:06) whatever. So, our mutual friend Andy (01:17:07) Matushak told me that he actually tried (01:17:11) 50 billion things to try to get models (01:17:14) to be good at writing space repetition (01:17:15) prompts. Again, (01:17:16) >> very much language in, language out (01:17:19) task. The kind of thing that should be (01:17:20) dead center in the repertoire of these (01:17:22) LLM. And he tried in context learning (01:17:24) obviously with a few short examples. He (01:17:26) tried I think he told me like a bunch of (01:17:28) things like supervised fine-tuning and (01:17:31) like you know retrieval whatever and he (01:17:34) just could not get them to make cards to (01:17:36) a satisfaction. So I find it striking (01:17:38) that even in language out domains (01:17:41) >> it's actually very hard to get a lot of (01:17:43) economic value out of these models (01:17:45) >> separate from coding. And I don't know (01:17:46) what what explains it. (01:17:47) >> Yeah I think um I think that makes (01:17:49) sense. I mean I would say um yeah it's (01:17:52) I'm not saying that anything text is (01:17:53) trivial right u I do think that code is (01:17:56) like it's pretty structured um text is (01:17:59) maybe a lot more flowery and this and (01:18:01) there's a lot more like (01:18:03) >> uh like entropy in text I would say I (01:18:05) don't know how else to put it um (01:18:07) >> and also I mean code is hard and so (01:18:09) people sort of feel quite empowered by (01:18:11) LLMs even from like simple simple kind (01:18:14) of uh knowledge I basically I don't (01:18:17) actually know that I have um a very good (01:18:19) I mean obviously like text makes it much (01:18:20) much easier maybe is maybe why I put it (01:18:22) but it doesn't mean that all text is (01:18:24) trivial. (01:18:24) >> Mhm. How do you think about super (01:18:27) intelligence? Do you expect it to feel (01:18:29) qualitatively different from normal (01:18:33) humans or human companies? (01:18:35) >> I guess I think I see it as like a (01:18:37) progression of automation in society (01:18:38) right and again like extraling the trend (01:18:40) of computing. I just feel like there (01:18:42) will be a gradual automation of a lot of (01:18:44) things and super intelligence will be (01:18:45) sort of like the extrapolation of that. (01:18:47) Uh so I do think we expect more and more (01:18:48) autonomous entities over time that are (01:18:50) doing a lot of the digital work and then (01:18:52) eventually even the physical work uh (01:18:54) probably some amount of time later but (01:18:56) basically I see it as just uh automation (01:18:59) >> um roughly speaking (01:19:00) >> I guess automation includes the things (01:19:02) humans can already do and super (01:19:03) intelligence things humans (01:19:05) >> well but some of the things that people (01:19:06) do is invent new things which I would (01:19:08) just put into the automation if that (01:19:09) makes sense. Yeah. But you I I guess (01:19:12) maybe um less abstractly and more sort (01:19:16) of like qualitatively. (01:19:18) >> Do you expect something to feel like (01:19:20) okay this because this thing can either (01:19:23) think so fast or has so many copies or (01:19:26) the copies can merge back in themselves (01:19:29) or is quote unquote much smarter. any (01:19:32) number of advantages an AI might have. (01:19:35) It will qualitative the civilization in (01:19:37) which these AI exists will just feel (01:19:39) qualitatively different from (01:19:39) humanization. (01:19:40) >> I think it will I mean it is (01:19:41) fundamentally automation but I mean it (01:19:42) will be like extremely foreign. I do I (01:19:44) do think it will look really strange (01:19:46) because um like you mentioned we can run (01:19:48) all of this on a computer cluster etc (01:19:51) and much faster and all this thing. (01:19:52) Yeah, I mean maybe some of the scenarios (01:19:54) for example that uh I start to get like (01:19:56) nervous about with respect with respect (01:19:58) to when the world looks like that is (01:19:59) this kind of like gradual loss of (01:20:00) control and understanding of what's (01:20:01) happening and I think that's actually (01:20:02) the most likely outcome probably is that (01:20:04) there will be a gradual loss of (01:20:06) understanding of (01:20:07) >> and we'll gradually layer all this stuff (01:20:09) everywhere and there will be fewer and (01:20:11) fewer people who understand it and that (01:20:12) there will be a sort of this like (01:20:13) scenario of a gradual loss of control (01:20:15) and understanding of what's happening (01:20:17) that to me seems most likely outcome of (01:20:19) how all this stuff will go down. Let me (01:20:21) probe on that a bit. It's not clear to (01:20:23) me that loss of control and loss of (01:20:25) understanding are the same things. (01:20:28) >> A board of directors at like whatever (01:20:31) TSMC, Intel, name a random company. (01:20:34) >> Um they're just like prestigious (01:20:36) 80year-olds. They have very little (01:20:38) understanding and maybe they don't (01:20:39) practically actually have control, but (01:20:42) >> or actually maybe a better example is (01:20:44) the president of the United States. (01:20:46) >> President has a lot of [ __ ] power. Um (01:20:49) I'm not trying to make a good statement (01:20:50) about the current operant, but maybe I (01:20:53) am. But like the actual level of (01:20:54) understanding is very different from the (01:20:55) level of control. (01:20:56) >> Yeah, I think that's fair. That's a good (01:20:58) push back. I think like um I guess I (01:21:01) expect loss of uh both. (01:21:05) >> Yeah. (01:21:05) >> How come? I mean loss of understanding (01:21:07) is obvious, but why loss of control? So, (01:21:10) so we're really far into territory of I (01:21:13) don't know what this looks like, but if (01:21:14) I was to write sci-fi novels, they would (01:21:16) look along the lines of not even a (01:21:19) single like entity or something like (01:21:20) that. So, that just sort of like takes (01:21:22) over everything. Uh, but actually like (01:21:24) multiple competing entities that (01:21:25) gradually become more and more (01:21:26) autonomous and uh some of them go rogue (01:21:29) and the others like fight them off and (01:21:30) all this kind of stuff. And it's like (01:21:31) this this hot pot of (01:21:33) >> completely autonomous activity that (01:21:35) we've uh delegated to. I I kind of feel (01:21:38) like (01:21:40) it would have that flavor. (01:21:42) >> It is not the fact that they are smarter (01:21:44) than us that is resulting in the loss of (01:21:45) control. It's the fact that they are (01:21:47) competing with each other and whatever (01:21:50) um arises out of that competition that (01:21:52) leads to the loss of control. (01:21:54) >> Um (01:21:56) I mean I basically expect there to be I (01:21:58) mean um a lot of these things I mean (01:22:00) they will be tools to people and the (01:22:02) people could some of the population is (01:22:03) like they're acting on behalf of people (01:22:06) or something like that. Maybe those (01:22:07) people are in control, but maybe it's a (01:22:08) loss of control overall for society in (01:22:10) in the sense that of like outcomes we (01:22:12) want or something like that. Um where (01:22:14) you have entities acting on behalf of (01:22:15) individuals that are still kind of uh (01:22:18) roughly seen as out of control. (01:22:19) >> Yeah. Yeah. (01:22:20) >> This is a question I should have asked (01:22:21) earlier. So we were talking about how (01:22:23) currently it feels like when you're (01:22:24) doing AI engineering or AI research, (01:22:27) these models are more like in the (01:22:28) category of compiler rather than uh in (01:22:31) the category of a replacement. (01:22:32) >> Yeah. At some point, if you have (01:22:34) quoteunquote AGI, it should be able to (01:22:35) do what you do. (01:22:37) >> And do you feel like having a million (01:22:39) copies of you in parallel results in (01:22:41) some huge speed up of AI progress? (01:22:43) Basically, if that does happen, would (01:22:45) you see do you expect to see an (01:22:46) intelligence explosion or even once we (01:22:49) have not talking about LLMs today, but (01:22:50) really (01:22:51) >> I guess like what I mean is um I do, but (01:22:54) it's business as usual because we're (01:22:56) we're in an intelligence explosion (01:22:58) already and have been for decades. And (01:22:59) when you look at GDP, it's basically the (01:23:01) GDP curve that is an exponential (01:23:02) weighted sum over so many aspects of the (01:23:04) industry. Everything is gradually being (01:23:06) automated has been for hundreds of (01:23:08) years. Um, industrial revolution is (01:23:10) automation and some of the physical (01:23:11) components and the tool building and all (01:23:12) this kind of stuff. Compilers, our early (01:23:14) software automation, etc. Uh, so I kind (01:23:16) of feel like we've been recursively (01:23:18) self-improving and uh exploding for for (01:23:21) a long time. Maybe another way to see it (01:23:22) is um I mean Earth was a pretty I mean (01:23:26) if you don't look at the biio mechanics (01:23:27) and so on it was a pretty boring place I (01:23:29) think and looked very similar if you (01:23:30) just look from space and earth is (01:23:32) spinning and then like we're in the (01:23:33) middle of this like firecracker event (01:23:36) >> right (01:23:36) >> but we're seeing it in slow motion but (01:23:38) >> I definitely feel like this is this has (01:23:41) already happened for a very long time (01:23:42) and I again like I I don't see AI as (01:23:45) like a distinct technology with respect (01:23:47) to what has already been happening for a (01:23:48) long time. Is there you think it's (01:23:50) continuous with this hyper exponential (01:23:52) trend? (01:23:52) >> And that's why like this is this was (01:23:54) very interesting to me because I was I (01:23:56) was trying to find AI in the GDP for a (01:23:57) while. I thought that GDP should go up (01:23:59) but then I looked at some of the other (01:24:01) technologies that I thought were were (01:24:03) very transformative like uh maybe (01:24:05) computers or mobile phones or etc. You (01:24:07) can't find them in GDP. GDP is the same (01:24:08) exponential and it's just that even for (01:24:10) example the early iPhone uh didn't have (01:24:12) the app store and it didn't have a lot (01:24:14) of the bells and whistles that the (01:24:15) modern iPhone has. And so even though we (01:24:16) think of 2008 was it when iPhone came (01:24:19) out as like some major seismic change, (01:24:21) it's actually not. Everything is like so (01:24:22) spread out and so slowly diffuses that (01:24:25) everything ends up being averaged up (01:24:26) into the same exponential. And it's the (01:24:28) exact same thing with computers. You (01:24:29) can't find them in a GDP is like, oh, we (01:24:30) have computers now. (01:24:31) >> That's not what happened because it's (01:24:33) such a slow progression. And with AI, (01:24:34) we're going to see the exact same thing. (01:24:35) It's just more automation. It allows us (01:24:37) to write different kinds of programs (01:24:38) that we couldn't write before. But AI is (01:24:40) still fundamentally a program and um (01:24:43) it's a new kind of computer and a new (01:24:45) kind of um kind of computing system, but (01:24:47) it has all these problems. It's going to (01:24:48) diffuse over over time and it's still (01:24:50) going to add up to the same exponential (01:24:52) and we're still going to have an (01:24:53) exponential that's going to get (01:24:54) extremely vertical and it's going to be (01:24:57) very foreign to live in that kind of an (01:24:59) environment. Are you saying that like (01:25:01) what will happen is so if you go if you (01:25:03) look at the trend before the industrial (01:25:04) revolution to currently you have a hyper (01:25:06) exponential where you go from like 0% (01:25:09) growth to then 10,000 years ago 0.02% (01:25:12) growth and then currently we're at 2% (01:25:14) growth. So that's a hyper exponential (01:25:15) and you're saying if you're charting AI (01:25:16) on there then it's like AI takes you to (01:25:18) 20% growth or 200% growth (01:25:20) >> or you could be saying if you look at (01:25:22) the last 300 years what you've been (01:25:24) seeing is you have technology after (01:25:25) technology computers electrification (01:25:27) steam steam engines railways etc (01:25:30) >> but the rate of growth is the exact same (01:25:32) it's 2%. So are you saying the rate of (01:25:34) growth will (01:25:36) >> directly I expect this the rate of (01:25:38) growth has also stayed roughly constant (01:25:40) right (01:25:40) >> for only the last 200 300 years but over (01:25:42) the course of human history it's like (01:25:44) exploded right it's like gone from like (01:25:45) 0% basically to like faster faster (01:25:48) faster industrial explosion 2% (01:25:50) >> like basically I guess what I'm saying (01:25:51) is for a while I tried to find AI or (01:25:53) look for AI in like the GDP curve and (01:25:55) I've kind of convinced myself that this (01:25:56) is false and that even when people talk (01:25:58) about recursive self-improvement and (01:25:59) labs and stuff like that I even don't (01:26:01) this is business as usual of course it's (01:26:02) going to recursively self-improved and (01:26:04) it's been recursively self-improving (01:26:05) like LLMs allow the engineers to work (01:26:08) much more efficiently to build the next (01:26:10) round of LLM and a lot more of the (01:26:12) components are being automated and and (01:26:13) tuned and etc. So all the engineers (01:26:16) having access to Google search is is (01:26:18) sort of part of it. All the engineers (01:26:20) having an ID all all of them having (01:26:22) autocomplete or having cloth code etc. (01:26:23) It's all just part of the same speed up (01:26:26) of the whole thing. So um it's just so (01:26:29) smooth. (01:26:31) >> But just just to clarify you're saying (01:26:32) that the rate of growth will not change (01:26:34) like um you know the intelligence (01:26:36) explosion will show up as like we it (01:26:38) just enabled us to continue staying on (01:26:39) the 2% growth trajectory just as the (01:26:41) internet helped us stay on the 2% growth (01:26:42) trajectory. (01:26:43) >> Yeah. My expectation is that it stays (01:26:44) the same pattern. (01:26:46) >> Yeah. I mean, um, ju just to throw the (01:26:49) opposite argument against you, my (01:26:51) expectation is that it like, um, blows (01:26:54) up because I think true AGI, and I'm not (01:26:57) talking about LLM coding bots, I'm (01:26:58) talking about like actual this is like a (01:27:00) replacement of a human in a server (01:27:03) >> is qualitatively different from these (01:27:06) other productivity improving (01:27:07) technologies (01:27:09) >> because it's labor itself, right? I (01:27:11) think we live in a very labor (01:27:12) constrained world. Like if you talk to (01:27:14) any startup founder, any person, you can (01:27:15) just be like, okay, what do you need (01:27:16) more of? You just like need really (01:27:18) talented people. And if you just have (01:27:20) billions of extra people who are (01:27:22) inventing stuff, integrating themselves, (01:27:24) making companies, bottoms, start to (01:27:27) finish, that feels qualitatively (01:27:28) different from just like a single (01:27:30) technology. It's sort of like just (01:27:31) asking if you like if you get 10 billion (01:27:32) extra people on the planet. (01:27:33) >> I mean, maybe a counterpoint. I mean, (01:27:35) number one, I I'm actually pretty um (01:27:37) pretty willing to be convinced one way (01:27:39) or another on this point. But I will (01:27:40) say, for example, computing is labor. (01:27:42) Computing was labor. Computers like a (01:27:44) lot of jobs disappeared because (01:27:45) computers are automating a bunch of (01:27:47) digital uh information processing that (01:27:49) you now don't need a human for. And so (01:27:51) computers are labor. Um and that has (01:27:53) played out. Um and you know, (01:27:56) self-driving as an example is also like (01:27:57) computers doing labor. Uh so like I (01:28:00) guess that's already been playing out. (01:28:01) So it's still business as usual. Yeah, I (01:28:03) guess you have a machine which just (01:28:04) spitting out more things like that (01:28:06) >> at potentially faster pace. And so we (01:28:08) historically we have examples of the (01:28:10) growth regime changing where like you (01:28:12) went from you know 2% growth to 2% (01:28:14) growth. (01:28:15) >> So it seems very plausible to me that (01:28:17) like (01:28:17) >> a machine which is then spitting out the (01:28:21) next self-driving car and the next (01:28:22) internet and whatever. (01:28:23) >> I mean I kind of yeah I see where it's (01:28:26) coming from. At the same time, I do feel (01:28:27) like people make this assumption of (01:28:28) like, okay, we have (01:28:30) >> uh God in a box and now it can do (01:28:32) everything and it's just it just won't (01:28:33) look like that. It's going to be it's (01:28:35) going to be able to do some of the (01:28:36) things. It's going to fail at some other (01:28:37) things. It's going to be gradually put (01:28:39) into society and basically we'll end up (01:28:40) with the same pattern is my prediction (01:28:42) because because this assumption of (01:28:44) suddenly having a completely intelligent (01:28:46) uh fully flexible, fully general human (01:28:48) uh in a box and we can dispensed it (01:28:49) arbitrary problems in society. I I I (01:28:52) don't think that we will have this like (01:28:54) discreet change and um and so I I think (01:28:58) we'll arrive at the same at the same (01:29:00) kind of a gradual diffusion of this (01:29:02) across the industry. M I I I think what (01:29:05) often ends up being misleading in these (01:29:07) um conversations is people I don't like (01:29:10) to use the word intelligence in this (01:29:11) context because intelligence implies you (01:29:13) think like oh super int super super (01:29:15) intelligence will be sitting there will (01:29:16) be a single super intelligence sitting (01:29:17) in a server and it'll like divine how to (01:29:19) come up with new technologies and (01:29:20) inventions that causes this explosion (01:29:23) >> and that's not what I'm imagining when (01:29:24) I'm imagining 20% growth (01:29:26) >> I'm imagining that there's billions of (01:29:30) you know basically like very smart human (01:29:33) minds potentially or that's all that's (01:29:34) required. But the fact that there's (01:29:36) hundreds of millions of them, billions (01:29:38) of them, each individually making new (01:29:41) products, figuring out how to integrate (01:29:42) themselves into the economy, just the (01:29:44) way if like a highly experienced smart (01:29:46) immigrant came to the country, you (01:29:47) wouldn't need to like figure out how we (01:29:48) integrate them in the economy. They (01:29:49) figured out they could start a company, (01:29:50) they could like uh make inventions, you (01:29:53) know, or like just increase productivity (01:29:54) in the world. And we have examples even (01:29:56) in the current regime of places that (01:29:58) have had 10 20% economic growth. you (01:30:01) know, if you just have a lot of people (01:30:03) and less capital in comparison to the (01:30:05) people, you can have Hong Kong or (01:30:08) Shenzhen or whatever just had decades of (01:30:11) 10% plus growth. It and I think it's (01:30:13) just like there's a lot of really smart (01:30:15) people who are ready to like make use of (01:30:16) the resources and do this like period of (01:30:19) catchup because we've had this (01:30:20) discontinuity. And I think yeah, it (01:30:22) might be similar. So, I think um I I (01:30:24) think I understand, but I still think (01:30:26) that you're presupposing some discrete (01:30:27) jump. There's some unlock that we're (01:30:29) waiting to claim (01:30:30) >> and suddenly we're going to have (01:30:31) geniuses in data centers. And I I still (01:30:34) think you're presupposing some discrete (01:30:36) jump that I think has basically no (01:30:37) historical precedent that I can't find (01:30:39) in any of the statistics and that I (01:30:41) think probably won't happen. (01:30:42) >> I mean, the industrial revolution is (01:30:43) such a jump, right? You went from like (01:30:44) 0% grow or 0.2% growth to 2% growth. Um (01:30:47) I'm just saying like you'll see another (01:30:48) jump like that. (01:30:49) >> I I I'm a little bit suspicious. I would (01:30:51) have to look at it. I I'm a little bit (01:30:52) suspicious and I would have to take a (01:30:54) look. For example, like maybe the some (01:30:55) of the logs are are not very good from (01:30:57) before the industrial revolution or (01:30:58) something like that. Uh so I'm a little (01:31:00) bit suspicious of it, but um yeah, maybe (01:31:02) you're right. I don't I don't have (01:31:03) strong opinions. (01:31:04) >> Maybe you're saying that this was a (01:31:06) singular event that was extremely (01:31:07) magical and you're saying that maybe (01:31:08) there's going to be another event that's (01:31:09) going to be just like that, extremely (01:31:10) magical. It will break paradigm and so (01:31:13) on. (01:31:13) >> I actually don't think they I mean the (01:31:14) crucial thing about the industrial (01:31:15) revolution was that it was not magical, (01:31:17) right? Like if you just zoomed in (01:31:20) >> what you would see in 1770 or 1870, (01:31:25) >> it's not that there like was some key (01:31:27) invention. (01:31:28) >> Yeah, exactly. But at the same time, you (01:31:30) did move the economy to a regime where (01:31:32) the progress was much faster (01:31:34) >> and the exponential 10xed (01:31:36) >> and I expected similar thing from AI (01:31:38) where it's not like (01:31:39) >> there's going to be a single moment (01:31:40) where we made the crucial (01:31:42) >> overhang that's being unlocked like (01:31:44) maybe there's a new energy source (01:31:45) there's there's some unlock in this case (01:31:47) some kind of a cognitive capacity and (01:31:49) there's an overhang of cognitive (01:31:50) cognitive work to do. That's right. (01:31:52) >> And you're expecting that overhang to be (01:31:54) filled by this new technology when it (01:31:55) crosses the threshold. (01:31:56) >> Yeah. And I mean I maybe one way to (01:31:57) think about it is through history a lot (01:31:59) of growth I mean growth comes because (01:32:02) people come up with ideas and then (01:32:03) people are like out there doing stuff to (01:32:06) execute those ideas and make valuable (01:32:08) output (01:32:09) >> and through most of this time population (01:32:11) isn't exploding that has been driving (01:32:12) growth for the last 50 years people have (01:32:14) argued that growth has stagnated (01:32:16) population and frontier countries has (01:32:18) also stagnated I think we go back on the (01:32:20) hyperexonential growth in population and (01:32:22) output (01:32:23) >> right sorry exponential growth in (01:32:24) population that causes hyperextential (01:32:26) growth and output. (01:32:27) >> Yeah. I mean, um, yeah, it's really hard (01:32:29) to tell. (01:32:30) >> I understand that viewpoint. I don't (01:32:32) intuitively feel that viewpoint. (01:32:34) >> So, we just got access to Google's VO (01:32:37) 3.1, and it's been really cool to play (01:32:40) around with. The first thing we did was (01:32:42) run a bunch of prompts through both V3 (01:32:44) and 3.1 to see what's changed in the new (01:32:47) version. So, here's V3. (01:32:50) >> Hi, I'm Max and I got stuck in a local (01:32:52) minimum again. (01:32:53) >> It's okay, Max. We've all been there. (01:32:55) Took me three epox to get out. (01:32:57) >> And here is VO3.1. (01:32:59) >> Hi, I'm Max and I got stuck in a local (01:33:02) minimum again. (01:33:03) >> It's okay, Max. We've all been there. (01:33:05) Took me three epox. (01:33:07) >> 3.1's output is just consistently more (01:33:10) coherent and the audio is noticeably (01:33:12) higher quality. We've been using VO for (01:33:14) a while now. Actually, we released an (01:33:16) essay earlier this year about AI firms (01:33:18) fully animated by V2, and it's been (01:33:20) amazing to see how fast these models are (01:33:23) improving. This update makes VO even (01:33:25) more useful in terms of animating our (01:33:28) ideas and our explainers. You can try VO (01:33:30) right now in the Gemini app with Pro and (01:33:33) Ultra subscriptions. You can also access (01:33:35) it through the Gemini API or through (01:33:37) Google Flow. You recommended Nick Lane's (01:33:40) book to me and then on that basis I I (01:33:42) also find it super interesting and I (01:33:44) interviewed him. Um and so I actually (01:33:46) have some questions about sort of (01:33:46) thinking about intelligence and (01:33:47) evolutionary history. Now that you over (01:33:50) the last 20 years of doing AI research, (01:33:52) you maybe have a more tangible sense of (01:33:54) what intelligence is, what it takes to (01:33:57) develop it. Are you more or less (01:34:00) surprised as a result that evolution (01:34:02) just sort of spontaneously (01:34:05) stumbled upon it? (01:34:07) >> Um, I love Nick's books by the way. So, (01:34:10) um, yeah, I was just listening to to his (01:34:12) podcast on the way up here. With respect (01:34:14) to intelligence and its evolution, I do (01:34:16) claim it came fairly (01:34:18) >> I mean it's very very recent, right? Um (01:34:21) I am surprised that it evolved. Yeah, I (01:34:23) I find it fascinating to think about all (01:34:24) the worlds out there. Like say there's a (01:34:26) thousand planets like Earth and what (01:34:27) they look like. I think Nane was here (01:34:28) talking about some of the early parts, (01:34:30) right? Like (01:34:30) >> okay, he expects basically very similar (01:34:33) life forms roughly speaking and bacteria (01:34:35) like things and most of them. (01:34:36) >> Yeah. (01:34:36) >> And then there's a few breaks in there. (01:34:39) I would expect that um the evolution of (01:34:41) intelligence intuitively feels to me (01:34:42) like it should be fairly rare event and (01:34:44) there have been animals for I guess (01:34:46) maybe you should base it on how long (01:34:47) some something has existed. So for (01:34:49) example, if bacteria have been around (01:34:50) for 2 billion years and nothing happened (01:34:52) then going to your carrier is probably (01:34:53) pretty hard cuz um cuz bacteria actually (01:34:56) um came up quite early in Earth's (01:34:58) evolution or history. (01:35:00) >> Um (01:35:01) >> and so I guess um how long have we had (01:35:03) animals? Maybe a couple hundred million (01:35:04) years like multisellular animals that (01:35:06) like run crawl etc. (01:35:08) um which is maybe 10% of um Earth's (01:35:11) lifespan or something like that. So I (01:35:13) maybe on that time scale is actually not (01:35:15) not too tricky. I still feel like (01:35:18) it's still surprising to me I think (01:35:19) intuitively that it developed. I would (01:35:21) maybe expect just a lot of like (01:35:22) animallike life forms doing animallike (01:35:24) things. Uh the fact that you can get (01:35:26) something that creates culture and (01:35:27) knowledge Yeah. and accumulates it is is (01:35:29) it is surprising to me that okay so (01:35:32) there's so there's actually a couple of (01:35:33) interesting follow-ups. (01:35:35) if you buy this uh sun perspective that (01:35:38) actually the crux of intelligence is (01:35:41) animal intelligence. What the quote said (01:35:42) is if you got to the squirrel you'd be (01:35:44) most of the way to AGI. Um (01:35:46) >> then we got to squirrel intelligence I (01:35:48) guess right after the Cambrian explosion (01:35:50) 600 million years ago. (01:35:51) >> It seems like what instigated that was (01:35:54) the oxygenation event 600 million years (01:35:56) ago. But immediately the sort of like (01:35:57) intelligence algorithm was there to like (01:35:59) make the the squirrel intelligence, (01:36:02) right? So it's suggestive that animal (01:36:04) intelligence was like that as soon as (01:36:07) you had the oxygen in the environment (01:36:08) you had the curat you could just like (01:36:10) get the algorithm. Um I maybe there was (01:36:13) like sort of an accident that evolution (01:36:15) smell abundant so fast but I don't know (01:36:16) if that suggest is like actually quite (01:36:18) uh at the end going to be quite simple. (01:36:20) >> Yes basically it's so hard to tell right (01:36:22) with any of this stuff. I guess you can (01:36:23) base it a little bit on how long (01:36:25) something has exited or how long it (01:36:26) feels like something has been (01:36:27) bottlenecked. So very good describing (01:36:30) this like very apparent bottleneck in (01:36:32) bacteria for years like extreme (01:36:35) diversity of chemical biochemistry and (01:36:38) yet nothing that grows to become (01:36:41) >> animals two billion years um I I don't (01:36:44) know that we've seen exactly that kind (01:36:46) of an equivalent with animals and (01:36:47) intelligence uh to your point right but (01:36:49) I guess maybe we could also look at it (01:36:51) with respect to how many times we think (01:36:52) evol intelligence has like individually (01:36:55) sprung up (01:36:56) >> that's a really good that's a really (01:36:57) good thing investigate. (01:36:58) >> Maybe one thought on that is I almost (01:37:00) feel like um well there's the homminid (01:37:03) intelligence and there's I would say (01:37:04) like the bird intelligence right like (01:37:06) ravens etc are extremely clever but uh (01:37:08) they their brain brain parts are (01:37:10) actually quite distinct and we don't (01:37:11) have that much um (01:37:13) >> existence so maybe that's an slight (01:37:15) event of there's a slight indication of (01:37:17) maybe intelligence springing up a few (01:37:18) times and so in that case you'd maybe (01:37:20) expect it more frequently or something (01:37:21) like that. Yeah, a former guest Gw and (01:37:25) also Carl Sherman have made made a (01:37:27) really interesting point about that (01:37:28) which is their perspective is that the (01:37:32) scalable algorithm which humans have and (01:37:34) primates have (01:37:35) >> arose in birds as well (01:37:38) >> and maybe other times as well. But in (01:37:41) humans found a evolutionary niche which (01:37:43) rewarded marginal increases in (01:37:45) intelligence. (01:37:46) >> Um and also had a scalable brain (01:37:49) algorithm that could achieve those (01:37:51) increases in intelligence. (01:37:52) >> The and so for example if a bird had a (01:37:55) bigger brain it would just like collapse (01:37:56) out of the air. So it's very smart for (01:37:58) the size of its brain but it's like it's (01:38:00) not in a niche which rewards the brain (01:38:02) getting bigger. (01:38:03) >> Um yeah (01:38:04) >> maybe similar with some really smart (01:38:06) >> dolphins etc. (01:38:07) >> Exactly. Yeah. Whereas humans, you know, (01:38:09) like we have hands that like reward (01:38:11) being able to learn how to do tool use. (01:38:12) We can externalize digestion, more (01:38:14) energy to the brain (01:38:15) >> and that um kicks off the flywheel. (01:38:17) >> Oh, yeah. And just stuff to work with. I (01:38:19) mean, I'm guessing it would be harder to (01:38:20) if I was a dolphin. (01:38:22) >> Um I mean, how do you do you can't have (01:38:24) fire for example and stuff like that? I (01:38:25) mean, the probably like the universe of (01:38:27) things you can do in water um like (01:38:29) inside water is probably lower than what (01:38:31) you can do on land um just chemically, (01:38:33) >> right? Yeah, I do I do agree with this (01:38:35) with this viewpoint of these niches and (01:38:36) what's what's being incentivized. I (01:38:38) still find it kind of miraculous that uh (01:38:41) I don't I I would have maybe expected (01:38:43) things to get stuck on like animals with (01:38:45) bigger muscles, you know? (01:38:46) >> Yeah. (01:38:47) >> Like going through intelligence is (01:38:48) actually a really fascinating uh (01:38:51) breaking point. The the way Burn put it (01:38:52) is the reason it was so hard is is a (01:38:55) very tight line between being in a (01:38:56) situation where something is so (01:38:59) important to learn (01:39:01) that it's not just worth distilling the (01:39:03) exact right circuits directly back into (01:39:06) your DNA (01:39:07) >> versus it's not important enough to (01:39:09) learn at all. (01:39:10) >> Yeah. (01:39:10) >> It has to be something which is like (01:39:12) >> you have to to incentivize building the (01:39:15) algorithm to learn in lifetime. (01:39:17) >> Yeah. Exactly. You have to incentivize (01:39:18) some kind of adaptability. You actually (01:39:19) want something that you actually want (01:39:21) environments that are unpredictable. So (01:39:22) evolution can't bake your algorithms (01:39:24) into your weights. A lot of um a lot of (01:39:26) animals are basically pre-baked in this (01:39:28) sense and so humans have to figure it (01:39:30) out at test time when they get born. And (01:39:31) so maybe there was um you actually want (01:39:34) these kinds of uh environments that (01:39:36) actually change really rapidly or (01:39:37) something like that where you can't (01:39:38) foresee um what will work well and so (01:39:40) you actually put all that intelligent (01:39:42) you create intelligence to figure it out (01:39:43) at test time. Uh so Quentyn Pope had (01:39:46) this interesting blog post where he's (01:39:47) saying the Brazilian doesn't expect a (01:39:49) sharp takeoff is um the so humans had (01:39:53) the sharp takeoff where 60,000 years ago (01:39:55) we seem to have had the cognitive (01:39:56) architectures that we have today (01:39:59) >> and 10,000 years ago agricultural (01:40:00) revolution modernity dot dot dot. What (01:40:03) was happening in that 50,000 years? (01:40:04) >> Well, you had to build this sort of like (01:40:06) cultural scaffold where you can (01:40:08) accumulate knowledge over generations. (01:40:12) This is an ability that exists for free (01:40:14) in the way we do AI training where if (01:40:17) you retrain a model it can still I mean (01:40:19) in many cases they're literally (01:40:20) distilled but they can be trained on (01:40:22) each other you know they can be trained (01:40:23) on the premium pre-training corpus um (01:40:25) they don't literally have to start from (01:40:27) scratch so there's a sense in which the (01:40:29) thing which it took humans a long time (01:40:31) to get this cultural loop going just (01:40:33) comes for free with the way we do LLM (01:40:35) training. Um, yes and no because LMs (01:40:38) don't really have the equivalent of (01:40:39) culture and maybe we're giving them way (01:40:40) too much and incentivizing not to create (01:40:42) it or something like that. But I guess (01:40:44) like the notion of culture and of (01:40:45) written record and of like passing down (01:40:47) notes between each other. I don't think (01:40:49) there's an equivalent of that with LM (01:40:50) right now. So LM don't really have (01:40:52) culture right now and it's kind of like (01:40:54) one of the I think uh impediments I (01:40:56) would say. Can (01:40:57) >> can you give me some sense of what LLM (01:40:59) culture might look like? Uh, so in the (01:41:01) simplest case, it would be a giant (01:41:02) scratch pad that the LLM can edit. And (01:41:04) as it's reading stuff or as it's helping (01:41:06) out with work, it's editing the scratch (01:41:08) pad for itself. (01:41:09) >> Why can't an LLM write a book for the (01:41:10) other LM? That would be cool. (01:41:12) >> Yeah. (01:41:13) >> Like why can't other LLMs read this (01:41:14) LLM's book and be inspired by it or (01:41:18) shocked by it or something like that? (01:41:19) There's no equivalence for any of this (01:41:20) stuff. (01:41:20) >> Interesting. When would you expect that (01:41:22) kind of thing to start happening? And (01:41:24) more general question about like multi- (01:41:25) aent systems and a sort of like (01:41:27) independent AI. Yeah, civilization and (01:41:30) culture. (01:41:31) >> I think there's two powerful ideas in (01:41:33) the realm of multi- aent that have both (01:41:34) not been like really claimed or or so (01:41:36) on. The first one I would say is culture (01:41:39) and LLM's basically a growing uh (01:41:41) repertoire of knowledge uh for their own (01:41:43) purposes. (01:41:44) >> Uh the second one looks a lot more like (01:41:46) uh the powerful idea of selfplay. Uh in (01:41:48) my mind it's extremely powerful. So (01:41:49) evolution actually is a lot of um (01:41:52) competition basically driving (01:41:53) intelligence and and evolution. Um and (01:41:57) uh for in AlphaGo more algorithmically (01:41:59) like Alph Go is playing against itself (01:42:01) and that's how it learns to get really (01:42:03) good at Go and there's no equivalent of (01:42:05) selfplaying LMS but I would expect that (01:42:07) to also exist but no one has done it yet (01:42:09) like why can't an LM for example create (01:42:10) a bunch of problems that another LM is (01:42:13) learning to solve and then the the LM is (01:42:15) always trying to like serve more and (01:42:16) more difficult problems stuff like that (01:42:18) you know so like (01:42:19) >> I think there's a bunch of ways to (01:42:21) actually organize it um and I think it's (01:42:22) a realm of research uh but I think I (01:42:24) haven't seen anything that convincing ly (01:42:26) like claims both of those (01:42:28) >> like multi- aent uh improvements. I (01:42:30) still think we're mostly in the realm of (01:42:31) a single individual agent, but I think I (01:42:34) also think that will change and and um (01:42:36) in the realm of culture also I would (01:42:38) bucket also organizations and we haven't (01:42:40) seen anything like that coming in (01:42:41) either. (01:42:42) >> Um so that's why we're still early. (01:42:44) >> And can you identify the key bottleneck (01:42:46) that's uh preventing this kind of (01:42:49) collaboration between LLMs? Maybe like (01:42:51) the way I would put it is (01:42:54) somehow remarkably again some of these (01:42:55) analogies work and they shouldn't but (01:42:57) somehow remarkably they do a lot of the (01:42:58) smaller models or the dumber like the (01:43:00) smaller models somehow remarkably (01:43:02) resemble like a kindergarten student or (01:43:04) then like a elementary school student or (01:43:06) high school student etc. And somehow we (01:43:08) still haven't like graduated enough (01:43:09) where the stuff can take over like it's (01:43:11) still mostly like my cloth code or (01:43:13) codeex they still kind of feel like this (01:43:16) elementary grade student. I know that (01:43:18) they can take PhD quizzes, but they (01:43:19) still cognitively feel like a (01:43:22) kindergarten or an elementary entry (01:43:23) school student. So, I don't think they (01:43:24) can create culture because they're still (01:43:26) kids. Um, you know, (01:43:28) >> like they're savant kids. Um, they have (01:43:30) episodic, they have perfect memory of (01:43:32) all this stuff, etc. And they can, uh, (01:43:34) convincingly create all kinds of slop (01:43:35) that looks really good. (01:43:37) >> But I still think they don't really know (01:43:38) what they're doing and they don't really (01:43:39) have the cognition uh, across all these (01:43:41) little check boxes that we still have to (01:43:43) collect. (01:43:43) >> Yeah. So, you've talked about how you (01:43:46) were at Tesla leading self-driving from (01:43:49) 2017 to 2022 and then you firsthand saw (01:43:53) this progress from we went from cool (01:43:55) demos to now thousands of cars out there (01:43:58) actually autonomously doing drives. Why (01:44:00) did that take a decade? Like what was (01:44:02) happening through that time? (01:44:03) >> Yeah. Uh so I would say one thing I will (01:44:05) almost instantly also push back on is (01:44:07) this is not even near done. (01:44:10) >> So in a bunch of ways that I'm going to (01:44:12) get to. I do think that uh self-driving (01:44:14) is very interesting because uh it's (01:44:16) definitely like where I get a lot of my (01:44:17) intuitions because I spent 5 years on (01:44:19) it. Um and it has this entire history (01:44:21) where actually the first demos of (01:44:23) self-driving go all the way to 1980s. (01:44:25) >> You can see a demo from CMU in 1986 (01:44:28) there's a truck that's driving itself on (01:44:30) roads. Um but okay fast forward I think (01:44:33) when I was joining Tesla I had um I had (01:44:35) a very early demo of a Whimo and it (01:44:38) basically gave me a perfect drive uh in (01:44:41) 200 2014 or something like that. So (01:44:44) perfect way drive a decade ago uh gave (01:44:47) to us around Palo Alto and so on because (01:44:48) I had a friend who worked there. Um and (01:44:51) I thought it was like very close and (01:44:52) then still took a long time and I do (01:44:54) think that some there's for some kinds (01:44:56) of um tasks and jobs and so on uh the (01:44:59) there's a very large demoto product gap (01:45:02) where the demo is very easy but the (01:45:03) product is very hard. Um, and it's (01:45:06) especially the case in cases like (01:45:07) self-driving where the the cost of (01:45:10) failure is too high, right? Many ind (01:45:12) many industries tasks and jobs maybe (01:45:14) don't have that property, but when you (01:45:15) do have that property, that definitely (01:45:17) increases the timelines. I do think that (01:45:19) for example in software engineering, I (01:45:20) do actually think that that property (01:45:22) does exist. I think for a lot of vibe (01:45:24) coding it doesn't but I think if you're (01:45:25) writing actual production grade code I (01:45:27) think that property should exist because (01:45:28) any kind of mistake actually leads to a (01:45:30) security vulnerability or something like (01:45:32) that and millions and hundreds of (01:45:33) millions of people's personal social (01:45:35) security numbers etc get leaked or (01:45:37) something like that and so I do think (01:45:38) that it is a case that in software (01:45:40) people should be careful um kind of like (01:45:43) in self-driving um like in self-driving (01:45:45) if you if it things go wrong you might (01:45:46) get injury in um I guess there's worse (01:45:49) outcomes but I guess in in software I (01:45:51) almost feel like it's almost unbounded (01:45:53) how terrible some things could be. (01:45:56) >> Interesting. (01:45:57) >> So I do think that they share that (01:45:58) property. And then I think basically (01:46:00) what takes the long amount of time and (01:46:01) the way to think about it is that it's a (01:46:04) march of nines and every single nine is (01:46:06) a constant amount of work. So every (01:46:09) single nine is the same amount of work. (01:46:10) So when you get a demo and something (01:46:12) works 90% of the time, that's just uh (01:46:15) that's just uh what the first nine and (01:46:17) then you need the second nine and third (01:46:18) nine, fourth nine, fifth nine. And while (01:46:19) I was at Tesla for was it five years or (01:46:21) so. I think we went through maybe three (01:46:22) nines or two nines. I don't know what it (01:46:24) is, you know, but like multiple nines of (01:46:25) iteration, there's still more nines to (01:46:27) go. And so that's why these things take (01:46:28) take so long. Um, and so it's definitely (01:46:32) formative for me like seeing something (01:46:34) that was a demo. I'm very unimpressed by (01:46:36) demos. Um, so whenever I see demos of (01:46:38) anything, I'm extremely unimpressed by (01:46:40) that. Um, it works better if you can um (01:46:43) if it's a demo that someone cooked up (01:46:45) and is just showing you it's worse. If (01:46:46) you can interact with it, it's a bit (01:46:47) better. But even then, you're not done. (01:46:49) You need actual product. It's going to (01:46:50) face all these challenges in when it (01:46:52) comes in contact with reality and all (01:46:53) these different pockets of behavior that (01:46:55) need patching. And so I think we're (01:46:56) going to see all this stuff play out. (01:46:58) It's a march of nines. Each nine is (01:46:59) constant. Uh demos are encouraging. (01:47:02) Still a huge amount of work to do. Uh I (01:47:04) do think it is a um kind of a critical (01:47:06) safety domain unless you're doing bip (01:47:08) coding, which is all nice and fun and so (01:47:10) on. And uh so that's why I think this (01:47:13) also enforces my timelines from that (01:47:14) perspective. Hm. That's that's very (01:47:17) interesting to hear you say that the (01:47:18) sort of safety guarantees you need from (01:47:20) software are actually not dissimilar to (01:47:23) self-driving because what people will (01:47:24) often say is that self-driving took so (01:47:25) long because the cost of failure is so (01:47:29) high. Like a human makes a mistake on (01:47:31) the average every 400,000 miles or every (01:47:33) seven years. And if you had to release a (01:47:35) coding agent that couldn't make a (01:47:37) mistake for at least seven years, it (01:47:40) would be much harder to deploy. But I (01:47:42) guess your point is that if you made a (01:47:43) catastrophic coding mistake like yeah (01:47:46) >> breaking some important system every (01:47:47) seven years (01:47:48) >> very easy to do (01:47:49) >> and in fact in terms of sort of wall (01:47:50) clock time it much it would be much less (01:47:52) than seven years because you're like (01:47:53) constantly outputting code like that (01:47:55) right so like per tokens or in terms of (01:47:58) tokens it would be seven years but in (01:47:59) terms of wall clock time (01:48:00) >> in some way it's a much harder problem I (01:48:01) mean self-driving is just one of (01:48:03) thousands of things that people do it's (01:48:05) almost like a single vertical I suppose (01:48:07) um whereas when we're talking about (01:48:08) general software engineering it's even (01:48:09) more there's more surface Yeah, (01:48:11) >> there's another uh objection people make (01:48:14) to that analogy, (01:48:16) >> which is that (01:48:17) >> with self-driving, what took a big (01:48:19) fraction of that time was solving the (01:48:21) problem of building basic uh having (01:48:24) basic perception that's robust and (01:48:26) building representations and having a (01:48:28) model that has some common sense so it (01:48:31) can generalize to when I see something (01:48:32) that's slightly out of distribution. If (01:48:34) somebody's waving down the road this (01:48:37) way, you don't need to train for it. the (01:48:38) thing will uh have some understanding of (01:48:41) how to respond to something like that. (01:48:43) >> And these are things we're getting for (01:48:44) free with LLMs or VLMs today. So we (01:48:47) don't have to solve these very basic (01:48:48) representation problems. And so now (01:48:51) deploying AI across different domains (01:48:52) will sort of be like deploying a (01:48:54) self-driving car with current models to (01:48:55) a different city which is hard but not (01:48:57) like a 10 year long task. (01:48:59) >> Yeah. Basically I'm not 100% sure if I (01:49:01) fully agree with that. I don't know that (01:49:02) we're how much we're getting for free (01:49:03) and I still think there's like a lot of (01:49:05) gaps in understanding in what we are (01:49:06) getting. Um I mean we're definitely (01:49:08) getting more generalizable intelligence (01:49:10) in a single entity. Uh whereas uh (01:49:12) self-driving is a very special purpose (01:49:14) task that requires in some sense (01:49:16) building a special purpose task is maybe (01:49:17) even harder in a certain sense because (01:49:19) it doesn't like fall out from a more (01:49:20) general thing that you're doing at scale (01:49:21) if that makes sense. So, um, but I still (01:49:25) think that the analogy doesn't, uh, I (01:49:26) still don't know if it fully resonates (01:49:28) because, um, like the LMS are still (01:49:30) pretty fallible and I still think that (01:49:32) they have a lot of gaps and that it (01:49:33) still needs to be filled in. And I don't (01:49:34) think that we're getting like magical (01:49:35) generalization completely out of the (01:49:36) box, uh, sort of in in a certain sense. (01:49:39) And the other aspect that I wanted to (01:49:40) also actually return to when I was uh, (01:49:43) in the in the beginning was uh, (01:49:45) self-driving cars are nowhere down (01:49:46) still. (01:49:48) >> So even though um, so the deployments (01:49:49) still are pretty minimal, right? Uh so (01:49:51) even Whimo and so on has very few cars (01:49:53) and they're doing that roughly speaking (01:49:54) because they're not economical, right? (01:49:56) Um because they've built something that (01:49:58) that lives in the future. Um and so they (01:50:00) they had to like pull back future but (01:50:02) they had had to make it uneconomical. So (01:50:04) they have all these like um you know (01:50:06) there's all these costs not just (01:50:07) marginal costs for those cars and their (01:50:09) operation and maintenance but also the (01:50:11) capex of the entire thing. (01:50:12) >> Um so making economical is still going (01:50:14) to be a slog I think uh for them. Um, (01:50:17) and then also I think when you look at (01:50:19) these cars and there's no one driving, (01:50:21) um, I also think it's a little bit (01:50:22) deceiving because there are actually (01:50:23) very elaborate operation centers of (01:50:27) people actually kind of like in a loop (01:50:28) with these cars. And I don't have the I (01:50:30) don't know the full extent of it, but I (01:50:31) think um (01:50:32) >> there's more human in the loop that you (01:50:34) might expect and there's people (01:50:35) somewhere out there basically beaming in (01:50:37) from the sky. (01:50:38) >> Uh, and uh, I don't actually know (01:50:40) they're fully in the loop with the (01:50:41) driving. I think some of the times they (01:50:42) are but they're certainly involved and (01:50:44) there are people and in some sense we (01:50:45) haven't actually removed the person (01:50:46) we've like moved them to somewhere where (01:50:48) we can't see them. I still think there (01:50:49) will be some work as you mentioned going (01:50:50) from environment to environment and uh (01:50:52) so I think like there's still challenges (01:50:54) to to make self driving real but I I do (01:50:57) agree that it's definitely cross a (01:50:58) threshold where it kind of feels real (01:51:00) unless it's like retail operated. Um for (01:51:02) example Whimo can't go to all the (01:51:04) different parts of the city. My (01:51:06) suspicion is it's like parts of city (01:51:07) where you don't get good signal (01:51:09) >> anyway. So basically I don't actually (01:51:11) know anything about the stack. I mean (01:51:13) I'm just making up making (01:51:14) >> up. You less self love driving for 5 (01:51:17) years at Tesla. (01:51:18) >> Sorry I don't know anything about the (01:51:19) specifics of Whimo. I feel talk about (01:51:20) them. (01:51:21) >> I actually by the way I love Whimo and I (01:51:22) take it all the time. So I don't want to (01:51:24) say like (01:51:25) >> I just think that people again are (01:51:27) sometimes a little bit too naive about (01:51:29) some of the progress and I still think (01:51:30) there's a huge amount of work (01:51:31) >> and I think Tesla took in my mind a lot (01:51:33) more scalable approach and I think the (01:51:34) team is doing extremely well and it's (01:51:36) going to um and I I I'm kind of like on (01:51:39) the record for predicting how this thing (01:51:40) will go which is like way more like (01:51:42) early start because you can package up (01:51:43) so many sensors but I do think Tesla is (01:51:45) taking the more uh scalable strategy and (01:51:47) it's going to look a lot more like that (01:51:48) u so I think this will have to still uh (01:51:50) play out and hasn't but basically Like I (01:51:53) don't want to talk about self driving as (01:51:54) something that took a decade because it (01:51:56) didn't take it didn't take yet (01:51:59) if that makes sense (01:52:00) >> because one it's the the start is at (01:52:02) 1980 not 10 years ago and then two the (01:52:04) end is not here yet. (01:52:05) >> Yeah. The end is not not near yet (01:52:07) because uh when we're talking about (01:52:08) self-driving usually in my mind it's (01:52:09) self-driving at scale. Yeah. (01:52:11) >> Um people don't have to get a driver's (01:52:13) license etc. I'm I'm curious to bounce (01:52:15) two other ways in which the analogy (01:52:18) might be different. (01:52:19) >> And the reason I'm especially curious (01:52:20) about this is because I think the (01:52:22) question of how fast AI is deployed, how (01:52:25) valuable it is when it's early on is (01:52:27) like potentially the most important (01:52:29) question in the world right now, right? (01:52:30) Like if you're trying to model what the (01:52:31) Euro 20 or 30 looks like, this is the (01:52:33) question you want to have some (01:52:34) understanding of. So another thing you (01:52:37) might think is one you have this latency (01:52:40) requirement with self-driving where you (01:52:42) have I have no idea what the actual (01:52:43) models are but I assume like tens of (01:52:45) millions of parameters or something (01:52:46) which is not the necessary constraint (01:52:48) for um knowledge work with LLMs or maybe (01:52:51) it might be with the computer use and (01:52:53) stuff but anyways the other big one is (01:52:56) maybe more importantly on this capex (01:52:59) question yes there is additional cost to (01:53:03) serving up an additional copy of a model (01:53:05) But the sort of opex of a session (01:53:09) >> is quite low and you can amortize the (01:53:12) cost of AI into the training run itself (01:53:16) depending on how inference scaling goes (01:53:17) and stuff but it's certainly not as much (01:53:19) as like building a whole new car (01:53:21) >> to serve another instance of a model. So (01:53:24) it just the economics of deploying more (01:53:26) widely (01:53:27) >> are much more favorable. (01:53:29) >> I think that's right. I think if you're (01:53:30) sticking in the realm of bits, bits are (01:53:32) like a million times easier than (01:53:33) anything that touches the physical (01:53:34) world. (01:53:35) >> No. (01:53:36) >> Uh I definitely grant that. Uh bits are (01:53:38) completely changeable, arbitrarily (01:53:40) reshuffable at very rapid speed. Uh so (01:53:43) you would expect a lot more (01:53:45) >> faster uh adaptation also in the (01:53:47) industry and so on. (01:53:48) >> And then uh what was the first one? (01:53:50) >> The latency requirements and it (01:53:52) implications for model size. (01:53:54) >> I think that's roughly right. I mean I (01:53:55) also think that if we are talking about (01:53:56) knowledge work at scale there will be (01:53:58) some u latency requirements practically (01:54:00) speaking because we uh you know we're (01:54:02) going to have to make create a huge (01:54:03) amount of compute instead of that. (01:54:05) >> Um and then I think like the last aspect (01:54:07) that I very briefly want to also talk (01:54:08) about is like all the all the rest of it (01:54:10) the (01:54:11) >> just all the rest of it. So um what does (01:54:14) society think about it? What is the (01:54:15) legal ramific how is it working legally? (01:54:18) How is it working insurance-wise? who's (01:54:20) really like what is the where what are (01:54:22) those layers of it and aspects of it (01:54:24) what happens with what is the equivalent (01:54:26) of people putting a cone on a whimo (01:54:28) >> you know uh there's going to be (01:54:29) equivalent of all that and so I I do (01:54:31) think that I almost feel like (01:54:33) self-driving is a very nice analogy that (01:54:35) you can borrow things from yeah what is (01:54:37) the equivalent of a cone on the car what (01:54:38) is the equivalent of a teleoperating (01:54:39) worker who's like hidden away um and uh (01:54:43) almost like all the aspects of it (01:54:45) >> yeah do you have any opinions on whether (01:54:47) this implies that the current day I (01:54:48) build (01:54:49) which would like 10x the amount of (01:54:52) available computer in the world in a (01:54:54) year or two and maybe like 100 more than (01:54:56) 100x it by the end of the decade. If the (01:54:59) use of AI will be lower than some people (01:55:01) naely predict, does that mean that we're (01:55:04) overbuilding compute or do you is that a (01:55:07) separate question? (01:55:07) >> Kind of like what happened with (01:55:08) railroads and all this kind of stuff? (01:55:10) Sorry. (01:55:10) >> Was it railroads? Oh, sorry. It was um (01:55:12) yeah, (01:55:12) >> there there is like historical precedent (01:55:14) or was it with telecommunication (01:55:15) industry, right? Like prepaving the (01:55:17) internet that only came like a decade (01:55:18) later, you know, (01:55:20) >> and creating like a whole bubble in the (01:55:21) telecommunications industry in the late (01:55:24) 90s kind of thing. Yeah. (01:55:25) >> Um so I don't know. I mean, I I (01:55:28) understand I'm sounding very pessimistic (01:55:30) here. (01:55:31) >> I'm only doing that I'm actually (01:55:32) optimistic. I think this will work. I (01:55:34) think it's tractable. I'm only sounding (01:55:36) pessimistic because when I go on my (01:55:37) Twitter timeline, I see all this stuff (01:55:39) that makes no sense to me. And um and I (01:55:42) think there's a lot of reasons for why (01:55:44) that exists. And I think a lot of it is (01:55:46) I think honestly just uh fundraising. (01:55:47) It's just incentive structures. A lot of (01:55:50) it may be fundraising. A lot of it is (01:55:51) just attention um you know, converting (01:55:53) attention to money on the internet, you (01:55:55) know, stuff like that. Um, so I think (01:55:58) there's u there's a lot of that going on (01:56:01) and I think I'm only reacting to that. (01:56:02) Um, but I'm still like overall very (01:56:05) bullish on technology. I think we're (01:56:06) going to work through all this stuff and (01:56:07) I think there's been a rapid amount of (01:56:09) progress. Um, I don't actually know that (01:56:11) there's overbuilding. I think that (01:56:13) there's going to be we're going to be (01:56:14) able to gobble up what in my (01:56:16) understanding is being built. Uh, (01:56:17) because I do think that for example (01:56:19) cloud code or open codex and stuff like (01:56:21) that, they didn't even exist a year ago, (01:56:22) right? Is that right? I think it's (01:56:23) roughly right. um this is miraculous (01:56:26) technology that didn't exist. I think um (01:56:29) uh there's going to be a huge amount of (01:56:30) demand as there as we see the demand in (01:56:31) Chaship PT already and so on. So uh (01:56:34) yeah, I don't actually know that there's (01:56:35) overbuilding. Um but I guess I'm just (01:56:38) reacting to like some of the very fast (01:56:40) timelines that people continue to say (01:56:42) incorrectly and I've heard many many (01:56:43) times over the course of my 15 years in (01:56:45) AI where very reputable people keep (01:56:48) getting this wrong all the time. (01:56:51) And I think I want this to be properly (01:56:52) calibrated and I think some of this also (01:56:54) it does have like geopolitical (01:56:55) ramifications and things like that when (01:56:58) uh like some of these questions and I (01:57:00) think I don't want people to make (01:57:01) mistakes on that on that sphere of (01:57:03) things. So um I do want us to be (01:57:05) grounded in reality of what technology (01:57:07) is and isn't. So (01:57:09) >> let's let's talk about education in (01:57:10) Eureka and stuff. (01:57:12) >> One thing you could do is uh start (01:57:15) another AI lab and then try to solve (01:57:18) those problems. Um, yeah. C curious what (01:57:20) you're up to now. (01:57:21) >> Yeah. (01:57:22) >> And then, yeah, why not AI research (01:57:24) itself? (01:57:25) >> Uh, I guess maybe like the way I would (01:57:26) put it is I feel some amount of like (01:57:29) determinism around the things that AI (01:57:32) labs are doing. Um, and I feel like I (01:57:34) could help out there, but I don't know (01:57:35) that I would uh like uniquely um I don't (01:57:39) know that I would like uniquely uh (01:57:40) improve it. But I I think like my (01:57:42) personal big fear is that a lot of this (01:57:44) stuff happens on the side of humanity (01:57:46) and that humanity gets disempowered by (01:57:48) it. And I don't I I kind of like I care (01:57:51) not just about all the Dyson spheres (01:57:53) that we're going to build and that AI is (01:57:54) going to build in a fully autonomous (01:57:55) way. I care about what happens to (01:57:57) humans. (01:57:57) >> Yeah. (01:57:58) >> And I want humans to be well off in this (01:58:00) future. And I feel like that's where I (01:58:02) can a lot more uniquely add value than (01:58:04) uh like an incremental improvement in a (01:58:05) frontier lab. And so, um, I guess I'm (01:58:08) most afraid of something maybe like, um, (01:58:10) depicted in movies like Wall-E or (01:58:12) Idiocracy or something like that where (01:58:14) humanity is sort of on the side of this (01:58:15) stuff. Um, and I want humans to be much (01:58:18) much better in this future. And so I (01:58:21) guess uh to me uh this is kind of like (01:58:23) through education that you can actually (01:58:24) achieve this (01:58:25) >> and and uh so what are you working on (01:58:27) there? (01:58:27) >> Oh yeah. So Eureka is trying to build I (01:58:29) think maybe the easiest way I can (01:58:30) describe it is we're trying to build the (01:58:31) Starfleet Academy. (01:58:33) >> Um I don't know if you watched Star (01:58:35) Trek. I haven't. But yeah. (01:58:36) >> Okay. Starfleet Academy is this like (01:58:38) elite institution for frontier (01:58:40) technology building spaceships and (01:58:42) graduating cadetses to be like you know (01:58:44) the pilots of these spaceships and (01:58:45) whatnot. So I just imagine like an elite (01:58:47) institution for technical knowledge and (01:58:50) um and basically a kind of school that's (01:58:53) um very upto-date and very like premier (01:58:55) institution. A category of questions I (01:58:58) have for you is just explaining how one (01:59:01) teaches technical or scientific content (01:59:05) >> well because you are one of the world (01:59:07) masters at it and then I'm curious both (01:59:10) about how you think about it for content (01:59:11) you've already put out there on YouTube. (01:59:13) >> Yeah. (01:59:13) >> But also to the extent it's any (01:59:14) different how you think about it for (01:59:15) Eureka. (01:59:16) >> Yeah. Yeah. With respect to Eureka, I (01:59:18) think like one thing that is very (01:59:19) fascinating to me about education is (01:59:21) like I do think education will pretty (01:59:22) fundamentally change with AIS on the (01:59:24) side and I think it has to be rewired (01:59:26) and changed um to some extent. I still (01:59:29) think that we're pretty early. I think (01:59:30) there's going to be a lot of people who (01:59:31) are going to try to do the obvious (01:59:32) things which is like oh have an LLM and (01:59:35) uh ask it questions and get you know do (01:59:37) all the basic things that you would do (01:59:38) via prompting right now. I I think it's (01:59:40) helpful but it still feels to me a bit (01:59:41) slop like slop. I I'd like to do it (01:59:44) properly and I think the capability is (01:59:45) not there for what I would want. What (01:59:46) I'd want is uh like an actual uh tutor (01:59:50) experience. Um maybe a prominent example (01:59:52) in my mind is um I was recently learning (01:59:55) Korean. So language learning (01:59:57) >> and I went through a phase where I was (01:59:58) learning Korean by myself on the (02:00:00) internet. I went through a phase where I (02:00:01) was actually part of a small class uh in (02:00:03) Korea. Uh taking a taking a Korean with (02:00:06) a bunch of other people which was really (02:00:07) funny. But we had a teacher and like 10 (02:00:08) people or so taking Korean. And then I (02:00:10) switched to a one-on-one tutor. And um I (02:00:14) guess what was fascinating to me is I (02:00:15) think I had a really good tutor. Uh but (02:00:17) um I mean just thinking through like (02:00:20) what this tutor was doing for me and how (02:00:22) incredible that experience was and how (02:00:25) high the bar is for like what I actually (02:00:26) want to build eventually. (02:00:28) >> Um because uh I mean she was extremely (02:00:30) so she instantly from a very short (02:00:32) conversation understood like where I am (02:00:34) as a student, what I know and don't know (02:00:36) and she was able to like probe exactly (02:00:37) like the kinds of questions or things to (02:00:39) understand my world model. M no LLM will (02:00:42) do that for you 100% right now. Not even (02:00:44) close, right? (02:00:44) >> But a tutor will do that if if they're (02:00:46) good. (02:00:47) >> Once she understands um she actually (02:00:49) like really served me all the things (02:00:50) that I needed at my current sliver of (02:00:52) capability. I need to be always (02:00:54) appropriately challenged. I can't be (02:00:56) faced with something too hard or too (02:00:57) trivial. And a tutor is really good at (02:00:59) serving you just the right stuff. And so (02:01:01) basically I felt like I was the only (02:01:03) constraint to learning like my own. I (02:01:05) was the only constraint. I was always (02:01:06) given the perfect information. (02:01:07) >> I'm the only constraint. And I felt good (02:01:09) because I'm the only impediment that (02:01:11) exists. It's not that I can't find (02:01:12) knowledge or that it's not properly (02:01:13) explained or etc. Like it's just my (02:01:15) ability to memorize and so on. And this (02:01:17) is what I want for people. How do you (02:01:19) automate that? (02:01:20) >> So very good question about the current (02:01:22) capability. You don't (02:01:23) >> but I do think that with uh as um and (02:01:25) that's why I think u it's not actually (02:01:27) the right right time to actually build (02:01:28) this kind of an AI tutor. I still think (02:01:30) it's a useful product um and lots of (02:01:32) people will build it but I still feel (02:01:34) like um the bar is so high uh and the (02:01:37) capability is not there. Um uh but I (02:01:40) mean even today I would say chachin is (02:01:42) an extremely um valuable educational (02:01:45) product but I think for me it was so (02:01:47) fascinating to see how high the bar is (02:01:48) and when I was with her I almost felt (02:01:50) like there's no way I can build this. (02:01:53) >> But you are building it right? (02:01:54) >> Anyone who's had a really good tutor is (02:01:56) like how are you going to build this? (02:01:58) Um so I guess I I'm waiting for that (02:02:01) capability. I I do think that in a lot (02:02:03) of ways in the industry, for example, I (02:02:04) did some AI consulting for computer (02:02:06) vision. Yeah. Um (02:02:07) >> a lot of my times the value that I (02:02:09) brought to the company was telling them (02:02:10) not to use AI. (02:02:11) >> It wasn't like I was the AI expert and (02:02:13) they described a problem and I said (02:02:14) don't use AI. (02:02:16) >> This was my value ad. And I feel like (02:02:17) it's in the same in education right now (02:02:19) where I kind of feel like for what I (02:02:21) have in mind, it's not yet the time, but (02:02:23) the time will come. But for now, I'm (02:02:25) building something that looks maybe a (02:02:26) bit more conventional. um that has a (02:02:28) physical and digital component and so (02:02:30) on. But I think there's obvious there's (02:02:32) obvious it's obvious how this should (02:02:34) look like in the future. (02:02:35) >> Do they extend you're willing to say (02:02:36) what is the thing you hope will be (02:02:39) released this year or next year? (02:02:41) >> Well, so I'm building the first course (02:02:42) and I want to have a really really good (02:02:44) course uh state-of-the-art obvious (02:02:47) state-of-the-art destination you go to (02:02:49) learn AI in this case because that's (02:02:50) just what I'm familiar with. So I think (02:02:51) it's a really good first product to get (02:02:52) to be really good. Um and so that's what (02:02:55) I'm building and nano chat which you (02:02:56) briefly mentioned is a capstone project (02:02:58) of uh LLM 101n which is a class that I'm (02:03:00) building. (02:03:01) >> So um that's a really big piece of it (02:03:04) but now I have to build out a lot of the (02:03:05) intermediates and then I have to (02:03:06) actually like hire a small team of you (02:03:08) know TAs and so on and actually like uh (02:03:10) build the entire course. And maybe one (02:03:12) more thing that I would say is like many (02:03:13) times when people think about education (02:03:15) they think about sort of like the more (02:03:17) what I would say is like kind of a (02:03:18) softer component of like diffusing (02:03:20) knowledge or like um but I actually have (02:03:22) something very hard and technical in (02:03:24) mind and so in my mind education is kind (02:03:26) of like the very difficult technical (02:03:28) like uh process of building ramps to (02:03:30) knowledge. (02:03:31) >> So in my mind nano chat is a ramp to (02:03:33) knowledge because it's a very simple (02:03:35) it's like the super simplified full (02:03:37) stack thing. If you give this artifact (02:03:39) to someone and they like look through (02:03:41) it, they're learning a ton of stuff. (02:03:42) >> Yeah. (02:03:42) >> And so, uh, it's giving you a lot of (02:03:45) what I call Eurekas per second. Yeah. (02:03:46) Which is like understanding per second. (02:03:48) That's what I want. Lots of Eurekas per (02:03:50) second. Um and so to me this is a (02:03:52) technical problem of how do we build (02:03:53) these ramps to knowledge and uh so I (02:03:55) almost think of Eureka as almost like a (02:03:57) it's not like maybe that different maybe (02:03:59) through through some of the for frontier (02:04:01) labs or some of the work that's going to (02:04:02) be going on because I want to figure out (02:04:04) how to build these frontier these ramps (02:04:06) very efficiently so that people are (02:04:08) never stuck um and everything is always (02:04:10) not too hard or not too not too trivial (02:04:13) and uh you can you have just the right (02:04:15) material to actually progress. (02:04:16) >> Yeah. So you're imagining the short term (02:04:18) that instead of a tutor being able to (02:04:20) like probe your understanding, if you (02:04:23) have enough self-awareness to be able to (02:04:25) probe yourself (02:04:26) >> there, you're never going to be stuck. (02:04:27) You can like find the right answer (02:04:29) between talking to the TA or talking to (02:04:31) an LLM and looking at the reference (02:04:32) implementation. It sounds like (02:04:35) automation or AI is actually not a (02:04:37) significant like so far it's actually (02:04:40) the the big alpha here is your ability (02:04:43) to explain AI codified in the source (02:04:47) material of the class right that's like (02:04:50) fundamentally what the course is (02:04:51) >> I mean I think you always have to be (02:04:52) calibrated to what the capability what (02:04:54) capability exists in the industry and I (02:04:56) think a lot of people are going to (02:04:57) pursue like oh just ask chasha etc uh (02:04:59) but I I think like right now for example (02:05:01) if you go to chasha and you say oh teach (02:05:03) me AI there's no way it's I mean it's (02:05:04) going to give you some slop right (02:05:06) >> like when I AI is never going to write (02:05:08) nano chat right now but nano chat is a (02:05:10) really useful I think intermediate point (02:05:12) >> so I still I'm collaborating with AI to (02:05:15) create all this material so AI is still (02:05:17) fundamentally very helpful (02:05:18) >> um earlier on I built a CS231N at (02:05:21) Stanford which was one of the earlier (02:05:23) actually sorry I think it was the first (02:05:24) deep learning class at Stanford which (02:05:25) became very popular (02:05:27) >> um and the difference in building out (02:05:29) 231N and L101N now is quite stark (02:05:32) uh because I'm I feel really empowered (02:05:34) by the LMS as they exist right now but (02:05:36) I'm very much in the loop (02:05:38) >> so they're helping me build all the (02:05:39) materials I go much faster u they're (02:05:41) doing a lot of the boring stuff etc uh (02:05:43) so I feel like I'm developing the course (02:05:45) much faster and there's LLM infused in (02:05:47) it but it's not yet at a place where I (02:05:49) can creatively create the content I'm (02:05:50) still there to do that (02:05:51) >> so like I think the trickiness is always (02:05:53) calibrating yourself to what exists (02:05:55) >> and so when you imagine what is (02:05:57) available through Eureka in a couple of (02:05:59) years it seems like the big bottleneck (02:06:01) is going to (02:06:02) finding corpse in field after field who (02:06:05) can (02:06:06) >> convert their understanding into these (02:06:08) ramps right (02:06:09) >> so I think it would change over time so (02:06:10) I think right now it would be uh hiring (02:06:13) faculty (02:06:14) >> Mhm. (02:06:14) >> to help work handin-hand with AI and a (02:06:17) team of people probably uh to build (02:06:19) state-of-the-art courses. (02:06:20) >> Yeah. (02:06:21) >> And then I think over time it can maybe (02:06:23) some of the TAs can actually become AIs (02:06:24) because some of the TAS like okay you (02:06:26) just take all the course materials (02:06:28) >> and then I think you could serve a very (02:06:29) good like automated TA. for the student (02:06:32) when they have more basic questions or (02:06:34) something like that, right? But I think (02:06:35) you'll need faculty for the overall (02:06:37) architecture of a course and making sure (02:06:40) that it fits. And so I kind of see a (02:06:41) progression of how this will evolve and (02:06:43) maybe at some future point, you know, (02:06:44) I'm not even that useful in AI is doing (02:06:46) most of the design much better than I (02:06:47) could. (02:06:48) >> But I still think that that's going to (02:06:49) take some time to play out. But are you (02:06:51) imagining that like uh people who have (02:06:54) expertise in other fields are then (02:06:56) contributing courses or do you feel like (02:06:57) it's actually quite essential to the (02:06:59) vision that you given your understanding (02:07:02) of how you want to teach are the one (02:07:04) designing the content (02:07:06) >> like I don't know Salon is like (02:07:08) narrating all the videos on Khan Academy (02:07:10) are you imagining something like that or (02:07:11) >> no I will hire faculty I think because (02:07:13) there are domains in which I'm not an (02:07:14) expert um and I think uh that's the only (02:07:17) way to offer the state-of-the-art (02:07:19) experience for the student ultimately. (02:07:20) So um (02:07:22) >> yeah I do expect that I would hire (02:07:24) faculty but I will probably stick around (02:07:25) in AI for some time but in I do have (02:07:28) something I think more conventional in (02:07:29) mind for the current capability I think (02:07:31) than what people would probably (02:07:32) anticipate. Um, and when I'm building (02:07:34) Starfleet Academy, I do probably imagine (02:07:36) a physical uh institution and maybe a (02:07:38) tier below that, a digital offering that (02:07:41) um is not the same not the (02:07:43) state-of-the-art experience you would (02:07:44) get when someone comes in physically (02:07:46) full-time and we work through material (02:07:48) from start to end and make sure you (02:07:49) understand it. Uh that's the physical (02:07:51) offering. (02:07:52) >> Um the digital offering is yeah, a bunch (02:07:53) of stuff on the internet and maybe some (02:07:55) LLM assistant and it's a bit more (02:07:56) gimmicky in a tier below, but uh at (02:07:58) least it's accessible to like eight (02:07:59) billion people. So (02:08:01) >> yeah, I think you're basically inventing (02:08:04) college from first principles for the (02:08:08) tools that are available today and then (02:08:09) just like for just like selecting for (02:08:12) people who have the motivation and the (02:08:13) interest of actually (02:08:16) >> really engaging with material. (02:08:17) >> Yeah. And I think there's going to have (02:08:18) to be a lot of not just education but (02:08:20) also re-education. And I would love to (02:08:22) uh help out uh there uh because I think (02:08:24) the jobs will probably change quite a (02:08:26) bit. Um and so for example today a lot (02:08:28) of people are trying to upskill in AI (02:08:29) specifically. So I think it's a really (02:08:30) good course to teach in this in this (02:08:32) respect. Um and yeah I think the (02:08:35) motivation wise uh before AGI uh (02:08:38) motivation is very simple to solve (02:08:39) because uh people want to make money and (02:08:41) this is how you make money in the (02:08:42) industry today. (02:08:43) >> I think post AGI it's a lot more (02:08:45) interesting um possibly because yeah if (02:08:48) everything is automated and there's (02:08:49) nothing to do for anyone why would (02:08:50) anyone go to a school etc. Um so I think (02:08:54) uh I guess like I often say that (02:08:56) pre-aggi education is useful, post AGI (02:08:59) education is fun (02:09:01) >> and uh in a similar way as people for (02:09:03) example uh people go to gym today. (02:09:06) >> Yeah. (02:09:06) >> Uh but we don't need their physical (02:09:08) strength to manipulate uh heavy objects (02:09:10) because we have machines that do that. (02:09:12) >> They still go to gym. Why do they go to (02:09:13) gym? Well, because it's fun. It's (02:09:14) healthy. It's uh and it's and you look (02:09:17) hot when you have a six-pack. I don't (02:09:18) know. I guess like um so it's I guess (02:09:21) what I'm saying is um it's attractive (02:09:23) for people to do that in a certain like (02:09:25) very deep psychological evolutionary (02:09:27) sense for humanity. (02:09:29) >> And so I kind of uh think that education (02:09:31) will kind of play out in the same way (02:09:32) like you'll go to school like you go to (02:09:33) gym um and you'll and I think that right (02:09:36) now I think not that many people learn (02:09:38) uh because learning is hard. You bounce (02:09:41) from material because and some people (02:09:42) overcome that barrier but for most (02:09:44) people it's hard. But I do think that we (02:09:46) should it's a technical problem to (02:09:47) solve. It's a technical problem to do (02:09:49) what my uh tutor did for me when I was (02:09:51) learning Korean. (02:09:52) >> I think it's tractable and buildable and (02:09:54) someone should build it and I think it's (02:09:55) going to make learning anything like (02:09:57) trivial and desirable and people will do (02:09:59) it for fun because it's trivial. (02:10:00) >> If I had a tutor like that for any (02:10:02) arbitrary piece of like knowledge, I (02:10:04) think it's going to be so much easier to (02:10:05) to learn anything and people will do it (02:10:07) and they'll do it for the same reasons (02:10:08) they go to gym. (02:10:09) >> I mean that sounds different from (02:10:13) using this. So post Asia you're using (02:10:15) this to um basically as entertainment or (02:10:20) as like a self- betterment but it (02:10:22) sounded like you had a vision also that (02:10:24) this education is relevant to keeping (02:10:25) humanity in control of AI. (02:10:28) >> And they sound different and I'm curious (02:10:29) is it like it's entertaining for some (02:10:31) people but then empowerment for some (02:10:32) others. How do you think about that? (02:10:33) >> I think this um so I do definitely feel (02:10:35) like people will be um I do think like (02:10:37) eventually it's a bit of a losing game (02:10:39) if that makes sense. I do think that it (02:10:42) is in long term long term which I think (02:10:44) is longer than I think maybe most people (02:10:46) in the industry it's a losing game. I I (02:10:48) do think that people can go so far and (02:10:51) that we barely scratched the surface of (02:10:52) much a person can can go and that's just (02:10:54) because people are bouncing off of (02:10:55) material that's too easy or too hard and (02:10:57) they and and I I I actually kind of feel (02:10:59) that people will be able to go much (02:11:01) further like anyone speaks five (02:11:02) languages because why not because it's (02:11:04) so trivial. (02:11:05) >> Um anyone um knows you know all the (02:11:07) basic curriculum of undergrad etc. Now, (02:11:10) now that I'm understanding the vision, (02:11:12) that that's very interesting. Like, I (02:11:14) think it actually has a perfect analog (02:11:16) in gym culture. I don't think a 100 (02:11:18) years ago anybody would be like ripped (02:11:20) like nobody would have, you know, be (02:11:22) able to like just spontaneously bench (02:11:23) two plays or three plays or something. (02:11:25) It's actually very common now. (02:11:27) >> And you're because this idea of (02:11:29) systematically training and lifting (02:11:31) weights in the gym or systematically (02:11:33) training to be able to run a marathon, (02:11:34) which is capability spontaneously you (02:11:36) would not have or most humans would not (02:11:38) have. And you're imagining similar (02:11:40) things for (02:11:41) learning across very many different (02:11:43) domains, much more intensely, deeply, (02:11:45) faster. (02:11:45) >> Yeah, exactly. And I kind of feel like I (02:11:47) am betting a little bit implicitly on (02:11:48) some of the timelessness of human (02:11:50) nature. (02:11:50) >> Yeah. (02:11:50) >> And I think um (02:11:52) >> I think it will be desirable to be to to (02:11:55) do all these things. Um (02:11:58) >> and I think people will look up to it (02:12:00) and as they have for for millennia (02:12:02) because uh and I think this will (02:12:04) continue to be true. And actually also (02:12:05) maybe there's some evidence of that (02:12:07) historically because if you look at for (02:12:08) example aristocrats or you look at maybe (02:12:10) ancient Greece or something like that (02:12:11) whenever you had little pocket (02:12:12) environments that were post AGI in a (02:12:14) certain sense I do feel like people have (02:12:16) spent a lot of their time uh flourishing (02:12:17) in a certain way uh either physically or (02:12:19) or cognitively and so I think um I I (02:12:22) feel okay about the prospects of that (02:12:24) >> and I think if this is false and I'm (02:12:26) wrong and we end up in like (02:12:28) >> you know um Wall-E or idiocracy future (02:12:30) then I think it's very I don't even care (02:12:33) if there's like Dyson spheres this is a (02:12:35) terrible outcome. (02:12:36) >> Mhm. (02:12:37) >> Yeah. (02:12:37) >> Like I actually really do care about (02:12:38) humanity. Like everyone has to just be (02:12:41) superhuman in a certain sense. (02:12:43) >> I I I guess it's still a world in which (02:12:46) that is not enabling us to (02:12:48) it's it's like the culture world, right? (02:12:50) Like you're not fundamentally going to (02:12:51) be able to like transform the trajectory (02:12:54) of (02:12:54) >> Yeah. (02:12:55) >> uh technology or (02:12:57) >> Yeah. (02:12:57) >> influence decisions by your own labor or (02:13:00) cognition alone. Maybe you can influence (02:13:02) decisions because the AI is like for (02:13:03) your approval, but you're not like it's (02:13:06) not because I've like I can in because (02:13:08) I've invented something or I've like (02:13:10) come up with a new design, I'm like (02:13:11) really influencing the future. (02:13:12) >> Um yeah, maybe. I don't actually think (02:13:14) that uh I I think there will be (02:13:15) transitional period where we are going (02:13:17) to be able to be in the loop and you (02:13:19) know advance things if we actually (02:13:20) understand a lot of stuff. (02:13:21) >> Um I do think that long term that (02:13:23) probably goes away, right? But um maybe (02:13:25) it's going to even become a sport. But (02:13:27) right now you have powerlifters who go (02:13:29) extreme on this direction. So what is (02:13:31) powerlifting in a cognitive era? (02:13:33) >> Um maybe it's people who are really (02:13:35) trying to make Olympics out of knowing (02:13:36) stuff. (02:13:37) >> Yeah. (02:13:37) >> Uh like and and if you have a perfect AI (02:13:41) tutor, um maybe you can get extremely (02:13:43) far. Yeah. (02:13:44) >> I almost feel like we're just barely the (02:13:46) the geniuses of today are barely (02:13:48) scratching the surface of what a human (02:13:49) mind can do. I think (02:13:50) >> Yeah. I I I love this vision. I also um (02:13:54) it's like I feel like the person you (02:13:56) have like most product market fit with (02:13:58) is like me because like my job involves (02:14:00) having to (02:14:01) >> learn different subjects every week and (02:14:04) I I am I am like very excited if you can (02:14:08) >> I'm similar for that matter. I mean I (02:14:10) you know a lot of people for example uh (02:14:12) hate school and want to get out of it. I (02:14:13) was I was actually I really liked (02:14:14) school. I loved learning things etc. I (02:14:16) wanted to stay in school. I stayed all (02:14:17) the way until PhD and then they wouldn't (02:14:19) let me stay longer so I went to the (02:14:20) industry. But I mean I basically it's (02:14:22) roughly speaking I love uh I love (02:14:24) learning uh even for the sake of (02:14:26) learning but I also um love learning (02:14:28) because it's a form of empowerment and (02:14:29) being useful and productive. I I think (02:14:31) you also made a point that uh was subtle (02:14:33) so just to spell it out. I think what's (02:14:36) happened so far with online courses is (02:14:37) that why haven't they already enabled us (02:14:40) to (02:14:41) enable every single human to know (02:14:43) everything (02:14:44) >> and I think they're just so motivation (02:14:47) laden because there's not obvious (02:14:49) on-ramps and it's like so easy to get (02:14:51) stuck. Um, and if you had (02:14:55) instead this re this thing basically (02:14:57) like a really good human tutor, it it (02:14:59) would just be such an unluck from a (02:15:01) motivation perspective. (02:15:02) >> I think so because it feels bad to (02:15:04) bounce from material. Feels bad. You get (02:15:06) negative reward from (02:15:07) >> uh sinking amount of time in something (02:15:09) and this doesn't pan out or like being (02:15:11) completely bored because what you're (02:15:13) getting is too easy or too hard. So I (02:15:14) think uh yeah I think it feel when you (02:15:16) actually do it properly learning feels (02:15:18) good. (02:15:18) >> Yeah. And I think it's a technical (02:15:20) problem to get there. And I think um for (02:15:22) a while it's going to be AI plus human (02:15:24) collab. And at some point maybe it's (02:15:26) just AI. I don't know. (02:15:27) >> Can I ask some questions about teaching? (02:15:29) Well, (02:15:29) >> if you had to like sort of like give (02:15:31) advice to another educator in another (02:15:33) field that you're curious about (02:15:36) >> to make the kinds of YouTube tutorials (02:15:39) you've made. Um (02:15:40) >> maybe it may be especially interesting (02:15:42) to talk about domains where you can't (02:15:43) just like you can't test somebody's (02:15:44) technical understanding by having them (02:15:46) code something up or something. what (02:15:48) advice would you give them? (02:15:49) >> Uh, so I think that's a pretty broad (02:15:51) topic. I do feel like there's basically (02:15:53) I almost feel like there are 10 20 tips (02:15:54) and tricks that I kind of (02:15:55) semi-consciously probably do, but um (02:15:59) I guess like on a high level, I always (02:16:01) try to I think a lot of this comes from (02:16:03) my physics background. I really really (02:16:05) did enjoy my physics background. I have (02:16:06) a whole rant on I think how everyone (02:16:08) should learn physics uh in the in early (02:16:11) school education because I think early (02:16:13) school education is not about (02:16:15) cremulating knowledge or memory for (02:16:16) tasks later in the industry. It's about (02:16:18) booting up a brain and I think physics (02:16:19) uniquely boots up the brain the best. Uh (02:16:22) because some of the things that they get (02:16:23) you to do in your brain during physics (02:16:25) is is extremely valuable later. the idea (02:16:27) of building models and abstractions and (02:16:28) understanding that there are there's a (02:16:30) first order um approximation that (02:16:32) describes most of the system but then (02:16:34) there's a second order, third order, (02:16:35) fourth order terms that may or may not (02:16:37) be present. And the idea that you're (02:16:38) observing like a very noisy system, but (02:16:40) actually there's like these fundamental (02:16:41) frequencies that you can abstract away. (02:16:43) Like when a physicist walks into the (02:16:45) class and they say, "Oh, assume there's (02:16:47) a spherical cow and dot dot dot." And (02:16:49) everyone laughs at that, but actually (02:16:50) this is brilliant. It's brilliant (02:16:51) thinking that's very generalizable (02:16:53) across the industry because (02:16:54) >> yeah cow is can be approximated as a (02:16:57) sphere I guess in a bunch of ways. Um (02:16:59) there's a really good book for example (02:17:00) scale uh it's basically from a physicist (02:17:03) talking about biology and maybe this is (02:17:05) also a book I would recommend reading (02:17:06) but you can actually get a lot of really (02:17:08) interesting approximations and chart (02:17:10) scaling laws of animals and you can look (02:17:12) at their heartbeats and things like that (02:17:14) and they actually line up and with the (02:17:16) size of the animal and things like that. (02:17:17) You can talk about an animal as a volume (02:17:19) and you can actually drive a lot of um (02:17:21) you can talk about the heat dissipation (02:17:23) uh of that because your your heat (02:17:24) dissipation grows as the surface area (02:17:26) which is growing a square but your heat (02:17:28) creation or generation um is growing as (02:17:31) a cube. (02:17:32) >> And so I just feel like physicists have (02:17:33) all the right cognitive tools to (02:17:34) approach problem solving in the world. (02:17:36) So I think because of that training I (02:17:38) always try to find the first order terms (02:17:40) or the second order terms of everything. (02:17:41) When I'm observing a system or or a (02:17:43) thing, I have a tangle of a web of ideas (02:17:45) or knowledge in my world in my mind and (02:17:47) I'm trying to find what is the what is (02:17:48) the thing that actually matters. What is (02:17:50) the first order component? How can I (02:17:52) simplify it? How can I have a simple (02:17:53) thing that actually shows that thing, (02:17:54) right? Show shows it in action and then (02:17:56) I can tack on the other terms. (02:17:57) >> Yeah, (02:17:58) >> maybe maybe an example from my from one (02:18:00) of my repos that I think illustrates it (02:18:02) well is called microrad. I don't know if (02:18:03) you're familiar with this, but (02:18:05) >> so microrad is 100 lines of code that (02:18:07) shows back propagation. It can uh you (02:18:09) can create neural networks out of simple (02:18:11) operations like plus and times etc Lego (02:18:13) blocks of neural networks and you build (02:18:15) up a computational graph and you do a (02:18:16) forward pass and a backward pass to get (02:18:18) the gradients. Um now this is at the (02:18:20) heart of all neural network learning. So (02:18:22) microrad is a 100 lines of (02:18:24) pre-interpretable python code and it can (02:18:26) do forward and backward arbitrary neural (02:18:27) networks but not efficiently. So (02:18:30) microrad these hundred lines of python (02:18:31) are everything you need to understand (02:18:33) how neural networks train. Everything (02:18:34) else is just efficiency. (02:18:36) >> Yeah. (02:18:37) >> Everything else is efficiency. And (02:18:38) there's a huge amount of work to do (02:18:39) efficiency. You know, you need your (02:18:40) tensors. You lay them out and you stride (02:18:42) them. You make sure your kernels are (02:18:43) orchestrating memory movement correctly, (02:18:45) etc. It's all just efficiency roughly (02:18:47) speaking. But the core intellectual sort (02:18:48) of piece of neural network training is (02:18:50) microat. It's 100 lines. You can easily (02:18:51) understand it. You're chaining. It's a (02:18:53) recursive application of chain rule to (02:18:55) derive the gradient which allows you to (02:18:56) optimize any arbitrary differential (02:18:57) function. So it's a I love finding these (02:19:01) like you know the smaller the terms and (02:19:04) serving them in a very on a platter and (02:19:06) discovering them and I feel like (02:19:08) education is like the most (02:19:09) intellectually interesting thing because (02:19:11) >> you have a tangle of understanding and (02:19:13) you're trying to lay it out in a way (02:19:15) that creates a ramp where everything (02:19:17) only depends on the thing before it and (02:19:19) I find that this like you know (02:19:21) untangling of knowledge is just so (02:19:22) intellectually interesting as a as a (02:19:24) cognitive task. (02:19:25) >> Yeah. And so I love doing it personally, (02:19:26) but I just find have fascination with (02:19:28) trying to lay things out in a certain (02:19:29) way. And maybe that that helps me. (02:19:31) >> It also just makes a learning experience (02:19:34) so much more motivated. You your um (02:19:36) tutorial on the transformer begins with (02:19:40) biogram. It's literally like a lookup (02:19:41) table from (02:19:43) >> here's the word right now (02:19:44) >> or here's the previous word, here's the (02:19:46) next word. And it's literally just a (02:19:47) lookup table. (02:19:48) >> Yeah. That's the essence of it. Yeah. I (02:19:49) mean such a brilliant way like okay (02:19:51) start with a lookup table and then go to (02:19:53) a transformer and then each piece is (02:19:54) motivated why would you add that why (02:19:56) would you add the next thing you (02:19:57) couldn't memorize a sort of attention (02:19:58) formula but it's like having an (02:20:00) understanding of why this is every (02:20:01) single piece is relevant what problem it (02:20:03) solves (02:20:04) >> yeah yeah you're presenting the pain (02:20:05) before you present a solution and how (02:20:07) clever is that and you want to take the (02:20:08) student through that progression so um (02:20:11) there's a lot of like other small things (02:20:12) like that that I think make it make it (02:20:14) nice and engaging interesting and and (02:20:16) you know always prompting the student (02:20:17) there's there's a lot of small things (02:20:19) like that that I think are, you know, (02:20:20) important and a lot of good educators (02:20:21) will do. Uh like how would you solve (02:20:23) this? Like I'm not going to present a (02:20:25) solution before you're going to guess. (02:20:27) >> That would be wasteful. That would be (02:20:29) that's that's a little bit of a (02:20:31) >> I don't want to swear, but like it's a (02:20:33) it's a it's a dick move towards you to (02:20:34) present you with the solution before I (02:20:36) give you a shot to try to um (02:20:38) >> Right. (02:20:38) >> Uh to come up with it yourself. (02:20:39) >> Yeah. Yeah. And because because if you (02:20:41) try to come with yourself, you you I (02:20:43) guess you get a better understanding of (02:20:44) like what is the action space. (02:20:47) >> Yeah. And then what is the sort of like (02:20:49) objective then like why does only this (02:20:51) action fulfill that objective right? (02:20:53) >> Yeah. Well, you have a chance to like (02:20:54) try yourself and you you have an (02:20:56) appreciation when I give you the (02:20:57) solution and uh it maximizes the amount (02:21:00) of knowledge per new fact added. (02:21:01) >> That's right. Yeah. Yeah. (02:21:03) >> Why do you think by default people who (02:21:05) are genuine experts in their field are (02:21:10) often bad at explaining it to somebody (02:21:13) ramping up? (02:21:14) >> Well, it's the curse of knowledge and (02:21:15) expertise. Yeah, (02:21:16) >> this is a real phenomenon and I actually (02:21:18) suffered from it myself as much as I try (02:21:20) to not not suffer from it. But (02:21:21) >> you take certain things for granted and (02:21:23) you can't put yourself in the shoes of (02:21:24) new of people who are just starting out (02:21:26) and uh this is pervasive and happens to (02:21:28) me as well. (02:21:29) >> One thing that I actually think is (02:21:30) extremely helpful as an example someone (02:21:32) was trying to show me a paper in biology (02:21:33) recently (02:21:34) >> and I just had instantly so many (02:21:36) terrible questions. Um, (02:21:38) >> so what I did was I used chacht to ask (02:21:40) the questions with the with the paper in (02:21:42) the context window and then uh it worked (02:21:44) through some of the simple things and (02:21:46) then I actually shared the thread to the (02:21:47) person who shared it uh who actually (02:21:49) like wrote that paper or like worked on (02:21:50) that work and I almost feel like it was (02:21:52) like um like if they can see the dumb (02:21:55) questions I had it might help them (02:21:57) explain it better in the future or (02:21:58) something like that because um so for (02:22:00) example for my material I would love if (02:22:02) people shared their dumb conversations (02:22:04) with Chachi PT about the stuff that I've (02:22:06) created because it really helps me put (02:22:07) myself again in the shoes of someone (02:22:09) who's starting out. (02:22:10) >> Another trick like that that I just (02:22:12) works astoundingly well. (02:22:15) Um, if somebody writes a paper or a blog (02:22:19) post or an announcement, it is in 100% (02:22:23) of cases true that just the narration or (02:22:26) the transcription of how they would (02:22:28) explain it to you over lunch (02:22:30) >> is way more uh not only understandable, (02:22:35) >> but actually also more accurate and (02:22:38) scientific in the sense that people have (02:22:41) a bias to explain things in the most (02:22:44) abstract act jargon filled way possible (02:22:46) and to clear their throat for four (02:22:48) paragraphs before they explain the (02:22:49) central idea. (02:22:50) >> But there's something about (02:22:51) communicating one-on-one with a person (02:22:54) which compels you to just say the thing. (02:22:58) >> Just say the thing. (02:22:58) >> Yeah. Actually, I saw that tweet. I (02:23:00) thought it was really good. I shared it (02:23:01) with a bunch of people actually. I think (02:23:02) it was really good. And I noticed this (02:23:04) many many times. Um maybe the most (02:23:06) prominent example is I remember uh back (02:23:09) in my PhD days doing research etc. Uh (02:23:11) you read someone's paper, right? and you (02:23:13) work you to understand what it's doing (02:23:15) etc. And then you catch them, you're (02:23:16) having beers at the conference later and (02:23:18) you ask them so like this paper like so (02:23:20) what were you doing like what is the (02:23:21) paper about and they will just tell you (02:23:23) these like three sentences that like (02:23:24) perfectly capture the essence of that (02:23:25) paper and totally give you the idea and (02:23:27) you didn't have to read the paper yet. (02:23:28) >> Yeah. And like it's only at when you're (02:23:30) sitting at the table with a beer or (02:23:31) something like that and like oh yeah the (02:23:33) paper is just oh you take this idea you (02:23:34) take that idea and you try this (02:23:35) experiment and uh and you try out this (02:23:37) thing and they have a way of just (02:23:38) putting it conversationally (02:23:39) >> right (02:23:40) >> and just like perfectly like why isn't (02:23:41) that the actual (02:23:42) >> exactly (02:23:46) >> um this is coming from the perspective (02:23:47) of how somebody who's trying to explain (02:23:49) an idea should formulate it better. What (02:23:51) is your advice as a student to other (02:23:55) students where if you don't have a (02:23:56) carpathy who is doing the exposition of (02:23:59) an idea if you're reading a paper for (02:24:01) somebody or reading a book, (02:24:03) >> what strategies do you employ (02:24:06) >> to learn material you're interested in (02:24:08) in fields you're not an expert in? (02:24:10) >> Um I don't actually know that I have um (02:24:12) like unique tips and tricks to be (02:24:14) honest. Um basically it's a it's it's (02:24:17) kind of a painful process. Um but you (02:24:19) know like reddraft one. Um I think like (02:24:22) one thing that has always helped me (02:24:24) quite a bit is um (02:24:27) I had a small tweet about this actually. (02:24:28) So like learning things on demand is is (02:24:30) pretty nice. Learning depthwise. I do (02:24:32) feel like you need a bit of alternation (02:24:34) of learning depth wise. On demand you're (02:24:35) trying to achieve a certain project that (02:24:36) you're going to get a reward from. (02:24:38) >> And learning breathwise which is just oh (02:24:39) let's do whatever one and here's all the (02:24:42) things you might need. Which is a lot of (02:24:43) school does a lot of breath wise (02:24:44) learning like oh trust me you'll need (02:24:45) this later. you know that kind of a (02:24:47) stuff (02:24:47) >> like okay I trust you I'll learn it (02:24:49) because I guess I need it. (02:24:51) >> But I love the kind of learning where (02:24:52) you'll actually get a reward out of (02:24:54) doing something and you're learning on (02:24:55) demand. (02:24:56) >> The other thing that I've found is (02:24:57) extremely helpful is um maybe this is an (02:25:01) aspect where education is a bit more (02:25:02) selfless because uh explaining things to (02:25:04) people is a beautiful way to learn (02:25:06) something more deeply. Uh this uh (02:25:08) happens to me all the time. I think it (02:25:09) probably happens to other people too (02:25:10) because (02:25:11) >> I realize if I don't really understand (02:25:13) something I can't explain it, you know, (02:25:15) and and um I'm trying and I'm like (02:25:17) actually actually I don't understand (02:25:19) this and it's so annoying to come to (02:25:20) terms with that and then you can go back (02:25:22) and make sure you understood it and so (02:25:24) it fills these gaps of your (02:25:25) understanding. It forces you to come to (02:25:26) terms with them and uh to reconcile (02:25:28) them. I love to reexlain and things like (02:25:31) that and I think people should be doing (02:25:32) that more as well. I think that forces (02:25:34) you to manipulate the knowledge and make (02:25:35) sure that you you know what you're (02:25:36) talking about when you're explaining it. (02:25:38) Oh yeah, I think that's an excellent (02:25:39) note to close on. (02:25:40) >> Yeah, (02:25:40) >> Andre, that was great. (02:25:41) >> Yeah, thank you. Thanks. Take your time. (02:25:44) >> Hey everybody, I hope you enjoyed that (02:25:45) episode. If you did, the most helpful (02:25:47) thing you can do is just share it with (02:25:49) other people who you think might enjoy (02:25:51) it. It's also helpful if you leave a (02:25:53) rating or a comment on whatever platform (02:25:56) you're listening on. If you're (02:25:58) interested in sponsoring the podcast, (02:25:59) you can reach out at (02:26:01) dwarcash.com/advertise. (02:26:05) Otherwise, I'll see you on the next one.

Andrej Karpathy — “We’re summoning ghosts, not building animals” (YouTube Video Transcript)

Learning Modes

YouTube Video Transcript Hide

Ask AI Result

Leave a Reply Cancel reply

Other Videos:

“We Learn It Too Late” – 5 Regrets Trapping People...

قرعة حروف مع عزيز الموسم الثالث 2026

How To Make The Most Of The Golden Hour After...

The Lie We Built: Chain-of-Thoughts

Chain-of-thought prompting – Explained!

El “momento Sputnik” de China: SeaDance 2.0 y la IA...

¡El Nuevo Ayudante de Mickey! | Episodio Completo | La...

Zeus Juice | The Ant and the Aardvark | Pink...

LA HORMIGA Y EL OSO HORMIGUERO de la Pandilla ♦...

Star vs The Forces of Evil Episode 1 – Star...

What Every Son Needs To Hear From His Father

Jordan Peterson: Society Forgot This About the Role of a...