↔
Title: Andrej Karpathy — “We’re summoning ghosts, not building animals”
Duration: 02:26:08
Total Correct Answers:
Current Caption
Correct
Learning Modes
YouTube Video Transcript Hide
Ask AI:
Export as:
Ask AI Result
The ask AI result will appear here..
(00:00:00) Your YouTube transcript will appear here
(00:00:00)
reinforcement learning is terrible.
(00:00:03)
It just so happens that everything that
(00:00:04)
we had before is much worse. I'm
(00:00:07)
actually optimistic. I think this will
(00:00:08)
work. I think it's tractable. I'm only
(00:00:10)
sounding pessimistic because when I go
(00:00:11)
on my Twitter timeline, I see all this
(00:00:13)
stuff that makes no sense to me. A lot
(00:00:15)
of it is, I think, honestly just uh
(00:00:17)
fundraising. We're not actually building
(00:00:18)
animals. We're building ghosts. These
(00:00:20)
like sort of ethereal spirit entities
(00:00:22)
because they're fully digital and
(00:00:24)
they're kind of like mimicking humans.
(00:00:25)
And it's a different kind of
(00:00:26)
intelligence. It's business as usual
(00:00:28)
because we're in an intelligence
(00:00:29)
explosion already and have been for
(00:00:31)
decades. Everything is gradually being
(00:00:32)
automated. Has been for hundreds of
(00:00:34)
years. Don't write blog posts. Don't do
(00:00:36)
slides. Don't do any of that. Like build
(00:00:38)
the code, arrange it, get it to work.
(00:00:39)
It's the only way to go. Otherwise,
(00:00:40)
you're missing knowledge. If you have a
(00:00:42)
perfect AI tutor, maybe you can get
(00:00:44)
extremely far. The geniuses of today are
(00:00:45)
barely scratching the surface of what a
(00:00:47)
human mind can do. I think
(00:00:49)
>> today I'm speaking with Andre Karpathy.
(00:00:51)
Andre, why do you say that this will be
(00:00:53)
the decade of agents and not the year of
(00:00:54)
agents?
(00:00:55)
>> Mhm. Uh well first of all uh thank you
(00:00:57)
for uh having me here. I'm excited to be
(00:00:59)
here. So the quote that you've just
(00:01:01)
mentioned it's the decade of agents.
(00:01:03)
That's actually a reaction to an
(00:01:04)
existing pre-existing quote I should say
(00:01:06)
where I think a lot of some of the labs
(00:01:07)
I'm not actually sure who said this but
(00:01:09)
they were alluding to this being the
(00:01:10)
year of agents
(00:01:12)
>> uh with respect to LLMs and uh how they
(00:01:14)
were going to evolve. And I think um I
(00:01:16)
was triggered by that because I feel
(00:01:18)
like there's some overpredictions going
(00:01:19)
on in the industry and uh in my mind
(00:01:22)
this is really a lot more accurately
(00:01:24)
described as the decade of agents and we
(00:01:26)
have some very early agents that are
(00:01:27)
actually like extremely impressive and
(00:01:28)
that I use daily uh you know cloud and
(00:01:30)
codeex and so on but I still feel like
(00:01:32)
there's uh so much work to be done and
(00:01:34)
so I think my like my reaction is like
(00:01:36)
we'll be working with these things for
(00:01:38)
decade they're going to get better uh
(00:01:40)
and uh it's going to be wonderful but I
(00:01:42)
think I was just reacting to the
(00:01:43)
timelines I suppose of
(00:01:45)
of the uh implication
(00:01:47)
>> and what do you think will take a decade
(00:01:48)
to accomplish? What are the bottlenecks?
(00:01:51)
>> Well, um actually make it work. So in my
(00:01:53)
mind, I mean when you're talking about
(00:01:54)
an agent, I guess or what the labs have
(00:01:56)
in mind and what maybe I have in mind as
(00:01:58)
well is it's uh you should think of it
(00:01:59)
almost like an employee or like an
(00:02:00)
intern that you would hire to work with
(00:02:01)
you. Uh so for example, you work with
(00:02:03)
some employees here. Um when would you
(00:02:05)
prefer to have an agent like Cloud or
(00:02:07)
Codeex uh do that work? Like currently
(00:02:09)
of course they can't. What would it take
(00:02:10)
for them to be able to do that? Why
(00:02:12)
don't you do it today? And the reason
(00:02:13)
you don't do it today is because they
(00:02:14)
just don't work. So like they don't have
(00:02:17)
enough intelligence. They're not
(00:02:18)
multimodal enough. They can't do
(00:02:19)
computer use and all this kind of stuff.
(00:02:21)
And they don't do a lot of things.
(00:02:23)
You know, they don't have continual
(00:02:25)
learning. You can't just tell them
(00:02:26)
something and they'll remember it. And
(00:02:27)
they're just cognitively lacking. And
(00:02:29)
it's just not working. And I just think
(00:02:30)
that it will take about a decade to work
(00:02:32)
through all of those issues.
(00:02:33)
>> Interesting. So, um, as a professional
(00:02:35)
podcaster and a
(00:02:38)
a viewer of AI from afar, it's sort of
(00:02:41)
easy to identify for me like, oh, here's
(00:02:43)
what's lacking. Continual learning is
(00:02:45)
lacking or multimodality is lacking. But
(00:02:48)
I don't really have a good um way of
(00:02:51)
trying to put a timeline on it. Like if
(00:02:53)
somebody's like, how long will continual
(00:02:54)
learning take? I
(00:02:56)
>> there's no like prior I have about like
(00:02:58)
this is a project that should take 5
(00:02:59)
years, 10 years, 50 years.
(00:03:01)
>> Why a decade? Why not one year? Why not
(00:03:03)
50 years?
(00:03:04)
>> Um, yeah, I guess this is where you get
(00:03:06)
into like a bit of I guess my own
(00:03:08)
intuition a little bit and also just
(00:03:10)
kind of doing a bit of an extrapolation
(00:03:12)
of with respect to my own experience in
(00:03:14)
the field, right? So, I guess I've been
(00:03:15)
in AI for
(00:03:16)
>> almost two decades. I mean, it's going
(00:03:17)
to be maybe 15 years or so, not that
(00:03:19)
long. um you had Richard Sutton here who
(00:03:21)
was around of course for much longer but
(00:03:23)
I do have about 15 years of experience
(00:03:25)
of people making predictions of seeing
(00:03:26)
how they actually uh turned out and also
(00:03:28)
I was in the industry for a while and I
(00:03:30)
was in research and I've worked in the
(00:03:31)
industry for a while so I guess I kind
(00:03:33)
of have uh just a general intuition that
(00:03:35)
I have left from that uh and uh I feel
(00:03:38)
like the problems are uh tractable
(00:03:41)
they're surmountable
(00:03:42)
>> but uh they're still difficult and if I
(00:03:44)
just average it out that just kind of
(00:03:45)
feels like a ticket I guess to me
(00:03:47)
>> this is actually quite interesting I I
(00:03:48)
want to like hear not only the history
(00:03:51)
but what people in the room felt was
(00:03:54)
about to happen at various different
(00:03:56)
>> breakthrough moments like what were the
(00:03:59)
ways in which their feelings were either
(00:04:01)
overly pessimistic or overly optimistic?
(00:04:03)
>> Yeah.
(00:04:03)
>> Yeah. Should we just go through each of
(00:04:04)
them one by one?
(00:04:05)
>> Oh yeah. I mean that's a giant question
(00:04:06)
because of course you're talking about
(00:04:07)
15 years of stuff that happened. I mean
(00:04:09)
AI is actually like so wonderful because
(00:04:10)
there have been a number of I would say
(00:04:12)
seismic shifts
(00:04:13)
>> that were like the entire field has sort
(00:04:15)
of like suddenly looked a different way,
(00:04:16)
right? And I guess I've maybe lived
(00:04:18)
through two or three of those.
(00:04:20)
>> And I still think there will continue to
(00:04:21)
be some because they come with some kind
(00:04:22)
of like almost surprising irregularity.
(00:04:25)
>> Well, my when my career began, of
(00:04:26)
course, like when I started to work on
(00:04:28)
deep learning, when I became interested
(00:04:29)
in deep learning, this was just kind of
(00:04:30)
like by chance of being right next to
(00:04:32)
Jeff Hinton at University of Toronto.
(00:04:34)
And Jeff Hinton, of course, is kind of
(00:04:35)
like the godfather figure of AI and he
(00:04:37)
was training all these neural networks
(00:04:38)
and I thought it was incredible and
(00:04:39)
interesting, but this was not like the
(00:04:41)
main thing that everyone in AI was doing
(00:04:43)
by far. Yeah,
(00:04:44)
>> this was a niche subject on the side.
(00:04:46)
That's kind of maybe like the first like
(00:04:48)
dramatic sort of seismic shift that came
(00:04:50)
with the Alexet and so on.
(00:04:51)
>> I would say like Alex sort of reoriented
(00:04:53)
everyone and everyone started to train
(00:04:54)
neural networks. Uh but it was still
(00:04:57)
like very like per task per specific
(00:04:59)
task. So maybe I have an image
(00:05:00)
classifier or I have a neural machine
(00:05:03)
translator or something like that. And
(00:05:04)
people became very slowly actually
(00:05:06)
interested in basically kind of agents I
(00:05:07)
would say. uh um and people started to
(00:05:10)
think okay well maybe we have a check
(00:05:11)
mark next to the visual cortex or
(00:05:13)
something like that but what about the
(00:05:14)
other parts of the brain and how can we
(00:05:15)
get an actual like full agent or an full
(00:05:17)
entity that can actually interact in the
(00:05:19)
world and I would say the Atari uh sort
(00:05:21)
of uh deep reinforcement learning shift
(00:05:23)
in 2013 or so uh was part of that early
(00:05:26)
effort of agents in my mind because it
(00:05:28)
was an attempt to try to get agents that
(00:05:30)
not just perceive the world but also
(00:05:31)
take actions and interact and get
(00:05:33)
rewards from environments and at the
(00:05:34)
time this was Atari games
(00:05:36)
>> right
(00:05:36)
>> and I kind of feel like that was a
(00:05:38)
misstep actually uh and it was a misstep
(00:05:40)
that actually even the early openi that
(00:05:42)
I was a part of of course uh kind of
(00:05:44)
adopted because at that time the
(00:05:46)
sitegeist was reinforcement learning
(00:05:48)
environments games playing beat games
(00:05:51)
get lots of different types of games and
(00:05:52)
open was doing a lot of that. So that
(00:05:54)
was maybe like another like prominent
(00:05:57)
part of I would say AI where maybe for
(00:05:59)
two or three or four years everyone was
(00:06:01)
doing reinforcement learning on games
(00:06:03)
>> and uh basically that was a little bit
(00:06:05)
of a misstep
(00:06:07)
>> and what I was trying to do at open a
(00:06:08)
actually is like I was always a little
(00:06:09)
bit suspicious of games as being like
(00:06:11)
this thing that would actually lead to
(00:06:12)
AGI because in my mind you want
(00:06:14)
something like an accountant or uh like
(00:06:16)
something that's actually interacting
(00:06:16)
with the real world and I just didn't
(00:06:18)
see how games kind of like add up to it
(00:06:20)
and so my project at OpenAI for example
(00:06:22)
was um within in the scope of the
(00:06:24)
universe project on an on an agent that
(00:06:27)
was using keyboard and mouse to operate
(00:06:29)
web pages. And I really wanted to have
(00:06:31)
something that like interacts with, you
(00:06:33)
know, the actual digital world that can
(00:06:34)
do knowledge work.
(00:06:35)
>> And it just so turns out that um this
(00:06:37)
was extremely early, way too early. so
(00:06:39)
early that we shouldn't have been
(00:06:41)
working on that, you know, uh because um
(00:06:43)
if you're just stumbling your way around
(00:06:45)
and keyboard mashing and mouse clicking
(00:06:47)
and trying to get rewards in these
(00:06:48)
environments, um your reward is too
(00:06:51)
sparse and you just won't learn and
(00:06:52)
you're going to burn a forest uh
(00:06:53)
computing and you're never actually
(00:06:55)
going to get something off the ground.
(00:06:56)
>> And so what you're missing is this uh
(00:06:58)
power of representation in the neural
(00:07:00)
network.
(00:07:01)
>> And so for example, today people are
(00:07:02)
training those computer using agents,
(00:07:03)
but they're doing it on top of a large
(00:07:05)
language model. And so you actually have
(00:07:06)
to get the language model first. you
(00:07:07)
have to get the representations first
(00:07:09)
and you have to do that by all the
(00:07:10)
pre-training and all the LLM stuff. So I
(00:07:12)
kind of feel like maybe loosely speaking
(00:07:14)
it was like people keep maybe trying to
(00:07:17)
get the full thing too early a few times
(00:07:19)
where people like really try to go after
(00:07:21)
agents too early I would say and that
(00:07:23)
was Atari and Universe
(00:07:25)
>> uh and even my own experience and you
(00:07:26)
actually have to do some things first
(00:07:28)
before you sort of get to those agents.
(00:07:30)
Um, and maybe now the agents are a lot
(00:07:31)
more competent, but maybe we're still
(00:07:33)
missing uh sort of some parts uh of that
(00:07:35)
stack. But I would say maybe those are
(00:07:37)
like the three like major buckets of
(00:07:39)
what people were doing. Uh, training
(00:07:41)
neural nets per tasks trying to the
(00:07:43)
first round of agents uh and then maybe
(00:07:45)
the LLMs and actually seeking the
(00:07:47)
representation power of the neural
(00:07:48)
networks before you uh tack on
(00:07:50)
everything else on top.
(00:07:51)
>> Interesting. Yeah, I guess if I were to
(00:07:53)
steal man, the sort of the sudden
(00:07:54)
perspective would be that humans
(00:07:56)
actually can just take on everything at
(00:07:58)
once, right? Even animals can take on
(00:07:59)
everything at once, right? Animals are
(00:08:01)
maybe a better example because they
(00:08:02)
don't even have the scaffold of
(00:08:03)
language. They just get thrown out into
(00:08:05)
the world and they just have to make
(00:08:06)
sense of everything without any labels.
(00:08:09)
Um,
(00:08:10)
>> and the vision for AGI then should just
(00:08:12)
be something which like just looks at
(00:08:13)
sensory data, looks at the computer
(00:08:15)
screen, and it just like figures out
(00:08:17)
what's going on from scratch. I mean, if
(00:08:19)
a human was put in a similar situation,
(00:08:21)
had to be trained from scratch, but I
(00:08:22)
mean, this is like a human growing up or
(00:08:23)
an animal growing up. So, why shouldn't
(00:08:25)
that be the vision for AI rather than
(00:08:26)
like this thing where we're doing
(00:08:28)
millions of years of training? I think
(00:08:30)
that's a really good question and I
(00:08:31)
think um I mean so so Sutton was on your
(00:08:34)
podcast and I saw the podcast and I had
(00:08:36)
a write up about that podcast almost
(00:08:37)
that gets into a little bit of how I see
(00:08:40)
things and I I kind of feel like I'm
(00:08:42)
very careful to make analogies to
(00:08:44)
animals because they came about by a
(00:08:46)
very different optimization process.
(00:08:48)
>> Animals are evolved and they actually
(00:08:50)
come with a huge amount of hardware
(00:08:51)
that's built in. Um, and when, for
(00:08:53)
example, my example in the post was the
(00:08:55)
zebra. A zebra gets born and a few
(00:08:57)
minutes later it's running around and
(00:08:58)
following its mother. That's an
(00:09:00)
extremely complicated thing to do.
(00:09:01)
>> Yeah.
(00:09:02)
>> Um, that's not reinforcement learning.
(00:09:04)
That's something that's baked in. And
(00:09:05)
evolution obviously has some way of
(00:09:07)
encoding the weights of our neural nets
(00:09:09)
in ATCGS. And I have no idea how that
(00:09:11)
works, but it apparently works. So, I
(00:09:13)
kind of feel like uh brains just were
(00:09:16)
came from a very different process. And
(00:09:18)
I I'm very hesitant to take inspiration
(00:09:20)
from it because we're not actually
(00:09:21)
running that process. So in my post, I
(00:09:24)
kind of said we're not actually building
(00:09:25)
animals. Uh we're building ghosts.
(00:09:27)
>> Yeah.
(00:09:27)
>> Or spirits or whatever people want to
(00:09:29)
call it. Uh because um we're not uh
(00:09:32)
we're not doing training by evolution.
(00:09:34)
Uh we're doing training by basically
(00:09:36)
imitation of humans and the data that
(00:09:38)
they've put on the internet. And so you
(00:09:40)
end up with these like sort of ethereal
(00:09:42)
spirit entities because they're fully
(00:09:43)
digital and they're kind of like
(00:09:44)
mimicking humans. And it's a different
(00:09:46)
kind of intelligence. Like if you
(00:09:47)
imagine a space of intelligences, we're
(00:09:49)
we're starting off at a different point
(00:09:50)
almost. We're not we're not really
(00:09:52)
building animals, but I think it's also
(00:09:53)
possible to make them a bit more
(00:09:54)
animallike over time. And I think we
(00:09:56)
should be doing that. And so I kind of
(00:09:57)
feel like so just I guess one more point
(00:09:59)
is I do feel like Sutton basically has a
(00:10:01)
very like his framework is like we want
(00:10:04)
to build animals and I actually think
(00:10:05)
that would be wonderful if we can get
(00:10:06)
that to work that would be amazing. If
(00:10:08)
there was a single like
(00:10:10)
>> algorithm that you can just you know run
(00:10:12)
on the internet and it learns everything
(00:10:14)
that would be incredible. I almost
(00:10:16)
suspect that I'm not actually sure that
(00:10:18)
it exists and that's certainly actually
(00:10:19)
not what animals do
(00:10:21)
>> because animals have this outer loop of
(00:10:23)
evolution,
(00:10:23)
>> right?
(00:10:24)
>> Um, and a lot of what looks like
(00:10:25)
learning is actually a lot more
(00:10:27)
maturation of the brain and I think that
(00:10:29)
actually very little reinforcement
(00:10:31)
learning for animals and I think a lot
(00:10:33)
of the reinforcement learning is
(00:10:34)
actually like more like motor tasks.
(00:10:36)
It's not intelligence tasks. So I
(00:10:37)
actually kind of think humans don't
(00:10:39)
actually like really use RL roughly
(00:10:40)
speaking is what I would say.
(00:10:41)
>> Can you read the last sentence? A lot of
(00:10:42)
that intelligence is not motor task.
(00:10:44)
It's what? Sorry. A lot of the
(00:10:45)
reinforcement learning in my perspective
(00:10:46)
would be things that are a lot more like
(00:10:47)
motor like like uh simple kind of like
(00:10:50)
task throwing a hoop or something like
(00:10:52)
that. Um but I don't think that humans
(00:10:55)
use reinforcement learning for a lot of
(00:10:57)
intelligence tasks like problem solving
(00:10:58)
and so on.
(00:10:59)
>> Interesting.
(00:11:00)
>> That doesn't mean we don't have we we
(00:11:01)
shouldn't do that for research but I
(00:11:03)
just feel like that's what animals do or
(00:11:05)
don't.
(00:11:06)
>> I'm going to take us a second to digest
(00:11:07)
that because there's a lot of different
(00:11:09)
ideas. Maybe one clarifying question I
(00:11:12)
can ask to um understand a perspective.
(00:11:14)
So I think you suggest that look
(00:11:16)
evolution is doing the kind of thing
(00:11:18)
that pre-training does in the sense of
(00:11:20)
building something which can then
(00:11:23)
understand the world. The difference I
(00:11:25)
guess is that evolution
(00:11:27)
has to be titrated in the case of humans
(00:11:29)
through 3 gigabytes of DNA. And so
(00:11:33)
that's very unlike the weights of a
(00:11:36)
model. I mean literally the weights of
(00:11:38)
the model are a brain which obviously is
(00:11:40)
not encoded in the the sperm and the egg
(00:11:42)
or does not exist in the sperm and the
(00:11:44)
egg. So it has to be grown and also the
(00:11:47)
information for every single synapse in
(00:11:49)
the brain simply cannot exist in the 3
(00:11:51)
gigabytes that exist in the DNA.
(00:11:53)
Evolution seems closer to finding the
(00:11:54)
algorithm
(00:11:56)
>> which then does the lifetime learning.
(00:11:59)
Now maybe the lifetime learning is not
(00:12:01)
analogous to RL to your point. Is that
(00:12:04)
compatible with the thing you were
(00:12:05)
saying or would you disagree with that?
(00:12:06)
>> I think so. I would agree with you that
(00:12:07)
there's some miraculous compression
(00:12:08)
going on because obviously the weights
(00:12:09)
of the neural net are not stored in
(00:12:11)
ATCGs.
(00:12:12)
>> There's some kind of a dramatic
(00:12:13)
compression and there's some kind of
(00:12:15)
like learning algorithms encoded that
(00:12:16)
that take over and do some of the
(00:12:18)
learning online.
(00:12:19)
>> So I definitely agree with you on that.
(00:12:21)
Basically I would say I'm a lot more
(00:12:22)
kind of like practically minded. I don't
(00:12:24)
come at it from the perspective of like
(00:12:25)
let's build animals. I come from it
(00:12:27)
perspective of like let's build useful
(00:12:28)
things. So I have a hard hat on and I'm
(00:12:31)
just observing that look we're not going
(00:12:32)
to do evolution because I don't know how
(00:12:34)
to do that. Uh but it does turn out we
(00:12:36)
can build these ghost spirit-l like
(00:12:37)
entities by imitating internet
(00:12:39)
documents. This works and it's actually
(00:12:41)
kind of like it's a way to bring you up
(00:12:44)
to something that has a lot of sort of
(00:12:46)
built-in knowledge and intelligence in
(00:12:47)
some way. Uh similar to maybe what
(00:12:49)
evolution has done. So that's why I kind
(00:12:51)
of call pre-training this kind of like
(00:12:52)
crappy evolution. It's like the
(00:12:55)
practically possible version with our
(00:12:57)
technology and what we have available to
(00:12:58)
us to get to a starting point where we
(00:13:01)
can actually do things like
(00:13:02)
reinforcement learning and so on. M just
(00:13:04)
to steelman the other perspective
(00:13:05)
because after doing this in an interview
(00:13:06)
and thinking about it a bit he has an
(00:13:08)
important point here evolution does not
(00:13:10)
give us the knowledge really right it
(00:13:12)
gives us the algorithm to find the
(00:13:14)
knowledge and that seems different from
(00:13:15)
pre-raining so if perhaps the
(00:13:17)
perspective is that pre-training helps
(00:13:19)
build the kind of entity which can learn
(00:13:21)
better it teaches metal learning and
(00:13:23)
therefore it is some similar to like
(00:13:25)
finding an algorithm um but if if it's
(00:13:27)
like evolution gives us knowledge and
(00:13:28)
pre-training gives us knowledge they're
(00:13:29)
not that analogy seems to break down
(00:13:31)
>> so it's subtle and I You're you're right
(00:13:33)
to push back on it, but basically the
(00:13:35)
thing that pre-training is doing, so
(00:13:36)
you're basically getting the next token
(00:13:38)
predictor on over the internet and
(00:13:39)
you're training that into a neural net.
(00:13:41)
>> It's doing two things actually that are
(00:13:42)
kind of like unrelated. Number one, it's
(00:13:44)
picking up all this knowledge as I call
(00:13:46)
it. Number two, it's actually becoming
(00:13:47)
intelligent.
(00:13:48)
>> Um, by observing the algorithmic
(00:13:50)
patterns in the internet, it actually
(00:13:52)
kind of like boots up all these like
(00:13:53)
little circuits and algorithms inside
(00:13:55)
the neural net to do things like in
(00:13:56)
context learning and all this kind of
(00:13:57)
stuff.
(00:13:58)
>> And actually, you don't actually need or
(00:14:00)
want the knowledge. I actually think
(00:14:01)
that's probably actually holding back
(00:14:03)
the neural networks overall because it's
(00:14:04)
actually like getting them to rely on
(00:14:05)
the knowledge a little too much
(00:14:06)
sometimes.
(00:14:07)
>> For example, I I kind of feel like
(00:14:09)
agents one thing they're not very good
(00:14:10)
at is going off the data manifold of
(00:14:12)
what exists on the internet.
(00:14:13)
>> If they had less knowledge or less
(00:14:15)
memory actually maybe they would be
(00:14:17)
better.
(00:14:17)
>> Yeah. Yeah. And so what I think we have
(00:14:19)
to do kind of going forward and this
(00:14:20)
will be part of the research paradigms
(00:14:21)
is I actually think we need to start um
(00:14:23)
we need to figure out ways to remove
(00:14:25)
some of the knowledge and to keep what I
(00:14:26)
call this cog is this cognitive core
(00:14:29)
>> is this like intelligent entity that is
(00:14:31)
kind of stripped from knowledge but
(00:14:32)
contains the algorithms and contains the
(00:14:34)
magic you know of intelligence and
(00:14:36)
problem solving and the strategies of it
(00:14:38)
and all this kind of stuff.
(00:14:39)
>> There's so much interesting stuff there.
(00:14:41)
Okay. So let's start with in context
(00:14:43)
learning. This is an obvious point, but
(00:14:45)
I think it's worth just like saying it
(00:14:47)
explicitly and meditating on it. The
(00:14:49)
situation in which these models seem the
(00:14:51)
most intelligent in which they are like
(00:14:53)
I talk to them and I'm like, "Wow,
(00:14:54)
there's really something on the other
(00:14:56)
end that's like responding to me
(00:14:57)
thinking about things. If it like makes
(00:14:59)
a mistake, it's like, oh wait, that's
(00:15:00)
actually the wrong way to think about
(00:15:01)
it. I'm backing up." All that is
(00:15:02)
happening in context. That's where I
(00:15:04)
feel like the real intelligence you can
(00:15:05)
like visibly see.
(00:15:07)
>> And that in context learning process is
(00:15:11)
developed by gradient descent on
(00:15:12)
pre-training, right? like it meta it
(00:15:14)
spontaneously metalarns in context
(00:15:16)
learning but the incontext learning
(00:15:18)
itself is not gradient descent in the
(00:15:20)
same way that our lifetime intelligence
(00:15:23)
as humans to be able to do things is
(00:15:25)
conditioned by evolution but our actual
(00:15:27)
learning during our lifetime is like
(00:15:29)
happening through some other process
(00:15:31)
>> I actually don't fully agree with that
(00:15:32)
but you should continue with
(00:15:33)
>> okay actually then I I'm very curious to
(00:15:35)
understand how that analogy breaks down
(00:15:37)
>> I think I'm hesitant to say that in
(00:15:39)
context learning is not doing gradient
(00:15:40)
descent uh because I mean it's not doing
(00:15:42)
explicit creating descent, but I I still
(00:15:44)
think that
(00:15:45)
>> so in context learning basically it's
(00:15:46)
it's pattern completion within uh a
(00:15:48)
token window, right? And it just turns
(00:15:50)
out that there's a huge amount of
(00:15:51)
patterns on the internet. And so you're
(00:15:52)
right, the model kind of like learns to
(00:15:54)
complete the pattern. Yeah.
(00:15:55)
>> And that's inside the weights. The
(00:15:56)
weights of the neural network are trying
(00:15:58)
to discover patterns and complete the
(00:16:00)
pattern. And there's some kind of an
(00:16:01)
adaptation that happens inside the
(00:16:02)
neural network, right?
(00:16:03)
>> Uh which is kind of magical and just
(00:16:05)
falls out from internet just because
(00:16:06)
there's a lot of patterns. I will say
(00:16:09)
that there have been some papers that I
(00:16:11)
thought were interesting that actually
(00:16:12)
look at the mechanisms behind in context
(00:16:14)
learning and I do think it's possible
(00:16:15)
that in context learning actually runs a
(00:16:16)
small gradient descent loop internally
(00:16:18)
in the layers of the neural network and
(00:16:20)
so I recall one paper in particular
(00:16:22)
where they were doing um uh linear
(00:16:24)
regression actually using in context
(00:16:26)
learning. So basically your inputs into
(00:16:27)
the neural network are XY pairs
(00:16:31)
>> XY XY XY that happen to be on the line
(00:16:33)
>> and then you do X and you expect the Y
(00:16:35)
and the neural network when you train it
(00:16:37)
in this way actually does do um does do
(00:16:39)
linear regression
(00:16:41)
>> and um normally when you would run
(00:16:43)
linear regression you have a small
(00:16:44)
gradient descent optimizer that
(00:16:45)
basically looks at XY looks at an error
(00:16:48)
calculates the gradient of the weights
(00:16:49)
and does the update a few times. It just
(00:16:51)
turns out that when they looked at the
(00:16:52)
weights of that in context learning
(00:16:54)
algorithm uh they actually found some
(00:16:56)
analogies to uh to gradient descent
(00:16:58)
mechanics. In fact, I think even the
(00:17:00)
paper went was stronger because they
(00:17:02)
actually hardcoded the weights of a
(00:17:04)
neural network to do gradient descent
(00:17:06)
through uh attention and all the all the
(00:17:09)
internals of of the neural network. So,
(00:17:11)
I guess that's just my only push back is
(00:17:12)
that who knows how in context learning
(00:17:14)
works, but I actually think that it's
(00:17:16)
probably doing a little bit of some kind
(00:17:17)
of funky gradient descent internally and
(00:17:19)
that I think that that's that's
(00:17:21)
possible. So, I guess I I was only
(00:17:22)
pushing back on you're saying it's not
(00:17:24)
doing in context learning. Who knows
(00:17:25)
what it's doing, but it's probably maybe
(00:17:26)
doing something similar to it, but we
(00:17:28)
don't know. So then it's worth thinking
(00:17:29)
about okay if both of them are
(00:17:31)
implementing gradient sorry if in
(00:17:33)
context learning and pre-training are
(00:17:35)
both implementing something like
(00:17:36)
gradient descent
(00:17:38)
>> why does it feel like in context
(00:17:39)
learning actually we're getting to this
(00:17:42)
like continual learning real
(00:17:43)
intelligence like thing whereas you
(00:17:45)
don't get the analogous feeling just
(00:17:46)
from pre-training at least you could
(00:17:48)
argue that and so if it's the same
(00:17:50)
algorithm what could be different well
(00:17:51)
one way you can think about it is how
(00:17:53)
much information does the model store
(00:17:56)
perform information it receives from
(00:17:59)
training. And if you look at
(00:18:01)
pre-training, if uh I think if you look
(00:18:03)
at llama 3 for example, I think it's
(00:18:04)
trained on
(00:18:06)
>> 15 trillion tokens and if you look at
(00:18:08)
the 70B model,
(00:18:10)
>> that would be the equivalent of 07 bits
(00:18:13)
per token in that it sees in
(00:18:14)
pre-training in terms of like the
(00:18:16)
information in the weights of the model
(00:18:17)
compared to the tokens it reads.
(00:18:19)
>> Whereas if you look at the KV cache
(00:18:21)
>> and how it grows per additional token in
(00:18:23)
in context learning, it's like 320
(00:18:25)
kilobytes.
(00:18:26)
>> Yeah. So that's a 35 millionfold
(00:18:28)
difference in how much information per
(00:18:30)
token is assimilated by the model. I
(00:18:34)
wonder if that's relevant at all.
(00:18:35)
>> I think I kind of agree. I mean the way
(00:18:37)
I usually put this is that anything that
(00:18:39)
happens during the training of the
(00:18:40)
neural network. The knowledge is only
(00:18:42)
kind of like a hazy recollection of what
(00:18:44)
happened in train in the training time
(00:18:45)
and that's because the compression is
(00:18:47)
dramatic. You've you're taking 15
(00:18:48)
trillion tokens and you're compressing
(00:18:49)
it to just your final network of a few
(00:18:51)
billion parameters. So obviously it's a
(00:18:52)
massive amount of compression going on.
(00:18:54)
uh so I kind of refer to it as like a
(00:18:56)
hazy recollection of the internet
(00:18:57)
documents whereas anything that happens
(00:18:59)
in the context window of the neural
(00:19:00)
network you're plugging all the tokens
(00:19:02)
and it's building up all this KV cache
(00:19:03)
representation is very directly
(00:19:05)
accessible to the neural net so I
(00:19:06)
compare the KV cache and the the stuff
(00:19:08)
that happens at test time to like more
(00:19:10)
like a working memory
(00:19:11)
>> uh like all the stuff that's in the in
(00:19:13)
um in the context window is very
(00:19:14)
directly accessible to the neural net so
(00:19:16)
there's always like these um almost
(00:19:19)
surprising analogies between LLMs and
(00:19:20)
humans and I find them kind of
(00:19:22)
surprising because we're not trying to
(00:19:23)
build a human brain of course u just
(00:19:25)
directly we're just finding that this
(00:19:26)
works and we're doing it
(00:19:27)
>> but I do think that
(00:19:29)
>> anything that's in the weights it's kind
(00:19:30)
of like a hazy recollection of what you
(00:19:32)
read a year ago anything that you give
(00:19:34)
it as a context uh at test time is
(00:19:37)
directly in the working memory um and I
(00:19:39)
think that's a very powerful analogy to
(00:19:40)
think through things so when you for
(00:19:42)
example go to an LLM and you ask it
(00:19:43)
about some book and what happened in it
(00:19:45)
like nan's book or something like that
(00:19:47)
the LM will often give you some stuff
(00:19:49)
which is roughly correct but if you give
(00:19:50)
it the full chapter and ask it questions
(00:19:52)
you're going to get much better results
(00:19:54)
because it's now loaded in the working
(00:19:55)
memory of the model. So I basically
(00:19:57)
agree with your very long way of saying
(00:19:59)
that I kind of agree and that's why
(00:20:00)
>> stepping back what is it the part about
(00:20:02)
human intelligence that we like have
(00:20:05)
most failed to replicate with these
(00:20:07)
models?
(00:20:08)
>> Um I almost feel like um just uh just a
(00:20:13)
lot of it still. So maybe one way to
(00:20:15)
think about it, I don't know if this is
(00:20:16)
the the best way, but I almost kind of
(00:20:19)
feel like again making these analogies,
(00:20:20)
imperfect as they are, um we've stumbled
(00:20:23)
by with the transformer neural network,
(00:20:24)
which extremely powerful, very general.
(00:20:27)
You can train transformers on audio or
(00:20:29)
video or text or whatever you want and
(00:20:32)
it just learns patterns and they're very
(00:20:33)
powerful and it works really well. That
(00:20:36)
to me almost indicates that this is kind
(00:20:37)
of like some piece of cortical tissue.
(00:20:39)
Uh it's something like that because the
(00:20:40)
cortex is famously very um plastic as
(00:20:42)
well. you can rewire um you know parts
(00:20:45)
of brains and there was the slightly
(00:20:47)
gruesome experiments with rewiring like
(00:20:49)
visual cortex to the auditory cortex and
(00:20:51)
this animal like learn find etc. Um, so
(00:20:54)
I think that this is kind of like
(00:20:55)
cortical tissue. I think when we're
(00:20:58)
doing reasoning and planning inside the
(00:21:00)
neural networks, so basically doing a
(00:21:02)
reasoning traces um for thinking models,
(00:21:04)
that's kind of like the prefrontal
(00:21:05)
cortex. Um, and then um I think we uh
(00:21:10)
maybe those are like little check marks,
(00:21:12)
but I still think there's many uh brain
(00:21:13)
parts and nuclei that are not explored.
(00:21:15)
So maybe for example there's a basic
(00:21:16)
ganglia doing a bit of reinforcement
(00:21:18)
learning when we fine tetune the models
(00:21:19)
on reinforcement learning but you you
(00:21:20)
know whereas like the hippocampus not
(00:21:22)
obvious what that would be some parts
(00:21:24)
are probably not important maybe the
(00:21:25)
cerebellum is like not important to
(00:21:27)
cognition it's thought so so we can skip
(00:21:28)
some of it uh but I still think there's
(00:21:30)
for example the amydala all the emotions
(00:21:32)
and instincts um and there's probably
(00:21:34)
like a bunch of other nuclei in the
(00:21:36)
brain that are very ancient that I don't
(00:21:37)
think we've like really replicated I
(00:21:39)
don't actually know that we should be
(00:21:40)
pursuing you know the building of an
(00:21:42)
analog of human brain I'm again an
(00:21:44)
engineer mostly at heart. But um I still
(00:21:47)
feel like maybe another way to answer
(00:21:50)
the question is you're not going to hire
(00:21:51)
this thing as an intern and it's missing
(00:21:53)
a lot of it's because it comes with a
(00:21:54)
lot of these cognitive deficits that we
(00:21:55)
all intuitively feel when we talk to the
(00:21:57)
models.
(00:21:58)
>> Um
(00:21:58)
>> and so it's just like not fully there
(00:22:00)
yet. You can look at it as like not all
(00:22:02)
the brain parts are checked off yet.
(00:22:04)
>> This is maybe relevant to the question
(00:22:07)
of thinking about how fast these issues
(00:22:09)
will be solved. So sometimes people will
(00:22:12)
say about continual learning. Look,
(00:22:14)
actually you could already you could
(00:22:16)
easily replicate this capability just as
(00:22:18)
in context learning emerged
(00:22:19)
spontaneously as a result of
(00:22:21)
pre-training.
(00:22:22)
Continual learning over longer horizons
(00:22:24)
will emerge spontaneously if the model
(00:22:27)
is incentivized to recollect information
(00:22:29)
over longer horizons or horizons longer
(00:22:32)
than one session. So if there's um some
(00:22:36)
like outer loop RL which has many
(00:22:40)
sessions within that outer loop then
(00:22:43)
like this continual learning where it
(00:22:44)
uses like it fine-tunes itself or it
(00:22:46)
writes to an external memory or
(00:22:47)
something will just sort of like emerge
(00:22:49)
spontaneously. Do you think
(00:22:50)
>> do you think things are that are
(00:22:52)
plausible? I just I don't have really a
(00:22:53)
prior over like how plausible is that?
(00:22:54)
How likely is that to happen?
(00:22:55)
>> I don't know that I fully resonate with
(00:22:57)
that because I feel like these models
(00:22:58)
when you boot them up and they have zero
(00:23:00)
tokens in the window, they're always
(00:23:01)
like restarting from scratch where they
(00:23:03)
were. So I don't actually know in that
(00:23:05)
worldview what it looks like. Uh because
(00:23:07)
um again making maybe making some
(00:23:10)
analogies to humans just because I think
(00:23:11)
it's roughly concrete and kind of
(00:23:13)
interesting to think through. I feel
(00:23:15)
like when I'm awake I'm building up a
(00:23:16)
context window of stuff that's happening
(00:23:17)
during the day but I feel like when I go
(00:23:19)
to sleep something magical happens where
(00:23:21)
uh I don't actually think that that
(00:23:22)
context window stays around. Um I think
(00:23:24)
there's some process of distillation
(00:23:25)
into weights of my brain.
(00:23:27)
>> Yeah.
(00:23:27)
>> Um and this happens during sleep and all
(00:23:29)
this kind of stuff. We don't have an
(00:23:30)
equivalent for of that in large language
(00:23:33)
models and that's to me more adjacent to
(00:23:35)
when you talk about continual learning
(00:23:36)
and so on as absent. These models don't
(00:23:39)
really have this distillation phase um
(00:23:41)
of taking what happened, analyzing it,
(00:23:44)
obsessively thinking through it, um
(00:23:47)
basically doing some kind of a synthetic
(00:23:48)
data generation process and distilling
(00:23:50)
it back back into the weights and maybe
(00:23:51)
having uh you know specific neural net
(00:23:54)
per person uh maybe it's a Laura it's
(00:23:57)
not a full uh yeah it's not a full
(00:23:59)
weight uh neural network that's it's
(00:24:01)
just small some of the small sparse
(00:24:04)
subset of the weights are changed
(00:24:05)
>> but basically we do want to create ways
(00:24:07)
of creating these individuals that have
(00:24:09)
very long contexts. It's not only
(00:24:11)
remaining in the context window because
(00:24:12)
the context windows grow very very long
(00:24:15)
like maybe we have some very elaborate
(00:24:16)
sparse attention over it
(00:24:18)
>> but I still think that humans obviously
(00:24:20)
have some process for distilling some of
(00:24:22)
that knowledge into the weights we're
(00:24:23)
missing it and I do also think that
(00:24:25)
humans um have some kind of a very
(00:24:27)
elaborate sparse attention scheme
(00:24:30)
>> um which I think we're starting to see
(00:24:32)
some early hints of uh so deepse v3.2 2
(00:24:35)
just came out and I saw that they have
(00:24:36)
like a sparse attention as an example
(00:24:38)
and this is one way to have very very
(00:24:40)
long context windows.
(00:24:41)
>> So I almost feel like we are redoing a
(00:24:43)
lot of the cognitive tricks that
(00:24:45)
evolution came up with through a very
(00:24:47)
different process but we're I think
(00:24:48)
going to converge on a similar
(00:24:49)
architecture cognitively.
(00:24:50)
>> Interesting. In 10 years do you think
(00:24:52)
it'll still be something like a
(00:24:53)
transformer but with a much more
(00:24:55)
modified attention and more sparse uh
(00:24:57)
MLPS and so forth? Well, the way I like
(00:24:59)
to think about it is okay, let's uh
(00:25:01)
translation invariance in time, right?
(00:25:02)
So 10 years ago, where were we?
(00:25:04)
>> 2015 uh we had uh convolutional neural
(00:25:07)
networks primarily. Residual networks
(00:25:09)
just came out. Um so remarkably similar
(00:25:12)
I guess, but quite a bit different
(00:25:13)
still. I mean transformer was not
(00:25:14)
around. Um
(00:25:16)
>> you know all the um all these sort of
(00:25:18)
like more modern uh tweaks on the
(00:25:20)
transformer were not around. So maybe
(00:25:22)
some of the things that we can bet on I
(00:25:24)
think in 10 years uh by translational
(00:25:26)
sort of equivariance is um we're still
(00:25:29)
training giant neural networks with uh
(00:25:30)
forward backward pass and update through
(00:25:32)
gradient descent um but maybe it looks a
(00:25:36)
little bit different
(00:25:36)
>> and it's just everything is much bigger
(00:25:38)
actually recently I also went back all
(00:25:41)
the way to 1989 which was kind of a fun
(00:25:43)
uh exercise for me a few years ago uh
(00:25:45)
because I was reproducing uh Yan Lakun's
(00:25:48)
1989 convolutional network which was the
(00:25:50)
first neural network I'm aware of
(00:25:51)
trained via gradient descent like modern
(00:25:53)
neural network trained gradient descent
(00:25:55)
on uh digit recognition
(00:25:57)
>> and I was just interested in okay how
(00:25:59)
can I modernize this how much of this is
(00:26:01)
algorithms how much of this is data how
(00:26:02)
much of this progress is uh compute and
(00:26:04)
systems and I was able to very quickly
(00:26:06)
like half the learning rate just knowing
(00:26:08)
by tra time travel by 33 years so if I
(00:26:11)
time travel by algorithms to 33 years I
(00:26:13)
could adjust what yan did in 1989 and I
(00:26:16)
could basically half the learning half
(00:26:17)
the error but to get further gains I had
(00:26:20)
to add a lot more data. I had to like
(00:26:22)
10x the training set and then I had to
(00:26:24)
actually add more computational
(00:26:25)
optimizations. Uh had to basically train
(00:26:28)
for much longer with dropout and other
(00:26:29)
regularization techniques.
(00:26:30)
>> And so it's almost like all these things
(00:26:33)
have to improve simultaneously. So um
(00:26:35)
you know we're probably going to have a
(00:26:36)
lot more data. We're probably going to
(00:26:37)
have a lot better hardware. Probably
(00:26:38)
going to have a lot better kernels and
(00:26:40)
software. We're probably going to have
(00:26:41)
better algorithms. And all of those it's
(00:26:43)
almost like no one of them is winning
(00:26:45)
too much. All of them are surprisingly
(00:26:47)
equal.
(00:26:48)
M
(00:26:49)
>> and this has kind of been the trend for
(00:26:50)
a while. So I guess to answer maybe your
(00:26:52)
question, I expect differences
(00:26:55)
algorithmically to what's happening
(00:26:56)
today. Uh but I do also expect that some
(00:26:58)
of the things that have stuck around for
(00:27:00)
a very long time will probably still be
(00:27:01)
there. It's probably still giant neural
(00:27:03)
network trained with gradient descent.
(00:27:04)
That would be my guess.
(00:27:05)
>> It's surprising that all of those things
(00:27:06)
together only haved um uh half the
(00:27:12)
error. Yeah. which is so like 30 years
(00:27:14)
of progress is uh maybe maybe half is a
(00:27:16)
lot because like if you half the error
(00:27:17)
that actually means that
(00:27:18)
>> half is a lot. Yeah.
(00:27:19)
>> Yeah. Yeah. Okay. Um
(00:27:20)
>> but it's I guess what was shocking to me
(00:27:21)
is everything needs to improve across
(00:27:23)
the board.
(00:27:24)
>> Uh architecture optimizer loss function
(00:27:26)
and also has improved across the board
(00:27:28)
forever. So I kind of expect all those
(00:27:30)
changes to be alive and well. Well,
(00:27:31)
yeah. Actually, I was about to ask a
(00:27:33)
very similar question about nano chat
(00:27:34)
because since you just coded up
(00:27:36)
recently, every single sort of step in
(00:27:39)
the, you know, process of building a
(00:27:41)
chatbot is like fresh in your RAM.
(00:27:43)
>> And I'm curious if you had similar
(00:27:45)
thoughts about like, oh, there was no
(00:27:47)
one thing that was relevant to going
(00:27:49)
from GPT2 to nanohat. What are sort of
(00:27:53)
like surprising takeaways from the
(00:27:55)
experience
(00:27:56)
>> building? So, nanohat is a kind of a
(00:27:58)
repository I released was it yesterday
(00:28:00)
or day before? I can't remember.
(00:28:03)
We can see the sleeve deliberation that
(00:28:05)
went into the
(00:28:06)
>> um well it's just trying to be a it's
(00:28:09)
trying to be the simplest complete
(00:28:11)
repository that covers the whole
(00:28:12)
pipeline end to end of building a chacha
(00:28:15)
pt clone
(00:28:16)
>> and so you know you have all of the
(00:28:18)
steps not just any individual step which
(00:28:20)
is a bunch of I worked on all the
(00:28:21)
individual steps sort of in the past and
(00:28:23)
really small pieces of code that kind of
(00:28:25)
um show you how that's done in
(00:28:26)
algorithmic sense um uh in like simple
(00:28:29)
code but this kind of handles the entire
(00:28:31)
pipeline I I think in terms of learning
(00:28:33)
it's not it's not so much um I don't
(00:28:35)
know that I actually found something
(00:28:36)
that I learned from from it necessarily.
(00:28:38)
I kind of already had in my mind as like
(00:28:40)
how you build it and this is just a
(00:28:41)
process of mechanically uh building it
(00:28:45)
and making it clean enough and uh so
(00:28:48)
that people can actually learn from it
(00:28:49)
and um that uh they find it useful.
(00:28:52)
>> Yeah. What is the best way for somebody
(00:28:53)
to learn from it? Is it just like delete
(00:28:55)
all the code and try to reimplement from
(00:28:56)
scratch? Try to add modifications to it?
(00:28:58)
>> Uh yeah, I think that's a that's a great
(00:29:00)
question. I would probably say so
(00:29:02)
basically it's about 8,000 lines of code
(00:29:03)
that takes you through the entire
(00:29:04)
pipeline. I would probably put it on the
(00:29:06)
right monitor like if you have two
(00:29:08)
monitors you put it on the on the right.
(00:29:09)
>> Um and you want to build it from
(00:29:11)
scratch. You build it from start. You're
(00:29:13)
not allowed to copy paste. You're
(00:29:15)
allowed to reference. You're not allowed
(00:29:16)
to copy paste. Maybe that's how I would
(00:29:17)
do it.
(00:29:18)
>> Um but I also think the repository by
(00:29:20)
itself it is like a pretty large beast.
(00:29:22)
I mean it's you know it's a it's when
(00:29:24)
you write this code you don't go from
(00:29:25)
top to bottom. you go from chunks and
(00:29:27)
you grow the chunks and uh that
(00:29:29)
information is absent like you wouldn't
(00:29:30)
know where to start and so I think it's
(00:29:32)
not just the final repository that's
(00:29:34)
needed it's like the building of the
(00:29:35)
repository which is a complicated chunk
(00:29:37)
growing process
(00:29:38)
>> right
(00:29:39)
>> uh so that part is not there yet I would
(00:29:41)
love to actually like add that probably
(00:29:42)
later this week or something in some way
(00:29:44)
like either it's a uh it's probably a
(00:29:46)
video or something like that but um but
(00:29:48)
maybe roughly speaking that's what I
(00:29:50)
would try to do is build the stuff
(00:29:52)
yourself uh but uh don't allow yourself
(00:29:54)
copy paste
(00:29:55)
>> I do think that there's two types of
(00:29:56)
knowledge almost like there's the high
(00:29:58)
level surface knowledge but the thing is
(00:30:00)
that when you actually build something
(00:30:01)
from scratch you're forced to come to
(00:30:02)
terms with what you don't actually
(00:30:04)
understand and you don't know that you
(00:30:05)
don't understand it
(00:30:06)
>> interesting
(00:30:06)
>> and it always leads to a deeper
(00:30:07)
understanding uh and um it's like just
(00:30:10)
the only way to to build is like if I
(00:30:12)
can't build it I don't understand it is
(00:30:14)
that a fine code I believe or something
(00:30:16)
along those lines
(00:30:17)
>> I 100% I've always believed this very
(00:30:19)
strongly uh because there's all these
(00:30:21)
like micro things that are just not
(00:30:23)
properly arranged and you don't really
(00:30:24)
have the knowledge you just in had the
(00:30:25)
knowledge. So, don't write blog posts,
(00:30:27)
don't do slides, don't do any of that.
(00:30:29)
Like, build the code, arrange it, get it
(00:30:30)
to work. It's the only way to go.
(00:30:32)
Otherwise, you're missing knowledge.
(00:30:33)
>> Um, you tweeted out that coding models
(00:30:35)
were actually of very little help to you
(00:30:37)
in assembling this repository and I'm
(00:30:39)
curious why that was.
(00:30:41)
>> Yeah. Uh, so the repository, I guess I
(00:30:44)
built it over a period of a bit more
(00:30:45)
than a month, and I would say there's
(00:30:47)
like three major classes of how people
(00:30:49)
interact with code right now. Some
(00:30:51)
people completely reject all of LLMs and
(00:30:53)
they are just uh writing by scratch. I
(00:30:55)
think this is probably not the the right
(00:30:56)
thing to do anymore. Um the intermediate
(00:30:59)
part which is where I am is you still
(00:31:01)
write a lot of things from scratch but
(00:31:03)
you use uh the autocomplete uh that's
(00:31:05)
basically uh available now from these
(00:31:06)
models. So when you start writing out
(00:31:08)
little piece of it it will it would auto
(00:31:10)
complete for you and you can just tap
(00:31:11)
through and most of the time it's
(00:31:12)
correct. Sometimes it's not and you edit
(00:31:14)
it but you're still very much the um
(00:31:16)
sort of architect of what you're
(00:31:18)
writing. And then there's the, you know,
(00:31:19)
VIP coding, uh, you know, hi, please
(00:31:22)
implement this or that, uh, you know,
(00:31:24)
enter and then let the model do it. And
(00:31:26)
that's the agents.
(00:31:27)
>> Um, I do feel like the agents work in
(00:31:30)
very specific settings and I would use
(00:31:32)
them in specific settings. But again,
(00:31:34)
these are all tools available to you and
(00:31:35)
you have to like learn what they what
(00:31:37)
they're good at and what they're not
(00:31:38)
good at and when to use them.
(00:31:39)
>> So the agents are actually pretty good.
(00:31:40)
For example, if you're doing boilerplate
(00:31:42)
stuff,
(00:31:42)
>> boilerplate code that's like just cop,
(00:31:44)
you know, just copy paste stuff. They're
(00:31:46)
very good at that. they're very good at
(00:31:47)
stuff that occurs very often in the
(00:31:49)
internet
(00:31:50)
um because there's lots of examples of
(00:31:52)
it in the training sets of these models.
(00:31:54)
Um so so there's like features of things
(00:31:56)
that where the models will do very well.
(00:31:58)
I would say nanohat is not an example of
(00:32:00)
this because it's a fairly unique
(00:32:02)
repository. There's not that much code I
(00:32:04)
think in the way that I've structured it
(00:32:06)
and um and it's not boilerplate code.
(00:32:09)
It's like actually like intellectually
(00:32:10)
intense code almost and everything has
(00:32:12)
to be very precisely arranged and the
(00:32:13)
models are always trying to
(00:32:15)
>> they kept trying to I mean they have so
(00:32:17)
many cognitive deficits right so one
(00:32:18)
example they keep trying to they keep
(00:32:20)
misunderstanding the code um because
(00:32:23)
they they have too much memory from all
(00:32:25)
the typical ways of doing things on the
(00:32:26)
internet that I just wasn't adopting.
(00:32:28)
>> Uh so the models for example
(00:32:30)
>> I mean I don't know if I want to get
(00:32:31)
into the full details but they keep they
(00:32:33)
keep um they keep thinking I'm writing
(00:32:35)
normal code and I'm not. Maybe one
(00:32:37)
example maybe
(00:32:38)
>> one example is so the way to synchronize
(00:32:41)
so you have eight GPUs that are all
(00:32:42)
doing forward backwards the way to
(00:32:44)
synchronize gradients between them is to
(00:32:45)
use a distributed data parallel
(00:32:47)
container of PyTorch which automatically
(00:32:49)
does all the as you're doing the
(00:32:50)
backward it will start communicating and
(00:32:51)
synchronizing gradients I didn't use DDP
(00:32:54)
because I didn't want to use it because
(00:32:56)
it's not necessary so I threw it out
(00:32:58)
>> and I basically wrote my own
(00:32:59)
synchronization routine that's inside
(00:33:01)
the step of the optimizer and so the
(00:33:03)
models were trying to get me to use the
(00:33:05)
DDB container
(00:33:06)
>> and they very concerned about okay this
(00:33:09)
gets way too technical but I wasn't
(00:33:10)
using that container because I don't
(00:33:12)
need it and I have a custom
(00:33:12)
implementation of something like it
(00:33:14)
>> and they just couldn't internalize that
(00:33:15)
you had your own
(00:33:16)
>> yeah they couldn't they couldn't get
(00:33:17)
past that and then um they kept trying
(00:33:20)
to like mess up the style like they're
(00:33:22)
way too overdefensive they make all
(00:33:24)
these try catch statements they keep
(00:33:25)
trying to make a production codebase and
(00:33:27)
I have a bunch of assumptions in my code
(00:33:29)
and it's okay and uh and it's just like
(00:33:32)
I don't need all this extra stuff in
(00:33:34)
there and so I just kind of feel like
(00:33:35)
they're bloating the codebase. They're
(00:33:37)
bloating the complexity. They keep
(00:33:38)
misunderstanding. They're using
(00:33:39)
deprecated APIs a bunch of times. So,
(00:33:42)
it's total mess. Um, and uh, it's just
(00:33:45)
it's just not that useful. I can go in,
(00:33:47)
I can clean it up, but it's not that
(00:33:48)
useful. I also feel like it's kind of
(00:33:50)
annoying to have to like type out what I
(00:33:52)
want in English because it's just too
(00:33:53)
much typing. Like, if I just navigate to
(00:33:55)
the part of the code that I want and I
(00:33:57)
go where I where I know the code has to
(00:33:58)
appear and I start typing out the first
(00:34:00)
three letters, autocomplete gets it and
(00:34:01)
just gives you the code. And so I think
(00:34:03)
it's this is a very high information
(00:34:05)
bandwidth to specify what you want is if
(00:34:07)
you point to the code where you want it
(00:34:08)
and you type out the first few pieces
(00:34:10)
and the model will complete it.
(00:34:12)
>> So I guess what I mean is um I think
(00:34:15)
these models are good in certain parts
(00:34:17)
of the stack actually use the models a
(00:34:19)
little bit in um there are two examples
(00:34:22)
where I actually use the models that I
(00:34:23)
think are illustrative. Uh one was when
(00:34:25)
I generated the report that's actually
(00:34:27)
more boilerplatey. So I actually bcoded
(00:34:29)
part partially some of that stuff that
(00:34:30)
was fine um because it's not like
(00:34:32)
mission critical stuff and it works
(00:34:34)
fine.
(00:34:34)
>> And then the other part is when I was
(00:34:35)
rewriting the tokenizer uh in Rust uh
(00:34:38)
I'm actually not as good at Rust because
(00:34:40)
I'm fairly new to Rust. So I was doing
(00:34:42)
there's a bit of vibe coding going on uh
(00:34:44)
in when I was writing some of the Rust
(00:34:46)
code but I had Python implementation
(00:34:47)
that I fully understand and I'm just
(00:34:49)
making sure I'm making a more efficient
(00:34:50)
version of it and I have tests so I feel
(00:34:52)
safer doing that stuff. Um and so
(00:34:54)
basically they lower or like they
(00:34:56)
increase accessibility to uh languages
(00:34:59)
or paradigms that you might not as be
(00:35:01)
not be as familiar with. Uh so I think
(00:35:03)
they're very helpful there as well.
(00:35:04)
>> Yeah.
(00:35:05)
>> Uh because there's a ton of Rust code
(00:35:06)
out there. The models are actually
(00:35:07)
pretty good at it. I happen to not know
(00:35:09)
that much about it. So the models are
(00:35:10)
very useful there.
(00:35:11)
>> Um the reason I think this question is
(00:35:12)
so interesting is because the main story
(00:35:16)
people have about AI exploding and
(00:35:19)
getting to super intelligence pretty
(00:35:20)
rapidly. is AI automating, AI
(00:35:23)
engineering and AI research.
(00:35:25)
So they'll look at the fact that you can
(00:35:27)
have cloud code make entire applications
(00:35:29)
from scratch and be like if you had this
(00:35:30)
incapability inside of open AI and deep
(00:35:33)
mind and everything well just imagine
(00:35:35)
the level of like just you know a
(00:35:37)
thousand of you or a million of you in
(00:35:38)
parallel finding little architectural
(00:35:40)
tweaks and so it's quite interesting to
(00:35:42)
hear you say that this is the thing
(00:35:44)
they're sort of asymmetrically worse at
(00:35:46)
and it's like quite relevant to
(00:35:47)
forecasting whether the AI 2027 type
(00:35:50)
explosion
(00:35:51)
>> is likely to happen anytime soon. I
(00:35:53)
think that's a good way of putting it.
(00:35:55)
And I think you're getting at some of my
(00:35:56)
like why my timelines are a bit longer.
(00:35:58)
You're right. Um I think um yeah,
(00:36:01)
they're not very good at code that
(00:36:02)
hasn't never been written before maybe
(00:36:04)
is like one way to put it, which is like
(00:36:05)
what we're trying to achieve when we're
(00:36:06)
building these models.
(00:36:08)
>> Very naive question, but um the
(00:36:10)
architectural tweaks that you're adding
(00:36:12)
to uh Nanohat, they're in a paper
(00:36:16)
somewhere, right? They might even be in
(00:36:17)
a repo somewhere. So it's
(00:36:20)
um is it surprising that they aren't
(00:36:22)
able to integrate that into whenever
(00:36:24)
you're like add rope embeddings or
(00:36:26)
something they do that in the wrong way.
(00:36:30)
>> It's it's tough. I think they kind of
(00:36:31)
know they kind of know but they don't
(00:36:32)
fully know and they don't know how to
(00:36:34)
fully integrate it into the repo and
(00:36:35)
your style and your code and your place
(00:36:36)
and some of the custom things that
(00:36:38)
you're doing and
(00:36:39)
>> and uh how it fits with all the
(00:36:40)
assumptions of the repository and all
(00:36:42)
this kind of stuff. So I think they do
(00:36:43)
have some knowledge but um they haven't
(00:36:46)
gotten to the place where they can
(00:36:47)
actually integrate it, make sense of it
(00:36:50)
uh and so on. I do think that a lot of
(00:36:51)
the stuff by the way continues to
(00:36:52)
improve. So um I think currently
(00:36:54)
probably state-of-the-art model that I
(00:36:56)
go to is the GP5 Pro.
(00:36:57)
>> Um and uh that's a very very powerful
(00:37:00)
model. So if I actually have 20 minutes
(00:37:01)
I will copy paste my entire repo and I
(00:37:03)
go to GPT5 Pro the Oracle for like some
(00:37:05)
questions and often it's not too bad and
(00:37:08)
surprisingly good compared to what
(00:37:09)
existed a year ago.
(00:37:10)
>> Yeah. Um, but I do think that uh overall
(00:37:12)
the models are are um they're not there.
(00:37:15)
And I kind of feel like the industry
(00:37:16)
it's it's um it's over it's it's making
(00:37:21)
too big of a jump and it's trying to
(00:37:23)
pretend like this is amazing and it's
(00:37:25)
not. It's slop and I think they're not
(00:37:27)
coming to terms with it and maybe
(00:37:28)
they're trying to fund raise or
(00:37:29)
something like that. I'm not sure what's
(00:37:30)
going on but it's we're at this
(00:37:32)
intermediate stage. The models are
(00:37:34)
amazing. They still need a lot of work
(00:37:36)
for now. autocomplete is my sweet spot
(00:37:38)
>> but sometimes for some types of code I
(00:37:40)
will go to a nom agent.
(00:37:41)
>> Yeah. Yeah.
(00:37:42)
>> Actually this this is also here's
(00:37:43)
another reason why this is really
(00:37:44)
interesting. Um through the history of
(00:37:47)
programming there's been many
(00:37:50)
productivity improvements compilers
(00:37:53)
linting better programming languages etc
(00:37:56)
which have increased programmer
(00:37:57)
productivity but have not led to an
(00:37:59)
explosion. So that's like one that
(00:38:01)
sounds very much like autocomplete tab
(00:38:04)
and this other category is just like
(00:38:06)
automation of the programmer
(00:38:07)
>> and it's interesting you're seeing more
(00:38:09)
in the category of the historical
(00:38:11)
analogies of like you know better
(00:38:13)
compilers or something
(00:38:14)
>> maybe because this one other kind of
(00:38:16)
thought that is like
(00:38:17)
>> I do feel like I have a hard time
(00:38:18)
differentiating where AI begins and
(00:38:20)
stops because I do see AI as
(00:38:22)
fundamentally an extension of computing
(00:38:23)
in some in some pretty fundamental way
(00:38:25)
and I I feel like I see a continuum of
(00:38:28)
this kind of like recursive
(00:38:29)
self-improvement or like of speeding up
(00:38:31)
uh programmers all the way from the
(00:38:32)
beginning like even like I would say
(00:38:34)
like uh code editors
(00:38:36)
>> um uh syntax highlighting
(00:38:39)
>> uh syntax uh or like checking even of
(00:38:41)
the of the types like data type checking
(00:38:44)
>> um
(00:38:45)
>> all these kinds of tools that we've
(00:38:46)
built for each for each other even
(00:38:47)
search engines like why aren't search
(00:38:49)
engines part of AI like
(00:38:50)
>> I don't know like ranking is kind of AI
(00:38:53)
right at some point Google was like even
(00:38:54)
early on they were thinking of
(00:38:55)
themselves as an AI company doing Google
(00:38:57)
search engine which I think is totally
(00:38:58)
fair
(00:38:59)
>> and So, I kind of see it as a lot more
(00:39:00)
of a continuum than I think other people
(00:39:02)
do and I don't it's hard for me to draw
(00:39:03)
the line and I kind of feel like okay,
(00:39:05)
we're now getting a much better
(00:39:06)
autocomplete and now we're also getting
(00:39:08)
some agents which are kind of like these
(00:39:09)
loopy things but they kind of go off
(00:39:10)
rails sometimes. Um, and what's going on
(00:39:14)
is that the human is progressively doing
(00:39:16)
a bit less and less of the low-level
(00:39:17)
stuff. For example, we're not writing
(00:39:19)
the assembly code because we have
(00:39:20)
compilers,
(00:39:20)
>> right? Like compilers will take my high
(00:39:22)
level language and C and write the
(00:39:23)
assembly code. So we're abstracting
(00:39:25)
ourselves very very slowly and there's
(00:39:27)
this what I call autonomy slider of like
(00:39:29)
more and more stuff is automated of the
(00:39:30)
stuff that can be automated at any point
(00:39:32)
in time and we're doing a bit less and
(00:39:33)
less and uh raising ourselves in the
(00:39:36)
layer abstraction over the automation.
(00:39:38)
One of the big problems with RL is that
(00:39:40)
it's incredibly information sparse.
(00:39:42)
Lelbox can help you with this by
(00:39:44)
increasing the amount of information
(00:39:46)
that your agent gets to learn from with
(00:39:48)
every single episode. For example, one
(00:39:50)
of their customers wanted to train a
(00:39:52)
coding agent. So, Labelbox augmented an
(00:39:54)
IDE with a bunch of extra data
(00:39:57)
collection tools and staffed a team of
(00:39:58)
expert software engineers from their
(00:40:00)
aligner network to generate trajectories
(00:40:03)
that were optimized for training. Now,
(00:40:05)
obviously, these engineers evaluated
(00:40:07)
these interactions on a past field
(00:40:08)
basis, but they also rated every single
(00:40:11)
response on a bunch of different
(00:40:12)
dimensions like readability and
(00:40:14)
performance. And they wrote down their
(00:40:16)
thought processes for every single
(00:40:18)
rating that they gave. So you're
(00:40:20)
basically showing every single step an
(00:40:22)
engineer takes and every single thought
(00:40:24)
that they have while they're doing their
(00:40:26)
job. And this is just something you
(00:40:28)
could never get from usage data alone.
(00:40:31)
And so Labelbox packaged up all these
(00:40:32)
evaluations and included all the Asian
(00:40:35)
trajectories and the corrective human
(00:40:37)
edits for the customer to train on. This
(00:40:40)
is just one example. So go check out how
(00:40:42)
Labelbox can get you highquality
(00:40:43)
frontier data across domains,
(00:40:46)
modalities, and training paradigms.
(00:40:48)
reach out at labelbox.com.
(00:40:54)
Let's talk about RL a bit. Uh you two
(00:40:56)
did some very interesting things about
(00:40:58)
this. Um conceptually, how should we
(00:41:01)
think about the way that humans are able
(00:41:03)
to build a rich world model just from
(00:41:06)
interacting with our environment and in
(00:41:08)
ways that seems almost irrespective of
(00:41:11)
the final reward at the end of the
(00:41:12)
episode.
(00:41:13)
>> Mhm. If somebody has, you know,
(00:41:15)
somebody's starting to start a business
(00:41:16)
and at the end of 10 years, she finds
(00:41:18)
out whether the business succeeded or
(00:41:19)
failed,
(00:41:20)
>> we say that she's earned a bunch of
(00:41:21)
wisdom and experience,
(00:41:22)
>> but it's not because like the log probs
(00:41:24)
of every single thing that happened over
(00:41:26)
the last 10 years are updated or
(00:41:27)
downweighted. It's something much more
(00:41:28)
deliberate and uh rich is happening.
(00:41:31)
What is the ML analogy and how does that
(00:41:33)
compare to what we're doing with other
(00:41:34)
ones right now?
(00:41:35)
>> Yeah, maybe the way I would put it is
(00:41:36)
humans don't use reinforcement learning
(00:41:38)
is maybe what I've as I've said it. I I
(00:41:40)
think they do something different which
(00:41:41)
is yeah you experience so reinforcement
(00:41:43)
learning is a lot worse than I think the
(00:41:45)
average person thinks
(00:41:48)
reinforcement learning is terrible.
(00:41:50)
It just so happens that uh everything
(00:41:52)
that we had before is much worse
(00:41:56)
u because previously we're just
(00:41:57)
imitating people so it has all these
(00:41:58)
issues. Um so in reinforcement learning
(00:42:01)
say you're working with uh you're
(00:42:02)
solving a math problem because it's very
(00:42:04)
simple. You're given a math problem and
(00:42:06)
you're trying to find the solution. Um
(00:42:08)
now in reinforcement learning you will
(00:42:11)
try lots of things in parallel first. So
(00:42:14)
uh you're given a problem you try
(00:42:15)
hundreds of different attempts and these
(00:42:17)
attempts can be complex right they can
(00:42:19)
be like oh let me try this let me try
(00:42:20)
that this didn't work that didn't work
(00:42:22)
etc. And then maybe you get an answer
(00:42:24)
and now you check the back of the book
(00:42:25)
and you see okay the correct answer is
(00:42:27)
this and then you can see that okay this
(00:42:30)
one this one and that one got the
(00:42:31)
correct answer but these other 97 of
(00:42:33)
them didn't. So literally what
(00:42:34)
reinforcement learning does is it goes
(00:42:36)
to the ones that worked really well and
(00:42:38)
every single thing you did along the way
(00:42:39)
every single token gets upweighted of
(00:42:41)
like do more of this. The problem with
(00:42:43)
that is I mean people will say that u
(00:42:45)
your estimator has high variance but
(00:42:47)
what I mean it's just noisy it's noisy.
(00:42:50)
So basically it kind of almost assumes
(00:42:52)
that every single little piece of the
(00:42:53)
solution that you made that ride the
(00:42:55)
right answer was correct thing to do
(00:42:56)
which is not true. Like you may have
(00:42:57)
gone down the wrong alleys until you
(00:43:00)
arrive the right solution. Every single
(00:43:01)
one of those incorrect things you did,
(00:43:03)
as long as you got to the correct
(00:43:04)
solution, will be upweed as do more of
(00:43:05)
this. It's terrible.
(00:43:07)
>> Yeah, it's noise. You've done all this
(00:43:09)
work only to find a single at the end,
(00:43:11)
you get a single number of like, oh, you
(00:43:13)
did correct. And and based on that, you
(00:43:15)
weigh that entire trajectory as like
(00:43:17)
upweight or down weight. And so you're
(00:43:19)
the way I like to put it is you're
(00:43:20)
sucking supervision through a straw. Uh
(00:43:22)
because you've done all this work that
(00:43:24)
could be a minute of rollout and you're
(00:43:25)
you're like sucking the bits of
(00:43:27)
supervision of the final reward signal
(00:43:28)
through a straw and you're like putting
(00:43:30)
it You're like
(00:43:32)
basically like um yeah, you're
(00:43:34)
broadcasting that across the entire
(00:43:35)
trajectory and using that to upway or
(00:43:37)
down with that trajectory. It's crazy. A
(00:43:39)
human would never do this. Number one, a
(00:43:41)
human would never do hundreds of
(00:43:42)
rollouts. Uh number two, when a person
(00:43:44)
sort of finds a solution, they will have
(00:43:47)
a pretty complicated process of review
(00:43:48)
of like, okay, I think these parts that
(00:43:50)
I did well, these parts I did not do
(00:43:51)
that well, I should probably do this or
(00:43:54)
that. And they think through things.
(00:43:55)
There's nothing in current LLM that does
(00:43:57)
this. There's no equivalent of it. Um
(00:43:59)
but I do see papers popping out that are
(00:44:01)
trying to do this because it's obvious
(00:44:03)
to everyone in the field.
(00:44:04)
>> Yeah.
(00:44:04)
>> So I kind of see as like the first
(00:44:06)
imitation learning actually by the way
(00:44:07)
was extremely surprising and miraculous
(00:44:09)
and amazing that we can uh fine-tune by
(00:44:11)
imitation on humans. Um and that was
(00:44:13)
incredible because in the beginning all
(00:44:14)
we had was base models. Base models are
(00:44:16)
autocomplete. uh and it wasn't obvious
(00:44:18)
to me at the time uh and I had to learn
(00:44:20)
this and the paper that like blew my
(00:44:23)
mind was instruct GPT because it pointed
(00:44:25)
out that hey you can take the
(00:44:26)
pre-trained model which is autocomplete
(00:44:28)
and if you just fine-tune it on text
(00:44:30)
that looks like conversations the model
(00:44:32)
will very rapidly adapt to become very
(00:44:34)
conversational and it keeps all the
(00:44:35)
knowledge from pre-training and this
(00:44:37)
blew my mind because I didn't understand
(00:44:39)
that it's just like stylistically can
(00:44:40)
adjust so quickly and become an
(00:44:42)
assistant to a user through through just
(00:44:44)
a few loops of fine-tuning on that kind
(00:44:46)
of data It was very miraculous to me
(00:44:48)
that that that worked. So incredible.
(00:44:50)
And that was like two years, three years
(00:44:52)
of work. And now came RL. And RL allows
(00:44:55)
you to do a bit better than just
(00:44:56)
imitation learning, right? Because you
(00:44:58)
you can't have these re um reward
(00:45:00)
functions and you can hill climb on the
(00:45:01)
reward functions. And so some problems
(00:45:03)
have just correct answers. You can hill
(00:45:05)
climb on that without getting expert
(00:45:06)
trajectories to imitate. So that's
(00:45:08)
amazing. And the model can also discover
(00:45:10)
solutions that the human might never
(00:45:11)
come up with.
(00:45:12)
>> Uh so this is incredible. And yet it's
(00:45:14)
still stupid. Um, so I think we need we
(00:45:18)
need more and so I saw a paper from
(00:45:20)
Google yesterday that tried to have this
(00:45:21)
reflect and review p um uh idea uh in
(00:45:24)
mind. Uh what was the memory bank paper
(00:45:27)
or something? I don't know. I've
(00:45:29)
actually seen a few papers along these
(00:45:30)
lines. So I expect there to be some kind
(00:45:32)
of a major update to how we do
(00:45:34)
algorithms for LLMs coming in that realm
(00:45:37)
and then I think we need three or four
(00:45:38)
or five more
(00:45:41)
um something like that. But you you're
(00:45:43)
so good at coming up with the evocative
(00:45:45)
evocative phrases sucking supervision
(00:45:48)
through a straw. It's like so good. Um
(00:45:51)
why hasn't So you're saying like your
(00:45:53)
problem with outcome based reward is
(00:45:55)
that you have this huge trajectory and
(00:45:57)
then at the end you're you're trying to
(00:46:00)
learn every single possible thing about
(00:46:02)
what you should do and what you should
(00:46:03)
learn about the world from that one
(00:46:04)
final bit. um why hasn't given the fact
(00:46:08)
that this is obvious, why hasn't process
(00:46:09)
based supervision
(00:46:10)
>> as an alternative been a successful way
(00:46:12)
to make models more capable? What what
(00:46:15)
has been preventing us from using this
(00:46:16)
alternative paradigm?
(00:46:17)
>> So process based supervision just refers
(00:46:18)
to the fact that we're not going to have
(00:46:19)
a reward function only at the very end
(00:46:21)
of after you've made 10 minutes of work.
(00:46:22)
I'm not going to tell you you did well
(00:46:24)
or not well.
(00:46:24)
>> I'm going to tell you at every single
(00:46:25)
step of the way how well you're doing.
(00:46:27)
>> Um and this is basically the reason we
(00:46:29)
don't have that is it's not trick it's
(00:46:30)
tricky how you do that properly.
(00:46:32)
>> Um because you have partial solutions
(00:46:33)
and you don't know how to assign credit.
(00:46:35)
So when you get the right answer, it's
(00:46:37)
just uh an equality match to the answer.
(00:46:39)
Very simple to implement.
(00:46:41)
>> If you're doing basically process
(00:46:42)
supervision, how do you assign an
(00:46:44)
automatable way partial credit
(00:46:46)
assignment? It's not obvious how you do
(00:46:48)
it. Lots of labs, I think, are trying to
(00:46:49)
do it with these LLM judges. So
(00:46:51)
basically, you get LLMs to try to do it.
(00:46:52)
So you prompt an LLM, hey, look at a
(00:46:54)
partial solution of a student. How well
(00:46:56)
do you think they're doing if the answer
(00:46:57)
is this? And they try to tune the
(00:46:58)
prompt. Um, the reason that I think this
(00:47:01)
is kind of tricky is quite subtle. And
(00:47:03)
it's the fact that anytime you use an
(00:47:04)
LLM to assign a reward, those LLMs are
(00:47:07)
giant things with billions of parameters
(00:47:09)
and they're gameable.
(00:47:10)
>> And if you're reinforcement learning
(00:47:11)
with respect to them, you will find
(00:47:12)
adversarial examples for your LM judges
(00:47:15)
>> almost guaranteed.
(00:47:16)
>> You can't do this for too long. You do
(00:47:17)
maybe 10 steps or 20 steps, maybe it
(00:47:19)
will work. But you can't do a hundred or
(00:47:21)
a thousand because it's not obvious
(00:47:22)
because um I know I understand it's not
(00:47:24)
obvious but basically the model will
(00:47:26)
find little cracks.
(00:47:29)
It will find all these like spurious
(00:47:31)
things in the nooks and crannies of the
(00:47:32)
giant model and find a way to cheat it.
(00:47:35)
So one example uh that's prominently in
(00:47:37)
my mind is I think this I think this was
(00:47:39)
probably public but basically if you're
(00:47:42)
using an LM judge for a reward so you
(00:47:44)
just give it a solution from a student
(00:47:45)
and ask it if the student will or not.
(00:47:47)
We were training with reinforcement
(00:47:49)
learning against that reward function
(00:47:51)
>> and it worked really well and then um
(00:47:53)
suddenly the reward became extremely
(00:47:55)
large like it was massive jump and it
(00:47:57)
did perfect and you're looking at it
(00:47:58)
like
(00:47:59)
>> wow this this means the student is
(00:48:01)
perfect in all these problems it's fully
(00:48:02)
solved math
(00:48:04)
>> but actually what's happening is that
(00:48:05)
when you look at the completions that
(00:48:06)
you're getting from the model they are
(00:48:08)
complete nonsense they start out okay
(00:48:09)
and then they change to duh duh duh duh
(00:48:11)
so it's just like oh okay let's take two
(00:48:13)
plus three and we do this and this and
(00:48:14)
then duh duh duh duh duh
(00:48:16)
>> and you're looking at it's like this
(00:48:17)
crazy. How is it getting a reward of one
(00:48:19)
or 100%. Um, and you look at the LLM
(00:48:21)
judge and it turns out the is an
(00:48:23)
adversarial example for the model and it
(00:48:25)
assigns 100% probability to it. And it's
(00:48:27)
just because this is an out of sample
(00:48:29)
example to the LLM. It's never seen it
(00:48:31)
during training and you're in pure
(00:48:33)
generalization land,
(00:48:34)
>> right?
(00:48:34)
>> It's never seen it during training. And
(00:48:36)
in the pure generalization land, you can
(00:48:37)
find these examples that that uh break
(00:48:40)
it.
(00:48:40)
>> You're basically training the LLM to be
(00:48:43)
a prompt injection model. Not even that
(00:48:45)
prompting injection is way too fancy.
(00:48:47)
You're you're finding adversarial
(00:48:48)
examples as they're called. These are
(00:48:49)
nonsensical uh solutions um that are
(00:48:53)
obviously wrong, but the model things
(00:48:54)
are amazing.
(00:48:55)
>> So to the extent you think this is the
(00:48:57)
bottleneck to making RL more functional,
(00:49:00)
then that will require making LLMs
(00:49:02)
better judges if you want to do this in
(00:49:03)
an automated way. And then so is it just
(00:49:06)
going to be like some sort of GAN-like
(00:49:07)
approach where you had to train models
(00:49:08)
to be more robust? Yeah. To
(00:49:10)
>> I think the labs are probably doing all
(00:49:11)
that like okay so the obvious thing is
(00:49:13)
like the should not get 100% reward.
(00:49:15)
Okay well take the put in the training
(00:49:17)
set of the LM judge and say this is not
(00:49:18)
100% this is 0%. You can do this
(00:49:20)
>> but every time you do this you get a new
(00:49:23)
LLM and it still has adversarial
(00:49:24)
examples. There's infinity adversarial
(00:49:26)
examples. And I think probably if you
(00:49:28)
iterate this a few times, it'll probably
(00:49:30)
be harder and harder to find real
(00:49:31)
examples, but I'm not 100% sure because
(00:49:32)
this thing has a trillion parameters or
(00:49:34)
whatnot. Um, so I bet you the the labs
(00:49:38)
are trying. Uh, I don't actually I I
(00:49:40)
still think I still think we need other
(00:49:43)
ideas.
(00:49:45)
>> Interesting. Do do you have some shape
(00:49:46)
of what the other idea
(00:49:49)
>> could be? So like this this idea of like
(00:49:52)
review um review solution encompass
(00:49:55)
synthetic examples such that when you
(00:49:56)
train on them you get uh you get better
(00:49:58)
and like metal learn it in some way and
(00:50:00)
I think there's some papers that I'm
(00:50:01)
starting to see pop out. I only am at a
(00:50:03)
stage of like reading abstracts because
(00:50:04)
a lot of these papers, you know, they're
(00:50:06)
just ideas. Someone has to actually like
(00:50:08)
make it work on a frontier LLM lab scale
(00:50:11)
uh in full generality because when you
(00:50:13)
see these papers, they pop up and it's
(00:50:14)
just like a little bit of noisy, you
(00:50:16)
know, it's cool ideas, but I haven't
(00:50:17)
actually seen anyone convincingly uh
(00:50:20)
show that this is possible. That said,
(00:50:22)
the LLM labs are fairly closed. Uh so,
(00:50:24)
who knows what they're doing now, but
(00:50:26)
>> yeah. So I guess I can I I see a very um
(00:50:29)
not easy but like I I can conceptualize
(00:50:32)
how you would be able to train on
(00:50:34)
synthetic examples or synthetic problems
(00:50:36)
that you have made for yourself. But
(00:50:37)
there seems to be another thing humans
(00:50:38)
do. Maybe sleep is this, maybe
(00:50:40)
daydreaming is this
(00:50:42)
>> which is not necessarily come up with
(00:50:44)
fake problems but just like reflect.
(00:50:46)
>> Yeah.
(00:50:47)
>> And I'm not sure what the ML analogy
(00:50:49)
for, you know, daydreaming or sleeping
(00:50:50)
but just like just reflecting. I haven't
(00:50:52)
come up with any problem. Yeah, I mean
(00:50:53)
obviously the very basic analogy would
(00:50:54)
just be like fine-tuning on reflection
(00:50:57)
bits, but I feel like in practice that
(00:50:59)
probably wouldn't work that well. So I
(00:51:00)
don't know if you have some take on what
(00:51:03)
the analogy of like this thing is.
(00:51:05)
>> Yeah, I do think that that we're missing
(00:51:06)
some aspects there. So as an example,
(00:51:08)
>> uh when you're reading a book, um
(00:51:11)
>> I almost feel like currently when LLMs
(00:51:12)
are reading a book, what that means is
(00:51:14)
we stretch out the sequence of text and
(00:51:16)
the model is predicting the next token
(00:51:18)
and it's getting some knowledge from
(00:51:19)
that. That's not really what humans do,
(00:51:20)
right? So when you're reading a book, I
(00:51:22)
almost don't even feel like the book is
(00:51:23)
like exposition I'm supposed to be
(00:51:25)
attending to and training on. The book
(00:51:26)
is a is a set of prompts for me to do
(00:51:29)
synthetic data generation
(00:51:30)
>> or for you to get to a book club and
(00:51:32)
talk about it with your friends. And
(00:51:33)
it's by manipulating that information
(00:51:35)
that you actually gain that knowledge.
(00:51:37)
And I I think we have no equivalent of
(00:51:39)
that again with LLMs. They don't really
(00:51:41)
do that. But I'd love to see during
(00:51:42)
pre-training some kind of a stage that
(00:51:44)
uh thinks through the material and tries
(00:51:46)
to reconcile it with what it already
(00:51:47)
knows and thinks through for like some
(00:51:49)
amount of time. and um gets that to
(00:51:51)
work. And so there's no equivalence of
(00:51:53)
any of this. This is all research.
(00:51:54)
There's some subtle very subtle that I
(00:51:56)
think are very hard to understand
(00:51:58)
reasons why it's not trivial. So if I
(00:52:00)
can just describe one,
(00:52:02)
>> why can we just synthetically generate
(00:52:03)
and train on it?
(00:52:04)
>> Well, because every synthetic example
(00:52:06)
like if I just give synthetic generation
(00:52:07)
of the model thinking about a book, you
(00:52:09)
look at it and you're like, "This looks
(00:52:10)
great. Why can't I train on it?" Well,
(00:52:12)
you could try, but the model will
(00:52:13)
actually get much worse if you continue
(00:52:14)
trying. And that's because all of the
(00:52:17)
samples you get from models are silently
(00:52:19)
collapsed. They're silently, this is not
(00:52:21)
obvious if you look at any individual
(00:52:22)
example of it. They occupy a very tiny
(00:52:24)
manifold of the possible space of um
(00:52:27)
sort of thoughts about content. So the
(00:52:29)
LLMs when they come off, they're what we
(00:52:31)
call collapsed. They have a collapsed
(00:52:32)
data distribution. If you sample, one
(00:52:35)
easy way to see it is go to Chachi PT
(00:52:37)
and ask it tell me a joke. It only has
(00:52:39)
like three jokes.
(00:52:40)
>> It's not giving you the whole breath of
(00:52:42)
possible jokes.
(00:52:42)
>> It's giving you like it knows like three
(00:52:44)
jokes. Yeah,
(00:52:45)
>> they're silently collapsed. So
(00:52:46)
basically, you're not getting the
(00:52:48)
richness and the diversity and the
(00:52:49)
entropy uh from these models as you
(00:52:51)
would get from humans. So humans are a
(00:52:53)
lot more sort of noisier, but at least
(00:52:55)
they're not biased. They're not um in in
(00:52:57)
a statistical sense, they're not
(00:52:58)
silently collapsed. They maintain a huge
(00:53:00)
amount of entropy. So how do you get
(00:53:02)
synthetic data generation to work
(00:53:04)
despite the collapse and while
(00:53:05)
maintaining the entropy is a research
(00:53:07)
problem. Um, just to make sure I
(00:53:09)
understood, the reason that the collapse
(00:53:11)
is relevant to synthetic data generation
(00:53:12)
is because you want to be able to come
(00:53:13)
up with synthetic problems or
(00:53:16)
reflections which are not already in
(00:53:18)
your data distribution.
(00:53:20)
>> I guess what I'm saying is um, say we
(00:53:23)
have a chapter of a book and I ask a nom
(00:53:25)
to think about it.
(00:53:26)
>> Um, it will give you something that
(00:53:27)
looks very reasonable. But if I ask it
(00:53:29)
10 times, you'll notice that all of them
(00:53:31)
are the same. You can't just leave
(00:53:33)
scaling scaling quote unquote reflection
(00:53:36)
on the same amount of uh you know prompt
(00:53:40)
information and then get returns from
(00:53:41)
that. Okay.
(00:53:41)
>> Yeah. Yeah. So any individual sample
(00:53:43)
will look okay but the distribution of
(00:53:45)
it is is quite terrible and it's quite
(00:53:47)
terrible in such a way that if you
(00:53:48)
continue training on too much of your
(00:53:49)
own stuff you actually collapse. I
(00:53:51)
actually think that um there's no like
(00:53:53)
fundamental solutions to this possibly
(00:53:54)
and I also think humans collapse over
(00:53:56)
time. Uh I think this is uh again these
(00:53:58)
analogies are surprisingly good but
(00:54:00)
humans collapse during the course of
(00:54:01)
their lives. This is why children have
(00:54:04)
completely u you know they haven't
(00:54:05)
overfit yet and they will say stuff that
(00:54:07)
will shock you because it's kind of you
(00:54:09)
can see where they're coming from but
(00:54:10)
it's just not the thing people say
(00:54:12)
>> and because they're not yet collapsed
(00:54:14)
but we're collapsed. We end up
(00:54:16)
revisiting the same thoughts. we end up,
(00:54:18)
you know, saying more and more of the
(00:54:20)
same stuff and the learning rates go
(00:54:21)
down and uh the collapse continues to
(00:54:23)
get worse and then um everything
(00:54:26)
deteriorates.
(00:54:27)
>> Have Have you seen a super interesting
(00:54:28)
paper that dreaming is a way of
(00:54:31)
preventing this kind of overfitting and
(00:54:33)
collapse that the reason dreaming is uh
(00:54:37)
evolutionary adaptive is to
(00:54:39)
>> put you in weird situations that are
(00:54:41)
like very unlike your day-to-day reality
(00:54:43)
so that to prevent this kind of
(00:54:44)
>> overfitting. It's an interesting idea. I
(00:54:45)
mean, I do think that when you're
(00:54:47)
generating things in your head and then
(00:54:49)
you're attending to it, you're kind of
(00:54:50)
like training on your own samples.
(00:54:51)
You're training on your synthetic data
(00:54:53)
and if you do it for too long, you go
(00:54:54)
off rails um and you collapse way too
(00:54:56)
much. So, you always have to like seek
(00:54:58)
um entropy in your life.
(00:55:00)
>> Yeah.
(00:55:01)
>> Uh so talking to other people is a great
(00:55:02)
source of entropy
(00:55:04)
>> and uh things like that. So maybe the
(00:55:06)
brain has also built some internal
(00:55:07)
mechanisms uh for increasing the amount
(00:55:09)
of entropy um in in that process. But
(00:55:13)
yeah, maybe that's an interesting idea.
(00:55:14)
This is a very ill-formed thought. So I
(00:55:16)
I'll just put it out and let you react
(00:55:18)
to it. The best learners that we are
(00:55:20)
aware of, which are children, are
(00:55:22)
extremely
(00:55:24)
bad at recollecting information. In
(00:55:26)
fact, at the very earliest stages of
(00:55:28)
childhood, you will forget everything.
(00:55:29)
You're just an amnesiac about everything
(00:55:31)
that happens before a certain uh year
(00:55:32)
date, but you're like extremely good at
(00:55:34)
picking up new languages and learning
(00:55:35)
from the world. And maybe there's some
(00:55:37)
element of like being able to see the
(00:55:38)
forest for the trees. Whereas if you
(00:55:40)
compare it to the ex opposite end of the
(00:55:41)
spectrum, you have LLM pre-training
(00:55:44)
which these models will literally able
(00:55:46)
to regurgitate word for word what is the
(00:55:48)
next thing in a Wikipedia page, but
(00:55:51)
their ability to learn abstract concepts
(00:55:53)
really quickly the way a child can is
(00:55:55)
much more limited. And then adults are
(00:55:57)
somewhere in between where they don't
(00:55:58)
have the flexibility of childhood
(00:56:00)
learning, but they can, you know, adults
(00:56:02)
can memorize facts and information in a
(00:56:04)
way that is harder for kids. And I don't
(00:56:06)
know if there's something interesting
(00:56:08)
about that. I think there's something
(00:56:09)
very interesting about that. Yeah, 100%.
(00:56:11)
I do think that humans actually
(00:56:13)
>> um they do kind of like have a lot more
(00:56:14)
of an element compared to like seeing
(00:56:16)
the forest for the trees
(00:56:18)
>> and and we're not actually that good at
(00:56:19)
memorization which is actually a
(00:56:21)
feature.
(00:56:22)
>> Um because we're not that good at
(00:56:24)
memorization, we actually are kind of
(00:56:26)
like forced to uh find the patterns uh
(00:56:29)
um like more in a more general sense. I
(00:56:32)
think lens for in comparison are
(00:56:33)
extremely good at memorization. they
(00:56:35)
will recite passages from all these uh
(00:56:37)
training sources. Uh you can give them
(00:56:39)
completely nonsensical data like you can
(00:56:41)
take um you can hash some amount of text
(00:56:43)
or something like that. You get
(00:56:44)
completely random sequence. If you train
(00:56:45)
on it even just I think a single
(00:56:46)
iteration or two it can suddenly
(00:56:48)
regurgitate the entire thing. It will
(00:56:49)
memorize it. There's no way a person can
(00:56:51)
read a single sequence of random numbers
(00:56:53)
and recite it to you. Um, and that's a
(00:56:56)
feature, not a bug almost. Uh, because
(00:56:58)
it forces you to like only learn the
(00:56:59)
generalizable components, whereas LLMs
(00:57:02)
are distracted by all the memory that
(00:57:04)
they have of the pre-trained documents
(00:57:06)
and it's probably very distracting to
(00:57:07)
them, uh, in a certain sense. So that's
(00:57:10)
why when I talk about the cognitive
(00:57:11)
core, I actually want to remove the
(00:57:12)
memory, which is what we talked about.
(00:57:14)
I'd love to have it them have less
(00:57:16)
memory so that they have to look things
(00:57:17)
up uh and that they only maintain the
(00:57:19)
algorithms for like thought uh and the
(00:57:22)
idea of an experiment and all this
(00:57:24)
cognitive glue of um of acting
(00:57:26)
>> and this is also relevant to preventing
(00:57:28)
model collapse.
(00:57:30)
>> Um let me think um
(00:57:35)
I'm not sure I think it's almost like a
(00:57:36)
separate axis. M
(00:57:37)
>> it's almost like the the models are way
(00:57:39)
too good at uh memorization and somehow
(00:57:41)
we should we should remove that and I
(00:57:43)
think people people are much worse but
(00:57:44)
it's a good thing.
(00:57:46)
>> What is a solution to model collapse? I
(00:57:48)
mean you could so there's very naive
(00:57:50)
things you could attempt is just like
(00:57:52)
>> um the distribution over lo should be
(00:57:55)
wider or something like there's many
(00:57:56)
naive things you could try. What ends up
(00:57:58)
being the problem with the naive
(00:57:59)
approaches?
(00:58:00)
>> Um yeah I think that's a great question.
(00:58:02)
I mean you can imagine having a
(00:58:03)
regularization for entropy and things
(00:58:04)
like that. I guess they just don't work
(00:58:06)
as well empirically because uh right now
(00:58:09)
like the models are collapsed but I will
(00:58:10)
say um most of the tasks that we want of
(00:58:13)
them don't actually demand the diversity
(00:58:17)
>> is probably the the answer of what's
(00:58:18)
going on and so it's just that the model
(00:58:20)
the frontier labs are trying to make the
(00:58:22)
models useful and I kind of just feel
(00:58:24)
like the diversity of the outputs is not
(00:58:25)
so much number one it's much harder to
(00:58:27)
work with and evaluate and all this kind
(00:58:28)
of stuff but maybe it's not what's
(00:58:30)
actually capturing most of the value.
(00:58:31)
Um,
(00:58:32)
>> in fact, it's actively penalized, right?
(00:58:34)
If you if you're like super creative in
(00:58:36)
RL, it's like not good.
(00:58:38)
>> Yeah. Or like maybe if you're doing a
(00:58:39)
lot of writing help from LMS and stuff
(00:58:40)
like that, I think it's probably bad
(00:58:41)
because the models will give you these
(00:58:43)
like silently
(00:58:44)
>> all the same stuff, you know? So,
(00:58:47)
they're not um they won't explore lots
(00:58:48)
of different ways of answering a
(00:58:50)
question, right?
(00:58:51)
>> But I kind of feel like maybe this
(00:58:52)
diversity is just not as big of um yeah,
(00:58:55)
maybe like yeah, not as many
(00:58:56)
applications need it, so the models
(00:58:57)
don't have it, but then it's actually a
(00:58:58)
problem at synthetic generation time,
(00:58:59)
etc. So we're actually shooting
(00:59:00)
ourselves in the foot by not allowing
(00:59:02)
this entropy to maintain in the model.
(00:59:04)
And I think possibly uh the labs should
(00:59:06)
try harder.
(00:59:07)
>> And then I think you hinted that it's a
(00:59:09)
it's a very fundamental problem. It
(00:59:11)
won't be easy to solve. And yeah, what's
(00:59:13)
your intuition for that?
(00:59:14)
>> I don't actually know if it's um super
(00:59:16)
fundamental. Uh I don't actually know if
(00:59:18)
I intended to to say that. I do think
(00:59:20)
that um
(00:59:22)
I haven't done these experiments, but I
(00:59:24)
do think that you could probably
(00:59:25)
regularize the entropy to be uh to be
(00:59:26)
higher. So you're encouraging the model
(00:59:28)
to give you more and more solutions. Um
(00:59:30)
but you don't want it to start deviating
(00:59:32)
too much from the training data. It's
(00:59:33)
going to start making up its own
(00:59:34)
language. It's going to start using
(00:59:35)
words that are extremely rare. U you
(00:59:37)
know so it's going to drift too much
(00:59:39)
from the distribution. Uh so I think
(00:59:41)
controlling the distribution is just
(00:59:42)
like a tricky it's just like someone
(00:59:44)
just has to
(00:59:45)
>> it's probably not trivial in that sense.
(00:59:48)
>> How many bits should the optimal core
(00:59:52)
>> of intelligence end up being if you just
(00:59:54)
had to make a guess? the thing we put on
(00:59:56)
the uh van
(00:59:57)
>> pros how big does it have to be?
(01:00:00)
>> So it's really interesting in the
(01:00:01)
history of the field because at one
(01:00:03)
point everything was very um scaling
(01:00:05)
pill in terms of like oh we're going to
(01:00:06)
make much bigger models trillions of
(01:00:08)
parameter models and actually what the
(01:00:09)
models have done in size is they've gone
(01:00:11)
up and now they've actually kind of like
(01:00:14)
actually even come down their models are
(01:00:16)
smaller
(01:00:17)
>> and even then I actually think they
(01:00:18)
memorized way too much. Um, so I think I
(01:00:21)
had a prediction a while back that I I
(01:00:23)
almost feel like we can get cognitive
(01:00:24)
cores that are very good at even like a
(01:00:26)
billion billion parameters. It it should
(01:00:28)
be already like like if you talk to a
(01:00:30)
billion parameter model I think in 20
(01:00:32)
years you can actually have a very
(01:00:33)
productive conversation. It thinks um
(01:00:36)
and it's a lot more like a human. But if
(01:00:38)
you ask it some factual question might
(01:00:39)
have to look it up but it knows that it
(01:00:41)
doesn't know and it might have to look
(01:00:42)
it up and it will just do all the
(01:00:43)
reasonable things. That that's actually
(01:00:44)
surprising that you think it will take a
(01:00:46)
billion because already we have a
(01:00:47)
billion parameter models or a couple
(01:00:49)
billion parameter models that are like
(01:00:50)
very intelligent.
(01:00:51)
>> Well, some of our models are like a
(01:00:53)
trillion parameters, right? But they
(01:00:54)
remember so much stuff like just
(01:00:56)
>> Yeah. But I'm surprised that in 10 years
(01:00:59)
given the pace, okay, we have GPT OSS
(01:01:03)
20B that's way better than GPD4 original
(01:01:07)
which was a trillion plus uh parameters.
(01:01:10)
So given that trend, I'm actually
(01:01:11)
surprised you think in 10 years the
(01:01:13)
cognitive core is still a billion
(01:01:15)
parameters. I would I'm surprised you're
(01:01:16)
not like it's going to be like uh tens
(01:01:18)
of millions or millions.
(01:01:20)
>> No, because I basically think that the
(01:01:21)
training data is so here's the issue.
(01:01:23)
The training data is the internet which
(01:01:24)
is really terrible.
(01:01:26)
>> So there's a huge amount of gains to be
(01:01:27)
made because the internet is terrible.
(01:01:28)
Like if you actually and even the
(01:01:30)
internet when you and I think of the
(01:01:31)
internet, you're thinking of like a Wall
(01:01:32)
Street Journal or
(01:01:34)
>> that's not what this is. When you're
(01:01:35)
actually looking at a preaching data set
(01:01:36)
in the Frontier Lab and you look at a
(01:01:38)
random internet document, it's total
(01:01:40)
garbage. Like I don't even know how this
(01:01:42)
works at all. It's some like stock
(01:01:44)
ticker symbols. Uh
(01:01:47)
>> it's a huge amount of slop and garbage
(01:01:49)
from like all the corners of the
(01:01:50)
internet. It's not like your Wall Street
(01:01:51)
Journal article that's extremely rare.
(01:01:53)
>> Um so I almost feel like because the
(01:01:55)
internet is so terrible, we actually
(01:01:57)
have to sort of like build really big
(01:01:58)
models to compress all that. Uh most of
(01:02:01)
that compression is memory work instead
(01:02:03)
of like cognitive work. But what we
(01:02:04)
really want is the cognitive part
(01:02:06)
actually delete the memory
(01:02:07)
>> and then so I guess what I'm saying is
(01:02:09)
like we need intelligent models to help
(01:02:12)
us refine even the pre-training set to
(01:02:14)
just narrow it down to the cognitive
(01:02:15)
components and then I think you get away
(01:02:17)
with a much smaller model because it's a
(01:02:18)
much better data set and you could train
(01:02:20)
it on it but probably it's not trained
(01:02:22)
directly on it. It's probably distilled
(01:02:23)
for a much better model still but
(01:02:24)
>> but why is the distilled version still a
(01:02:26)
billion is I guess the thing I'm curious
(01:02:28)
about.
(01:02:29)
>> I just feel like distillation work
(01:02:30)
extremely well. So um almost every small
(01:02:32)
model if you have a small model it's
(01:02:34)
almost certainly distilled. Why would
(01:02:35)
you train on
(01:02:36)
>> right? No no but why is a distillation
(01:02:37)
not in 10 years not getting below 1
(01:02:39)
billion.
(01:02:40)
>> Oh you think it should be smaller than a
(01:02:41)
million?
(01:02:42)
>> I mean come on right I don't know at
(01:02:45)
some point uh it should take at least a
(01:02:47)
billion knobs uh to do something
(01:02:49)
interesting. You're thinking it should
(01:02:50)
be even smaller.
(01:02:51)
>> Yeah. I mean just like if you look at
(01:02:52)
the trend over the last few years just
(01:02:54)
finding low hanging fruit and going from
(01:02:56)
like trillion plus models that are like
(01:02:58)
literally two orders of magnitude
(01:03:00)
smaller in a matter of two years and
(01:03:02)
having better performance.
(01:03:03)
>> Yeah.
(01:03:04)
>> It makes me think the the sort of like
(01:03:06)
core of intelligence might be
(01:03:08)
>> even way way smaller like plenty of room
(01:03:10)
at the bottom to to paraphrase fineman.
(01:03:12)
>> I mean I almost feel like I'm already
(01:03:13)
contrarian by talking about a billion
(01:03:14)
parameter cognitive core and you're
(01:03:16)
outdoing me. I think um yeah maybe we
(01:03:19)
could get a little bit smaller. I mean,
(01:03:20)
I still think that there should be
(01:03:21)
enough.
(01:03:22)
>> Yeah, maybe it can be smaller.
(01:03:23)
>> I do think that practically speaking,
(01:03:25)
you want the model to have some
(01:03:26)
knowledge. You don't want it to be
(01:03:27)
looking up everything.
(01:03:28)
>> Um because then you can't like think in
(01:03:30)
your head. You're looking up way too
(01:03:31)
much stuff all the time. So, I do think
(01:03:32)
it needs to be some basic curriculum
(01:03:34)
needs to be there for knowledge.
(01:03:36)
>> Uh but it doesn't have esoteric
(01:03:38)
knowledge, you know.
(01:03:38)
>> Yeah. So, we're discussing what like
(01:03:40)
plausibly could be the cognitive core.
(01:03:41)
There's a separate question which is
(01:03:43)
what will actually be the size of French
(01:03:46)
models over time? And I'm curious to
(01:03:47)
have a prediction. So we had increasing
(01:03:50)
scale up to maybe 4.5 and now we're
(01:03:52)
seeing decreasing/plateing scale.
(01:03:55)
There's many reasons that could be going
(01:03:56)
on but do you have a prediction about
(01:03:58)
going forward will scale will the
(01:04:00)
biggest models be bigger? Will they be
(01:04:01)
smaller? Will they be the same?
(01:04:03)
>> Um yeah I don't know that I have a super
(01:04:05)
strong prediction. I do think that the
(01:04:07)
labs are just being practical. They have
(01:04:09)
a flops budget and a cost budget. And it
(01:04:11)
just turns out that pre-shraining is not
(01:04:12)
where you want to put most of your flops
(01:04:14)
or your cost. So that's why the models
(01:04:15)
have gotten smaller because they are a
(01:04:17)
bit smaller. or the pre-training stages
(01:04:18)
smaller etc but they make it up in
(01:04:20)
reinforcement learning and all this kind
(01:04:21)
of stuff mid training and all this kind
(01:04:22)
of stuff that follows
(01:04:23)
>> uh so they're just being practical in
(01:04:25)
terms of all the stages and how you get
(01:04:26)
the most bang for the buck um so I guess
(01:04:28)
like forecasting that trend I think uh
(01:04:30)
is quite hard I do still expect that
(01:04:32)
there's so much longing for it that's my
(01:04:33)
basic that's my basic expectation um and
(01:04:38)
so I I have a very wide distribution
(01:04:40)
here um do you expect the longing for it
(01:04:42)
to be similar in kind to the kinds of
(01:04:45)
things that have been happening over the
(01:04:47)
two to five years like just in terms of
(01:04:49)
like if I look at nano chat versus nano
(01:04:52)
GPT and then the architectural tweaks
(01:04:53)
you made
(01:04:54)
>> is that basically like the flavor of
(01:04:55)
things you continue to keep happening or
(01:04:57)
is there you're not expecting any giant
(01:05:00)
>> part yeah I I expect the data sets to
(01:05:02)
get much much better because when you
(01:05:03)
look at the average data sets they're
(01:05:04)
extremely terrible like so bad that I
(01:05:06)
don't even know how anything works to be
(01:05:07)
honest like look at the average example
(01:05:08)
in the training set
(01:05:10)
>> like factual mistakes errors yeah
(01:05:13)
>> nonsensical things um somehow when you
(01:05:15)
do it at scale the the noise washes away
(01:05:18)
and you're left with some of the signal.
(01:05:20)
Um so data sets will improve a ton. It's
(01:05:22)
just everything gets better. So um our
(01:05:25)
hardware um all the kernels um uh all
(01:05:28)
the kernels for running the hardware and
(01:05:29)
maximizing what you get with the
(01:05:30)
hardware, you know. So NVIDIA is slowly
(01:05:32)
tuning the actual hardware itself tensor
(01:05:34)
course and so on. All that needs to
(01:05:35)
happen and will continue to happen. Uh
(01:05:37)
all the kernels will get better and
(01:05:38)
utilize the chip to the max extent. all
(01:05:40)
the algorithms will probably improve
(01:05:42)
improve over optimization architecture
(01:05:43)
and um just all the modeling components
(01:05:45)
of how everything is done and what the
(01:05:47)
algorithms are that we're even training
(01:05:48)
with. So I do I do kind of expect like a
(01:05:51)
just very just everything nothing
(01:05:53)
dominates everything plus 20%.
(01:05:57)
>> Right. Interesting.
(01:05:58)
>> This is like roughly what I've seen.
(01:05:59)
>> Okay. This is my general manager Max.
(01:06:02)
>> Good to be here here every day.
(01:06:03)
>> And you have been here since you were
(01:06:04)
onboarded about 6 months ago. But when I
(01:06:06)
was
(01:06:06)
>> months ago
(01:06:07)
>> Oh, right. Um, time passes so fast. But
(01:06:10)
when I on boarded you, I was in France
(01:06:12)
and so we basically didn't get the
(01:06:14)
chance to talk at all almost
(01:06:16)
>> and you basically just gave me one
(01:06:18)
login.
(01:06:19)
>> I gave you access to my Mercury
(01:06:21)
platform, which is the banking platform
(01:06:23)
that I was using at the time to run the
(01:06:24)
podcast.
(01:06:25)
>> And so I logged into Mercury assuming
(01:06:26)
that that would just be the first of
(01:06:27)
many steps, but I realized that was how
(01:06:30)
you were running the entire business,
(01:06:32)
even down to a lot of our editors are
(01:06:34)
international contractors. And so you
(01:06:35)
had just figured out how to set up these
(01:06:37)
recurring payments to set up basic
(01:06:39)
payroll.
(01:06:39)
>> I mean, Mercury made the experience of
(01:06:41)
all of these things I was doing before
(01:06:42)
so seamless that it didn't even occur to
(01:06:44)
me until you pointed it out that this is
(01:06:45)
not the natural way to set up payroll or
(01:06:48)
invoicing or any of these other things.
(01:06:50)
>> I I was surprised, but I was like, it's
(01:06:51)
worked so far, so maybe I'll trust it.
(01:06:54)
And then now I can't think of doing
(01:06:55)
anything else.
(01:06:56)
>> All right, you heard him. Visit
(01:06:58)
mercury.com to apply online in minutes.
(01:07:01)
Cool. Thanks, Max.
(01:07:02)
>> Thanks for having me.
(01:07:03)
>> Dude, you're great at this. I'm so
(01:07:04)
nervous, but thank you.
(01:07:06)
>> Mercury is a financial technology
(01:07:07)
company, not a bank. Banking services
(01:07:09)
provided through Choice Financial Group,
(01:07:11)
column NA, and Evolve Bank and Trust
(01:07:12)
members FDIC. People have proposed
(01:07:15)
different ways of charting how much
(01:07:18)
progress we've made towards full AGI
(01:07:21)
because if you can come up with some
(01:07:23)
line, then you can see where that line
(01:07:24)
intersects with AGI and where that would
(01:07:26)
happen on the X-axis. And so people have
(01:07:28)
proposed, oh, it's like the education
(01:07:30)
level, like we had a high schooler and
(01:07:31)
then then they went to college with RL
(01:07:33)
and they're going to get a PhD. I don't
(01:07:34)
like that one.
(01:07:35)
>> Um or and then they'll propose horizon
(01:07:37)
length. So maybe they can do tasks that
(01:07:39)
take a minute. Uh they can do those
(01:07:41)
autonomously, then they can autonomously
(01:07:43)
do tasks that take an hour, a human an
(01:07:44)
hour, a human a week, etc.
(01:07:46)
>> How do you think about what is the
(01:07:49)
relevant um y-axis here? What is the how
(01:07:52)
should we think about how AI is making
(01:07:54)
progress?
(01:07:54)
>> So I guess I have two answers to that.
(01:07:56)
Number one, I'm almost tempted to like
(01:07:58)
reject the question entirely because
(01:07:59)
again like I see this as an extension of
(01:08:00)
computing. Have we talked about like how
(01:08:02)
to chart progress in computing or how do
(01:08:04)
you chart progress in computing since
(01:08:05)
1970s or whatever. What is the x-axis?
(01:08:08)
So I kind of feel like the whole
(01:08:09)
question is kind of like funny from that
(01:08:10)
perspective a little bit. Um but I will
(01:08:12)
say I guess like when people talk about
(01:08:14)
AI and the original AGI and how we spoke
(01:08:16)
about it when we um when OpenAI started
(01:08:19)
>> AGI was a system you can go to that can
(01:08:22)
do any task that is economically
(01:08:24)
valuable any economically valuable task
(01:08:26)
at um human performance or better.
(01:08:29)
>> Okay. So that was the definition and I
(01:08:31)
was pretty happy with that at the time
(01:08:32)
and I kind of feel like I've stuck to
(01:08:33)
that definition forever and then people
(01:08:35)
have made up all kinds of other
(01:08:36)
definitions but I I like I feel like I
(01:08:39)
like that definition. Now, number one,
(01:08:41)
the first concession that people make
(01:08:43)
all the time is they just take out all
(01:08:44)
the physical stuff because we're just
(01:08:46)
talking about digital knowledge work. I
(01:08:48)
feel like that's a pretty major
(01:08:49)
concession compared to the original
(01:08:50)
definition which was like any task a
(01:08:52)
human can do. I can lift things, etc.
(01:08:54)
Like AI can't do that obviously. So,
(01:08:56)
okay, but we'll take it.
(01:08:57)
>> Uh, what fraction of the economy are we
(01:08:59)
taking away by saying, "Oh, only
(01:09:01)
knowledge work." Um, I don't actually
(01:09:03)
know the numbers. I feel like um it's
(01:09:04)
about 10 to 20% if I had to guess. Is um
(01:09:07)
is only knowledge work. uh like someone
(01:09:10)
could work from home and perform tasks
(01:09:11)
something like that. Um I still think
(01:09:14)
it's a really large market. Uh like um
(01:09:16)
yeah what is the size of the economy and
(01:09:18)
what is 10 20% like we're still talking
(01:09:20)
about few trillion dollars of even in
(01:09:22)
the US of market share almost or like
(01:09:25)
work.
(01:09:26)
>> Um so still a very massive bucket. So
(01:09:28)
but I guess like going back to the
(01:09:30)
definition I guess what I would be
(01:09:31)
looking for is uh to what extent is that
(01:09:33)
definition true? Uh so um are there jobs
(01:09:36)
or lots of tasks? If we think of tasks
(01:09:38)
as you know not jobs but tasks kind of
(01:09:41)
difficult because the problem is like
(01:09:43)
society will refactor based on the tasks
(01:09:46)
that make up jobs compared to what's
(01:09:47)
yeah based on what's automatable or not
(01:09:49)
but today what jobs are replaceable by
(01:09:51)
AI so a good example recently was um
(01:09:55)
Jeff Hinton's prediction that
(01:09:56)
radiologists would not be a job anymore
(01:09:58)
and this turned out to be very wrong in
(01:09:59)
a bunch of ways right so radiologists
(01:10:01)
are alive and well and growing even
(01:10:03)
though computer vision is really really
(01:10:04)
good at recognizing all the different
(01:10:06)
things that they have to recognize in
(01:10:07)
and it's just messy complicated job with
(01:10:10)
a lot of surfaces and dealing with
(01:10:11)
patients and all this kind of stuff in
(01:10:12)
the context of it. Um so I guess I don't
(01:10:16)
actually know that by that definition AI
(01:10:18)
has made a huge amount of dent yet. Um
(01:10:21)
but some of the some of the jobs maybe
(01:10:22)
that I would be looking for have some
(01:10:24)
features that I think make it very
(01:10:25)
amenable to automation earlier than
(01:10:27)
later. As an example, call center
(01:10:28)
employees often come up and I think
(01:10:30)
rightly so. Uh because call center
(01:10:32)
employees have a number of simplifying
(01:10:34)
uh properties with respect to what's
(01:10:35)
automatable today. um their jobs are
(01:10:39)
pretty simple. It's a sequence of tasks
(01:10:41)
and every task looks similar like you
(01:10:43)
take a phone call with a person, it's 10
(01:10:44)
minutes of interaction or whatever it
(01:10:46)
is, probably a bit longer in my
(01:10:47)
experience, a lot longer. Um and you
(01:10:50)
complete some task in some scheme and
(01:10:52)
you change some database entries around
(01:10:54)
or something like that. So you keep
(01:10:55)
repeating something over and over again
(01:10:56)
and that's your job. So basically you do
(01:10:59)
want to bring in the task horizon how
(01:11:01)
long it takes to perform a task.
(01:11:03)
>> And then you want to also remove context
(01:11:05)
like you're not dealing with different
(01:11:06)
parts of services of companies or other
(01:11:08)
customers. It's just the database you
(01:11:10)
and a person you're serving. And so it's
(01:11:12)
more closed. It's more understandable
(01:11:14)
and it's purely digital. So I I would be
(01:11:16)
looking for those things. But even there
(01:11:18)
I'm not actually looking at full
(01:11:19)
automation yet. I'm looking for an
(01:11:21)
autonomy slider and I almost expect that
(01:11:23)
we are not going to instantly replace
(01:11:25)
people. We're going to be swapping in
(01:11:27)
AIs that do 80% of the volume. They
(01:11:29)
delegate 20% of the volume to humans and
(01:11:31)
humans are supervising teams of five AIs
(01:11:33)
doing the call center work that's more
(01:11:35)
rote. Um so I would be looking for new
(01:11:38)
interfaces or new um companies that
(01:11:40)
provide some kind of a layer that allows
(01:11:43)
you to manage some of these AIs that are
(01:11:45)
not yet perfect.
(01:11:46)
>> Yeah.
(01:11:47)
>> And then I would expect that across the
(01:11:48)
economy and a lot of jobs are a lot
(01:11:50)
harder than call center employee. I
(01:11:52)
wonder with radiologists,
(01:11:54)
I'm totally speculating. I have no idea
(01:11:56)
how what the actual workflow of a
(01:11:57)
radiologist involves,
(01:11:59)
>> but one analogy that might be applicable
(01:12:01)
is um when we were first being ruled
(01:12:05)
out, there would be a person sitting in
(01:12:07)
the front seat
(01:12:08)
>> and you just had to have them there to
(01:12:10)
make sure that if something went really
(01:12:11)
wrong, they're there to monitor. And I
(01:12:13)
think even today, people are still
(01:12:14)
watching to make sure things are going
(01:12:15)
well. Um Robo Taxi, who was just
(01:12:17)
deployed, actually still has a person
(01:12:18)
inside it. And we we could be in a
(01:12:20)
similar situation where if you automate
(01:12:23)
99% of a job, that last 1% the human has
(01:12:26)
to do is incredibly valuable because
(01:12:28)
it's bottlenecking everything else. And
(01:12:30)
if it end had if it was the case with
(01:12:32)
like with radiologists where the person
(01:12:33)
sitting in the front of the Uber or the
(01:12:34)
front of the Whimo has to be specially
(01:12:36)
trained for years in order to be able to
(01:12:37)
provide the last 1%. Their wages should
(01:12:39)
go go up tremendously because they're
(01:12:41)
like the one the one thing bottlenecking
(01:12:42)
wide deployment. So radiologists I think
(01:12:44)
their wages have gone up for similar
(01:12:46)
reasons. if you're like the last
(01:12:47)
bottleneck, you should you're like and
(01:12:49)
you're not funible, which like you know
(01:12:50)
a way driver might be fungeable with
(01:12:52)
other things. Um so you might see this
(01:12:54)
thing where like your wages go like
(01:12:56)
>> and until you get to 90% and then like
(01:12:58)
just like that
(01:12:59)
>> and when the last 1% is gone.
(01:13:00)
>> I see.
(01:13:01)
>> Um and I wonder if we're similar things
(01:13:03)
with radiology or salaries of call
(01:13:05)
center workers or anything like that.
(01:13:07)
>> Yeah, I think that's that's an
(01:13:08)
interesting um question. I don't think
(01:13:10)
we're currently seeing that with
(01:13:11)
radiology or uh and I don't have like um
(01:13:15)
in my understanding but I think
(01:13:16)
radiology is not a good example
(01:13:17)
basically. I don't know why Jeff Hinton
(01:13:19)
picked on radiology uh because I think
(01:13:21)
it's an extremely messy messy
(01:13:23)
complicated profession.
(01:13:24)
>> Yeah.
(01:13:25)
>> Uh so I would be a lot more interested
(01:13:26)
in what's happening with call center
(01:13:27)
employees today for example uh because I
(01:13:29)
would expect a lot of the road stuff to
(01:13:31)
be uh automatable today
(01:13:32)
>> and I don't have a first level access to
(01:13:34)
it but maybe I would be looking for
(01:13:35)
trends of what's happening with the call
(01:13:37)
center employees. Maybe some of the
(01:13:39)
things I would also expect is maybe they
(01:13:41)
are uh swapping in AI but then I would
(01:13:43)
still wait for a year or two because I
(01:13:46)
would potentially expect them to pull
(01:13:47)
pull back and actually rehire some of
(01:13:48)
the people.
(01:13:49)
>> I think there's been evidence that
(01:13:50)
that's already been happening in the
(01:13:52)
generally like companies that have been
(01:13:53)
adopting AI which I think is quite
(01:13:54)
surprising and I also find what was
(01:13:56)
really surprising.
(01:13:58)
>> Okay. Um AGI right like a thing which
(01:14:01)
should do everything and okay we'll take
(01:14:03)
out physical work. So think we should be
(01:14:05)
able to do all knowledge work. And what
(01:14:07)
you would have naively anticipated that
(01:14:09)
the way this regression would happen is
(01:14:10)
like you take a little task that a
(01:14:14)
consultant is doing, you take that out
(01:14:16)
of the bucket. You take a little task
(01:14:17)
that um an accountant is doing, you take
(01:14:20)
that out of the bucket. Uh and then
(01:14:22)
you're just doing this across all
(01:14:23)
knowledge work. But instead, if we do
(01:14:25)
believe we're on the path of hi with the
(01:14:26)
current paradigm, the progression is
(01:14:28)
very much not like that. at least um
(01:14:30)
>> it just does not seem like consultants
(01:14:32)
and accountants or whatever are getting
(01:14:33)
like huge productive improvement. It's
(01:14:34)
very much like
(01:14:36)
>> programmers are like getting more and
(01:14:39)
more chills of the way of their work. If
(01:14:40)
you to look at the revenues of these
(01:14:41)
companies discounting just like normal
(01:14:43)
chat revenue which I think is like I
(01:14:45)
don't know that's similar to like Google
(01:14:47)
or something just looking at API
(01:14:50)
revenues it's like dominated by coding
(01:14:51)
right so this thing which is general
(01:14:54)
quote unquote should be able to do any
(01:14:55)
knowledge work is just overwhelmingly
(01:14:57)
doing only coding and it's a surprising
(01:15:00)
way that you would expect like the AGI
(01:15:02)
to be deployed
(01:15:03)
>> so I think there's there's an
(01:15:04)
interesting point here because I do
(01:15:06)
believe coding is like the perfect first
(01:15:08)
thing for uh for a for uh these LLMs and
(01:15:11)
uh agents and that's because coding has
(01:15:13)
always fundamentally uh worked around
(01:15:16)
text.
(01:15:17)
>> It's computer terminals and text and
(01:15:19)
everything is based around text and LLMs
(01:15:21)
the way they're trained on the internet
(01:15:23)
love text
(01:15:24)
>> and so they're perfect text processors
(01:15:26)
and there's all this data out there and
(01:15:27)
it's just perfect fit. Um and also we
(01:15:30)
have a lot of infrastructure pre-built
(01:15:31)
for handling uh code and text. So for
(01:15:34)
example, we have a Visual Studio Code or
(01:15:36)
you know um your favorite um uh IDE
(01:15:39)
showing you code um and an agent can
(01:15:42)
plug into that. So for example, if an
(01:15:43)
agent has a diff where it made some
(01:15:45)
change, we suddenly have all this code
(01:15:46)
already that shows all the differences
(01:15:48)
to a codebase uh using a diff. So we've
(01:15:51)
it's almost like we've pre-built a lot
(01:15:53)
of the a lot of the infrastructure for
(01:15:55)
code. Now contrast that with some of the
(01:15:57)
things that that don't enjoy that at
(01:15:58)
all. So as an example like um there's
(01:16:00)
people trying to build automation not
(01:16:02)
for coding but for example for slides
(01:16:04)
like I saw a company doing slides that's
(01:16:06)
much much harder and the reason it's
(01:16:07)
much much harder is because slides are
(01:16:08)
not text.
(01:16:09)
>> Yeah.
(01:16:10)
>> Slides are little graphics and they're
(01:16:12)
arranged spatially and uh there's visual
(01:16:14)
component to it and um and slides uh
(01:16:17)
don't have this pre-built
(01:16:18)
infrastructure. Like for example if an
(01:16:20)
agent is to make a different uh change
(01:16:21)
to your slides. How does a thing show
(01:16:23)
you the diff? How do you see the diff?
(01:16:25)
There's no there's no nothing that shows
(01:16:27)
diffs for slides. Mhm.
(01:16:28)
>> So someone has to build it. Um so it's
(01:16:30)
just some of these things are not
(01:16:32)
amendable to AIS as they are which is
(01:16:35)
text processors and code surprisingly
(01:16:37)
is.
(01:16:37)
>> I I actually I'm not sure if that alone
(01:16:40)
explains it because
(01:16:42)
I personally have tried to get LLM to be
(01:16:46)
useful in domains which are just pure
(01:16:49)
language in language out. Um like
(01:16:52)
rewriting transcripts, like coming up
(01:16:54)
with clips based on transcripts, etc.
(01:16:56)
And you might say, well, I didn't, it's
(01:16:58)
very plausible that like I didn't do
(01:16:59)
every single possible thing I could do
(01:17:00)
to I put a bunch of, you know, good
(01:17:03)
examples in context, but maybe I should
(01:17:04)
have done like some kind of fine tuning,
(01:17:06)
whatever. So, our mutual friend Andy
(01:17:07)
Matushak told me that he actually tried
(01:17:11)
50 billion things to try to get models
(01:17:14)
to be good at writing space repetition
(01:17:15)
prompts. Again,
(01:17:16)
>> very much language in, language out
(01:17:19)
task. The kind of thing that should be
(01:17:20)
dead center in the repertoire of these
(01:17:22)
LLM. And he tried in context learning
(01:17:24)
obviously with a few short examples. He
(01:17:26)
tried I think he told me like a bunch of
(01:17:28)
things like supervised fine-tuning and
(01:17:31)
like you know retrieval whatever and he
(01:17:34)
just could not get them to make cards to
(01:17:36)
a satisfaction. So I find it striking
(01:17:38)
that even in language out domains
(01:17:41)
>> it's actually very hard to get a lot of
(01:17:43)
economic value out of these models
(01:17:45)
>> separate from coding. And I don't know
(01:17:46)
what what explains it.
(01:17:47)
>> Yeah I think um I think that makes
(01:17:49)
sense. I mean I would say um yeah it's
(01:17:52)
I'm not saying that anything text is
(01:17:53)
trivial right u I do think that code is
(01:17:56)
like it's pretty structured um text is
(01:17:59)
maybe a lot more flowery and this and
(01:18:01)
there's a lot more like
(01:18:03)
>> uh like entropy in text I would say I
(01:18:05)
don't know how else to put it um
(01:18:07)
>> and also I mean code is hard and so
(01:18:09)
people sort of feel quite empowered by
(01:18:11)
LLMs even from like simple simple kind
(01:18:14)
of uh knowledge I basically I don't
(01:18:17)
actually know that I have um a very good
(01:18:19)
I mean obviously like text makes it much
(01:18:20)
much easier maybe is maybe why I put it
(01:18:22)
but it doesn't mean that all text is
(01:18:24)
trivial.
(01:18:24)
>> Mhm. How do you think about super
(01:18:27)
intelligence? Do you expect it to feel
(01:18:29)
qualitatively different from normal
(01:18:33)
humans or human companies?
(01:18:35)
>> I guess I think I see it as like a
(01:18:37)
progression of automation in society
(01:18:38)
right and again like extraling the trend
(01:18:40)
of computing. I just feel like there
(01:18:42)
will be a gradual automation of a lot of
(01:18:44)
things and super intelligence will be
(01:18:45)
sort of like the extrapolation of that.
(01:18:47)
Uh so I do think we expect more and more
(01:18:48)
autonomous entities over time that are
(01:18:50)
doing a lot of the digital work and then
(01:18:52)
eventually even the physical work uh
(01:18:54)
probably some amount of time later but
(01:18:56)
basically I see it as just uh automation
(01:18:59)
>> um roughly speaking
(01:19:00)
>> I guess automation includes the things
(01:19:02)
humans can already do and super
(01:19:03)
intelligence things humans
(01:19:05)
>> well but some of the things that people
(01:19:06)
do is invent new things which I would
(01:19:08)
just put into the automation if that
(01:19:09)
makes sense. Yeah. But you I I guess
(01:19:12)
maybe um less abstractly and more sort
(01:19:16)
of like qualitatively.
(01:19:18)
>> Do you expect something to feel like
(01:19:20)
okay this because this thing can either
(01:19:23)
think so fast or has so many copies or
(01:19:26)
the copies can merge back in themselves
(01:19:29)
or is quote unquote much smarter. any
(01:19:32)
number of advantages an AI might have.
(01:19:35)
It will qualitative the civilization in
(01:19:37)
which these AI exists will just feel
(01:19:39)
qualitatively different from
(01:19:39)
humanization.
(01:19:40)
>> I think it will I mean it is
(01:19:41)
fundamentally automation but I mean it
(01:19:42)
will be like extremely foreign. I do I
(01:19:44)
do think it will look really strange
(01:19:46)
because um like you mentioned we can run
(01:19:48)
all of this on a computer cluster etc
(01:19:51)
and much faster and all this thing.
(01:19:52)
Yeah, I mean maybe some of the scenarios
(01:19:54)
for example that uh I start to get like
(01:19:56)
nervous about with respect with respect
(01:19:58)
to when the world looks like that is
(01:19:59)
this kind of like gradual loss of
(01:20:00)
control and understanding of what's
(01:20:01)
happening and I think that's actually
(01:20:02)
the most likely outcome probably is that
(01:20:04)
there will be a gradual loss of
(01:20:06)
understanding of
(01:20:07)
>> and we'll gradually layer all this stuff
(01:20:09)
everywhere and there will be fewer and
(01:20:11)
fewer people who understand it and that
(01:20:12)
there will be a sort of this like
(01:20:13)
scenario of a gradual loss of control
(01:20:15)
and understanding of what's happening
(01:20:17)
that to me seems most likely outcome of
(01:20:19)
how all this stuff will go down. Let me
(01:20:21)
probe on that a bit. It's not clear to
(01:20:23)
me that loss of control and loss of
(01:20:25)
understanding are the same things.
(01:20:28)
>> A board of directors at like whatever
(01:20:31)
TSMC, Intel, name a random company.
(01:20:34)
>> Um they're just like prestigious
(01:20:36)
80year-olds. They have very little
(01:20:38)
understanding and maybe they don't
(01:20:39)
practically actually have control, but
(01:20:42)
>> or actually maybe a better example is
(01:20:44)
the president of the United States.
(01:20:46)
>> President has a lot of [ __ ] power. Um
(01:20:49)
I'm not trying to make a good statement
(01:20:50)
about the current operant, but maybe I
(01:20:53)
am. But like the actual level of
(01:20:54)
understanding is very different from the
(01:20:55)
level of control.
(01:20:56)
>> Yeah, I think that's fair. That's a good
(01:20:58)
push back. I think like um I guess I
(01:21:01)
expect loss of uh both.
(01:21:05)
>> Yeah.
(01:21:05)
>> How come? I mean loss of understanding
(01:21:07)
is obvious, but why loss of control? So,
(01:21:10)
so we're really far into territory of I
(01:21:13)
don't know what this looks like, but if
(01:21:14)
I was to write sci-fi novels, they would
(01:21:16)
look along the lines of not even a
(01:21:19)
single like entity or something like
(01:21:20)
that. So, that just sort of like takes
(01:21:22)
over everything. Uh, but actually like
(01:21:24)
multiple competing entities that
(01:21:25)
gradually become more and more
(01:21:26)
autonomous and uh some of them go rogue
(01:21:29)
and the others like fight them off and
(01:21:30)
all this kind of stuff. And it's like
(01:21:31)
this this hot pot of
(01:21:33)
>> completely autonomous activity that
(01:21:35)
we've uh delegated to. I I kind of feel
(01:21:38)
like
(01:21:40)
it would have that flavor.
(01:21:42)
>> It is not the fact that they are smarter
(01:21:44)
than us that is resulting in the loss of
(01:21:45)
control. It's the fact that they are
(01:21:47)
competing with each other and whatever
(01:21:50)
um arises out of that competition that
(01:21:52)
leads to the loss of control.
(01:21:54)
>> Um
(01:21:56)
I mean I basically expect there to be I
(01:21:58)
mean um a lot of these things I mean
(01:22:00)
they will be tools to people and the
(01:22:02)
people could some of the population is
(01:22:03)
like they're acting on behalf of people
(01:22:06)
or something like that. Maybe those
(01:22:07)
people are in control, but maybe it's a
(01:22:08)
loss of control overall for society in
(01:22:10)
in the sense that of like outcomes we
(01:22:12)
want or something like that. Um where
(01:22:14)
you have entities acting on behalf of
(01:22:15)
individuals that are still kind of uh
(01:22:18)
roughly seen as out of control.
(01:22:19)
>> Yeah. Yeah.
(01:22:20)
>> This is a question I should have asked
(01:22:21)
earlier. So we were talking about how
(01:22:23)
currently it feels like when you're
(01:22:24)
doing AI engineering or AI research,
(01:22:27)
these models are more like in the
(01:22:28)
category of compiler rather than uh in
(01:22:31)
the category of a replacement.
(01:22:32)
>> Yeah. At some point, if you have
(01:22:34)
quoteunquote AGI, it should be able to
(01:22:35)
do what you do.
(01:22:37)
>> And do you feel like having a million
(01:22:39)
copies of you in parallel results in
(01:22:41)
some huge speed up of AI progress?
(01:22:43)
Basically, if that does happen, would
(01:22:45)
you see do you expect to see an
(01:22:46)
intelligence explosion or even once we
(01:22:49)
have not talking about LLMs today, but
(01:22:50)
really
(01:22:51)
>> I guess like what I mean is um I do, but
(01:22:54)
it's business as usual because we're
(01:22:56)
we're in an intelligence explosion
(01:22:58)
already and have been for decades. And
(01:22:59)
when you look at GDP, it's basically the
(01:23:01)
GDP curve that is an exponential
(01:23:02)
weighted sum over so many aspects of the
(01:23:04)
industry. Everything is gradually being
(01:23:06)
automated has been for hundreds of
(01:23:08)
years. Um, industrial revolution is
(01:23:10)
automation and some of the physical
(01:23:11)
components and the tool building and all
(01:23:12)
this kind of stuff. Compilers, our early
(01:23:14)
software automation, etc. Uh, so I kind
(01:23:16)
of feel like we've been recursively
(01:23:18)
self-improving and uh exploding for for
(01:23:21)
a long time. Maybe another way to see it
(01:23:22)
is um I mean Earth was a pretty I mean
(01:23:26)
if you don't look at the biio mechanics
(01:23:27)
and so on it was a pretty boring place I
(01:23:29)
think and looked very similar if you
(01:23:30)
just look from space and earth is
(01:23:32)
spinning and then like we're in the
(01:23:33)
middle of this like firecracker event
(01:23:36)
>> right
(01:23:36)
>> but we're seeing it in slow motion but
(01:23:38)
>> I definitely feel like this is this has
(01:23:41)
already happened for a very long time
(01:23:42)
and I again like I I don't see AI as
(01:23:45)
like a distinct technology with respect
(01:23:47)
to what has already been happening for a
(01:23:48)
long time. Is there you think it's
(01:23:50)
continuous with this hyper exponential
(01:23:52)
trend?
(01:23:52)
>> And that's why like this is this was
(01:23:54)
very interesting to me because I was I
(01:23:56)
was trying to find AI in the GDP for a
(01:23:57)
while. I thought that GDP should go up
(01:23:59)
but then I looked at some of the other
(01:24:01)
technologies that I thought were were
(01:24:03)
very transformative like uh maybe
(01:24:05)
computers or mobile phones or etc. You
(01:24:07)
can't find them in GDP. GDP is the same
(01:24:08)
exponential and it's just that even for
(01:24:10)
example the early iPhone uh didn't have
(01:24:12)
the app store and it didn't have a lot
(01:24:14)
of the bells and whistles that the
(01:24:15)
modern iPhone has. And so even though we
(01:24:16)
think of 2008 was it when iPhone came
(01:24:19)
out as like some major seismic change,
(01:24:21)
it's actually not. Everything is like so
(01:24:22)
spread out and so slowly diffuses that
(01:24:25)
everything ends up being averaged up
(01:24:26)
into the same exponential. And it's the
(01:24:28)
exact same thing with computers. You
(01:24:29)
can't find them in a GDP is like, oh, we
(01:24:30)
have computers now.
(01:24:31)
>> That's not what happened because it's
(01:24:33)
such a slow progression. And with AI,
(01:24:34)
we're going to see the exact same thing.
(01:24:35)
It's just more automation. It allows us
(01:24:37)
to write different kinds of programs
(01:24:38)
that we couldn't write before. But AI is
(01:24:40)
still fundamentally a program and um
(01:24:43)
it's a new kind of computer and a new
(01:24:45)
kind of um kind of computing system, but
(01:24:47)
it has all these problems. It's going to
(01:24:48)
diffuse over over time and it's still
(01:24:50)
going to add up to the same exponential
(01:24:52)
and we're still going to have an
(01:24:53)
exponential that's going to get
(01:24:54)
extremely vertical and it's going to be
(01:24:57)
very foreign to live in that kind of an
(01:24:59)
environment. Are you saying that like
(01:25:01)
what will happen is so if you go if you
(01:25:03)
look at the trend before the industrial
(01:25:04)
revolution to currently you have a hyper
(01:25:06)
exponential where you go from like 0%
(01:25:09)
growth to then 10,000 years ago 0.02%
(01:25:12)
growth and then currently we're at 2%
(01:25:14)
growth. So that's a hyper exponential
(01:25:15)
and you're saying if you're charting AI
(01:25:16)
on there then it's like AI takes you to
(01:25:18)
20% growth or 200% growth
(01:25:20)
>> or you could be saying if you look at
(01:25:22)
the last 300 years what you've been
(01:25:24)
seeing is you have technology after
(01:25:25)
technology computers electrification
(01:25:27)
steam steam engines railways etc
(01:25:30)
>> but the rate of growth is the exact same
(01:25:32)
it's 2%. So are you saying the rate of
(01:25:34)
growth will
(01:25:36)
>> directly I expect this the rate of
(01:25:38)
growth has also stayed roughly constant
(01:25:40)
right
(01:25:40)
>> for only the last 200 300 years but over
(01:25:42)
the course of human history it's like
(01:25:44)
exploded right it's like gone from like
(01:25:45)
0% basically to like faster faster
(01:25:48)
faster industrial explosion 2%
(01:25:50)
>> like basically I guess what I'm saying
(01:25:51)
is for a while I tried to find AI or
(01:25:53)
look for AI in like the GDP curve and
(01:25:55)
I've kind of convinced myself that this
(01:25:56)
is false and that even when people talk
(01:25:58)
about recursive self-improvement and
(01:25:59)
labs and stuff like that I even don't
(01:26:01)
this is business as usual of course it's
(01:26:02)
going to recursively self-improved and
(01:26:04)
it's been recursively self-improving
(01:26:05)
like LLMs allow the engineers to work
(01:26:08)
much more efficiently to build the next
(01:26:10)
round of LLM and a lot more of the
(01:26:12)
components are being automated and and
(01:26:13)
tuned and etc. So all the engineers
(01:26:16)
having access to Google search is is
(01:26:18)
sort of part of it. All the engineers
(01:26:20)
having an ID all all of them having
(01:26:22)
autocomplete or having cloth code etc.
(01:26:23)
It's all just part of the same speed up
(01:26:26)
of the whole thing. So um it's just so
(01:26:29)
smooth.
(01:26:31)
>> But just just to clarify you're saying
(01:26:32)
that the rate of growth will not change
(01:26:34)
like um you know the intelligence
(01:26:36)
explosion will show up as like we it
(01:26:38)
just enabled us to continue staying on
(01:26:39)
the 2% growth trajectory just as the
(01:26:41)
internet helped us stay on the 2% growth
(01:26:42)
trajectory.
(01:26:43)
>> Yeah. My expectation is that it stays
(01:26:44)
the same pattern.
(01:26:46)
>> Yeah. I mean, um, ju just to throw the
(01:26:49)
opposite argument against you, my
(01:26:51)
expectation is that it like, um, blows
(01:26:54)
up because I think true AGI, and I'm not
(01:26:57)
talking about LLM coding bots, I'm
(01:26:58)
talking about like actual this is like a
(01:27:00)
replacement of a human in a server
(01:27:03)
>> is qualitatively different from these
(01:27:06)
other productivity improving
(01:27:07)
technologies
(01:27:09)
>> because it's labor itself, right? I
(01:27:11)
think we live in a very labor
(01:27:12)
constrained world. Like if you talk to
(01:27:14)
any startup founder, any person, you can
(01:27:15)
just be like, okay, what do you need
(01:27:16)
more of? You just like need really
(01:27:18)
talented people. And if you just have
(01:27:20)
billions of extra people who are
(01:27:22)
inventing stuff, integrating themselves,
(01:27:24)
making companies, bottoms, start to
(01:27:27)
finish, that feels qualitatively
(01:27:28)
different from just like a single
(01:27:30)
technology. It's sort of like just
(01:27:31)
asking if you like if you get 10 billion
(01:27:32)
extra people on the planet.
(01:27:33)
>> I mean, maybe a counterpoint. I mean,
(01:27:35)
number one, I I'm actually pretty um
(01:27:37)
pretty willing to be convinced one way
(01:27:39)
or another on this point. But I will
(01:27:40)
say, for example, computing is labor.
(01:27:42)
Computing was labor. Computers like a
(01:27:44)
lot of jobs disappeared because
(01:27:45)
computers are automating a bunch of
(01:27:47)
digital uh information processing that
(01:27:49)
you now don't need a human for. And so
(01:27:51)
computers are labor. Um and that has
(01:27:53)
played out. Um and you know,
(01:27:56)
self-driving as an example is also like
(01:27:57)
computers doing labor. Uh so like I
(01:28:00)
guess that's already been playing out.
(01:28:01)
So it's still business as usual. Yeah, I
(01:28:03)
guess you have a machine which just
(01:28:04)
spitting out more things like that
(01:28:06)
>> at potentially faster pace. And so we
(01:28:08)
historically we have examples of the
(01:28:10)
growth regime changing where like you
(01:28:12)
went from you know 2% growth to 2%
(01:28:14)
growth.
(01:28:15)
>> So it seems very plausible to me that
(01:28:17)
like
(01:28:17)
>> a machine which is then spitting out the
(01:28:21)
next self-driving car and the next
(01:28:22)
internet and whatever.
(01:28:23)
>> I mean I kind of yeah I see where it's
(01:28:26)
coming from. At the same time, I do feel
(01:28:27)
like people make this assumption of
(01:28:28)
like, okay, we have
(01:28:30)
>> uh God in a box and now it can do
(01:28:32)
everything and it's just it just won't
(01:28:33)
look like that. It's going to be it's
(01:28:35)
going to be able to do some of the
(01:28:36)
things. It's going to fail at some other
(01:28:37)
things. It's going to be gradually put
(01:28:39)
into society and basically we'll end up
(01:28:40)
with the same pattern is my prediction
(01:28:42)
because because this assumption of
(01:28:44)
suddenly having a completely intelligent
(01:28:46)
uh fully flexible, fully general human
(01:28:48)
uh in a box and we can dispensed it
(01:28:49)
arbitrary problems in society. I I I
(01:28:52)
don't think that we will have this like
(01:28:54)
discreet change and um and so I I think
(01:28:58)
we'll arrive at the same at the same
(01:29:00)
kind of a gradual diffusion of this
(01:29:02)
across the industry. M I I I think what
(01:29:05)
often ends up being misleading in these
(01:29:07)
um conversations is people I don't like
(01:29:10)
to use the word intelligence in this
(01:29:11)
context because intelligence implies you
(01:29:13)
think like oh super int super super
(01:29:15)
intelligence will be sitting there will
(01:29:16)
be a single super intelligence sitting
(01:29:17)
in a server and it'll like divine how to
(01:29:19)
come up with new technologies and
(01:29:20)
inventions that causes this explosion
(01:29:23)
>> and that's not what I'm imagining when
(01:29:24)
I'm imagining 20% growth
(01:29:26)
>> I'm imagining that there's billions of
(01:29:30)
you know basically like very smart human
(01:29:33)
minds potentially or that's all that's
(01:29:34)
required. But the fact that there's
(01:29:36)
hundreds of millions of them, billions
(01:29:38)
of them, each individually making new
(01:29:41)
products, figuring out how to integrate
(01:29:42)
themselves into the economy, just the
(01:29:44)
way if like a highly experienced smart
(01:29:46)
immigrant came to the country, you
(01:29:47)
wouldn't need to like figure out how we
(01:29:48)
integrate them in the economy. They
(01:29:49)
figured out they could start a company,
(01:29:50)
they could like uh make inventions, you
(01:29:53)
know, or like just increase productivity
(01:29:54)
in the world. And we have examples even
(01:29:56)
in the current regime of places that
(01:29:58)
have had 10 20% economic growth. you
(01:30:01)
know, if you just have a lot of people
(01:30:03)
and less capital in comparison to the
(01:30:05)
people, you can have Hong Kong or
(01:30:08)
Shenzhen or whatever just had decades of
(01:30:11)
10% plus growth. It and I think it's
(01:30:13)
just like there's a lot of really smart
(01:30:15)
people who are ready to like make use of
(01:30:16)
the resources and do this like period of
(01:30:19)
catchup because we've had this
(01:30:20)
discontinuity. And I think yeah, it
(01:30:22)
might be similar. So, I think um I I
(01:30:24)
think I understand, but I still think
(01:30:26)
that you're presupposing some discrete
(01:30:27)
jump. There's some unlock that we're
(01:30:29)
waiting to claim
(01:30:30)
>> and suddenly we're going to have
(01:30:31)
geniuses in data centers. And I I still
(01:30:34)
think you're presupposing some discrete
(01:30:36)
jump that I think has basically no
(01:30:37)
historical precedent that I can't find
(01:30:39)
in any of the statistics and that I
(01:30:41)
think probably won't happen.
(01:30:42)
>> I mean, the industrial revolution is
(01:30:43)
such a jump, right? You went from like
(01:30:44)
0% grow or 0.2% growth to 2% growth. Um
(01:30:47)
I'm just saying like you'll see another
(01:30:48)
jump like that.
(01:30:49)
>> I I I'm a little bit suspicious. I would
(01:30:51)
have to look at it. I I'm a little bit
(01:30:52)
suspicious and I would have to take a
(01:30:54)
look. For example, like maybe the some
(01:30:55)
of the logs are are not very good from
(01:30:57)
before the industrial revolution or
(01:30:58)
something like that. Uh so I'm a little
(01:31:00)
bit suspicious of it, but um yeah, maybe
(01:31:02)
you're right. I don't I don't have
(01:31:03)
strong opinions.
(01:31:04)
>> Maybe you're saying that this was a
(01:31:06)
singular event that was extremely
(01:31:07)
magical and you're saying that maybe
(01:31:08)
there's going to be another event that's
(01:31:09)
going to be just like that, extremely
(01:31:10)
magical. It will break paradigm and so
(01:31:13)
on.
(01:31:13)
>> I actually don't think they I mean the
(01:31:14)
crucial thing about the industrial
(01:31:15)
revolution was that it was not magical,
(01:31:17)
right? Like if you just zoomed in
(01:31:20)
>> what you would see in 1770 or 1870,
(01:31:25)
>> it's not that there like was some key
(01:31:27)
invention.
(01:31:28)
>> Yeah, exactly. But at the same time, you
(01:31:30)
did move the economy to a regime where
(01:31:32)
the progress was much faster
(01:31:34)
>> and the exponential 10xed
(01:31:36)
>> and I expected similar thing from AI
(01:31:38)
where it's not like
(01:31:39)
>> there's going to be a single moment
(01:31:40)
where we made the crucial
(01:31:42)
>> overhang that's being unlocked like
(01:31:44)
maybe there's a new energy source
(01:31:45)
there's there's some unlock in this case
(01:31:47)
some kind of a cognitive capacity and
(01:31:49)
there's an overhang of cognitive
(01:31:50)
cognitive work to do. That's right.
(01:31:52)
>> And you're expecting that overhang to be
(01:31:54)
filled by this new technology when it
(01:31:55)
crosses the threshold.
(01:31:56)
>> Yeah. And I mean I maybe one way to
(01:31:57)
think about it is through history a lot
(01:31:59)
of growth I mean growth comes because
(01:32:02)
people come up with ideas and then
(01:32:03)
people are like out there doing stuff to
(01:32:06)
execute those ideas and make valuable
(01:32:08)
output
(01:32:09)
>> and through most of this time population
(01:32:11)
isn't exploding that has been driving
(01:32:12)
growth for the last 50 years people have
(01:32:14)
argued that growth has stagnated
(01:32:16)
population and frontier countries has
(01:32:18)
also stagnated I think we go back on the
(01:32:20)
hyperexonential growth in population and
(01:32:22)
output
(01:32:23)
>> right sorry exponential growth in
(01:32:24)
population that causes hyperextential
(01:32:26)
growth and output.
(01:32:27)
>> Yeah. I mean, um, yeah, it's really hard
(01:32:29)
to tell.
(01:32:30)
>> I understand that viewpoint. I don't
(01:32:32)
intuitively feel that viewpoint.
(01:32:34)
>> So, we just got access to Google's VO
(01:32:37)
3.1, and it's been really cool to play
(01:32:40)
around with. The first thing we did was
(01:32:42)
run a bunch of prompts through both V3
(01:32:44)
and 3.1 to see what's changed in the new
(01:32:47)
version. So, here's V3.
(01:32:50)
>> Hi, I'm Max and I got stuck in a local
(01:32:52)
minimum again.
(01:32:53)
>> It's okay, Max. We've all been there.
(01:32:55)
Took me three epox to get out.
(01:32:57)
>> And here is VO3.1.
(01:32:59)
>> Hi, I'm Max and I got stuck in a local
(01:33:02)
minimum again.
(01:33:03)
>> It's okay, Max. We've all been there.
(01:33:05)
Took me three epox.
(01:33:07)
>> 3.1's output is just consistently more
(01:33:10)
coherent and the audio is noticeably
(01:33:12)
higher quality. We've been using VO for
(01:33:14)
a while now. Actually, we released an
(01:33:16)
essay earlier this year about AI firms
(01:33:18)
fully animated by V2, and it's been
(01:33:20)
amazing to see how fast these models are
(01:33:23)
improving. This update makes VO even
(01:33:25)
more useful in terms of animating our
(01:33:28)
ideas and our explainers. You can try VO
(01:33:30)
right now in the Gemini app with Pro and
(01:33:33)
Ultra subscriptions. You can also access
(01:33:35)
it through the Gemini API or through
(01:33:37)
Google Flow. You recommended Nick Lane's
(01:33:40)
book to me and then on that basis I I
(01:33:42)
also find it super interesting and I
(01:33:44)
interviewed him. Um and so I actually
(01:33:46)
have some questions about sort of
(01:33:46)
thinking about intelligence and
(01:33:47)
evolutionary history. Now that you over
(01:33:50)
the last 20 years of doing AI research,
(01:33:52)
you maybe have a more tangible sense of
(01:33:54)
what intelligence is, what it takes to
(01:33:57)
develop it. Are you more or less
(01:34:00)
surprised as a result that evolution
(01:34:02)
just sort of spontaneously
(01:34:05)
stumbled upon it?
(01:34:07)
>> Um, I love Nick's books by the way. So,
(01:34:10)
um, yeah, I was just listening to to his
(01:34:12)
podcast on the way up here. With respect
(01:34:14)
to intelligence and its evolution, I do
(01:34:16)
claim it came fairly
(01:34:18)
>> I mean it's very very recent, right? Um
(01:34:21)
I am surprised that it evolved. Yeah, I
(01:34:23)
I find it fascinating to think about all
(01:34:24)
the worlds out there. Like say there's a
(01:34:26)
thousand planets like Earth and what
(01:34:27)
they look like. I think Nane was here
(01:34:28)
talking about some of the early parts,
(01:34:30)
right? Like
(01:34:30)
>> okay, he expects basically very similar
(01:34:33)
life forms roughly speaking and bacteria
(01:34:35)
like things and most of them.
(01:34:36)
>> Yeah.
(01:34:36)
>> And then there's a few breaks in there.
(01:34:39)
I would expect that um the evolution of
(01:34:41)
intelligence intuitively feels to me
(01:34:42)
like it should be fairly rare event and
(01:34:44)
there have been animals for I guess
(01:34:46)
maybe you should base it on how long
(01:34:47)
some something has existed. So for
(01:34:49)
example, if bacteria have been around
(01:34:50)
for 2 billion years and nothing happened
(01:34:52)
then going to your carrier is probably
(01:34:53)
pretty hard cuz um cuz bacteria actually
(01:34:56)
um came up quite early in Earth's
(01:34:58)
evolution or history.
(01:35:00)
>> Um
(01:35:01)
>> and so I guess um how long have we had
(01:35:03)
animals? Maybe a couple hundred million
(01:35:04)
years like multisellular animals that
(01:35:06)
like run crawl etc.
(01:35:08)
um which is maybe 10% of um Earth's
(01:35:11)
lifespan or something like that. So I
(01:35:13)
maybe on that time scale is actually not
(01:35:15)
not too tricky. I still feel like
(01:35:18)
it's still surprising to me I think
(01:35:19)
intuitively that it developed. I would
(01:35:21)
maybe expect just a lot of like
(01:35:22)
animallike life forms doing animallike
(01:35:24)
things. Uh the fact that you can get
(01:35:26)
something that creates culture and
(01:35:27)
knowledge Yeah. and accumulates it is is
(01:35:29)
it is surprising to me that okay so
(01:35:32)
there's so there's actually a couple of
(01:35:33)
interesting follow-ups.
(01:35:35)
if you buy this uh sun perspective that
(01:35:38)
actually the crux of intelligence is
(01:35:41)
animal intelligence. What the quote said
(01:35:42)
is if you got to the squirrel you'd be
(01:35:44)
most of the way to AGI. Um
(01:35:46)
>> then we got to squirrel intelligence I
(01:35:48)
guess right after the Cambrian explosion
(01:35:50)
600 million years ago.
(01:35:51)
>> It seems like what instigated that was
(01:35:54)
the oxygenation event 600 million years
(01:35:56)
ago. But immediately the sort of like
(01:35:57)
intelligence algorithm was there to like
(01:35:59)
make the the squirrel intelligence,
(01:36:02)
right? So it's suggestive that animal
(01:36:04)
intelligence was like that as soon as
(01:36:07)
you had the oxygen in the environment
(01:36:08)
you had the curat you could just like
(01:36:10)
get the algorithm. Um I maybe there was
(01:36:13)
like sort of an accident that evolution
(01:36:15)
smell abundant so fast but I don't know
(01:36:16)
if that suggest is like actually quite
(01:36:18)
uh at the end going to be quite simple.
(01:36:20)
>> Yes basically it's so hard to tell right
(01:36:22)
with any of this stuff. I guess you can
(01:36:23)
base it a little bit on how long
(01:36:25)
something has exited or how long it
(01:36:26)
feels like something has been
(01:36:27)
bottlenecked. So very good describing
(01:36:30)
this like very apparent bottleneck in
(01:36:32)
bacteria for years like extreme
(01:36:35)
diversity of chemical biochemistry and
(01:36:38)
yet nothing that grows to become
(01:36:41)
>> animals two billion years um I I don't
(01:36:44)
know that we've seen exactly that kind
(01:36:46)
of an equivalent with animals and
(01:36:47)
intelligence uh to your point right but
(01:36:49)
I guess maybe we could also look at it
(01:36:51)
with respect to how many times we think
(01:36:52)
evol intelligence has like individually
(01:36:55)
sprung up
(01:36:56)
>> that's a really good that's a really
(01:36:57)
good thing investigate.
(01:36:58)
>> Maybe one thought on that is I almost
(01:37:00)
feel like um well there's the homminid
(01:37:03)
intelligence and there's I would say
(01:37:04)
like the bird intelligence right like
(01:37:06)
ravens etc are extremely clever but uh
(01:37:08)
they their brain brain parts are
(01:37:10)
actually quite distinct and we don't
(01:37:11)
have that much um
(01:37:13)
>> existence so maybe that's an slight
(01:37:15)
event of there's a slight indication of
(01:37:17)
maybe intelligence springing up a few
(01:37:18)
times and so in that case you'd maybe
(01:37:20)
expect it more frequently or something
(01:37:21)
like that. Yeah, a former guest Gw and
(01:37:25)
also Carl Sherman have made made a
(01:37:27)
really interesting point about that
(01:37:28)
which is their perspective is that the
(01:37:32)
scalable algorithm which humans have and
(01:37:34)
primates have
(01:37:35)
>> arose in birds as well
(01:37:38)
>> and maybe other times as well. But in
(01:37:41)
humans found a evolutionary niche which
(01:37:43)
rewarded marginal increases in
(01:37:45)
intelligence.
(01:37:46)
>> Um and also had a scalable brain
(01:37:49)
algorithm that could achieve those
(01:37:51)
increases in intelligence.
(01:37:52)
>> The and so for example if a bird had a
(01:37:55)
bigger brain it would just like collapse
(01:37:56)
out of the air. So it's very smart for
(01:37:58)
the size of its brain but it's like it's
(01:38:00)
not in a niche which rewards the brain
(01:38:02)
getting bigger.
(01:38:03)
>> Um yeah
(01:38:04)
>> maybe similar with some really smart
(01:38:06)
>> dolphins etc.
(01:38:07)
>> Exactly. Yeah. Whereas humans, you know,
(01:38:09)
like we have hands that like reward
(01:38:11)
being able to learn how to do tool use.
(01:38:12)
We can externalize digestion, more
(01:38:14)
energy to the brain
(01:38:15)
>> and that um kicks off the flywheel.
(01:38:17)
>> Oh, yeah. And just stuff to work with. I
(01:38:19)
mean, I'm guessing it would be harder to
(01:38:20)
if I was a dolphin.
(01:38:22)
>> Um I mean, how do you do you can't have
(01:38:24)
fire for example and stuff like that? I
(01:38:25)
mean, the probably like the universe of
(01:38:27)
things you can do in water um like
(01:38:29)
inside water is probably lower than what
(01:38:31)
you can do on land um just chemically,
(01:38:33)
>> right? Yeah, I do I do agree with this
(01:38:35)
with this viewpoint of these niches and
(01:38:36)
what's what's being incentivized. I
(01:38:38)
still find it kind of miraculous that uh
(01:38:41)
I don't I I would have maybe expected
(01:38:43)
things to get stuck on like animals with
(01:38:45)
bigger muscles, you know?
(01:38:46)
>> Yeah.
(01:38:47)
>> Like going through intelligence is
(01:38:48)
actually a really fascinating uh
(01:38:51)
breaking point. The the way Burn put it
(01:38:52)
is the reason it was so hard is is a
(01:38:55)
very tight line between being in a
(01:38:56)
situation where something is so
(01:38:59)
important to learn
(01:39:01)
that it's not just worth distilling the
(01:39:03)
exact right circuits directly back into
(01:39:06)
your DNA
(01:39:07)
>> versus it's not important enough to
(01:39:09)
learn at all.
(01:39:10)
>> Yeah.
(01:39:10)
>> It has to be something which is like
(01:39:12)
>> you have to to incentivize building the
(01:39:15)
algorithm to learn in lifetime.
(01:39:17)
>> Yeah. Exactly. You have to incentivize
(01:39:18)
some kind of adaptability. You actually
(01:39:19)
want something that you actually want
(01:39:21)
environments that are unpredictable. So
(01:39:22)
evolution can't bake your algorithms
(01:39:24)
into your weights. A lot of um a lot of
(01:39:26)
animals are basically pre-baked in this
(01:39:28)
sense and so humans have to figure it
(01:39:30)
out at test time when they get born. And
(01:39:31)
so maybe there was um you actually want
(01:39:34)
these kinds of uh environments that
(01:39:36)
actually change really rapidly or
(01:39:37)
something like that where you can't
(01:39:38)
foresee um what will work well and so
(01:39:40)
you actually put all that intelligent
(01:39:42)
you create intelligence to figure it out
(01:39:43)
at test time. Uh so Quentyn Pope had
(01:39:46)
this interesting blog post where he's
(01:39:47)
saying the Brazilian doesn't expect a
(01:39:49)
sharp takeoff is um the so humans had
(01:39:53)
the sharp takeoff where 60,000 years ago
(01:39:55)
we seem to have had the cognitive
(01:39:56)
architectures that we have today
(01:39:59)
>> and 10,000 years ago agricultural
(01:40:00)
revolution modernity dot dot dot. What
(01:40:03)
was happening in that 50,000 years?
(01:40:04)
>> Well, you had to build this sort of like
(01:40:06)
cultural scaffold where you can
(01:40:08)
accumulate knowledge over generations.
(01:40:12)
This is an ability that exists for free
(01:40:14)
in the way we do AI training where if
(01:40:17)
you retrain a model it can still I mean
(01:40:19)
in many cases they're literally
(01:40:20)
distilled but they can be trained on
(01:40:22)
each other you know they can be trained
(01:40:23)
on the premium pre-training corpus um
(01:40:25)
they don't literally have to start from
(01:40:27)
scratch so there's a sense in which the
(01:40:29)
thing which it took humans a long time
(01:40:31)
to get this cultural loop going just
(01:40:33)
comes for free with the way we do LLM
(01:40:35)
training. Um, yes and no because LMs
(01:40:38)
don't really have the equivalent of
(01:40:39)
culture and maybe we're giving them way
(01:40:40)
too much and incentivizing not to create
(01:40:42)
it or something like that. But I guess
(01:40:44)
like the notion of culture and of
(01:40:45)
written record and of like passing down
(01:40:47)
notes between each other. I don't think
(01:40:49)
there's an equivalent of that with LM
(01:40:50)
right now. So LM don't really have
(01:40:52)
culture right now and it's kind of like
(01:40:54)
one of the I think uh impediments I
(01:40:56)
would say. Can
(01:40:57)
>> can you give me some sense of what LLM
(01:40:59)
culture might look like? Uh, so in the
(01:41:01)
simplest case, it would be a giant
(01:41:02)
scratch pad that the LLM can edit. And
(01:41:04)
as it's reading stuff or as it's helping
(01:41:06)
out with work, it's editing the scratch
(01:41:08)
pad for itself.
(01:41:09)
>> Why can't an LLM write a book for the
(01:41:10)
other LM? That would be cool.
(01:41:12)
>> Yeah.
(01:41:13)
>> Like why can't other LLMs read this
(01:41:14)
LLM's book and be inspired by it or
(01:41:18)
shocked by it or something like that?
(01:41:19)
There's no equivalence for any of this
(01:41:20)
stuff.
(01:41:20)
>> Interesting. When would you expect that
(01:41:22)
kind of thing to start happening? And
(01:41:24)
more general question about like multi-
(01:41:25)
aent systems and a sort of like
(01:41:27)
independent AI. Yeah, civilization and
(01:41:30)
culture.
(01:41:31)
>> I think there's two powerful ideas in
(01:41:33)
the realm of multi- aent that have both
(01:41:34)
not been like really claimed or or so
(01:41:36)
on. The first one I would say is culture
(01:41:39)
and LLM's basically a growing uh
(01:41:41)
repertoire of knowledge uh for their own
(01:41:43)
purposes.
(01:41:44)
>> Uh the second one looks a lot more like
(01:41:46)
uh the powerful idea of selfplay. Uh in
(01:41:48)
my mind it's extremely powerful. So
(01:41:49)
evolution actually is a lot of um
(01:41:52)
competition basically driving
(01:41:53)
intelligence and and evolution. Um and
(01:41:57)
uh for in AlphaGo more algorithmically
(01:41:59)
like Alph Go is playing against itself
(01:42:01)
and that's how it learns to get really
(01:42:03)
good at Go and there's no equivalent of
(01:42:05)
selfplaying LMS but I would expect that
(01:42:07)
to also exist but no one has done it yet
(01:42:09)
like why can't an LM for example create
(01:42:10)
a bunch of problems that another LM is
(01:42:13)
learning to solve and then the the LM is
(01:42:15)
always trying to like serve more and
(01:42:16)
more difficult problems stuff like that
(01:42:18)
you know so like
(01:42:19)
>> I think there's a bunch of ways to
(01:42:21)
actually organize it um and I think it's
(01:42:22)
a realm of research uh but I think I
(01:42:24)
haven't seen anything that convincing ly
(01:42:26)
like claims both of those
(01:42:28)
>> like multi- aent uh improvements. I
(01:42:30)
still think we're mostly in the realm of
(01:42:31)
a single individual agent, but I think I
(01:42:34)
also think that will change and and um
(01:42:36)
in the realm of culture also I would
(01:42:38)
bucket also organizations and we haven't
(01:42:40)
seen anything like that coming in
(01:42:41)
either.
(01:42:42)
>> Um so that's why we're still early.
(01:42:44)
>> And can you identify the key bottleneck
(01:42:46)
that's uh preventing this kind of
(01:42:49)
collaboration between LLMs? Maybe like
(01:42:51)
the way I would put it is
(01:42:54)
somehow remarkably again some of these
(01:42:55)
analogies work and they shouldn't but
(01:42:57)
somehow remarkably they do a lot of the
(01:42:58)
smaller models or the dumber like the
(01:43:00)
smaller models somehow remarkably
(01:43:02)
resemble like a kindergarten student or
(01:43:04)
then like a elementary school student or
(01:43:06)
high school student etc. And somehow we
(01:43:08)
still haven't like graduated enough
(01:43:09)
where the stuff can take over like it's
(01:43:11)
still mostly like my cloth code or
(01:43:13)
codeex they still kind of feel like this
(01:43:16)
elementary grade student. I know that
(01:43:18)
they can take PhD quizzes, but they
(01:43:19)
still cognitively feel like a
(01:43:22)
kindergarten or an elementary entry
(01:43:23)
school student. So, I don't think they
(01:43:24)
can create culture because they're still
(01:43:26)
kids. Um, you know,
(01:43:28)
>> like they're savant kids. Um, they have
(01:43:30)
episodic, they have perfect memory of
(01:43:32)
all this stuff, etc. And they can, uh,
(01:43:34)
convincingly create all kinds of slop
(01:43:35)
that looks really good.
(01:43:37)
>> But I still think they don't really know
(01:43:38)
what they're doing and they don't really
(01:43:39)
have the cognition uh, across all these
(01:43:41)
little check boxes that we still have to
(01:43:43)
collect.
(01:43:43)
>> Yeah. So, you've talked about how you
(01:43:46)
were at Tesla leading self-driving from
(01:43:49)
2017 to 2022 and then you firsthand saw
(01:43:53)
this progress from we went from cool
(01:43:55)
demos to now thousands of cars out there
(01:43:58)
actually autonomously doing drives. Why
(01:44:00)
did that take a decade? Like what was
(01:44:02)
happening through that time?
(01:44:03)
>> Yeah. Uh so I would say one thing I will
(01:44:05)
almost instantly also push back on is
(01:44:07)
this is not even near done.
(01:44:10)
>> So in a bunch of ways that I'm going to
(01:44:12)
get to. I do think that uh self-driving
(01:44:14)
is very interesting because uh it's
(01:44:16)
definitely like where I get a lot of my
(01:44:17)
intuitions because I spent 5 years on
(01:44:19)
it. Um and it has this entire history
(01:44:21)
where actually the first demos of
(01:44:23)
self-driving go all the way to 1980s.
(01:44:25)
>> You can see a demo from CMU in 1986
(01:44:28)
there's a truck that's driving itself on
(01:44:30)
roads. Um but okay fast forward I think
(01:44:33)
when I was joining Tesla I had um I had
(01:44:35)
a very early demo of a Whimo and it
(01:44:38)
basically gave me a perfect drive uh in
(01:44:41)
200 2014 or something like that. So
(01:44:44)
perfect way drive a decade ago uh gave
(01:44:47)
to us around Palo Alto and so on because
(01:44:48)
I had a friend who worked there. Um and
(01:44:51)
I thought it was like very close and
(01:44:52)
then still took a long time and I do
(01:44:54)
think that some there's for some kinds
(01:44:56)
of um tasks and jobs and so on uh the
(01:44:59)
there's a very large demoto product gap
(01:45:02)
where the demo is very easy but the
(01:45:03)
product is very hard. Um, and it's
(01:45:06)
especially the case in cases like
(01:45:07)
self-driving where the the cost of
(01:45:10)
failure is too high, right? Many ind
(01:45:12)
many industries tasks and jobs maybe
(01:45:14)
don't have that property, but when you
(01:45:15)
do have that property, that definitely
(01:45:17)
increases the timelines. I do think that
(01:45:19)
for example in software engineering, I
(01:45:20)
do actually think that that property
(01:45:22)
does exist. I think for a lot of vibe
(01:45:24)
coding it doesn't but I think if you're
(01:45:25)
writing actual production grade code I
(01:45:27)
think that property should exist because
(01:45:28)
any kind of mistake actually leads to a
(01:45:30)
security vulnerability or something like
(01:45:32)
that and millions and hundreds of
(01:45:33)
millions of people's personal social
(01:45:35)
security numbers etc get leaked or
(01:45:37)
something like that and so I do think
(01:45:38)
that it is a case that in software
(01:45:40)
people should be careful um kind of like
(01:45:43)
in self-driving um like in self-driving
(01:45:45)
if you if it things go wrong you might
(01:45:46)
get injury in um I guess there's worse
(01:45:49)
outcomes but I guess in in software I
(01:45:51)
almost feel like it's almost unbounded
(01:45:53)
how terrible some things could be.
(01:45:56)
>> Interesting.
(01:45:57)
>> So I do think that they share that
(01:45:58)
property. And then I think basically
(01:46:00)
what takes the long amount of time and
(01:46:01)
the way to think about it is that it's a
(01:46:04)
march of nines and every single nine is
(01:46:06)
a constant amount of work. So every
(01:46:09)
single nine is the same amount of work.
(01:46:10)
So when you get a demo and something
(01:46:12)
works 90% of the time, that's just uh
(01:46:15)
that's just uh what the first nine and
(01:46:17)
then you need the second nine and third
(01:46:18)
nine, fourth nine, fifth nine. And while
(01:46:19)
I was at Tesla for was it five years or
(01:46:21)
so. I think we went through maybe three
(01:46:22)
nines or two nines. I don't know what it
(01:46:24)
is, you know, but like multiple nines of
(01:46:25)
iteration, there's still more nines to
(01:46:27)
go. And so that's why these things take
(01:46:28)
take so long. Um, and so it's definitely
(01:46:32)
formative for me like seeing something
(01:46:34)
that was a demo. I'm very unimpressed by
(01:46:36)
demos. Um, so whenever I see demos of
(01:46:38)
anything, I'm extremely unimpressed by
(01:46:40)
that. Um, it works better if you can um
(01:46:43)
if it's a demo that someone cooked up
(01:46:45)
and is just showing you it's worse. If
(01:46:46)
you can interact with it, it's a bit
(01:46:47)
better. But even then, you're not done.
(01:46:49)
You need actual product. It's going to
(01:46:50)
face all these challenges in when it
(01:46:52)
comes in contact with reality and all
(01:46:53)
these different pockets of behavior that
(01:46:55)
need patching. And so I think we're
(01:46:56)
going to see all this stuff play out.
(01:46:58)
It's a march of nines. Each nine is
(01:46:59)
constant. Uh demos are encouraging.
(01:47:02)
Still a huge amount of work to do. Uh I
(01:47:04)
do think it is a um kind of a critical
(01:47:06)
safety domain unless you're doing bip
(01:47:08)
coding, which is all nice and fun and so
(01:47:10)
on. And uh so that's why I think this
(01:47:13)
also enforces my timelines from that
(01:47:14)
perspective. Hm. That's that's very
(01:47:17)
interesting to hear you say that the
(01:47:18)
sort of safety guarantees you need from
(01:47:20)
software are actually not dissimilar to
(01:47:23)
self-driving because what people will
(01:47:24)
often say is that self-driving took so
(01:47:25)
long because the cost of failure is so
(01:47:29)
high. Like a human makes a mistake on
(01:47:31)
the average every 400,000 miles or every
(01:47:33)
seven years. And if you had to release a
(01:47:35)
coding agent that couldn't make a
(01:47:37)
mistake for at least seven years, it
(01:47:40)
would be much harder to deploy. But I
(01:47:42)
guess your point is that if you made a
(01:47:43)
catastrophic coding mistake like yeah
(01:47:46)
>> breaking some important system every
(01:47:47)
seven years
(01:47:48)
>> very easy to do
(01:47:49)
>> and in fact in terms of sort of wall
(01:47:50)
clock time it much it would be much less
(01:47:52)
than seven years because you're like
(01:47:53)
constantly outputting code like that
(01:47:55)
right so like per tokens or in terms of
(01:47:58)
tokens it would be seven years but in
(01:47:59)
terms of wall clock time
(01:48:00)
>> in some way it's a much harder problem I
(01:48:01)
mean self-driving is just one of
(01:48:03)
thousands of things that people do it's
(01:48:05)
almost like a single vertical I suppose
(01:48:07)
um whereas when we're talking about
(01:48:08)
general software engineering it's even
(01:48:09)
more there's more surface Yeah,
(01:48:11)
>> there's another uh objection people make
(01:48:14)
to that analogy,
(01:48:16)
>> which is that
(01:48:17)
>> with self-driving, what took a big
(01:48:19)
fraction of that time was solving the
(01:48:21)
problem of building basic uh having
(01:48:24)
basic perception that's robust and
(01:48:26)
building representations and having a
(01:48:28)
model that has some common sense so it
(01:48:31)
can generalize to when I see something
(01:48:32)
that's slightly out of distribution. If
(01:48:34)
somebody's waving down the road this
(01:48:37)
way, you don't need to train for it. the
(01:48:38)
thing will uh have some understanding of
(01:48:41)
how to respond to something like that.
(01:48:43)
>> And these are things we're getting for
(01:48:44)
free with LLMs or VLMs today. So we
(01:48:47)
don't have to solve these very basic
(01:48:48)
representation problems. And so now
(01:48:51)
deploying AI across different domains
(01:48:52)
will sort of be like deploying a
(01:48:54)
self-driving car with current models to
(01:48:55)
a different city which is hard but not
(01:48:57)
like a 10 year long task.
(01:48:59)
>> Yeah. Basically I'm not 100% sure if I
(01:49:01)
fully agree with that. I don't know that
(01:49:02)
we're how much we're getting for free
(01:49:03)
and I still think there's like a lot of
(01:49:05)
gaps in understanding in what we are
(01:49:06)
getting. Um I mean we're definitely
(01:49:08)
getting more generalizable intelligence
(01:49:10)
in a single entity. Uh whereas uh
(01:49:12)
self-driving is a very special purpose
(01:49:14)
task that requires in some sense
(01:49:16)
building a special purpose task is maybe
(01:49:17)
even harder in a certain sense because
(01:49:19)
it doesn't like fall out from a more
(01:49:20)
general thing that you're doing at scale
(01:49:21)
if that makes sense. So, um, but I still
(01:49:25)
think that the analogy doesn't, uh, I
(01:49:26)
still don't know if it fully resonates
(01:49:28)
because, um, like the LMS are still
(01:49:30)
pretty fallible and I still think that
(01:49:32)
they have a lot of gaps and that it
(01:49:33)
still needs to be filled in. And I don't
(01:49:34)
think that we're getting like magical
(01:49:35)
generalization completely out of the
(01:49:36)
box, uh, sort of in in a certain sense.
(01:49:39)
And the other aspect that I wanted to
(01:49:40)
also actually return to when I was uh,
(01:49:43)
in the in the beginning was uh,
(01:49:45)
self-driving cars are nowhere down
(01:49:46)
still.
(01:49:48)
>> So even though um, so the deployments
(01:49:49)
still are pretty minimal, right? Uh so
(01:49:51)
even Whimo and so on has very few cars
(01:49:53)
and they're doing that roughly speaking
(01:49:54)
because they're not economical, right?
(01:49:56)
Um because they've built something that
(01:49:58)
that lives in the future. Um and so they
(01:50:00)
they had to like pull back future but
(01:50:02)
they had had to make it uneconomical. So
(01:50:04)
they have all these like um you know
(01:50:06)
there's all these costs not just
(01:50:07)
marginal costs for those cars and their
(01:50:09)
operation and maintenance but also the
(01:50:11)
capex of the entire thing.
(01:50:12)
>> Um so making economical is still going
(01:50:14)
to be a slog I think uh for them. Um,
(01:50:17)
and then also I think when you look at
(01:50:19)
these cars and there's no one driving,
(01:50:21)
um, I also think it's a little bit
(01:50:22)
deceiving because there are actually
(01:50:23)
very elaborate operation centers of
(01:50:27)
people actually kind of like in a loop
(01:50:28)
with these cars. And I don't have the I
(01:50:30)
don't know the full extent of it, but I
(01:50:31)
think um
(01:50:32)
>> there's more human in the loop that you
(01:50:34)
might expect and there's people
(01:50:35)
somewhere out there basically beaming in
(01:50:37)
from the sky.
(01:50:38)
>> Uh, and uh, I don't actually know
(01:50:40)
they're fully in the loop with the
(01:50:41)
driving. I think some of the times they
(01:50:42)
are but they're certainly involved and
(01:50:44)
there are people and in some sense we
(01:50:45)
haven't actually removed the person
(01:50:46)
we've like moved them to somewhere where
(01:50:48)
we can't see them. I still think there
(01:50:49)
will be some work as you mentioned going
(01:50:50)
from environment to environment and uh
(01:50:52)
so I think like there's still challenges
(01:50:54)
to to make self driving real but I I do
(01:50:57)
agree that it's definitely cross a
(01:50:58)
threshold where it kind of feels real
(01:51:00)
unless it's like retail operated. Um for
(01:51:02)
example Whimo can't go to all the
(01:51:04)
different parts of the city. My
(01:51:06)
suspicion is it's like parts of city
(01:51:07)
where you don't get good signal
(01:51:09)
>> anyway. So basically I don't actually
(01:51:11)
know anything about the stack. I mean
(01:51:13)
I'm just making up making
(01:51:14)
>> up. You less self love driving for 5
(01:51:17)
years at Tesla.
(01:51:18)
>> Sorry I don't know anything about the
(01:51:19)
specifics of Whimo. I feel talk about
(01:51:20)
them.
(01:51:21)
>> I actually by the way I love Whimo and I
(01:51:22)
take it all the time. So I don't want to
(01:51:24)
say like
(01:51:25)
>> I just think that people again are
(01:51:27)
sometimes a little bit too naive about
(01:51:29)
some of the progress and I still think
(01:51:30)
there's a huge amount of work
(01:51:31)
>> and I think Tesla took in my mind a lot
(01:51:33)
more scalable approach and I think the
(01:51:34)
team is doing extremely well and it's
(01:51:36)
going to um and I I I'm kind of like on
(01:51:39)
the record for predicting how this thing
(01:51:40)
will go which is like way more like
(01:51:42)
early start because you can package up
(01:51:43)
so many sensors but I do think Tesla is
(01:51:45)
taking the more uh scalable strategy and
(01:51:47)
it's going to look a lot more like that
(01:51:48)
u so I think this will have to still uh
(01:51:50)
play out and hasn't but basically Like I
(01:51:53)
don't want to talk about self driving as
(01:51:54)
something that took a decade because it
(01:51:56)
didn't take it didn't take yet
(01:51:59)
if that makes sense
(01:52:00)
>> because one it's the the start is at
(01:52:02)
1980 not 10 years ago and then two the
(01:52:04)
end is not here yet.
(01:52:05)
>> Yeah. The end is not not near yet
(01:52:07)
because uh when we're talking about
(01:52:08)
self-driving usually in my mind it's
(01:52:09)
self-driving at scale. Yeah.
(01:52:11)
>> Um people don't have to get a driver's
(01:52:13)
license etc. I'm I'm curious to bounce
(01:52:15)
two other ways in which the analogy
(01:52:18)
might be different.
(01:52:19)
>> And the reason I'm especially curious
(01:52:20)
about this is because I think the
(01:52:22)
question of how fast AI is deployed, how
(01:52:25)
valuable it is when it's early on is
(01:52:27)
like potentially the most important
(01:52:29)
question in the world right now, right?
(01:52:30)
Like if you're trying to model what the
(01:52:31)
Euro 20 or 30 looks like, this is the
(01:52:33)
question you want to have some
(01:52:34)
understanding of. So another thing you
(01:52:37)
might think is one you have this latency
(01:52:40)
requirement with self-driving where you
(01:52:42)
have I have no idea what the actual
(01:52:43)
models are but I assume like tens of
(01:52:45)
millions of parameters or something
(01:52:46)
which is not the necessary constraint
(01:52:48)
for um knowledge work with LLMs or maybe
(01:52:51)
it might be with the computer use and
(01:52:53)
stuff but anyways the other big one is
(01:52:56)
maybe more importantly on this capex
(01:52:59)
question yes there is additional cost to
(01:53:03)
serving up an additional copy of a model
(01:53:05)
But the sort of opex of a session
(01:53:09)
>> is quite low and you can amortize the
(01:53:12)
cost of AI into the training run itself
(01:53:16)
depending on how inference scaling goes
(01:53:17)
and stuff but it's certainly not as much
(01:53:19)
as like building a whole new car
(01:53:21)
>> to serve another instance of a model. So
(01:53:24)
it just the economics of deploying more
(01:53:26)
widely
(01:53:27)
>> are much more favorable.
(01:53:29)
>> I think that's right. I think if you're
(01:53:30)
sticking in the realm of bits, bits are
(01:53:32)
like a million times easier than
(01:53:33)
anything that touches the physical
(01:53:34)
world.
(01:53:35)
>> No.
(01:53:36)
>> Uh I definitely grant that. Uh bits are
(01:53:38)
completely changeable, arbitrarily
(01:53:40)
reshuffable at very rapid speed. Uh so
(01:53:43)
you would expect a lot more
(01:53:45)
>> faster uh adaptation also in the
(01:53:47)
industry and so on.
(01:53:48)
>> And then uh what was the first one?
(01:53:50)
>> The latency requirements and it
(01:53:52)
implications for model size.
(01:53:54)
>> I think that's roughly right. I mean I
(01:53:55)
also think that if we are talking about
(01:53:56)
knowledge work at scale there will be
(01:53:58)
some u latency requirements practically
(01:54:00)
speaking because we uh you know we're
(01:54:02)
going to have to make create a huge
(01:54:03)
amount of compute instead of that.
(01:54:05)
>> Um and then I think like the last aspect
(01:54:07)
that I very briefly want to also talk
(01:54:08)
about is like all the all the rest of it
(01:54:10)
the
(01:54:11)
>> just all the rest of it. So um what does
(01:54:14)
society think about it? What is the
(01:54:15)
legal ramific how is it working legally?
(01:54:18)
How is it working insurance-wise? who's
(01:54:20)
really like what is the where what are
(01:54:22)
those layers of it and aspects of it
(01:54:24)
what happens with what is the equivalent
(01:54:26)
of people putting a cone on a whimo
(01:54:28)
>> you know uh there's going to be
(01:54:29)
equivalent of all that and so I I do
(01:54:31)
think that I almost feel like
(01:54:33)
self-driving is a very nice analogy that
(01:54:35)
you can borrow things from yeah what is
(01:54:37)
the equivalent of a cone on the car what
(01:54:38)
is the equivalent of a teleoperating
(01:54:39)
worker who's like hidden away um and uh
(01:54:43)
almost like all the aspects of it
(01:54:45)
>> yeah do you have any opinions on whether
(01:54:47)
this implies that the current day I
(01:54:48)
build
(01:54:49)
which would like 10x the amount of
(01:54:52)
available computer in the world in a
(01:54:54)
year or two and maybe like 100 more than
(01:54:56)
100x it by the end of the decade. If the
(01:54:59)
use of AI will be lower than some people
(01:55:01)
naely predict, does that mean that we're
(01:55:04)
overbuilding compute or do you is that a
(01:55:07)
separate question?
(01:55:07)
>> Kind of like what happened with
(01:55:08)
railroads and all this kind of stuff?
(01:55:10)
Sorry.
(01:55:10)
>> Was it railroads? Oh, sorry. It was um
(01:55:12)
yeah,
(01:55:12)
>> there there is like historical precedent
(01:55:14)
or was it with telecommunication
(01:55:15)
industry, right? Like prepaving the
(01:55:17)
internet that only came like a decade
(01:55:18)
later, you know,
(01:55:20)
>> and creating like a whole bubble in the
(01:55:21)
telecommunications industry in the late
(01:55:24)
90s kind of thing. Yeah.
(01:55:25)
>> Um so I don't know. I mean, I I
(01:55:28)
understand I'm sounding very pessimistic
(01:55:30)
here.
(01:55:31)
>> I'm only doing that I'm actually
(01:55:32)
optimistic. I think this will work. I
(01:55:34)
think it's tractable. I'm only sounding
(01:55:36)
pessimistic because when I go on my
(01:55:37)
Twitter timeline, I see all this stuff
(01:55:39)
that makes no sense to me. And um and I
(01:55:42)
think there's a lot of reasons for why
(01:55:44)
that exists. And I think a lot of it is
(01:55:46)
I think honestly just uh fundraising.
(01:55:47)
It's just incentive structures. A lot of
(01:55:50)
it may be fundraising. A lot of it is
(01:55:51)
just attention um you know, converting
(01:55:53)
attention to money on the internet, you
(01:55:55)
know, stuff like that. Um, so I think
(01:55:58)
there's u there's a lot of that going on
(01:56:01)
and I think I'm only reacting to that.
(01:56:02)
Um, but I'm still like overall very
(01:56:05)
bullish on technology. I think we're
(01:56:06)
going to work through all this stuff and
(01:56:07)
I think there's been a rapid amount of
(01:56:09)
progress. Um, I don't actually know that
(01:56:11)
there's overbuilding. I think that
(01:56:13)
there's going to be we're going to be
(01:56:14)
able to gobble up what in my
(01:56:16)
understanding is being built. Uh,
(01:56:17)
because I do think that for example
(01:56:19)
cloud code or open codex and stuff like
(01:56:21)
that, they didn't even exist a year ago,
(01:56:22)
right? Is that right? I think it's
(01:56:23)
roughly right. um this is miraculous
(01:56:26)
technology that didn't exist. I think um
(01:56:29)
uh there's going to be a huge amount of
(01:56:30)
demand as there as we see the demand in
(01:56:31)
Chaship PT already and so on. So uh
(01:56:34)
yeah, I don't actually know that there's
(01:56:35)
overbuilding. Um but I guess I'm just
(01:56:38)
reacting to like some of the very fast
(01:56:40)
timelines that people continue to say
(01:56:42)
incorrectly and I've heard many many
(01:56:43)
times over the course of my 15 years in
(01:56:45)
AI where very reputable people keep
(01:56:48)
getting this wrong all the time.
(01:56:51)
And I think I want this to be properly
(01:56:52)
calibrated and I think some of this also
(01:56:54)
it does have like geopolitical
(01:56:55)
ramifications and things like that when
(01:56:58)
uh like some of these questions and I
(01:57:00)
think I don't want people to make
(01:57:01)
mistakes on that on that sphere of
(01:57:03)
things. So um I do want us to be
(01:57:05)
grounded in reality of what technology
(01:57:07)
is and isn't. So
(01:57:09)
>> let's let's talk about education in
(01:57:10)
Eureka and stuff.
(01:57:12)
>> One thing you could do is uh start
(01:57:15)
another AI lab and then try to solve
(01:57:18)
those problems. Um, yeah. C curious what
(01:57:20)
you're up to now.
(01:57:21)
>> Yeah.
(01:57:22)
>> And then, yeah, why not AI research
(01:57:24)
itself?
(01:57:25)
>> Uh, I guess maybe like the way I would
(01:57:26)
put it is I feel some amount of like
(01:57:29)
determinism around the things that AI
(01:57:32)
labs are doing. Um, and I feel like I
(01:57:34)
could help out there, but I don't know
(01:57:35)
that I would uh like uniquely um I don't
(01:57:39)
know that I would like uniquely uh
(01:57:40)
improve it. But I I think like my
(01:57:42)
personal big fear is that a lot of this
(01:57:44)
stuff happens on the side of humanity
(01:57:46)
and that humanity gets disempowered by
(01:57:48)
it. And I don't I I kind of like I care
(01:57:51)
not just about all the Dyson spheres
(01:57:53)
that we're going to build and that AI is
(01:57:54)
going to build in a fully autonomous
(01:57:55)
way. I care about what happens to
(01:57:57)
humans.
(01:57:57)
>> Yeah.
(01:57:58)
>> And I want humans to be well off in this
(01:58:00)
future. And I feel like that's where I
(01:58:02)
can a lot more uniquely add value than
(01:58:04)
uh like an incremental improvement in a
(01:58:05)
frontier lab. And so, um, I guess I'm
(01:58:08)
most afraid of something maybe like, um,
(01:58:10)
depicted in movies like Wall-E or
(01:58:12)
Idiocracy or something like that where
(01:58:14)
humanity is sort of on the side of this
(01:58:15)
stuff. Um, and I want humans to be much
(01:58:18)
much better in this future. And so I
(01:58:21)
guess uh to me uh this is kind of like
(01:58:23)
through education that you can actually
(01:58:24)
achieve this
(01:58:25)
>> and and uh so what are you working on
(01:58:27)
there?
(01:58:27)
>> Oh yeah. So Eureka is trying to build I
(01:58:29)
think maybe the easiest way I can
(01:58:30)
describe it is we're trying to build the
(01:58:31)
Starfleet Academy.
(01:58:33)
>> Um I don't know if you watched Star
(01:58:35)
Trek. I haven't. But yeah.
(01:58:36)
>> Okay. Starfleet Academy is this like
(01:58:38)
elite institution for frontier
(01:58:40)
technology building spaceships and
(01:58:42)
graduating cadetses to be like you know
(01:58:44)
the pilots of these spaceships and
(01:58:45)
whatnot. So I just imagine like an elite
(01:58:47)
institution for technical knowledge and
(01:58:50)
um and basically a kind of school that's
(01:58:53)
um very upto-date and very like premier
(01:58:55)
institution. A category of questions I
(01:58:58)
have for you is just explaining how one
(01:59:01)
teaches technical or scientific content
(01:59:05)
>> well because you are one of the world
(01:59:07)
masters at it and then I'm curious both
(01:59:10)
about how you think about it for content
(01:59:11)
you've already put out there on YouTube.
(01:59:13)
>> Yeah.
(01:59:13)
>> But also to the extent it's any
(01:59:14)
different how you think about it for
(01:59:15)
Eureka.
(01:59:16)
>> Yeah. Yeah. With respect to Eureka, I
(01:59:18)
think like one thing that is very
(01:59:19)
fascinating to me about education is
(01:59:21)
like I do think education will pretty
(01:59:22)
fundamentally change with AIS on the
(01:59:24)
side and I think it has to be rewired
(01:59:26)
and changed um to some extent. I still
(01:59:29)
think that we're pretty early. I think
(01:59:30)
there's going to be a lot of people who
(01:59:31)
are going to try to do the obvious
(01:59:32)
things which is like oh have an LLM and
(01:59:35)
uh ask it questions and get you know do
(01:59:37)
all the basic things that you would do
(01:59:38)
via prompting right now. I I think it's
(01:59:40)
helpful but it still feels to me a bit
(01:59:41)
slop like slop. I I'd like to do it
(01:59:44)
properly and I think the capability is
(01:59:45)
not there for what I would want. What
(01:59:46)
I'd want is uh like an actual uh tutor
(01:59:50)
experience. Um maybe a prominent example
(01:59:52)
in my mind is um I was recently learning
(01:59:55)
Korean. So language learning
(01:59:57)
>> and I went through a phase where I was
(01:59:58)
learning Korean by myself on the
(02:00:00)
internet. I went through a phase where I
(02:00:01)
was actually part of a small class uh in
(02:00:03)
Korea. Uh taking a taking a Korean with
(02:00:06)
a bunch of other people which was really
(02:00:07)
funny. But we had a teacher and like 10
(02:00:08)
people or so taking Korean. And then I
(02:00:10)
switched to a one-on-one tutor. And um I
(02:00:14)
guess what was fascinating to me is I
(02:00:15)
think I had a really good tutor. Uh but
(02:00:17)
um I mean just thinking through like
(02:00:20)
what this tutor was doing for me and how
(02:00:22)
incredible that experience was and how
(02:00:25)
high the bar is for like what I actually
(02:00:26)
want to build eventually.
(02:00:28)
>> Um because uh I mean she was extremely
(02:00:30)
so she instantly from a very short
(02:00:32)
conversation understood like where I am
(02:00:34)
as a student, what I know and don't know
(02:00:36)
and she was able to like probe exactly
(02:00:37)
like the kinds of questions or things to
(02:00:39)
understand my world model. M no LLM will
(02:00:42)
do that for you 100% right now. Not even
(02:00:44)
close, right?
(02:00:44)
>> But a tutor will do that if if they're
(02:00:46)
good.
(02:00:47)
>> Once she understands um she actually
(02:00:49)
like really served me all the things
(02:00:50)
that I needed at my current sliver of
(02:00:52)
capability. I need to be always
(02:00:54)
appropriately challenged. I can't be
(02:00:56)
faced with something too hard or too
(02:00:57)
trivial. And a tutor is really good at
(02:00:59)
serving you just the right stuff. And so
(02:01:01)
basically I felt like I was the only
(02:01:03)
constraint to learning like my own. I
(02:01:05)
was the only constraint. I was always
(02:01:06)
given the perfect information.
(02:01:07)
>> I'm the only constraint. And I felt good
(02:01:09)
because I'm the only impediment that
(02:01:11)
exists. It's not that I can't find
(02:01:12)
knowledge or that it's not properly
(02:01:13)
explained or etc. Like it's just my
(02:01:15)
ability to memorize and so on. And this
(02:01:17)
is what I want for people. How do you
(02:01:19)
automate that?
(02:01:20)
>> So very good question about the current
(02:01:22)
capability. You don't
(02:01:23)
>> but I do think that with uh as um and
(02:01:25)
that's why I think u it's not actually
(02:01:27)
the right right time to actually build
(02:01:28)
this kind of an AI tutor. I still think
(02:01:30)
it's a useful product um and lots of
(02:01:32)
people will build it but I still feel
(02:01:34)
like um the bar is so high uh and the
(02:01:37)
capability is not there. Um uh but I
(02:01:40)
mean even today I would say chachin is
(02:01:42)
an extremely um valuable educational
(02:01:45)
product but I think for me it was so
(02:01:47)
fascinating to see how high the bar is
(02:01:48)
and when I was with her I almost felt
(02:01:50)
like there's no way I can build this.
(02:01:53)
>> But you are building it right?
(02:01:54)
>> Anyone who's had a really good tutor is
(02:01:56)
like how are you going to build this?
(02:01:58)
Um so I guess I I'm waiting for that
(02:02:01)
capability. I I do think that in a lot
(02:02:03)
of ways in the industry, for example, I
(02:02:04)
did some AI consulting for computer
(02:02:06)
vision. Yeah. Um
(02:02:07)
>> a lot of my times the value that I
(02:02:09)
brought to the company was telling them
(02:02:10)
not to use AI.
(02:02:11)
>> It wasn't like I was the AI expert and
(02:02:13)
they described a problem and I said
(02:02:14)
don't use AI.
(02:02:16)
>> This was my value ad. And I feel like
(02:02:17)
it's in the same in education right now
(02:02:19)
where I kind of feel like for what I
(02:02:21)
have in mind, it's not yet the time, but
(02:02:23)
the time will come. But for now, I'm
(02:02:25)
building something that looks maybe a
(02:02:26)
bit more conventional. um that has a
(02:02:28)
physical and digital component and so
(02:02:30)
on. But I think there's obvious there's
(02:02:32)
obvious it's obvious how this should
(02:02:34)
look like in the future.
(02:02:35)
>> Do they extend you're willing to say
(02:02:36)
what is the thing you hope will be
(02:02:39)
released this year or next year?
(02:02:41)
>> Well, so I'm building the first course
(02:02:42)
and I want to have a really really good
(02:02:44)
course uh state-of-the-art obvious
(02:02:47)
state-of-the-art destination you go to
(02:02:49)
learn AI in this case because that's
(02:02:50)
just what I'm familiar with. So I think
(02:02:51)
it's a really good first product to get
(02:02:52)
to be really good. Um and so that's what
(02:02:55)
I'm building and nano chat which you
(02:02:56)
briefly mentioned is a capstone project
(02:02:58)
of uh LLM 101n which is a class that I'm
(02:03:00)
building.
(02:03:01)
>> So um that's a really big piece of it
(02:03:04)
but now I have to build out a lot of the
(02:03:05)
intermediates and then I have to
(02:03:06)
actually like hire a small team of you
(02:03:08)
know TAs and so on and actually like uh
(02:03:10)
build the entire course. And maybe one
(02:03:12)
more thing that I would say is like many
(02:03:13)
times when people think about education
(02:03:15)
they think about sort of like the more
(02:03:17)
what I would say is like kind of a
(02:03:18)
softer component of like diffusing
(02:03:20)
knowledge or like um but I actually have
(02:03:22)
something very hard and technical in
(02:03:24)
mind and so in my mind education is kind
(02:03:26)
of like the very difficult technical
(02:03:28)
like uh process of building ramps to
(02:03:30)
knowledge.
(02:03:31)
>> So in my mind nano chat is a ramp to
(02:03:33)
knowledge because it's a very simple
(02:03:35)
it's like the super simplified full
(02:03:37)
stack thing. If you give this artifact
(02:03:39)
to someone and they like look through
(02:03:41)
it, they're learning a ton of stuff.
(02:03:42)
>> Yeah.
(02:03:42)
>> And so, uh, it's giving you a lot of
(02:03:45)
what I call Eurekas per second. Yeah.
(02:03:46)
Which is like understanding per second.
(02:03:48)
That's what I want. Lots of Eurekas per
(02:03:50)
second. Um and so to me this is a
(02:03:52)
technical problem of how do we build
(02:03:53)
these ramps to knowledge and uh so I
(02:03:55)
almost think of Eureka as almost like a
(02:03:57)
it's not like maybe that different maybe
(02:03:59)
through through some of the for frontier
(02:04:01)
labs or some of the work that's going to
(02:04:02)
be going on because I want to figure out
(02:04:04)
how to build these frontier these ramps
(02:04:06)
very efficiently so that people are
(02:04:08)
never stuck um and everything is always
(02:04:10)
not too hard or not too not too trivial
(02:04:13)
and uh you can you have just the right
(02:04:15)
material to actually progress.
(02:04:16)
>> Yeah. So you're imagining the short term
(02:04:18)
that instead of a tutor being able to
(02:04:20)
like probe your understanding, if you
(02:04:23)
have enough self-awareness to be able to
(02:04:25)
probe yourself
(02:04:26)
>> there, you're never going to be stuck.
(02:04:27)
You can like find the right answer
(02:04:29)
between talking to the TA or talking to
(02:04:31)
an LLM and looking at the reference
(02:04:32)
implementation. It sounds like
(02:04:35)
automation or AI is actually not a
(02:04:37)
significant like so far it's actually
(02:04:40)
the the big alpha here is your ability
(02:04:43)
to explain AI codified in the source
(02:04:47)
material of the class right that's like
(02:04:50)
fundamentally what the course is
(02:04:51)
>> I mean I think you always have to be
(02:04:52)
calibrated to what the capability what
(02:04:54)
capability exists in the industry and I
(02:04:56)
think a lot of people are going to
(02:04:57)
pursue like oh just ask chasha etc uh
(02:04:59)
but I I think like right now for example
(02:05:01)
if you go to chasha and you say oh teach
(02:05:03)
me AI there's no way it's I mean it's
(02:05:04)
going to give you some slop right
(02:05:06)
>> like when I AI is never going to write
(02:05:08)
nano chat right now but nano chat is a
(02:05:10)
really useful I think intermediate point
(02:05:12)
>> so I still I'm collaborating with AI to
(02:05:15)
create all this material so AI is still
(02:05:17)
fundamentally very helpful
(02:05:18)
>> um earlier on I built a CS231N at
(02:05:21)
Stanford which was one of the earlier
(02:05:23)
actually sorry I think it was the first
(02:05:24)
deep learning class at Stanford which
(02:05:25)
became very popular
(02:05:27)
>> um and the difference in building out
(02:05:29)
231N and L101N now is quite stark
(02:05:32)
uh because I'm I feel really empowered
(02:05:34)
by the LMS as they exist right now but
(02:05:36)
I'm very much in the loop
(02:05:38)
>> so they're helping me build all the
(02:05:39)
materials I go much faster u they're
(02:05:41)
doing a lot of the boring stuff etc uh
(02:05:43)
so I feel like I'm developing the course
(02:05:45)
much faster and there's LLM infused in
(02:05:47)
it but it's not yet at a place where I
(02:05:49)
can creatively create the content I'm
(02:05:50)
still there to do that
(02:05:51)
>> so like I think the trickiness is always
(02:05:53)
calibrating yourself to what exists
(02:05:55)
>> and so when you imagine what is
(02:05:57)
available through Eureka in a couple of
(02:05:59)
years it seems like the big bottleneck
(02:06:01)
is going to
(02:06:02)
finding corpse in field after field who
(02:06:05)
can
(02:06:06)
>> convert their understanding into these
(02:06:08)
ramps right
(02:06:09)
>> so I think it would change over time so
(02:06:10)
I think right now it would be uh hiring
(02:06:13)
faculty
(02:06:14)
>> Mhm.
(02:06:14)
>> to help work handin-hand with AI and a
(02:06:17)
team of people probably uh to build
(02:06:19)
state-of-the-art courses.
(02:06:20)
>> Yeah.
(02:06:21)
>> And then I think over time it can maybe
(02:06:23)
some of the TAs can actually become AIs
(02:06:24)
because some of the TAS like okay you
(02:06:26)
just take all the course materials
(02:06:28)
>> and then I think you could serve a very
(02:06:29)
good like automated TA. for the student
(02:06:32)
when they have more basic questions or
(02:06:34)
something like that, right? But I think
(02:06:35)
you'll need faculty for the overall
(02:06:37)
architecture of a course and making sure
(02:06:40)
that it fits. And so I kind of see a
(02:06:41)
progression of how this will evolve and
(02:06:43)
maybe at some future point, you know,
(02:06:44)
I'm not even that useful in AI is doing
(02:06:46)
most of the design much better than I
(02:06:47)
could.
(02:06:48)
>> But I still think that that's going to
(02:06:49)
take some time to play out. But are you
(02:06:51)
imagining that like uh people who have
(02:06:54)
expertise in other fields are then
(02:06:56)
contributing courses or do you feel like
(02:06:57)
it's actually quite essential to the
(02:06:59)
vision that you given your understanding
(02:07:02)
of how you want to teach are the one
(02:07:04)
designing the content
(02:07:06)
>> like I don't know Salon is like
(02:07:08)
narrating all the videos on Khan Academy
(02:07:10)
are you imagining something like that or
(02:07:11)
>> no I will hire faculty I think because
(02:07:13)
there are domains in which I'm not an
(02:07:14)
expert um and I think uh that's the only
(02:07:17)
way to offer the state-of-the-art
(02:07:19)
experience for the student ultimately.
(02:07:20)
So um
(02:07:22)
>> yeah I do expect that I would hire
(02:07:24)
faculty but I will probably stick around
(02:07:25)
in AI for some time but in I do have
(02:07:28)
something I think more conventional in
(02:07:29)
mind for the current capability I think
(02:07:31)
than what people would probably
(02:07:32)
anticipate. Um, and when I'm building
(02:07:34)
Starfleet Academy, I do probably imagine
(02:07:36)
a physical uh institution and maybe a
(02:07:38)
tier below that, a digital offering that
(02:07:41)
um is not the same not the
(02:07:43)
state-of-the-art experience you would
(02:07:44)
get when someone comes in physically
(02:07:46)
full-time and we work through material
(02:07:48)
from start to end and make sure you
(02:07:49)
understand it. Uh that's the physical
(02:07:51)
offering.
(02:07:52)
>> Um the digital offering is yeah, a bunch
(02:07:53)
of stuff on the internet and maybe some
(02:07:55)
LLM assistant and it's a bit more
(02:07:56)
gimmicky in a tier below, but uh at
(02:07:58)
least it's accessible to like eight
(02:07:59)
billion people. So
(02:08:01)
>> yeah, I think you're basically inventing
(02:08:04)
college from first principles for the
(02:08:08)
tools that are available today and then
(02:08:09)
just like for just like selecting for
(02:08:12)
people who have the motivation and the
(02:08:13)
interest of actually
(02:08:16)
>> really engaging with material.
(02:08:17)
>> Yeah. And I think there's going to have
(02:08:18)
to be a lot of not just education but
(02:08:20)
also re-education. And I would love to
(02:08:22)
uh help out uh there uh because I think
(02:08:24)
the jobs will probably change quite a
(02:08:26)
bit. Um and so for example today a lot
(02:08:28)
of people are trying to upskill in AI
(02:08:29)
specifically. So I think it's a really
(02:08:30)
good course to teach in this in this
(02:08:32)
respect. Um and yeah I think the
(02:08:35)
motivation wise uh before AGI uh
(02:08:38)
motivation is very simple to solve
(02:08:39)
because uh people want to make money and
(02:08:41)
this is how you make money in the
(02:08:42)
industry today.
(02:08:43)
>> I think post AGI it's a lot more
(02:08:45)
interesting um possibly because yeah if
(02:08:48)
everything is automated and there's
(02:08:49)
nothing to do for anyone why would
(02:08:50)
anyone go to a school etc. Um so I think
(02:08:54)
uh I guess like I often say that
(02:08:56)
pre-aggi education is useful, post AGI
(02:08:59)
education is fun
(02:09:01)
>> and uh in a similar way as people for
(02:09:03)
example uh people go to gym today.
(02:09:06)
>> Yeah.
(02:09:06)
>> Uh but we don't need their physical
(02:09:08)
strength to manipulate uh heavy objects
(02:09:10)
because we have machines that do that.
(02:09:12)
>> They still go to gym. Why do they go to
(02:09:13)
gym? Well, because it's fun. It's
(02:09:14)
healthy. It's uh and it's and you look
(02:09:17)
hot when you have a six-pack. I don't
(02:09:18)
know. I guess like um so it's I guess
(02:09:21)
what I'm saying is um it's attractive
(02:09:23)
for people to do that in a certain like
(02:09:25)
very deep psychological evolutionary
(02:09:27)
sense for humanity.
(02:09:29)
>> And so I kind of uh think that education
(02:09:31)
will kind of play out in the same way
(02:09:32)
like you'll go to school like you go to
(02:09:33)
gym um and you'll and I think that right
(02:09:36)
now I think not that many people learn
(02:09:38)
uh because learning is hard. You bounce
(02:09:41)
from material because and some people
(02:09:42)
overcome that barrier but for most
(02:09:44)
people it's hard. But I do think that we
(02:09:46)
should it's a technical problem to
(02:09:47)
solve. It's a technical problem to do
(02:09:49)
what my uh tutor did for me when I was
(02:09:51)
learning Korean.
(02:09:52)
>> I think it's tractable and buildable and
(02:09:54)
someone should build it and I think it's
(02:09:55)
going to make learning anything like
(02:09:57)
trivial and desirable and people will do
(02:09:59)
it for fun because it's trivial.
(02:10:00)
>> If I had a tutor like that for any
(02:10:02)
arbitrary piece of like knowledge, I
(02:10:04)
think it's going to be so much easier to
(02:10:05)
to learn anything and people will do it
(02:10:07)
and they'll do it for the same reasons
(02:10:08)
they go to gym.
(02:10:09)
>> I mean that sounds different from
(02:10:13)
using this. So post Asia you're using
(02:10:15)
this to um basically as entertainment or
(02:10:20)
as like a self- betterment but it
(02:10:22)
sounded like you had a vision also that
(02:10:24)
this education is relevant to keeping
(02:10:25)
humanity in control of AI.
(02:10:28)
>> And they sound different and I'm curious
(02:10:29)
is it like it's entertaining for some
(02:10:31)
people but then empowerment for some
(02:10:32)
others. How do you think about that?
(02:10:33)
>> I think this um so I do definitely feel
(02:10:35)
like people will be um I do think like
(02:10:37)
eventually it's a bit of a losing game
(02:10:39)
if that makes sense. I do think that it
(02:10:42)
is in long term long term which I think
(02:10:44)
is longer than I think maybe most people
(02:10:46)
in the industry it's a losing game. I I
(02:10:48)
do think that people can go so far and
(02:10:51)
that we barely scratched the surface of
(02:10:52)
much a person can can go and that's just
(02:10:54)
because people are bouncing off of
(02:10:55)
material that's too easy or too hard and
(02:10:57)
they and and I I I actually kind of feel
(02:10:59)
that people will be able to go much
(02:11:01)
further like anyone speaks five
(02:11:02)
languages because why not because it's
(02:11:04)
so trivial.
(02:11:05)
>> Um anyone um knows you know all the
(02:11:07)
basic curriculum of undergrad etc. Now,
(02:11:10)
now that I'm understanding the vision,
(02:11:12)
that that's very interesting. Like, I
(02:11:14)
think it actually has a perfect analog
(02:11:16)
in gym culture. I don't think a 100
(02:11:18)
years ago anybody would be like ripped
(02:11:20)
like nobody would have, you know, be
(02:11:22)
able to like just spontaneously bench
(02:11:23)
two plays or three plays or something.
(02:11:25)
It's actually very common now.
(02:11:27)
>> And you're because this idea of
(02:11:29)
systematically training and lifting
(02:11:31)
weights in the gym or systematically
(02:11:33)
training to be able to run a marathon,
(02:11:34)
which is capability spontaneously you
(02:11:36)
would not have or most humans would not
(02:11:38)
have. And you're imagining similar
(02:11:40)
things for
(02:11:41)
learning across very many different
(02:11:43)
domains, much more intensely, deeply,
(02:11:45)
faster.
(02:11:45)
>> Yeah, exactly. And I kind of feel like I
(02:11:47)
am betting a little bit implicitly on
(02:11:48)
some of the timelessness of human
(02:11:50)
nature.
(02:11:50)
>> Yeah.
(02:11:50)
>> And I think um
(02:11:52)
>> I think it will be desirable to be to to
(02:11:55)
do all these things. Um
(02:11:58)
>> and I think people will look up to it
(02:12:00)
and as they have for for millennia
(02:12:02)
because uh and I think this will
(02:12:04)
continue to be true. And actually also
(02:12:05)
maybe there's some evidence of that
(02:12:07)
historically because if you look at for
(02:12:08)
example aristocrats or you look at maybe
(02:12:10)
ancient Greece or something like that
(02:12:11)
whenever you had little pocket
(02:12:12)
environments that were post AGI in a
(02:12:14)
certain sense I do feel like people have
(02:12:16)
spent a lot of their time uh flourishing
(02:12:17)
in a certain way uh either physically or
(02:12:19)
or cognitively and so I think um I I
(02:12:22)
feel okay about the prospects of that
(02:12:24)
>> and I think if this is false and I'm
(02:12:26)
wrong and we end up in like
(02:12:28)
>> you know um Wall-E or idiocracy future
(02:12:30)
then I think it's very I don't even care
(02:12:33)
if there's like Dyson spheres this is a
(02:12:35)
terrible outcome.
(02:12:36)
>> Mhm.
(02:12:37)
>> Yeah.
(02:12:37)
>> Like I actually really do care about
(02:12:38)
humanity. Like everyone has to just be
(02:12:41)
superhuman in a certain sense.
(02:12:43)
>> I I I guess it's still a world in which
(02:12:46)
that is not enabling us to
(02:12:48)
it's it's like the culture world, right?
(02:12:50)
Like you're not fundamentally going to
(02:12:51)
be able to like transform the trajectory
(02:12:54)
of
(02:12:54)
>> Yeah.
(02:12:55)
>> uh technology or
(02:12:57)
>> Yeah.
(02:12:57)
>> influence decisions by your own labor or
(02:13:00)
cognition alone. Maybe you can influence
(02:13:02)
decisions because the AI is like for
(02:13:03)
your approval, but you're not like it's
(02:13:06)
not because I've like I can in because
(02:13:08)
I've invented something or I've like
(02:13:10)
come up with a new design, I'm like
(02:13:11)
really influencing the future.
(02:13:12)
>> Um yeah, maybe. I don't actually think
(02:13:14)
that uh I I think there will be
(02:13:15)
transitional period where we are going
(02:13:17)
to be able to be in the loop and you
(02:13:19)
know advance things if we actually
(02:13:20)
understand a lot of stuff.
(02:13:21)
>> Um I do think that long term that
(02:13:23)
probably goes away, right? But um maybe
(02:13:25)
it's going to even become a sport. But
(02:13:27)
right now you have powerlifters who go
(02:13:29)
extreme on this direction. So what is
(02:13:31)
powerlifting in a cognitive era?
(02:13:33)
>> Um maybe it's people who are really
(02:13:35)
trying to make Olympics out of knowing
(02:13:36)
stuff.
(02:13:37)
>> Yeah.
(02:13:37)
>> Uh like and and if you have a perfect AI
(02:13:41)
tutor, um maybe you can get extremely
(02:13:43)
far. Yeah.
(02:13:44)
>> I almost feel like we're just barely the
(02:13:46)
the geniuses of today are barely
(02:13:48)
scratching the surface of what a human
(02:13:49)
mind can do. I think
(02:13:50)
>> Yeah. I I I love this vision. I also um
(02:13:54)
it's like I feel like the person you
(02:13:56)
have like most product market fit with
(02:13:58)
is like me because like my job involves
(02:14:00)
having to
(02:14:01)
>> learn different subjects every week and
(02:14:04)
I I am I am like very excited if you can
(02:14:08)
>> I'm similar for that matter. I mean I
(02:14:10)
you know a lot of people for example uh
(02:14:12)
hate school and want to get out of it. I
(02:14:13)
was I was actually I really liked
(02:14:14)
school. I loved learning things etc. I
(02:14:16)
wanted to stay in school. I stayed all
(02:14:17)
the way until PhD and then they wouldn't
(02:14:19)
let me stay longer so I went to the
(02:14:20)
industry. But I mean I basically it's
(02:14:22)
roughly speaking I love uh I love
(02:14:24)
learning uh even for the sake of
(02:14:26)
learning but I also um love learning
(02:14:28)
because it's a form of empowerment and
(02:14:29)
being useful and productive. I I think
(02:14:31)
you also made a point that uh was subtle
(02:14:33)
so just to spell it out. I think what's
(02:14:36)
happened so far with online courses is
(02:14:37)
that why haven't they already enabled us
(02:14:40)
to
(02:14:41)
enable every single human to know
(02:14:43)
everything
(02:14:44)
>> and I think they're just so motivation
(02:14:47)
laden because there's not obvious
(02:14:49)
on-ramps and it's like so easy to get
(02:14:51)
stuck. Um, and if you had
(02:14:55)
instead this re this thing basically
(02:14:57)
like a really good human tutor, it it
(02:14:59)
would just be such an unluck from a
(02:15:01)
motivation perspective.
(02:15:02)
>> I think so because it feels bad to
(02:15:04)
bounce from material. Feels bad. You get
(02:15:06)
negative reward from
(02:15:07)
>> uh sinking amount of time in something
(02:15:09)
and this doesn't pan out or like being
(02:15:11)
completely bored because what you're
(02:15:13)
getting is too easy or too hard. So I
(02:15:14)
think uh yeah I think it feel when you
(02:15:16)
actually do it properly learning feels
(02:15:18)
good.
(02:15:18)
>> Yeah. And I think it's a technical
(02:15:20)
problem to get there. And I think um for
(02:15:22)
a while it's going to be AI plus human
(02:15:24)
collab. And at some point maybe it's
(02:15:26)
just AI. I don't know.
(02:15:27)
>> Can I ask some questions about teaching?
(02:15:29)
Well,
(02:15:29)
>> if you had to like sort of like give
(02:15:31)
advice to another educator in another
(02:15:33)
field that you're curious about
(02:15:36)
>> to make the kinds of YouTube tutorials
(02:15:39)
you've made. Um
(02:15:40)
>> maybe it may be especially interesting
(02:15:42)
to talk about domains where you can't
(02:15:43)
just like you can't test somebody's
(02:15:44)
technical understanding by having them
(02:15:46)
code something up or something. what
(02:15:48)
advice would you give them?
(02:15:49)
>> Uh, so I think that's a pretty broad
(02:15:51)
topic. I do feel like there's basically
(02:15:53)
I almost feel like there are 10 20 tips
(02:15:54)
and tricks that I kind of
(02:15:55)
semi-consciously probably do, but um
(02:15:59)
I guess like on a high level, I always
(02:16:01)
try to I think a lot of this comes from
(02:16:03)
my physics background. I really really
(02:16:05)
did enjoy my physics background. I have
(02:16:06)
a whole rant on I think how everyone
(02:16:08)
should learn physics uh in the in early
(02:16:11)
school education because I think early
(02:16:13)
school education is not about
(02:16:15)
cremulating knowledge or memory for
(02:16:16)
tasks later in the industry. It's about
(02:16:18)
booting up a brain and I think physics
(02:16:19)
uniquely boots up the brain the best. Uh
(02:16:22)
because some of the things that they get
(02:16:23)
you to do in your brain during physics
(02:16:25)
is is extremely valuable later. the idea
(02:16:27)
of building models and abstractions and
(02:16:28)
understanding that there are there's a
(02:16:30)
first order um approximation that
(02:16:32)
describes most of the system but then
(02:16:34)
there's a second order, third order,
(02:16:35)
fourth order terms that may or may not
(02:16:37)
be present. And the idea that you're
(02:16:38)
observing like a very noisy system, but
(02:16:40)
actually there's like these fundamental
(02:16:41)
frequencies that you can abstract away.
(02:16:43)
Like when a physicist walks into the
(02:16:45)
class and they say, "Oh, assume there's
(02:16:47)
a spherical cow and dot dot dot." And
(02:16:49)
everyone laughs at that, but actually
(02:16:50)
this is brilliant. It's brilliant
(02:16:51)
thinking that's very generalizable
(02:16:53)
across the industry because
(02:16:54)
>> yeah cow is can be approximated as a
(02:16:57)
sphere I guess in a bunch of ways. Um
(02:16:59)
there's a really good book for example
(02:17:00)
scale uh it's basically from a physicist
(02:17:03)
talking about biology and maybe this is
(02:17:05)
also a book I would recommend reading
(02:17:06)
but you can actually get a lot of really
(02:17:08)
interesting approximations and chart
(02:17:10)
scaling laws of animals and you can look
(02:17:12)
at their heartbeats and things like that
(02:17:14)
and they actually line up and with the
(02:17:16)
size of the animal and things like that.
(02:17:17)
You can talk about an animal as a volume
(02:17:19)
and you can actually drive a lot of um
(02:17:21)
you can talk about the heat dissipation
(02:17:23)
uh of that because your your heat
(02:17:24)
dissipation grows as the surface area
(02:17:26)
which is growing a square but your heat
(02:17:28)
creation or generation um is growing as
(02:17:31)
a cube.
(02:17:32)
>> And so I just feel like physicists have
(02:17:33)
all the right cognitive tools to
(02:17:34)
approach problem solving in the world.
(02:17:36)
So I think because of that training I
(02:17:38)
always try to find the first order terms
(02:17:40)
or the second order terms of everything.
(02:17:41)
When I'm observing a system or or a
(02:17:43)
thing, I have a tangle of a web of ideas
(02:17:45)
or knowledge in my world in my mind and
(02:17:47)
I'm trying to find what is the what is
(02:17:48)
the thing that actually matters. What is
(02:17:50)
the first order component? How can I
(02:17:52)
simplify it? How can I have a simple
(02:17:53)
thing that actually shows that thing,
(02:17:54)
right? Show shows it in action and then
(02:17:56)
I can tack on the other terms.
(02:17:57)
>> Yeah,
(02:17:58)
>> maybe maybe an example from my from one
(02:18:00)
of my repos that I think illustrates it
(02:18:02)
well is called microrad. I don't know if
(02:18:03)
you're familiar with this, but
(02:18:05)
>> so microrad is 100 lines of code that
(02:18:07)
shows back propagation. It can uh you
(02:18:09)
can create neural networks out of simple
(02:18:11)
operations like plus and times etc Lego
(02:18:13)
blocks of neural networks and you build
(02:18:15)
up a computational graph and you do a
(02:18:16)
forward pass and a backward pass to get
(02:18:18)
the gradients. Um now this is at the
(02:18:20)
heart of all neural network learning. So
(02:18:22)
microrad is a 100 lines of
(02:18:24)
pre-interpretable python code and it can
(02:18:26)
do forward and backward arbitrary neural
(02:18:27)
networks but not efficiently. So
(02:18:30)
microrad these hundred lines of python
(02:18:31)
are everything you need to understand
(02:18:33)
how neural networks train. Everything
(02:18:34)
else is just efficiency.
(02:18:36)
>> Yeah.
(02:18:37)
>> Everything else is efficiency. And
(02:18:38)
there's a huge amount of work to do
(02:18:39)
efficiency. You know, you need your
(02:18:40)
tensors. You lay them out and you stride
(02:18:42)
them. You make sure your kernels are
(02:18:43)
orchestrating memory movement correctly,
(02:18:45)
etc. It's all just efficiency roughly
(02:18:47)
speaking. But the core intellectual sort
(02:18:48)
of piece of neural network training is
(02:18:50)
microat. It's 100 lines. You can easily
(02:18:51)
understand it. You're chaining. It's a
(02:18:53)
recursive application of chain rule to
(02:18:55)
derive the gradient which allows you to
(02:18:56)
optimize any arbitrary differential
(02:18:57)
function. So it's a I love finding these
(02:19:01)
like you know the smaller the terms and
(02:19:04)
serving them in a very on a platter and
(02:19:06)
discovering them and I feel like
(02:19:08)
education is like the most
(02:19:09)
intellectually interesting thing because
(02:19:11)
>> you have a tangle of understanding and
(02:19:13)
you're trying to lay it out in a way
(02:19:15)
that creates a ramp where everything
(02:19:17)
only depends on the thing before it and
(02:19:19)
I find that this like you know
(02:19:21)
untangling of knowledge is just so
(02:19:22)
intellectually interesting as a as a
(02:19:24)
cognitive task.
(02:19:25)
>> Yeah. And so I love doing it personally,
(02:19:26)
but I just find have fascination with
(02:19:28)
trying to lay things out in a certain
(02:19:29)
way. And maybe that that helps me.
(02:19:31)
>> It also just makes a learning experience
(02:19:34)
so much more motivated. You your um
(02:19:36)
tutorial on the transformer begins with
(02:19:40)
biogram. It's literally like a lookup
(02:19:41)
table from
(02:19:43)
>> here's the word right now
(02:19:44)
>> or here's the previous word, here's the
(02:19:46)
next word. And it's literally just a
(02:19:47)
lookup table.
(02:19:48)
>> Yeah. That's the essence of it. Yeah. I
(02:19:49)
mean such a brilliant way like okay
(02:19:51)
start with a lookup table and then go to
(02:19:53)
a transformer and then each piece is
(02:19:54)
motivated why would you add that why
(02:19:56)
would you add the next thing you
(02:19:57)
couldn't memorize a sort of attention
(02:19:58)
formula but it's like having an
(02:20:00)
understanding of why this is every
(02:20:01)
single piece is relevant what problem it
(02:20:03)
solves
(02:20:04)
>> yeah yeah you're presenting the pain
(02:20:05)
before you present a solution and how
(02:20:07)
clever is that and you want to take the
(02:20:08)
student through that progression so um
(02:20:11)
there's a lot of like other small things
(02:20:12)
like that that I think make it make it
(02:20:14)
nice and engaging interesting and and
(02:20:16)
you know always prompting the student
(02:20:17)
there's there's a lot of small things
(02:20:19)
like that that I think are, you know,
(02:20:20)
important and a lot of good educators
(02:20:21)
will do. Uh like how would you solve
(02:20:23)
this? Like I'm not going to present a
(02:20:25)
solution before you're going to guess.
(02:20:27)
>> That would be wasteful. That would be
(02:20:29)
that's that's a little bit of a
(02:20:31)
>> I don't want to swear, but like it's a
(02:20:33)
it's a it's a dick move towards you to
(02:20:34)
present you with the solution before I
(02:20:36)
give you a shot to try to um
(02:20:38)
>> Right.
(02:20:38)
>> Uh to come up with it yourself.
(02:20:39)
>> Yeah. Yeah. And because because if you
(02:20:41)
try to come with yourself, you you I
(02:20:43)
guess you get a better understanding of
(02:20:44)
like what is the action space.
(02:20:47)
>> Yeah. And then what is the sort of like
(02:20:49)
objective then like why does only this
(02:20:51)
action fulfill that objective right?
(02:20:53)
>> Yeah. Well, you have a chance to like
(02:20:54)
try yourself and you you have an
(02:20:56)
appreciation when I give you the
(02:20:57)
solution and uh it maximizes the amount
(02:21:00)
of knowledge per new fact added.
(02:21:01)
>> That's right. Yeah. Yeah.
(02:21:03)
>> Why do you think by default people who
(02:21:05)
are genuine experts in their field are
(02:21:10)
often bad at explaining it to somebody
(02:21:13)
ramping up?
(02:21:14)
>> Well, it's the curse of knowledge and
(02:21:15)
expertise. Yeah,
(02:21:16)
>> this is a real phenomenon and I actually
(02:21:18)
suffered from it myself as much as I try
(02:21:20)
to not not suffer from it. But
(02:21:21)
>> you take certain things for granted and
(02:21:23)
you can't put yourself in the shoes of
(02:21:24)
new of people who are just starting out
(02:21:26)
and uh this is pervasive and happens to
(02:21:28)
me as well.
(02:21:29)
>> One thing that I actually think is
(02:21:30)
extremely helpful as an example someone
(02:21:32)
was trying to show me a paper in biology
(02:21:33)
recently
(02:21:34)
>> and I just had instantly so many
(02:21:36)
terrible questions. Um,
(02:21:38)
>> so what I did was I used chacht to ask
(02:21:40)
the questions with the with the paper in
(02:21:42)
the context window and then uh it worked
(02:21:44)
through some of the simple things and
(02:21:46)
then I actually shared the thread to the
(02:21:47)
person who shared it uh who actually
(02:21:49)
like wrote that paper or like worked on
(02:21:50)
that work and I almost feel like it was
(02:21:52)
like um like if they can see the dumb
(02:21:55)
questions I had it might help them
(02:21:57)
explain it better in the future or
(02:21:58)
something like that because um so for
(02:22:00)
example for my material I would love if
(02:22:02)
people shared their dumb conversations
(02:22:04)
with Chachi PT about the stuff that I've
(02:22:06)
created because it really helps me put
(02:22:07)
myself again in the shoes of someone
(02:22:09)
who's starting out.
(02:22:10)
>> Another trick like that that I just
(02:22:12)
works astoundingly well.
(02:22:15)
Um, if somebody writes a paper or a blog
(02:22:19)
post or an announcement, it is in 100%
(02:22:23)
of cases true that just the narration or
(02:22:26)
the transcription of how they would
(02:22:28)
explain it to you over lunch
(02:22:30)
>> is way more uh not only understandable,
(02:22:35)
>> but actually also more accurate and
(02:22:38)
scientific in the sense that people have
(02:22:41)
a bias to explain things in the most
(02:22:44)
abstract act jargon filled way possible
(02:22:46)
and to clear their throat for four
(02:22:48)
paragraphs before they explain the
(02:22:49)
central idea.
(02:22:50)
>> But there's something about
(02:22:51)
communicating one-on-one with a person
(02:22:54)
which compels you to just say the thing.
(02:22:58)
>> Just say the thing.
(02:22:58)
>> Yeah. Actually, I saw that tweet. I
(02:23:00)
thought it was really good. I shared it
(02:23:01)
with a bunch of people actually. I think
(02:23:02)
it was really good. And I noticed this
(02:23:04)
many many times. Um maybe the most
(02:23:06)
prominent example is I remember uh back
(02:23:09)
in my PhD days doing research etc. Uh
(02:23:11)
you read someone's paper, right? and you
(02:23:13)
work you to understand what it's doing
(02:23:15)
etc. And then you catch them, you're
(02:23:16)
having beers at the conference later and
(02:23:18)
you ask them so like this paper like so
(02:23:20)
what were you doing like what is the
(02:23:21)
paper about and they will just tell you
(02:23:23)
these like three sentences that like
(02:23:24)
perfectly capture the essence of that
(02:23:25)
paper and totally give you the idea and
(02:23:27)
you didn't have to read the paper yet.
(02:23:28)
>> Yeah. And like it's only at when you're
(02:23:30)
sitting at the table with a beer or
(02:23:31)
something like that and like oh yeah the
(02:23:33)
paper is just oh you take this idea you
(02:23:34)
take that idea and you try this
(02:23:35)
experiment and uh and you try out this
(02:23:37)
thing and they have a way of just
(02:23:38)
putting it conversationally
(02:23:39)
>> right
(02:23:40)
>> and just like perfectly like why isn't
(02:23:41)
that the actual
(02:23:42)
>> exactly
(02:23:46)
>> um this is coming from the perspective
(02:23:47)
of how somebody who's trying to explain
(02:23:49)
an idea should formulate it better. What
(02:23:51)
is your advice as a student to other
(02:23:55)
students where if you don't have a
(02:23:56)
carpathy who is doing the exposition of
(02:23:59)
an idea if you're reading a paper for
(02:24:01)
somebody or reading a book,
(02:24:03)
>> what strategies do you employ
(02:24:06)
>> to learn material you're interested in
(02:24:08)
in fields you're not an expert in?
(02:24:10)
>> Um I don't actually know that I have um
(02:24:12)
like unique tips and tricks to be
(02:24:14)
honest. Um basically it's a it's it's
(02:24:17)
kind of a painful process. Um but you
(02:24:19)
know like reddraft one. Um I think like
(02:24:22)
one thing that has always helped me
(02:24:24)
quite a bit is um
(02:24:27)
I had a small tweet about this actually.
(02:24:28)
So like learning things on demand is is
(02:24:30)
pretty nice. Learning depthwise. I do
(02:24:32)
feel like you need a bit of alternation
(02:24:34)
of learning depth wise. On demand you're
(02:24:35)
trying to achieve a certain project that
(02:24:36)
you're going to get a reward from.
(02:24:38)
>> And learning breathwise which is just oh
(02:24:39)
let's do whatever one and here's all the
(02:24:42)
things you might need. Which is a lot of
(02:24:43)
school does a lot of breath wise
(02:24:44)
learning like oh trust me you'll need
(02:24:45)
this later. you know that kind of a
(02:24:47)
stuff
(02:24:47)
>> like okay I trust you I'll learn it
(02:24:49)
because I guess I need it.
(02:24:51)
>> But I love the kind of learning where
(02:24:52)
you'll actually get a reward out of
(02:24:54)
doing something and you're learning on
(02:24:55)
demand.
(02:24:56)
>> The other thing that I've found is
(02:24:57)
extremely helpful is um maybe this is an
(02:25:01)
aspect where education is a bit more
(02:25:02)
selfless because uh explaining things to
(02:25:04)
people is a beautiful way to learn
(02:25:06)
something more deeply. Uh this uh
(02:25:08)
happens to me all the time. I think it
(02:25:09)
probably happens to other people too
(02:25:10)
because
(02:25:11)
>> I realize if I don't really understand
(02:25:13)
something I can't explain it, you know,
(02:25:15)
and and um I'm trying and I'm like
(02:25:17)
actually actually I don't understand
(02:25:19)
this and it's so annoying to come to
(02:25:20)
terms with that and then you can go back
(02:25:22)
and make sure you understood it and so
(02:25:24)
it fills these gaps of your
(02:25:25)
understanding. It forces you to come to
(02:25:26)
terms with them and uh to reconcile
(02:25:28)
them. I love to reexlain and things like
(02:25:31)
that and I think people should be doing
(02:25:32)
that more as well. I think that forces
(02:25:34)
you to manipulate the knowledge and make
(02:25:35)
sure that you you know what you're
(02:25:36)
talking about when you're explaining it.
(02:25:38)
Oh yeah, I think that's an excellent
(02:25:39)
note to close on.
(02:25:40)
>> Yeah,
(02:25:40)
>> Andre, that was great.
(02:25:41)
>> Yeah, thank you. Thanks. Take your time.
(02:25:44)
>> Hey everybody, I hope you enjoyed that
(02:25:45)
episode. If you did, the most helpful
(02:25:47)
thing you can do is just share it with
(02:25:49)
other people who you think might enjoy
(02:25:51)
it. It's also helpful if you leave a
(02:25:53)
rating or a comment on whatever platform
(02:25:56)
you're listening on. If you're
(02:25:58)
interested in sponsoring the podcast,
(02:25:59)
you can reach out at
(02:26:01)
dwarcash.com/advertise.
(02:26:05)
Otherwise, I'll see you on the next one.
