↔
Title: Richard Sutton – Father of RL thinks LLMs are a dead end
Duration: 01:07:09
Total Correct Answers:
Current Caption
Correct
Learning Modes
YouTube Video Transcript Hide
Ask AI:
Export as:
Ask AI Result
The ask AI result will appear here..
(00:00:00) Your YouTube transcript will appear here
(00:00:00)
Why are you trying to distinguish
(00:00:01)
humans? Humans are animals. What we have
(00:00:04)
in common is more interesting. What
(00:00:06)
distinguishes us, we should be paying
(00:00:08)
less attention to.
(00:00:08)
>> I mean, we're trying to replicate
(00:00:09)
intelligence, right? No animal can go to
(00:00:11)
the moon or make semiconductors. So, we
(00:00:13)
want to understand what makes humans
(00:00:14)
special.
(00:00:15)
>> So, I like the way you consider that
(00:00:16)
obvious, cuz I consider the opposite
(00:00:19)
obvious. If we understood a squirrel,
(00:00:21)
we'd be almost all the way there. I am
(00:00:23)
personally just kind of content being
(00:00:25)
out of sync with my field for a long
(00:00:27)
period of time perhaps decades because
(00:00:29)
occasionally I have improved right in
(00:00:32)
the past. I don't think learning is
(00:00:35)
really about training. It's about an
(00:00:36)
active process. The child tries things
(00:00:39)
and sees what happens. I think we should
(00:00:41)
be proud that we are giving rise to this
(00:00:45)
great transition in the universe.
(00:00:48)
Today I'm chatting with Richard Sutton
(00:00:50)
who is one of the founding fathers of
(00:00:52)
reinforcement learning an inventor of
(00:00:54)
many of the main techniques used there
(00:00:55)
like TD learning and policy gradient
(00:00:57)
methods and for that he received this
(00:00:59)
year's touring award which if you don't
(00:01:01)
know is basically the Nobel Prize for
(00:01:03)
computer science Richard congratulations
(00:01:05)
>> thank you Darish
(00:01:06)
>> and uh thanks for coming on the podcast
(00:01:08)
>> it's my pleasure
(00:01:09)
>> okay so first question my audience and I
(00:01:13)
are familiar with the LLM way of
(00:01:15)
thinking about AI conceptually
(00:01:17)
What are we missing in terms of thinking
(00:01:19)
about AI from the RL perspective?
(00:01:22)
>> Well, yes, I think it's really quite a
(00:01:24)
different point of view and it's it can
(00:01:27)
easily get separated and lose the
(00:01:29)
ability to talk to each other.
(00:01:30)
>> Mhm.
(00:01:31)
>> And um yeah, large language models have
(00:01:34)
become such a big thing. Generative AI
(00:01:36)
in general a big thing. Um and our field
(00:01:40)
is subject to bandwagons and fashions.
(00:01:42)
So we lose we lose track of the uh basic
(00:01:46)
basic things because I consider
(00:01:47)
reinforcement learning to be basic AI
(00:01:50)
and what is intelligence are the problem
(00:01:52)
is is to understand your world and um
(00:01:57)
reinforcement learning is about
(00:01:58)
understanding your world whereas large
(00:02:00)
language models are about mimicking
(00:02:02)
people doing what people say you should
(00:02:04)
do. They're not about figuring out what
(00:02:06)
to do.
(00:02:07)
>> Huh. I guess you would think that to
(00:02:11)
emulate the trillions of tokens in the
(00:02:13)
corpus of internet text, you would have
(00:02:15)
to build a world model. In fact, these
(00:02:17)
models do seem to have very robust world
(00:02:19)
models and they're the best um world
(00:02:21)
models we've made to date in AI. Right.
(00:02:23)
So, what do you think that that's
(00:02:25)
missing?
(00:02:26)
>> I would disagree with most of the things
(00:02:28)
you just said.
(00:02:28)
>> Great.
(00:02:30)
>> Just to mimic the the what people say is
(00:02:33)
not really to build a model of the world
(00:02:34)
at all. I don't think you know you're
(00:02:36)
mimicking things that have a model of
(00:02:39)
the world the people
(00:02:40)
>> but I don't want to approach the
(00:02:42)
question in an adversarial way
(00:02:45)
uh but but I would I would question the
(00:02:47)
idea that they um they have a world
(00:02:50)
model so a world model would enable you
(00:02:52)
to predict what would happen
(00:02:53)
>> right
(00:02:53)
>> they they have they have the ability to
(00:02:55)
predict what a person would say they
(00:02:57)
don't have the ability to predict what
(00:02:58)
will happen what we want I think to
(00:03:01)
quote Alan Turing what we want is a
(00:03:04)
machine that can learn from experience,
(00:03:06)
>> right?
(00:03:07)
>> Where experience is the things that
(00:03:08)
actually happen in your life. You do
(00:03:10)
things, you see what happens. Um, and uh
(00:03:14)
that's what you learn from.
(00:03:16)
>> Yeah.
(00:03:17)
>> The large language models learn from
(00:03:18)
something else. They learn from here's a
(00:03:20)
situation and here's what a person did.
(00:03:23)
And implicitly the suggestion is you
(00:03:25)
should do what the person did.
(00:03:26)
>> Right? I guess maybe the the crux and
(00:03:29)
I'm curious if you disagree with this is
(00:03:30)
some people will say okay so this
(00:03:32)
imitation learning has given us a good
(00:03:34)
prior or given these models a good prior
(00:03:36)
but reasonable ways to approach problems
(00:03:39)
and as we move towards the era of
(00:03:41)
experience uh as you call it this prior
(00:03:45)
is going to be the basis on which we
(00:03:48)
teach these models from experience
(00:03:49)
because this gives them the opportunity
(00:03:51)
to get uh answers right some of the time
(00:03:54)
and then on this you can build uh you
(00:03:57)
can train them on experience. Do you
(00:03:58)
agree with that perspective?
(00:04:00)
>> No, I I I agree that it's the it's the
(00:04:03)
large language model perspective, right?
(00:04:04)
>> I don't think it's a good perspective.
(00:04:06)
>> Yeah. Yeah. Cere.
(00:04:08)
>> So to be a prior for something, there
(00:04:11)
has to be a real thing. I mean, a prior
(00:04:14)
bit of knowledge should be the basis for
(00:04:17)
actual knowledge. What is actual
(00:04:19)
knowledge? There's no definition of
(00:04:20)
actual knowledge in that in that large
(00:04:23)
language framework. What makes an action
(00:04:27)
a good action to take? You recognize the
(00:04:30)
value, the need for continual learning,
(00:04:32)
right? So if you need to learn
(00:04:34)
continually, continually means learning
(00:04:36)
during the normal interaction with the
(00:04:38)
world.
(00:04:38)
>> Yeah.
(00:04:39)
>> And so then there must be some way
(00:04:41)
during the normal interaction to tell
(00:04:43)
what's right.
(00:04:44)
>> Yep.
(00:04:45)
>> Okay. So
(00:04:47)
is there any way for it to tell in the
(00:04:50)
largest language model setup to tell
(00:04:52)
what's the right thing to say? You will
(00:04:55)
say something and you will not get
(00:04:56)
feedback about what the right thing to
(00:04:58)
say is
(00:04:59)
>> because there's no definition of what
(00:05:01)
the right thing to say is. There's no
(00:05:03)
goal,
(00:05:03)
>> right?
(00:05:04)
>> And if there's no goal, then there's
(00:05:05)
there's one thing to say, another thing
(00:05:07)
to say. There's no right thing to say,
(00:05:09)
>> right?
(00:05:10)
>> So there's no ground truth. You can't
(00:05:12)
have prior knowledge if you don't have
(00:05:14)
ground truth because the prior knowledge
(00:05:16)
is supposed to be a hint or an initial
(00:05:18)
belief about what the truth is.
(00:05:20)
>> Yeah.
(00:05:21)
>> But there isn't any truth. there's no
(00:05:23)
right thing to say right now in
(00:05:25)
reinforcement learning there is a right
(00:05:26)
thing to say or right thing to do
(00:05:28)
because the the right thing to do is the
(00:05:30)
thing that gets you reward
(00:05:32)
>> right
(00:05:32)
>> so we have a definition of what the
(00:05:33)
right thing to do is and so we can have
(00:05:36)
uh prior knowledge or knowledge provided
(00:05:39)
by pe people about what the right thing
(00:05:40)
to do is and then we can check it
(00:05:43)
>> to see because because we have a
(00:05:44)
definition of what the actual right
(00:05:45)
thing to do is
(00:05:47)
>> now an even simpler case is when you
(00:05:49)
have you're trying to make a model of
(00:05:50)
the world when you predict what will
(00:05:52)
happen, you predict and then you see
(00:05:54)
what happens.
(00:05:55)
>> Okay? So there's ground truth. There's
(00:05:57)
no ground truth in in uh large language
(00:06:01)
models because you don't have a a
(00:06:03)
prediction about what will happen next.
(00:06:06)
If you say something in your in your um
(00:06:08)
conversation, there's the large language
(00:06:10)
models have no prediction about what the
(00:06:13)
person will say in response to that or
(00:06:15)
what what the response will be. I mean I
(00:06:17)
think they do like they you can
(00:06:19)
literally ask them what what what would
(00:06:20)
you anticipate a user might say in
(00:06:22)
response and they have a prediction.
(00:06:24)
>> Oh no they they they will respond to
(00:06:26)
that question right?
(00:06:27)
>> Yeah
(00:06:27)
>> but they have no prediction in the
(00:06:28)
substantive sense that they won't be
(00:06:31)
surprised by what happens and if
(00:06:33)
something happens that isn't what you
(00:06:34)
might say they predicted they will not
(00:06:36)
change because an unexpected thing has
(00:06:39)
happened and there to learn that they'd
(00:06:43)
have to make an adjustment. I I so I
(00:06:45)
think a capability like this does exist
(00:06:48)
in context. So it's interesting to watch
(00:06:51)
a model do chain of thought and then
(00:06:54)
suppose it's trying to solve a math
(00:06:55)
problem. It'll say okay I'm going to
(00:06:56)
approach this problem using this
(00:06:59)
approach at first and it'll write this
(00:07:00)
out and be like oh wait I just realized
(00:07:02)
this is the wrong conceptual way to
(00:07:03)
approach the problem. I'm going to
(00:07:04)
restart by this another approach and
(00:07:07)
that flexibility does exist in context,
(00:07:10)
right? Do you have something else in
(00:07:11)
mind or do you just think that you need
(00:07:12)
to extend this capability across longer
(00:07:15)
horizons?
(00:07:16)
>> I'm just saying they don't have a have a
(00:07:19)
uh in any meaningful sense they don't
(00:07:21)
have a prediction of what will happen
(00:07:22)
next and they will not be surprised by
(00:07:24)
what happened next. They'll not make any
(00:07:26)
changes if if something happens.
(00:07:29)
>> But isn't that isn't that isn't that
(00:07:30)
>> based on what happens.
(00:07:31)
>> Isn't that literally what next token
(00:07:32)
prediction is? prediction about what's
(00:07:34)
next and then updating on a surprise.
(00:07:35)
>> Next token is what they should say, what
(00:07:37)
the action should be. It's not what the
(00:07:40)
world will give them in response to what
(00:07:42)
they do. Let's let's go back to the uh
(00:07:45)
their lack of goal.
(00:07:46)
>> Mhm.
(00:07:47)
>> For me, having a goal is the essence of
(00:07:49)
intelligence,
(00:07:50)
>> right?
(00:07:51)
>> Something is intelligent if it can
(00:07:52)
achieve goals. Is I like John McCarthy's
(00:07:55)
definition that intelligence is the
(00:07:57)
computational part of the ability to
(00:07:58)
achieve goals. Yeah. So, you have to
(00:08:00)
have goals. You're you're not you're
(00:08:02)
just you're just you're just a behaving
(00:08:04)
system. You you're not you're not any
(00:08:08)
special. You're not intelligent.
(00:08:09)
>> Right.
(00:08:10)
>> And you agree that large language models
(00:08:12)
don't have goals.
(00:08:13)
>> I think they No, they have a goal.
(00:08:14)
>> What's the goal?
(00:08:15)
>> Next token prediction.
(00:08:17)
>> That's not a goal. Doesn't it doesn't
(00:08:19)
change the world,
(00:08:21)
you know.
(00:08:22)
>> I think tokens come at you and if you
(00:08:25)
predict them, you don't influence them.
(00:08:27)
>> Oh, yeah. I I it's not a goal about the
(00:08:30)
external world.
(00:08:32)
>> Yeah. It's not a goal.
(00:08:34)
It's not a substantive goal. It's not
(00:08:36)
You can't look at a system and say, "Oh,
(00:08:38)
it uh has a goal if it's just sitting
(00:08:40)
there predicting and being happy with
(00:08:41)
itself that it's predicting accurately."
(00:08:43)
I I guess maybe the bigger question I
(00:08:44)
want you want to understand is why you
(00:08:46)
don't think doing RL on top of LLM is a
(00:08:50)
productive direction because being we we
(00:08:52)
seem to be able to give these models a
(00:08:53)
goal of solving difficult math problems
(00:08:55)
and they're in many ways um at the very
(00:08:58)
peaks of human level in in the capacity
(00:09:01)
to solve math olympia type problems
(00:09:03)
right they got gold at IMO so it seems
(00:09:06)
like the model which got gold at the
(00:09:09)
international math Olympia does have the
(00:09:10)
goal of getting math problems right Um
(00:09:13)
so why can't we extend this to different
(00:09:14)
domains?
(00:09:15)
>> Well the math problems are different. Um
(00:09:18)
the making a model of the physical world
(00:09:20)
and uh carrying out the consequences of
(00:09:23)
mathematical
(00:09:25)
um assumptions or operations,
(00:09:27)
>> right?
(00:09:28)
>> Those are very different things like the
(00:09:30)
the empirical world has to be learned.
(00:09:33)
You have to learn the consequences. Um
(00:09:36)
whereas the uh the math is is more just
(00:09:41)
computational. It's more like standard
(00:09:43)
planning. So, so there you can you can
(00:09:47)
um they can have a goal to to um uh to
(00:09:52)
find the proof and they are in in some
(00:09:55)
way given that goal to find the proof.
(00:09:57)
>> Right. So, I mean, it's interesting
(00:09:59)
because you wrote this essay in 2019
(00:10:01)
titled The Bitter Lesson, and this is
(00:10:04)
the most influential essay perhaps in
(00:10:05)
the history of AI, but people have used
(00:10:10)
that as a justification
(00:10:13)
for scaling up LLMs because in their
(00:10:16)
view, this is the one scalable way we
(00:10:19)
have found to pour ungodly amounts of
(00:10:22)
compute into learning about the world.
(00:10:24)
And so it's interesting that your
(00:10:25)
perspective is that the LLMs are
(00:10:27)
actually not bitter lesson.
(00:10:29)
>> It's an interesting question whether uh
(00:10:32)
large language models are are uh a case
(00:10:36)
of the bitter lesson.
(00:10:37)
>> Yeah.
(00:10:38)
>> Because they are clearly um a way of
(00:10:42)
using massive computation things that
(00:10:44)
will scale with computation up to up to
(00:10:48)
the limits of the internet.
(00:10:49)
>> Yeah.
(00:10:50)
uh but they're also a way of putting in
(00:10:54)
lots of um human knowledge and uh so so
(00:10:59)
this is an interesting question um it's
(00:11:02)
a sociological or industry question uh
(00:11:07)
will they reach the limits of of of the
(00:11:11)
data and and be superseded by things
(00:11:15)
that that are can get more data just
(00:11:19)
from experience rather than from
(00:11:21)
uh from people. Uh in some ways it's a
(00:11:26)
classic case of the of the of the bitter
(00:11:28)
lesson with the more the more human
(00:11:30)
knowledge we put into the large language
(00:11:32)
models the better they can do and so it
(00:11:34)
feels good. Um
(00:11:37)
and yet uh one well I in particular
(00:11:42)
expect there to be systems that can
(00:11:44)
learn from experience and which could
(00:11:45)
well perform much much better and be
(00:11:48)
much more scalable. In which uh case it
(00:11:51)
will be another instance of the bitter
(00:11:53)
lesson that the things that that used
(00:11:56)
human knowledge were eventually
(00:11:59)
superseded by things that just um
(00:12:02)
trained from uh experience and
(00:12:04)
computation. I I guess that doesn't seem
(00:12:06)
like the crux to me because I think
(00:12:08)
those people would also agree that the
(00:12:11)
overwhelming amount of compute in the
(00:12:13)
future will come from uh learning from
(00:12:17)
experience. They just think that the
(00:12:19)
scaffold or the basis of that the thing
(00:12:21)
you'll start with in order to pour in
(00:12:23)
the compute to do this future
(00:12:25)
experiential learning or on the job
(00:12:27)
learning will be LLMs. And so I I guess
(00:12:31)
I I still don't understand why this is
(00:12:34)
the wrong starting point altogether. Why
(00:12:37)
we need a whole new architecture to
(00:12:39)
begin doing experential continual
(00:12:42)
learning. Uh and why we can't start with
(00:12:44)
LLMs to do that.
(00:12:46)
>> Well, in every case the bitter lesson,
(00:12:48)
you know, you could start with uh human
(00:12:51)
knowledge,
(00:12:51)
>> right?
(00:12:52)
>> And then just and then do the scalable
(00:12:54)
things.
(00:12:54)
>> Yeah,
(00:12:54)
>> that's always the case. And there's no
(00:12:57)
never any reason why that has to be bad,
(00:13:00)
>> right?
(00:13:00)
>> But in fact and in practice it has
(00:13:03)
always turned out to be bad because
(00:13:05)
people get locked into the human
(00:13:07)
knowledge approach and they
(00:13:10)
psychologically or you know now I'm now
(00:13:12)
I'm speculating why it is but this is
(00:13:14)
what has always happened.
(00:13:15)
>> Yeah.
(00:13:16)
>> That uh yeah they get they get their
(00:13:19)
lunch gets eaten by the methods that are
(00:13:21)
truly scalable.
(00:13:22)
>> Yeah. Give me a sense of what the
(00:13:23)
scalable method is. The scalable method
(00:13:26)
is you learn from experience. Um you uh
(00:13:30)
you you try things, you see what you see
(00:13:32)
what works. No one no one has to tell
(00:13:35)
you. First of all, you have a goal. So
(00:13:37)
without a goal, uh there's no sense of
(00:13:39)
right or wrong or better or worse. So
(00:13:42)
large language models are trying to get
(00:13:44)
by without having a goal or a sense of
(00:13:46)
better or worse. That's just, you know,
(00:13:49)
it's exactly starting in the wrong
(00:13:51)
place. May maybe it's um interesting to
(00:13:53)
compare this to humans. So in both the
(00:13:56)
case of learning from imitation versus
(00:13:59)
experience and on the question of goals,
(00:14:02)
I think there's some interesting
(00:14:04)
analogies. So you know kids will
(00:14:07)
initially learn from imitation. Uh you
(00:14:11)
don't think so?
(00:14:12)
>> No, of course not.
(00:14:14)
>> Really?
(00:14:15)
>> Yeah. I think kids just like watch
(00:14:17)
people. They like kind of try try to
(00:14:19)
like say the same.
(00:14:20)
>> How old are those these kids?
(00:14:22)
>> I I think the level
(00:14:23)
>> What about the first six months?
(00:14:25)
>> I think they're kind kind of imitating
(00:14:26)
things. They're trying to like make
(00:14:27)
their mouth sound the way they see their
(00:14:29)
mother's mouth sound. And then they'll
(00:14:30)
say the same words without understanding
(00:14:31)
what they mean. And as you get older,
(00:14:33)
the complexity of the imitation they do
(00:14:35)
increases. So that's you're you're
(00:14:37)
you're you know, you're imitating maybe
(00:14:40)
the skills that your uh people in your
(00:14:42)
band are using to hunt down the deer or
(00:14:44)
something. And then you go into the
(00:14:46)
learning from experience RL regime. But
(00:14:48)
I think there's a lot of imitation
(00:14:49)
learning happening with uh humans.
(00:14:52)
>> Yeah. Surprising. Yeah. You can have
(00:14:53)
such a different point of view.
(00:14:55)
>> Yeah.
(00:14:55)
>> Um when I see kids, I see kids uh just
(00:14:58)
trying things and like waving their
(00:15:01)
hands around and moving their eyes
(00:15:03)
around and no one no one tells them
(00:15:05)
there. There's no there's no um
(00:15:08)
imitation for uh how they move their
(00:15:10)
eyes around or even the sounds they
(00:15:12)
make. They may they may want to create
(00:15:14)
the same sounds but the um the actions
(00:15:17)
you know the thing that the uh infant
(00:15:20)
actually does there there's no targets
(00:15:23)
for that there are no examples for that
(00:15:25)
>> I agree that it doesn't explain
(00:15:26)
everything infants do but I think it
(00:15:28)
guides a learning process I mean even
(00:15:30)
LLM when it's trying to predict the next
(00:15:32)
token early in training it will like
(00:15:34)
make a guess it'll be different from
(00:15:35)
what like it actually sees and in some
(00:15:37)
sense it's like very short horizon RL
(00:15:39)
where it's like making this guess of
(00:15:40)
like I think this token will be It's
(00:15:42)
actually this other thing similar to how
(00:15:43)
a kid will try to say a word, it comes
(00:15:45)
out wrong.
(00:15:46)
>> The the large language models is
(00:15:48)
learning from training data. It's not
(00:15:49)
learning from experience.
(00:15:51)
It's it's learning from something that
(00:15:53)
will never be available during its
(00:15:55)
normal life.
(00:15:57)
There's never any uh training data that
(00:15:59)
says you should do this action in normal
(00:16:02)
life.
(00:16:02)
>> I I think this is maybe more of a
(00:16:05)
semantic distinction like what do you
(00:16:06)
call school? Is that not training data?
(00:16:08)
You're not like going to school because
(00:16:10)
it's like
(00:16:11)
>> school is much later. Okay, I shouldn't
(00:16:13)
have said never, but but I I don't know.
(00:16:16)
I think I would even say it about
(00:16:17)
school, but formal schooling is is the
(00:16:21)
exception. You should base your
(00:16:23)
>> of learning where I think you're just
(00:16:26)
sort of programming in your biology that
(00:16:27)
like early on you're not that useful and
(00:16:29)
then like kind of why you exist is to
(00:16:32)
understand the world and like learn how
(00:16:34)
to interact with it. Um, and seems kind
(00:16:37)
of like a training phase. I agree that
(00:16:39)
then there's like a sort of more gradual
(00:16:40)
there's not a sharp cut off to like
(00:16:42)
training to deployment, but there seems
(00:16:44)
to be this like initial training phase,
(00:16:47)
right? There's nothing where where you
(00:16:49)
have training of what you should do.
(00:16:51)
There's nothing you you you see things
(00:16:54)
that happen. You're not you're not told
(00:16:56)
what to do.
(00:16:58)
Don't don't don't be difficult. I mean,
(00:17:01)
this is obvious.
(00:17:02)
>> I mean, you're like literally taught
(00:17:03)
what to do. This is like where the word
(00:17:05)
training comes from is from humans,
(00:17:07)
right?
(00:17:08)
>> So I don't think uh learning is really
(00:17:10)
about training. I think learning is
(00:17:12)
about about learning. It's about an
(00:17:14)
active process. The child tries things
(00:17:17)
and sees what happens.
(00:17:19)
>> Right.
(00:17:19)
>> Yeah. It does not
(00:17:21)
we don't think we don't think about
(00:17:23)
training when we think of the an infant
(00:17:26)
growing up. These these things are
(00:17:28)
actually rather well understood. If you
(00:17:30)
go to look about how psychologists think
(00:17:32)
about learning, there's nothing like uh
(00:17:35)
imitation. Maybe there are some extreme
(00:17:38)
cases where humans might do that or
(00:17:41)
appear to do that, but there's no basic
(00:17:44)
animal learning process called
(00:17:45)
imitation. There basic animal learning
(00:17:48)
processes for prediction and for trial
(00:17:51)
and error control. I mean, it's really
(00:17:54)
interesting how sometime the most
(00:17:55)
hardest things to see are the obvious
(00:17:57)
ones. It's obvious um if you just look
(00:18:01)
at animals and how they learn and you
(00:18:03)
look at psychology and how our theories
(00:18:05)
of them um it's obvious that that
(00:18:09)
supervised learning is not part of uh
(00:18:11)
the way animals learn. We don't have we
(00:18:14)
don't have examples of desired behavior.
(00:18:17)
What we have is examples of things that
(00:18:20)
happened, things one things that
(00:18:21)
followed another and we have examples of
(00:18:24)
we did something and and and
(00:18:28)
there were consequences but there are no
(00:18:30)
examples of supervised learning. I mean
(00:18:32)
there are no supervised learning is not
(00:18:33)
something that that happens in nature
(00:18:35)
and you know school even if that was the
(00:18:39)
case you know we should forget about it
(00:18:41)
because it's it's just this that's some
(00:18:43)
special thing that happens in people.
(00:18:45)
doesn't happen broadly in nature and you
(00:18:48)
know squirrels don't go to school.
(00:18:49)
Squirrels can learn all about the world.
(00:18:52)
It's absolutely obvious I would say that
(00:18:55)
um supervised learning doesn't happen in
(00:18:58)
animals. So I I I interviewed this
(00:19:01)
psychologist and anthropologist Joseph
(00:19:04)
Henrik who has done work about cultural
(00:19:08)
evolution and basically how did what you
(00:19:10)
know what distinguishes humans and how
(00:19:13)
do humans pick up knowledge. Why are you
(00:19:15)
trying to distinguish humans?
(00:19:18)
Humans are animals.
(00:19:20)
What we have in common is more
(00:19:22)
interesting. What we have what
(00:19:23)
distinguished us we we should be paying
(00:19:25)
less attention to.
(00:19:26)
>> I mean we're trying to replicate
(00:19:27)
intelligence, right? So if you want to
(00:19:28)
understand what is it that
(00:19:31)
>> enables humans to go to the moon or to
(00:19:33)
build semiconductors. I think the thing
(00:19:35)
we want to understand is the thing that
(00:19:37)
makes it no animal can go to the moon or
(00:19:39)
make semiconductors. So we want to
(00:19:41)
understand what makes humans special.
(00:19:42)
>> So I like the way you consider that
(00:19:44)
obvious cuz I consider the opposite
(00:19:46)
obvious.
(00:19:48)
Yeah. I think we we need to we we have
(00:19:50)
to we have to understand how we are
(00:19:53)
animals and we if we understood a
(00:19:55)
squirrel I think we'd have a we'd be
(00:19:57)
almost all the way there to
(00:19:59)
understanding human intelligence. The
(00:20:01)
the language part is just a a small
(00:20:04)
veneer on the surface.
(00:20:06)
Okay. So this is great. You know we're
(00:20:08)
finding out the very different ways that
(00:20:10)
we're thinking.
(00:20:12)
>> We're not arguing. We're trying to share
(00:20:15)
share our different ways of thinking
(00:20:16)
with each other.
(00:20:17)
>> Yeah. And you I think argument is
(00:20:19)
useful. So um uh yeah but I do want to
(00:20:22)
complete this thought. So Joseph Henrik
(00:20:24)
has this interesting theory that uh if
(00:20:26)
you look a lot of the uh skills that
(00:20:30)
humans have had to master in order to be
(00:20:32)
successful and we're not talking about
(00:20:34)
you know last thousand years or last
(00:20:35)
10,000 years but hundreds of thousands
(00:20:37)
of years. uh you know the world is
(00:20:39)
really complicated and it's not possible
(00:20:43)
to reason through how to let's say hunt
(00:20:46)
a uh seal if you're living in the
(00:20:49)
Arctic. And so there's this many many
(00:20:52)
stepong process of how to make the bait
(00:20:56)
and how to find the seal and then how to
(00:20:59)
process the food in a way that make sure
(00:21:01)
you won't get poisoned. And it's not
(00:21:03)
possible to reason through all of that.
(00:21:04)
And so over time, yes, there's this like
(00:21:07)
larger process of whatever analogy you
(00:21:10)
want to use, maybe something else where
(00:21:12)
culture as a whole has figured out how
(00:21:15)
to uh find and kill and eat uh seals.
(00:21:21)
But then what is happening when through
(00:21:23)
generations this knowledge is
(00:21:25)
transmitted is in his view that like
(00:21:28)
there you just have to imitate your
(00:21:30)
elders in order to learn that skill
(00:21:32)
because you can't you can't think your
(00:21:34)
way through how to hunt and kill and
(00:21:36)
process a seal. You have to just watch
(00:21:38)
other people maybe make tweaks and
(00:21:39)
adjustments. Uh and that's how cultural
(00:21:42)
knowledge accumulates. But the the
(00:21:43)
initial step of the cultural gain has to
(00:21:45)
be imitation. But maybe you think about
(00:21:47)
it a different way.
(00:21:48)
>> No, I think about it the same way.
(00:21:50)
>> Okay. But still it's a small thing on
(00:21:54)
top of basic trial and error learning,
(00:21:57)
>> prediction learning, and it's what
(00:21:59)
distinguishes us perhaps from from many
(00:22:03)
animals.
(00:22:05)
>> But we're an animal first.
(00:22:07)
>> Yeah.
(00:22:08)
>> And and we were an animal before we had
(00:22:10)
language and all those other things. I
(00:22:14)
do think you make a very interesting
(00:22:15)
point that continual learning is a
(00:22:17)
capability that most mammals have. I
(00:22:21)
guess all mammals have. So it's quite
(00:22:23)
interesting that we have something that
(00:22:25)
all mammals have but our AI systems
(00:22:28)
don't have, right? Whereas maybe like
(00:22:30)
the ability to understand math and
(00:22:31)
solves difficult math problems depends
(00:22:33)
on how you define math. But like these
(00:22:36)
this is a capability our AIs have but
(00:22:38)
that no almost no animal has. And so
(00:22:41)
it's quite interesting what ends up
(00:22:42)
being difficult and what ends up being
(00:22:44)
easy.
(00:22:46)
>> Morix paradox.
(00:22:47)
>> That's right.
(00:22:48)
>> For the era of experience to commence,
(00:22:50)
we're going to need to train AIs in
(00:22:52)
complex real world environments. But
(00:22:54)
building effective RL environments is
(00:22:56)
hard. You can't just hire a software
(00:22:58)
engineer and have them write a bunch of
(00:22:59)
cookie cutter validation tests. Real
(00:23:02)
world domains are messy. You need deep
(00:23:04)
subject matter experts to get the data,
(00:23:06)
the workflows, and all the subtle rules
(00:23:08)
right. When one of Labelbox's customers
(00:23:10)
wanted to train an agent to shop online,
(00:23:12)
Labelbox assembled a team with a ton of
(00:23:14)
experience engineering internet
(00:23:16)
storefronts. For example, the team built
(00:23:18)
a product catalog that could be updated
(00:23:20)
during the episode because most shopping
(00:23:22)
sites have constantly changing state.
(00:23:24)
They also added a Reddis cache to
(00:23:25)
simulate stale data since that's how
(00:23:28)
real e-commerce sites actually work.
(00:23:30)
These are the kinds of things that you
(00:23:31)
might not have naively thought to do,
(00:23:33)
but that label box can anticipate. These
(00:23:35)
details really matter. Small tweaks are
(00:23:37)
often the difference between cool demos
(00:23:39)
and agents that can actually operate in
(00:23:41)
the real world. So whether it's
(00:23:42)
correcting traces that you already
(00:23:44)
produced or building an entirely new
(00:23:46)
suite of environments, Labelbox can help
(00:23:48)
you turn your RL projects into working
(00:23:50)
systems. Reach out at
(00:23:53)
labelbox.com/thearcash.
(00:23:56)
All right, back to Richard.
(00:23:58)
>> This alternative paradigm that you're
(00:24:00)
imagining,
(00:24:00)
>> the experential paradigm, let's lay out
(00:24:03)
a little bit what it is. It says that
(00:24:06)
experience action sensation well
(00:24:10)
sensation action reward and this happens
(00:24:13)
on and on and on makes for life. It's it
(00:24:16)
says that this is the uh foundation and
(00:24:19)
the focus of intelligence. Intelligence
(00:24:21)
is about taking that stream and altering
(00:24:25)
the actions to increase the the rewards
(00:24:28)
in the stream.
(00:24:29)
>> Right? So learning then is from the
(00:24:32)
stream and learning is about the stream.
(00:24:35)
So it's that that second part is is is
(00:24:38)
particularly telling you know that what
(00:24:41)
you learn your knowledge your knowledge
(00:24:43)
is about the stream. Your knowledge is
(00:24:45)
about if you do some action what will
(00:24:47)
happen or it's about uh which events
(00:24:51)
will follow other events. It's about the
(00:24:53)
stream. It's the content of the
(00:24:55)
knowledge is is statements about the
(00:24:58)
stream. Um, and so because it it's it's
(00:25:01)
a statement about the stream, you can
(00:25:02)
test it by comparing it to the stream
(00:25:04)
and you can learn it continually.
(00:25:06)
>> So when you're imagining this future
(00:25:09)
continual learning agent,
(00:25:10)
>> they're not future. Of course, we they
(00:25:12)
exist all all the time. It's I mean this
(00:25:14)
is what reinforcement learning paradigm
(00:25:15)
is learning from experience.
(00:25:17)
>> Yeah, I guess the maybe I what I meant
(00:25:19)
to say is uh human level general
(00:25:21)
continual learning agent.
(00:25:24)
>> What is the reward function? Is it just
(00:25:26)
predicting the world? Is it uh is it
(00:25:29)
then having a specific effect on it?
(00:25:32)
What would the general reward function
(00:25:34)
be?
(00:25:34)
>> The reward uh function is arbitrary
(00:25:38)
and um so if you're playing chess, it's
(00:25:40)
to win the game of chess.
(00:25:42)
If you were to um uh if you're a
(00:25:46)
squirrel, maybe the the reward has to do
(00:25:48)
with getting nuts,
(00:25:49)
>> right?
(00:25:50)
Um in general for an animal you would
(00:25:53)
say the reward is to avoid pain and to
(00:25:57)
acquire pleasure
(00:25:59)
>> right
(00:26:00)
>> uh
(00:26:01)
and there's also would be a component
(00:26:03)
having to do with uh I think there would
(00:26:05)
be should be a component having to do
(00:26:07)
with your uh increasing understanding of
(00:26:10)
your of your environment that would be
(00:26:13)
sort of an intrinsic motivation.
(00:26:15)
>> I see. I guess this AI would be deployed
(00:26:19)
to like lots of people would want it to
(00:26:22)
be doing lots of different kinds of
(00:26:23)
things,
(00:26:24)
>> right? So it's performing the task
(00:26:26)
people want but at the same time it's
(00:26:28)
learning about the world from doing that
(00:26:30)
task and do you do you imagine okay so
(00:26:34)
we get rid of this paradigm where
(00:26:36)
there's training periods and then
(00:26:38)
there's deployment periods but then is
(00:26:41)
there do we also get rid of this
(00:26:42)
paradigm when there's the model and then
(00:26:45)
instances of the model or copies of the
(00:26:47)
model that are you know doing certain
(00:26:49)
things h how do you think about the fact
(00:26:52)
that there we'd want this thing to be
(00:26:54)
doing different things, we'd want to
(00:26:55)
aggregate the knowledge that it's
(00:26:57)
gaining from doing those different
(00:26:58)
things.
(00:26:59)
>> I don't like the word model when used
(00:27:01)
the way you just did. I I think a better
(00:27:04)
word would be the network. So, I think
(00:27:06)
you mean the the network. Maybe there's
(00:27:10)
many networks. So anyway, things would
(00:27:12)
be learned and then you'd have copies
(00:27:15)
and many instances and sure you'd want
(00:27:17)
to share knowledge across the uh
(00:27:20)
instances and there would be lots of
(00:27:22)
possibilities for doing that like there
(00:27:24)
is not today. You can't have one child
(00:27:26)
learn grow up and and learn about the
(00:27:29)
world and then and then every new child
(00:27:31)
has to repeat that process. Whereas with
(00:27:34)
AIS, with a digital intelligence, you
(00:27:37)
could hope to do it once and then copy
(00:27:38)
it into the next one as a starting
(00:27:40)
place.
(00:27:40)
>> Right?
(00:27:41)
>> So this would be a huge savings and I
(00:27:44)
think actually it would be much more
(00:27:46)
important than uh trying to learn from
(00:27:48)
people.
(00:27:50)
>> I agree that the kind of thing you're
(00:27:52)
talking about is necessary regardless of
(00:27:55)
whether you start from LLMs or not.
(00:27:57)
Right? If you want human or animal level
(00:27:59)
intelligence, you're going to need this
(00:28:01)
capability. Suppose a human is trying to
(00:28:03)
make a startup, right? And this is a
(00:28:06)
thing which has a reward on the order of
(00:28:08)
10 years. Once in 10 years, you might
(00:28:10)
have an exit where you get, you know,
(00:28:11)
paid out a billion dollars. But humans
(00:28:13)
have this ability to make intermediate
(00:28:16)
auxiliary rewards or have some way of
(00:28:18)
even when they have extremely rewards,
(00:28:19)
they can still make intermediate steps
(00:28:23)
having an understanding of like what the
(00:28:24)
next thing they're doing leads to this
(00:28:26)
grander goal we have. And so how do you
(00:28:28)
imagine such a process might play out
(00:28:30)
with AIS? So this is something we know
(00:28:33)
very well
(00:28:34)
>> and it's the basis of it is temporal
(00:28:36)
difference learning
(00:28:37)
>> where the same thing happens um in a
(00:28:40)
less grandiose scale like when you learn
(00:28:42)
to play chess you have the grand the
(00:28:45)
long-term goal is winning the game and
(00:28:47)
yet you you can't you um you want to be
(00:28:50)
able to learn from shorter term things
(00:28:51)
like you know taking the your opponent's
(00:28:53)
pieces um and so you do that by having a
(00:28:57)
value function which predicts the
(00:28:58)
long-term outcome right
(00:29:00)
>> and then if You take guys pieces where
(00:29:02)
your prediction about the long-term
(00:29:04)
outcome is changed. It goes up. You
(00:29:06)
think you're going to win and then that
(00:29:07)
increase in your in your belief
(00:29:10)
immediately quote reinforces the uh the
(00:29:14)
move that led to taking the piece.
(00:29:16)
>> Mhm.
(00:29:17)
>> Okay. So, we have this long-term 10-year
(00:29:19)
goal of making a startup and making a
(00:29:21)
lot of money. And so, when we make
(00:29:23)
progress, we say, "Oh, I'm I'm I'm more
(00:29:26)
likely to uh achieve the long-term
(00:29:29)
goal." and that rewards the the steps
(00:29:32)
along the way,
(00:29:33)
>> right? And then you also want some
(00:29:35)
ability for information that you're
(00:29:37)
learning. I mean, one of the things that
(00:29:40)
makes humans quite different from these
(00:29:42)
LLMs is that if you're onboarding on a
(00:29:44)
job, you're you're picking up so much
(00:29:46)
context and information, and that's what
(00:29:48)
makes you useful at the job, right?
(00:29:49)
You're uh everything from how your
(00:29:51)
client has preferences to how the
(00:29:54)
company works to everything. Um, and is
(00:29:57)
the bandwidth of information that you
(00:29:59)
get from a procedure like TDLearning
(00:30:01)
high enough to have this like huge pipe
(00:30:04)
of like context and tacet knowledge that
(00:30:06)
you need to be picking up the way humans
(00:30:08)
do when they're when they're just like
(00:30:10)
deployed? Um
(00:30:12)
I think the crux of this and I'm not
(00:30:15)
sure but
(00:30:17)
the the big world hypothesis seems very
(00:30:20)
relevant and the reason why humans
(00:30:22)
becoming useful on their job is because
(00:30:25)
they are encountering the particular
(00:30:27)
part of the world. That's right. And um
(00:30:29)
and it can't have been anticipated and
(00:30:31)
it can't all have been put in in in
(00:30:33)
advance in in uh the world is so huge
(00:30:38)
that you can't the the dream as I see it
(00:30:41)
the dream of large language models is
(00:30:43)
you can teach the an the agent
(00:30:45)
everything and it will know everything
(00:30:46)
and it won't have to learn anything
(00:30:49)
online
(00:30:50)
>> right
(00:30:50)
>> during its life. Okay. and and your
(00:30:53)
examples are all well really you have to
(00:30:55)
because you can there's a lot to you can
(00:30:58)
teach it but there's all little
(00:31:00)
idiosyncrasies of the particular life
(00:31:01)
they're leading and the the particular
(00:31:03)
people they're working with and what
(00:31:04)
they like as opposed to what average
(00:31:07)
people like right
(00:31:08)
>> and so that's just saying the world is
(00:31:10)
really big and so you're going to have
(00:31:11)
to learn it uh along the way
(00:31:14)
>> yeah so it seems to me you need two
(00:31:15)
things one is some way of converting
(00:31:17)
this long run goal reward into smaller
(00:31:22)
auxiliary or you know um these like
(00:31:26)
predictive rewards of the future reward
(00:31:27)
or the future reward at least to the
(00:31:29)
final reward then you need some other
(00:31:31)
way initially it seems to me you need
(00:31:33)
some way of then okay I'm
(00:31:35)
I need to hold on to all this context
(00:31:38)
that I'm gaining as I'm
(00:31:41)
working in the world right I'm like
(00:31:42)
learning about
(00:31:44)
my clients my my company all this
(00:31:47)
information and I'm so I would say
(00:31:51)
you're just doing regular learning.
(00:31:52)
>> Yeah,
(00:31:53)
>> maybe you're using context because in
(00:31:54)
large language models, all that
(00:31:56)
information has to go into the context
(00:31:58)
window,
(00:31:58)
>> right?
(00:31:59)
>> But in in a continual learning setup, it
(00:32:02)
just goes into the weights.
(00:32:03)
>> Maybe maybe Yeah. So maybe context is
(00:32:04)
the wrong word to use because I mean a
(00:32:06)
more general thing.
(00:32:06)
>> You learn a policy that's specific to
(00:32:08)
the environment that you're finding
(00:32:10)
yourself in.
(00:32:10)
>> Yeah. So the question I'm trying to ask
(00:32:14)
is you need some way of getting like how
(00:32:18)
many bits per second are you picking
(00:32:20)
like is a human picking up when they're
(00:32:22)
you know out in the world, right? Um if
(00:32:24)
you're just like interacting over Slack
(00:32:25)
with your clients and everything.
(00:32:27)
>> So maybe you're trying to ask the
(00:32:28)
question of it seems like the reward is
(00:32:31)
too small of a thing to to do all the
(00:32:32)
learning that we need to do. But of
(00:32:34)
course we have the uh the sensations
(00:32:38)
uh we we have all the other information
(00:32:40)
we can learn from
(00:32:41)
>> right
(00:32:41)
>> we don't just learn from the reward we
(00:32:43)
learn from all the data
(00:32:45)
>> yeah so what is the learning process
(00:32:48)
which helps you capture that information
(00:32:51)
>> so now I want to talk about the base
(00:32:55)
common model of the agent with the four
(00:32:57)
parts
(00:32:58)
>> right
(00:32:58)
>> so we need a policy the policy says in
(00:33:02)
the situation I'm in what should I do we
(00:33:05)
need a value function. The value
(00:33:06)
function is the thing that is learned
(00:33:08)
with TDarning and the value function
(00:33:10)
produces a number. The number says how
(00:33:12)
well is it going
(00:33:14)
>> and then you watch if that's going up
(00:33:15)
and down and use that to adjust your
(00:33:18)
policy. Okay. So the those two things
(00:33:21)
and and then there's also the perception
(00:33:24)
component which is the construction of
(00:33:27)
your uh state representation. This your
(00:33:29)
sense of where you are now. And the
(00:33:31)
fourth one is what we're really getting
(00:33:32)
at most transparently. Anyway, the the
(00:33:35)
fourth one is the transition model of
(00:33:37)
the world. Um that's why I am
(00:33:39)
uncomfortable just calling everything
(00:33:41)
models because I want to talk about the
(00:33:42)
model of the world. The transition model
(00:33:45)
of the world, your belief that if you do
(00:33:47)
this, what will happen? What will be the
(00:33:49)
consequences of what you do? So your
(00:33:51)
physics of the world, but it's al not
(00:33:52)
just physics. It's also um abstract
(00:33:55)
models like you know your model of how
(00:33:56)
you traveled um from California up to
(00:34:00)
Edmonton for this podcast that was a
(00:34:01)
model and that's a transition model and
(00:34:03)
that would be
(00:34:04)
>> uh learned and it's not learned from
(00:34:06)
reward it's learned from you did things
(00:34:08)
you saw what happened
(00:34:09)
>> you made that model with the world that
(00:34:11)
is it will be learned very richly from
(00:34:14)
all the sensation that you receive not
(00:34:16)
just from the reward
(00:34:18)
>> it has to include the reward as well but
(00:34:21)
it's that's a small part of the whole
(00:34:23)
model small crucial part of the whole
(00:34:25)
model.
(00:34:25)
>> Yeah. One of my friends Toby Ward
(00:34:27)
pointed out that if you look at the Muse
(00:34:30)
Euro models that Google Deep Mind
(00:34:33)
deployed to learn Atari games that these
(00:34:35)
models were initially
(00:34:38)
not a general intelligence itself but a
(00:34:40)
general framework for training
(00:34:42)
specialized intelligences to play
(00:34:44)
specific games. That is to say that you
(00:34:46)
couldn't using that framework train a
(00:34:49)
policy to play both chess and go and
(00:34:52)
some other game. You had to train each
(00:34:54)
one in a specialized way. And he was
(00:34:57)
wondering whether that implies that
(00:34:59)
reinforcement learning generally because
(00:35:02)
of this information constraint you you
(00:35:04)
you can only learn one thing at a time.
(00:35:06)
Uh the density of information isn't that
(00:35:08)
high or whether it was just specific to
(00:35:10)
the way that mu0 was done. And if it's
(00:35:12)
specific to uh Alpha Zero, what what
(00:35:15)
what needed to be changed about that
(00:35:17)
approach so that it could be a general
(00:35:19)
learning agent?
(00:35:21)
>> The the idea is totally general. You
(00:35:23)
know, uh I do use all the time as my
(00:35:27)
canonical example, the idea of an AI
(00:35:29)
agent is like a person.
(00:35:30)
>> Yeah. And and people uh in some sense
(00:35:34)
they have just one world they live in
(00:35:37)
and um that world may involve chess and
(00:35:40)
it may involve Atari games. Uh but those
(00:35:42)
are are are not a different task or a
(00:35:44)
different world. Those are different
(00:35:45)
states right they encounter and so the
(00:35:49)
the general idea is not limited at all.
(00:35:51)
So maybe it would be useful to explain
(00:35:54)
what was missing in that architecture or
(00:35:57)
that that approach which this continual
(00:36:02)
learn learning AGI would have they just
(00:36:05)
set it up they didn't it was not their
(00:36:07)
ambition to have one agent across across
(00:36:12)
uh those games. If we want to talk about
(00:36:14)
transfer, we should talk about transfer
(00:36:16)
not across games or across tasks, but
(00:36:20)
transfer between states.
(00:36:22)
>> Yeah. I I guess I'm curious about
(00:36:25)
historically, have we seen the
(00:36:29)
level of transfer
(00:36:31)
using RL techniques that would be needed
(00:36:33)
to build this kind of
(00:36:35)
>> Okay, good. Good. We're not seeing
(00:36:37)
transfer anywhere. We're not seeing
(00:36:39)
general critical to good performance is
(00:36:42)
that you can generalize well from one
(00:36:44)
state to another state.
(00:36:46)
>> We don't have any methods that are good
(00:36:47)
at that. What we have are people um try
(00:36:51)
different things and they they settle on
(00:36:54)
something that that uh a representation
(00:36:57)
that that transfers well or they
(00:36:58)
generalize as well. But we have no we
(00:37:00)
don't have any automated techniques to
(00:37:02)
promote. we have very few automated
(00:37:05)
techniques to promote transfer and
(00:37:07)
they're not none of them are used in in
(00:37:10)
modern deep learning.
(00:37:11)
>> Um let me paraphrase to make sure that I
(00:37:15)
understood that correctly.
(00:37:17)
It sounds like you're saying that when
(00:37:19)
we do have generalization in these
(00:37:22)
models that is a result of some uh
(00:37:26)
sculpted
(00:37:28)
uh
(00:37:28)
>> humans did it.
(00:37:29)
>> Yeah.
(00:37:30)
>> The researchers did it because there's
(00:37:31)
no other explanation. I mean gradient
(00:37:34)
descent will not make you generalize
(00:37:35)
well it will make you solve the problem
(00:37:38)
>> right
(00:37:38)
>> it will not make you you know get new
(00:37:40)
data
(00:37:42)
you generalize in a good way
(00:37:43)
generalization means train on one thing
(00:37:46)
that affects what you do on the other
(00:37:47)
things so we know deep learning is
(00:37:50)
really bad at this for example we know
(00:37:52)
that if you train on some new thing it
(00:37:54)
will often catastrophically interfere
(00:37:56)
with all the old things that you that
(00:37:58)
you knew
(00:37:59)
>> so this is exactly bad generalization
(00:38:01)
Right
(00:38:02)
>> now generalization as I said is some
(00:38:04)
kind of influence of training on one
(00:38:07)
state on other states and generalization
(00:38:09)
is not necessarily good or bad right
(00:38:11)
just the fact that you generalize is not
(00:38:13)
necessarily good or bad you can
(00:38:14)
generalize poorly you can generalize
(00:38:16)
well
(00:38:16)
>> right
(00:38:17)
>> so you you need generalization always
(00:38:19)
will happen u but we need algorithms
(00:38:21)
that will uh cause the the
(00:38:24)
generalization to be good rather than
(00:38:25)
bad
(00:38:26)
>> I'm not trying to kickstart this uh
(00:38:29)
initial uh crux proxy, but I'm just
(00:38:32)
genuinely curious because I I think I'm
(00:38:34)
might be using the term differently. I
(00:38:35)
mean, one way to think about is these
(00:38:36)
LLMs are increasing the scope of
(00:38:39)
generalization from like earlier systems
(00:38:42)
which could not really even do a basic
(00:38:44)
math problem to now they can do anything
(00:38:47)
in this class of math Olympia type
(00:38:49)
problems, right? So, you initially start
(00:38:51)
with like they can generalize among
(00:38:52)
addition problems at least. Um uh then
(00:38:54)
you generalize to like they can
(00:38:56)
generalize among like problems which
(00:38:59)
require use of different kinds of
(00:39:02)
mathematical techniques and theorems and
(00:39:04)
you know conceptual categories which is
(00:39:06)
like what the math olympiad requires.
(00:39:08)
And so it sounds like you don't think of
(00:39:10)
that being able to solve any problem
(00:39:12)
within that category as an example of
(00:39:15)
generalization or let me know if I'm
(00:39:17)
misunderstanding that. Well, large
(00:39:19)
language models so complex. We don't we
(00:39:22)
don't really know what information they
(00:39:24)
had prior. We are we have to guess
(00:39:28)
because they've been fed so much. This
(00:39:30)
is one reason why they're not a good way
(00:39:32)
to do science. Uh it's just so
(00:39:36)
uncontrolled, so unknown.
(00:39:38)
>> But if you come up with an entirely new,
(00:39:40)
>> they're getting a bunch of things right
(00:39:42)
>> perhaps. And uh so the question is why?
(00:39:45)
Well, it may be that they don't need to
(00:39:47)
generalize to get them right because the
(00:39:48)
only way to get some of them right is is
(00:39:51)
to form something which gets all of them
(00:39:53)
right.
(00:39:54)
>> So, you know, if there's only one answer
(00:39:57)
uh then and you find it, I that's not
(00:39:59)
called generalization. It's just it's
(00:40:01)
the only way to solve it and so they
(00:40:03)
find the only way to solve it.
(00:40:04)
>> Generalization is when it could be this
(00:40:06)
way, it could been that way and they do
(00:40:08)
it the good way. My my understanding is
(00:40:10)
that they um this is working more and
(00:40:13)
more better and better with coding
(00:40:15)
agents. So engineers obviously if you're
(00:40:18)
trying to program a library
(00:40:21)
there's many different ways you could
(00:40:22)
achieve the endspec and an initial
(00:40:24)
frustration with these models has been
(00:40:25)
that they'll do it in a way that's
(00:40:27)
sloppy and then over time they're
(00:40:29)
getting better and better at coming up
(00:40:32)
with the design architecture and the
(00:40:34)
abstractions that developers find more
(00:40:36)
satisfying. And it seems that an example
(00:40:40)
of what you're talking about.
(00:40:41)
>> Well, there's nothing in them which will
(00:40:43)
cause it to generalize. Well, the
(00:40:46)
gradient descent will cause them to find
(00:40:49)
a solution to the problems they've seen.
(00:40:52)
And if there's only one way to solve
(00:40:54)
them, you know, they they'll do it. But
(00:40:55)
there are many ways to solve it. Some
(00:40:56)
which generalize well, some which
(00:40:58)
generalize poorly. There's nothing in
(00:41:00)
them in the algorithms that will cause
(00:41:02)
them to generalize well.
(00:41:04)
>> But people of course are involved. and
(00:41:06)
and you know if if it's not working out
(00:41:08)
you know they fiddle with it and until
(00:41:10)
they find a way perhaps until they find
(00:41:12)
a way which it generalizes well so to
(00:41:15)
prep for this interview I wanted to
(00:41:17)
understand the full history of RL
(00:41:20)
starting with reinforce up to current
(00:41:22)
techniques like GRPO and I didn't just
(00:41:24)
want a list of equations and algorithms
(00:41:27)
I wanted to really understand each
(00:41:29)
change in this progression and the
(00:41:31)
underlying motivation you know what was
(00:41:33)
the main problem that each successive
(00:41:34)
method was actually trying to solve. So
(00:41:36)
I had Gemini Deep Research walk me
(00:41:38)
through this entire timeline step by
(00:41:40)
step. It explained the last 20 years of
(00:41:42)
gradual innovation and explained how
(00:41:44)
each step made the Aura learning process
(00:41:47)
more stable or more sample efficient or
(00:41:50)
more scalable. I asked Deep Research to
(00:41:52)
put all of this together like an Andre
(00:41:53)
Carpathy style tutorial and it did that.
(00:41:56)
What was cool is that it combined this
(00:41:58)
whole lesson together into one coherent
(00:42:00)
cohesive document in the style that I
(00:42:02)
wanted. It was also great that it
(00:42:03)
assembled all of the best links in the
(00:42:05)
same place so that if I wanted to
(00:42:06)
understand any specific algorithm
(00:42:08)
better, I could just access the right
(00:42:10)
explainer right there. Go to
(00:42:12)
gemini.google.com
(00:42:14)
to try it out yourself. All right, back
(00:42:16)
to Richard. I want to zoom out and ask
(00:42:19)
about so being in the field of AI for
(00:42:24)
longer than almost anybody who's
(00:42:25)
commentating on it uh or working in it
(00:42:27)
now. I'm just curious about what the
(00:42:30)
biggest surprises have been. How much
(00:42:33)
new stuff you feel like is coming out or
(00:42:35)
does it feel like people are just
(00:42:36)
playing with old ideas? Um zooming out
(00:42:40)
you know you you got into this even
(00:42:42)
before like deep learning was popular.
(00:42:43)
So how do you see this trajectory of
(00:42:46)
this field over time and how new ideas
(00:42:49)
have come about and everything and
(00:42:50)
what's been surprising?
(00:42:52)
>> Okay so yeah I I I um thought a little
(00:42:56)
bit about this. There are many things or
(00:42:59)
a handful of things. Um first the large
(00:43:02)
language models are surprising. It's
(00:43:04)
surprising how how effective um neural
(00:43:07)
networks artificial neural networks are
(00:43:10)
at at language tasks. You know that that
(00:43:13)
was a surprise. Wasn't expected.
(00:43:15)
Language seemed different. So that's
(00:43:17)
impressive.
(00:43:18)
>> Um there's a longstanding controversy in
(00:43:22)
AI about uh simple basic principle
(00:43:25)
methods. uh the the general purpose
(00:43:29)
methods like search and learning and
(00:43:31)
compared to um human enabled systems uh
(00:43:37)
like symbolic methods and um uh so in
(00:43:41)
the old days it was interesting because
(00:43:43)
things like search and learning were
(00:43:44)
called weak methods because they're just
(00:43:46)
they just use general principles.
(00:43:47)
They're not using uh the power that
(00:43:49)
comes from uh imbuing a system with
(00:43:52)
human knowledge. So those are called
(00:43:53)
strong and um and so I think the weak
(00:43:57)
methods have just you know totally won
(00:44:01)
that's you know that's that's that's the
(00:44:03)
biggest um question from the old days of
(00:44:07)
AI what would happen and you know yeah
(00:44:10)
learning and search have just won the
(00:44:12)
day
(00:44:13)
>> right
(00:44:13)
>> but there's a sense which that was not
(00:44:15)
surprising to me because I was always
(00:44:17)
voting for or hoping or rooting for the
(00:44:19)
for the uh simple basic principles
(00:44:21)
>> and so Even with the large language
(00:44:23)
models, it's surprising how how well it
(00:44:25)
worked, but it was all it was all good
(00:44:27)
and gratifying. And um and things like
(00:44:31)
Alph Go, it's it's sort of surprising
(00:44:33)
how well that was able to work. Um and
(00:44:36)
Alpha Zero in particular, how well it
(00:44:38)
was able to work. Um but it's all very
(00:44:41)
gratifying because again, it's simple
(00:44:42)
basic principles are winning the day.
(00:44:45)
Have there felt like whenever the public
(00:44:49)
conception
(00:44:50)
has been changed because some new
(00:44:52)
technique was or sorry some new
(00:44:54)
application was developed for example
(00:44:55)
when Alpha Zero became this viral
(00:44:58)
sensation to you as somebody who has
(00:45:01)
literally came up with many of the
(00:45:02)
techniques that were used. Did it feel
(00:45:04)
to you like new breakthroughs were made
(00:45:06)
or does it feel like oh we've had these
(00:45:08)
techniques since the '90s and people are
(00:45:11)
simply combining them and applying them
(00:45:13)
now? So the whole alpho thing had a
(00:45:16)
precursor which is TD gam Jerry Tasaro
(00:45:19)
did exactly um reinforcement learning
(00:45:24)
temporal difference learning methods to
(00:45:25)
um to play back gam
(00:45:28)
>> right
(00:45:29)
>> and it beat the beat the world's best
(00:45:31)
players and it worked really well and so
(00:45:34)
in some sense Alpha Go was merely a
(00:45:37)
scaling up of that process but it was
(00:45:39)
quite a bit of scaling up and there was
(00:45:41)
also an additional innovation in how the
(00:45:44)
search was done,
(00:45:45)
>> right?
(00:45:46)
>> But it made sense. It wasn't surprising
(00:45:48)
in that sense. Alph Go actually didn't
(00:45:52)
use uh TD learning. It waited to see the
(00:45:55)
final outcomes. Uh but Alpha Zero used
(00:45:58)
TD uh and Alpha Zero was applied to all
(00:46:01)
the other games and did extremely well.
(00:46:04)
I was very I've always been very
(00:46:06)
impressed by the way Alpha Zero plays
(00:46:08)
chess because I'm a chess player and it
(00:46:10)
just it it was just sacrifices material
(00:46:13)
for sort of positional advantages and
(00:46:15)
it's just just content and patient to uh
(00:46:19)
sacrifice that material for a long
(00:46:20)
period of time and um so that was
(00:46:23)
surprising that it worked so well but
(00:46:26)
also gratifying and fitting into my
(00:46:29)
worldview. So, so this has led me where
(00:46:33)
I am. Where I am is I'm in some sense a
(00:46:35)
contrarian or some thinking differently
(00:46:38)
from the field is and I'm I am
(00:46:41)
personally just kind of content being
(00:46:43)
out of sync with my field for a long
(00:46:45)
period of time perhaps decades uh
(00:46:47)
because occasionally I have been proved
(00:46:50)
uh right in the past. And the other
(00:46:54)
thing I do to help me not feel I'm I'm
(00:46:57)
out of sync and thinking in a strange
(00:46:59)
way is to look not at my my local uh
(00:47:03)
environment or my local field, but to
(00:47:05)
look back in in time into history and to
(00:47:08)
see what people have thought classically
(00:47:10)
about about um about the mind in many
(00:47:13)
different fields. And I don't feel I'm
(00:47:15)
out of sync with the larger traditions.
(00:47:18)
>> I I really view myself as a classicist
(00:47:20)
rather than as a contrarian. I go to
(00:47:22)
what what the larger community of of
(00:47:26)
thinkers about the mind have always
(00:47:27)
thought.
(00:47:28)
>> Okay. Some sort of left field questions
(00:47:31)
for you if you'll tolerate them. Um so
(00:47:33)
the way I read the bitter lesson is that
(00:47:36)
it's not saying necessarily that human
(00:47:39)
artisal researcher tuning doesn't work
(00:47:43)
but that it obviously scales much worse
(00:47:46)
than compute which is growing
(00:47:48)
exponentially. And so you want
(00:47:50)
techniques which leverage a ladder.
(00:47:52)
>> Y
(00:47:52)
>> and once we have AGI, we'll have
(00:47:57)
researchers which scale linearly with
(00:47:59)
compute, right? So we'll have this
(00:48:00)
avalanche of millions of AI researchers
(00:48:03)
and their stock will be growing as fast
(00:48:05)
as uh compute. And so maybe this will
(00:48:08)
mean that it is rational or it will make
(00:48:10)
sense to have them doing good
(00:48:13)
old-fashioned AI and doing these artisal
(00:48:16)
solutions. uh does that as a vision of
(00:48:20)
what happens after AGI in terms of how
(00:48:22)
AI research will evolve. I wonder if
(00:48:23)
that's still compatible with a better
(00:48:25)
lesson.
(00:48:25)
>> Well, how did we get to this AGI and you
(00:48:28)
want to presume that it's been done?
(00:48:31)
>> So, suppose it started with general math
(00:48:32)
methods, but now we've got the AGI and
(00:48:34)
now we want to go
(00:48:37)
>> h
(00:48:38)
>> we're done.
(00:48:39)
>> Interesting. You don't think that
(00:48:40)
there's
(00:48:41)
any anything above AGI?
(00:48:44)
>> Well, but you're using it to get AGI
(00:48:46)
again. Well, I'm using it to get
(00:48:48)
superhuman levels of intelligence or
(00:48:49)
competence at different tasks.
(00:48:51)
>> So, these AGIS, if they're not
(00:48:52)
superhuman already, then the the
(00:48:57)
knowledge they might impart would be not
(00:48:59)
superhuman.
(00:49:00)
>> I guess there's different gradations of
(00:49:02)
your
(00:49:02)
>> I'm not sure this this your idea makes
(00:49:05)
sense because because it seems to
(00:49:06)
presume the existence of AGI. Uh, and
(00:49:09)
then that we've already worked that out.
(00:49:12)
>> So, maybe one way to motivate this is
(00:49:14)
Alpha Go was superhuman. um it beat any
(00:49:17)
Go player. Alpha Zero would beat Alpha
(00:49:20)
Go every single time. So there's ways to
(00:49:22)
get more superhuman than than even
(00:49:24)
superhuman
(00:49:26)
>> and it was a different architecture. And
(00:49:27)
so it seems plausible to me that
(00:49:30)
>> well the agent that's like able to
(00:49:31)
generally learn across all domains.
(00:49:33)
There would be ways to make that give it
(00:49:36)
better architecture for learning just
(00:49:37)
the same Alpha Zero was an improvement
(00:49:39)
upon Apple Go and Mu0ero was an
(00:49:41)
improvement upon Alpha Zero. And the way
(00:49:42)
alpha zero was an improvement was it did
(00:49:45)
not use the human knowledge but just
(00:49:48)
went from experience.
(00:49:49)
>> Right.
(00:49:50)
>> So why do you why do you say
(00:49:53)
>> but
(00:49:53)
>> bring in other agents expertise to teach
(00:49:56)
it when it's when it's been it's worked
(00:50:00)
so well from experience
(00:50:02)
and not by help from another agent. I
(00:50:05)
agree that in that particular case that
(00:50:07)
it was moving to more general methods,
(00:50:09)
but I meant to use that example to
(00:50:11)
illustrate that it's possible to go
(00:50:13)
superhuman to superhuman plus+ to
(00:50:15)
superhuman++.
(00:50:17)
>> Yeah.
(00:50:17)
>> And I'm curious if you think those
(00:50:18)
gradations will continue to happen by
(00:50:20)
just making the method simpler or
(00:50:23)
because we'll have the capability of
(00:50:25)
these millions of minds who can then add
(00:50:27)
complexity as needed. if that will
(00:50:29)
continue to if that will continue to be
(00:50:31)
a false path even when you have billions
(00:50:33)
of AI researchers or trillions of AI
(00:50:35)
researchers.
(00:50:36)
>> I think I think more interesting is just
(00:50:38)
think about that case
(00:50:39)
>> which when you have many AIs um will
(00:50:43)
they help each other the way cultural
(00:50:46)
evolution works in people
(00:50:48)
>> and let's just maybe we should talk
(00:50:50)
about that.
(00:50:50)
>> Yeah, for sure.
(00:50:51)
>> The bitter lesson. Oh, who cares about
(00:50:52)
that? That's that's an empirical
(00:50:54)
observation about a particular period in
(00:50:56)
history. 70 years in history no longer
(00:50:58)
doesn't necessarily have to apply the
(00:51:00)
next 70 years. So the interesting
(00:51:02)
question is you're an AI, you get some
(00:51:04)
more computer power. Should you use it
(00:51:05)
to make yourself, you know, more
(00:51:07)
computationally capable or should you
(00:51:09)
use it to spawn off a copy of yourself
(00:51:12)
to go learn something interesting on the
(00:51:13)
other side of the planet or on some
(00:51:15)
other topic and then report back to you?
(00:51:17)
>> Yep.
(00:51:18)
I think that's a really interesting
(00:51:20)
question
(00:51:21)
um that that that will only arise in the
(00:51:25)
age of digital intelligences.
(00:51:27)
>> I'm not sure what the answer is, but I
(00:51:28)
think it it will more questions. Will it
(00:51:31)
be possible to really, you know, spawn
(00:51:33)
it off, send it out, learn something
(00:51:35)
new, some perhaps very new, and then
(00:51:38)
will it be able to re be reinccorporated
(00:51:40)
into the original
(00:51:41)
>> or will it will it uh have will have
(00:51:44)
changed so much that it uh it can't
(00:51:46)
really be done, you know? Is that
(00:51:48)
possible or is it not? And you know, you
(00:51:50)
can carry this to its limit as I I I saw
(00:51:53)
one of your videos the other night that
(00:51:55)
that suggested that it that it could
(00:51:57)
where you spawn off many many copies, do
(00:51:59)
different things. It's highly
(00:52:00)
decentralized, but report back to the
(00:52:02)
the central master
(00:52:04)
and that this is this will be such a
(00:52:06)
powerful thing. Well, I think one thing
(00:52:09)
that uh so this is my attempt to add
(00:52:11)
something to this this view is that uh a
(00:52:14)
big question, a big issue will become uh
(00:52:18)
corruption. You know, if you if you
(00:52:20)
really could just get information from
(00:52:21)
anywhere and bring it into your central
(00:52:23)
mind, you become more and more powerful.
(00:52:26)
Uh, and since it's all digital and they
(00:52:29)
all speak some internal digital
(00:52:31)
language, maybe it'll be easy and
(00:52:33)
possible. But it will not be that easy,
(00:52:37)
as easy as you're imagining because uh
(00:52:39)
that you can lose your mind this way. If
(00:52:41)
you you pull in something from the
(00:52:43)
outside and build it into your into your
(00:52:45)
inner thinking, uh, it could take over
(00:52:47)
you. It could change you. It could be uh
(00:52:51)
your destruction rather than uh your in
(00:52:54)
increment in knowledge. M
(00:52:56)
>> I think this will become a a big
(00:52:58)
concern, you know, particularly when
(00:52:59)
you're, oh, he's figured all about, you
(00:53:01)
know, how to play some new game or
(00:53:03)
figures out he's studied Indonesia and
(00:53:05)
you want to incorporate that into your
(00:53:07)
mind. Um, yeah. So, you can't you could
(00:53:10)
you think, oh, just read it all in and
(00:53:12)
that'll be fine. But no, you've just
(00:53:14)
read a whole bunch of bits into your
(00:53:16)
mind and uh they could have viruses in
(00:53:20)
them. They could have hidden goals. uh
(00:53:23)
they can uh warp you and change you and
(00:53:26)
this will become a big thing. How do you
(00:53:28)
have cyber security in the age of
(00:53:31)
digital spawning and re reforming again?
(00:53:34)
>> It's interesting that both quant firms
(00:53:37)
and AI labs have a culture of secrecy
(00:53:39)
because both of them are operating in
(00:53:41)
incredibly competitive markets and their
(00:53:43)
success rest on protecting their IP. If
(00:53:45)
you're an AI researcher or engineer and
(00:53:47)
you're deciding where to work, most of
(00:53:49)
the quant firms or AI labs that you'll
(00:53:51)
be considering will be strongly siloing
(00:53:53)
their teams to minimize the risk of
(00:53:55)
leaks. Hudson River Trading takes the
(00:53:57)
opposite approach. Their teams openly
(00:53:59)
share their trading strategies and their
(00:54:01)
strategy code lives in a shared monor
(00:54:03)
repo. At HRT, if you're a researcher and
(00:54:06)
you have a good idea, your contribution
(00:54:08)
will be broadly deployed across all
(00:54:10)
relevant strategies. This gives your
(00:54:12)
work a ton of leverage. You'll also
(00:54:14)
learn incredibly fast. You can learn
(00:54:16)
about other people's research and ask
(00:54:18)
questions and you can see how everything
(00:54:20)
fits together end to end from the
(00:54:22)
low-level execution of trades to the
(00:54:24)
high level predictive models. HRT is
(00:54:27)
hiring. If you want to learn more, go to
(00:54:30)
hudson rivertrading.com/thearkcash.
(00:54:34)
All right, back to Richard. I guess this
(00:54:36)
brings us to the topic of AI succession.
(00:54:39)
>> Mhm. you have a perspective that's quite
(00:54:41)
different from a lot of people that I've
(00:54:42)
interviewed and maybe a lot of people
(00:54:44)
generally. So I also think it's a very
(00:54:46)
interesting perspective. I want to hear
(00:54:48)
about it.
(00:54:48)
>> Yeah. So I do think succession to
(00:54:53)
digital
(00:54:54)
or digital intelligence or augmented
(00:54:57)
humans is inevitable. So the argument go
(00:55:00)
I have a four-part argument. Now I step
(00:55:03)
one is
(00:55:05)
there's no government or organization
(00:55:09)
that that uh gives humanity a unified
(00:55:12)
point of view that dominates and that
(00:55:14)
can that can arrange. There's no
(00:55:16)
consensus about how the world should be
(00:55:18)
run. And number two um we will figure
(00:55:22)
out how intelligence works. Researchers
(00:55:24)
will figure it out eventually. And
(00:55:26)
number three we won't stop just with
(00:55:28)
human level intelligence. we will get
(00:55:30)
reach super intelligence. And number
(00:55:32)
four is that once it's inevitable over
(00:55:36)
time that the most intelligent things
(00:55:40)
around would gain resources and power.
(00:55:43)
Uh and uh so put all that together, it's
(00:55:47)
you know you um it's sort of inevitable
(00:55:50)
that you're going to have um succession
(00:55:53)
to AI or to AI enabled augmented humans.
(00:55:58)
So within those those four things seem
(00:56:01)
clear and and and sure to happen. Uh but
(00:56:05)
within that set of possibilities some
(00:56:07)
there can be good outcomes as well as
(00:56:09)
less good outcomes bad outcomes
(00:56:12)
>> and um
(00:56:14)
so I just just trying to be realistic
(00:56:16)
about where we are and and ask how we
(00:56:20)
should feel about it. Yeah, I I agree
(00:56:23)
with all four of those arguments and the
(00:56:25)
implication and I also agree that
(00:56:27)
succession
(00:56:30)
contains
(00:56:32)
a wide variety of possible futures. So,
(00:56:34)
curious to get more thoughts on that.
(00:56:36)
>> Right. And so then I do encourage people
(00:56:37)
to think um positively about it first of
(00:56:40)
all because it's something we humans
(00:56:42)
have always tried to do for thousands of
(00:56:45)
years trying to understand themselves
(00:56:46)
trying to make themselves think better
(00:56:49)
and um
(00:56:51)
you know just understand themselves. So
(00:56:53)
this is a great success from as science
(00:56:57)
humanities uh we're finding out what
(00:57:00)
this essential part of of of humanness
(00:57:03)
is what it means to be intelligent. And
(00:57:06)
then what I usually say is is that this
(00:57:09)
is all kind of human centric. What if we
(00:57:11)
look we step aside from being a human
(00:57:14)
and just say take the point of view of
(00:57:16)
the universe and and this is I think a
(00:57:18)
major stage in the universe a major
(00:57:20)
transition a transition from replicators
(00:57:24)
we humans and animals plants we're all
(00:57:27)
replicators and that gives some
(00:57:30)
strengths and some limitations and then
(00:57:32)
we're entering the age of design where
(00:57:34)
because our AIs are designed our our our
(00:57:37)
all of our physical objects are designed
(00:57:39)
our buildings are designed
(00:57:41)
our technology is designed and we're
(00:57:43)
we're designing now uh AIs things that
(00:57:47)
can be intelligent themselves and that
(00:57:48)
are themselves capable of design and so
(00:57:51)
this is this is a key step in the world
(00:57:54)
and I and in the universe and I think
(00:57:57)
it's the it's the transition from the
(00:57:59)
world in which most of the interesting
(00:58:00)
things
(00:58:02)
uh that are are replicated replicated
(00:58:05)
means you can make copies of them uh but
(00:58:08)
you don't really understand them like
(00:58:09)
right now we make more intelligent
(00:58:11)
beings, more children. Uh we don't
(00:58:14)
really understand how intelligence
(00:58:15)
works. Whereas in as we're we're
(00:58:18)
reaching now to having design
(00:58:20)
intelligence intelligence that we do
(00:58:22)
understand how it works and therefore we
(00:58:24)
can change it in different ways in
(00:58:26)
different speeds um than otherwise and
(00:58:29)
and our future they might not be
(00:58:32)
replicated at all like we may just
(00:58:33)
design AIs and those AIs will design
(00:58:36)
other AIs and um everything will be done
(00:58:40)
by design construction rather than by
(00:58:42)
replication.
(00:58:44)
Yeah, I mark this as one of the four
(00:58:46)
great stages of the universe. First
(00:58:48)
there's there's dust ends with stars.
(00:58:51)
Stars we and and then stars make planets
(00:58:53)
and the planets give rise to life. And
(00:58:56)
now we're giving life life to uh
(00:58:58)
designed entities.
(00:59:01)
And so I think we should be proud and we
(00:59:04)
should be uh uh that we are giving rise
(00:59:08)
to this great transition in the
(00:59:09)
universe.
(00:59:11)
Yeah. So it's an interesting thing. What
(00:59:13)
should we what should we consider them
(00:59:16)
part of humanity or different from
(00:59:17)
humanity? It's our choice. It's our
(00:59:20)
choice whether we should say oh they are
(00:59:21)
our offspring and we should be proud of
(00:59:24)
them and we should celebrate their
(00:59:25)
achievements or we should we could say
(00:59:26)
oh no they're not us and we should be
(00:59:29)
horrified. It's it's just it's
(00:59:30)
interesting that that that is it feels
(00:59:32)
to me like a choice and yet it's such a
(00:59:35)
strongly uh held thing that how could we
(00:59:38)
be a choice? I like these sort of
(00:59:39)
contradictory uh implications of
(00:59:41)
thought.
(00:59:43)
>> It would be interesting to consider if
(00:59:44)
we were just designing another
(00:59:46)
generation of humans.
(00:59:48)
>> Yes,
(00:59:49)
>> design is the wrong word. But we knew a
(00:59:50)
future generation was a good humans
(00:59:51)
going to come up and forget about AI. We
(00:59:54)
just know in the long run humanity will
(00:59:56)
be more capable and maybe more numerous,
(00:59:59)
maybe more intelligent. How do we feel
(01:00:01)
about that? I do think there's potential
(01:00:04)
worlds with future humans that we would
(01:00:06)
be quite concerned about. So are you
(01:00:08)
thinking like maybe we are we are like
(01:00:10)
the Neanderthalss we give rise to Homo
(01:00:13)
sapiens maybe homo sapiens will give
(01:00:14)
rise to a new group of people
(01:00:17)
>> something like that like I'm basically
(01:00:19)
taking the example you're giving of like
(01:00:21)
okay even if you consider them part of
(01:00:22)
humanity yeah
(01:00:23)
>> I don't think that re necessarily means
(01:00:26)
that we should feel super comfortable
(01:00:28)
>> yeah like Nazis were humans right if we
(01:00:32)
thought like oh the future generation
(01:00:33)
will be Nazis I think we'd be like quite
(01:00:35)
concerned about just handing off power
(01:00:37)
to them So, um, I agree that this is not
(01:00:41)
super dissimilar to worrying about more
(01:00:43)
capable future humans, but I don't think
(01:00:45)
that that addresses a lot of the
(01:00:48)
concerns people might have about this
(01:00:50)
level of power being attained this fast
(01:00:52)
with entities we don't fully understand.
(01:00:54)
>> Well, I think it's relevant to point out
(01:00:56)
that uh for most of humanity
(01:01:00)
um they don't have much uh influence on
(01:01:05)
what happens. Um, most of humanity
(01:01:08)
doesn't influence
(01:01:10)
>> who can control the atom bombs or who uh
(01:01:16)
controls the nation states. Even as a as
(01:01:19)
a citizen, I often feel that we don't
(01:01:22)
control the nation states very much.
(01:01:24)
They're out of control. A lot of it has
(01:01:26)
to do with just how you feel about
(01:01:27)
change. Um, and if you think the current
(01:01:30)
situation is really really good, then
(01:01:33)
you're uh more likely to be suspicious
(01:01:36)
of change and averse to change than if
(01:01:37)
you think um
(01:01:40)
it's imperfect. And I think it's
(01:01:42)
imperfect. In fact, I think it's pretty
(01:01:44)
bad.
(01:01:46)
>> So, I'm I'm I'm open to change. I I
(01:01:50)
think humanity is not in a has had a
(01:01:53)
good super good track record. And maybe
(01:01:54)
it's the best thing that there's been,
(01:01:57)
but it it it's far from perfect.
(01:01:59)
>> Yeah, I guess there's different
(01:02:01)
varieties of change. Um, the industrial
(01:02:05)
revolution was change. The bullshik
(01:02:07)
revolution was also change. And if you
(01:02:09)
were around in Russia in the 1900s and
(01:02:12)
you're like, look, things aren't going
(01:02:13)
well. This is our is kind of messing
(01:02:15)
things up. We need change. I'd want to
(01:02:18)
know what kind of change you wanted
(01:02:20)
before signing on the dotted line.
(01:02:22)
Right? And then similar with AI where
(01:02:25)
I'd want to understand and to the extent
(01:02:27)
it's possible to change the trajectory
(01:02:29)
to change the trajectory of AI such that
(01:02:31)
the change is positive um for humans.
(01:02:34)
>> We we are we should be concerned about
(01:02:37)
our future the future make we should try
(01:02:40)
to make it good. Um, we al also though
(01:02:44)
should recognize the limits, our limits.
(01:02:47)
And we're
(01:02:50)
I think we want to avoid the feeling of
(01:02:52)
entitlement. Avoid the feeling, oh, we
(01:02:54)
are here first. We should always have it
(01:02:57)
in a good way. Um, how should we think
(01:03:00)
about the future and how much control uh
(01:03:04)
a particular species on a particular
(01:03:06)
planet should have over it? Uh, and how
(01:03:08)
much control do we have? You know, a a
(01:03:11)
counterbalance to our limited control
(01:03:13)
over the long-term future of humanity
(01:03:18)
should be how much control do we have
(01:03:20)
over our own lives? Like we have uh our
(01:03:23)
own goals and we have our our families
(01:03:26)
and we those things are much more
(01:03:28)
controllable than like trying to control
(01:03:30)
um the whole universe,
(01:03:32)
>> right? Um so I think it's appropriate
(01:03:36)
you know for us to to uh you know really
(01:03:41)
work towards our own local goals and uh
(01:03:45)
and it's kind of aggressive for us
(01:03:46)
saying oh the future has to evolve this
(01:03:49)
way that I want it to.
(01:03:50)
>> Sure.
(01:03:51)
>> Because then we'll have arguments like
(01:03:52)
different people think the future the
(01:03:54)
global future should evolve in different
(01:03:56)
ways and then they have conflict and
(01:03:58)
>> yeah avoid that. Maybe a bit a good
(01:04:00)
analogy here would be okay so suppose
(01:04:04)
you're raising your own children
(01:04:07)
>> it might not be appropriate to have
(01:04:09)
extremely tight goals for their own life
(01:04:11)
or also have some sense of like I want
(01:04:13)
my children to go out there in the world
(01:04:15)
and have this specific impact you know
(01:04:17)
my my son's going to become president
(01:04:18)
and my daughter's going to become CEO of
(01:04:20)
Intel and like together they're going to
(01:04:22)
have this effect on the world um but it
(01:04:26)
people do have the sense and I think
(01:04:27)
this is appropriate of saying, "I'm
(01:04:29)
going to give them good, robust values
(01:04:32)
such that if and when they do end up in
(01:04:35)
positions of power, they do reasonable
(01:04:38)
pro-social things." And I think maybe a
(01:04:41)
similar attitude towards AI makes sense.
(01:04:42)
Not in the sense of we can predict
(01:04:44)
everything that they will do. Um where
(01:04:47)
we have this plan about what the world
(01:04:48)
should look like in 100 years but it's
(01:04:51)
quite important to give them
(01:04:54)
robust and steerable and pro-social
(01:04:58)
values.
(01:04:59)
>> Pro-social values.
(01:05:01)
>> Maybe that's the wrong word.
(01:05:02)
>> Are there universal values that we can
(01:05:05)
all agree on?
(01:05:06)
>> I don't think so. But that doesn't
(01:05:07)
prevent us from uh giving our kids a
(01:05:11)
good education, right? Like we have some
(01:05:13)
sense of we want our children to be a
(01:05:14)
certain way.
(01:05:15)
>> Yeah.
(01:05:15)
>> And maybe process is the wrong word.
(01:05:16)
Actually, high integrity is a maybe a
(01:05:18)
better word where if there's a request
(01:05:20)
or if there's a goal that seems harmful,
(01:05:24)
they will refuse to engage in it. Um or
(01:05:26)
they'll be honest. Um things like that.
(01:05:29)
and we have some sense that we can teach
(01:05:32)
our children things like this even if we
(01:05:34)
don't have some sense of what true
(01:05:35)
morality is or everybody doesn't agree
(01:05:37)
on that. Um, and maybe that's a
(01:05:40)
reasonable target for AI as well.
(01:05:42)
>> So, so you're saying we're trying to
(01:05:45)
design the future and the the principles
(01:05:47)
by which it will evolve and come into
(01:05:49)
being,
(01:05:49)
>> right?
(01:05:50)
>> And so you're saying the first thing
(01:05:51)
you're saying is well we will we try to
(01:05:53)
teach our our children um general
(01:05:56)
principles which will promote
(01:05:59)
more likely evolutions.
(01:06:00)
>> Yeah.
(01:06:01)
>> Um maybe we should also seek for things
(01:06:04)
being voluntary. If there is change, we
(01:06:06)
want it to be voluntary rather than
(01:06:08)
imposed on people.
(01:06:09)
>> I think that's a very important point.
(01:06:11)
>> Y
(01:06:12)
>> um and yeah, that's all good. I think I
(01:06:14)
think this is like a big um you know,
(01:06:17)
the big the big or one of the really big
(01:06:21)
human enterprises to design society and
(01:06:24)
that's been ongoing for for thousands of
(01:06:26)
years again. And so so it's like the
(01:06:29)
more things change really the more
(01:06:30)
things they stay the same. We still have
(01:06:32)
to figure out how to be uh the children
(01:06:35)
will still come up with different values
(01:06:37)
that seem strange to their parents and
(01:06:40)
their grandparents and uh and things
(01:06:42)
will evolve. the the more things change,
(01:06:44)
the more they stay the same. Also seems
(01:06:45)
like a good capstone to the AI
(01:06:48)
discussion because the AI discussion we
(01:06:49)
were having was about how techniques
(01:06:51)
which were um invented even before their
(01:06:55)
application to deep learning and back
(01:06:57)
propagation was evident have are you
(01:06:59)
know central to the progression of AI
(01:07:01)
today. So maybe that's a good place to
(01:07:03)
wrap up the conversation.
(01:07:05)
>> Okay, thank you very much.
(01:07:07)
>> Thank you for coming on.
(01:07:08)
>> My pleasure.
