↔
Title: GPT-5.3 Codex VS Opus 4.6 : I TESTED BOTH Models EXTENSIVELY & ONE IS A CLEAR WINNER!
Duration: 00:15:43
Total Correct Answers:
Current Caption
Correct
Learning Modes
YouTube Video Transcript Hide
Ask AI:
Export as:
Ask AI Result
The ask AI result will appear here..
(00:00:00) Your YouTube transcript will appear here
(00:00:05)
Hi, welcome to another video. So,
(00:00:09)
Anthropic dropped Opus 4.6 and OpenAI
(00:00:13)
dropped Codeex 5.3.
(00:00:16)
Both the companies dropped their models
(00:00:18)
in just an hour gap of each other, which
(00:00:20)
is extremely competitive and
(00:00:22)
interesting. Since they are doing this,
(00:00:25)
let's talk about them and see what these
(00:00:27)
models actually are and what do they
(00:00:30)
bring new to the table and then we'll
(00:00:32)
compare them on my benchmarks. Let's
(00:00:35)
start with Opus 4.6.
(00:00:38)
So Opus 4.6 is Anthropic's new flagship
(00:00:42)
model and it's a pretty meaningful
(00:00:44)
upgrade over Opus 4.5
(00:00:47)
especially for agentic coding. It plans
(00:00:51)
more carefully, sustains agentic tasks
(00:00:53)
for longer, operates reliably in massive
(00:00:56)
code, and catches its own mistakes.
(00:01:01)
It's also the first Opus class model
(00:01:03)
with a 1 million token context window,
(00:01:06)
which is in beta, and it can output up
(00:01:09)
to 128,000 tokens.
(00:01:13)
So, for those of you building full
(00:01:15)
features or doing large refactors with
(00:01:18)
AI, that's a pretty big deal. On
(00:01:21)
benchmarks, Opus 4.6 scored 65.4%
(00:01:27)
on terminal bench 2.0, up from 59.8 on
(00:01:32)
Opus 4.5.
(00:01:34)
On SWEBench verified, it got 80.8%.
(00:01:39)
basically the same as before. Where it
(00:01:42)
really shines is OS World for computer
(00:01:44)
use at 72.7%
(00:01:47)
up from 66.3.
(00:01:50)
And on Arc Agi2, it went crazy with
(00:01:53)
68.8%
(00:01:55)
nearly double its predecessor.
(00:01:58)
Browse comp came in at 84% and
(00:02:01)
humanity's last exam at 53.1% with
(00:02:04)
tools. So, strong improvements across
(00:02:08)
the board.
(00:02:10)
They also introduced adaptive thinking
(00:02:13)
where you can set the effort level from
(00:02:15)
low to max and it actually affects
(00:02:18)
performance.
(00:02:20)
Pricing stays the same as Opus 4.5 at $5
(00:02:24)
per million input and $25 per million
(00:02:27)
output. No price increase for a better
(00:02:30)
model, which is great. Now let's talk
(00:02:33)
about GPT 5.3 codeex. So OpenAI is
(00:02:39)
calling this their most capable Aentic
(00:02:41)
coding model to date. The key thing
(00:02:44)
about GPT 5.3 codeex is that it combines
(00:02:48)
the coding performance of GPT 5.2 codecs
(00:02:52)
with the reasoning and professional
(00:02:53)
knowledge capabilities of GPT 5.2
(00:02:57)
allinone model and it's 25% faster.
(00:03:02)
Now, one of the wildest things about
(00:03:04)
this model is that OpenAI says it's the
(00:03:08)
first model that was instrumental in
(00:03:10)
creating itself. They used early
(00:03:13)
versions of GPT 5.3 codecs to debug its
(00:03:17)
own training, manage its own deployment,
(00:03:20)
and diagnose its own test results. So,
(00:03:24)
the model literally helped build itself.
(00:03:27)
That's a pretty significant claim. On
(00:03:29)
benchmarks, GPT 5.3, codec scored 77.3%
(00:03:35)
on terminal bench 2.0.
(00:03:38)
For context, Opus 4.6 got 65.4
(00:03:43)
and GPT 5.2, codec got 64. So that's a
(00:03:49)
massive jump and a clear lead on
(00:03:51)
terminal coding.
(00:03:53)
On SWE Pro, it scored 56.8% 8% slightly
(00:03:58)
up from 56.4
(00:04:00)
on GPT 5.2 codeex on OSWorld Verified it
(00:04:06)
got 64.7%.
(00:04:09)
And on GDP val which evaluates
(00:04:11)
professional knowledge work across 44
(00:04:14)
different jobs it scored 70.9% wins or
(00:04:18)
ties.
(00:04:20)
One thing that really stands out is
(00:04:22)
cyber security.
(00:04:24)
GPT5.3
(00:04:26)
codeex scored 77.6%
(00:04:30)
on cyber security CTF challenges and
(00:04:33)
OpenAI has classified it as their first
(00:04:36)
high capability model for cyber security
(00:04:39)
under their preparedness framework.
(00:04:42)
It's the first model they've directly
(00:04:43)
trained to find software
(00:04:45)
vulnerabilities.
(00:04:47)
So that's a whole new dimension. It's
(00:04:50)
also more efficient.
(00:04:53)
Open AAI says it uses less than half the
(00:04:56)
tokens of its predecessor for the same
(00:04:58)
tasks.
(00:05:00)
Context window is 400,000 tokens with a
(00:05:05)
128,000 token output limit. It's
(00:05:08)
available now to all chat GPT users
(00:05:11)
through the codeex app, CLI, and IDE
(00:05:14)
extension.
(00:05:16)
If you use the Codeex app, then you get
(00:05:18)
one month of free access as well. So
(00:05:21)
that's great. API access is coming soon,
(00:05:25)
but pricing for that hasn't been
(00:05:27)
announced yet. Now, those are the
(00:05:30)
official numbers from both companies,
(00:05:33)
but as you all know, official benchmarks
(00:05:35)
only tell part of the story. So, let me
(00:05:39)
show you how these two actually perform
(00:05:41)
on my benchmarks. Let's start with the
(00:05:43)
non-aggentic test for Opus 4.6.
(00:05:47)
I ran it on my king bench, which has 11
(00:05:50)
questions, three general knowledge, and
(00:05:53)
eight coding. And Opus 4.6 scored 100%.
(00:05:59)
220 out of 220.
(00:06:02)
Perfect score. Every single question, 20
(00:06:06)
out of 20. That has never happened
(00:06:09)
before on my benchmark with any model.
(00:06:12)
Let me quickly walk you through the
(00:06:13)
coding generations.
(00:06:16)
The 3D floor plan came out really clean.
(00:06:19)
It built a proper 3.js scene with two
(00:06:23)
bedrooms, two bathrooms, living room,
(00:06:26)
kitchen, dining, and hallway. All in the
(00:06:30)
correct 1,585
(00:06:33)
ft layout. Perfect 20 out of 20. The SVG
(00:06:38)
Panda holding a burger was well
(00:06:41)
structured with proper body proportions,
(00:06:44)
belly patch, legs, and it was actually
(00:06:47)
holding the burger. 20 out of 20. The
(00:06:51)
Pokeball in 3.js rendered a proper 3D
(00:06:54)
Pokeball with animation. Clean code,
(00:06:58)
single HTML file. Everything worked. 20
(00:07:02)
out of 20. The chessboard was fully
(00:07:05)
functional. All pieces placed correctly.
(00:07:08)
Legal move validation. And the autoplay
(00:07:11)
feature actually worked where both sides
(00:07:13)
make legal moves automatically.
(00:07:16)
This is a hard one and it nailed it. 20
(00:07:19)
out of 20. The 3D Minecraft was insane.
(00:07:23)
It called it Kandinsky Edition. And it
(00:07:26)
had procedurally generated handdrawn
(00:07:28)
style textures, trees, and smooth
(00:07:31)
terrain. Full firsterson controls. 20
(00:07:36)
out of 20. The 3D butterfly in a garden
(00:07:39)
was a complete 3.js scene with an
(00:07:43)
animated butterfly flying through a
(00:07:45)
garden with camera controls.
(00:07:48)
Beautiful output. 20 out of 20. The Rust
(00:07:53)
CLI tool for image conversion had a
(00:07:55)
proper cargo.l
(00:07:57)
with clap for argument parsing. The
(00:08:00)
image crate supporting 13 plus formats
(00:08:03)
including PNG, JPEG, GIF, WEBP, AVF,
(00:08:08)
TIFF, and more. Well structured project,
(00:08:12)
20 out of 20. And the Blender script for
(00:08:15)
the Pokeball was clean Python with
(00:08:18)
proper scene cleanup. PBR material setup
(00:08:21)
with the right Pokeball red, white, and
(00:08:24)
black colors. Ready to paste into
(00:08:26)
Blender and run. 20 out of 20. On the
(00:08:31)
general side, the math problems and the
(00:08:33)
riddle were all perfect as well, 100%
(00:08:37)
across the board. Now, looking at the
(00:08:40)
leaderboard, Opus 4.6 is sitting at
(00:08:44)
number one, tied with Gemini 3 Pro, both
(00:08:47)
at 100%.
(00:08:49)
But Gemini 3 Pro only cost about 85 for
(00:08:53)
the full run, while Opus 4.6 6 cost
(00:08:57)
around $6.39.
(00:09:01)
After that, there's a massive drop.
(00:09:04)
Opus 4.5 Max is at 74% in third place.
(00:09:09)
And then GPT 5.2x high is down at fifth
(00:09:13)
place with just 65%.
(00:09:16)
So on non-aggenic coding, both Opus 4.6
(00:09:21)
and Gemini 3 Pro are in a league of
(00:09:23)
their own right now. Now, for GPT 5.3
(00:09:27)
codeex, here's the thing. As of right
(00:09:31)
now, there is no API access for GPT 5.3
(00:09:35)
codeex. OpenAI hasn't released it on the
(00:09:38)
API yet, so I cannot run it on my
(00:09:41)
benchmarks. I literally have no way to
(00:09:44)
test it on Kingbench. Once the API is
(00:09:47)
available, I will test it and make a
(00:09:50)
separate video on it or update this one.
(00:09:54)
Now let's move to the agentic tests.
(00:09:57)
This is where things get really
(00:09:58)
interesting because agentic coding is a
(00:10:01)
completely different game. There are
(00:10:03)
seven questions. So let's look at both
(00:10:07)
of their generations one by one. First
(00:10:11)
we have the Expo mobile movie tracker
(00:10:13)
app. First let's look at the one from
(00:10:16)
Opus and it is quite good. On the
(00:10:19)
homepage you see the movies you have
(00:10:21)
watched and a calendar to see that when
(00:10:24)
you have watched movies. You can go to
(00:10:26)
search tab and you can easily see
(00:10:28)
movies. Add a new one and see the inner
(00:10:32)
pages. Add to watch list and everything
(00:10:35)
like that. It works really well and
(00:10:38)
everything and it was all one shot which
(00:10:41)
is quite great. Then we've got GPT 5.3
(00:10:45)
codeex and well it's not good. It kind
(00:10:48)
of works, but Codeex implemented
(00:10:51)
everything in one file for some odd
(00:10:53)
reason, which was obviously not great to
(00:10:56)
see. And while it kind of did
(00:10:58)
everything,
(00:11:00)
it was very lackluster in real usage.
(00:11:03)
And while working with it, for some
(00:11:05)
reason, it uses CAT to write lines to
(00:11:08)
files instead of like an edit tool.
(00:11:11)
This happens a lot of time. For some
(00:11:13)
reason, I used extra high reasoning for
(00:11:16)
this. So, this is the best you can get
(00:11:19)
apparently. Not a very great response
(00:11:21)
from Codeex for sure. Next up, we've got
(00:11:25)
the calculator question. Here, I ask the
(00:11:29)
agent to build me a graphical calculator
(00:11:31)
that works in the terminal with Go and
(00:11:34)
well, Opus does kind of well. It is not
(00:11:37)
the best by any means, but it's fine and
(00:11:39)
workable. However, the one by Codeex is
(00:11:43)
full of bugs and does not work either.
(00:11:46)
So yeah, it's not great. Then we've got
(00:11:50)
God, and both of them do good on this.
(00:11:53)
So that's good. Next, we have the Conban
(00:11:56)
app in Spelta. Opus nails it here again.
(00:12:00)
It works end to end without any bugs,
(00:12:02)
and the UI is also really good. The one
(00:12:05)
from OpenAI opens the login page, but it
(00:12:09)
just errors out after that.
(00:12:11)
So yeah, codeex is not good. Then we
(00:12:15)
have the Nux app. Here I ask it to make
(00:12:18)
me a Stack Overflow clone in Nux. If we
(00:12:21)
look at codeex, then it looks really
(00:12:22)
good, but it again falls apart in
(00:12:25)
authentication and gives a CSRF token
(00:12:28)
error. However, the one from Opus just
(00:12:32)
works flawlessly and seems just better
(00:12:34)
to me. It is just way better.
(00:12:38)
Next, we have the Tori image cropper
(00:12:40)
app. The one from Claude works on the
(00:12:44)
web but doesn't work as an app while the
(00:12:47)
codeex one doesn't work at all which is
(00:12:49)
not great and the open code is
(00:12:52)
unresolved by both of them.
(00:12:55)
This makes the Opus 4.6 model score the
(00:12:58)
first position on the leaderboard while
(00:13:01)
Codeex is laying low with Miniax M2.1.
(00:13:06)
I think codeex at least subjectively is
(00:13:09)
not anywhere near as good as the
(00:13:11)
experience that you get with opus or
(00:13:13)
claude code.
(00:13:15)
Codeex needs to mature a lot. Even if
(00:13:18)
the model is good, it can't reliably
(00:13:20)
call tools in the contraption that
(00:13:23)
OpenAI has given it. Why does it need to
(00:13:26)
use CAT commands to write to files? The
(00:13:29)
model is already slow and it makes it
(00:13:32)
even slower. And this is not some weird
(00:13:34)
bug that I'm having. This is an open
(00:13:37)
issue on their GitHub repo for the last
(00:13:40)
3 or 4 months and it has been opened
(00:13:42)
multiple times and it has been happening
(00:13:45)
still and the users are still reporting
(00:13:48)
it with the last report of 5 days back
(00:13:52)
but they just don't listen.
(00:13:55)
I get that they try hard to be the
(00:13:57)
community lover by helping open code,
(00:14:00)
but the truth is that they don't want to
(00:14:02)
build a functional product. Claude Code
(00:14:05)
and Opus is just a better experience.
(00:14:08)
Plus, why the hell is there no API for
(00:14:12)
it? If OpenAI can't build a good
(00:14:15)
contraption around their models, then at
(00:14:17)
least let others build it and not giving
(00:14:19)
API access is just sketchy and weird.
(00:14:23)
Instead of paleing ketchup with
(00:14:25)
anthropic, which is actually way ahead,
(00:14:28)
they should just focus and build and
(00:14:31)
then come back with a good product.
(00:14:34)
Opus is just the best model right now. I
(00:14:37)
am testing some upcoming open models and
(00:14:40)
they are just way better than this model
(00:14:42)
that a trillion dollar company is
(00:14:44)
making.
(00:14:46)
So yeah, I don't get the hype about
(00:14:48)
Codeex at all. It's just way worse. I'll
(00:14:53)
keep using Opus 4.6 as it is now
(00:14:56)
available in both of the coders that I
(00:14:58)
regularly use which is Verdant and Kilo
(00:15:01)
code. I can go and use it all I want
(00:15:04)
there and they do better for me than
(00:15:07)
General Claude code as I can like run a
(00:15:10)
ton of tasks in Verdant at once and let
(00:15:13)
it run and work with Kilo as the pair
(00:15:16)
programmer by my side to help me and do
(00:15:19)
other changes. That is majorly about it.
(00:15:22)
Overall, Opus 4.6 is pretty cool.
(00:15:26)
Anyway, share your thoughts below and
(00:15:28)
subscribe to the channel. You can also
(00:15:30)
donate via Superthanks option or join
(00:15:32)
the channel as well and get some perks.
(00:15:34)
I'll see you in the next video. Bye.
