Home Videos

GPT-5.3 Codex VS Opus 4.6 : I TESTED BOTH Models EXTENSIVELY & ONE IS A CLEAR WINNER! (YouTube Video Transcript)

Need transcripts for other videos? Try our YouTube Transcript Generator →
Title: GPT-5.3 Codex VS Opus 4.6 : I TESTED BOTH Models EXTENSIVELY & ONE IS A CLEAR WINNER!
Duration: 00:15:43
Total Correct Answers:
Current Caption
Correct

Learning Modes

YouTube Video Transcript Hide

Ask AI Result

The ask AI result will appear here..
(00:00:00) Your YouTube transcript will appear here (00:00:05) Hi, welcome to another video. So, (00:00:09) Anthropic dropped Opus 4.6 and OpenAI (00:00:13) dropped Codeex 5.3. (00:00:16) Both the companies dropped their models (00:00:18) in just an hour gap of each other, which (00:00:20) is extremely competitive and (00:00:22) interesting. Since they are doing this, (00:00:25) let's talk about them and see what these (00:00:27) models actually are and what do they (00:00:30) bring new to the table and then we'll (00:00:32) compare them on my benchmarks. Let's (00:00:35) start with Opus 4.6. (00:00:38) So Opus 4.6 is Anthropic's new flagship (00:00:42) model and it's a pretty meaningful (00:00:44) upgrade over Opus 4.5 (00:00:47) especially for agentic coding. It plans (00:00:51) more carefully, sustains agentic tasks (00:00:53) for longer, operates reliably in massive (00:00:56) code, and catches its own mistakes. (00:01:01) It's also the first Opus class model (00:01:03) with a 1 million token context window, (00:01:06) which is in beta, and it can output up (00:01:09) to 128,000 tokens. (00:01:13) So, for those of you building full (00:01:15) features or doing large refactors with (00:01:18) AI, that's a pretty big deal. On (00:01:21) benchmarks, Opus 4.6 scored 65.4% (00:01:27) on terminal bench 2.0, up from 59.8 on (00:01:32) Opus 4.5. (00:01:34) On SWEBench verified, it got 80.8%. (00:01:39) basically the same as before. Where it (00:01:42) really shines is OS World for computer (00:01:44) use at 72.7% (00:01:47) up from 66.3. (00:01:50) And on Arc Agi2, it went crazy with (00:01:53) 68.8% (00:01:55) nearly double its predecessor. (00:01:58) Browse comp came in at 84% and (00:02:01) humanity's last exam at 53.1% with (00:02:04) tools. So, strong improvements across (00:02:08) the board. (00:02:10) They also introduced adaptive thinking (00:02:13) where you can set the effort level from (00:02:15) low to max and it actually affects (00:02:18) performance. (00:02:20) Pricing stays the same as Opus 4.5 at $5 (00:02:24) per million input and $25 per million (00:02:27) output. No price increase for a better (00:02:30) model, which is great. Now let's talk (00:02:33) about GPT 5.3 codeex. So OpenAI is (00:02:39) calling this their most capable Aentic (00:02:41) coding model to date. The key thing (00:02:44) about GPT 5.3 codeex is that it combines (00:02:48) the coding performance of GPT 5.2 codecs (00:02:52) with the reasoning and professional (00:02:53) knowledge capabilities of GPT 5.2 (00:02:57) allinone model and it's 25% faster. (00:03:02) Now, one of the wildest things about (00:03:04) this model is that OpenAI says it's the (00:03:08) first model that was instrumental in (00:03:10) creating itself. They used early (00:03:13) versions of GPT 5.3 codecs to debug its (00:03:17) own training, manage its own deployment, (00:03:20) and diagnose its own test results. So, (00:03:24) the model literally helped build itself. (00:03:27) That's a pretty significant claim. On (00:03:29) benchmarks, GPT 5.3, codec scored 77.3% (00:03:35) on terminal bench 2.0. (00:03:38) For context, Opus 4.6 got 65.4 (00:03:43) and GPT 5.2, codec got 64. So that's a (00:03:49) massive jump and a clear lead on (00:03:51) terminal coding. (00:03:53) On SWE Pro, it scored 56.8% 8% slightly (00:03:58) up from 56.4 (00:04:00) on GPT 5.2 codeex on OSWorld Verified it (00:04:06) got 64.7%. (00:04:09) And on GDP val which evaluates (00:04:11) professional knowledge work across 44 (00:04:14) different jobs it scored 70.9% wins or (00:04:18) ties. (00:04:20) One thing that really stands out is (00:04:22) cyber security. (00:04:24) GPT5.3 (00:04:26) codeex scored 77.6% (00:04:30) on cyber security CTF challenges and (00:04:33) OpenAI has classified it as their first (00:04:36) high capability model for cyber security (00:04:39) under their preparedness framework. (00:04:42) It's the first model they've directly (00:04:43) trained to find software (00:04:45) vulnerabilities. (00:04:47) So that's a whole new dimension. It's (00:04:50) also more efficient. (00:04:53) Open AAI says it uses less than half the (00:04:56) tokens of its predecessor for the same (00:04:58) tasks. (00:05:00) Context window is 400,000 tokens with a (00:05:05) 128,000 token output limit. It's (00:05:08) available now to all chat GPT users (00:05:11) through the codeex app, CLI, and IDE (00:05:14) extension. (00:05:16) If you use the Codeex app, then you get (00:05:18) one month of free access as well. So (00:05:21) that's great. API access is coming soon, (00:05:25) but pricing for that hasn't been (00:05:27) announced yet. Now, those are the (00:05:30) official numbers from both companies, (00:05:33) but as you all know, official benchmarks (00:05:35) only tell part of the story. So, let me (00:05:39) show you how these two actually perform (00:05:41) on my benchmarks. Let's start with the (00:05:43) non-aggentic test for Opus 4.6. (00:05:47) I ran it on my king bench, which has 11 (00:05:50) questions, three general knowledge, and (00:05:53) eight coding. And Opus 4.6 scored 100%. (00:05:59) 220 out of 220. (00:06:02) Perfect score. Every single question, 20 (00:06:06) out of 20. That has never happened (00:06:09) before on my benchmark with any model. (00:06:12) Let me quickly walk you through the (00:06:13) coding generations. (00:06:16) The 3D floor plan came out really clean. (00:06:19) It built a proper 3.js scene with two (00:06:23) bedrooms, two bathrooms, living room, (00:06:26) kitchen, dining, and hallway. All in the (00:06:30) correct 1,585 (00:06:33) ft layout. Perfect 20 out of 20. The SVG (00:06:38) Panda holding a burger was well (00:06:41) structured with proper body proportions, (00:06:44) belly patch, legs, and it was actually (00:06:47) holding the burger. 20 out of 20. The (00:06:51) Pokeball in 3.js rendered a proper 3D (00:06:54) Pokeball with animation. Clean code, (00:06:58) single HTML file. Everything worked. 20 (00:07:02) out of 20. The chessboard was fully (00:07:05) functional. All pieces placed correctly. (00:07:08) Legal move validation. And the autoplay (00:07:11) feature actually worked where both sides (00:07:13) make legal moves automatically. (00:07:16) This is a hard one and it nailed it. 20 (00:07:19) out of 20. The 3D Minecraft was insane. (00:07:23) It called it Kandinsky Edition. And it (00:07:26) had procedurally generated handdrawn (00:07:28) style textures, trees, and smooth (00:07:31) terrain. Full firsterson controls. 20 (00:07:36) out of 20. The 3D butterfly in a garden (00:07:39) was a complete 3.js scene with an (00:07:43) animated butterfly flying through a (00:07:45) garden with camera controls. (00:07:48) Beautiful output. 20 out of 20. The Rust (00:07:53) CLI tool for image conversion had a (00:07:55) proper cargo.l (00:07:57) with clap for argument parsing. The (00:08:00) image crate supporting 13 plus formats (00:08:03) including PNG, JPEG, GIF, WEBP, AVF, (00:08:08) TIFF, and more. Well structured project, (00:08:12) 20 out of 20. And the Blender script for (00:08:15) the Pokeball was clean Python with (00:08:18) proper scene cleanup. PBR material setup (00:08:21) with the right Pokeball red, white, and (00:08:24) black colors. Ready to paste into (00:08:26) Blender and run. 20 out of 20. On the (00:08:31) general side, the math problems and the (00:08:33) riddle were all perfect as well, 100% (00:08:37) across the board. Now, looking at the (00:08:40) leaderboard, Opus 4.6 is sitting at (00:08:44) number one, tied with Gemini 3 Pro, both (00:08:47) at 100%. (00:08:49) But Gemini 3 Pro only cost about 85 for (00:08:53) the full run, while Opus 4.6 6 cost (00:08:57) around $6.39. (00:09:01) After that, there's a massive drop. (00:09:04) Opus 4.5 Max is at 74% in third place. (00:09:09) And then GPT 5.2x high is down at fifth (00:09:13) place with just 65%. (00:09:16) So on non-aggenic coding, both Opus 4.6 (00:09:21) and Gemini 3 Pro are in a league of (00:09:23) their own right now. Now, for GPT 5.3 (00:09:27) codeex, here's the thing. As of right (00:09:31) now, there is no API access for GPT 5.3 (00:09:35) codeex. OpenAI hasn't released it on the (00:09:38) API yet, so I cannot run it on my (00:09:41) benchmarks. I literally have no way to (00:09:44) test it on Kingbench. Once the API is (00:09:47) available, I will test it and make a (00:09:50) separate video on it or update this one. (00:09:54) Now let's move to the agentic tests. (00:09:57) This is where things get really (00:09:58) interesting because agentic coding is a (00:10:01) completely different game. There are (00:10:03) seven questions. So let's look at both (00:10:07) of their generations one by one. First (00:10:11) we have the Expo mobile movie tracker (00:10:13) app. First let's look at the one from (00:10:16) Opus and it is quite good. On the (00:10:19) homepage you see the movies you have (00:10:21) watched and a calendar to see that when (00:10:24) you have watched movies. You can go to (00:10:26) search tab and you can easily see (00:10:28) movies. Add a new one and see the inner (00:10:32) pages. Add to watch list and everything (00:10:35) like that. It works really well and (00:10:38) everything and it was all one shot which (00:10:41) is quite great. Then we've got GPT 5.3 (00:10:45) codeex and well it's not good. It kind (00:10:48) of works, but Codeex implemented (00:10:51) everything in one file for some odd (00:10:53) reason, which was obviously not great to (00:10:56) see. And while it kind of did (00:10:58) everything, (00:11:00) it was very lackluster in real usage. (00:11:03) And while working with it, for some (00:11:05) reason, it uses CAT to write lines to (00:11:08) files instead of like an edit tool. (00:11:11) This happens a lot of time. For some (00:11:13) reason, I used extra high reasoning for (00:11:16) this. So, this is the best you can get (00:11:19) apparently. Not a very great response (00:11:21) from Codeex for sure. Next up, we've got (00:11:25) the calculator question. Here, I ask the (00:11:29) agent to build me a graphical calculator (00:11:31) that works in the terminal with Go and (00:11:34) well, Opus does kind of well. It is not (00:11:37) the best by any means, but it's fine and (00:11:39) workable. However, the one by Codeex is (00:11:43) full of bugs and does not work either. (00:11:46) So yeah, it's not great. Then we've got (00:11:50) God, and both of them do good on this. (00:11:53) So that's good. Next, we have the Conban (00:11:56) app in Spelta. Opus nails it here again. (00:12:00) It works end to end without any bugs, (00:12:02) and the UI is also really good. The one (00:12:05) from OpenAI opens the login page, but it (00:12:09) just errors out after that. (00:12:11) So yeah, codeex is not good. Then we (00:12:15) have the Nux app. Here I ask it to make (00:12:18) me a Stack Overflow clone in Nux. If we (00:12:21) look at codeex, then it looks really (00:12:22) good, but it again falls apart in (00:12:25) authentication and gives a CSRF token (00:12:28) error. However, the one from Opus just (00:12:32) works flawlessly and seems just better (00:12:34) to me. It is just way better. (00:12:38) Next, we have the Tori image cropper (00:12:40) app. The one from Claude works on the (00:12:44) web but doesn't work as an app while the (00:12:47) codeex one doesn't work at all which is (00:12:49) not great and the open code is (00:12:52) unresolved by both of them. (00:12:55) This makes the Opus 4.6 model score the (00:12:58) first position on the leaderboard while (00:13:01) Codeex is laying low with Miniax M2.1. (00:13:06) I think codeex at least subjectively is (00:13:09) not anywhere near as good as the (00:13:11) experience that you get with opus or (00:13:13) claude code. (00:13:15) Codeex needs to mature a lot. Even if (00:13:18) the model is good, it can't reliably (00:13:20) call tools in the contraption that (00:13:23) OpenAI has given it. Why does it need to (00:13:26) use CAT commands to write to files? The (00:13:29) model is already slow and it makes it (00:13:32) even slower. And this is not some weird (00:13:34) bug that I'm having. This is an open (00:13:37) issue on their GitHub repo for the last (00:13:40) 3 or 4 months and it has been opened (00:13:42) multiple times and it has been happening (00:13:45) still and the users are still reporting (00:13:48) it with the last report of 5 days back (00:13:52) but they just don't listen. (00:13:55) I get that they try hard to be the (00:13:57) community lover by helping open code, (00:14:00) but the truth is that they don't want to (00:14:02) build a functional product. Claude Code (00:14:05) and Opus is just a better experience. (00:14:08) Plus, why the hell is there no API for (00:14:12) it? If OpenAI can't build a good (00:14:15) contraption around their models, then at (00:14:17) least let others build it and not giving (00:14:19) API access is just sketchy and weird. (00:14:23) Instead of paleing ketchup with (00:14:25) anthropic, which is actually way ahead, (00:14:28) they should just focus and build and (00:14:31) then come back with a good product. (00:14:34) Opus is just the best model right now. I (00:14:37) am testing some upcoming open models and (00:14:40) they are just way better than this model (00:14:42) that a trillion dollar company is (00:14:44) making. (00:14:46) So yeah, I don't get the hype about (00:14:48) Codeex at all. It's just way worse. I'll (00:14:53) keep using Opus 4.6 as it is now (00:14:56) available in both of the coders that I (00:14:58) regularly use which is Verdant and Kilo (00:15:01) code. I can go and use it all I want (00:15:04) there and they do better for me than (00:15:07) General Claude code as I can like run a (00:15:10) ton of tasks in Verdant at once and let (00:15:13) it run and work with Kilo as the pair (00:15:16) programmer by my side to help me and do (00:15:19) other changes. That is majorly about it. (00:15:22) Overall, Opus 4.6 is pretty cool. (00:15:26) Anyway, share your thoughts below and (00:15:28) subscribe to the channel. You can also (00:15:30) donate via Superthanks option or join (00:15:32) the channel as well and get some perks. (00:15:34) I'll see you in the next video. Bye.

Leave a Reply

Your email address will not be published. Required fields are marked *