vkryukov
Opus 4.5 vs GPT 5.2 vs Gemini 3 Pro for Elixir development
People are constantly debating which LLM is better for writing Elixir code, so I decided to compare the three SoA models from Google, OpenAI, and Antropic to see which one would be better in designing a medium-size feature for a medium-size project.
The project is ReqLLM, a wonderful new LLM library by @mikehostetler, and the feature is adding image generation support, a first part of Add Image Generation and Audio Transcription Support · Issue #14 · agentjido/req_llm · GitHub.
I’ve used Gemini 3 Pro, GPT 5.2, and Claude Opus 4.5 in gemini, codex, and claude code clis, respectively, with the same prompt. After each model wrote a plan, I asked each (in a separate session) to compare the three plans. The results are here: ReqLLM image support plans · GitHub
Bottom line:
- My ranking of the plans is GPT 5.2 > Opus 4.5 > Gemin 3 Pro; and each of the three models agreed with this assessments.
- Arguably, GPT’s is the only correct plan; while Opus’s plan works, it essentially introduces a parallel response parsing infrastructure, and would make it hard or impossible to extend the image support going forward, add streaming, etc.
- For some reason, Claude likes to write big implementation chunks as part of its plan
- Gemini’s is the least concrete and least accurate plan (and also uses the wrong image generation endpoints, for some reason).
This matches my experience working with Claude Code and Codex daily: while Claude Code has a nicer output, more features (like parallel/background execution), and works faster, Codex is much, much more thorough and most often generates higher quality code.
And also, the “/review” function in Codex is underrated. My current workflow is to always run a “/review”, for code written by me, Claude, or another codex. It excells at finding some very subtle edge cases and bugs that were introduced by the latest patch.
Most Liked
egeersoz
GPT is really bad with Elixir in my experience. I regularly run experiments where I ask multiple models the same question (about design or troubleshooting a bug) and GPT is consistently bottom tier. It’s also slow as hell. Not sure why people like it as a coding assistant.
I used to use it for product management to build domain expertise but Gemini 3 is better at that now.
FlyingNoodle
I’m going to have to agree to disagree on this.
LLMs and brains don’t work the same way at all.
FlyingNoodle
each of the three models agreed with this assessments.
This should really say “each of these models generated text which said that they agreed.”
If you worded your question slightly differently the models would write something else. They are just text generators, they can’t “agree”.







