Skip to main content
  1. posts/

deep dive: the ai models i use

·7 mins· loading ·
 Author
Author
philip mathew hern
philliant
Table of Contents
ai - This article is part of a series.
Part : This Article

i spend most of my working day using an ai assistant in cursor. the part that is easy to skip in public write-ups are the simpler details like which model name maps to which vendor, what each one is trying to be good at, and where i should not pretend it is interchangeable with the others.

this post is that roster for me, written as of friday mar 20, 2026. i am not running benchmarks here. i am writing down how these models behave in my hands, with links so you can read the official specs if you want to explore further on your own. for why i treat multi-model routing as a production-era default, see from prototype to production: my early adopter view of ai.

quick answer
#

six models in my rotation right now: composer 2 when i want cursor-native agentic work, gpt-5.3 codex xhigh when i need serious implementation muscle, claude 4.6 opus max when the problem is genuinely hard and i want anthropic thinking, gemini 3.1 pro when the input is big or visual, grok 4.20 when i am stuck and want a fresh perspective, and kimi k2.5 when i want strong tool use from outside the usual three vendors.

who this is for
#

  • anyone already using cursor (or something similar) who wants to know what models are out there
  • engineers who do not want to watch an hour of launch videos to get a vendor map
  • future me, six months from now, when half of these names have changed and i need to remember what i was actually using

comparison table
#

the table is the quick reference. the sections below are where i get honest about what each model is actually like to use.

model (as shown in my router)makerspeciality / intended usepro / condocumentation
composer-2cursoragentic coding inside cursor: edits, terminal-shaped workflows, tool usepro: built for the editor; strong on long-horizon tasks with summarization training. con: not a portable api model in my mental model; i think of it as an environment capability, not a generic llmcomposer 2 model page
gpt-5.3-codex-xhighopenaiagentic coding via the codex line; the xhigh suffix is how my router encodes a higher reasoning effort preset on top of the codex familypro: excellent when i want careful refactors and api-shaped thinking. con: slower and more expensive than “just answer fast” tiers; easy to overuse on triviagpt-5-codex model, codex product hub
claude-4.6-opus-maxanthropicmaximum depth sonnet-family reasoning when latency is a fair pricepro: best anthropic option in my rotation for subtle bugs, spec ambiguity, and multi-file coherence. con: the cost and latency are real; i save it for work that deserves the taxclaude models overview
gemini-3.1-progoogleflagship gemini tier for long context and strong multimodal reasoning in the gemini stackpro: great when i am dragging in screenshots, pdf-shaped context, or very wide file sets. con: vendor-specific quirks still matter; i verify critical logic instead of trusting vibegemini models
grok-4-20xaigrok 4 family reasoning with the 4.20 snapshot naming xai uses in api surfacespro: useful second opinion when i feel anchored to one vendor’s “house style”. con: i treat cutting-edge models as higher variance until i have personal calibration dataxai api introduction
kimi-k2.5moonshot aikimi k2 line tuned for coding, math-style reasoning, and tool calling on moonshot’s platformpro: strong when i want mixture-of-experts-style efficiency stories and a different training prior than the usual us trio. con: operational details (regions, billing, rate limits) are another console to respectkimi api quickstart

composer-2 (cursor)
#

composer 2 is cursor’s house model for agentic work such as file edits, tool calls, and terminal workflows. it does not feel like chatting with an llm. it feels like the editor itself got smarter.

i use it when the task lives in the repo: multi-step refactors, searching across the workspace, long sessions where i do not want to re-explain context every ten minutes. i do not think of it as an api model i happen to access through cursor. it is more like a capability of the editor itself.

the official docs say it is tuned for tool use and long horizons. that matches what i see.

gpt-5.3-codex-xhigh (openai)
#

this is my “i need the ai to really think about this” slot on the openai side. the public docs call the family gpt-5-codex; the 5.3 and xhigh parts are how my router encodes the version and reasoning effort. your account might show a different string.

i use it when the work is code-heavy and i want the model to show its reasoning, not just spit out an answer. it shines when the change touches contracts, apis, types, migrations, or anything where a wrong assumption quietly spreads.

the downside is obvious: it is slower and more expensive, and it tempts me into using a sledgehammer on a thumbtack.

claude-4.6-opus-max (anthropic)
#

this is my only anthropic route right now and i save it for the hard stuff: security-sensitive code, tricky concurrency, specs that contradict themselves, and problems where i want the model to slow down and really chew on it.

the trade-off is cost and patience. opus is not “better” at everything. it is better at the things where i would otherwise redo the work three times trying to get it right with a faster model.

i check anthropic’s model pages periodically because vendors bump versions quietly and my router changes behavior without telling me.

gemini-3.1-pro (google)
#

gemini is where i go when the input is not just code. screenshots, long mixed documents, big file sets, and that is where the pro tier earns its keep for me.

same review standard applies though. if the answer involves auth, money, or data integrity, the model is writing drafts, not making decisions. i sign off. always.

grok-4-20 (xai)
#

grok is my “break the pattern” model. when i have been staring at the same bug through two other model families and getting nowhere, throwing it at a third set of priors sometimes finds the thing i missed faster than another hour of printf debugging.

i keep my expectations honest though. this model does not compete with the above flagship models, but it sometimes i even find value in seeing what it gets wrong which prompts a better question for me to ask to one of the better models. it is kind of like using microsoft edge to download google chrome.

kimi-k2.5 (moonshot ai)
#

kimi k2.5 is my pick when i want strong coding and tool calling from outside the usual us vendor trio. moonshot makes it easy to try because their endpoints are openai-compatible, so i do not have to rewire everything to test it.

i only keep a model in rotation to make sure i do not always only use the same two or three models. otherwise it just collects dust.

how i actually pick (it is not scientific)
#

  1. lots of files, lots of tool calls → composer 2
  2. hard code problem, i want to see the reasoning → codex xhigh or opus max, depending on whether i want openai-flavored or anthropic-flavored thinking
  3. big context window or images involved → gemini 3.1 pro
  4. i have been going in circles for an hour → grok or kimi for a fresh set of eyes

faq
#

do you run all six every day?
#

no. most days it is 90% gpt-5-codex. the full roster is there for when i need it, and over time i have built up a mental map of which model tends to do well on which kind of task.

should i copy this exact list?
#

please do not. if you are not living inside an agentic editor all day, half of this will not make sense for your workflow. honestly, one fast model and one deep model will cover most people. add a third only if you keep running into the same wall.

references
#

related reading#

ai - This article is part of a series.
Part : This Article

Related