what changed most between prototype-era ai and production-era ai?

capability improved, measurement quality improved, and enterprise adoption scaled at the same time. together, those shifts moved ai from side-task support to operational systems.

does one model still handle everything best?

no. production usage now favors model portfolios, with larger models for complex reasoning and smaller models for repeatable high-volume tasks.

from prototype to production: my early adopter view of ai

ai - This article is part of a series.

Part : the future of data engineering workflows with ai

Part : the danger of trusting the ai agent

Part : deep dive: the ai models i use

Part : a practical ai workflow: jira, github, and mcp

Part : This Article

Part : starter templates for ai rules, skills, and commands

Part : how to use ai to create ai rules, skills, and commands

Part : ai br-ai-n fr-ai

i have been an early adopter of new technology for all of my career, and ai has been one of the fastest shifts i have seen.

at first, ai felt “cool / neat / useful on the side”. i used it for first drafts, rough ideas, and quick experiments. and to expand “on-the-side”, it was a stand-alone tool, separate from other applications. i had to copy/paste questions and answers back and forth between the ai and my workspace.

now, it is different.

ai is no longer just a sidekick in my workflow. it is powerful enough to be production-usable when paired with the right guardrails, review loops, and system design. plus, these capabilities are integrated directly into all of the tools, applications, and websites i use daily.

quick answer
#

ai is not a side tool anymore. the benchmarks are better, inference is cheaper, and companies are actually using it in production, not just piloting it. in my own work, i stopped treating it as one model for everything and started routing different models to different jobs. the concrete roster i use in practice is in deep dive: the ai models i use.

who this is for
#

leaders wondering whether ai is actually ready for real workloads (not just demos)
engineers figuring out which model to use for what
teams trying to keep quality high without the bill getting ridiculous

the evidence: 1 year ago, 6 months ago, and now
#

if someone asked me “why do you trust it more now?”, the honest answer is that i have been using this stuff daily for a while, and the outputs are just better than they were a year ago. that is anecdotal. but the benchmarks back it up:

when	study	stat that stood out to me
~1 year ago (2025)	stanford ai index 2025	benchmark gains in one year were large: +18.8 (MMMU), +48.9 (GPQA), and +67.3 (swe-bench) percentage points
~6 months ago (sep 2025)	fluid language model benchmarking (arxiv)	on MMLU, they report higher validity and lower variance with 50x fewer evaluation items
most recent i could find (feb 2026)	prescriptive scaling reveals the evolution of language model capabilities (arxiv)	analysis used 5,000 observational + 2,000 new model-performance points, and found math-reasoning frontiers are still advancing over time

the part that gets me is not just that models are getting better, it is that we are also getting better at measuring whether they are actually better. both things improving at the same time is what makes me trust the trend.

more companies have models now
#

this is not just a tech story. the org charts moved too.

the stanford ai index 2025 economy chapter reports that:

organizational ai use rose from 55% (2023) to 78% (2024)
genai use in at least one business function rose from 33% to 71%

and at the model-building layer, the same report notes that nearly 90% of notable ai models in 2024 came from industry (up from 60% in 2023). companies are not just buying models anymore, they are building them. that shift matters.

how models have evolved
#

the story is not “bigger model = better model” anymore. it got more interesting than that:

ONERULER (mar 2025) showed long-context multilingual performance can swing by up to 20% depending on instruction language, and english was only 6th of 26 languages in their setup
longbench pro (jan 2026) evaluated 46 long-context models over 1,500 real long-context samples and found long-context optimization can matter more than raw parameter scaling
prescriptive scaling (feb 2026) suggests some capability boundaries are stabilizing, while math reasoning keeps moving forward

what i take from all of this: “what is the best model?” is the wrong question now. the right question is “best model for what?”.

how model usage has evolved (including slms)
#

the way i use models has changed as much as the models themselves.

before, i used one large model for almost everything. now, i use a system:

larger models for hard reasoning, large requests, and high-stakes trade-offs
smaller/faster models for mechanical edits or asking basic questions
this helps balance cost, latency, and quality

this is exactly where small language models (slms) come in, and why i think they matter more than most people give them credit for.

the stanford ai index 2025 highlights that inference cost for gpt-3.5-level performance dropped by over 280x between nov 2022 and oct 2024, driven partly by more efficient model options.

and the slm research ecosystem is catching up quickly:

slm-bench (2025) benchmarked 15 slms across 9 tasks, 23 datasets, and 11 metrics (accuracy, compute, and consumption)
this 2026 slm/srlm benchmark reported a 4b model reaching 95.64% on a log-severity task with RAG, while a 0.6b model still reached 88.12%

for me, the shift is going from “one model for everything” to a “roster of models to pick the right player for the job”.

ai agents are now inside nearly every tool i use
#

a year or two ago, “ai in my tool” meant autocomplete or a chat sidebar you forgot about after a week. that changed. in my day-to-day as a data engineer, ai now shows up in basically every tool i touch, and it actually does useful work.

what that looks like in practice:

my code editor reads the repo, proposes multi-file edits, runs checks, and explains what it changed. it is not just finishing my sentences anymore
source control and review tools help me draft pull requests, summarize diffs, and flag the risky parts before a reviewer has to
ticketing and docs tools turn my rough brain dump into structured requirements and keep threads summarized so i do not have to re-read 47 messages
data tools (sql editors, dbt, warehouse observability) help with query drafts, lineage questions, test ideas, and figuring out why a pipeline broke at 3am
more and more of my operational workflows are bounded loops that the ai runs end-to-end while i review results

the bigger picture is that tools used to wait for me to click a button. now agents work alongside me toward an outcome. this is what i call human+.

closing thought
#

i am still an early adopter. that has not changed. what changed is why i keep using it. it is not because ai is shiny or new. it is because it actually helps with real work when you set it up right.

the phase where i had to convince people “no really, this is useful” is mostly over. the current phase, figuring out how to engineer it into production systems that do not break, is harder, and honestly more fun.

faq
#

what is the most important takeaway from the benchmark trend?
#

no single benchmark number is the story. the story is that the trend points the same direction across different benchmarks and different time windows. that consistency is what makes me think this is real progress, not just one lucky test result.

what is the practical operating model today?
#

use more than one model. hard problems and ambiguous decisions go to the big model. repetitive transforms, classification, and high-volume calls go to something faster and cheaper. add guardrails to both.

from prototype to production: my early adopter view of ai

quick answer
#

who this is for
#

the evidence: 1 year ago, 6 months ago, and now
#

more companies have models now
#

how models have evolved
#

how model usage has evolved (including slms)
#

ai agents are now inside nearly every tool i use
#

closing thought
#

faq
#

what is the most important takeaway from the benchmark trend?
#

what is the practical operating model today?
#

references
#

related reading
#

Related

quick answer#

who this is for#

the evidence: 1 year ago, 6 months ago, and now#

more companies have models now#

how models have evolved#

how model usage has evolved (including slms)#

ai agents are now inside nearly every tool i use#

closing thought#

faq#

what is the most important takeaway from the benchmark trend?#

what is the practical operating model today?#

references#

related reading#

Related

quick answer
#

who this is for
#

the evidence: 1 year ago, 6 months ago, and now
#

more companies have models now
#

how models have evolved
#

how model usage has evolved (including slms)
#

ai agents are now inside nearly every tool i use
#

closing thought
#

faq
#

what is the most important takeaway from the benchmark trend?
#

what is the practical operating model today?
#

references
#

related reading
#