Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.
But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.
I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.
But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.
I guess only giving a try will tell.
Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.
Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff
To transform the string "AB" to "AC" using the given rules, follow these steps:
1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B"). - Result: "ABC"
2. *Apply Rule 4*: Remove the substring "CC" from "ABC". - Result: "AC"
Thus, the series of transformations is: - "AB" → "ABC" (Rule 1) - "ABC" → "AC" (Rule 4)
This sequence successfully transforms "AB" to "AC".
¹ https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...
The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].
Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!
1. https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...
Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.
There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.
From a quick search it looks like https://arxiv.org/abs/2306.05426 in 2023 discussed backtracking LLMs and https://arxiv.org/html/2410.02749v3 / https://github.com/upiterbarg/lintseq trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up https://arxiv.org/html/2504.20196 from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)
You have 2 minutes to cool down a cup of coffee to the lowest temp you can
You have two options:
1. Add cold milk immediately, then let it sit for 2 mins.
2. Let it sit for 2 mins, then add the cold milk.
Which one cools the coffee to the lowest temperature and why?
And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.
So that’s pretty impressive.
I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?
However,
> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five
>
> Response: Rapidly running, rats rush, racing, racing.
With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.
[1] https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...
Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.
That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?
Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.
High tech US service industry exports are cooked.
It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.
I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?
Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?
Where does the value accrue? What are the most important problems to work on?
This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!
Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.