Mercury: Commercial-scale diffusion language model

(www.inceptionlabs.ai)

by HyprMusic

36720h ago

159 comments

Comments (159)

inerte19h ago

Not sure if I would tradeoff speed for accuracy.

Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.

But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.

I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.

But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.

I guess only giving a try will tell.

dmos6218h ago

If the benchmarks aren't lying, Mercury Coder Small is as smart as 4o mini and costs the same, but is order of magnitude faster when outputting (unclear if pre-output delay is notably different). Pretty cool. However, I'm under the impression that 4o-mini was superceded by 4.1-mini and 4.1-nano for all use cases (correct me if I'm wrong). Unfortunately they didn't publish comparisons with the 4.1 line, which feels like an attempt to manipulate the optics. Or am I misreading this?

Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.

g-mork19h ago

There are some open weight attempts at this around too: https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...

Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff

m-hodges18h ago

It fails the MU Puzzle¹ by violating rules:

To transform the string "AB" to "AC" using the given rules, follow these steps:

1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B"). - Result: "ABC"

2. *Apply Rule 4*: Remove the substring "CC" from "ABC". - Result: "AC"

Thus, the series of transformations is: - "AB" → "ABC" (Rule 1) - "ABC" → "AC" (Rule 4)

This sequence successfully transforms "AB" to "AC".

¹ https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...

vlovich12311h ago

I just tried giving it a coding snippet that has a bug. ChatGPT & Claude found the bug instantly. Mercury fails to find it even after several reprompts (it's hallucinating). On the upside it is significantly faster. That's promising since the edge for ChatGPT and Claude are in the prolonged time and energy they've spent building training infrastructure, tooling, datasets, etc to pump out models with high task performance.

schappim18h ago

It's nice to see a team doing something different.

The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].

Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!

1. https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...

2. https://files.littlebird.com.au/pb-IQYUdv6nQo.png

NitpickLawyer11h ago

Looks interesting, and my intuition is that code is a good application of diffusion LLMs, especially if they get support for "constrained generation", as there's already plenty of tooling around code (linters and so on).

Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.

twotwotwo15h ago

It's kind of weird to think that in a coding assistant, an LLM is regularly asked to produce a valid block of code top to bottom, or repeat a section of code with changes, when that's not what we do. (There are other intuitively odd things about this, like the amount of compute spent generating 'easy' tokens, e.g. repeating unchanged code.) Some of that might be that models are just weird and intuition doesn't apply. But maybe the way we do it--jumping around, correcting as we go, etc.--is legitimately an efficient use of effort, and a model could do its job better, with less effort, or both if it too used some approach other than generating the whole sequence start-to-finish.

There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.

From a quick search it looks like https://arxiv.org/abs/2306.05426 in 2023 discussed backtracking LLMs and https://arxiv.org/html/2410.02749v3 / https://github.com/upiterbarg/lintseq trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up https://arxiv.org/html/2504.20196 from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)

jonplackett19h ago

Ok. My go to puzzle is this:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can

You have two options:

1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add the cold milk.

Which one cools the coffee to the lowest temperature and why?

And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.

So that’s pretty impressive.

StriverGuy3h ago

Related paper discussing diffusion models from 2 months ago: https://arxiv.org/abs/2502.09992

freeqaz18h ago

Anybody able to get the "View Technical Report" button at the bottom to do anything? I was curious to glean more details but it doesn't work on either of my devices.

I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?

agnishom11h ago

I'd hope that with diffusion, it would be able to go back and forth between parts of the output to adjust issues with part of the output which it had previously generated. This would not be possible with a purely sequential model.

However,

> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five

> Response: Rapidly running, rats rush, racing, racing.

jtonz18h ago

I would be interested to see how people would apply this working as a coding assistant. For me, its application in solutioning seem very strong, particularly vibe coding, and potentially agentic coding. One of my main gripes with LLM-assisted coding is that for me to get the output which catches all scenarios I envision takes multiple attempts in refining my prompt requiring regeneration of the output. Iterations are slow and often painful.

With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.

tzury12h ago

Would have been nice if along to this demo video[1] comparing speed of 3 models, they would have share the artifacts as well, so we can compare quality.

[1] https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...

parsimo201019h ago

This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text.

Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.

jakeinsdca18h ago

I just tried it and it was able to perfectly generate a piece of code for me that i needed for generating a 12 month rolling graph based on a list of invoices and it seemed a bit easier and faster then chatgpt.

pants219h ago

This is awesome for the future of autocomplete. Current models aren't fast enough to give useful suggestions at the speed that I type - but this certainly is.

That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?

ZeroTalent10h ago

Look into groq.com guys. some good models at similar speed to inception labs

mlsu17h ago

It seems that with this technique you could not possibly do "chain of thought." That technique seems unique to auto-regressive architecture. Right?

badmonster18h ago

1000+ tokens/sec on H100s, a 5–10x speedup over typical autoregressive models — and without needing exotic hardware like Groq or Cerebras - impressive

carterschonwald16h ago

I actually just tried it. And I’m very impressed. Or at least it’s reasonable code to start with for nontrivial systems.

strangescript17h ago

Speed is great, but you have to set the bar a little higher than last year's tiny models

byearthithatius19h ago

Interesting approach. However, I never thought of auto regression being _the_ current issue with language modeling. If anything it seems the community was generally surprised just how far next "token" prediction took us. Remember back when we did char generating RNNs and were impressed they could make almost coherent sentences?

Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.

jph0018h ago

The linked page only compares to very old and very small models. But the pricing is higher even than the latest Gemini Flash 2.5 model, which performs far better than anything they compare to.

good-luck865237h ago

Everyone will just switch to LibreOffice and Hetzner.

High tech US service industry exports are cooked.

echelon19h ago

There are so many models. Every single day half a dozen new models land. And even more papers.

It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.

I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?

Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?

Where does the value accrue? What are the most important problems to work on?

kittikitti16h ago

This is genius! There are tradeoffs between diffusion and neural network models in image generation so why not use diffusion models in text generation? Excited to see where this ends up and I wouldn't be surprised if we saw some of these types of models appear in the future updates to popular families like Llama or Qwen.

moralestapia17h ago

>Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips.

This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!

Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.

gitroom10h ago

this convo has me rethinking how much speed actually matters vs just getting stuff right - you think most problems are just about better habits or purely tooling upgrades at this point?

mackepacke18h ago

Nice

stats11117h ago

Can't use the Mercury name Sir. It's a bank!

marcyb5st19h ago

Super happy to see something like this getting traction. As someone that is trying to reduce my carbon footprint sometimes I feel bad about asking any model to do something trivial. With something like that perhaps the guilt will lessen