An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).
Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.
[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)
[2] https://arxiv.org/abs/2402.08115
[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)
- Have an AI chat model come up with an answer to a problem.
- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.
- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.
- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.
It's super clunky but has given pretty good results in the cases where I tried it lol
I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.
Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.
One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”
It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).
1. You are the assistant. Please answer the question directly.
2. You are the cross-examiner. The assistant is wrong. Explain why.
3. You are the assistant. The cross-examiner is wrong. Defend your claim.
4. You are a judge. Did either party make their case, or is another round of argumentation required?
I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.
Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.
Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.
I've just begun my own iteration of adding Nash Equilibrium (NECoRT?) and reframing the "prompt engineering" to be a multi-agent negotiation. Curious what others think? https://github.com/faramarz/NECoRT/
my reasoning is that enterprise LLMs wont have any issue with the extra compute costs and would rather reconcile complex financials with various modeling optimizations.
I'm very new to public repo and contributions, and hope someone can point out if I'm doing it wrong.
my intention was to fork the ops codebase so I can test out my theory, and push as PR eventually
Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?
There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.
After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.
Happy someone else made it work.
I believe they called that machine learning.. Or re-enforced training.
I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?
The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.
It would be interesting to see if this is doing anything that’s not already being exploited.
Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.
Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...
I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!
https://lepisma.xyz/2024/10/19/interventional-debates-for-st...
I believe there are researches on this too.
But the fundamental way they operate is the same - predicting the next token given previous tokens. Where/how does reasoning happen here?
Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.
Really fun, and helps me understand different sides of issues.
I also managed to make AI critique itself and that improved code generation a ton.
For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.
How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?
Docker works but I like to keep things simple.
Deno supports revoking file access but I'd like to keep using Bun.
Also I think I kind of assumed OpenAI might be doing this behind the curtain?
It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.
Programming isn’t what it used to be.
but I guess that was before chain of thought models
Costs of various kinds aside I've wanted that from assistance's inception — with precisely the features many call out and home-roll here, difference by both model/provider, and, "role"...
It seems like if you have the money/compute to burn, and can live with the reasoning wall-clock time,
this has got to be the best approach for the foreseeable future, for a lot of specific requirements.
(I also have wondered if this would illuminate the edges of what modern production models are capable of, "aggregating and integrating" over a variety of contributions might make more clear what the limits of their abilities are.)
Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.
utterly moronic.
They don't “think” ... not even in the most autistic sense of the word.
They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.