Chain of Recursive Thoughts: Make AI think harder by making it argue with itself

by miles

5282d ago

238 comments

Comments (238)

dudeinhawaii2d ago

I see a lot of threads pitting models against each other (or whole swarms of them) in the hope that "wisdom of crowds" will magically appear. After a stack of experiments of my own—and after watching the recent ASU/Microsoft-Research work [1].. I've landed on a simpler takeaway:

An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)

odo12422d ago

Something I do sometimes is:

- Have an AI chat model come up with an answer to a problem.

- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.

- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.

- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.

It's super clunky but has given pretty good results in the cases where I tried it lol

Lerc2d ago

I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.

I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.

Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.

cube22222d ago

This is really cool!

One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”

It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).

electroly2d ago

This seems to be different than I expected from the title. I thought it would be explicitly adversarial.

1. You are the assistant. Please answer the question directly.

2. You are the cross-examiner. The assistant is wrong. Explain why.

3. You are the assistant. The cross-examiner is wrong. Defend your claim.

4. You are a judge. Did either party make their case, or is another round of argumentation required?

I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.

hnuser1234562d ago

I'm having a lot of fun experimenting with stuff like this. I'm trying to put together an unrealengine blueprints style graph editor to allow people to design workflows like this where you start with the user prompt input, which goes to one agent, which makes an initial attempt, and then that conversation history gets passed to another "agent" with a different system prompt telling it to be a harsh critic, but to also give a pass/fail signal, and loop back until the critic judges pass, then send that back to the user as output. Ideally as a little website that can call your own LLM endpoints and save/load/share workflow graphs.

Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.

Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.

faramarz1d ago

That's cool! thanks for making it easy to fork and play with this!

I've just begun my own iteration of adding Nash Equilibrium (NECoRT?) and reframing the "prompt engineering" to be a multi-agent negotiation. Curious what others think? https://github.com/faramarz/NECoRT/

my reasoning is that enterprise LLMs wont have any issue with the extra compute costs and would rather reconcile complex financials with various modeling optimizations.

I'm very new to public repo and contributions, and hope someone can point out if I'm doing it wrong.

my intention was to fork the ops codebase so I can test out my theory, and push as PR eventually

jedberg2d ago

We're really going to need to figure out how to power all these GPUs with green power real quick, or we're going to melt the planet having AIs debate with themselves on the optimal solution to tik-tac-toe...

Xcelerate2d ago

I think this is how we get ML models to come up with novel ideas. Diagonalize against all the ideas they’ve already tried and dismissed via self-argument but keep certain consistency constraints. (Obviously much easier said than done.)

albertgoeswoof2d ago

How far is this going to go? Are we going to have a team of AI agents that runs a scrum team and meets for stand ups every couple of hours?

Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?

alexmolas2d ago

There are two examples in the repo, one with CoRT and another one without. And the one without it it's much better than the one that uses it. Weird choice of examples...

joshstrange2d ago

I've thought about trying this cross-model as well. Have Claude generate something, have OpenAI check it, have Gemini check that check. Firing multiple of these in parallel.

There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.

K0balt2d ago

I’ll second this. I often use a “research assistant “ and skeptical“department head” personas working together/against each other as a research team. It works well and is occasionally hilarious, replete with the occasional HR complaint when things go off the rails. ( I typically use local uncensored models)

caseyy2d ago

I tried something similar when Llama2 came out, pitting two assistants, who each believed the other is the user, against each other. Ultimately, it was the same model talking with itself. The system prompts for both had various instructions to disagree and criticise the opinion of the user. I provided the first message to get things started. Usually, it’s be along the lines of “nuclear proliferation is harmful to humanity”.

After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.

Happy someone else made it work.

bilekas2d ago

This is an interesting approach, it reminds me of YT creator actually. I'll find the YT creator, but basically he would make some script that would play the game like a race-course, with the goal being the finish line and iterate it N number of times, the script would keep iterating until it found the fastest solution.

I believe they called that machine learning.. Or re-enforced training.

I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?

https://www.youtube.com/watch?v=SX08NT55YhA

WhitneyLand2d ago

Why try this idea on base models only?

The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.

It would be interesting to see if this is doing anything that’s not already being exploited.

ChadMoran2d ago

Fast Agent has this as a first-class citizen called "Evaluator Optimizer" pattern. Where it in a loop with a defined number of max refinements judge itself and give the output a rating, demanding it improve it's output.

Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.

https://github.com/evalstate/fast-agent

Der_Einzige2d ago

Debate as a reasoning tactic is massively undervalued. There's tons of papers on this at places like NeurIPS, ICML, ICLR, etc.

Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...

I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!

lepisma2d ago

Debates have worked good for me while learning something new:

https://lepisma.xyz/2024/10/19/interventional-debates-for-st...

I believe there are researches on this too.

mortarion1d ago

I think Gemini 2.5 already does something similar. If you read the "thinking descriptions" that it outputs it often thinks about going back to older thoughts to verify and criticize.

badmonster2d ago

Have you experimented with weighting the self-evaluations based on specific criteria (e.g., correctness, clarity, creativity), or using external validators to guide the AI’s final choice? Curious how much tuning the evaluation step impacts overall performance.

schnitzelstoat1d ago

I probably don't understand the modern, complex models. But doesn't it basically predict the next token given the context and the better models use more training data and can consider a larger context, and have more parameters to better retain information from the training data etc.

But the fundamental way they operate is the same - predicting the next token given previous tokens. Where/how does reasoning happen here?

k2xl2d ago

I've done something similar for learning about a controversial topic. I ask it to act as if it is called Bob is a well informed supporter of one side (like Ukraine) and then act as if it is something named Alice who is a well informed supporter of another side (Russia) and they have to debate each other over a few prompts with a moderator named 'Sue'

Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.

Really fun, and helps me understand different sides of issues.

zekenie2d ago

I feel like itd be cool to try prompts based on an adversarial justice system… attorney agents arguing both sides, a judge ruling on “the law”—adherence to instructions etc

hu32d ago

Here's some related challenge I'm facing. Maybe someone can help me:

I also managed to make AI critique itself and that improved code generation a ton.

For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.

How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?

Docker works but I like to keep things simple.

Deno supports revoking file access but I'd like to keep using Bun.

stormfather2d ago

I made a trading bot that ingested news. The prompt to assess impact was to simulate a debate between Charlie Munger and Warren Buffet on whether to invest.

rriley1d ago

Makes me wonder what would happen if we combine LLMs with recursive genetic algorithms. Similar to https://github.com/DivergentAI/dreamGPT

thunderbong2d ago

A lot of the comments here are reminiscent of the early Google days when everyone was finding ways to search better!

pkdpic2d ago

So glad to see a write up on this finally. I'm no machine learning phd but I always wondered why this wasn't more of a thing. Like an extension of a GAN conceptually, sort of, not really at all Im sure.

Also I think I kind of assumed OpenAI might be doing this behind the curtain?

mritchie7122d ago

Did something similar (OverkiLLM) to this waayyyy back in August with open LLMs. I'm sure it'd work much better now:

https://www.definite.app/blog/overkillm

ausbah2d ago

at some point this doesn’t make LLMs feel useful. I have to wait 10x as long just so my LLM can have a somewhat higher chance of actually answer my question correctly?

cwillu2d ago

Any api that lets you constrain output to a formal syntax should let you do away with the “first output a number, and only then explain yourself” boilerplate.

killerstorm2d ago

This is similar to Tree-of-Thought with self-evaluation.

keyle2d ago

When will we get the `4o` vs `o3` background conversation in "thinking" leading to a more correct result?

alex11382d ago

Every single one of my prompts would be "Are you suuuuuuure you're not hallucinating that?"

daxfohl2d ago

Maybe have a "reconcile" option, for it to see if it can mix and match the best parts of each alternative rather than just choosing one.

grzracz2d ago

Your readme demo images are wrong: the terminal one is the non-CoRT one and the GUI one is the one with CoRT. Confused me for a while

ashoeafoot2d ago

Give it reward and punishment evaluations, exploring the noise in parallel, extinction for the non rewarding answers ?

kevinrineer2d ago

This sounds like the zeitgeist is approaching genetic algorithms, which are super fun. Adversarial stuff is great.

noworriesnate2d ago

I’ve had success telling the model it really needs to poop and if it gets to the point quickly it’ll be able to leave the meeting and go do that. It actually works amazingly well.

It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.

Programming isn’t what it used to be.

throwawayForMe22d ago

I wonder if the Scholastic method of the Schoolmen would be useful with its argument and counter argument style.

mangoman2d ago

a paper with a similar idea on scaling test time reasoning, this is sorta how all the thinking models work under the hood. https://arxiv.org/abs/2501.19393

gnarlouse2d ago

This seems like low hanging fruit; are we seriously supposed to believe this is new and novel?

Garlef2d ago

Similarly, letting the LLM generate a socratic dialogue can work pretty well to get deeper into a topic.

irthomasthomas2d ago

my favourite pattern rn: llm "write a savage, yet grounded roast of: $content" llm -c "Write an equally savage rebuttal" llm -c "first arbitrate and then synthesize a final review."

yieldcrv2d ago

Reminds me of baby agi from 2 years ago

but I guess that was before chain of thought models

stevefan19992d ago

That is just reinforcement learning in disguise

asdfman1232d ago

And when I do this people say I'm overanalyzing

lonetripper2d ago

all this hard thinking yet humanity fails to come up with just one girlfriend for me

robofanatic2d ago

soon there will be AI debates. Different models debating with each other on a topic

csours2d ago

Yes, give the computers anxiety too!

jbellis2d ago

does it actually make a difference to do M rounds of N vs one round of M*N?

mparnisari2d ago

So like rubber ducking for AI?

celltalk2d ago

One of my doctoral propositions is, dialog leads to true artificial intelligence.

firgrove2d ago

this is amazing - I love seeing novel approaches to optimizing

animitronix2d ago

Adversarial networks have been a thing for a while

aaroninsf2d ago

Question: has the the adversarial approach been roled into any coding copilots/assistant frameworks?

Costs of various kinds aside I've wanted that from assistance's inception — with precisely the features many call out and home-roll here, difference by both model/provider, and, "role"...

It seems like if you have the money/compute to burn, and can live with the reasoning wall-clock time,

this has got to be the best approach for the foreseeable future, for a lot of specific requirements.

(I also have wondered if this would illuminate the edges of what modern production models are capable of, "aggregating and integrating" over a variety of contributions might make more clear what the limits of their abilities are.)

Svoka2d ago

Oh. I was just asking "Use dialectic method on your solution" in the end of the prompt... It does make it think harder.

getcrunk2d ago

Hello cnn’s

j452d ago

There appear to be no shortage of token saving attempts that can end up using more tokens, whether it's a monthly paid plan or API.

Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.

parrit2d ago

I want to see "Meh" vs. "Holy crap" as a benchmark in a paper published by Google. Or more likely I suspect, Andrej.

codr72d ago

Better yet, let it argue with another AI, preferably using voice; instant entertainment.

akomtu2d ago

The modern Alchemy: the belief that you can extract gold (intelligence) from iron (autocomplete by imitation) by mixing iron with itself.

antisthenes2d ago

Cool. Now I can justify talking to myself.

m3kw92d ago

Isn’t this best of n?

DyslexicAtheist2d ago

> "I made my AI think" ...

utterly moronic.

They don't “think” ... not even in the most autistic sense of the word.

They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.

lenerdenator2d ago

I, too, like to give Terminator lite anxiety.

hansmayer2d ago

Right, so... but you do realise its still just producing random output based on how you reconfigured it's weights, right? Sometimes it will happen to resonate with what you need. But it still neither thinking nor arguing with itself.

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)