Back to Home
I read a multi-agent reasoning paper, built the Claude-native version, and measured everything

I read a multi-agent reasoning paper, built the Claude-native version, and measured everything

B
Blizine Admin
·2 min read·0 views

Bohyeon Jang Posted on Jun 1 I read a multi-agent reasoning paper, built the Claude-native version, and measured everything # ai # python # machinelearning # claudeapi RecursiveMAS (arXiv 2604.25917) showed that agents sharing internal reasoning state outperform agents that share only final outputs. The average accuracy gain across benchmarks was 8.3 points. The mechanism: each agent passes not just its answer but the latent embeddings from its own reasoning process, and the next agent conditions on both. The paper is a good result. The catch is access. RecursiveMAS requires open-weight models with hidden states exposed at inference time. That rules out Claude, GPT-4o, and Gemini. I built a Claude-native version using the Anthropic extended thinking API. The core idea transfers: instead of passing latent vectors, pass the full thinking text. The paper calls it internal state sharing; the Claude version calls it thinking-block relay. The architecture problem Claude's extended thinking blocks carry an encrypted signature tied to the originating conversation. You cannot pass a signed thinking block into a different agent's messages array. The API rejects it. The workaround: extract the text from the thinking block and inject it as a regular user message. # Extract thinking text from Agent 1 thinking_text = next ( ( b . thinking for b in response . content if b . type == " thinking " ), "" ) # Inject into Agent 2 as regular context, not as a thinking block context = f " Prior agent reasoning: \n { thinking_text } " Enter fullscreen mode Exit fullscreen mode The signature does not transfer. The reasoning does. relay-structured: what I built first The first architecture was a Planner > Critic > Solver loop where each agent emits a compact mental model JSON instead of raw thinking text. Raw thinking at a 1024-token budget is often compressed and fragmented. The hypothesis was that 150 tokens of structured signal carries more information per token than 1024 tokens of

📰Dev.to — dev.to

Comments