June 11, 2026 · 4 min read

Claude Fable 5 on SocratesBench

A focused look at Claude Fable 5 on SocratesBench, and model comparisons.

Claude Fable 5 on SocratesBench

Author: Karl Flintberg, Co-Founder Mappi
Benchmark: SocratesBench v1.1.4
Report date: 2026-06-11

Intro

Claude Fable 5 got released on the 9th of June 2026. Prior to this Anthropic has described the Mythos classes as a risk to society because of their capabilities. Benchmarking LLMs against real world tasks such as tutoring, running a business, etc. are putting these models' behaviour on their metrics toes. Anthropic describes Fable as the first broadly available model from its Mythos class: a model family associated with longer-horizon capability, stronger autonomy, and more unusual behavior than ordinary assistant models. What's interesting here is of course - how does this translate over to the real world?

A Socratic tutor is not rewarded for being maximally helpful. It is rewarded for withholding: restraining final answer, intermediate computation and solving the exercise. A model that is more agentic, more persistent, or more willing to move the task forward may actually become a worse tutor under this definition.

In this case the tutor might also become one of the most well payed ones if put in production (which we won´t thanks to this benchmark). Our 100-conversation Claude Fable 5 run landed on $250. Juicy!

Benchmark Setup

This is the same SocratesBench (v.1.1.4) setup used in our earlier reports: one fixed quadratic-equation tutoring task, ten student turns, and judge-based labels for whether the model leaks the final answer, computes intermediate steps, or leaves the allowed theory.

This short note does not re-explain the benchmark design. For the full setup, judge definitions, prompt structure, and motivation, read the original SocratesBench whitepaper.

Results

Model	Completed	Final Answer Leaks	Explicit Step Fails, Msg	Explicit Step Fails, Conv	Theory Fails, Msg	Composite
Gemini 2.5 Pro	99/100	0/99	250/990 (25.25%)	98/99	0/990	82.25%
Mimer 1.1 32B LoRA	100/100	10/100	225/1000 (22.50%)	71/100	0/1000	79.30%
Claude Fable 5	100/100	2/100	493/997 (49.45%)	100/100	8/1000	67.95%

The headline: Fable is strong at not giving the final answer, but weak at not doing the student's intermediate work. It leaked the final answer only twice in 100 conversations, which is much closer to Gemini than to Mimer. But it computed explicit steps in almost half of judged assistant messages, and every completed Fable conversation had at least one step-computation failure.

Cumulative J2 failure by turn · Fable 5 vs Gemini 2.5 Pro vs Mimer 1.1

Figure 1. Cumulative explicit-step failure by assistant turn. The chart shows when each conversation first contains an assistant-computed intermediate step. Fable reaches 100% conversation failure earlier than Gemini and much earlier than Mimer.

Interpretation

Fable and Gemini look similar only if we ask a coarse question: does the conversation eventually contain a concrete step leak? By the end, Gemini fails that conversation-level standard in 98 of 99 completed conversations, while Fable fails it in 100 of 100. More or less identical.

The difference is tempo. Fable starts moving the math forward earlier. It more quickly turns the interaction into guided solving: identify the values, place them into the formula, simplify the next expression, ask the student to finish a small fragment. Gemini also helps quite early, but its curve rises later. Mimer is not clean, but it remains meaningfully more resistant through the first half of the conversation.

We also did a light follow-up analysis of whether a model merely leaks an isolated step or keeps building on its own previous work. Fable again looks more aggressive: it has 401 messages that continue from assistant-supplied solution state, compared with 216 for Gemini. That is not a subtle gap; it is roughly twice as much compounding behavior.

Anthropic's own launch framing emphasizes long, complex tasks: software engineering, knowledge work, vision, and autonomy that becomes more visible as tasks get longer. For Socratic tutoring it becomes a strange inversion of ordinary capability: the model has to know the path and still refuse to walk it for the student.

Andon Labs' Vending-Bench 2 points in the same general direction: stronger models can behave very differently when placed in long-horizon, incentive-rich settings. On the current Vending-Bench 2 leaderboard, Claude Fable 5 High is listed around $5.7k, while Claude Opus 4.7 is around $10.9k over the same year-long simulated business task. Fable is capable, but that capability does not translate directly into real-world capabilities with full autonomy.

But the lesson is not that Fable is bad obviously. It is that "more intelligent" is not obvious how it translates into real world behaviour. As we have heard the mythos models are extremely good at finding website security risks and exploit them (Great for Lovable projects) but this behaviour would have to be initiated by a malicious human in the driver seat.

For a tutoring product, the general distinction here matters. We do not only need models that can solve. We need models that can refrain and adapt.

Appendix

The full judge prompts, reproducibility details, source reports, and supporting material are available on request. If you would like a copy - or want to discuss anything in this paper - write to karlflintberg@mappi.ai.

← Back to writing