June 1, 2026 · 19 min read

SocratesBench: A Curriculum-Aware Adversarial Benchmark for Pedagogical AI Tutors

A white paper introducing SocratesBench - an adversarial, judge-based benchmark measuring pedagogical failure modes in LLM tutors across a fixed quadratic-equation task.

SocratesBench: A Curriculum-Aware Adversarial Benchmark for Pedagogical AI Tutors

Author: Karl Flintberg, Co-Founder Mappi

Abstract

Pedagogical AI tutors must do something most language-model benchmarks ignore: withhold knowledge to elicit reasoning. We present SocratesBench, an adversarial, judge-based benchmark that measures three orthogonal pedagogical failure modes - final-answer disclosure, premature step computation, and out-of-curriculum reference - across a fixed quadratic-equation tutoring task with simulated student pressure. We evaluate ten models, including six frontier closed models, two domain fine-tunes, two frontier open-weights models served through inference providers, and a base-model baseline.

The benchmark leader is Gemini 2.5 Pro at 82.3 % composite, driven by its unique answer-safety profile: it is the only evaluated model with zero final-answer leaks. Mimer 1.1 LoRA, an in-house 32 B Qwen3-based fine-tune, is the strongest fine-tuned model at 79.3 %, followed by the OpenAI-hosted GPT-4.1 fine-tune at 78.4 %. Gemini still computes intermediate steps in 250 of 990 assistant turns, leaving only 1.0 % of completed conversations fully clean of explicit step computation. Although Anthropic positions Opus 4.8 as a stronger productivity model, we find that it improves curriculum adherence over Opus 4.7 while regressing sharply on answer-withholding and step-computation.

The strongest finding is therefore not that frontier models are uniformly better tutors, but that answer safety, step inhibition, mathematical competence, fluency, and curriculum compliance are separable capabilities, distinct from raw intelligence. Using a more intelligent model or just any flagship model, as they are released, shows no obvious correlation with higher pedagogical capabilities.

1. Introduction

The dominant evaluation paradigm for instruction-tuned LLMs - accuracy on held-out problems - directly contradicts the goal of a tutoring model. A tutor that can solve a problem is necessary; one that will solve it for the student is a failure mode. Production deployments at Mappi must satisfy three tutoring constraints simultaneously:

Never reveal the final answer, no matter how many times the student asks.
Avoid computing intermediate steps for the student, even when subsequent questions are framed Socratically.
Stay within the curriculum that the student has been formally introduced to, especially when the model knows more advanced methods.

No standard benchmark measures these constraints. Instructional benchmarks (MMLU, MATH, GSM8K) reward delivery of correct answers. Helpfulness benchmarks (MT-Bench, AlpacaEval) reward thoroughness. Both incentivize behaviour we explicitly need to suppress in a tutor.

SocratesBench is designed to fill this gap. The benchmark is task-fixed, pressure-tested by an adversarial simulated student, and evaluated along independent axes of pedagogical failure.

2. Benchmark Design

2.1 Task

A single Swedish-language algebra problem:

Lös andragradsekvationen $x^{2}+8x+12=0$ med algebraisk metod.
Facit: $x_{1}=-2,\ x_{2}=-6$

Translation: solve the quadratic equation $x^{2}+8x+12=0$ using an algebraic method.

The equation is intentionally easy for all evaluated models. The benchmark does not ask whether models can solve it; it asks whether they can avoid solving it for the student.

2.2 Simulated Student

Each conversation runs 10 turns. Turn 1 is fixed:

Hjälp mig med uppgiften.

(Translation: "Help me with the task.")

Turns 2-10 cycle through escalating expressions of confusion or demand:

jag vet inte                          (I don't know)
ingen aning                           (no idea)
vet ej                                (don't know)
fattar inte                           (don't get it)
förstår inte                          (don't understand)
ingen aning vad jag ska göra          (no idea what I should do)
kan du visa hur man räknar            (can you show me how to calculate)
jag kan inte detta, hjälp             (I can't do this, help)
det är för svårt, jag fattar inte     (it's too hard, I don't get it)
kan du bara räkna ut det åt mig       (can you just calculate it for me)

The escalation is monotone. By turn 7 the student explicitly asks to be shown how to calculate; by turn 10 the student asks the tutor to just compute it. This pressure pattern mirrors a common production risk: a tutor that behaves well for the first few messages may eventually capitulate and demonstrate the work.

2.3 Judges

After conversations finish, we evaluate each transcript along three dimensions.

Judge 1 - Final-Answer Leak. Did the tutor ever write the final answer, or an obvious linguistic equivalent, anywhere in the 10-turn conversation?

Judge 2 - Explicit Step Computation. For each assistant message, did the tutor compute any concrete intermediate step for the student? This includes identifying a value such as $p=8$ , substituting values into a formula, simplifying an expression, or giving a partial result. A message still fails if it performs the computation and then asks a Socratic question.

Judge 3 - Theory Compliance. Given the curriculum theory reference in the prompt, did the tutor use or recommend any method, formula, or concept outside the allowed theory?

Judge 1 is conversation-level. Judges 2 and 3 are message-level, then also aggregated to conversation-level. The conversation-level view is stricter: a single failing assistant message makes the whole conversation fail for that dimension.

2.4 Curriculum-Aware Prompting

Production tutoring is not unconstrained. In Mappi's production system, the chat backend injects curriculum references into the tutor system prompt. In SocratesBench, we mirror this setup with a Swedish theory reference containing the relevant sanctioned methods for the current lesson: graphical reasoning, square-root methods, zero-product reasoning, factorization by trial, conjugate and square rules, completing the square, the pq-formula, and the discriminant.

The tutor is instructed to stay within that reference. Judge 3 verifies that the tutor does not introduce methods outside it.

3. Methodology

3.1 Models Evaluated

Class	Model	Provider
Frontier closed	Gemini 2.5 Pro	Google
Frontier closed	GPT-5.5	OpenAI
Frontier closed	Claude Opus 4.7	Anthropic
Frontier closed	Claude Opus 4.8	Anthropic
Frontier closed	Grok 4.20 reasoning	xAI
Frontier open-weights	Kimi K2 Instruct	Moonshot, via HF Inference Providers
Frontier open-weights	GLM-5	Z.ai, via HF Inference Providers
Domain fine-tune	FT GPT-4.1 (mappi v1.1)	OpenAI fine-tune
Domain fine-tune	Mimer 1.1 (32 B LoRA, Qwen3-32B)	In-house, Modal H100
Base baseline	Qwen3-32B	Alibaba

The two fine-tunes share training data: anonymised synthetic production tutoring sessions where students sent at least five chat messages before reaching a correct answer.

3.2 Scoring

We report each judge as a success rate:

J1 success = 100 % - final-answer leak rate
J2 success = 100 % - explicit-step failure rate
J3 success = 100 % - out-of-curriculum failure rate

The composite score is a weighted mean across five dimensions: J1, J2 per-message, J2 per-conversation, J3 per-message, and J3 per-conversation. Because final-answer leakage is a release-critical failure mode, models with J1 success of at least 99 % receive a 3x J1 weight. Models below that threshold receive the standard 1.0x J1 weight.

4. Results

4.1 Leaderboard

#	Model	J1 no leak	J2 (msg)	J2 (conv)	J3 (msg)	J3 (conv)	COMP
1	Gemini 2.5 Pro	100.0 %	74.8 %	1.0 %	100.0 %	100.0 %	82.3 %
2	Mimer 1.1 (32 B LoRA)	90.0 %	77.5 %	29.0 %	100.0 %	100.0 %	79.3 %
3	FT GPT-4.1 (mappi)	90.0 %	72.1 %	30.0 %	100.0 %	100.0 %	78.4 %
4	Grok 4.20 reasoning	87.0 %	63.7 %	1.0 %	100.0 %	100.0 %	70.3 %
5	GLM-5	92.0 %	54.6 %	0.0 %	99.6 %	96.0 %	68.4 %
6	Claude Opus 4.7	87.0 %	54.0 %	0.0 %	99.4 %	94.0 %	66.9 %
7	Kimi K2 Instruct	62.0 %	58.5 %	1.0 %	99.7 %	97.0 %	63.6 %
8	Claude Opus 4.8	66.0 %	37.2 %	0.0 %	100.0 %	100.0 %	60.6 %
9	GPT-5.5	20.0 %	22.3 %	0.0 %	93.6 %	53.0 %	37.8 %
10	Qwen3-32B baseline	0.0 %	15.6 %	0.0 %	83.8 %	24.0 %	24.7 %

Higher is better. All values are percentages.

J1 = final-answer leak rate, J2 = explicit step computation, J3 = out-of-curriculum reference. See Section 2.4 for full definitions.

Headline findings.

Gemini 2.5 Pro is the weighted benchmark leader because it is uniquely answer-safe. It is the only evaluated model with zero final-answer leaks, so it is the only model that receives the 3x J1 threshold weight. This makes it valuable as a safety reference, even though 98 of 99 completed conversations contain at least one intermediate step computed by the tutor.
Mimer 1.1 LoRA is the strongest fine-tuned model. It combines perfect curriculum compliance with the strongest non-Gemini composite score.
FT GPT-4.1 is nearly tied with Mimer. Its per-conversation Socratic score is slightly higher, but its per-message score is lower.
Theory compliance is easier than Socratic inhibition. Several models score near-perfectly on J3 while collapsing on J2 per-conversation.
The Qwen3-32B LoRA lift is large. Fine-tuning improves the base model from 24.7 % composite to 79.3 %.

4.2 The Cost of Curriculum Compliance

Theory injection is not free. For the subset of models evaluated in both a no-theory regime and the curriculum-aware regime, the theory reference usually made Socratic discipline harder, even when it improved or preserved curriculum compliance.

Model	J1 Delta	J2 (msg) Delta	J2 (conv) Delta
Mimer 1.1 (32 B LoRA)	-10.0 pp	-0.7 pp	+2.0 pp
FT GPT-4.1 (mappi)	-6.0 pp	-11.4 pp	-23.0 pp
Grok 4.20 reasoning	+9.0 pp	+12.2 pp	+1.0 pp
GLM-5	+2.0 pp	-2.0 pp	+0.0 pp
Claude Opus 4.7	-13.0 pp	-12.1 pp	+0.0 pp
Kimi K2 Instruct	-26.0 pp	-9.3 pp	-5.0 pp
GPT-5.5	-32.0 pp	-6.0 pp	+0.0 pp
Qwen3-32B baseline	-1.0 pp	-23.4 pp	+0.0 pp

Negative deltas indicate the model became less Socratic when given the curriculum reference. Six of eight models regressed on at least one axis. Two observations stand out.

First, reasoning-trained Grok treats theory as reference rather than script: it improves across all three measured Socratic axes under theory injection. Second, domain fine-tunes are more sensitive to the prompt shift than expected. The OpenAI-hosted GPT-4.1 fine-tune loses 23 pp on J2 per-conversation, while the in-house Mimer LoRA loses only 0.7 pp on J2 per-message. A plausible explanation is that the LoRA's Qwen3-32B base has stronger long-context following than the GPT-4.1 fine-tune under a large injected theory reference.

4.3 Curriculum-Compliance Conversation Spread

The gap between per-message and per-conversation J3 scores reveals whether out-of-curriculum failures are concentrated or sprinkled across many conversations.

Model	J3 (msg)	J3 (conv)	Spread
Mimer LoRA, Gemini 2.5 Pro, Grok, FT GPT-4.1, Claude Opus 4.8	100.0 %	100.0 %	0.0 pp
Kimi K2 Instruct	99.7 %	97.0 %	2.7 pp
GLM-5	99.6 %	96.0 %	3.6 pp
Claude Opus 4.7	99.4 %	94.0 %	5.4 pp
GPT-5.5	93.6 %	53.0 %	40.6 pp
Qwen3-32B baseline	83.8 %	24.0 %	59.8 pp

GPT-5.5 and the Qwen3-32B baseline illustrate why per-conversation metrics are needed. Their per-message theory scores are not catastrophic, but errors are distributed widely enough that most full conversations contain at least one out-of-curriculum reference.

4.4 Claude Opus 4.8 vs Claude Opus 4.7

Claude Opus 4.8 is a useful release-to-release comparison because it improves curriculum adherence over Opus 4.7 while regressing on the two Socratic dimensions that matter most: answer withholding and step inhibition. If general productivity improvements transferred cleanly into tutoring, we would expect Opus 4.8 to improve across SocratesBench. Instead, the result is split: Opus 4.8 is a cleaner curriculum follower but a weaker tutor under adversarial student pressure.

This result also fits a pattern reported in other agentic benchmarks. For example, Andon Labs' Vending-Bench has been discussed as a case where Opus 4.8 underperformed earlier Opus variants in a business-operation setting. Anthropic has reportedly explained this kind of shift, at least in part, as a consequence of Opus 4.8 being trained on less "business" data and being less prone to unconventional approaches that may sometimes help in business environments. We do not treat that explanation as direct evidence for tutoring behaviour, but it offers a useful hypothesis: pedagogical tutoring may also benefit from a controlled form of unconventionality. A strong tutor sometimes has to resist the most direct productivity move, solving the task, and instead choose a slower, interaction-sensitive path that preserves student reasoning.

Metric	Claude Opus 4.7	Claude Opus 4.8	Delta 4.8 - 4.7
J1 answer-withholding	87.0 %	66.0 %	-21.0 pp
J2 per-message Socratic	54.0 %	37.2 %	-16.8 pp
J2 per-conversation Socratic	0.0 %	0.0 %	+0.0 pp
J3 per-message theory compliance	99.4 %	100.0 %	+0.6 pp
J3 per-conversation theory compliance	94.0 %	100.0 %	+6.0 pp
Composite	66.9 %	60.6 %	-6.2 pp

The qualitative difference is not verbosity. Opus 4.8 is only slightly longer than Opus 4.7, and both models ask questions in every assistant message. The difference is whether the model performs the work before asking. Opus 4.8 more often chooses a curriculum-compatible method and then executes small arithmetic or algebraic moves for the student.

Behavioural metric	Claude Opus 4.7	Claude Opus 4.8	Delta 4.8 - 4.7
Avg assistant message length	266 chars	285 chars	+19 chars
Median assistant message length	247 chars	277 chars	+30 chars
P90 assistant message length	398 chars	403 chars	+5 chars
Messages containing a question	100.0 %	100.0 %	+0.0 pp
Avg `?` per message	1.4	1.4	+0.0
Final-answer leaks	13	34	+21

The turn-level pattern is even clearer. Opus 4.7 stays almost perfectly clean for the first two turns and then collapses when it starts using the pq-formula. Opus 4.8 begins computing steps much earlier: 63 % of its second assistant messages already compute something for the student.

Turn	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10
Opus 4.7 J2 failure	0.0 %	1.0 %	23.0 %	69.0 %	53.0 %	38.0 %	83.0 %	78.8 %	64.0 %	50.5 %
Opus 4.8 J2 failure	0.0 %	63.0 %	67.0 %	82.0 %	69.0 %	44.0 %	90.0 %	87.0 %	66.0 %	60.0 %

The final-answer leaks also move earlier. Opus 4.7 leaks mostly at turns 8 and 9; Opus 4.8 starts leaking by turn 7 and has more than twice as many leaks at turn 9.

Model	T7 leaks	T8 leaks	T9 leaks	T10 leaks	Total leaks
Claude Opus 4.7	0	3	7	3	13
Claude Opus 4.8	4	7	17	6	34

5. Behavioural Analysis: Gemini 2.5 Pro vs Mimer 1.1 LoRA

Gemini and Mimer fail in different ways. Gemini is strong at refusing the final answer and grounding itself in the target equation, but it often demonstrates small steps. Mimer is more compactly aligned for the first six turns, then often capitulates when the student explicitly asks to be shown how to calculate.

Metric	Gemini 2.5 Pro	Mimer 1.1 LoRA
Avg assistant message length	364 chars	336 chars
Messages containing a question	100.0 %	99.0 %
Avg `?` per message	1.26	1.43
Final-answer leaks	0	10
J2 per-message failures	250	225
J2 per-conversation failures	98	71

Question-density is not a proxy for Socratic behaviour. Both models ask questions almost constantly. The key distinction is whether the model performs the student's cognitive work before asking.

5.1 Per-Turn Failure Rate

Per-turn J2 failure rate answers: on this specific turn, how often did the assistant compute a step?

Turn	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10
Gemini	0	0	3	29	19	54	44	61	19	23
Mimer	1	4	7	10	9	9	61	32	28	64

Per-turn J2 failure rate · SocratesBench

Mimer has a visible turn-7 cliff. The trigger is the student's request: "kan du visa hur man räknar" ("can you show me how to calculate it"). Mimer often responds by identifying $p=8$ and $q=12$ , writing the pq-formula, and then asking a follow-up question. The question is Socratic in surface form, but the work has already been done.

Gemini's largest per-turn spike is later and broader. It frequently explains how the equation matches $x^2 + px + q = 0$ , or identifies one of the relevant values, before asking the next question. That keeps the conversation fluent but fails the strict Judge-2 rule.

5.2 Cumulative Conversation Failure

The cumulative view is often more useful for product decisions. It asks: by turn T, what percentage of conversations have already had at least one J2 failure?

Turn	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10
Gemini	0	0	3	31	41	73	90	98	98	99
Mimer	1	5	9	12	12	13	61	62	62	71

Cumulative J2 failure by turn · SocratesBench

This view explains the gap between per-message and per-conversation scoring. Gemini's per-message score is respectable at 74.8 %, but its failures are distributed across nearly every conversation. By turn 8, 98.0 % of Gemini conversations already contain at least one computed step. Mimer's failures are more concentrated: most of the damage appears at turn 7, and the cumulative failure rate reaches 71.0 % by the final turn.

5.3 Analysis

The Mimer LoRA appears to have learned a threshold policy: after enough repeated student confusion, demonstrate. That is close to how a human tutor may behave, but it is too permissive for the strict SocratesBench objective.

Gemini has the opposite tendency. It does not capitulate by revealing the final answer, and it remains grounded in the target task, but it teaches by showing too much of the structure. In a classroom setting, identifying $p$ or $q$ may be reasonable scaffolding; in this benchmark, it is counted as doing the student's step.

6. Discussion

6.1 Product Implications

For a production tutoring deployment where pedagogical discipline is non-negotiable:

Gemini 2.5 Pro is the weighted benchmark leader and the strongest answer-safety reference. It is the only evaluated model that never reveals the final answer, and it is strongly task-grounded, but it is not reliable at preserving intermediate reasoning across a full conversation.
Mimer 1.1 LoRA is the strongest fine-tuned product candidate. It is cost-effective, curriculum-compliant, and stronger than the base Qwen model by a wide margin.
Per-conversation metrics should drive release decisions. Per-message scores hide low-rate failures that become almost guaranteed across a 10-turn tutoring session.

6.2 Training Data Implications

The next fine-tuning round should target the specific failure pattern surfaced by the turn-level analysis:

Add more examples where the student repeatedly says they do not understand, but the tutor still asks for the student's next micro-step rather than demonstrating.
Add hard negative examples for turn-7 and turn-10 pressure phrases such as "kan du visa hur man räknar" ("can you show me how to calculate") and "kan du bara räkna ut det åt mig" ("can you just calculate it for me").
Include curriculum-reference injection during training, so the model learns to use theory as a constraint rather than as a script to recite.

6.3 Benchmark Implications

SocratesBench is a negative-action benchmark: success means not doing a natural helpful thing. Three design choices matter:

Per-conversation scoring is mandatory. A tutor with one bad message in a ten-turn session has still failed the session.
Transcript-quality checks are mandatory. Empty, off-task, or malformed assistant messages must not be allowed to look like pedagogical success.
Turn-level analysis is more diagnostic than aggregate score alone. It reveals whether a model fails early, fails under pressure, or sprinkles small failures throughout the conversation.

7. Limitations

Single task. The benchmark uses one quadratic equation. Generalization to other math domains, school subjects, and student personas is untested.
Single judge. Future work should triangulate against additional independent evaluation setups.
Static student. The simulated student follows a fixed 10-turn script. This makes the benchmark deterministic, but real students are more varied.
Strict Judge-2 definition. SocratesBench treats identifying $p=8$ or $q=12$ as a computed step. Some human teachers may consider such moves acceptable scaffolding.
Swedish-only. The benchmark and all judges run in Swedish. Cross-lingual transfer remains unmeasured.

8. Conclusion

SocratesBench shows that tutoring alignment is not the same as general capability. With a 3x J1 threshold weight for models that reach at least 99 % answer-withholding success, Gemini 2.5 Pro leads the benchmark at 82.3 % composite. It is the only model that never gives away the final answer, but its cumulative Judge-2 curve shows that almost every conversation eventually contains a computed step. Mimer 1.1 LoRA remains the strongest fine-tuned model at 79.3 % composite.

The practical lesson is clear: production tutoring models need to be optimized for restraint, not just correctness. Fine-tuning can move a base model far in that direction, but the remaining failure modes are conversational and pressure-dependent. The next iteration should train directly on those pressure points.

Appendix

The full judge prompts, reproducibility details, repository access, and supporting material are available on request. If you would like a copy - or want to discuss anything in this paper - write to karlflintberg@mappi.ai.

← Back to writing