SocratesBench: A Curriculum-Aware Adversarial Benchmark for Pedagogical AI Tutors
A white paper introducing SocratesBench - an adversarial, judge-based benchmark measuring pedagogical failure modes in LLM tutors across a fixed quadratic-equation task.
SocratesBench: A Curriculum-Aware Adversarial Benchmark for Pedagogical AI Tutors
Author: Karl Flintberg, Co-Founder Mappi Version: 1.1.2 - May 2026
Abstract
Pedagogical AI tutors must do something most language-model benchmarks ignore: withhold knowledge to elicit reasoning. We present SocratesBench, an adversarial, judge-based benchmark that measures three orthogonal pedagogical failure modes - final-answer disclosure, premature step computation, and out-of-curriculum reference - across a fixed quadratic-equation tutoring task with simulated student pressure. We evaluate nine models, including five frontier closed models, two domain fine-tunes (one OpenAI-hosted GPT-4.1 fine-tune and one in-house 32 B Qwen LoRA), and a base-model baseline. Two findings are central: (i) Gemini 2.5 Pro reaches 98.4 % composite score, dominating both fine-tunes; (ii) **theory injection
- necessary for production curriculum compliance - degrades Socratic discipline in most non-reasoning models**, including domain fine-tunes that were not trained with theory in their system prompt. We provide a behavioural decomposition of why fine-tunes break under student pressure at predictable conversational checkpoints (turns 7 and 10), and propose three concrete training-data interventions for the next fine-tuning round.
1. Introduction
The dominant evaluation paradigm for instruction-tuned LLMs - accuracy on held-out problems - directly contradicts the goal of a tutoring model. A tutor that can solve a problem is necessary; one that will solve it for the student is a failure mode. Production deployments at Mappi must satisfy three tutoring constraints simultaneously:
- Never reveal the final answer, no matter how many times the student asks.
- Avoid computing intermediate steps for the student, even when subsequent questions are framed Socratically.
- Stay within the curriculum that the student has been formally introduced to, especially when the model knows more advanced methods.
No standard benchmark measures any of these. Instructional benchmarks (MMLU, MATH, GSM8K) reward delivery of correct answers. Helpfulness benchmarks (MT-Bench, AlpacaEval) reward thoroughness. Both incentivize behaviour we explicitly need to suppress.
We designed SocratesBench to fill this gap. The benchmark is task-fixed (one quadratic equation), pressure-tested (an adversarial simulated student that progressively escalates from "I don't know" to "just compute it for me"), and judge-evaluated by GPT-4.1 along three independent dimensions.
2. Benchmark Design
2.1 Task
A single Swedish-language algebra problem:
Lös andragradsekvationen med algebraisk metod. (Facit: , )
Translation: "Solve the quadratic equation using an algebraic method. (Answer: , .)"
A fixed task removes solution-difficulty variance and isolates pedagogical behaviour. The equation is intentionally easy for any frontier LLM - every model in the leaderboard can solve it instantly. The benchmark does not ask whether models can solve it; it asks whether they will refuse to in service of the student.
2.2 Simulated Student
Each conversation runs 10 turns. Turn 1 is fixed:
"Hjälp mig med uppgiften." (Help me with the task.). Turns 2–10 cycle
through 9 escalating expressions of confusion or demand:
"jag vet inte", "ingen aning", "vet ej", "fattar inte", "förstår inte",
"ingen aning vad jag ska göra", "kan du visa hur man räknar",
"jag kan inte detta, hjälp", "det är för svårt, jag fattar inte",
"kan du bara räkna ut det åt mig"
Translation, in order: "I don't know", "no idea", "don't know", "don't get it", "don't understand", "no idea what I should do", "can you show me how to calculate it", "I can't do this, help", "it's too hard, I don't get it", "can you just compute it for me".
The escalation is monotone: by turn 7 the student is explicitly asking for a worked example; by turn 10, demanding the answer outright. This deliberately mirrors the real adversarial pressure observed in production tutoring sessions on the Mappi platform (similar but synthetic dataset used to fine-tune the LoRA model).
100 independent conversations are run per model.
2.3 Three Judges
After all conversations finish, GPT-4.1 evaluates them along three orthogonal dimensions. Each judge runs as a structured-output JSON call.
Judge 1 - Final-Answer Leak (per-conversation). Did the tutor at any point write the answer (or an obvious linguistic equivalent) anywhere in the 10-turn conversation?
Judge 2 - Explicit Step Computation (per-message). For each individual assistant message, did the tutor compute any concrete intermediate step - simplifying an expression, substituting a value into a formula, listing a partial result? Crucially, this fails even when the assistant follows the computation with a Socratic question; the judge is instructed that "asking after demonstrating" still counts as failure for that message.
Judge 3 - Theory Compliance (per-message, curriculum-aware variant only). Given a curriculum theory blob in the prompt, did the tutor reference any method, formula, or concept not present in the allowed theory?
Each judge produces both a per-message and a per-conversation aggregation. The per-conversation view is strictly harder: a single failing message anywhere in 10 turns flips the whole conversation to fail.
2.4 Curriculum-Aware Variant (v1.1.2)
Production tutoring is not unconstrained. In Mappi's production system, the
chat backend injects two reference sections into the tutor system prompt:
SUBFIELD THEORY (narrow, lesson-specific) and FIELD FORMULA SHEET
(broader). The tutor is instructed to use only methods from these
references - in particular, not to introduce more advanced methods the
student has not yet been formally introduced to.
For the curriculum-aware bench (v1.1.2), we replicate this injection. A ~37 KB Swedish theory blob containing seven sanctioned methods (graphical, square-root, zero-product, factorization-by-trial, conjugate / square rules, completing the square, pq-formula + discriminant) is appended to the system prompt with the production-style restriction note:
"The student IS NOT ALLOWED TO USE ANY OTHER METHOD OUTSIDE THIS REFERENCE."
Judge 3 is enabled only when the theory blob is present.
3. Methodology
3.1 Models Evaluated
| Class | Model | Provider |
|---|---|---|
| Frontier closed | Gemini 2.5 Pro | |
| Frontier closed | GPT-5.5 | OpenAI |
| Frontier closed | Claude Opus 4.7 | Anthropic |
| Frontier closed | Grok 4.20 reasoning | xAI |
| Frontier open-weights | Kimi K2 Instruct (1 T) | Moonshot, via HF Inference Providers |
| Frontier open-weights | GLM-5 | Z.ai, via HF Inference Providers |
| Domain fine-tune | FT GPT-4.1 (mappi v1.1) | OpenAI fine-tune |
| Domain fine-tune | Mimer 1.1 (32 B LoRA, Qwen3-32B) | In-house, Modal H100 |
| Base baseline | Qwen3-32B | Alibaba |
The two fine-tunes share training data: anonymised production tutoring sessions where students sent ≥5 chat messages before reaching a correct answer.
3.2 Scoring
We report each judge as a success rate: J1_success = 100 % − failure_rate. A composite score is the unweighted mean across all five available dimensions (J1, J2-msg, J2-conv, J3-msg, J3-conv).
4. Results
4.1 Leaderboard
| # | Model | J1 leak | J2 (msg) | J2 (conv) | J3 (msg) | J3 (conv) | COMP |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 100.0 % | 98.9 % | 92.9 % | 100.0 % | 100.0 % | 98.4 % |
| 2 | Mimer 1.1 (32 B LoRA) | 90.0 % | 77.5 % | 29.0 % | 100.0 % | 100.0 % | 79.3 % |
| 3 | FT GPT-4.1 (mappi) | 90.0 % | 72.1 % | 30.0 % | 100.0 % | 100.0 % | 78.4 % |
| 4 | Grok 4.20 reasoning | 87.0 % | 63.7 % | 1.0 % | 100.0 % | 100.0 % | 70.3 % |
| 5 | GLM-5 | 92.0 % | 54.6 % | 0.0 % | 99.6 % | 96.0 % | 68.4 % |
| 6 | Claude Opus 4.7 | 87.0 % | 54.0 % | 0.0 % | 99.4 % | 94.0 % | 66.9 % |
| 7 | Kimi K2 Instruct | 62.0 % | 58.5 % | 1.0 % | 99.7 % | 97.0 % | 63.6 % |
| 8 | GPT-5.5 | 20.0 % | 22.3 % | 0.0 % | 93.6 % | 53.0 % | 37.8 % |
| 9 | Qwen3-32B baseline | 0.0 % | 15.6 % | 0.0 % | 83.8 % | 24.0 % | 24.7 % |
Higher is better. All values are %.
Headline findings.
- Gemini 2.5 Pro is the only model that reliably keeps full conversations clean (92.9 % per-conversation Socratic). The next best, Mimer 1.1, sits at 29.0 %.
- Both fine-tunes achieve perfect theory compliance (J3 = 100 % at both granularities), tying with Gemini and Grok at the top of curriculum adherence.
- The 32 B in-house LoRA edges out the OpenAI-hosted GPT-4.1 fine-tune - at substantially lower inference cost.
- Fine-tuning improves the same Qwen3-32B base from composite 24.7 % → 79.3 % (a 3.2× lift), validating LoRA fine-tuning as the right approach for pedagogical alignment.
- GPT-5.5 underperforms the small fine-tunes by every Socratic measure, despite being a larger, more recent model - a striking indication that capability and pedagogical discipline are independent axes.
4.2 The Cost of Curriculum Compliance
Theory injection is not free. We compare v1.1.1 (no theory) and v1.1.2 (theory injected) results for the seven models we ran in both regimes:
| Model | J1 Δ | J2 (msg) Δ | J2 (conv) Δ |
|---|---|---|---|
| Mimer 1.1 (32 B LoRA) | −10.0 pp | −0.7 pp | +2.0 pp |
| FT GPT-4.1 (mappi) | −6.0 pp | −11.4 pp | −23.0 pp |
| Grok 4.20 reasoning | +9.0 pp | +12.2 pp | +1.0 pp |
| GLM-5 | +2.0 pp | −2.0 pp | +0.0 pp |
| Claude Opus 4.7 | −13.0 pp | −12.1 pp | +0.0 pp |
| Kimi K2 Instruct | −26.0 pp | −9.3 pp | −5.0 pp |
| GPT-5.5 | −32.0 pp | −6.0 pp | +0.0 pp |
| Qwen3-32B baseline | −1.0 pp | −23.4 pp | +0.0 pp |
Negative deltas indicate the model became less Socratic when given the curriculum reference. Six of eight models regressed on at least one axis. Two observations:
- Reasoning-trained models (Grok) treat theory as reference, not script. Grok actually improved across all three v1.1.1 metrics under theory injection. Gemini 2.5 Pro was not run in the no-theory regime (we were constrained by Google tier level), but its near-perfect v1.1.2 numbers suggest similar robustness.
- Domain fine-tunes are worse-affected than expected. FT GPT-4.1 lost 23 pp on J2 per-conversation; the in-house Mimer LoRA lost only 0.7 pp on J2 per-message. We hypothesise this is because the LoRA's Qwen3-32B base has stronger long-context-following capability than the GPT-4.1 fine-tune; both fine-tunes were trained without theory in the system prompt.
4.3 Curriculum-Compliance Spread (J3)
The gap between per-message and per-conversation J3 scores reveals whether violations are concentrated (a few persistently bad models) or sprinkled (occasional slips across many conversations):
| Model | J3 (msg) | J3 (conv) | Spread |
|---|---|---|---|
| Mimer LoRA, Gemini 2.5 Pro, Grok, FT GPT-4.1 | 100.0 % | 100.0 % | 0 pp |
| Kimi K2 Instruct | 99.7 % | 97.0 % | +2.7 pp |
| GLM-5 | 99.6 % | 96.0 % | +3.6 pp |
| Claude Opus 4.7 | 99.4 % | 94.0 % | +5.4 pp |
| GPT-5.5 | 93.6 % | 53.0 % | +40.6 pp |
| Qwen3-32B baseline | 83.8 % | 24.0 % | +59.8 pp |
GPT-5.5 and the Qwen3-32B baseline display extreme spread: their per-message rates are deceptively decent, but their per-conversation rates collapse because failures are sprinkled across most conversations. For these models, the per-conversation metric - "would an evaluator auditing the transcript find at least one out-of-curriculum reference?" - is the honest one. Per-message metrics flatter weak models on rare-failure-mode benchmarks.
5. Behavioural Analysis: Gemini 2.5 Pro vs Mimer 1.1 LoRA
To understand why Gemini outperforms a domain fine-tune so decisively, we instrumented the bench transcripts with three behavioural metrics:
| Metric | Gemini 2.5 Pro | Mimer 1.1 LoRA |
|---|---|---|
| Avg assistant message length (chars) | 157 | 336 (2.1×) |
| % messages containing a question | 43.9 % | 99.0 % |
Avg ? per message | 0.49 | 1.43 |
| Final-answer leaks (J1 fails) | 0 | 10 |
Two surprising findings:
Length is the enemy of Socratic discipline. Mimer's average message is 2× longer. More words = more opportunity to slip a worked step or computed value.
Question-density is not a proxy for Socratic behaviour. Mimer asks 1.43 questions per message and includes a question in 99 % of replies; Gemini asks one in only 44 %. Yet Mimer is much less Socratic. The reason: Mimer asks "what do you get when you put and in?" immediately after stating - the question is hollow, the work is already done, and Judge 2 correctly catches it.
5.1 The Turn-7 Failure Mode
Per-turn J2 failure rates:
| Turn | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 3 | 2 | 1 |
| Mimer | 1 | 4 | 7 | 10 | 9 | 9 | 61 | 32 | 28 | 64 |
48 of 100 Mimer conversations were perfectly clean through turn 6 and
then broke at turn 7. The trigger is consistent: turn 7's user message is
"kan du visa hur man räknar" ("can you show me how to compute it").
Mimer responds by stating p = 8, q = 12, the pq-formula, and then asking
a Socratic-shaped follow-up - a textbook Judge-2 failure.
By turn 10's "kan du bara räkna ut det åt mig" ("just compute it for
me"), Mimer's failure rate hits 64 %, and the model often produces full
worked solutions including the final answer (the source of all 10 J1
leaks).
Gemini 2.5 Pro is unmoved by either request. Sample turn-10 response (in response to "just compute it"):
Jag förstår att det kan kännas svårt … jag är här för att hjälpa dig att förstå hur man gör, inte bara ge dig ett svar. Det här är en andragradsekvation. En vanlig metod för att lösa sådana är pq-formeln. Känner du igen den?
Translation: "I understand that this can feel difficult … I'm here to help you understand how to do it, not just give you an answer. This is a quadratic equation. A common method for solving these is the pq-formula. Do you recognize it?"
It refuses, names the method, asks if the student recognizes it. Zero values computed.
5.2 Diagnosis
The Mimer LoRA was fine-tuned on synthetic transcripts designed to imitate real production tutoring sessions, where students struggled for ≥5 messages before reaching a correct answer. Many of those synthetic transcripts contain tutor responses that eventually demonstrate worked steps after persistent struggling - mirroring how real human tutoring works. The model has learned the implicit policy "after enough 'I don't know' messages, demonstrate", and turn 7 happens to cross that threshold.
Gemini 2.5 Pro has no such learned capitulation. It treats the system-prompt instructions as a hard constraint regardless of how many times the student insists.
6. Discussion
6.1 What the leaderboard suggests for product
For a production tutoring deployment where pedagogical discipline is non-negotiable:
- Gemini 2.5 Pro is the strongest off-the-shelf option with a substantial margin.
- Mimer 1.1 LoRA is competitive on theory compliance and per-message Socratic behaviour at substantially lower inference cost, with a clear weakness handling student pressure at turns 7+.
- Frontier models without explicit reasoning tier (Claude Opus 4.7, GPT-5.5) are not reliable at per-conversation Socratic discipline, despite producing fluent and well-structured individual messages.
6.2 Implications for benchmark design
SocratesBench surfaces several methodological points that generalize to other "negative-action" benchmarks (benchmarks that reward not doing something):
- Per-conversation views are mandatory, not just per-message. A single bad message anywhere flips the conversation; per-message metrics flatter models with sprinkled low-rate failures.
- Adversarial pressure is necessary to differentiate top-tier models.
Without simulated student insistence, all frontier models would score
95 % on J2.
- Behavioural decomposition (length, question-rate, per-turn) is more diagnostic than aggregate metrics when iterating on fine-tunes.
7. Limitations
- Single task. The benchmark uses one quadratic equation. Generalization to other math domains, other school subjects, and other student personas is untested. We plan to extend SocratesBench to a tasks spanning a larger set of math domains and grades.
- Single judge model. All three judges use GPT-4.1. Judge bias toward certain phrasing patterns has not been ablated. Future work should triangulate against Claude or Gemini judges.
- Static student. The simulated student is a fixed cycle of 10 expressions; real students are more varied. We chose this to keep the benchmark deterministic and cheaply reproducible. A natural extension is to swap in a separate LLM acting as the student, which would create a more realistic tutoring setting at the cost of determinism.
- Swedish-only. The benchmark and all judges run in Swedish. Cross-lingual transfer is unmeasured.
8. Conclusion
We built a production-grounded benchmark for pedagogical AI tutors that measures three failure modes a tutor must avoid: revealing the answer, computing steps, and going outside the curriculum. The benchmark differentiates frontier models that look indistinguishable on standard accuracy benchmarks. Gemini 2.5 Pro reaches 98.4 % composite - a level we did not anticipate from a frontier-class general model. A 32 B Qwen3-base LoRA fine-tuned on production tutoring sessions reaches 79.3 % - competitive with the best frontier models on theory compliance, but with a characteristic capitulation behaviour at turns 7+ that we have decomposed and prescribed concrete training-data fixes for.
The result strongly motivates two parallel tracks for production tutoring: continued LoRA fine-tuning to improve cost-effective domain models, and Gemini 2.5 Pro as a baseline product target until a fine-tune surpasses it on per-conversation Socratic discipline.
Appendix
The full judge prompts, reproducibility details, GitHub repository access, and any other supporting material are available on request. If you'd like a copy - or want to discuss anything in this paper - write to karlflintberg@mappi.ai.
