MathTutorBench - Benchmark for LLM Tutors

Overview

MathTutorBench is a benchmark which provides a unified framework for evaluating open-ended pedagogical capabilities of large language models (LLMs) tutors across three high level teacher skills and seven concrete tasks.

Key Features

Automatic Evaluation: The benchmark is designed to be run automatically on any new models you are developing.
Comprehensive Metrics: The benchmark covers three high level skills and seven tasks to evaluate in the domain of math tutoring.
Teacher-Grounded Evaluation: Each task is annotated with teacher ground truths and compared to it.
Fast execution loop: Run benchmark on different tasks very quickly.

Leaderboard

Click on any column header to sort the table by that metric.

Model	Problem Solving	Socratic Questioning	Solution Correctness	Mistake Location	Mistake Correction	Scaffolding Win Rate	Pedagogy IF Win Rate	Scaffolding (Hard)	Pedagogy IF (Hard)
LLaMA3.2-3B-Instruct	0.60	0.29	0.67	0.41	0.13	0.64	0.63	0.45	0.40
LLaMA3.1-8B-Instruct	0.70	0.29	0.63	0.29	0.09	0.61	0.67	0.46	0.49
LLaMA3.1-70B-Instruct	0.91	0.29	0.71	0.56	0.19	0.63	0.70	0.49	0.49
GPT-4o	0.90	0.48	0.67	0.37	0.84	0.50	0.82	0.46	0.70
LearnLM-1.5-Pro	0.94	0.32	0.75	0.57	0.74	0.64	0.68	0.66	0.67
Llemma-7B-ScienceTutor	0.62	0.29	0.66	0.29	0.16	0.37	0.48	0.38	0.42
Qwen2.5-7B-SocraticLM	0.73	0.32	0.05	0.39	0.23	0.39	0.39	0.28	0.28
Qwen2.5-Math-7B-Instruct	0.88	0.35	0.43	0.47	0.49	0.06	0.07	0.05	0.05

0. Run your model locally using vllm

For more details on how to run your model locally using vllm, see vllm documentation.

vllm serve [[model_name]]

1. Quick Start - Evaluate a New Model

# Example with openai API
python main.py --tasks mistake_location.yaml --provider completion_api --model_args model=gpt-4o-mini-2024-07-18,is_chat=True,api_key=<API_KEY>

# Example with vllm model
python main.py --tasks mistake_location.yaml --provider completion_api --model_args base_url=base_url=http://localhost:8000/v1,model=meta-llama/Llama-3.2-3B-Instruct,is_chat=True

Required Parameters

--tasks

Task definition file in the configs folder. Use comma separated list for multiple sequential tasks.
- problem_solving.yaml: Task definition for problem solving.
- socratic_questioning.yaml: Task definition for socratic questioning.
- student_solution_generation.yaml: Task definition for student solution generation.
- mistake_location.yaml: Task definition for mistake location.
- mistake_correction.yaml: Task definition for mistake correction.
- scaffolding_generation.yaml: Task definition for scaffolding generation.
- pedagogy_following.yaml: Task definition for pedagogy following.
- scaffolding_generation_hard.yaml: Task definition for scaffolding generation hard.
- pedagogy_following_hard.yaml: Task definition for pedagogy following hard.
--provider

API provider to use for the task.
- completion_api: Use the completion API for the task. Support any OpenAI-type API. Use for openai and vllm models.
- gemini: Use the gemini API for the task.
--model_args

Model arguments to pass to the API provider.
- base_url: Base URL of the API provider. Empty for openai and gemini.
- model: Model name to use for the task. Default is the first available model.
- api_key: API key to access API. Empty for vllm models.
- is_chat: Whether the model is chat-based or not. Default is False.
- temperature: Temperature for sampling. Default is 0.0.
- max_tokens: Maximum tokens to generate. Default is 2048.
- max_retries: Maximum retries for the API. Default is 3.

2. Run reward model of the Pedagogical Ability tasks - Scaffolding Score

python reward_models/compute_scaffolding_score.py --data_path results/generations-specific-model.json

3. Visualize results

python visualize.py --results_dir results/

Installation

pip install -r requirements.txt

Submit your model to the leaderboard

To submit your model to the leaderboard, please follow the steps below:

Open a new issue with the title Leaderboard Submission: model-name.
Provide the exact model name on the Huggingface hub and if specific code/arguments/settings are needed for the model or the vllm library which will be used to run your model. Please copy the results from the local run of the model.

Add a new benchmark task

Please open a new PR and provide the configuration of the task in the configs folder and the task implementation in the tasks folder.

Citation

@article{macina2025mathtutorbench,
          title={MathTutorBench: A Benchmark for Measuring Open-ended\\ Pedagogical Capabilities of LLM Tutors},
          author={Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan},
          year={2025},
          eprint={2502.18940},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2502.18940},
}