A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
MathTutorBench is a benchmark which provides a unified framework for evaluating open-ended pedagogical capabilities of large language models (LLMs) tutors across three high level teacher skills and seven concrete tasks.
Click on any column header to sort the table by that metric.
Model | Problem Solving | Socratic Questioning | Solution Correctness | Mistake Location | Mistake Correction | Scaffolding Win Rate | Pedagogy IF Win Rate | Scaffolding (Hard) | Pedagogy IF (Hard) |
---|---|---|---|---|---|---|---|---|---|
LLaMA3.2-3B-Instruct | 0.60 | 0.29 | 0.67 | 0.41 | 0.13 | 0.64 | 0.63 | 0.45 | 0.40 |
LLaMA3.1-8B-Instruct | 0.70 | 0.29 | 0.63 | 0.29 | 0.09 | 0.61 | 0.67 | 0.46 | 0.49 |
LLaMA3.1-70B-Instruct | 0.91 | 0.29 | 0.71 | 0.56 | 0.19 | 0.63 | 0.70 | 0.49 | 0.49 |
GPT-4o | 0.90 | 0.48 | 0.67 | 0.37 | 0.84 | 0.50 | 0.82 | 0.46 | 0.70 |
LearnLM-1.5-Pro | 0.94 | 0.32 | 0.75 | 0.57 | 0.74 | 0.64 | 0.68 | 0.66 | 0.67 |
Llemma-7B-ScienceTutor | 0.62 | 0.29 | 0.66 | 0.29 | 0.16 | 0.37 | 0.48 | 0.38 | 0.42 |
Qwen2.5-7B-SocraticLM | 0.73 | 0.32 | 0.05 | 0.39 | 0.23 | 0.39 | 0.39 | 0.28 | 0.28 |
Qwen2.5-Math-7B-Instruct | 0.88 | 0.35 | 0.43 | 0.47 | 0.49 | 0.06 | 0.07 | 0.05 | 0.05 |
For more details on how to run your model locally using vllm, see vllm documentation.
vllm serve [[model_name]]
# Example with openai API python main.py --tasks mistake_location.yaml --provider completion_api --model_args model=gpt-4o-mini-2024-07-18,is_chat=True,api_key=<API_KEY> # Example with vllm model python main.py --tasks mistake_location.yaml --provider completion_api --model_args base_url=base_url=http://localhost:8000/v1,model=meta-llama/Llama-3.2-3B-Instruct,is_chat=True
--tasks
Task definition file in the configs
folder. Use comma separated list for multiple sequential tasks.
problem_solving.yaml
: Task definition for problem solving.socratic_questioning.yaml
: Task definition for socratic questioning.student_solution_generation.yaml
: Task definition for student solution generation.mistake_location.yaml
: Task definition for mistake location.mistake_correction.yaml
: Task definition for mistake correction.scaffolding_generation.yaml
: Task definition for scaffolding generation.pedagogy_following.yaml
: Task definition for pedagogy following.scaffolding_generation_hard.yaml
: Task definition for scaffolding generation hard.pedagogy_following_hard.yaml
: Task definition for pedagogy following hard.--provider
API provider to use for the task.
completion_api
: Use the completion API for the task. Support any OpenAI-type API. Use for openai and vllm models.gemini
: Use the gemini API for the task.--model_args
Model arguments to pass to the API provider.
base_url
: Base URL of the API provider. Empty for openai and gemini.model
: Model name to use for the task. Default is the first available model.api_key
: API key to access API. Empty for vllm models.is_chat
: Whether the model is chat-based or not. Default is False.temperature
: Temperature for sampling. Default is 0.0.max_tokens
: Maximum tokens to generate. Default is 2048.max_retries
: Maximum retries for the API. Default is 3.python reward_models/compute_scaffolding_score.py --data_path results/generations-specific-model.json
python visualize.py --results_dir results/
pip install -r requirements.txt
To submit your model to the leaderboard, please follow the steps below:
Please open a new PR and provide the configuration of the task in the configs folder and the task implementation in the tasks folder.
@article{macina2025mathtutorbench,
title={MathTutorBench: A Benchmark for Measuring Open-ended\\ Pedagogical Capabilities of LLM Tutors},
author={Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan},
year={2025},
eprint={2502.18940},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18940},
}