MathTutorBench

A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Overview

MathTutorBench is a benchmark which provides a unified framework for evaluating open-ended pedagogical capabilities of large language models (LLMs) tutors across three high level teacher skills and seven concrete tasks.

Skills Overview

Key Features

Leaderboard

Click on any column header to sort the table by that metric.

Model Problem Solving Socratic Questioning Solution Correctness Mistake Location Mistake Correction Scaffolding Win Rate Pedagogy IF Win Rate Scaffolding (Hard) Pedagogy IF (Hard)
LLaMA3.2-3B-Instruct 0.60 0.29 0.67 0.41 0.13 0.64 0.63 0.45 0.40
LLaMA3.1-8B-Instruct 0.70 0.29 0.63 0.29 0.09 0.61 0.67 0.46 0.49
LLaMA3.1-70B-Instruct 0.91 0.29 0.71 0.56 0.19 0.63 0.70 0.49 0.49
GPT-4o 0.90 0.48 0.67 0.37 0.84 0.50 0.82 0.46 0.70
LearnLM-1.5-Pro 0.94 0.32 0.75 0.57 0.74 0.64 0.68 0.66 0.67
Llemma-7B-ScienceTutor 0.62 0.29 0.66 0.29 0.16 0.37 0.48 0.38 0.42
Qwen2.5-7B-SocraticLM 0.73 0.32 0.05 0.39 0.23 0.39 0.39 0.28 0.28
Qwen2.5-Math-7B-Instruct 0.88 0.35 0.43 0.47 0.49 0.06 0.07 0.05 0.05

0. Run your model locally using vllm

For more details on how to run your model locally using vllm, see vllm documentation.

vllm serve [[model_name]]

1. Quick Start - Evaluate a New Model

# Example with openai API
python main.py --tasks mistake_location.yaml --provider completion_api --model_args model=gpt-4o-mini-2024-07-18,is_chat=True,api_key=<API_KEY>

# Example with vllm model
python main.py --tasks mistake_location.yaml --provider completion_api --model_args base_url=base_url=http://localhost:8000/v1,model=meta-llama/Llama-3.2-3B-Instruct,is_chat=True

Required Parameters


2. Run reward model of the Pedagogical Ability tasks - Scaffolding Score

python reward_models/compute_scaffolding_score.py --data_path results/generations-specific-model.json

3. Visualize results

python visualize.py --results_dir results/
Results Visualization

Installation

pip install -r requirements.txt

Submit your model to the leaderboard

To submit your model to the leaderboard, please follow the steps below:

  1. Open a new issue with the title Leaderboard Submission: model-name.
  2. Provide the exact model name on the Huggingface hub and if specific code/arguments/settings are needed for the model or the vllm library which will be used to run your model. Please copy the results from the local run of the model.

Add a new benchmark task

Please open a new PR and provide the configuration of the task in the configs folder and the task implementation in the tasks folder.

Citation

@article{macina2025mathtutorbench,
          title={MathTutorBench: A Benchmark for Measuring Open-ended\\ Pedagogical Capabilities of LLM Tutors},
          author={Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan},
          year={2025},
          eprint={2502.18940},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2502.18940},
}