MixEval

🔔News

🚀[2024-10-20]: MixEval-X is released! Checkout the project page, paper, and github repo to learn more about this real-world any-to-any benchmark!🌟

🔥[2024-09-27]: MixEval is accepted to Neurips 2024.

🚀[2024-06-06]: Official evaluation suite of MixEval is released here. ⚡️You can run quick evaluations on MixEval with a very easy setup! 🤗 It's totally the same procedure as running other ground-truth-based benchmarks!

🚀[2024-06-05]: MixEval is released! Checkout the Paper and Leaderboard to learn more about this reliable, holistic, and efficient benchmark!🌟

Introduction

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks' advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research directions.

TL;DR: We introduce MixEval, a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.

What is MixEval?

MixEval is an approach that bridges the gap between real-world user queries and ground-truth-based evaluation by leveraging user queries mined from the web and matching them with similar queries from existing benchmarks. MixEval is also the proposed benchmark built with this approach.

MixEval-Hard is the hard version of MixEval, designed to enhance the benchmark's ability to distinguish strong models. It is sampled from MixEval based on model evaluation results, with a higher probability of selecting harder queries. To address distribution deviation, we introduce a rejective sampling process to ensure that the distribution of MixEval-Hard aligns with that of wild queries.

Dynamic evaluation is introduced to mitigate the contamination issue. We periodically update the data points in MixEval and MixEval-Hard using our fast, stable pipeline, which performs benchmark mixture with a different batch of wild queries from the same distribution, showing low variance (0.36 Std. on a 0-100 scale) and significant version difference (85% unique query ratio).

How Effective is MixEval as a Benchmark Mixture Approach?

Our approach improves the correlation with Arena Elo and Arena Elo (En) for all the main splits of MixEval and outperforms benchmark-level and uniform mixture.

MixEval is effective as (1) MixEval and MixEval-Hard achieve the highest correlation with Arena Elo and Arena Elo (En) among all benchmarks. (2) MixEval improves the correlation with Arena Elo and Arena Elo (En) across all its main benchmark splits. (3) MixEval outperforms both benchmark-level and uniform mixtures. (4) MixEval effectively maps real-world user queries to ground-truth-based benchmarks.

We evaluate LLMs of various sizes from various model developpers. We evaluate both chat and base models. In this project, we mainly discuss the chat models because they are more suitable for user-facing evaluations. In chat model evaluation, we consider both open-source and proprietary models. Our evaluation of chat models is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark, while the base models are evaluated under a 5-shot setting. For all models, we use the default generation settings provided by each model creator.

MixEval and MixEval-Hard are dynamic benchmarks. To mitigate contamination, we periodically update the data points in MixEval and MixEval-Hard using our fast, stable pipeline, which performs benchmark mixture with a different batch of wild queries from the same distribution, showing low score variance (0.36 Std. on a 0-100 scale) and significant version difference (85% unique query ratio). Most models in this Leaderboard are tested by authors on MixEval-2024-06-01. Due to the low score variance between versions, we will aggregate model scores tested on the later versions in this leaderboard.

Open-Source Proprietary

	MixEval-Hard 🔥	MixEval 🔥	Arena Elo (0527)	TriviaQA (Mixed)	MMLU (Mixed)	DROP (Mixed)	HellaSwag (Mixed)	CommonsenseQA (Mixed)	TriviaQA-Hard (Mixed)	MMLU-Hard (Mixed)	DROP-Hard (Mixed)
OpenAI o1-preview	72.0	-	-	-	-	-	-	-	75.7	67.5	70.2
Claude 3.5 Sonnet-0620	68.1	89.9	-	92.6	84.2	93.7	94.6	85.4	73.3	58.4	80.4
LLaMA-3.1-405B-Instruct	66.2	-	-	-	-	-	-	-	72	57.1	69.2
GPT-4o-2024-05-13	64.7	87.9	1287	88.0	85.4	87.9	94.3	86.8	70.3	57.1	67.5
Claude 3 Opus	63.5	88.1	1248	90.4	83.2	91.5	93.3	87.7	71.4	55.0	75.2
GPT-4-Turbo-2024-04-09	62.6	88.8	1256	91.2	82.8	91.0	92.6	85.4	73.1	45.5	71.0
Gemini 1.5 Pro-API-0409	58.7	84.2	1258	85.3	79.2	84.2	89.2	84.4	67.8	44.6	64.8
Gemini 1.5 Pro-API-0514	58.3	84.8	-	83.7	84.0	82.5	91.2	82.5	59.4	54.5	55.2
Mistral Large 2	57.4	86.1	-	88.2	81.9	89.3	80.1	81.6	64.8	42.9	72
Spark4.0	57.0	84.1	-	77.0	84.9	85.9	99.0	89.6	45.7	51.5	74.0
Yi-Large-preview	56.8	84.4	1239	81.7	80.9	87.0	92.6	90.1	55.4	48.5	63.1
LLaMA-3-70B-Instruct	55.9	84.0	1208	83.1	80.5	90.1	81.8	83.0	60.5	46.3	74.5
Qwen-Max-0428	55.8	86.1	1184	86.7	80.6	85.4	93.6	88.2	61.5	41.6	53.5
Claude 3 Sonnet	54.0	81.7	1201	84.2	74.7	87.7	85.9	82.5	59.1	40.7	66.9
Reka Core-20240415	52.9	83.3	-	82.8	79.3	88.1	88.6	81.6	51.6	46.3	66.6
MAmmoTH2-8x7B-Plus	51.8	81.5	-	83.0	74.5	85.7	82.2	82.5	52.9	41.1	65.1
DeepSeek-V2	51.7	83.7	-	84.4	77.3	85.3	88.2	84.0	51.7	42.0	62.8
GPT-4o mini	51.6	84.2	-	83.1	82.3	87.7	83.8	84.9	45.3	45	68.1
Command R+	51.4	81.5	1189	83.3	78.9	80.4	83.5	82.1	57.5	42.0	65.0
Yi-1.5-34B-Chat	51.2	81.7	-	78.4	76.4	87.0	90.2	86.8	44.4	38.1	67.4
Mistral-Large	50.3	84.2	1156	88.3	80.2	88.6	65.0	83.5	55.5	42.4	61.6
Qwen1.5-72B-Chat	48.3	84.1	1147	83.9	80.1	85.1	87.9	86.3	49.9	37.7	56.5
Mistral-Medium	47.8	81.9	1148	86.8	76.3	83.2	72.4	82.5	59.8	38.5	47.1
Gemini 1.0 Pro	46.4	78.9	1131	81.0	74.9	82.6	74.7	80.2	58.2	35.5	54.1
Reka Flash-20240226	46.2	79.8	1148	76.4	75.4	86.7	90.6	80.7	42.9	34.6	65.0
Mistral-Small	46.2	81.2	-	85.1	75.2	86.1	73.4	77.8	56.0	33.8	52.6
LLaMA-3-8B-Instruct	45.6	75.0	1153	71.7	71.9	86.4	65.7	78.3	40.2	40.7	67.6
Command R	45.2	77.0	1147	80.9	75.0	72.0	75.8	77.4	57.0	39.0	42.0
Qwen1.5-32B-Chat	43.3	81.0	1126	75.7	78.0	82.9	85.9	88.2	39.1	29.9	54.4
GPT-3.5-Turbo-0125	43.0	79.7	1102	85.2	74.5	84.8	63.0	81.6	46.4	35.1	55.4
Claude 3 Haiku	42.8	79.7	1178	79.9	76.1	85.0	75.8	78.8	42.4	30.7	51.5
Yi-34B-Chat	42.6	80.1	1111	82.7	73.6	86.1	86.9	78.8	41.5	29.9	57.1
Mixtral-8x7B-Instruct-v0.1	42.5	76.4	1114	82.5	72.0	79.5	54.2	77.4	48.5	37.2	47.7
Starling-LM-7B-beta	41.8	74.8	1119	75.1	69.0	86.4	48.5	84.9	33.4	34.2	62.9
Yi-1.5-9B-Chat	40.9	74.2	-	61.3	72.6	83.9	86.5	82.5	23.3	36.8	61.3
Gemma-1.1-7B-IT	39.1	69.6	1084	64.3	66.9	80.6	66.3	73.6	30.3	39.0	55.1
Vicuna-33B-v1.3	38.7	66.3	1090	79.2	59.2	71.4	30.3	61.8	42.5	39.4	36.6
LLaMA-2-70B-Chat	38.0	74.6	1093	80.0	69.8	79.8	67.3	74.1	42.2	27.7	42.2
MAP-Neo-Instruct-v0.1	37.8	70.0	-	62.1	66.7	75.5	74.4	82.1	26.5	32.5	42.4
Mistral-7B-Instruct-v0.2	36.2	70.0	1072	73.7	67.3	72.8	54.2	66.0	33.5	29.4	44.3
Qwen1.5-7B-Chat	35.5	71.4	1069	64.1	68.7	76.4	76.1	82.1	29.0	29.0	50.0
Reka Edge-20240208	32.2	68.5	-	60.0	63.6	80.0	74.7	80.7	18.6	26.4	56.9
Zephyr-7B-β	31.6	69.1	-	74.7	64.9	77.3	39.1	69.3	30.2	24.2	45.3
LLaMA-2-7B-Chat	30.8	61.7	1037	68.8	59.4	69.3	35.7	61.3	24.8	30.3	44.3
Yi-6B-Chat	30.1	65.6	-	66.1	65.4	70.5	52.5	69.8	18.9	26.8	43.7
Qwen1.5-MoE-A2.7B-Chat	29.1	69.1	-	65.9	69.5	64.6	72.7	81.1	21.9	26.8	39.5
Gemma-1.1-2B-IT	28.4	51.9	1019	53.7	51.5	59.8	26.6	57.1	31.9	30.3	27.8
Vicuna-7B-v1.5	27.8	60.3	1004	66.4	58.7	68.3	24.9	62.7	25.9	23.4	33.2
OLMo-7B-Instruct	26.7	55.0	1015	51.7	57.1	53.1	55.9	64.6	24.7	27.3	22.9
Qwen1.5-4B-Chat	24.6	57.2	988	46.0	61.4	57.2	54.9	74.1	16.5	17.3	28.6
JetMoE-8B-Chat	24.3	51.6	-	46.8	58.5	27.0	86.2	68.4	19.2	25.5	11.5
MPT-7B-Chat	23.8	43.8	927	50.2	37.8	50.0	25.6	36.3	17.5	24.7	31.0

The evaluation results of chat and base models on MixEval, MixEval-Hard, and their subsplits. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

@article{ni2024mixeval, title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures}, author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang}, journal={arXiv preprint arXiv:2406.06565}, year={2024} }

	MixEval 🔥	MixEval-Hard 🔥	TriviaQA (Mixed)	MMLU (Mixed)	DROP (Mixed)	HellaSwag (Mixed)	CommonsenseQA (Mixed)	BoolQ (Mixed)	TriviaQA-Hard (Mixed)	MMLU-Hard (Mixed)	DROP-Hard (Mixed)
LLaMA-3-70B	82.2	54.0	83.1	79.8	81.5	90.9	85.4	81.7	59.1	39.8	59.5
Qwen1.5-72B	79.5	41.9	78.4	78.8	64.5	91.9	87.3	86.9	41.4	42.4	26.2
Yi-34B	78.3	47.2	72.1	79.3	78.2	98.0	81.1	79.4	39.4	42.4	56.5
Qwen1.5-32B	77.6	41.0	71.9	77.2	68.7	93.3	89.2	83.4	28.0	37.2	36.9
Mixtral-8x7B	74.0	40.7	77.3	71.6	69.8	73.7	77.4	77.7	44.1	34.6	42.0
LLaMA-2-70B	73.2	41.6	78.7	70.8	73.2	63.0	77.4	74.3	53.8	29.0	46.1
Qwen1.5-MoE-A2.7B	70.2	33.5	71.3	69.4	59.9	80.1	80.2	70.9	36.0	30.7	31.0
Qwen1.5-7B	68.2	33.7	61.4	67.0	63.6	83.8	84.4	77.7	31.6	28.6	29.8
LLaMA-3-8B	65.1	31.7	65.2	69.5	63.8	51.5	69.8	64.0	22.6	38.5	37.1
Mistral-7B	64.8	27.1	67.2	68.5	61.3	54.5	67.9	68.0	24.2	27.7	34.5
Gemma-7B	64.7	32.7	66.0	67.4	63.8	36.0	68.4	74.3	31.1	28.1	31.4
Yi-6B	63.1	30.4	54.7	71.2	51.4	77.4	76.4	65.1	17.0	37.2	19.4
Qwen1.5-4B	58.2	23.5	47.8	59.6	51.0	65.7	79.2	72.0	14.0	22.9	24.7
JetMoE-8B	57.1	27.0	53.4	55.3	44.1	89.2	60.4	64.6	22.8	27.3	19.2
DeepSeek-7B	52.2	21.7	58.7	53.3	43.5	35.0	51.4	62.9	21.4	26.4	21.4
Phi-2	51.9	21.9	37.0	62.5	50.4	20.2	68.9	73.1	7.3	29.0	27.1
DeepSeekMoE-16B	51.4	24.2	64.2	49.9	41.1	28.6	48.6	62.9	24.9	30.7	12.2
LLaMA-2-7B	43.1	22.1	55.5	40.8	37.6	24.9	30.7	61.7	19.5	24.7	14.9
Gemma-2B	38.9	22.6	41.5	37.4	32.6	33.3	31.6	58.9	12.1	27.3	13.2
OLMo-7B	31.8	21.2	38.4	29.7	24.0	26.9	25.5	49.1	16.0	25.1	11.1
MPT-7B	30.8	17.4	33.5	30.9	26.8	19.2	28.8	44.0	6.6	24.2	9.2

MixEval

Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

🔔News

Introduction

MixEval

What is MixEval?

Why to Use MixEval Benchmarks?

How Effective is MixEval as a Benchmark Mixture Approach?

Statistics

Leaderboard

Dynamic Benchmark Version: 2024-06-01

Meta-Evaluation

Correlations Between Benchmarks

Benchmark Query Distribution

Citation