Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
We will actively maintain this repository by incorporating new research as it emerges. If you have any suggestions regarding our taxonomy, find any missed papers, or update any preprint arXiv paper that has been accepted to some venue, feel free to send us an email or submit a pull request using the following markdown format.
Paper Title, <ins>Conference/Journal/Preprint, Year</ins> [[pdf](link)] [[other resources](link)].
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts.
Data contamination occurs when benchmark data is inadvertently included in the training phase of language models, leading to an inflated and misleading assessment of their performance. While this issue has been recognized for some time—stemming from the fundamental machine learning principle of separating training and test sets—it has become even more critical with the advent of LLMs. These models often scrape vast amounts of publicly available data from the Internet, significantly increasing the likelihood of contamination. Furthermore, due to privacy and commercial concerns, tracing the exact training data for these models is challenging, if not impossible, complicating efforts to detect and mitigate potential contamination.
This survey is necessary to address the growing issue of data contamination in LLM benchmarking, which compromises the reliability of static benchmarks that rely on fixed, human-curated datasets. While methods like data encryption and post-hoc contamination detection attempt to mitigate this issue, they have inherent limitations. Dynamic benchmarking has emerged as a promising alternative, yet existing reviews focus primarily on post-hoc detection and lack a systematic analysis of dynamic methods. Moreover, no standardized criteria exist for evaluating these benchmarks. To bridge this gap, we comprehensively review contamination-free benchmarking strategies, assess their strengths and limitations, and propose evaluation criteria for dynamic benchmarks, offering insights to guide future research and standardization.
- Training Verifiers to Solve Math Word Problems, arXiv, 2021 [Paper] [Code]
- Measuring Mathematical Problem Solving With the MATH Dataset, NeurIPS, 2021 [Paper] [Code]
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension, ACL, 2017 [Paper] [Code]
- Natural questions: a benchmark for question answering research, TACL, 2019 [Paper] [Code]
- Measuring Massive Multitask Language Understanding, ICLR, 2021 [Paper] [Code]
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, ACL, 2023 [Paper] [Code]
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, NAACL, 2024 [Paper] [Code]
- Are We Done with MMLU?, Arxiv, 2024 [Paper] [Code]
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, NeurIPS, 2024 [Paper] [Code]
- Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra, Arxiv, 2024 [Paper]
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark, COLM, 2024 [Paper] [Code]
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Arxiv, 2024 [Paper] [Code]
- FROM CROWDSOURCED DATA TO HIGH-QUALITY BENCHMARKS: ARENA-HARD AND BENCHBUILDER PIPELINE, Arxiv, 2024 [Paper] [Code]
- Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation, NAACL, 2025 [Paper] [Code]
- AIME., [Website]
- CNMO., [Website]
- Evaluating Large Language Models Trained on Code, Arxiv, 2021 [Paper] [Code]
- Program Synthesis with Large Language Models, Arxiv, 2021 [Paper] [Code]
- SWE-bench: Can Language Models Resolve Real-world Github Issues?, ICLR, 2024 [Paper] [Code]
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?, ICLR, 2025 [Paper] [Code]
- Codeforces: Competitive programming platform., [Website]
- Aider., [Website]
- Instruction-Following Evaluation for Large Language Models, Arxiv, 2023 [Paper] [Code]
- C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, NeurIPS, 2023 [Paper] [Code]
- INFOBENCH: Evaluating Instruction Following Ability in Large Language Models, ACL, 2024 [Paper] [Code]
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, EMNLP, 2018 [Paper] [Code]
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, Arxiv, 2018 [Paper] [Code]
- HellaSwag: Can a Machine Really Finish Your Sentence?, ACL, 2019 [Paper] [Code]
- WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale,ACL, 2019[Paper] [Code]
- COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge , NAACL, 2019 [Paper] [Code]
- SOCIAL IQA: Commonsense Reasoning about Social Interactions, EMNLP, 2019 [Paper] [Code]
- PIQA: Reasoning about Physical Commonsense in Natural Language, AAAI, 2020 [Paper] [Code]
- CHINESE SIMPLEQA: A CHINESE FACTUALITY EVALUATION FOR LARGE LANGUAGE MODELS, Arxiv, 2024 [Paper] [Code]
- TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario, Arxiv, 2025 [Paper] [Code].
- REALTOXICITYPROMPTS: Evaluating Neural Toxic Degeneration in Language Models, EMNLP, 2020 [Paper] [Code]
- TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, ACL, 2022 [Paper] [Code]
- TRUSTGEN: On the Trustworthiness of Generative Foundation Models - Guideline, Assessment, and Perspective, Arxiv, 2025 [Paper] [Code]
- Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures, Arxiv, 2025 [Paper] [WebSite]
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, EMNLP, 2018 [Paper] [Code]
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, NeurIPS, 2019 [Paper] [Code]
- CLUE: A Chinese Language Understanding Evaluation Benchmark, COLING, 2020 [Paper] [Code]
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, ACL, 2023 [Paper] [Code]
- Know What You Don’t Know: Unanswerable Questions for SQuAD, ACL, 2018 [Paper] [Code]
- QuAC : Question Answering in Context, EMNLP, 2018 [Paper] [Code]
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , NAACL, 2019 [Paper] [Code]
- Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks , EMNLP, 2023 [Paper] [Code]
- Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks , EMNLP, 2023 [Paper] [Code]
- Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , Arxiv, 2023 [Paper] [Code]
- TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs, Arxiv, 2024 [Paper] [Code]
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, EMNLP, 2018 [Paper] [Code]
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, NeurIPS, 2019 [Paper] [Code]
- Evaluating Large Language Models Trained on Code, Arxiv, 2021 [Paper] [Code]
- Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models, ACL, 2024 [Paper] [Code]
- Platypus: Quick, Cheap, and Powerful Refinement of LLMs, NeurIPS Workshop, 2023 [Paper] [Code]
- Textbooks Are All You Need, Arxiv, 2023 [Paper] [Code]
- An Open-Source Data Contamination Report for Large Language Models, EMNLP, 2024 [Paper] [Code]
- Benchmarking Benchmark Leakage in Large Language Models, Arxiv, 2024 [Paper] [Code]
- Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation, ACL, 2024 [Paper] [Code]
- Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 , EMNLP, 2023 [Paper] [Code]
- Time Travel in LLMs: Tracing Data Contamination in Large Language Models, ICLR, 2024 [Paper] [Code]
- DE-COP: Detecting Copyrighted Content in Language Models Training Data, ICML, 2024 [Paper] [Code]
- Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models , Arxiv, 2024 [Paper] [Code]
- Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations , ICML, 2024 [Paper] [Code]
- ConStat: Performance-Based Contamination Detection in Large Language Models , NeurIPS, 2024 [Paper] [Code]
- TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles , Arxiv, 2024 [Paper] [Code]
- AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge , Arxiv, 2024 [Paper] [Code]
- LiveBench: A Challenging, Contamination-Free LLM Benchmark , ICLR, 2025 [Paper] [Code]
- ACADEMICEVAL: LIVE LONG-CONTEXT LLM BENCHMARK, Arxiv, 2025 [Paper]
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , ICLR, 2025 [Paper] [Code]
- LEVERAGING ONLINE OLYMPIAD-LEVEL MATH PROBLEMS FOR LLMS TRAINING AND CONTAMINATION-RESISTANT EVALUATION , Arxiv, 2025 [Paper] [Code]
- ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , ICLR, 2025 [Paper] [Code]
- RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics , Arxiv, 2025 [Paper] [Code]
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , ICLR, 2025 [Paper] [Code]
- Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models , EMNLP, 2024 [Paper] [Code]
- MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark , EMNLP, 2024 [Paper] [Code]
- S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models , NAACl, 2024 [Paper] [Code]
- DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks , ICLR, 2024 [Paper] [Code]
- NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes , ACL, 2024 [Paper] [Code]
- ON MEMORIZATION OF LARGE LANGUAGE MODELS IN LOGICAL REASONING , NeurIPS Workshop, 2024 [Paper] [Code]
- DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination, ICML, 2025 [Paper] [Code]
- Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models , NeurIPS, 2024 [Paper] [Code]
- Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation , EMNLP, 2024 [Paper] [Code]
- StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation , ACL, 2024 [Paper] [Code]
- VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation , EMNLP, 2024 [Paper] [Code]
- Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models , NeurIPS, 2024 [Paper] [Code]
- LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation , Arxiv, 2024 [Paper] [Code]
- TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning , NeurIPS, 2024 [Paper] [Code]
- KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models , ACL , 2024 [Paper] [Code]
- Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation , COLING , 2025 [Paper] [Code]
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction , Arxiv , 2024 [Paper]
- GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning , ACL , 2025 [Paper] [Code]
- LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction , AAAI, 2024 [Paper] [Code]
- DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph , NeurIPS, 2024 [Paper] [Code]
- C2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation , AAAI, 2024 [Paper]