Awesome_MLLMs_Reasoning

In this repository, we will continuously update the latest papers, projects, and other valuable resources that advance MLLM reasoning, making learning more efficient for everyone!

📢 Updates

2025.03: We released this repo. Feel free to open pull requests.

📚 Table of Contents

Awesome_MLLMs_Reasoning
Open-Source Projects

📖 Papers

📝 1.Technical Report

We also feature some well-known technical reports on Large Language Models (LLMs) reasoning.

[2507] [GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning] (GLM-V Team) Technical Report
[2506] [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention] (MiniMax Team) Technical Report
[2506] [MiMo-VL Technical Report] (LLM-Core Xiaomi) Technical Report
[2504] [Kimi-VL Technical Report] (Kimi Team) Technical Report
[2503] [Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought] (SkyWork AI) Technical Report Model
[2503] [Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs] (Microsoft) Technical Report
[2503] [QwQ-32B: Embracing the Power of Reinforcement Learning](Qwen Team) Technical Report Code Model
[2501] [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](DeepSeek Team) Technical Report
[2501] [Kimi k1.5: Scaling Reinforcement Learning with LLMs](Kimi Team) Technical Report

📌 2.Generated Data Guided Post-Training

[2506][Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing] (CASIA) Paper
[2506][SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis] (NUS) Paper
[2505][Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL] (BIT) Paper
[2505][GRIT: Teaching MLLMs to Think with Images] (UC Santa Cruz) Paper
[2505][DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning] (Xiaohongshu Inc.) Paper
[2504][SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement] (University of Maryland) Paper
[2503][Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning] (PKU) Paper
[2503] [R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization] (NTU) Paper
[2503] [R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model] (UCLA) Paper Blog Code
[2503] [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models] (East China Normal University) Paper
[2503] [MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning] Paper Code
[2503] [Visual-RFT: Visual Reinforcement Fine-Tuning] (Shanghai AI Lab) Paper Code
[2502] [OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference] (Shanghai AI Lab) Paper Code
[2502] [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification] (PKU) Paper Code
[2502] [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment] (Kuaishou) Paper Code
[2501] [Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step] (CUHK) Paper Code
[2501] [URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics] (ByteDance) Paper
[2501] [LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs] (Mohamed bin Zayed University of AI) Paper
[2501] [Imagine while Reasoning in Space: Multimodal Visualization-of-Thought] (Microsoft Research) Paper
[2501] [Technical Report on Slow Thinking with LLMs: Visual Reasoning] (Renmin University of China) Paper
[2412] [MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale] (CMU) Paper
[2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper
[2412] [TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action] (University of Washington) Paper
[2412] [Diving into Self-Evolving Training for Multimodal Reasoning] (HKUST) Paper
[2412] [Progressive Multimodal Reasoning via Active Retrieval] (Renmin University of China) Paper
[2411] [Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization] (Shanghai AI Lab) Paper
[2411] [Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning] (FDU) Paper
[2411] [Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models] (NTU) Paper
[2411] [AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning] (Sun Yat-sen University) Paper
[2411] [LLaVA-o1: Let Vision Language Models Reason Step-by-Step] (PKU) Paper
[2411] [Vision-Language Models Can Self-Improve Reasoning via Reflection] (NJU) Paper
[2403] [Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models] (CUHK) Paper
[2306] [Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic] (SenseTime) Paper

⬆️ Back to Top

🚀 3.Test-time Scaling

[2502] [Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking] (THU) Paper
[2502] [MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs] (USC) Paper
[2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper
[2411] [Vision-Language Models Can Self-Improve Reasoning via Reflection] (NJU) Paper Code
[2402] [Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models] (THU) Paper Code
[2402] [V-STaR: Training Verifiers for Self-Taught Reasoners] (Mila, Universite de Montreal) Paper

⬆️ Back to Top

🚀 4.Collaborative Reasoning

This kind of method aims to use small models(tool or visual expert) or multiple MLLMs to do collaborative reasoning.

[2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper
[2410] [VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use] (Dartmouth College) Paper
[2406] [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models] (University of Washington) Paper
[2409] [Visual Agents as Fast and Slow Thinkers] (UCLA) Paper
[2312] [Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models] (Google Research) Paper
[2211] [Visual Programming: Compositional visual reasoning without training] (Allen Institute for AI) Paper

⬆️ Back to Top

💰 5.MLLM Reward Model

[2505] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning Paper
[2503] [VisualPRM: An Effective Process Reward Model for Multimodal Reasoning](Shanghai AI Lab) Paper Blog
[2503] [Unified Reward Model for Multimodal Understanding and Generation] (Shanghai AI Lab) Paper
[2502] [Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning] (University of California, Riverside) Paper
[2501] [InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model] (Shanghai AI Lab) Paper
[2410] [TLDR: Token-Level Detective Reward Model for Large Vision Language Models] (Meta) Paper
[2410] [FINE-GRAINED VERIFIERS: PREFERENCE MODELING AS NEXT-TOKEN PREDICTION IN VISION-LANGUAGE ALIGNMENT] (NUS) Paper
[2410] [LLaVA-Critic: Learning to Evaluate Multimodal Models] (ByteDance) Paper Code

⬆️ Back to Top

📊 6.Benchmarks

[2503] [CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation] (CMU) Paper
[2503] [reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs] (Meta) Paper
[2503] [How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game] (THU) Paper
[2502] [Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models] (FAIR) Paper Code
[2502] [ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models] (University of Cambridge) Paper Code
[2502] [MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models] (Tencent Hunyuan Team) Paper Code
[2502] [MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency] (CUHK MMLab) Paper Code
[2410] [HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks] (CityU HK) Paper Homepage
[2406] [(CV-Bench)Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs] (NYU) Paper Code
[2404] [BLINK: Multimodal Large Language Models Can See but Not Perceive] (University of Pennsylvania) Paper
[2401] [(MMVP) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs] (NYU) Paper
[2312] [(V∗Bench) V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs] (UCSD) Paper

⬆️ Back to Top

📦 7.Applications

[2503] [Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models] (Emory University) Paper

⬆️ Back to Top

🛠️ Open-Source Projects

[MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse] Code
[R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning] Code
[R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3] Code Report
[EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework] Code
[R1-Onevision：An Open-Source Multimodal Large Language Model Capable of Deep Reasoning] Paper Code
[LMM-R1] Code Paper
[VLM-R1: A stable and generalizable R1-style Large Vision-Language Model] Code
[Multi-modal Open R1] Code
[Video-R1: Towards Super Reasoning Ability in Video Understanding] Code
[Open-R1-Video] Code
[R1-Vision: Let's first take a look at the image] Code