Skip to content

OpenDCAI/Awesome_MLLMs_Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 

Repository files navigation

Awesome_MLLMs_Reasoning

In this repository, we will continuously update the latest papers, projects, and other valuable resources that advance MLLM reasoning, making learning more efficient for everyone!

📢 Updates

  • 2025.03: We released this repo. Feel free to open pull requests.

📚 Table of Contents

📖 Papers

📝 1.Technical Report

We also feature some well-known technical reports on Large Language Models (LLMs) reasoning.

  • [2507] [GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning] (GLM-V Team) Technical Report

  • [2506] [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention] (MiniMax Team) Technical Report

  • [2506] [MiMo-VL Technical Report] (LLM-Core Xiaomi) Technical Report

  • [2504] [Kimi-VL Technical Report] (Kimi Team) Technical Report

  • [2503] [Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought] (SkyWork AI) Technical Report Model

  • [2503] [Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs] (Microsoft) Technical Report

  • [2503] [QwQ-32B: Embracing the Power of Reinforcement Learning](Qwen Team) Technical Report CodeModel

  • [2501] [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](DeepSeek Team) Technical Report

  • [2501] [Kimi k1.5: Scaling Reinforcement Learning with LLMs](Kimi Team) Technical Report

📌 2.Generated Data Guided Post-Training

  • [2506][Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing] (CASIA) Paper

  • [2506][SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis] (NUS) Paper

  • [2505][Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL] (BIT) Paper

  • [2505][GRIT: Teaching MLLMs to Think with Images] (UC Santa Cruz) Paper

  • [2505][DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning] (Xiaohongshu Inc.) Paper

  • [2504][SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement] (University of Maryland) Paper

  • [2503][Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning] (PKU) Paper

  • [2503] [R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization] (NTU) Paper

  • [2503] [R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model] (UCLA) Paper Blog Code

  • [2503] [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models] (East China Normal University) Paper

  • [2503] [MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning] Paper Code

  • [2503] [Visual-RFT: Visual Reinforcement Fine-Tuning] (Shanghai AI Lab) Paper Code

  • [2502] [OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference] (Shanghai AI Lab) Paper Code

  • [2502] [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification] (PKU) Paper Code

  • [2502] [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment] (Kuaishou) Paper Code

  • [2501] [Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step] (CUHK) Paper Code

  • [2501] [URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics] (ByteDance) Paper

  • [2501] [LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs] (Mohamed bin Zayed University of AI) Paper

  • [2501] [Imagine while Reasoning in Space: Multimodal Visualization-of-Thought] (Microsoft Research) Paper

  • [2501] [Technical Report on Slow Thinking with LLMs: Visual Reasoning] (Renmin University of China) Paper

  • [2412] [MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale] (CMU) Paper

  • [2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper

  • [2412] [TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action] (University of Washington) Paper

  • [2412] [Diving into Self-Evolving Training for Multimodal Reasoning] (HKUST) Paper

  • [2412] [Progressive Multimodal Reasoning via Active Retrieval] (Renmin University of China) Paper

  • [2411] [Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization] (Shanghai AI Lab) Paper

  • [2411] [Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning] (FDU) Paper

  • [2411] [Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models] (NTU) Paper

  • [2411] [AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning] (Sun Yat-sen University) Paper

  • [2411] [LLaVA-o1: Let Vision Language Models Reason Step-by-Step] (PKU) Paper

  • [2411] [Vision-Language Models Can Self-Improve Reasoning via Reflection] (NJU) Paper

  • [2403] [Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models] (CUHK) Paper

  • [2306] [Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic] (SenseTime) Paper

⬆️ Back to Top

🚀 3.Test-time Scaling

  • [2502] [Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking] (THU) Paper

  • [2502] [MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs] (USC) Paper

  • [2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper

  • [2411] [Vision-Language Models Can Self-Improve Reasoning via Reflection] (NJU) Paper Code

  • [2402] [Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models] (THU) Paper Code

  • [2402] [V-STaR: Training Verifiers for Self-Taught Reasoners] (Mila, Universite de Montreal) Paper

⬆️ Back to Top

🚀 4.Collaborative Reasoning

This kind of method aims to use small models(tool or visual expert) or multiple MLLMs to do collaborative reasoning.

  • [2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper

  • [2410] [VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use] (Dartmouth College) Paper

  • [2406] [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models] (University of Washington) Paper

  • [2409] [Visual Agents as Fast and Slow Thinkers] (UCLA) Paper

  • [2312] [Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models] (Google Research) Paper

  • [2211] [Visual Programming: Compositional visual reasoning without training] (Allen Institute for AI) Paper

⬆️ Back to Top

💰 5.MLLM Reward Model

  • [2505] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning Paper

  • [2503] [VisualPRM: An Effective Process Reward Model for Multimodal Reasoning](Shanghai AI Lab) Paper Blog

  • [2503] [Unified Reward Model for Multimodal Understanding and Generation] (Shanghai AI Lab) Paper

  • [2502] [Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning] (University of California, Riverside) Paper

  • [2501] [InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model] (Shanghai AI Lab) Paper

  • [2410] [TLDR: Token-Level Detective Reward Model for Large Vision Language Models] (Meta) Paper

  • [2410] [FINE-GRAINED VERIFIERS: PREFERENCE MODELING AS NEXT-TOKEN PREDICTION IN VISION-LANGUAGE ALIGNMENT] (NUS) Paper

  • [2410] [LLaVA-Critic: Learning to Evaluate Multimodal Models] (ByteDance) Paper Code

⬆️ Back to Top

📊 6.Benchmarks

  • [2503] [CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation] (CMU) Paper

  • [2503] [reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs] (Meta) Paper

  • [2503] [How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game] (THU) Paper

  • [2502] [Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models] (FAIR) Paper Code

  • [2502] [ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models] (University of Cambridge) Paper Code

  • [2502] [MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models] (Tencent Hunyuan Team) Paper Code

  • [2502] [MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency] (CUHK MMLab) Paper Code

  • [2410] [HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks] (CityU HK) Paper Homepage

  • [2406] [(CV-Bench)Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs] (NYU) Paper Code

  • [2404] [BLINK: Multimodal Large Language Models Can See but Not Perceive] (University of Pennsylvania) Paper

  • [2401] [(MMVP) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs] (NYU) Paper

  • [2312] [(V∗Bench) V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs] (UCSD) Paper

⬆️ Back to Top

📦 7.Applications

  • [2503] [Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models] (Emory University) Paper

⬆️ Back to Top

🛠️ Open-Source Projects

  • [MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse] Code MetaSpatial Stars

  • [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning] Code R1-Omni Stars

  • [R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3] Code R1-V Stars Report

  • [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework] Code EasyR1 Stars

  • [R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning] Paper Code R1-Onevision Stars

  • [LMM-R1] Code Paper LMM-R1 Stars

  • [VLM-R1: A stable and generalizable R1-style Large Vision-Language Model] Code VLM-R1 Stars

  • [Multi-modal Open R1] Code Multi-modal Open R1 Stars

  • [Video-R1: Towards Super Reasoning Ability in Video Understanding] Code Video-R1 Stars

  • [Open-R1-Video] Code Open-R1-Video Stars

  • [R1-Vision: Let's first take a look at the image] Code R1-Vision Stars

⬆️ Back to Top

🤝 Contributing

You’re welcome to submit new resources or paper links. Please initiate a Pull Request directly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •