Overview

MC-SEARCH is the first comprehensive benchmark designed to evaluate Multimodal Agentic Retrieval-Augmented Generation (MM-RAG). Unlike traditional RAG, MC-SEARCH focuses on structured long reasoning chains where agents must navigate through vast knowledge bases (389K images and 784K documents) to solve complex queries.

Each instance in MC-SEARCH provides step-wise sub-questions, explicit retrieval modality annotations (image vs. text), and intermediate reasoning states, allowing for a fine-grained analysis of how agents plan and execute multimodal search.

MC-SEARCH Benchmark Design — **Figure 1.** The reasoning topologies of MC-SEARCH: simulating realistic agentic search behaviors via unified pipelines.

We introduce five representative reasoning topologies that mirror real-world search patterns. We evaluate models not just on the final answer, but on their Hit-per-Step (HPS), Rollout Deviation (RD), and Planning Accuracy to pinpoint where agentic reasoning breaks down.

Key Takeaways

Our evaluation of state-of-the-art MLLMs (Multimodal Large Language Models) reveals critical bottlenecks in current agentic search capabilities:

💡 Compounding Errors: Retrieval fidelity drops significantly as reasoning depth exceeds 3 hops.

💡 Modality Gap: Current models struggle with maintaining consistency when switching between image and text retrieval.

💡 Process Supervision: Explicit reasoning alignment (like our SEARCH-ALIGN) substantially boosts retrieval success rates.

MC-Search

Overview

Key Takeaways