MC-Search

Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
1UIUC, 2Meta, 3IBM Research, 4Stony Brook University

Overview


MC-SEARCH is the first comprehensive benchmark designed to evaluate Multimodal Agentic Retrieval-Augmented Generation (MM-RAG). Unlike traditional RAG, MC-SEARCH focuses on structured long reasoning chains where agents must navigate through vast knowledge bases (389K images and 784K documents) to solve complex queries.

Each instance in MC-SEARCH provides step-wise sub-questions, explicit retrieval modality annotations (image vs. text), and intermediate reasoning states, allowing for a fine-grained analysis of how agents plan and execute multimodal search.

MC-SEARCH Benchmark Design
Figure 1. The reasoning topologies of MC-SEARCH: simulating realistic agentic search behaviors via unified pipelines.

We introduce five representative reasoning topologies that mirror real-world search patterns. We evaluate models not just on the final answer, but on their Hit-per-Step (HPS), Rollout Deviation (RD), and Planning Accuracy to pinpoint where agentic reasoning breaks down.

Key Takeaways


Our evaluation of state-of-the-art MLLMs (Multimodal Large Language Models) reveals critical bottlenecks in current agentic search capabilities:

💡 Compounding Errors: Retrieval fidelity drops significantly as reasoning depth exceeds 3 hops.

💡 Modality Gap: Current models struggle with maintaining consistency when switching between image and text retrieval.

💡 Process Supervision: Explicit reasoning alignment (like our SEARCH-ALIGN) substantially boosts retrieval success rates.