MC-Search

Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

🏆 ICLR 2026 Oral

Xuying Ning¹^*, Dongqi Fu²^*, Tianxin Wei¹^*, Mengting Ai¹, Jiaru Zou¹, Ting-Wei Li¹

Hanghang Tong¹, Yada Zhu³, Hendrik Hamann⁴, Jingrui He¹

¹ University of Illinois Urbana-Champaign ² Meta ³ IBM Research ⁴ Stony Brook University

^* Equal Contribution

Overview

With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.

Illustrative overview of the MC-Search benchmark, reasoning topologies, and evaluation framework.

MC-Search Benchmark

We introduce MC-Search as a new benchmark paradigm for multimodal agentic search, requiring agents to reason over both visual and textual evidence through structured, multi-hop chains.

Benchmark Design

• Scale & Quality	3,333 examples averaging 3.7 hops, verified by HAVE (Hop-wise Attribution and Verification of Evidence).
• Offline Knowledge Base	389K images and 784K documents covering diverse real-world topics.
• Diverse Reasoning Structures	Five representative search-enhanced reasoning structures: image-initiated chain, text-initiated chain, parallel image-text fork, multi-images fork, and text-only chain.
• Step-wise Annotation	Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers at every hop.

Comparison of existing multimodal retrieval-augmented QA datasets. Distribution of the five reasoning topologies in MC-Search.

Evaluation Metrics

MC-Search evaluates models beyond final-answer accuracy (F1, LLM-as-Judge) using process-level metrics:

• Hit-per-Step (HPS)	Measures whether each reasoning step retrieves the correct supporting evidence.
• Rollout Deviation (RD)	Quantifies how much the predicted reasoning chain deviates from the gold trajectory.
• Planning Accuracy	Evaluates how well agents decompose complex questions into sub-questions.

Data Examples

Image-Initiated Chain

Reasoning starts from an image retrieval step and then proceeds with textual retrieval to gather factual knowledge. The image serves as the anchor for identifying entities that guide subsequent text-based reasoning.

Text-Initiated Chain

Reasoning begins with textual retrieval to obtain key factual information and later incorporates image retrieval for visual verification. This topology reflects workflows where structured knowledge is first identified before grounding it in visual evidence.

Parallel Image-Text Fork

Image and text retrieval are performed in parallel branches without strict dependency between steps. The model must coordinate evidence from both modalities and integrate them to produce the final answer.

Multi-Images Fork

Multiple images are retrieved and compared to extract visual information before turning to textual evidence for factual support. This structure emphasizes cross-image reasoning and visual comparison.

Text-Only Chain

All reasoning steps rely purely on textual retrieval without using images. It serves as a baseline topology focusing on sequential text-based reasoning.

Experiments and Results

Main Experiments

Closed-source models (e.g., Gemini-2.5-Pro) achieve the strongest overall performance. Open-source models lag mainly due to weaker planning and retrieval alignment. Process-level training with Search-Align significantly closes this gap:

• Closed-source lead	Gemini-2.5-Pro tops all metrics; GPT-4o follows closely on process-level scores.
• Open-source gap	Primarily driven by poor planning decomposition and modality-misaligned retrieval.
• Search-Align gains	Structured reasoning supervision consistently improves open-source MLLMs across all process-level metrics.

Chain Length & Retrieval Iterations

Performance degrades systematically with longer chains and misaligned retrieval budgets:

• Chain length	Sharp accuracy drops beyond 4–5 hops; models fail to maintain stable planning over long trajectories.
• Under-retrieval	Causes the largest degradation because critical evidence is skipped.
• Over-retrieval	Moderate excess can recover missing evidence; excessive retrieval introduces noise and hurts reasoning quality.

Modality Coverage

Models exhibit a strong text-retrieval bias, exposing a significant visual grounding gap:

• Text bias	High text retrieval coverage across all models; image retrieval coverage is significantly weaker.
• Visual grounding	Deteriorates sharply without explicit image cues, revealing a modality gap in current multimodal reasoning.

Error Analysis

Most failures originate from evidence acquisition rather than downstream reasoning:

• Top failure modes	Retrieval failure, hallucinated entities, and missing reasoning steps dominate error cases.
• Structural errors	Less frequent but still challenging for long multi-hop chains — reasoning logic breaks down at later steps.

Key Takeaways

Our evaluation of state-of-the-art MLLMs reveals three critical bottlenecks in current multimodal agentic search capabilities. The most prevalent failure pattern involves compounding retrieval errors across reasoning hops — a single wrong retrieval at step 2 propagates and amplifies failure through all subsequent steps. Cross-domain errors (i.e., failures at modality-switching junctions) account for the majority of task failures, confirming that the critical bottleneck emerges precisely at the intersection where visual and textual retrieval must be coordinated.

💡 Compounding Errors: Retrieval fidelity drops significantly as reasoning depth exceeds 3 hops.

💡 Modality Gap: Models struggle with maintaining consistency when switching between image and text retrieval.

💡 Process Supervision: Explicit reasoning alignment (SEARCH-ALIGN) substantially boosts retrieval success rates across all topologies and model families.

Citation

@inproceedings{
      ning2026mcsearch,
      title={{MC}-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains},
      author={Xuying Ning and Dongqi Fu and Tianxin Wei and Mengting Ai and Jiaru Zou and Ting-Wei Li and Hanghang Tong and Yada Zhu and Hendrik Hamann and Jingrui He},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=JEGDp1E4OH}
      }