★ Accepted to ECCV 2026

ProMSA

Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

An MLLM agent that learns when to retrieve, which modality to use, and when to stop — trained end-to-end with TN-GSPO, a tool-horizon-normalized sequence-level RL objective.

Overview of ProMSA: input and goal, progressive search loop, and TN-GSPO training.
Overview of ProMSA. (1) Image–question input and goal. (2) The progressive search loop — at each round the policy thinks, then chooses image search, text search, or stop, retrieving over Wikipedia with de-duplication. (3) Training with the tool-horizon-normalized sequence-level objective, TN-GSPO.

Search like a budget-aware researcher

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image–question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy.

Retrieval becomes part of the reasoning

Instead of one fixed retrieval step, ProMSA unrolls a short, budgeted trajectory. It can switch modality and re-search with a de-duplication list when the first attempt drifts — then stop the moment the evidence is enough.

Question   In which country is this lake located?
Round 01

The lake isn't identifiable from the image alone — find the entity visually.

image search

Top candidate page:
Lake Köyceğiz  (prior pages excluded)

Round 02

Entity likely identified — confirm the country with a rewritten text query.

text search

query: "Lake Köyceğiz country location"
"…a lake in southwestern Turkey."

Round 03

Evidence is sufficient and consistent — stop and answer.

stop

Turkey

Fixed pipelines can't recover from a wrong start

A single-shot RAG pipeline that drifts to the wrong entity is forced to answer on bad evidence. ProMSA re-searches with exclusion, switches between image and text retrieval, and accumulates evidence across rounds.

Comparison of direct answering, RAG-based retrieval, and the ProMSA progressive search agent.
Direct answering vs. fixed RAG vs. ProMSA. Where RAG locks onto an incorrect page, the agent corrects course and grounds its answer in the right evidence.

Normalize by the tool horizon, not just the length

In a search agent, what makes a trajectory hard is its tool-interaction depth H — how many retrievals it took — not the raw number of tokens. Standard GSPO normalizes the sequence ratio by generation length L alone, which biases the policy toward shorter outputs. TN-GSPO folds the tool horizon into the normalizer:

$$ D(\tau) = L(\tau)\,\bigl(1 + c\,H(\tau)^{\alpha}\bigr), \qquad r_\theta(\tau) = \exp\!\Bigl(\tfrac{1}{D(\tau)}\!\!\sum_{t\in T_\text{gen}}\!\! \Delta_t\Bigr) $$
# H = number of tool calls · defaults c = 0.04, α = 1.0 · asymmetric clip ε⁻=0.2, ε⁺=0.28

The reward is sparse and sequence-level: answer correctness (LLM-judge) + format − a tool-cost penalty λ·(#calls / Hmax), encouraging efficient retrieval under a fixed budget.

  • 🖼

    Image search

    Reverse-image retrieval over a Wikipedia KB with an exclusion list, so repeated calls surface new candidates under appearance variation.

  • 🔎

    Text search

    A model-rewritten query drives dense retrieval to fill in missing attributes once the entity is known.

  • Stop

    The policy decides on its own when the gathered evidence is sufficient and emits the final answer.

  • 🧭

    Two-stage training

    Rejection-sampling SFT cold-start teaches valid tool-call formats; RL with TN-GSPO learns the search policy.

State of the art on E-VQA and InfoSeek

52.6
E-VQA (All) · Qwen3-VL-8B
53.4
InfoSeek (All) · Qwen3-VL-8B
35.1 53.0
Base → Cold-Start → RL (avg.)
MethodRetrieverModel E-VQA
Single
E-VQA
All
InfoSeek
All
Qwen3-VL-8B (zero-shot)25.324.825.7
MMSearch-R1BGE + EVA-CLIPQwen2.5-VL-7B40.640.739.7
CC-VQAEVA-CLIP-8BQwen2.5-VL-7B41.436.145.1
REALEVA-CLIP-8BQwen3-VL-8B45.541.444.1
ProMSA (Ours)BGE + EVA-CLIPQwen2.5-VL-7B50.049.749.2
ProMSA (Ours)BGE + EVA-CLIPQwen3-VL-8B52.252.653.4

E-VQA reported with BEM; InfoSeek with VQA accuracy. See the paper for the full comparison, tool / budget / top-k ablations, inference-time analysis, and OK-VQA generalization.

Tool usage and training dynamics across RL strategies.
Tool usage & training dynamics. RL reshapes the image-vs-text ratio; TN-GSPO keeps the tool-call count in a stable range while reward rises and responses shorten.
Image-search retrieval examples showing rank variation.
Retrieval examples. The correct entity may sit at rank-1 or deep in the list — rank variation that motivates adaptive, multi-round retrieval.

BibTeX

@misc{wu2026promsaprogressivemultimodalsearchagents,
  title         = {ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering},
  author        = {ZhengXian Wu and Hangrui Xu and Kai Shi and Zhuohong Chen and Yunyao Yu and Chuanrui Zhang and Zirui Liao and Jun Yang and Zhenyu Yang and Haonan Lu and Haoqian Wang},
  year          = {2026},
  eprint        = {2606.27974},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.27974}
}