When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

1OPPO AI Center     2Tsinghua University     3Nanyang Technological University    4Hefei University of Technology
Equal contribution     ✉️Corresponding author
zx-wu24@mails.tsinghua.edu.cn, wanghaoqian@tsinghua.edu.cn

Abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high quality annotated data, which is costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models.

For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge-based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates.

Proposed Method

Key Ideas

  • We propose a new framework for unsupervised post-training of large multimodal models, enabling sustained self-improvement without any external supervision.
  • Through extensive empirical analysis, we identify common failure modes in unsupervised self-evolution and mitigate them by modeling and optimizing the within-input relative structure among candidate solutions.
Proposed method overview
Overview of the proposed unsupervised self-evolution framework. The Actor generates multiple reasoning trajectories for the same input, while a frozen Judge provides bounded score modulation. The final rewards are optimized in a group-wise, distributional manner to enable stable policy updates without external supervision.

🧩 Self-Consistency as a Training Prior

Our framework samples multiple reasoning trajectories for the same input and uses their agreement patterns to construct a self-consistency distribution. This provides a noise-reduced training prior in fully unsupervised settings, without relying on ground-truth answers or external supervision.

⚖️ Bounded Judge-Based Modulation

We introduce a frozen Judge that evaluates each reasoning trajectory and produces a bounded, continuous modulation signal. Instead of acting as an absolute reward, the Judge softly reshapes the Actor’s self-consistency distribution, correcting systematic biases while avoiding unstable optimization.

📊 Group-Wise Distributional Optimization

Rather than optimizing trajectories independently, we model rewards in a group-wise, distributional manner. By converting absolute scores into relative advantages within each input group, our approach prevents early mode collapse and enables stable, long-term self-evolution.

Case Study

Case study: limitation of majority voting in unsupervised self-evolution
Limitation of majority voting in unsupervised self-evolution. Right: An example where the most frequent answer is incorrect. Majority voting reinforces this dominant error, while our method favors higher-quality reasoning paths through Judge modulation. Left: Results on MathVision and DynaMath show that our approach consistently outperforms majority-voting-based self-training.

Results

Training Data Method MathVision MathVerse WeMath LogicVista DynaMath Avg.
Performance on Qwen2.5-VL-7B
Qwen2.5-VL-7B 25.0 44.2 37.1 46.3 20.3 34.6
Unsupervised self-Evolving
- VisionZero 27.6 46.4 38.8 48.8 21.7 36.7
VisionZero 27.4 46.8 38.5 49.1 21.3 36.6
EvoLMM 25.8 44.7 37.6 46.9 21.0 35.2
MMR1 +SFTU 27.3 45.0 38.3 48.1 22.9 36.3
+RL(GRPO)U 29.3 47.4 39.3 49.4 23.3 37.7
MMR1 Major Vote 26.4 46.0 38.6 47.9 21.8 36.1
Ours 28.4 46.4 38.8 48.6 23.0 37.0
GeoQA +SFTU 27.6 45.3 38.6 47.8 22.6 36.4
+RL(GRPO)U 28.8 47.1 39.0 49.2 23.4 37.5
GeoQA Major Vote 27.3 45.1 38.2 47.3 21.9 36.0
Ours 28.6 46.5 38.9 47.9 23.2 37.0
Geo3K +SFTU 27.8 44.7 37.9 47.2 22.1 35.9
+RL(GRPO)U 29.1 46.9 39.1 49.6 23.8 37.7
Geo3K Major Vote 27.5 44.0 37.4 46.9 21.4 35.4
Ours 30.9 46.8 38.7 49.0 24.2 37.9
Main results on multimodal mathematical reasoning benchmarks. We report accuracy (%) on five math benchmarks. MajorVote corresponds to the MM-UPT method. U denotes supervised training. indicates VisionZero-Qwen-7B trained on CLEVR, while indicates VisionZero-Qwen-7B trained on real-world data.

Citation

@misc{wu2026modelsjudgethemselvesunsupervised,
                    title={When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning}, 
                    author={Zhengxian Wu and Kai Shi and Chuanrui Zhang and Zirui Liao and Jun Yang and Ni Yang and Qiuying Peng and Luyuan Zhang and Hangrui Xu and Tianhuang Su and Zhenyu Yang and Haonan Lu and Haoqian Wang},
                    year={2026},
                    eprint={2603.21289},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2603.21289}, 
              }
×
This website is inspired by Absolute Zero. We sincerely thank the authors for their contributions to the community.