Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high quality annotated data, which is costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models.
For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge-based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates.
Our framework samples multiple reasoning trajectories for the same input and uses their agreement patterns to construct a self-consistency distribution. This provides a noise-reduced training prior in fully unsupervised settings, without relying on ground-truth answers or external supervision.
We introduce a frozen Judge that evaluates each reasoning trajectory and produces a bounded, continuous modulation signal. Instead of acting as an absolute reward, the Judge softly reshapes the Actor’s self-consistency distribution, correcting systematic biases while avoiding unstable optimization.
Rather than optimizing trajectories independently, we model rewards in a group-wise, distributional manner. By converting absolute scores into relative advantages within each input group, our approach prevents early mode collapse and enables stable, long-term self-evolution.
| Training Data | Method | MathVision | MathVerse | WeMath | LogicVista | DynaMath | Avg. |
|---|---|---|---|---|---|---|---|
| Performance on Qwen2.5-VL-7B | |||||||
| — | Qwen2.5-VL-7B | 25.0 | 44.2 | 37.1 | 46.3 | 20.3 | 34.6 |
| Unsupervised self-Evolving | |||||||
| - | VisionZero | 27.6 | 46.4 | 38.8 | 48.8 | 21.7 | 36.7 |
| VisionZero | 27.4 | 46.8 | 38.5 | 49.1 | 21.3 | 36.6 | |
| EvoLMM | 25.8 | 44.7 | 37.6 | 46.9 | 21.0 | 35.2 | |
| MMR1 | +SFTU | 27.3 | 45.0 | 38.3 | 48.1 | 22.9 | 36.3 |
| +RL(GRPO)U | 29.3 | 47.4 | 39.3 | 49.4 | 23.3 | 37.7 | |
| MMR1 | Major Vote | 26.4 | 46.0 | 38.6 | 47.9 | 21.8 | 36.1 |
| Ours | 28.4 | 46.4 | 38.8 | 48.6 | 23.0 | 37.0 | |
| GeoQA | +SFTU | 27.6 | 45.3 | 38.6 | 47.8 | 22.6 | 36.4 |
| +RL(GRPO)U | 28.8 | 47.1 | 39.0 | 49.2 | 23.4 | 37.5 | |
| GeoQA | Major Vote | 27.3 | 45.1 | 38.2 | 47.3 | 21.9 | 36.0 |
| Ours | 28.6 | 46.5 | 38.9 | 47.9 | 23.2 | 37.0 | |
| Geo3K | +SFTU | 27.8 | 44.7 | 37.9 | 47.2 | 22.1 | 35.9 |
| +RL(GRPO)U | 29.1 | 46.9 | 39.1 | 49.6 | 23.8 | 37.7 | |
| Geo3K | Major Vote | 27.5 | 44.0 | 37.4 | 46.9 | 21.4 | 35.4 |
| Ours | 30.9 | 46.8 | 38.7 | 49.0 | 24.2 | 37.9 | |
@misc{wu2026modelsjudgethemselvesunsupervised,
title={When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning},
author={Zhengxian Wu and Kai Shi and Chuanrui Zhang and Zirui Liao and Jun Yang and Ni Yang and Qiuying Peng and Luyuan Zhang and Hangrui Xu and Tianhuang Su and Zhenyu Yang and Haonan Lu and Haoqian Wang},
year={2026},
eprint={2603.21289},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.21289},
}