When Models Judge Themselves

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Zhengxian Wu^1,2,†, Kai Shi^1,†, Chuanrui Zhang³, Zirui Liao², Jun Yang^1,✉️, Ni Yang¹, Qiuying Peng¹, Luyuan Zhang², Hangrui Xu⁴, Tianhuang Su¹, Zhenyu Yang¹, Haonan Lu¹ and Haoqian Wang^2,✉️

¹OPPO AI Center ²Tsinghua University ³Nanyang Technological University ⁴Hefei University of Technology

^†Equal contribution ^✉️Corresponding author

zx-wu24@mails.tsinghua.edu.cn, wanghaoqian@tsinghua.edu.cn

Abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high quality annotated data, which is costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models.

For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge-based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates.

Proposed Method

Key Ideas

We propose a new framework for unsupervised post-training of large multimodal models, enabling sustained self-improvement without any external supervision.
Through extensive empirical analysis, we identify common failure modes in unsupervised self-evolution and mitigate them by modeling and optimizing the within-input relative structure among candidate solutions.

Overview of the proposed unsupervised self-evolution framework. The Actor generates multiple reasoning trajectories for the same input, while a frozen Judge provides bounded score modulation. The final rewards are optimized in a group-wise, distributional manner to enable stable policy updates without external supervision.

🧩 Self-Consistency as a Training Prior

Our framework samples multiple reasoning trajectories for the same input and uses their agreement patterns to construct a self-consistency distribution. This provides a noise-reduced training prior in fully unsupervised settings, without relying on ground-truth answers or external supervision.

⚖️ Bounded Judge-Based Modulation

We introduce a frozen Judge that evaluates each reasoning trajectory and produces a bounded, continuous modulation signal. Instead of acting as an absolute reward, the Judge softly reshapes the Actor’s self-consistency distribution, correcting systematic biases while avoiding unstable optimization.

📊 Group-Wise Distributional Optimization

Rather than optimizing trajectories independently, we model rewards in a group-wise, distributional manner. By converting absolute scores into relative advantages within each input group, our approach prevents early mode collapse and enables stable, long-term self-evolution.

Case Study

Results

Training Data	Method	MathVision	MathVerse	WeMath	LogicVista	DynaMath	Avg.
Performance on Qwen2.5-VL-7B
—	Qwen2.5-VL-7B	25.0	44.2	37.1	46.3	20.3	34.6
Unsupervised self-Evolving
-	VisionZero	27.6	46.4	38.8	48.8	21.7	36.7
	VisionZero	27.4	46.8	38.5	49.1	21.3	36.6
	EvoLMM	25.8	44.7	37.6	46.9	21.0	35.2
MMR1	+SFT^U	27.3	45.0	38.3	48.1	22.9	36.3
MMR1	+RL(GRPO)^U	29.3	47.4	39.3	49.4	23.3	37.7
MMR1	Major Vote	26.4	46.0	38.6	47.9	21.8	36.1
MMR1	Ours	28.4	46.4	38.8	48.6	23.0	37.0
GeoQA	+SFT^U	27.6	45.3	38.6	47.8	22.6	36.4
GeoQA	+RL(GRPO)^U	28.8	47.1	39.0	49.2	23.4	37.5
GeoQA	Major Vote	27.3	45.1	38.2	47.3	21.9	36.0
GeoQA	Ours	28.6	46.5	38.9	47.9	23.2	37.0
Geo3K	+SFT^U	27.8	44.7	37.9	47.2	22.1	35.9
Geo3K	+RL(GRPO)^U	29.1	46.9	39.1	49.6	23.8	37.7
Geo3K	Major Vote	27.5	44.0	37.4	46.9	21.4	35.4
Geo3K	Ours	30.9	46.8	38.7	49.0	24.2	37.9

Main results on multimodal mathematical reasoning benchmarks. We report accuracy (%) on five math benchmarks. MajorVote corresponds to the MM-UPT method. U denotes supervised training. indicates VisionZero-Qwen-7B trained on CLEVR, while indicates VisionZero-Qwen-7B trained on real-world data.

Citation

@misc{wu2026modelsjudgethemselvesunsupervised,
                    title={When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning}, 
                    author={Zhengxian Wu and Kai Shi and Chuanrui Zhang and Zirui Liao and Jun Yang and Ni Yang and Qiuying Peng and Luyuan Zhang and Hangrui Xu and Tianhuang Su and Zhenyu Yang and Haonan Lu and Haoqian Wang},
                    year={2026},
                    eprint={2603.21289},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2603.21289}, 
              }

🦔 When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning