NeurIPS 2026 Evaluations & Datasets Track

SeePhys Pro

Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang¹ Terry Jingchen Zhang² Zirong Liu¹ Bokai Zhou¹ Yueling Tang¹ Junjie Yu¹ Jiacong Lu¹ Shangrui Huang¹ Heng Li¹ Likui Zhang¹ Kunkun Liu⁴ Changzheng Zhang⁴ Yangle Fang⁴ Boqiang Guo⁴ Hui-Ling Zhen⁴ Dandan Tu^4,* Yinya Huang^2,3 Xiaodan Liang^1,*

¹Sun Yat-sen University ²ETH Zurich ³ETH AI Center ⁴Huawei Technologies Ltd.

Paper coming soon Dataset Challenge Code coming soon

Introduction

SeePhys Pro is a fine-grained modality-transfer benchmark for multimodal physics reasoning. Each problem preserves the same physical semantics while progressively moving task-critical information from text into diagrams, revealing whether a model reasons over stable physics or over the surface form of the prompt.

1,000

Seed Questions

4,000

Aligned Variants

Disciplines

Domains

104

Fields

Benchmark Design

The benchmark is built on the principle of same physics, different representation. Four aligned levels decompose the cost of structural transfer, visual variable grounding, and full image rendering.

Overview of SeePhys Pro four modality-transfer levels. — Four semantically aligned variants: L1 text-only, L2 structure in image, L3 structure and variables in image, and L4 full problem rendered as a single visual input.

            L1
            Text-only physics reasoning baseline.
          
            L2
            Physical structure moves into the diagram.
          
            L3
            Variables and labels must be read and grounded visually.
          
            L4
            Full handwritten statement and diagram are processed as one image.

Data Construction

Seed problems are curated from public datasets, textbooks, exam papers, olympiad archives, and physics problem books, then manually transformed into aligned multimodal variants with preserved answers and solution paths.

Source-matched: benchmark and training corpora share a broad physics source pool but remain instance-disjoint.
Manually aligned: annotators redraw structure and variable layers while preserving the physical system.
Fine-grained metadata: discipline, field, domain, visual evidence type, and reasoning skill annotations support targeted analysis.

Test-Time Modality Transfer

Across evaluated models, average accuracy drops from 49.2% at Level 1 to 35.8% at Level 4. The largest average gap occurs at Level 2 to Level 3, where models must ground variables and labels from images.

Model	Accuracy / consistency					Transfer gap
Model	L1	L2	L3	L4	Cons₄	Δ_S	Δ_V	Δ_R	Δ_T
Human Performance	54.0	58.5	59.5	56.0	49.0	-4.5	-1.0	3.5	-2.0
GPT-5.4	67.4	64.1	55.8	53.0	32.6	3.3	8.3	2.8	14.4
GPT-5	41.8	32.9	23.8	23.2	8.9	8.9	9.1	0.5	18.5
Gemini-3.1-Pro	71.0	72.0	66.5	66.5	47.0	-1.0	5.5	0.0	4.5
Claude-4.7-Opus	74.0	67.0	56.5	46.5	33.5	7.0	10.5	10.0	27.5
Qwen-3.6-flash	61.4	59.3	49.9	48.4	29.9	2.1	9.4	1.5	13.0
Qwen3.5-27B	45.0	34.8	28.0	25.6	9.9	10.3	6.8	2.4	19.4
Gemma-4-31B-it	38.9	33.5	23.9	22.0	8.9	5.4	9.6	1.9	16.9
Average	49.2	46.1	38.7	35.8	21.4	3.0	7.4	2.9	13.4

Accuracy and consistency are percentages. Positive transfer gaps indicate degradation under a more visual representation; ΔS, ΔV, and ΔR measure the gaps in structural transfer (Level 1→2), variable grounding (Level 2→3), and rendering (Level 3→4), respectively. ΔT measures the overall gap between Level 1 and Level 4, while Cons4 measures consistency across all four levels.

Training-Time Diagnostic

SeePhys Pro also tests whether multimodal RLVR improvements are visually grounded. A blind-training control masks all training images, yet still improves unmasked validation accuracy, showing that final-answer gains alone can overstate visual grounding.

Normal and blind RL diagnostic curves on SeePhys Pro. — Normal and blind RL both improve accuracy on Level 1--4, but the total transfer gap and variable-grounding gap remain large.

PhysRL-40K / PhysRL-8KLarge-scale physics RL corpora built from the same source pool, with benchmark-disjoint instances.

Blind RL controlAll training images are replaced by black images while validation keeps normal visual inputs.

Gap dynamicsAccuracy can rise without closing modality-transfer gaps, especially the variable-grounding gap.

Cross-Benchmark Controls

Blind gains are not unique to SeePhys Pro. Across external physics and math benchmarks, masked-image RL can recover a substantial fraction of normal RL gains, indicating sensitivity to residual textual and distributional cues.

Cross-benchmark normal and blind RL gain summary.

Citation

@article{xiang2026seephyspro,
  title   = {SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning},
  author  = {Xiang, Kun and Zhang, Terry Jingchen and Liu, Zirong and Zhou, Bokai and Tang, Yueling and Yu, Junjie and Lu, Jiacong and Huang, Shangrui and Li, Heng and Zhang, Likui and Liu, Kunkun and Zhang, Changzheng and Fang, Yangle and Guo, Boqiang and Zhen, Hui-Ling and Tu, Dandan and Huang, Yinya and Liang, Xiaodan},
  journal = {arXiv},
  year    = {2026}
}