NeurIPS 2026 Evaluations & Datasets Track

SeePhys Pro

Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang1 Terry Jingchen Zhang2 Zirong Liu1 Bokai Zhou1 Yueling Tang1 Junjie Yu1 Jiacong Lu1 Shangrui Huang1 Heng Li1 Likui Zhang1 Kunkun Liu4 Changzheng Zhang4 Yangle Fang4 Boqiang Guo4 Hui-Ling Zhen4 Dandan Tu4,* Yinya Huang2,3 Xiaodan Liang1,*
1Sun Yat-sen University 2ETH Zurich 3ETH AI Center 4Huawei Technologies Ltd.

Introduction

SeePhys Pro is a fine-grained modality-transfer benchmark for multimodal physics reasoning. Each problem preserves the same physical semantics while progressively moving task-critical information from text into diagrams, revealing whether a model reasons over stable physics or over the surface form of the prompt.

1,000
Seed Questions
4,000
Aligned Variants
6
Disciplines
38
Domains
104
Fields

Benchmark Design

The benchmark is built on the principle of same physics, different representation. Four aligned levels decompose the cost of structural transfer, visual variable grounding, and full image rendering.

Overview of SeePhys Pro four modality-transfer levels.
Four semantically aligned variants: L1 text-only, L2 structure in image, L3 structure and variables in image, and L4 full problem rendered as a single visual input.
L1 Text-only physics reasoning baseline.
L2 Physical structure moves into the diagram.
L3 Variables and labels must be read and grounded visually.
L4 Full handwritten statement and diagram are processed as one image.

Data Construction

Seed problems are curated from public datasets, textbooks, exam papers, olympiad archives, and physics problem books, then manually transformed into aligned multimodal variants with preserved answers and solution paths.

SeePhys Pro data construction engine.
  • Source-matched: benchmark and training corpora share a broad physics source pool but remain instance-disjoint.
  • Manually aligned: annotators redraw structure and variable layers while preserving the physical system.
  • Fine-grained metadata: discipline, field, domain, visual evidence type, and reasoning skill annotations support targeted analysis.

Test-Time Modality Transfer

Across evaluated models, average accuracy drops from 49.2% at Level 1 to 35.8% at Level 4. The largest average gap occurs at Level 2 to Level 3, where models must ground variables and labels from images.

Model Accuracy / consistency Transfer gap
L1 L2 L3 L4 Cons4 ΔS ΔV ΔR ΔT
Human Performance 54.0 58.5 59.5 56.0 49.0 -4.5 -1.0 3.5 -2.0
GPT-5.4 67.4 64.1 55.8 53.0 32.6 3.3 8.3 2.8 14.4
GPT-5 41.8 32.9 23.8 23.2 8.9 8.9 9.1 0.5 18.5
Gemini-3.1-Pro 71.0 72.0 66.5 66.5 47.0 -1.0 5.5 0.0 4.5
Claude-4.7-Opus 74.0 67.0 56.5 46.5 33.5 7.0 10.5 10.0 27.5
Qwen-3.6-flash 61.4 59.3 49.9 48.4 29.9 2.1 9.4 1.5 13.0
Qwen3.5-27B 45.0 34.8 28.0 25.6 9.9 10.3 6.8 2.4 19.4
Gemma-4-31B-it 38.9 33.5 23.9 22.0 8.9 5.4 9.6 1.9 16.9
Average 49.2 46.1 38.7 35.8 21.4 3.0 7.4 2.9 13.4

Accuracy and consistency are percentages. Positive transfer gaps indicate degradation under a more visual representation; ΔS, ΔV, and ΔR measure the gaps in structural transfer (Level 1→2), variable grounding (Level 2→3), and rendering (Level 3→4), respectively. ΔT measures the overall gap between Level 1 and Level 4, while Cons4 measures consistency across all four levels.

Training-Time Diagnostic

SeePhys Pro also tests whether multimodal RLVR improvements are visually grounded. A blind-training control masks all training images, yet still improves unmasked validation accuracy, showing that final-answer gains alone can overstate visual grounding.

Normal and blind RL diagnostic curves on SeePhys Pro.
Normal and blind RL both improve accuracy on Level 1--4, but the total transfer gap and variable-grounding gap remain large.
PhysRL-40K / PhysRL-8KLarge-scale physics RL corpora built from the same source pool, with benchmark-disjoint instances.
Blind RL controlAll training images are replaced by black images while validation keeps normal visual inputs.
Gap dynamicsAccuracy can rise without closing modality-transfer gaps, especially the variable-grounding gap.

Cross-Benchmark Controls

Blind gains are not unique to SeePhys Pro. Across external physics and math benchmarks, masked-image RL can recover a substantial fraction of normal RL gains, indicating sensitivity to residual textual and distributional cues.

Cross-benchmark normal and blind RL gain summary.

Citation

@article{xiang2026seephyspro,
  title   = {SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning},
  author  = {Xiang, Kun and Zhang, Terry Jingchen and Liu, Zirong and Zhou, Bokai and Tang, Yueling and Yu, Junjie and Lu, Jiacong and Huang, Shangrui and Li, Heng and Zhang, Likui and Liu, Kunkun and Zhang, Changzheng and Fang, Yangle and Guo, Boqiang and Zhen, Hui-Ling and Tu, Dandan and Huang, Yinya and Liang, Xiaodan},
  journal = {arXiv},
  year    = {2026}
}