The Rail for Computational Imaging: A Physics World Model for Industrializing Image Reconstruction
The computational imaging community has built increasingly powerful reconstruction algorithms, yet real-world deployments routinely fail. We show that a 5-parameter sub-pixel operator mismatch -- well within manufacturing tolerances -- degrades the state-of-the-art CASSI transformer (MST-L) by 13.98 dB, erasing years of algorithmic progress. This paper argues that the bottleneck is not the solver but the infrastructure around it: evaluation protocols, physics representations, calibration pipelines, and benchmarks. Drawing on the SolveEverything.org framework, we present the Physics World Model (PWM) as the "rail" for computational imaging -- a standardized evaluation harness comprising: (i) OperatorGraph intermediate representation (IR), a universal directed acyclic graph (DAG) representation spanning 64 modalities across 5 physical carriers with 89 validated templates; (ii) a 4-scenario evaluation protocol separating solver quality from operator fidelity; (iii) the Leaderboard for Imaging Physics (LIP-Arena), a prospective Commit-Measure-Score competition eliminating benchmark overfitting; and (iv) a Red Team adversarial verification module. Across a 26-modality benchmark, we demonstrate that operator correction improves reconstruction by +0.54 to +48.25 dB across 9 correction configurations spanning 7 distinct modalities, with mismatch (Gate 3) identified as the binding constraint in every modality tested. PWM provides the infrastructure to move computational imaging from artisanal practice to industrial standardization.
1. Introduction
In 2022, the Mask-guided Spectral-wise Transformer (MST-L) achieved 34.81 dB on the standard CASSI benchmark -- a number that represents years of sustained algorithmic progress in hyperspectral image reconstruction. Yet when the coded aperture mask is subjected to a 5-parameter perturbation (sub-pixel shift, rotation, dispersion drift, and spectral axis tilt) -- well within the manufacturing and alignment tolerances of any real optical system -- the same method collapses to 20.83 dB, a loss of 13.98 dB. Under this sub-pixel mismatch model, deep-learning solvers (HDNet, MST-S, MST-L) lose 12.78-13.98 dB. To put this in perspective, the entire history of CASSI reconstruction -- from GAP-TV at 24.34 dB to MST-L at 34.81 dB -- spans only 10.47 dB. A sub-pixel operator mismatch erases years of algorithmic improvement.
This is not an isolated failure. Across modalities -- CASSI, CACTI, single-pixel cameras (SPC) -- the pattern repeats: the most powerful neural solvers exhibit the largest sensitivity to forward-model error. Classical methods such as GAP-TV lose only 3.38 dB under the same mismatch, precisely because they lack the capacity to overfit the operator. The paradox is stark: the field has been building faster and faster trains while ignoring the fact that the rails are broken.
The thesis. We argue that the computational imaging community is optimizing the wrong layer. The bottleneck is not the solver -- the reconstruction algorithm that maps measurements to images -- but the infrastructure around it: the evaluation protocols, the physics representations, the calibration pipelines, and the benchmarks that determine what counts as progress. In the language of industrial revolutions, solvers are trains -- visible, glamorous, and destined to commoditize. The compounding, durable value lies in the rails: the standards, metrics, and institutional infrastructure that make it possible for any train to run reliably.
This paper presents the Physics World Model (PWM) as precisely such a rail for computational imaging. PWM is not a new reconstruction algorithm. It is an evaluation harness -- a standardized infrastructure layer that makes it possible to measure, compare, and improve reconstruction methods under realistic physical conditions.
Contributions. Our specific contributions are: (1) SolveEverything mapping to imaging -- the first systematic mapping of the SolveEverything Industrial Intelligence Stack to computational imaging; (2) OperatorGraph IR as universal physics representation -- a graph-based intermediate representation for forward measurement operators unifying 64 modalities; (3) 26-modality benchmark evidence demonstrating that neural solvers systematically amplify operator mismatch; (4) Operator correction across 9 correction configurations spanning 7 distinct modalities; (5) LIP-Arena prospective evaluation protocol -- the first Commit-Measure-Score competition for computational imaging.
2. The Problem: Solver-Only Optimization
The computational imaging community has invested enormous effort in building better reconstruction algorithms -- deeper networks, more sophisticated architectures, larger training sets -- while treating the forward measurement operator as a fixed, known quantity. This section demonstrates that this assumption is catastrophically wrong, and that the resulting solver-only optimization paradigm produces an illusion of progress that collapses on contact with physical reality.
The Mask-Sensitivity Spectrum
Under ideal conditions, MST-L leads the field at 34.81 dB -- a full 10.47 dB above GAP-TV. Under mismatch, MST-L collapses to 20.83 dB, while GAP-TV drops only to 20.96 dB. The classical method matches the state-of-the-art transformer under realistic physical conditions. The degradation column tells the story most clearly: GAP-TV loses 3.38 dB, PnP-HSICNN loses 6.02 dB (estimated), and MST-L loses 13.98 dB. There is a near-perfect inverse relationship between ideal-condition performance and mismatch robustness.
HDNet presents a partial counterexample: it achieves high ideal PSNR (34.66 dB) with degradation of -12.78 dB, retaining 21.88 dB under mismatch. However, its Oracle Gain is zero (+0.00 dB), indicating that its mask-conditioning pathway cannot exploit an improved operator estimate -- a hallmark of mask-oblivious architectures.
Evaluation Fragmentation
Even if every lab evaluated mismatch sensitivity, the results would remain incomparable due to pervasive evaluation fragmentation: different metrics (PSNR, SSIM, SAM, LPIPS), different datasets (KAIST, CAVE, ICVL), different noise models (noiseless, Gaussian, Poisson-Gaussian), and different mismatch definitions. The result is a literature in which every paper reports numbers that cannot be meaningfully compared with any other paper. This is precisely the condition that the SolveEverything framework calls The Muddle.
The Scale of the Challenge
PWM's current registry includes 64 distinct modalities spanning five physical carriers. For each modality, at least 5 types of operator mismatch are physically relevant. The solver space includes at least 10 competitive reconstruction methods per modality. The resulting evaluation space is on the order of 64 modalities x 5 mismatch types x 10 solvers = 3,200 experiments (minimum). No ad hoc evaluation effort can cover this space. What is needed is standardized infrastructure for evaluation.
3. The SolveEverything Framework for Imaging
Four Stages of Revolution
Stage 1: Legibility. Can you measure the problem? Currently, the answer is no -- the field uses incommensurable metrics, datasets, noise models, and mismatch definitions.
Stage 2: Harnessing. Can you capture the energy? This requires standardized evaluation so that improvements measured by one group are meaningful to all others.
Stage 3: Institutionalization. Can you scale it? This is the stage at which LIP-Arena and the Red Team module operate: automated, prospective evaluation at scale.
Stage 4: Abundance. Is it a commodity? Every imaging system ships with a self-calibrating operator model that continuously updates itself.
Rails vs. Trains
In computational imaging, the trains are reconstruction algorithms -- GAP-TV, MST-L, HDNet, and their successors. The rails are the infrastructure components that remain valuable regardless of which solver is running: the OperatorGraph IR, the 4-scenario evaluation protocol, the LIP-Arena, the calibration pipelines, and the governance structures. PWM is designed entirely as a rail.
The L0-L5 Maturity Ladder
Six maturity levels: L0 (The Muddle), L1 (Measurable), L2 (Repeatable), L3 (Automated), L4 (Industrialized), L5 (Commoditized). The field is currently transitioning from L1 to L2. The most impactful transition is from L2 (repeatable, but manual) to L3 (automated) -- the transition from artisanal craft to industrial process.
4. PWM Architecture: The Evaluation Harness
OperatorGraph IR: A Universal DAG for Physics
At the core of PWM lies the OperatorGraph IR -- a directed acyclic graph (DAG) formalism that represents any forward measurement operator as a composition of primitive physical operations. Nodes include: mask modulation, convolution, spectral dispersion, temporal modulation, projection, Fourier sampling, noise injection, and sensor integration. Edges define data flow. The primitive library spans five physical carriers (photons, electrons, spins, acoustic waves, particles). PWM includes 89 validated OperatorGraph templates covering 64 distinct modalities.
The 4-Scenario Evaluation Protocol
Scenario I (Ideal): True operator for both measurement and reconstruction -- the oracle upper bound. Scenario II (Assumed/Mismatch): True operator for measurement, nominal operator for reconstruction -- the realistic deployment condition. Scenario III (Corrected): True operator for measurement, calibrated operator for reconstruction -- the calibration benefit. Scenario IV (Oracle Mask): True operator for reconstruction on mismatched data -- the correction ceiling.
Derived metrics include: Degradation = PSNR_I - PSNR_II, Calibration Gain = PSNR_III - PSNR_II, Recovery Ratio = (PSNR_III - PSNR_II) / (PSNR_I - PSNR_II), and Oracle Gap = PSNR_I - PSNR_III.
LIP-Arena: Prospective Evaluation via Commit-Measure-Score
The LIP-Arena is a prospective evaluation competition in which test data is generated after all submissions are finalized. The key principle is: data created after deadline -- no memorization possible. The four-phase protocol: Phase 1 (Commit, 2 weeks), Phase 2 (Measure, 2 weeks), Phase 3 (Execute, 1 week), Phase 4 (Score, 1 week). Four evaluation tracks: Correct, Diagnose, No-GT, and Design. Anti-Goodhart scoring via prospective dominance weighting, gaming penalties, and multi-metric ranking.
Red Team Module: Adversarial Verification
The Red Team module comprises six categories of adversarial tests totaling over 2,900 pre-designed scenarios: novel mismatch, compound mismatch, out-of-family physics, distribution shift, compute traps, and gate-flip scenarios.
5. The Triad Law: Diagnosing Imaging Failure
We identify three principal root causes that account for the vast majority of imaging failures across modalities.
Gate 1 -- Recoverability (Sampling). Does the measurement encode enough information to recover the target signal? Gate 1 is violated whenever the null space of the forward operator is too large.
Gate 2 -- Carrier Budget (Noise). Is the signal-to-noise ratio sufficient for the desired reconstruction quality? Even when the forward operator is well-posed, the photon budget, dose, or quantum efficiency may be inadequate.
Gate 3 -- Operator Mismatch (System Fidelity). Does the assumed model match the true physics? When H_true does not equal H_nom, the reconstruction algorithm inverts the wrong operator.
Every benchmark submission must include a TriadReport -- a structured artifact containing: dominant gate ID, evidence scores, confidence interval, and recommended action. The central empirical finding is that Gate 3 (operator mismatch) is the binding constraint in every modality tested.
6. Multi-Agent Orchestration
PWM automates the full imaging pipeline through a multi-agent architecture comprising 9 pipeline components (including 8 agents and the RunBundle packager) and 8 support classes, totaling 10,545 lines of Python. All agents run deterministically without requiring a large language model. The pipeline flow: User Prompt -> PlanAgent -> PhotonAgent + MismatchAgent + RecoverabilityAgent -> AnalysisAgent -> Negotiator -> PreFlightReportBuilder -> Pipeline Runner -> RunBundle.
The contract ecosystem comprises 25 Pydantic models, all inheriting from StrictBaseModel, backed by 9 YAML registries totaling 7,034 lines.
7. Empirical Evidence
26-Modality Benchmark
All 26 registered modalities pass the benchmark threshold under Scenario I (ideal operator). Average PSNR is approximately 37.0 dB (excluding Phase Retrieval identity test); range 25.46-64.84 dB over physically meaningful modalities. The range reflects the diversity of forward models -- from the heavily ill-posed CT (25.46 dB) to OCT (64.84 dB).
16-Modality Operator Correction
Improvements range from +0.54 dB (CASSI Alg 1) to +48.25 dB (MRI coil sensitivity correction). The most dramatic improvement is MRI coil-sensitivity correction at +48.25 dB. CACTI mask timing correction yields +22.94 dB. SPC and Matrix gain bias correction yields +12.21 dB. CT center-of-rotation correction yields +10.67 dB. Ptychography position offset correction yields +7.09 dB. Lensless PSF shift correction yields +3.55 dB.
CASSI Deep Dive
The mismatch cost -- the gap from Scenario I to Scenario II -- is catastrophic. For MST-L the drop is 34.81 - 20.83 = 13.98 dB, while the best solver upgrade under ideal conditions (GAP-TV to MST-L) yields 34.81 - 24.34 = 10.47 dB. The mismatch penalty is 1.34x the solver upgrade gain. Fixing the operator is worth more than upgrading from a classical solver to the state of the art.
8. Reproducibility and Open Infrastructure
Every imaging experiment executed through PWM produces a RunBundle v0.3.0 -- a self-contained, immutable archive. Every artifact is hashed with SHA-256. Decision Records for Imaging Systems (DR-IS) log every calibration decision cryptographically. The ExperimentSpec v0.2.1 eliminates ambiguity by requiring that every parameter be drawn from a typed, validated registry. The PWM evaluation harness is released under the MIT license, with 3,743 tests spanning unit, integration, regression, and end-to-end tests achieving 0 failures.
9. The Foundry Window and Roadmap
Computational imaging in 2026 lacks every piece of shared infrastructure that enabled the AlphaFold revolution: no CASP equivalent, no Protein Data Bank equivalent, and no universal evaluation protocol. The field is not merely fragmented -- it is pre-paradigmatic.
The roadmap is organized into three six-month phases: Phase 1 (Months 1-6): Become the default evaluation infrastructure via pip install pwm-eval, replication packs, and external laboratory validation. Phase 2 (Months 7-12): Launch the CISP Public Competition with a public leaderboard and blinded test sets. Phase 3 (Months 13-18): Become the Action Network with robotic lab API, compute escrow, calibration-as-a-service API, and outcome-based contracts.
10. Call to Action
Professors: Co-steward the LIP-Arena evaluation tracks. Validate your group's methods on the four-scenario protocol and publish the full diagnostic vector. Serve on the CISP steward board.
PhD Students: Join the calibration sprints. Pick one of the 64 registered modalities, measure its mask-sensitivity spectrum on physical hardware, and contribute a row to the correction table.
Hobbyists: Participate in the weekly SolveEverything challenges. The entire infrastructure is released under the MIT license: zero cost, zero barrier to entry.
Investors: The flywheel economics create three revenue surfaces: calibration-as-a-service, pay-per-reconstruction APIs, and imaging SLAs.
11. Conclusion
We began with a paradox: computational imaging possesses solvers of extraordinary power, yet real-world deployments routinely fail. Reconstructions that achieve 35+ dB on simulated benchmarks collapse by 3.38-13.98 dB under realistic operator mismatch. The gap is not a solver problem. It is an infrastructure problem.
This paper introduced PWM, the Physics World Model, as the rail for computational imaging. The contributions are concrete: 64 modalities formalized as composable OperatorGraph templates, 89 validated templates, the four-scenario evaluation protocol, LIP-Arena, and the Triad Law. Across the 26-modality benchmark, operator correction alone improved reconstruction quality by +0.54 to +48.25 dB across 9 correction configurations. The CASSI deep-dive was particularly revealing: the mismatch penalty (13.98 dB) exceeded the solver upgrade gain (10.47 dB) by 1.34x. The bottleneck was never the algorithm. It was the physics encoding.
A problem is solved when the bottleneck shifts from genius to compute. PWM provides the infrastructure for that shift.