Abstract
Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities.
We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
Real-Time Streaming Inference
GenMatter runs in real time while conditioning on monocular depth, optical flow, and DINO features, tracking the moving 3D matter that makes up deforming objects frame by frame. Input frames (left) are preprocessed to extract these features, lifting each pixel to a 3D point tagged with its velocity. Inference conditions on the moving 3D points to recover 3D Gaussian particles (center), filled by average color and highlighted by cluster assignment, shown from a panning viewpoint alongside the input camera and two novel views. Re-rendering the particles yields an RGB reconstruction and an object segmentation (right). The on-screen readout indicates the performance of the different inference stages.
Generative Matter Model and Inference
Generative matter model
GenMatter is a two-level hierarchical generative model for structured motion. At the lower level, data points are grouped into particles, where each particle is a local Gaussian representing a small piece of moving matter. At the higher level, particles are grouped into clusters, which represent coherent, independently moveable physical entities.
Particles, clusters, and motion
Each cluster is parameterized by a Gaussian over space and a rigid-body transformation. Particles inherit this cluster-level motion while retaining their own spatial extent and velocity covariance, giving the model enough flexibility to explain both rigid and deformable objects. This hierarchy lets local matter move with a coherent entity while allowing within-cluster slack for non-rigid motion.
Inference
Inference uses parallelized blocked Gibbs sampling to update data-point assignments, particle parameters, cluster assignments, and cluster transformations. Conditional independence in the hierarchy allows many variables at the data-point, particle, and cluster levels to be sampled in parallel, and the same inference procedure is used across random dots, camouflaged structure-from-motion stimuli, and RGB videos with optional image features.
Random Dot Kinematograms
In RDK stimuli, object membership must be explained from motion alone. We created 27 stimuli from 9 rigid-body scenes and collected binary same-object judgments from 150 participants. GenMatter was run on 50 random seeds per stimulus to match the human sample size.
Each panel pairs an original random-dot kinematogram with GenMatter's inferred posterior over object structure. The stimuli contain sparse moving dots with no appearance or boundary cues, so the task is to explain whether probe dots belong to the same moving object from motion alone. In the posterior videos, dots are colored by inferred cluster assignment, showing how the model groups ambiguous dot correspondences into coherent moving matter.
Structure from Motion in Gestalt Stimuli
We evaluate GenMatter on 140 short videos of rotating 3D objects with foreground and background textures matched. Static frames provide little segmentation information, but motion supports 3D structure from motion.
Table 1. Summary statistics across 140 Gestalt videos. GenMatter scores higher on mean per-pixel accuracy and Jaccard index. GenMatter is also more consistent across stimuli. Values reported as mean [95% bootstrap CI].
| Method | Accuracy | Jaccard |
|---|---|---|
| SegAnyMo | 0.33 [0.28, 0.37] | 0.26 [0.22, 0.31] |
| FlowSAM | 0.87 [0.85, 0.88] | 0.67 [0.63, 0.70] |
| GenMatter | 0.94 [0.93, 0.94] | 0.72 [0.70, 0.74] |
Posterior segmentations show the inferred object mask over time. Because foreground and background share the same texture, the recovered object boundary comes from motion-based 3D structure rather than static appearance contrast.
The same inference procedure is applied across matched texture patterns, extracting a MAP segmentation from samples over the latent matter representation.
GenMatter assigns probability mass to contiguous matter regions even when texture is deliberately uninformative, using motion to recover the rotating object.
The benchmark contains 140 videos: 20 object geometries rendered with 7 foreground-background matched texture patterns.
Static frames provide little segmentation information, but the model's inferred moving particles and clusters make the object boundary explicit over time.
GenMatter scores higher than FlowSAM and SegAnyMo on both per-pixel accuracy and Jaccard, with more consistent performance across stimuli.
Across textures, segmentation emerges from the posterior over moving matter rather than from a feed-forward appearance mask.
3D Particle Representations from RGB Video
On TAP-Vid-DAVIS videos, GenMatter conditions on monocular depth, optical flow, and DINO features. Its projected 3D particle representation matches CoTracker3 on matter-weighted Jaccard without task-specific pretraining.
Tracking performance on TAP-Vid DAVIS. Values are mean [95% bootstrap CI].
| Metric | CoTracker3 | GenMatter | GenMatter (abl.) |
|---|---|---|---|
| Jm (SAM) | 0.78 [0.69, 0.87] | 0.79 [0.73, 0.84] | 0.69 [0.61, 0.77] |
| Jm (GT) | 0.78 [0.69, 0.87] | 0.77 [0.73, 0.84] | 0.68 [0.58, 0.73] |
BibTeX
@inproceedings{li2026genmatter,
title = {GenMatter: Perceiving Physical Objects with Generative Matter Models},
author = {Li, Eric and Dasgupta, Arijit and Friedman, Yoni and Huot, Mathieu and Mansinghka, Vikash and O'Connell, Thomas and Freeman, William T. and Tenenbaum, Joshua B.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
Acknowledgements
This work was supported in part by the Department of the Air Force Artificial Intelligence Accelerator (Cooperative Agreement FA8750-19-2-1000), NSF Award 2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions), Navy-ONR MURI N00002610, Navy-ONR MURI N00014-22-1-2740, CoCoSys from the Georgia Institute of Technology (Award 2023-JU-3131), the MIT Siegel Family Quest for Intelligence, and the Probabilistic Computing Foundation.