GenMatter: Perceiving Physical Objects with Generative Matter Models

Abstract

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities.

We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

Motion-Based Perception Across Domains

Human vision robustly segments the world into independently moveable chunks of matter across very different visual conditions: sparse moving dots with no appearance cues, camouflaged surfaces where texture is uninformative, and naturalistic scenes containing deforming objects. Existing vision systems are typically specialized for one of these regimes, while GenMatter uses a single probabilistic model and inference procedure across all three.

The paper evaluates this claim in three settings. In random dot kinematograms, the model must explain object membership from motion alone and capture graded human uncertainty across ambiguous dot correspondences. In Gestalt-inspired camouflaged structure-from-motion videos, foreground and background share the same texture, so segmentation depends on recovering 3D structure from motion. In naturalistic RGB video, monocular depth, optical flow, and DINO features are lifted into a 3D representation that can track the moving matter making up deforming objects.

These experiments test two complementary objectives: whether the same generative matter model can support motion-based perception across settings where human vision succeeds, and whether its inferred particles and clusters provide useful object-level structure for segmentation and tracking.

Generative Matter Model and Inference

GenMatter inference pipeline. RGB video is preprocessed (gray arrows) to extract dense depth and optical flow, lifting each pixel to a 3D point tagged with its velocity. The blue box depicts the GenMatter generative model, which represents a scene as a hierarchy of clusters and particles that emit moving 3D points. Black arrows indicate the generative direction (clusters generate particles, which generate 3D points). Hollow arrows indicate the inference algorithm, which conditions on observed 3D points to infer particle and cluster parameters. Red arrows depict motion at the point, particle, and cluster layers.

Generative matter model

GenMatter is a two-level hierarchical generative model for structured motion. At the lower level, data points are grouped into particles, where each particle is a local Gaussian representing a small piece of moving matter. At the higher level, particles are grouped into clusters, which represent coherent, independently moveable physical entities.

Particles, clusters, and motion

Each cluster is parameterized by a Gaussian over space and a rigid-body transformation. Particles inherit this cluster-level motion while retaining their own spatial extent and velocity covariance, giving the model enough flexibility to explain both rigid and deformable objects. This hierarchy lets local matter move with a coherent entity while allowing within-cluster slack for non-rigid motion.

Inference

Inference uses parallelized blocked Gibbs sampling to update data-point assignments, particle parameters, cluster assignments, and cluster transformations. Conditional independence in the hierarchy allows many variables at the data-point, particle, and cluster levels to be sampled in parallel, and the same inference procedure is used across random dots, camouflaged structure-from-motion stimuli, and RGB videos with optional image features.

Random Dot Kinematograms

We created 27 RDK stimuli from 9 rigid-body scenes and collected binary same-object judgments from 150 participants. GenMatter was run on 50 random seeds per stimulus to match the human sample size.

GenMatter psychophysics results — GenMatter closely tracks human perceptual judgments on random dot kinematograms. (a) GenMatter accuracy (%) vs. participant accuracy (%) across 27 stimuli (r² = 0.86). Each blue circle represents a same-object stimulus and each orange triangle a different-object stimulus. (b) An internal view of the inferred posterior: red points belong to the moving object, blue points to the background, and black points flickered out of the scene. (c) Fast rotation makes object-background separation easy: both GenMatter (86%) and humans (90%) correctly judge the probes as on different objects (left). Slow sliding motion is very challenging: both GenMatter (88%) and humans (84%) incorrectly judge the probes as on the same object (right).

Random dot kinematogram condition c1 — RDK stimulus c1.

GenMatter inference on c1.

Random dot kinematogram condition c2 — RDK stimulus c2.

GenMatter inference on c2.

Random dot kinematogram condition c9 — RDK stimulus c9.

GenMatter inference on c9.

Each panel pairs an original random-dot kinematogram with GenMatter's inferred posterior over object structure. The stimuli contain sparse moving dots with no appearance or boundary cues, so the task is to explain whether probe dots belong to the same moving object from motion alone. In the posterior videos, dots are colored by inferred cluster assignment, showing how the model groups ambiguous dot correspondences into coherent moving matter.

Structure from Motion in Gestalt Stimuli

We evaluate GenMatter on 140 short videos of rotating 3D objects with foreground and background textures matched. Static frames provide little segmentation information, but motion supports 3D structure from motion.

Qualitative comparison on camouflaged stimuli. Probe point segmentation on scene 16, texture 01. The depth estimate is uninformative, and the flow estimate shows that on-axis rotation causes opposing motion at top vs. bottom (blue vs. red). GenMatter correctly segments the moving object, while FlowSAM segments the initial frame correctly but degrades over time. SegAnyMo fails to detect any object in the scene.

Table 1. Summary statistics across 140 Gestalt videos. GenMatter scores higher on mean per-pixel accuracy and Jaccard index. GenMatter is also more consistent across stimuli. Values reported as mean [95% bootstrap CI].

Method	Accuracy	Jaccard
SegAnyMo	0.33 [0.28, 0.37]	0.26 [0.22, 0.31]
FlowSAM	0.87 [0.85, 0.88]	0.67 [0.63, 0.70]
GenMatter	0.94 [0.93, 0.94]	0.72 [0.70, 0.74]

Posterior segmentation, texture 00.

Posterior segmentations show the inferred object mask over time. Because foreground and background share the same texture, the recovered object boundary comes from motion-based 3D structure rather than static appearance contrast.

Posterior segmentation, texture 07.

The same inference procedure is applied across matched texture patterns, extracting a MAP segmentation from samples over the latent matter representation.

Posterior segmentation, texture 13.

GenMatter assigns probability mass to contiguous matter regions even when texture is deliberately uninformative, using motion to recover the rotating object.

Posterior segmentation, texture 16.

The benchmark contains 140 videos: 20 object geometries rendered with 7 foreground-background matched texture patterns.

Posterior segmentation, texture 21.

Static frames provide little segmentation information, but the model's inferred moving particles and clusters make the object boundary explicit over time.

Posterior segmentation, texture 22.

GenMatter scores higher than FlowSAM and SegAnyMo on both per-pixel accuracy and Jaccard, with more consistent performance across stimuli.

Posterior segmentation, texture 25.

Across textures, segmentation emerges from the posterior over moving matter rather than from a feed-forward appearance mask.

3D Particle Representations from RGB Video

On TAP-Vid-DAVIS videos, GenMatter conditions on monocular depth, optical flow, and DINO features. Its projected 3D particle representation matches CoTracker3 on matter-weighted Jaccard without task-specific pretraining.