Compositional Video Synthesis by Temporal Object-Centric Learning

Yücel Yemez^{1, 2}

¹Department of Computer Engineering, Koc University

Under review

TL;DR We introduce a self-supervised, object-centric framework that learns temporally consistent, pose-invariant slots using Invariant Slot Attention (ISA) over DINOv2 features, and decodes with a frozen Image Diffusion Model via adapter cross-attention for video generation and compositional editing while maintaining temporal identity.

Our method learns temporally consistent, pose-invariant slots via Invariant Slot Attention (ISA) applied to frozen DINOv2 features. For each frame, ISA produces K slots with learned position/scale parameters. We temporally aggregate slots across 5-frame windows using Transformer encoders with temporal positional embeddings. Separately, we create a global register token by average-pooling DINOv2 features across time and processing through another Transformer encoder—this register carries scene/pose/background context while keeping slots pose-invariant. For decoding, we use frozen Stable Diffusion 1.5 augmented with lightweight adapter cross-attention layers: adapters condition on temporally-enriched slots (object identities), while SD's native cross-attention consumes the register token (scene context and pose information). Training uses a 1-frame objective (random frame per window) with frozen SD weights; inference employs sliding windows with Hungarian matching for temporal consistency. This architecture enables high-fidelity video generation and compositional editing (insert/delete/replace objects) while maintaining temporal coherence.

Abstract

We present a self-supervised framework for compositional video synthesis that combines temporal object-centric learning with pretrained image diffusion models. Our approach learns pose-invariant, temporally consistent object slots using Invariant Slot Attention (ISA) over frozen DINOv2 features, enriched via temporal Transformer aggregation over 5-frame windows. We introduce a global register token mechanism that separates pose/scene context from object identity, enabling slots to remain spatially invariant. For generation, we condition a frozen Stable Diffusion 1.5 model using lightweight adapter cross-attention layers for slots and native cross-attention for the register token. Despite training with a 1-frame objective, our sliding-window inference with Hungarian matching achieves strong temporal consistency. Experimental results on YTVIS19 and DAVIS17 demonstrate state-of-the-art video generation quality (PSNR/SSIM/LPIPS/FID/FVD) and competitive unsupervised segmentation (mIoU/FG-ARI). Our method uniquely integrates segmentation with high-fidelity generation, enabling intuitive compositional editing—object insertion, deletion, or replacement—while maintaining temporal identity and visual coherence across frames.

Quantitative Results

Unsupervised Video Object Segmentation on YTVIS19 and DAVIS17 Our method achieves state-of-the-art performance in FG-ARI and competitive results in mIoU on both datasets, demonstrating strong object discovery capabilities while uniquely enabling high-fidelity video generation.

Video Generation on YTVIS19 and DAVIS17 Our method achieves state-of-the-art video generation quality across all metrics (PSNR, SSIM, LPIPS, FID, FVD) on both datasets. We are the only method that can generate high-fidelity videos from object-centric representations, uniquely combining unsupervised segmentation with superior generative capabilities for compositional video editing.

Qualitative Results

Video Segmentation

Video Segmentation Visualizations of predicted segments from ISA attention masks across frames. Our method discovers and tracks distinct object instances using Hungarian matching—competitive unsupervised performance while enabling generation.

Video Generation

Ground Truth

Generation

Ground Truth

Generation

Ground Truth

Generation

Ground Truth

Generation

Video Generation on DAVIS17 and YTVIS19 Reconstructed videos using frozen SD-1.5 conditioned on temporally-aggregated slots via adapter cross-attention and register tokens via native cross-attention. Despite 1-frame training, sliding-window inference with Hungarian matching achieves state-of-the-art quality.

Compositional Video Editing

Input (Delete)

Output

Input (Replace)

Output

Input (Add)

Output

Compositional Video Editing Object-level edits via direct slot manipulation with temporal consistency maintained through Hungarian matching. Delete: Zero out target slot across frames. Replace: Splice aligned slot streams between videos. Add: Insert slots from another video. All edits maintain pose-invariance and temporal identity.

Paper

Compositional Video Synthesis by Temporal Object-Centric Learning

Adil Kaan Akan and Yücel Yemez

Under Review

@article{Akan2025Video,
        author = {Akan, Adil Kaan and Yemez, Y\"{u}cel},
        title = {Compositional Video Synthesis by Temporal Object-Centric Learning},
        journal = {arXiv preprint arXiv:2507.20855},
        year      = {2025}
      }