We extract object-centric information from the input image using a visual backbone, which combines DINO and slot attention. Stable Diffusion VAE is used to encode the image into latent space and then noise is added to the latent. Diffusion process is conditioned on the generated slots as well as the register token which is generated by (mean) pooling the slots. We use the original cross attention layers of diffusion model to condition on the register token, and additional adapter attentions to condition on the learned slots. The overall objective is to predict the noise added to the image. Additionally, we introduce a guidance loss between the slot attention masks and adapter cross attention masks, which encourages the similarity between them. The guidance is only applied in the third upsampling block, while slot conditioning is applied throughout all downsampling and upsampling blocks.
We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature.
Quantitative Results on Pascal VOC and COCO dataset This tables show results in comparison to the previous work in terms of FG-ARI and mBO for instance and semantic segmentation tasks.
Segmentation We show visualizations of predicted segments on COCO (left) and VOC (right). SlotAdapt successfully binds distinct instances belonging to the same class.
Generation We show sample images reconstructed by SlotAdapt conditioned on slots on COCO (left) and VOC (right). SlotAdapt generates reconstructions highly faithful to the original input images.
Compositional Editing We demonstrate object removal, replacement, and addition edits on COCO images by using slots. Removing highlighted slots (top row) yields realistic and successful generations (first 4 examples). Replacing highlighted objects in the 3rd and 4th images with the cow object from the 5th and 6th images results in highly accurate edits, yet with small changes in the original images. Finally, adding the cow (5th image) and the person (3rd image) slots to the last two images, respectively, generates meaningful examples of complex scenes.
Adil Kaan Akan and Yücel Yemez
ICLR 2025
@InProceedings{Akan2025ICLR,
author = {Akan, Adil Kaan and Yemez, Y\"{u}cel},
title = {Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation},
booktitle = {International Conference on Learning Representations},
year = {2025}}