I am a Research Scientist on the post-training team at fal.ai, where I work on large-scale diffusion models for image and video generation.
I received my PhD and MSc from
Koç University
(PhD Thesis,
MSc Thesis),
with my doctoral work on object-centric learning and compositional image and video generation.
I completed my Bachelor's at Middle East Technical University.
We derive the exact posterior score in closed form for linear Gaussian inverse problems and turn it into EPS, a denoising training objective that preserves the input/output structure of standard pretraining. At inference, EPS reuses the backbone's sampler with no likelihood gradients or projections, outperforming training-free and training-based baselines while using roughly an order of magnitude fewer denoiser evaluations.
We present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a decoupled 3D-RoPE positional key, cutting per-token KV memory by 92.7% at every cached layer. On VBench it matches short-horizon streaming baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x.
We show that VAE latents and Gaussian noise both concentrate on thin spherical shells, where decoded content is carried predominantly by direction rather than radius. By projecting latents onto a fixed token radius and replacing linear interpolation with spherical interpolation, our geodesic flow-matching paths stay on the sphere at every timestep and consistently improve class-conditional ImageNet-256 FID without changing the diffusion architecture.
∞-RoPE enables infinite-horizon, action-controllable video generation through Block-Relativistic RoPE, KV Flush, and RoPE Cut—training-free techniques that overcome temporal horizon limits, enable fine-grained action control, and support cinematic scene transitions within a single autoregressive rollout.
We extend SlotAdapt to video by learning temporally consistent object-centric slots and conditioning them on pretrained diffusion models for compositional video synthesis. Our approach enables intuitive editing capabilities like object insertion, deletion, or replacement while maintaining consistent identities across frames. Experiments demonstrate superior video generation quality and temporal consistency, uniquely integrating segmentation with robust generative performance.
SlotAdapt combines slot attention with pretrained diffusion models through adapters for slot-based conditioning. By adding a guidance loss to align cross-attention with slot attention, our model better identifies objects without external supervision. Experiments show superior performance in object discovery and image generation, particularly on complex real-world images.
We propose a novel method for trajectory prediction that can adapt itself into every agent in the shared scene. We exploit dynamic weight learning to adapt each agent's state separately to predict their future trajectories simultaneously without rotating and normalizing the scene frame. Our results achieve state-of-the-art performance on Argoverse and INTERACTION datasets with impressive runtime performance.
We propose a novel method for future instance segmentation in Bird's-eye view space. We exploit state-space models for the future state prediction for encoding 3D scene structure and decoding future instance segmentations. Our results achieve state-of-the-art performance on NuScenes dataset with a great margin.
We propose a novel method for trajectory prediction. We propose to use Temporal Graph Networks for learning dynamically evolving agent features. Our results reaches the state-of-the-art performance on Argoverse Forecasting dataset.
We decompose the scene into static and dynamic parts by encoding it into ego-motion and optical flow. We first factorize scene structure, the ego-motion, then conditioned on this, we predict the residual flow in the scene specifically for independently moving objects.
We propose a novel way for stochastic video prediction by decomposing static and dynamic parts of the scene. We reason about appearance and motion in the video stochastically by predicting the future based on the motion history.
We propose theoretical understanding of JND concept for machine perception and conduct further analyses and comparisons with other state-of-the-art methods.
We propose a new concept for adversarial example generation. Inspired by the experimental psychology, we use the concept of Just Noticeable Difference to generate natural looking adversarial images.
Teaching
COMP302: Software Engineering, Koç University
COMP100: Introduction to Computer Science and Programming, Koç University
CENG223: Discrete Computational Structures, Middle East Technical University
CENG230: C Programming, Middle East Technical University