ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

Alex Rav Acha
Daniel Winter
Yedid Hoshen
Shlomi Fruchter
Matan Cohen
2024

Abstract

Diffusion models revolutionized the quality and breadth of image editing, but the edited images often violate physical laws which makes them unrealistic. In this paper, we focus on two related tasks: removing objects from images ("object removal"), and blending objects into environments ("environmental blending"). These tasks are very challenging, as they require understanding the physical mechanism of the interaction between the objects and the environment, particularly in terms of shadows and reflections. We first argue that these tasks may be impossible to learn using pure self-supervision, and demonstrate that current methods suffer from some of these failure modes. Therefore, we propose a new, pragmatic approach based on collecting a suitable counterfactual intervention dataset. Specifically, we collect each example using the following three steps: i) photographing a scene, ii) removing a single object iii) recapturing the scene while keeping all other objects, lighting and camera pose unchanged. This subtle idea allows us to finetune a diffusion model to reach object removal results of unprecedented quality. Conversely, we find that environmental blending requires larger datasets, which is a challenge as manually collecting a large causal dataset is laborious. We therefore propose a bootstrapped training method that uses our object removal model to synthesize a large-scale synthetic environmental blending dataset. We evaluate our results on several challenging tasks: object removal, shadow generation, object insertion, and show that our method achieves significantly better results than previous approaches.
×