George Leifman

George Leifman

I am a Research Scientist at Google Research, working on Google Earth AI, where I develop vision-language foundation models for satellite and aerial imagery. My previous industry experience includes building computer vision systems for Amazon Go and Amazon One at Amazon Research, and developing ML/CV projects in the healthcare domain at Google Health and Verily. I hold a Ph.D. in Electrical Engineering from the Technion and completed a Postdoctoral Fellowship at the MIT Media Lab.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    A Recipe for Improving Remote Sensing Zero Shot Generalization
    Aviad Barzilai
    Yotam Gigi
    Vered Silverman
    Yehonathan Refael
    Bolous Jaber
    Amr Helmy
    3rd ML4RS Workshop at ICLR 2025
    Preview abstract Foundation models have had a significant impact across various AI applications, enabling applications for use cases that were previously impossible. Visual language models (VLMs), in particular, have outperformed other techniques in many tasks. In remote sensing (RS), foundation models have shown improvements across various applications. However, unlike other fields, the use of VLMs with large-scale remote sensing image-text datasets remains limited. In this work, we first introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery, aligned with Google-Maps data, with high-quality captions generated using Gemini. The second utilizes public web images and their corresponding alt-text, filtered for only remote sensing domain, resulting in a highly diverse dataset. We show that using these datasets to pre-train the Mammut [], a VLM architecture, results in state-of-the-art generalization performance in a zero-shot classification and cross-modal retrieval on well-known public benchmarks. Secondly, we leverage this newly pre-trained VLM to generate inference attention maps for a novel class query (i.e., a class unseen during training). We subsequently propose an iterative self-supervised fine-tuning approach where samples aligned with these attention maps are iteratively pseudo-labeled and utilized for model training. View details
    Anchored diffusion for video face reenactment
    Idan Kligvasser
    Regev Cohen
    Ehud Rivlin
    Michael Elad
    2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025), pp. 4087-4097
    Preview abstract Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, a task of transforming the action from the driving video to the source face. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities. View details
    Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning
    Aaron Bell
    Aviad Barzilai
    Roy Lee
    Gia Jung
    Charles Elliott
    Adam Boulanger
    Amr Helmy
    Jacob Bien
    Ruth Alcantara
    Nadav Sherman
    Hassler Thurston
    Yotam Gigi
    Bolous Jaber
    Vered Silverman
    Luke Barrington
    Tim Thelin
    Elad Aharoni
    Kartik Hegde
    Yuval Carny
    Shravya Shetty
    Yehonathan Refael
    Stone Jiang
    David Schottlander
    Juliet Rothenberg
    Luc Houriez
    Yochai Blau
    Joydeep Paul
    Yang Chen
    Yael Maguire
    Aviv Slobodkin
    Shlomi Pasternak
    Alex Ottenwess
    Jamie McPike
    Per Bjornsson
    Natalie Williams
    Reuven Sayag
    Thomas Turnbull
    Ali Ahmadalipour
    David Andre
    Amit Aides
    Ean Phing VanLee
    Niv Efron
    Monica Bharel
    arXiv (preprint 2025), arXiv, arXiv:2510.18318 https://doi.org/10.48550/arXiv.2510.18318 (2025)
    Preview abstract Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. The emergence of Foundation Models (FMs) and Large Language Models (LLMs) offers an unprecedented opportunity to tackle some of this complexity, unlocking novel and profound insights into our planet. This paper introduces a comprehensive approach to developing Earth AI solutions, built upon foundation models across three key domains—Planet-scale Imagery, Population, and Environment—and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that they provide complementary value to improve geospatial inference. We show that the synergy between these models unlocks superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools to unlock novel geospatial insights. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding. View details
    Enhancing Remote Sensing Representations through Mixed-Modality Masked Autoencoding
    Ori Linial
    Yochai Blau
    Nadav Sherman
    Yotam Gigi
    Wojciech Sirko
    Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops (2025), pp. 507-516
    Preview abstract This paper presents an innovative approach to pre-training models for remote sensing by integrating optical and radar data from Sentinel-2 and Sentinel-1 satellites. Using a novel variation on the masked autoencoder (MAE) framework, our model incorporates a dual-task setup: reconstructing masked Sentinel-2 images and predicting corresponding Sentinel-1 images. This multi-task design enables the encoder to capture both spectral and structural features across diverse environmental conditions. Additionally, we introduce a "mixing" strategy in the pretraining phase, combining patches from both image sources, which mitigates spatial misalignment errors and enhances model robustness. Evaluation on segmentation and classification tasks, including Sen1Floods11 and BigEarthNet, demonstrates significant improvements in adaptability and generalizability across varied downstream remote sensing applications. Our findings highlight the advantages of leveraging complementary modalities for more resilient and versatile land cover analysis. View details
    Pixel-accurate Segmentation of Surgical Tools based on Bounding Box Annotations
    Tomer Golany
    Daniel Freedman
    Ehud Rivlin
    Amit Aides
    2022 26th International Conference on Pattern Recognition (ICPR), pp. 5096-5103
    Preview abstract Detection and segmentation of surgical instruments is an important problem for the laparoscopic surgery. Accurate pixel-wise instrument segmentation used as an intermediate task for the development of computer-assisted surgery systems, such as pose estimation, surgical phase estimation, enhanced image fusion, video retrieval and others. In this paper we describe our deep learning-based approach for instrument segmentation, which addresses the binary segmentation problem, where every pixel in an image is labeled as an instrument or background. Our approach relies on weak annotations provided as bounding boxes of the instruments, which is much faster and cheaper to obtain than a dense pixel-level annotation. To improve the accuracy even further we propose a novel approach to generate synthetic training images. Our approach achieves state-of-the-art results, outperforming previously proposed methods for automatic instrument segmentation, based on weak annotations only. View details