George Leifman
I am a Research Scientist at Google Research, working on Google Earth AI, where I develop vision-language foundation models for satellite and aerial imagery.
My previous industry experience includes building computer vision systems for Amazon Go and Amazon One at Amazon Research, and developing ML/CV projects in the healthcare domain at Google Health and Verily.
I hold a Ph.D. in Electrical Engineering from the Technion and completed a Postdoctoral Fellowship at the MIT Media Lab.
Research Areas
Authored Publications
Sort By
Enhancing Remote Sensing Representations through Mixed-Modality Masked Autoencoding
Ori Linial
Yochai Blau
Nadav Sherman
Yotam Gigi
Wojciech Sirko
Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops (2025), pp. 507-516
Preview abstract
This paper presents an innovative approach to pre-training models for remote sensing by integrating optical and radar data from Sentinel-2 and Sentinel-1 satellites. Using a novel variation on the masked autoencoder (MAE) framework, our model incorporates a dual-task setup: reconstructing masked Sentinel-2 images and predicting corresponding Sentinel-1 images. This multi-task design enables the encoder to capture both spectral and structural features across diverse environmental conditions. Additionally, we introduce a "mixing" strategy in the pretraining phase, combining patches from both image sources, which mitigates spatial misalignment errors and enhances model robustness. Evaluation on segmentation and classification tasks, including Sen1Floods11 and BigEarthNet, demonstrates significant improvements in adaptability and generalizability across varied downstream remote sensing applications. Our findings highlight the advantages of leveraging complementary modalities for more resilient and versatile land cover analysis.
View details
Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning
Aaron Bell
Aviad Barzilai
Roy Lee
Gia Jung
Charles Elliott
Adam Boulanger
Amr Helmy
Jacob Bien
Ruth Alcantara
Nadav Sherman
Hassler Thurston
Yotam Gigi
Bolous Jaber
Vered Silverman
Luke Barrington
Tim Thelin
Elad Aharoni
Kartik Hegde
Yuval Carny
Shravya Shetty
Yehonathan Refael
Stone Jiang
David Schottlander
Juliet Rothenberg
Luc Houriez
Yochai Blau
Joydeep Paul
Yang Chen
Yael Maguire
Aviv Slobodkin
Shlomi Pasternak
Alex Ottenwess
Jamie McPike
Per Bjornsson
Natalie Williams
Reuven Sayag
Thomas Turnbull
Ali Ahmadalipour
David Andre
Amit Aides
Ean Phing VanLee
Niv Efron
Monica Bharel
arXiv (preprint 2025), arXiv, arXiv:2510.18318
https://doi.org/10.48550/arXiv.2510.18318
(2025)
Preview abstract
Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. The emergence of Foundation Models (FMs) and Large Language Models (LLMs) offers an unprecedented opportunity to tackle some of this complexity, unlocking novel and profound insights into our planet.
This paper introduces a comprehensive approach to developing Earth AI solutions, built upon foundation models across three key domains—Planet-scale Imagery, Population, and Environment—and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that they provide complementary value to improve geospatial inference. We show that the synergy between these models unlocks superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools to unlock novel geospatial insights. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.
View details
A Remote Sensing Vision-Language Foundation Model for Zero-Shot Tasks
Aviad Barzilai
Amr Helmy
Yotam Gigi
Vered Silverman
Yehonathan Refael
2025
Preview abstract
Foundation models have revolutionized AI, particularly in visual-language tasks, achieving unparalleled performance across domains. Despite advancements, remote sensing (RS) remains underserved due to the lack of large-scale image-text datasets. This paper addresses the gap by introducing two novel datasets: RS-WebLI and Google Maps, specifically designed for training remote sensing vision-language models (RS-VLMs).
The RS-WebLI dataset leverages web images filtered for RS relevance, enriched with high-quality captions derived from associated alt-text. The Google Maps dataset utilizes Gemini, a multi-modal large language model, to generate accurate and descriptive captions by aligning Google Maps data with high-resolution satellite and aerial imagery. These datasets together encompass a vast and diverse array of remote sensing objects and contexts, forming a robust foundation for RS-specific tasks. The two datasets together incorporate around 20M image and text pairs.
We fine-tuned Mammut, a state-of-the-art (SOTA) vision-language model, using these datasets. The model employs a contrastive learning framework, enabling robust zero-shot capabilities. Moreover, the Mammut architecture incorporates a generative loss component, further enhancing its adaptability. To evaluate the model’s zero-shot performance, we used two main methods. The first, zero-shot classification, tests the ability of the model to classify a remote sensing image into a pre-defined set of classes without training directly on the dataset. For this task we use the following RS image classification datasets: Functional Map of the World (FMOW), RESISC45, UCM Classification and SkyScript classification. For every dataset, we composed a set of sentences of the form ”An aerial image of class name”, and we used a simple nearest neighbor algorithm to find the best matching class for every image. The metric is the top-1 accuracy. The second evaluation method is zero-shot retrieval. For that task, we use the following remote sensing image-captions datasets: NWPU RESISC, UCM Captions, RSITMD and RSICD. Similarly to zero-shot classification, we use nearest neighbors on the model’s output embedding to match every image to a class. Similarly to other works in the field, we present the average of the top-1, top-5 and top-10 recall scores.
The study also evaluates supervised learning regimes, where the VLMs are fine-tuned on task-specific datasets like FMOW and FloodNet. These models outperform traditional masked-image models, showcasing the advantage of leveraging vision-language pre-training for RS applications. To assess generalization, the Google Maps Hold-out dataset was introduced, excluding specific object types during training. Results indicate the model's strong ability to recognize unseen objects, validating its versatility.
This work establishes a comprehensive framework for developing RS-VLMs, addressing dataset limitations and model scalability. It sets a precedent for leveraging foundation models in RS, paving the way for enhanced zero-shot and fine-tuned applications in remote sensing analytics. Future directions include expanding dataset diversity and exploring advanced architectures to further push the boundaries of RS vision-language understanding.
View details
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Yehonathan Refael
Amit Aides
Aviad Barzilai
Vered Silverman
Bolous Jaber
(2025)
Preview abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility applications by enabling object detection from arbitrary text queries. Still, the zero-shot performance of the pre-trained models is hampered by the inherent semantic ambiguity of natural language, result to low precision, leading to insufficient crucial downstream applications. For instance, in the remote sensing (RS) domain, a query for "ship" can yield varied and contextually irrelevant results. To address this, for real time applications, we propose a novel cascaded architecture that synergizes the broad capabilities of a large, pre-trained OVD model with a lightweight, few-shot classifier. Our approach utilizes the frozen weights of the zero-shot model to generate initial, high-recall object-embedding proposals, which are then refined by a compact classifier trained in real-time on a handful of user-annotated examples. The core of our contribution is an efficient one step active learning strategy for selecting the most informative samples for user annotation. Our method identifies (extremely) small amount of an uncertain candidates near the theoretical decision boundary using density estimation and then applies clustering to ensure a diverse training set. This targeted sampling enables our cascaded system to elevate performance on standard remote sensing benchmarks. Our work thus presents a practical and resource-efficient framework for adapting foundational models to specific user needs, drastically reducing annotation overhead while achieving high accuracy without costly full-model fine-tuning.
View details
A Recipe for Improving Remote Sensing Zero Shot Generalization
Aviad Barzilai
Yotam Gigi
Vered Silverman
Yehonathan Refael
Bolous Jaber
Amr Helmy
3rd ML4RS Workshop at ICLR 2025
Preview abstract
Foundation models have had a significant impact across various AI applications, enabling applications for use cases that were previously impossible. Visual language models (VLMs), in particular, have outperformed other techniques in many tasks. In remote sensing (RS), foundation models have shown improvements across various applications. However, unlike other fields, the use of VLMs with large-scale remote sensing image-text datasets remains limited.
In this work, we first introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery, aligned with Google-Maps data, with high-quality captions generated using Gemini. The second utilizes public web images and their corresponding alt-text, filtered for only remote sensing domain, resulting in a highly diverse dataset.
We show that using these datasets to pre-train the Mammut [], a VLM architecture, results in state-of-the-art generalization performance in a zero-shot classification and cross-modal retrieval on well-known public benchmarks. Secondly, we leverage this newly pre-trained VLM to generate inference attention maps for a novel class query (i.e., a class unseen during training). We subsequently propose an iterative self-supervised fine-tuning approach where samples aligned with these attention maps are iteratively pseudo-labeled and utilized for model training.
View details
Anchored diffusion for video face reenactment
Idan Kligvasser
Regev Cohen
Ehud Rivlin
Michael Elad
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025), pp. 4087-4097
Preview abstract
Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, a task of transforming the action from the driving video to the source face. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.
View details
Pixel-accurate Segmentation of Surgical Tools based on Bounding Box Annotations
Tomer Golany
Daniel Freedman
Ehud Rivlin
Amit Aides
2022 26th International Conference on Pattern Recognition (ICPR), pp. 5096-5103
Preview abstract
Detection and segmentation of surgical instruments is an important problem for the laparoscopic surgery. Accurate pixel-wise instrument segmentation used as an intermediate task for the development of computer-assisted surgery systems, such as pose estimation, surgical phase estimation, enhanced image fusion, video retrieval and others. In this paper we describe our deep learning-based approach for instrument segmentation, which addresses the binary segmentation problem, where every pixel in an image is labeled as an instrument or background. Our approach relies on weak annotations provided as bounding boxes of the instruments, which is much faster and cheaper to obtain than a dense pixel-level annotation. To improve the accuracy even further we propose a novel approach to generate synthetic training images. Our approach achieves state-of-the-art results, outperforming previously proposed methods for automatic instrument segmentation, based on weak annotations only.
View details