Simple Open-Vocabulary Object Detection with Vision Transformers

Matthias Minderer

Alexey Alexeevich Gritsenko

Austin Stone

Maxim Neumann

Dirk Weissenborn

Alexey Dosovitskiy

Aravindh Mahendran

Anurag Arnab

Mostafa Dehghani

Zhuoran Shen

Xiao Wang

Xiaohua Zhai

Thomas Kipf

Neil Houlsby

ECCV (Poster) (2022)

Download Google Scholar

Abstract

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub (https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities