Jonathan Huang
I am a research scientist at Google working on machine learning and computer vision and NLP projects. Most recently, I led the team that won 1st place in the COCO object detection challenge.
Prior to Google, I was a postdoctoral fellow working in the Computer Science Department at Stanford University and was supported by an NSF/CRA CI (Computing Innovations) fellowship. At Stanford I was a member of the Geometric Computation Group which is headed by Leonidas Guibas. I was also part of the Lytics Lab, a multidisciplinary group focused on Learning Analytics. A more complete publication list can be found at my personal webpage.
Authored Publications
Sort By
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
The Auto-Arborist Dataset: A Large-Scale Benchmark for Generalizable, Multimodal Urban Forest Monitoring
Sara Meghan Beery
Guanhang Wu
Trevor Edwards
Filip Pavetić
Bo Majewski
Stan Chan
John Morgan
Vivek Mansing Rathod
CVPR 2022 (2022)
Preview abstract
Urban forests provide significant benefits to urban societies (e.g., cleaner air and water, carbon sequestration, and energy savings among others). However, planning and maintaining these forests is expensive. One particularly costly aspect of urban forest management is monitoring the existing trees in a city: ie tracking tree locations, species, and health. Monitoring efforts are currently based on tree censuses built by human experts, collected at a rate of once every five years or less and costing cities millions of dollars. In this paper we explore the use of computer vision to automatically find, label, and monitor individual trees at a large scale using a combination of street level and aerial imagery.
Previous investigations into automating this process focused on small datasets from single cities, covering only common species \cite{Branson2018, sumbul2017fine}. These fail to capture the complexity of the problem, which is both fine-grained and significantly long-tailed, and result in methods which are not applicable to new cities.
To address this shortcoming, we introduce a new large scale dataset that joins public tree inventories (maintained by cities) with a large collection of street level and aerial imagery. Our Auto-Arborist dataset contains over 2.5 million trees covering >340 genus level categories from North America and is currently at least two orders of magnitude larger than the closest comparable dataset in the literature. Uniquely, we cover multiple cities (to our knowledge, prior works have restricted their focus to single-city datasets) which allows for analysis of generalization with respect to geographic distribution shifts that were not previously possible.
We propose a set of metrics to evaluate performance especially with respect to these geographic distribution shifts and show the strengths and weaknesses of typical deep learning models when applied to the Auto Arborist dataset. We hope our dataset can be an important and exciting new scientific benchmark that will spur progress on the application of computer vision to urban ecology and sustainability.
View details
PERF-Net: Pose Empowered RGB-Flow Net
Zhichao Lu
Xuehan Xiong
IEEE Winter Conference on Applications of Computer Vision (2021)
Preview abstract
In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state-of-the-art performance. In this paper we show the benefits of including yet another stream based on human pose estimated from each frame — specifically by rendering pose on input RGB frames. At first blush, this additional stream may seem redundant given that human pose is fully determined by RGB pixel values — however we show (perhaps surprisingly) that this simple and flexible addition can provide complementary gains. Using this insight, we propose a new model, which we dub PERF-Net (short for Pose Empowered RGB-Flow Net), which combines this new pose stream with the standard RGB and flow based input streams via distillation techniques and show that our model outperforms the state-of-the-art by a large margin in a number of human action recognition datasets while not requiring flow or pose to be explicitly computed at inference time. The proposed pose stream is also part of the winner solution of the ActivityNet Kinetics Challenge 2020.
View details
Preview abstract
This paper presents a weakly-supervised approach to object instance segmentation. Starting with known or predicted object bounding boxes, we learn object masks by playing a game of cut-and-paste in an adversarial learning setup. A mask generator takes a detection box and Faster R-CNN features, and constructs a segmentation mask that is used to cut-and-paste the object into a new image location. The discriminator tries to distinguish between real objects, and those cut and pasted via the generator, giving a learning signal that leads to improved object masks. We verify our method experimentally using Cityscapes, COCO, and aerial image datasets, learning to segment objects without ever having seen a mask in training. Our method exceeds the performance of existing weakly supervised methods, without requiring hand-tuned segment proposals, and reaches 90% of supervised performance.
View details
Preview abstract
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).
View details
Progressive Neural Architecture Search
Chenxi Liu
Barret Zoph
Maxim Neumann
Jonathan Shlens
Wei Hua
Jia Li
Fei-Fei Li
Alan Yuille
ECCV (2018)
Preview abstract
We propose a new method for learning the structure of convolutional
neural networks (CNNs) that is more efficient than recent
state-of-the-art methods based on reinforcement learning and evolutionary
algorithms. Our approach uses a sequential model-based optimization
(SMBO) strategy, in which we search for structures in order of increasing
complexity, while simultaneously learning a surrogate model to guide the
search through structure space. Direct comparison under the same search
space shows that our method is up to 5 times more efficient than the RL
method of Zoph et al. (2018) in terms of number of models evaluated,
and 8 times faster in terms of total compute. The structures we discover
in this way achieve state of the art classification accuracies on CIFAR-10
and ImageNet.
View details
Spatially Adaptive Computation Time for Residual Networks
Dmitry P. Vetrov
Maxwell Collins
Michael Figurnov
Ruslan Salakhutdinov
Yukun Zhu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Preview abstract
This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of ResNet on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the image saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.
View details
Speed and accuracy trade-offs for modern convolutional object detectors
Anoop Korattikara
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
CVPR 2017, Honolulu, Hawaii (2017)
Preview abstract
The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
View details
Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao
Alexander Toshev
Oana Camburu
Computer Vision and Pattern Recognition (2016)
Preview abstract
We propose a method that can generate an unambiguous
description (known as a referring expression) of a specific
object or region in an image, and which can also comprehend
or interpret such an expression to infer which object
is being described. We show that our method outperforms
previous methods that generate descriptions of objects
without taking into account other potentially ambiguous
objects in the scene. Our model is inspired by recent
successes of deep learning methods for image captioning,
but while image captioning is difficult to evaluate, our task
allows for easy objective evaluation. We also present a new
large-scale dataset for referring expressions, based on MSCOCO.
We have released the dataset and a toolbox for visualization
and evaluation, see https://github.com/
mjhucla/Google_Refexp_toolbox.
View details
G-RMI Object Detection
Anoop Korattikara
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
Preview abstract
We present our submission to the COCO 2016 Object Detection challenge.
View details