Forrester Cole
Forrester is a software engineer working on computer vision and graphics research, particularly 3D understanding of images and videos.
Prior to Google, Forrester was a postdoctoral researcher at Pixar Animation Studios and MIT. He completed his PhD at Princeton University under Adam Finkelstein.
Research Areas
Authored Publications
Sort By
Associating Objects and their Effects in Unconstrained Monocular Video
Erika Lu
Zhengqi Li
Leonid Sigal
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
Preview abstract
We propose a method to decompose a video into a back-
ground and a set of foreground layers, where the back-
ground captures stationary elements while the foreground
layers capture moving objects along with their associated
effects (e.g. shadows and reflections). Our approach is de-
signed for unconstrained monocular videos, with arbitrary
camera and object motion. Prior work that tackles this
problem assumes that the video can be mapped onto a fixed
2D canvas, severely limiting the possible space of camera
motion. Instead, our method applies recent progress in
monocular camera pose and depth estimation to create a
full, RGBD video layer for the background, along with a
video layer for each foreground object. To solve the under-
constrained decomposition problem, we propose a new loss
formulation based on multi-view consistency. We test our
method on challenging videos with complex camera motion
and show significant qualitative improvement over current
methods.
View details
DynIBaR: Neural Dynamic Image-Based Rendering
Zhengqi Li
Qianqian Wang
Computer Vision and Pattern Recognition (CVPR) (2023)
Preview abstract
We address the problem of synthesizing novel views from a monocular video depicting complex dynamic scenes.
State-of-the-art methods based on temporally varying Neural Radiance Fields (aka \emph{dynamic NeRFs}) have shown impressive results on this task.
However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications.
Rather than encoding a dynamic scene within the weights of MLPs, we present a new method that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner.
Our system preserves the advantages for modeling complex scenes and view-dependent effects, but enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings.
View details
SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
Itai Lang
Shai Avidan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
Preview abstract
Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the scene's 3D motion from its consecutive observations. Recently, there is a research effort to compute scene flow using 3D point clouds. A main approach is to train a regression model that consumes a source and target point clouds and outputs the per-point translation vector. An alternative approach is to learn point correspondence between the point clouds, concurrently with a refinement regression of the initial flow. In both approaches the task is very challenging, since the flow is regressed in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset.
We introduce CorrFlow, a new method for scene flow estimation that can be learned on a small amount of data without using ground-truth flow supervision. In contrast to previous works, we train a pure correspondence model that is focused on learning point feature representation, and initialize the flow as the difference between a source point and its softly corresponding target point. Then, at test time, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent flow field between the point clouds. Experiments on widely used datasets demonstrate the performance gains achieved by our method compared to existing leading techniques.
View details
Preview abstract
The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations. Towards this end, we introduce Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions. We provide networks that infer the space decomposition and local deep implicit functions from a 3D mesh or posed depth image. During experiments, we find that it provides 10.3 points higher surface reconstruction accuracy (F-Score) than the state-of-the-art (OccNet), while requiring fewer than 1 percent of the network parameters. Experiments on posed depth image completion and generalization to unseen classes show 15.8 and 17.8 point improvements over the state-of-the-art, while producing a structured 3D representation for each input with consistency across diverse shape collections.
View details
Layered Neural Rendering for Retiming People in Video
Erika Lu
Weidi Xie
Andrew Zisserman
ACM Transactions on Graphics (Proc. SIGGRAPH Asia) (2020)
Preview abstract
We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate---e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running.
View details
Preview abstract
Extracting and predicting object structure and dynamics from videos without
supervision is a major challenge in machine learning. To address this challenge,
we adopt a keypoint-based image representation and learn a stochastic dynamics
model of the keypoints. Future frames are reconstructed from the keypoints and
a reference frame. By modeling dynamics in the keypoint coordinate space, we
achieve stable learning and avoid compounding of errors in pixel space. Our
method improves upon unstructured representations both for pixel-level video
prediction and for downstream tasks requiring object-level understanding of motion
dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset,
the Human3.6M dataset, and datasets based on continuous control tasks from
the DeepMind Control Suite. The spatially structured representation outperforms
unstructured representations on a range of motion-related tasks such as object
tracking, action recognition and reward prediction.
View details
Learning the Depths of Moving People by Watching Frozen People
Zhengqi Li
Ce Liu
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Preview abstract
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and often can recover only a sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a large corpus of data. Specifically, we use a new source of data comprised of thousands of Internet videos in which people imitate mannequins, i.e., people freeze in diverse, natural poses, while a hand-held camera is touring the scene. We then create training data using modern Multi-View Stereo (MVS) methods, and design a model that is applied to dynamic scene at inference time. Our method makes use of motion parallax beyond single view and shows clear advantages over state-of-the-art monocular depth prediction methods. We demonstrate the applicability of our method on real-world sequences captured by a moving hand-held camera, depicting complex human actions. We show various 3D effects such as re-focusing, creating a stereoscopic video from a monocular one, and inserting virtual objects to the scene, all produced using our predicted depth maps.
View details
Unsupervised Training for 3D Morphable Model Regression
Kyle Genova
Aaron Maschinot
Daniel Vlasic
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Preview abstract
We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch regularization loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the regression network can correctly reinterpret its own output, and a multi-view loss that compares the predicted 3D face to the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.
View details
XGAN: Unsupervised Image-to-Image Translation for many-to-many Mappings
Amelie Royer
Stephan Gouws
Fred Bertsch
ICML Workshop (2017)
Preview abstract
Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset we collected for this purpose is in the process of being released as a new benchmark for semantic style transfer.
View details
Synthesizing Normalized Faces from Facial Identity Features
Dilip Krishnan
Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Preview abstract
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image warping operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar.
View details