Jonathan Tompson

Jonathan Tompson

My research background covers a wide range of topics: computer vision and graphics, robotics, computational fluid dynamics, reinforcement learning, unsupervised learning, hand and human body tracking and analog IC design. You can find more of my projects at jonathantompson.com.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In recent years, much progress has been made in learning robotic manipulation policies that can follow natural language instructions. Common approaches involve learning methods that operate on offline datasets, such as task-specific teleoperated demonstrations or on hindsight labeled robotic experience. Such methods work reasonably but rely strongly on the assumption of clean data: teleoperated demonstrations are collected with specific tasks in mind, while hindsight language descriptions rely on expensive human labeling. Recently, large-scale pretrained language and vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. However, can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills contained in the offline dataset that weren't necessarily reflected in ground truth labels? We investigate fine-tuning a reward model on a small dataset of robot interactions with crowd-sourced natural language labels and using the model to relabel instructions of a large offline robot dataset. The resulting dataset with diverse language skills is used to train imitation learning policies, which outperform prior methods by up to 30% when evaluated on a diverse set of novel language instructions that were not contained in the original dataset. View details
    InnerMonologue: Embodied Reasoning through Planning with Language Models
    Wenlong Huang
    Harris Chan
    Jacky Liang
    Pete Florence
    Andy Zeng
    Igor Mordatch
    Yevgen Chebotar
    Noah Brown
    Tomas Jackson
    Linda Luu
    Sergey Levine
    Karol Hausman
    Brian Andrew Ichter
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Recent works have shown the capabilities of large language models to perform tasks requiring reasoning and to be applied to applications beyond natural language processing, such as planning and interaction for embodied robots.These embodied problems require an agent to understand the repertoire of skills available to a robot and the order in which they should be applied. They also require an agent to understand and ground itself within the environment. In this work we investigate to what extent LLMs can reason over sources of feedback provided through natural language. We propose an inner monologue as a way for an LLM to think through this process and plan. We investigate a variety of sources of feedback, such as success detectors and object detectors, as well as human interaction. The proposed method is validated in a simulation domain and on real robotic. We show that Innerlogue can successfully replan around failures, and generate new plans to accommodate human intent. View details
    Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks
    Daniel Seita
    Pete Florence
    Erwin Johan Coumans
    Ken Goldberg
    Andy Zeng
    IEEE International Conference on Robotics and Automation (ICRA) (2021)
    Preview abstract Rearranging and manipulating deformable objects such as cables, fabrics, and bags is a long-standing challenge in robotic manipulation. The complex dynamics and high-dimensional configuration spaces of deformables, compared to rigid objects, make manipulation difficult not only for multi-step planning, but even for goal specification. Goals cannot be as easily specified as rigid object poses, and may involve complex relative spatial relations such as ``place the item inside the bag". In this work, we develop a suite of simulated benchmarks with 1D, 2D, and 3D deformable structures, including tasks that involve image-based goal-conditioning and multi-step deformable manipulation. We propose embedding goal-conditioning into Transporter Networks, a recently proposed model architecture for robotic manipulation that uses learned template matching to infer displacements that can represent pick and place actions. We demonstrate that goal-conditioned Transporter Networks enable agents to manipulate deformable structures into flexibly specified configurations without test-time visual anchors for target locations. We also significantly extend prior results using Transporter Networks for manipulating deformable objects by testing on tasks with 2D and 3D deformables. View details
    Implicit Behavioral Cloning
    Pete Florence
    Corey Lynch
    Andy Zeng
    Oscar Ramirez
    Laura Downs
    Igor Mordatch
    CoRL (2021)
    Preview abstract We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multi-valued (set-valued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) behavioral cloning policies, including on tasks with high-dimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform state-of-the-art offline reinforcement learning methods on the challenging human-expert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision. View details
    Preview abstract We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc. In this work, we demonstrate that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment demonstration videos that are robust to these differences. Specifically, we present a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL) that leverages temporal cycle-consistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences. Prior to our work, producing rewards from self-supervised embeddings typically required alignment with a reference trajectory, which may be difficult to acquire under stark embodiment differences. We show empirically that if the embeddings are aware of task progress, simply taking the negative distance between the current state and goal state in the learned embedding space is useful as a reward for training policies with reinforcement learning. We find our learned reward function not only works for embodiments seen during training, but also generalizes to entirely new embodiments. Additionally, when transferring real-world human demonstrations to a simulated robot, we find that XIRL is more sample efficient than current best methods. View details
    Preview abstract The need for understanding periodic videos is pervasive. Videos of biological processes, manufacturing processes, people exercising, objects being manipulated are only a few examples where the respective fields would benefit greatly if they were able to process periodic videos automatically. We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in leveraging temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen videos in the wild. We train this model with a synthetic dataset from a large unlabeled video dataset by sampling short clips of varying lengths and repeating them with different periods. However, simply training powerful video classification models on this synthetic dataset doesn't transfer to real videos. We constrain the period prediction model to use the self-similarity of temporal representations to ensure that the model generalizes to real videos with repeated actions. This combination of synthetic data and a powerful yet constrained model allows us to predict periods in a class-agnostic fashion. Our repetition counting model substantially exceeds the state of the art performance on existing periodicity benchmarks. We also collect a new challenging dataset called Countix which is more difficult than the existing datasets, capturing difficulties in repetition counting in videos in the real-world. We present extensive experiments on this dataset and hope this encourages more research in this important problem. View details
    Preview abstract We present the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive policies that can be transferred between environments of varying dynamics, by imitating a small number of demonstrations collected from a single source domain. This problem is important in robotic learning because in real world scenarios 1) reward functions are hard to obtain, 2) learned policies from one domain are difficult to deploy in another due to varying source to target domain statistics, 3) collecting expert demonstrations in multiple environments where the dynamics are known and controlled is often infeasible. We address these constraints by building upon recent advances in adversarial imitation learning; we condition our policy on a learned dynamics embedding and we employ a domain-adversarial loss to learn a dynamics-invariant discriminator. The effectiveness of our method is demonstrated on simulated control tasks with varying environment dynamics and the learned adaptive agent outperforms several recent baselines. View details
    Imitation Learning via Off-Policy Distribution Matching
    Ilya Kostrikov
    Ofir Nachum
    Submission for NeurIPS workshop, ICLR conference (2020)
    Preview abstract When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one typically alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data-inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can consistently outperform state-of-the-art methods. View details
    Transporter Networks: Rearranging the Visual World for Robotic Manipulation
    Andy Zeng
    Pete Florence
    Stefan Welker
    Jonathan Chien
    Travis Armstrong
    Ivan Krasin
    Dan Duong
    Conference on Robot Learning (CoRL) (2020)
    Preview abstract Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved can encompass object(s) or an end effector. In this work, we propose the Transporter Network, a simple model architecture that rearranges deep features to infer spatial displacements from visual input -- which can parameterize robot actions. It makes no assumptions of objectness (e.g. canonical poses, models, or keypoints), it exploits spatial symmetries, and is orders of magnitude more sample efficient than our benchmarked alternatives in learning vision-based manipulation tasks: from stacking a pyramid of blocks, to assembling kits with unseen objects; from manipulating deformable ropes, to pushing piles of small objects with closed-loop feedback. Our method can represent complex multi-modal policy distributions and generalizes to multi-step sequential tasks, as well as 6DoF pick-and-place. Experiments on 10 simulated tasks show that it learns faster and generalizes better than a variety of end-to-end baselines, including policies that use ground-truth object poses. We validate our methods with hardware in the real world. View details
    Preview abstract Fully convolutional deep correlation networks are currently the state of the art approaches to single object visual tracking. It is commonly assumed that these networks perform tracking by detection by matching features of the object instance with features of the scene. Strong architectural priors and conditioning on the object representation is thought to encourage this tracking strategy. Despite these efforts, we show that deep trackers often default to “tracking by saliency” detection – without relying on the object representation. This leads us to introduce an auxiliary detection task that encourages more discriminative object representations and improves tracking performance. View details