
Alessio Tonioni
I’m a researcher in computer vision and deep learning and I’m currently working with Federico Tombari at Google Zurich. Previously I was enrolled as a post doc at the Computer Vision Lab of the university of Bologna under the supervision of Professor Luigi Di Stefano.
I received my PhD in Computer Science and Engineering from University of Bologna on April 2019. During my PhD I have worked on deep learning solutions for product detection and recognition in retail environments and on deep learning applied to depth estimation from stereo and monocular cameras.
Research Areas
Authored Publications
Sort By
Snap-it, Tap-it, Splat-it: Tactile-Informed 3D Gaussian Splatting for Reconstructing Challenging Surfaces
Mauro Comi
Max Yang
Jonathan Tremblay
Valts Blukis
Yijiong Lin
Nathan Lepora
Laurence Aitchison
2025
Preview abstract
Touch and vision go hand in hand, mutually enhancing our ability to understand the world. From a research perspective, the problem of mixing touch and vision is underexplored and presents interesting challenges. To this end, we propose Tactile-Informed 3DGS, a novel approach that incorporates touch data (local depth maps) with multi-view vision data to achieve surface reconstruction and novel view synthesis. Our method optimises 3D Gaussian primitives to accurately model the object's geometry at points of contact. By creating a framework that decreases the transmittance at touch locations, we achieve a refined surface reconstruction, ensuring a uniformly smooth depth map. Touch is particularly useful when considering non-Lambertian objects (e.g. shiny or reflective surfaces) since contemporary methods tend to fail to reconstruct with fidelity specular highlights. By combining vision and tactile sensing, we achieve more accurate geometry reconstructions with fewer images than prior methods. We conduct evaluation on objects with glossy and reflective surfaces and demonstrate the effectiveness of our approach, offering significant improvements in reconstruction quality.
View details
Preview abstract
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
View details
TextMesh: Generation of Realistic 3D Meshes From Text Prompts
Christina Tsalicoglou
Fabian Manhardt
Michael Niemeyer
3DV 2024 (2024)
Preview abstract
The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh.
View details
Preview abstract
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
View details
TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction Using Vision-Based Tactile Sensing
Mauro Comi
Yijiong Lin
Alex Church
Laurence Aitchison
Nathan Lepora
IEEE Robotics and Automation Letters (2024)
Preview abstract
Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: this https URL
View details
NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions
Mohamad Shahbazi
Evangelos Ntaveli
Edo Collins
Danda Pani Paudel
Martin Danelljan
Luc Van Gool
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2023)
Preview abstract
Pose-conditioned convolutional generative models
struggle with high-quality 3D-consistent image generation
from single-view datasets, due to their lack of sufficient
3D priors. Recently, the integration of Neural Radiance
Fields (NeRFs) and generative models, such as Generative
Adversarial Networks (GANs), has transformed 3D-aware
generation from single-view images. NeRF-GANs exploit
the strong inductive bias of neural 3D representations and
volumetric rendering at the cost of higher computational
complexity. This study aims at revisiting pose-conditioned
2D GANs for efficient 3D-aware generation at inference
time by distilling 3D knowledge from pretrained NeRFGANs.
We propose a simple and effective method, based on
re-using the well-disentangled latent space of a pre-trained
NeRF-GAN in a pose-conditioned convolutional network
to directly generate 3D-consistent images corresponding
to the underlying 3D representations. Experiments on
several datasets demonstrate that the proposed method
obtains results comparable with volumetric rendering in
terms of quality and 3D consistency while benefiting from
the computational advantage of convolutional networks.
The code is available at: https://github.com/
mshahbazi72/NeRF-GAN-Distillation
View details
NeRF-Supervised Deep Stereo
Fabio Tosi
Daniele De Gregorio
Matteo Poggi
Computer Vision and Pattern Recognition (2023)
Preview abstract
We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.
View details
Learning good features to transfer across tasks and domains
Adriano Cardace
Luca De Luigi
Luigi Di Stefano
Pierluigi Zama Ramirez
Samuele Salti
IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) (to appear)
Preview abstract
The availability of labelled data is the major obstacle to the deployment of deep learning algorithms to solve computer vision tasks in new domains. Recent works have shown that it is possible to leverage on correlations between features learned by neural networks for different tasks on different domains to reduce the need for full supervision. This is achieved by learning to transfer features across both tasks and domains. In this work, we show how constraining the structure of the source and target feature space is the key to improve the performances of such a transfer framework. In particular, we demonstrate the benefits of: learning features able to capture fine-grain details of the image and aligning the space across tasks by means of an auxiliary task; aligning the feature spaces across domains by means of a novel norm discrepancy loss. We achieve state of the art results in synthetic-to-real adaptation scenarios for this novel setting.
View details
LatentSwap3D: Swapping Latent Codes for Semantic Edits
Enis Simsar
Evin Pınar Örnek
Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Preview abstract
3D GANs have the ability to generate latent codes for entire 3D volumes rather than only 2D images. These models offer desirable features like high-quality geometry and multi-view consistency, but, unlike their 2D counterparts, complex semantic image editing tasks for 3D GANs have only been partially explored. To address this problem, we propose LatentSwap3D, a semantic edit approach based on latent space discovery that can be used with any off-the-shelf 3D or 2D GAN model and on any dataset. LatentSwap3D relies on identifying the latent code dimensions corresponding to specific attributes by feature ranking using a random forest classifier. It then performs the edit by swapping the selected dimensions of the image being edited with the ones from an automatically selected reference image. Compared to other latent space control-based edit methods, which were mainly designed for 2D GANs, our method on 3D GANs provides remarkably consistent semantic edits in a disentangled manner and outperforms others both qualitatively and quantitatively. We show results on seven 3D GANs (?-GAN, GIRAFFE, StyleSDF, MVCGAN, EG3D, StyleNeRF, and VolumeGAN) and on five datasets (FFHQ, AFHQ, Cats, MetFaces, and CompCars).
View details
Continual Adaptation for Deep Stereo
Fabio Tosi
Luigi Di Stefano
Matteo Poggi
Stefano Mattoccia
IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) (to appear)
Preview abstract
Depth estimation from stereo images is carried out with unmatched results by convolutional neural networks trained end-to-end to regress dense disparities. Like for most tasks, this is possible if large amounts of labelled samples are available for training, possibly covering the whole data distribution encountered at deployment time. Being such an assumption systematically unmet in real applications, the capacity of adapting to any unseen setting becomes of paramount importance. Purposely, we propose a continual adaptation paradigm for deep stereo networks designed to deal with challenging and ever-changing environments. We design a lightweight and modular architecture, Modularly ADaptive Network (MADNet), and formulate Modular ADaptation algorithms (MAD, MAD++) which permit efficient optimization of independent sub-portions of the entire network. In our paradigm, the learning signals needed to continuously adapt models online can be sourced from self-supervision via right-to-left image warping or from traditional stereo algorithms. With both sources, no other data than the input images being gathered at deployment time are needed. Thus, our network architecture and adaptation algorithms realize the first real-time self-adaptive deep stereo system and pave the way for a new paradigm that can facilitate practical deployment of end-to-end architectures for dense disparity regression.
View details