Kangfu Mei is a research scientist at Google AI Innovation & Research (AIR). He got Ph.D. degree at Johns Hopkins University. See more details at https://kfmei.com/
39th Conference on Neural Information Processing Systems (2025)
Preview abstract
Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an N-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as ”collective wisdom”, steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.View details
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Preview abstract
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities –including depth, segmentation, edges, and text prompts– to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality’s guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity.View details