Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action-Spaces

Haitong Ma
Ofir Nabati
Na Li
Shie Mannor
Guy Tennenholtz
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Google Scholar

Abstract

Reinforcement learning (RL) algorithms have achieved superhuman performance
on many sequential decision-making tasks, but often struggle in domains with
large, combinatorial action spaces. To address this, we introduce a practical and
stable algorithm for training discrete diffusion models to represent policies in
such environments. We formulate a policy mirror descent algorithm that enhances
training stability by reframing policy optimization as an inference problem, which
naturally aligns with the learning objective of discrete diffusion models. Through
extensive experiments on a suite of challenging benchmark tasks, we demonstrate
that our approach achieves significant improvements over existing methods in both
performance and sample efficiency. This work opens a promising new direction
for applying discrete diffusion models in RL to tackle long-standing challenges in
large-scale combinatorial action spaces.
×