Fast and Flexible Neural Audio Synthesis

Lamtharn (Hanoi) Hantrakul
Chenjie Gu
ISMIR 2019 (2019) (to appear)

Abstract

Autoregressive neural networks, such as WaveNet, have opened up new avenues for expressive audio synthesis. High-quality speech synthesis utilizes detailed linguistic features for conditioning, but comparable levels control have yet to be realized for musical instruments. Here, we demonstrate an autoregressive model capable of synthesizing realistic audio that closely follows fine-scale temporal conditioning for loudness and fundamental frequency. We find the appropriate choice of conditioning features and architectures improves both the quantitative accuracy of audio resynthesis and qualitative responsiveness to creative manipulation of conditioning. While large autoregressive models generate audio much slower than realtime, we achieve these results with a much more efficient WaveRNN model, opening the door for exploring real-time interactive audio synthesis with neural networks.

Research Areas