Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks

Google Scholar

Abstract

Both Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) have shown improvements over Deep Neural Networks
(DNNs) across a wide variety of speech recognition tasks.
CNNs, LSTMs and DNNs are complementary in their modeling
capabilities, as CNNs are good at reducing frequency variations,
LSTMs are good at temporal modeling, and DNNs are appropriate
for mapping features to a more separable space. In this paper, we
take advantage of the complementarity of CNNs, LSTMs and DNNs
by combining them into one unified architecture. We explore the
proposed architecture, which we call CLDNN, on a variety of large
vocabulary tasks, varying from 200 to 2,000 hours. We find that
the CLDNN provides a 4-6% relative improvement in WER over an
LSTM, the strongest of the three individual models.