Here is a list of papers and book chapters for someone who wants to get started with ASR. This has been compiled from the required reading list in Shinji Watanabe’s Information Extraction course offered at JHU in Spring 2019.

  1. Automatic speech recognition – a brief history of the technology development. Juang and Rabiner.
  2. An introduction to hidden Markov models. Rabiner and Juang.
  3. GMMs and k-means (Bishop’s PRML Section 9.1 and 9.2)
  4. Feature extraction for ASR (Jurafsky and Martin 2nd edition Section 9.3)
  5. EM algorithm and its application to parameter estimation for GMM-HMM models. Jeff Bilmes.
  6. Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. Gauvain and Lee. IEEE Transactions on Speecch and Audio Processing. April 1994.
  7. Tree-based state tying for high accuracy modeling. Young et al. ACL 1994.
  8. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. C. J. Leggetter, P. C. Woodland. Computer speech & language.
  9. Acoustic modeling based on the MDL principle for speech recognition. Shinoda and Watanabe. Eurospeech 1997.
  10. Linear discriminant analysis for improved LVSCR. Haeb-Umbach, R., & Ney, H. ICASSP 1992.
  11. An empirical study of smoothing techniques for language modeling. Chen and Goodman. ACL 1996.
  12. DNNs for acoustic modeling in speech recognition. Hinton et al. IEEE Signal Processing Magazine.
  13. Front-end factor analysis for speaker verification. Dehak et al. IEEE Transactions on Audio, Speech, and Language Processing.
  14. Conversational speech transcription using context-dependent DNNs. Yu, Seide, and Li. ICML 2012.
  15. Applying CNNs to hybrid NN-HMM model for speech recognition. Abdel-Hamid et al. ICASSP 2012.
  16. LSTM based RNNs for large vocabulary speech recognition. Sak et al. Interspeech 2014.
  17. Sequence-discriminative training of DNNs. Vesely et al. Interspeech 2013.
  18. RNN based language model. Mikolov et al. Interspeech 2010.
  19. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Weninger et al.
  20. WaveNet: a generative model for raw audio. Oord et al.
  21. Sequence to sequence learning with neural networks. Sutskever et al. NIPS 2014.
  22. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Graves et al. ICML 2006.
  23. End-to-end attention-based large vocabulary speech recognition. Bahdanau et al. ICASSP 2016.
  24. Listen, attend, and spell. Chan, Jaitly, Le, and Vinyals.
  25. Joint CTC-attention based end-to-end speech recognition using MTL. Kim et al. ICASSP 2017.