Desh's curated list of ASR, diarization, and related papers from Interspeech 2020

Interspeech 2020 just ended, and here is my curated list of papers that I found interesting from the proceedings.

Disclaimer: This list is based on my research interests at present: ASR, speaker diarization, target speech extraction, and general training strategies.

A. Automatic speech recognition

I. Hybrid DNN-HMM systems

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
- Key contributions: multi-stream CNN for acoustic modeling and self-attentive simple recurrent unit for language modeling.
- Using SpecAugment and N-best rescoring, achieves 1.75% and 4.46% on test-clean and test-other.
- Most of the improvement seems to come from 24-layer SRU LM.
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces
- Removes GMM bootstrapping, decision tree building steps by using word-pieces instead of context-dependent phones, and CTC instead of cross-entropy training.
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data
- Key idea: use an error detector mechanism to control which transcripts are used for semi-supervised training.
- The error detector is a neural network classifier which takes the ASR decoded output and predicts if it contains errors.
On the Robustness and Training Dynamics of Raw Waveform Models
- They study raw waveforms as inputs (instead of MFCCs) on TIMIT, Aurora-4, and WSJ, in both matched and mismatched conditions.
- In mismatched case, MFCCs perform better, but raw waveform performance can be improved using normalization techniques.
- In matched techniques, raw waverforms performed better. Better alignments improve performance considerably.
Speaker Adaptive Training for Speech Recognition Based on Attention-over-Attention Mechanism
- Instead of using external embeddings (like d-vector) for speaker adaptation, the embedding is obtained using an attention mechanism on the frames of the utterance.
- This paper improves upon a previous work by the authors by replacing frame-attention with attention-over-attention.
Frame-wise Online Unsupervised Adaptation of DNN-HMM Acoustic Model from Perspective of Robust Adaptive Filtering
- Key things are “online” and “unsupervised”, as opposed to existing methods which are offline and supervised.
- This is done by formulating a gradient based on the conditional likelihood of the acoustic model and using a particle filter approach for efficient computation (I did not quite understand the details).
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
- Strong BLSTM teacher is used to guide the sequence discriminative training of LSTM student model.
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
- Replace state tying with a method that jointly models the tied context-dependent phones with DNN training.
- Phonetic context representation relies on phoneme embeddings.
- Interesting idea, but needs more empirical work to improve performance compared to standard tying approach.

II. End-to-end models

On the comparison of popular end-to-end models for large scale speech recognition
- Compared RNN-transducer, RNN-AED, transformer AED models on 65k hours of speech.
- Transformer AED beats other systems in both streaming and non-streaming modes.
- Both are better than hybrid model.
Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard
- Through careful tuning of the whole seq2seq pipeline, they achieve SOTA performance on 300h SWBD subset.
- Lots of ablation experiments; SpecAug seems to be the most helpful.
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
- Idea: swap random time-bands and frequency-bands in the input spectogram.
- From ablation, time swapping seems more useful than frequency swapping
- Slightly worse than using SpecAug
Speech Transformer with Speaker Aware Persistent Memory
- This is like i-vector based speaker adaptation (used often in hybrid ASR), but applied to end-to-end transformers.
- All speaker i-vectors are concatenated into a matrix (called “persistent memory”) and transformed to get matrices for key and value, and appended to the respective self-attention block matrices.
- Consistent improvements on SWBD, Librispeech (100h), and AISHELL-1.
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
- A new beam search algorithm is proposed which mitigates the problem with longer utterance decoding in ASR.
- The new algorithm explicitly models utterance length in the sequence posterior, and is robust across different beam sizes.
A New Training Pipeline for an Improved Neural Transducer
- Extensive experiments on the RNN-transducer model, showing that it outperforms attention-based models on long sequences (on SWBD).
- NOTE TO SELF: Needs more careful reading.
Semi-supervised end-to-end ASR via teacher-student learning with conditional posterior distribution
- A new TS training scheme for E2E ASR models.
- The scheme involves: (i) teacher forcing using 1-best hypothesis from a teacher model, and (ii) matching the student’s conditional posterior to the teacher’s posterior through the 1-best decoding path.
- Improvements seen in WSJ and Librispeech.
Early Stage LM Integration Using Local and Global Log-Linear Combination
- New method for integrating external LM into sequence-to-sequence model training: log-linear model combination with a per-token renormalization (as opposed to shallow fusion which is global renormalization).
An investigation of phone-based subword units for end-to-end speech recognition
- One-pass decoding with BPE is introduced, with corresponding forward function and beam search decoding.
- Phone BPEs outperform char BPEs on WSJ and SWBD.

III. Training strategies

Unsupervised Regularization-Based Adaptive Training for Speech Recognition
- New regularization objectives proposed for speaker adaptation of CTC-based models:
  - Center loss: penalizes the distances between speaker embeddings and the center
  - Speaker variance loss: minimizes the speaker interclass variance
- Key idea is to remove speaker-specific deep embedding variances from the acoustic model
Iterative Pseudo-Labeling for Speech Recognition
- Semi-supervised training with unlabeled data — iteratively generate pseudo-labels for the unlabeled data in each iteration of training.
- External LM and data augmentation is important to avoid local minima.
- SOTA WERs on Librispeech (100h and 960h).
- Also release a large text corpus from Project Gutenberg books (not overlapping with LibriVox and LibriSpeech).
Improved Noisy Student Training for Automatic Speech Recognition
- Several techniques for semi-supervised training are proposed, which results in SOTA performance on Librispeech.
- Normalized filtering score, sub-modular sampling, gradational filtering, gradational augmentation -> all are relatively simple ideas but useful in combination.
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks
- Using XL-Net like pretraining schemes for self-attention network based ASR models.
- Pretraining objective is next-frame prediction.
- Importantly, L1 and L2 losses fail to converge, so they used something called Mean Absolute Error (Huber loss). Also, only last 20% frames are predicted.
- Experiments conducted on both hybrid and E2E settings.
Combination of end-to-end and hybrid models for speech recognition
- 3 ways of combining: (i) ROVER over 1-best hypothesis, (ii) MBR-based combination, and (iii) ROVER over N-best lists (approximation of (ii)).
- MBR combination worked best, and consistently improved over best single model.
- Note that length normalization was required for LAS and RNN-T models since they prefer shorter sequences.

B. Speaker diarization and recognition

I. Diarization

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
- This paper extends the EEND diarization system to unknown number of speakers.
- This is done using encoder-decoder attractor (EDA). The idea is to pass the EEND hidden state to an LSTM encoder-decoder which can produce a flexible number of outputs.
- The output of the attractor is passed to a sigmoid which decides when to stop. A threshold is used to then select the number of speakers.
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
- TS-VAD was used in STC’s winning submission to the CHiME-6 challenge.
- The idea is to use speaker i-vectors from a first-pass diarization to perform a multi-label classification per-frame, which considers all speakers simultaneously.
- Limitation: it can only handle a fixed number of speakers (4 in the case of CHiME-6).
- My slides from a reading group presentation: slides
New advances in speaker diarization
- Multiple small tweaks in clustering-based diarization.
- Key contributions:
  - using x-vectors + d-vectors
  - using a neural networks for scoring segment similarity (also see this paper which uses self-attention based scoring)
  - better estimation of speaker count
- All these tweaks provide some gains on Callhome dataset (8.6% to 5.1%).
Speaker attribution with voice profiles by graph-based semi-supervised learning
- Key idea: create a graph with nodes as the subsegment embeddings and edges containing similarity. Then add profile nodes and use “label propagation” to assign labels to other nodes.
Speaker Diarization System based on DPCA Algorithm For Fearless Steps Challenge Phase-2
- Key novelty is in using a new clustering method called Density Peak Clustering.
- Performance is better than AHC and spectral clustering for data containing non-convex clusters and outliers.
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios
- Overlap detection and VAD is formulated as an OSDC task, and temporal convolutional networks (TCNs) are used to tackle it.
- Experiments on AMI and CHiME-6 dataset show that TCNs are better than LSTM and CRNN models for this task.

II. Recognition

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms
- Proposes improvements over the RawNet architecture for end-to-end speaker verification from raw waveforms.
- This is an alternative to the traditional approach which involves a front-end embedding extractor and a back-end like a PLDA classifier.
In defence of metric learning for speaker recognition
- Through extensive experimentation (20k GPU hours) on VoxCeleb, they show that metric learning learns better speaker embeddings than classification-based losses.
- Good overview of various loss functions, including a new angular prototypical loss.
- NOTE TO SELF: Needs careful reading.
Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms
- Replace MFCC, VAD, and CMVN with deep learning tools, and extract embeddings directly from waveforms.
- Outperforms x-vector system on VoxCeleb1.
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
- Extensive comparison of feature extraction methods (14 methods studied)
- Contains good overview of available extraction methods

C. Speech enhancement/separation

I. Target speech extraction

Neural Spatio-Temporal Beamformer for Target Speech Separation
- Multi-tap MVDR beamformer which uses complex-valued masks for enhancement in multi-channel scenario.
- The model is jointly trained with ASR objective.
SpEx+: A Complete Time Domain Speaker Extraction Network
- This is a follow up on their previous SpEx model, with the difference that the speaker embedding is now also in time domain.
- This is done to avoid phase estimation that would be required to reconstruct the target signal, if the extraction is performed in frequency domain (e.g. in SpeakerBeam).
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
- This is another time-domain speaker extraction model, but it’s based on TasNet (which is SOTA for speech separation).
- A new training strategy called Speaker Presence Invariant Training (SPIT) is proposed, to solve the problem that happens when such models are asked to extract speakers not present in the mixture.
Time-Domain Target-Speaker Speech Separation With Waveform-Based Speaker Embedding
- Another time-domain extraction model, but works entirely in time-domain. The model is called WaveFilter.
- Auxiliary input is fed step-wise into the separation network through residual blocks.
- Experiments show improvements in SDR over SpeakerBeam.
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
- This is a follow-up on VoiceFilter, which uses d-vectors to extract the speaker’s speech from mixture.
- The novelty is that this extraction is now performed directly on filterbanks, and an asymmetric L2 loss is used to mitigate over-suppression problem.
- Model is made to fit on-device by 8-bit quantization.

II. Speech enhancement

Single-channel speech enhancement by subspace affinity minimization
- Learns separate speech and noise embeddings from the input using a subspace affinity loss function.
- Theoretically proven to maximally decorrelate speech and noise representations; empirically outperforms other popular single-channel methods on VCTK.
Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement
- A “noise encoder” learns noise representation from the noisy speech, which is then used to obtain enhanced STFT magnitude.
- Improves VoiceFilter performance in terms of PESQ and STOI.
Real Time Speech Enhancement in the Waveform Domain
- Causal DEMUCS model that runs in real time and matches performance of non-causal models.
- Trained using multiple objective functions.
- Augmentation schemes: Remix, Band-Mask, Revecho, and random shift.

D. Joint modeling, LMs, and others

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers
- Extends the Serialized Output Training model for multispeaker ASR using an attention-based encoder-decoder.
- Speaker inventory is used as auxiliary input from which the speaker embeddings are obtained.
- Experiments conducted using simulated mimxtures from LibriSpeech.
Identifying Important Time-frequency Locations in Continuous Speech Utterances
- By masking increasing regions of the spectogram while keeping the error rate consistent, the authors determine the important regions of the input.
- They mention that they will use this analysis to aid data augmentation in future work.
FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition
- A simple technique to perform early fusion of multi-microphone inputs, through a “fusion layer”.
- Consistent improvements on DIRHA dataset over delay-and-sum beamforming.
Speaker-Conditional Chain Model for Speech Separation and Extraction
- The task is to extract the speech of all the speakers in a multi-speaker recording.
- This is done by sequentially extracting the speech, conditioned on the embeddings of the previously extracted ones.
- Promising results on both WSJ-mix and LibriCSS.
- Note: This “chain” model is closely associated with the EEND line of work on diarization.
Neural Speech Completion
- Proposes “completion” tasks for speech-to-text and speech-to-speech, and trains encoder-decoder models for these.
- Speech-to-text completion models performed better than RNN-LM and BERT baselines on WER (although I don’t quite understand what is considered a “correct” completion).
Serialized Output Training for End-to-End Overlapped Speech Recognition
- Similar line of work as the joint training (see #1 in this list); task is multi-speaker overlapped ASR.
- Transcriptions of the speakers are generated one after another.
- Several advantages over the traditional permutation invariant training (PIT).
Multi-talker ASR for an unknown number of sources: ()Joint training of source counting, separation and ASR
- Main novelty is that the model works for unknown number of speakers, using an iterative speech extraction system.
- Promising results on WSJ-mix with 2,3, and 4 speakers.
Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for N-gram Language Models
- Data struture “DashHashLM” to store n-gram LMs with efficient lookup for rescoring.
- 6x query speedup at 10% increased memory requirement.
LVCSR with Transformer Language Models
- Several tweaks are proposed to improve rescoring time with transformer LMs, and to use them in single-pass systems.
- Proposed improvements include quantization of LM state, common prefix, and a hybrid lattice/n-best list rescoring.
Vector-Quantized Autoregressive Predictive Coding
- Best paper award
- Quantized representations are produced using the APC self-supervised objective.
- Probing tasks and mutual information used to show the presence and absence of information in learned representations from increasingly limited models.

E. Datasets

Spot the conversation: speaker diarisation in the wild
- New VoxConverse dataset for multi-modal diarization.
- Dev: 1218 mins, test: 53 hours.
DiPCo - Dinner Party Corpus
- Dinner party like conversations with close-talk and array microphone recordings.
- 10 sessions, between 15 and 45 minutes
- Using the Kaldi CHiME-5 acoustic model with adaptation provides approx. 80% WER on far-field setting.
Speech recognition and multi-speaker diarization of long conversations
- Long-form multi-speaker recordings (approx 1 hour each) collected from This American Life podcast.
- Contains approx 640 hours of speech comprising 6608 unique speakers.
- Aligned transcripts are made publicly available
JukeBox: A Multilingual Singer Recognition Dataset
- 467 hours of singing audio sampled at 16 kHz, containing 936 unique singers and 18 different languages.
- Publicly available here.
MLS: A Large-Scale Multilingual Dataset for Speech Research
- Multilingual Librispeech data containing 32K hours of English and 4.5k in other languages.
- Will be made available on OpenSLR.
- Paper includes baselines using wav2letter++.

F. Toolkits

PYCHAIN: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
- Fully parallelized PyTorch implementation of end-to-end LFMMI.
- Provides wrapper around the forward-backward computations required for LFMMI gradient computation in Kaldi, so that it can be used for PyTorch based nnet training.
Asteroid: the PyTorch-based audio source separation toolkit for researchers
- Provides Kaldi-style reproducible recipes for single-channel source separation datasets.
- Contains implementations of popular architectures performing at par with the reference papers, such as deep clustering, TasNet, WaveSplit, etc.
Surfboard: Audio Feature Extraction for Modern Machine Learning
- Feature extraction Python library
- Can be used in native Python or as CLI tool.