Interspeech 2020 just ended, and here is my curated list of papers that I found interesting from the proceedings.
Disclaimer: This list is based on my research interests at present: ASR, speaker diarization, target speech extraction, and general training strategies.
A. Automatic speech recognition
I. Hybrid DNN-HMM systems
-
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
- Key contributions: multi-stream CNN for acoustic modeling and self-attentive simple recurrent unit for language modeling.
- Using SpecAugment and N-best rescoring, achieves 1.75% and 4.46% on test-clean and test-other.
- Most of the improvement seems to come from 24-layer SRU LM.
-
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces
- Removes GMM bootstrapping, decision tree building steps by using word-pieces instead of context-dependent phones, and CTC instead of cross-entropy training.
-
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data
- Key idea: use an error detector mechanism to control which transcripts are used for semi-supervised training.
- The error detector is a neural network classifier which takes the ASR decoded output and predicts if it contains errors.
-
On the Robustness and Training Dynamics of Raw Waveform Models
- They study raw waveforms as inputs (instead of MFCCs) on TIMIT, Aurora-4, and WSJ, in both matched and mismatched conditions.
- In mismatched case, MFCCs perform better, but raw waveform performance can be improved using normalization techniques.
- In matched techniques, raw waverforms performed better. Better alignments improve performance considerably.
-
Speaker Adaptive Training for Speech Recognition Based on Attention-over-Attention Mechanism
- Instead of using external embeddings (like d-vector) for speaker adaptation, the embedding is obtained using an attention mechanism on the frames of the utterance.
- This paper improves upon a previous work by the authors by replacing frame-attention with attention-over-attention.
-
- Key things are “online” and “unsupervised”, as opposed to existing methods which are offline and supervised.
- This is done by formulating a gradient based on the conditional likelihood of the acoustic model and using a particle filter approach for efficient computation (I did not quite understand the details).
-
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
- Strong BLSTM teacher is used to guide the sequence discriminative training of LSTM student model.
-
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
- Replace state tying with a method that jointly models the tied context-dependent phones with DNN training.
- Phonetic context representation relies on phoneme embeddings.
- Interesting idea, but needs more empirical work to improve performance compared to standard tying approach.
II. End-to-end models
-
On the comparison of popular end-to-end models for large scale speech recognition
- Compared RNN-transducer, RNN-AED, transformer AED models on 65k hours of speech.
- Transformer AED beats other systems in both streaming and non-streaming modes.
- Both are better than hybrid model.
-
Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard
- Through careful tuning of the whole seq2seq pipeline, they achieve SOTA performance on 300h SWBD subset.
- Lots of ablation experiments; SpecAug seems to be the most helpful.
-
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
- Idea: swap random time-bands and frequency-bands in the input spectogram.
- From ablation, time swapping seems more useful than frequency swapping
- Slightly worse than using SpecAug
-
Speech Transformer with Speaker Aware Persistent Memory
- This is like i-vector based speaker adaptation (used often in hybrid ASR), but applied to end-to-end transformers.
- All speaker i-vectors are concatenated into a matrix (called “persistent memory”) and transformed to get matrices for key and value, and appended to the respective self-attention block matrices.
- Consistent improvements on SWBD, Librispeech (100h), and AISHELL-1.
-
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
- A new beam search algorithm is proposed which mitigates the problem with longer utterance decoding in ASR.
- The new algorithm explicitly models utterance length in the sequence posterior, and is robust across different beam sizes.
-
A New Training Pipeline for an Improved Neural Transducer
- Extensive experiments on the RNN-transducer model, showing that it outperforms attention-based models on long sequences (on SWBD).
- NOTE TO SELF: Needs more careful reading.
-
Semi-supervised end-to-end ASR via teacher-student learning with conditional posterior distribution
- A new TS training scheme for E2E ASR models.
- The scheme involves: (i) teacher forcing using 1-best hypothesis from a teacher model, and (ii) matching the student’s conditional posterior to the teacher’s posterior through the 1-best decoding path.
- Improvements seen in WSJ and Librispeech.
-
Early Stage LM Integration Using Local and Global Log-Linear Combination
- New method for integrating external LM into sequence-to-sequence model training: log-linear model combination with a per-token renormalization (as opposed to shallow fusion which is global renormalization).
-
An investigation of phone-based subword units for end-to-end speech recognition
- One-pass decoding with BPE is introduced, with corresponding forward function and beam search decoding.
- Phone BPEs outperform char BPEs on WSJ and SWBD.
III. Training strategies
-
Unsupervised Regularization-Based Adaptive Training for Speech Recognition
- New regularization objectives proposed for speaker adaptation of CTC-based models:
- Center loss: penalizes the distances between speaker embeddings and the center
- Speaker variance loss: minimizes the speaker interclass variance
- Key idea is to remove speaker-specific deep embedding variances from the acoustic model
- New regularization objectives proposed for speaker adaptation of CTC-based models:
-
Iterative Pseudo-Labeling for Speech Recognition
- Semi-supervised training with unlabeled data — iteratively generate pseudo-labels for the unlabeled data in each iteration of training.
- External LM and data augmentation is important to avoid local minima.
- SOTA WERs on Librispeech (100h and 960h).
- Also release a large text corpus from Project Gutenberg books (not overlapping with LibriVox and LibriSpeech).
-
Improved Noisy Student Training for Automatic Speech Recognition
- Several techniques for semi-supervised training are proposed, which results in SOTA performance on Librispeech.
- Normalized filtering score, sub-modular sampling, gradational filtering, gradational augmentation -> all are relatively simple ideas but useful in combination.
-
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks
- Using XL-Net like pretraining schemes for self-attention network based ASR models.
- Pretraining objective is next-frame prediction.
- Importantly, L1 and L2 losses fail to converge, so they used something called Mean Absolute Error (Huber loss). Also, only last 20% frames are predicted.
- Experiments conducted on both hybrid and E2E settings.
-
Combination of end-to-end and hybrid models for speech recognition
- 3 ways of combining: (i) ROVER over 1-best hypothesis, (ii) MBR-based combination, and (iii) ROVER over N-best lists (approximation of (ii)).
- MBR combination worked best, and consistently improved over best single model.
- Note that length normalization was required for LAS and RNN-T models since they prefer shorter sequences.
B. Speaker diarization and recognition
I. Diarization
-
- This paper extends the EEND diarization system to unknown number of speakers.
- This is done using encoder-decoder attractor (EDA). The idea is to pass the EEND hidden state to an LSTM encoder-decoder which can produce a flexible number of outputs.
- The output of the attractor is passed to a sigmoid which decides when to stop. A threshold is used to then select the number of speakers.
-
- TS-VAD was used in STC’s winning submission to the CHiME-6 challenge.
- The idea is to use speaker i-vectors from a first-pass diarization to perform a multi-label classification per-frame, which considers all speakers simultaneously.
- Limitation: it can only handle a fixed number of speakers (4 in the case of CHiME-6).
- My slides from a reading group presentation: slides
-
New advances in speaker diarization
- Multiple small tweaks in clustering-based diarization.
- Key contributions:
- using x-vectors + d-vectors
- using a neural networks for scoring segment similarity (also see this paper which uses self-attention based scoring)
- better estimation of speaker count
- All these tweaks provide some gains on Callhome dataset (8.6% to 5.1%).
-
Speaker attribution with voice profiles by graph-based semi-supervised learning
- Key idea: create a graph with nodes as the subsegment embeddings and edges containing similarity. Then add profile nodes and use “label propagation” to assign labels to other nodes.
-
Speaker Diarization System based on DPCA Algorithm For Fearless Steps Challenge Phase-2
- Key novelty is in using a new clustering method called Density Peak Clustering.
- Performance is better than AHC and spectral clustering for data containing non-convex clusters and outliers.
-
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios
- Overlap detection and VAD is formulated as an OSDC task, and temporal convolutional networks (TCNs) are used to tackle it.
- Experiments on AMI and CHiME-6 dataset show that TCNs are better than LSTM and CRNN models for this task.
II. Recognition
-
- Proposes improvements over the RawNet architecture for end-to-end speaker verification from raw waveforms.
- This is an alternative to the traditional approach which involves a front-end embedding extractor and a back-end like a PLDA classifier.
-
In defence of metric learning for speaker recognition
- Through extensive experimentation (20k GPU hours) on VoxCeleb, they show that metric learning learns better speaker embeddings than classification-based losses.
- Good overview of various loss functions, including a new angular prototypical loss.
- NOTE TO SELF: Needs careful reading.
-
Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms
- Replace MFCC, VAD, and CMVN with deep learning tools, and extract embeddings directly from waveforms.
- Outperforms x-vector system on VoxCeleb1.
-
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
- Extensive comparison of feature extraction methods (14 methods studied)
- Contains good overview of available extraction methods
C. Speech enhancement/separation
I. Target speech extraction
-
Neural Spatio-Temporal Beamformer for Target Speech Separation
- Multi-tap MVDR beamformer which uses complex-valued masks for enhancement in multi-channel scenario.
- The model is jointly trained with ASR objective.
-
SpEx+: A Complete Time Domain Speaker Extraction Network
- This is a follow up on their previous SpEx model, with the difference that the speaker embedding is now also in time domain.
- This is done to avoid phase estimation that would be required to reconstruct the target signal, if the extraction is performed in frequency domain (e.g. in SpeakerBeam).
-
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
- This is another time-domain speaker extraction model, but it’s based on TasNet (which is SOTA for speech separation).
- A new training strategy called Speaker Presence Invariant Training (SPIT) is proposed, to solve the problem that happens when such models are asked to extract speakers not present in the mixture.
-
Time-Domain Target-Speaker Speech Separation With Waveform-Based Speaker Embedding
- Another time-domain extraction model, but works entirely in time-domain. The model is called WaveFilter.
- Auxiliary input is fed step-wise into the separation network through residual blocks.
- Experiments show improvements in SDR over SpeakerBeam.
-
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
- This is a follow-up on VoiceFilter, which uses d-vectors to extract the speaker’s speech from mixture.
- The novelty is that this extraction is now performed directly on filterbanks, and an asymmetric L2 loss is used to mitigate over-suppression problem.
- Model is made to fit on-device by 8-bit quantization.
II. Speech enhancement
-
Single-channel speech enhancement by subspace affinity minimization
- Learns separate speech and noise embeddings from the input using a subspace affinity loss function.
- Theoretically proven to maximally decorrelate speech and noise representations; empirically outperforms other popular single-channel methods on VCTK.
-
Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement
- A “noise encoder” learns noise representation from the noisy speech, which is then used to obtain enhanced STFT magnitude.
- Improves VoiceFilter performance in terms of PESQ and STOI.
-
Real Time Speech Enhancement in the Waveform Domain
- Causal DEMUCS model that runs in real time and matches performance of non-causal models.
- Trained using multiple objective functions.
- Augmentation schemes: Remix, Band-Mask, Revecho, and random shift.
D. Joint modeling, LMs, and others
-
- Extends the Serialized Output Training model for multispeaker ASR using an attention-based encoder-decoder.
- Speaker inventory is used as auxiliary input from which the speaker embeddings are obtained.
- Experiments conducted using simulated mimxtures from LibriSpeech.
-
Identifying Important Time-frequency Locations in Continuous Speech Utterances
- By masking increasing regions of the spectogram while keeping the error rate consistent, the authors determine the important regions of the input.
- They mention that they will use this analysis to aid data augmentation in future work.
-
FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition
- A simple technique to perform early fusion of multi-microphone inputs, through a “fusion layer”.
- Consistent improvements on DIRHA dataset over delay-and-sum beamforming.
-
Speaker-Conditional Chain Model for Speech Separation and Extraction
- The task is to extract the speech of all the speakers in a multi-speaker recording.
- This is done by sequentially extracting the speech, conditioned on the embeddings of the previously extracted ones.
- Promising results on both WSJ-mix and LibriCSS.
- Note: This “chain” model is closely associated with the EEND line of work on diarization.
-
- Proposes “completion” tasks for speech-to-text and speech-to-speech, and trains encoder-decoder models for these.
- Speech-to-text completion models performed better than RNN-LM and BERT baselines on WER (although I don’t quite understand what is considered a “correct” completion).
-
Serialized Output Training for End-to-End Overlapped Speech Recognition
- Similar line of work as the joint training (see #1 in this list); task is multi-speaker overlapped ASR.
- Transcriptions of the speakers are generated one after another.
- Several advantages over the traditional permutation invariant training (PIT).
-
- Main novelty is that the model works for unknown number of speakers, using an iterative speech extraction system.
- Promising results on WSJ-mix with 2,3, and 4 speakers.
-
Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for N-gram Language Models
- Data struture “DashHashLM” to store n-gram LMs with efficient lookup for rescoring.
- 6x query speedup at 10% increased memory requirement.
-
LVCSR with Transformer Language Models
- Several tweaks are proposed to improve rescoring time with transformer LMs, and to use them in single-pass systems.
- Proposed improvements include quantization of LM state, common prefix, and a hybrid lattice/n-best list rescoring.
-
Vector-Quantized Autoregressive Predictive Coding
- Best paper award
- Quantized representations are produced using the APC self-supervised objective.
- Probing tasks and mutual information used to show the presence and absence of information in learned representations from increasingly limited models.
E. Datasets
-
Spot the conversation: speaker diarisation in the wild
- New VoxConverse dataset for multi-modal diarization.
- Dev: 1218 mins, test: 53 hours.
-
- Dinner party like conversations with close-talk and array microphone recordings.
- 10 sessions, between 15 and 45 minutes
- Using the Kaldi CHiME-5 acoustic model with adaptation provides approx. 80% WER on far-field setting.
-
Speech recognition and multi-speaker diarization of long conversations
- Long-form multi-speaker recordings (approx 1 hour each) collected from This American Life podcast.
- Contains approx 640 hours of speech comprising 6608 unique speakers.
- Aligned transcripts are made publicly available
-
JukeBox: A Multilingual Singer Recognition Dataset
- 467 hours of singing audio sampled at 16 kHz, containing 936 unique singers and 18 different languages.
- Publicly available here.
-
MLS: A Large-Scale Multilingual Dataset for Speech Research
- Multilingual Librispeech data containing 32K hours of English and 4.5k in other languages.
- Will be made available on OpenSLR.
- Paper includes baselines using wav2letter++.
F. Toolkits
-
PYCHAIN: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
- Fully parallelized PyTorch implementation of end-to-end LFMMI.
- Provides wrapper around the forward-backward computations required for LFMMI gradient computation in Kaldi, so that it can be used for PyTorch based nnet training.
-
Asteroid: the PyTorch-based audio source separation toolkit for researchers
- Provides Kaldi-style reproducible recipes for single-channel source separation datasets.
- Contains implementations of popular architectures performing at par with the reference papers, such as deep clustering, TasNet, WaveSplit, etc.
-
Surfboard: Audio Feature Extraction for Modern Machine Learning
- Feature extraction Python library
- Can be used in native Python or as CLI tool.