GBO notes: Mask estimation for GSS

Guided source separation (GSS) is an unsupervised algorithm for target speech extraction, first proposed in the Paderborn submission to the CHiME-5 challenge. Given a noisy (and reverberant) multi-channel recording containing multiple speakers, and a time-annotated segment where a desired speaker is active, GSS solves the task of extracting a (relatively) clean audio of the desired speaker in the segment, while removing interference in the form of background noise or overlapping speakers.

Note: I have recently implemented a GPU-accelearated version of GSS which can be used for datasets other than CHiME-5 (such as LibriCSS or AMI). It can be found here.

The overall GSS method contains 3 stages:

Dereverberation using WPE
Mask estimation using CACGMMs
Mask-based MVDR beamforming

In this note, I will focus specifically on Step 2, i.e., mask estimation using CACGMMs. We will look at WPE and MVDR components in other notes.

Let $\mathbf{Y}_{t,f}$ be a multi-channel signal in the STFT domain. Suppose there are $D$ channels, i.e., each T-F bin is a $D$-dimensional vector. We assume the following model of the signal:

\[\mathbf{Y}_{t,f} = \sum_k \mathbf{X}_{t,f,k}^{\mathrm{early}} + \sum_k \mathbf{X}_{t,f,k}^{\mathrm{tail}} + \mathbf{N}_{t,f},\]

where $k$ denotes the speaker indices, “early” and “tail” are the early reverberation part of the signal and the late reverberations, and $\mathbf{N}_{t,f}$ is the noise in STFT domain. We can sum up the late reverberations as denote it as $\mathbf{X}_{t,f}^{\mathrm{tail}}$.

The WPE component (stage 1 of the method) estimates this quantity $\mathbf{X}_{t,f}^{\mathrm{tail}}$ and removes it from $\mathbf{Y}_{t,f}$, so that at the mask estimation stage we only have the early-reverberated mixture with noise. Suppose we further normalize the T-F bin vectors into unit vectors and denote the resulting signal as $\tilde{\mathbf{Y}}_{t,f}$.

The mask estimation technique is based on the “sparsity assumption”, which says that only one speaker is active in each time-frequency bin. Using this assumption, the vector in each T-F bin can be assumed to have been generated from a mixture model where each component of the mixture belongs to a different speaker.

In the case of GSS, each mixture component is a complex angular central Gaussian. This can seem like a loaded term, but let us break it down. It is similar to a standard multivariate normal distribution, except for 2 things: (i) it models complex-valued random variables instead of real-valued variables (which is useful for us since STFT’s are complex-valued), and (ii) it distributes the random variable on a unit hypersphere $S$ (which is again relevant since we unit normalized each STFT bin).

Recall that a standard multi-variate Gaussian is characterized by a mean vector $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$, and the density function is

\[p(\mathbf{x}) = \left(\frac{1}{2\pi}\right)^{\frac{D}{2}} |\mathbf{\Sigma}|^{-\frac{1}{2}} \exp \left[ -\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}) \right].\]

In the case of a CACG, since it is zero-centered, we only have one parameter, denoted as $\mathbf{B}$, which is a positive-definite Hermitian matrix that controls everything about the distribution. The density function is given as:

\[p(\mathbf{z}) = \left(\frac{1}{2\pi}\right)^{D} (D-1)! |\mathbf{B}|^{-1} (\mathbf{z}^H \mathbf{B}^{-1} \mathbf{z})^{-D}.\]

The CACGMM is then given as a mixture of CACG components as follows:

\[p(\tilde{\mathbf{Y}}_{t,f}) = \sum_k \pi_{f,k} \mathcal{A}(\tilde{\mathbf{Y}}_{t,f};\mathbf{B}_{f,k}),\]

where $\mathcal{A}(\tilde{\mathbf{Y}}_{t,f};\mathbf{B}_{f,k})$ is the contribution of a single CACG, and $\pi$ are the mixture weights.

At this point, it may seem like we can just run the EM algorithm independently for each frequency bin on the CACGMM model to compute its parameters. But there are two problems:

The same mixture component $k$ may correspond to different speakers in different frequency bins, leading to the well-known permutation problem.
We do not know the number of mixture components $k$.

This is where the “guided” part of GSS comes in: if we have external guidance in the form of speaker-level time annotations (either oracle or computed using a diarizer), we can use it to (i) fix the global speaker order, and (ii) fix the number of mixture components. We denote the speaker-time annotations as $a_{t,k}$, which takes values 0 or 1 based on whether the speaker $k$ is active at time $t$. We can then convert the time-invariant mixture weights $\pi_{f,k}$ to time-varying weights as

\[\pi_{t,f,k} = \frac{\pi_{f,k}a_{t,k}}{\sum_{k'}\pi_{f,k'}a_{t,k'}}.\]

Now we are ready to apply the EM algorithm to learn the CACGMM. The E-step involves computing the state posteriors at each time step as:

\[\gamma_{t,f,k} = \frac{\pi_{t,f,k}|\mathbf{B}_{f,k}|^{-1}(\tilde{\mathbf{Y}}_{t,f}^H \mathbf{B}^{-1} \tilde{\mathbf{Y}}_{t,f})^{-D}}{\sum_{k'}\pi_{t,f,k'}|\mathbf{B}_{f,k'}|^{-1}(\tilde{\mathbf{Y}}_{t,f}^H \mathbf{B}^{-1} \tilde{\mathbf{Y}}_{t,f})^{-D}}.\]

And the M-step is:

\[\pi_{f,k} = \sum_t \pi_{t,f,k},\] \[\mathbf{B}_{f, k}=D \frac{\sum_{t} \gamma_{t, f, k} \frac{\tilde{\mathbf{Y}}_{t, f}^{\mathrm{H}} \tilde{\mathbf{Y}}_{t, f}}{\tilde{\mathbf{Y}}_{t, f}^{\mathrm{H}} \mathbf{B}_{f, k}^{-1} \tilde{\mathbf{Y}}_{t, f}}}{\sum_{t} \gamma_{t, f, k}}.\]

The E and M steps are repeated for a specified number of iterations, and the $\gamma_{t,f,k}$ obtained at the end of this process are returned as the estimated speaker masks.

In subsequent notes, we will see how these masks can be used for target-speaker extraction from the noisy multi-speaker mixture.