Guided source separation (GSS) is an unsupervised algorithm for target speech extraction, first proposed in the Paderborn submission to the CHiME5 challenge. Given a noisy (and reverberant) multichannel recording containing multiple speakers, and a timeannotated segment where a desired speaker is active, GSS solves the task of extracting a (relatively) clean audio of the desired speaker in the segment, while removing interference in the form of background noise or overlapping speakers.
Note: I have recently implemented a GPUaccelearated version of GSS which can be used for datasets other than CHiME5 (such as LibriCSS or AMI). It can be found here.
The overall GSS method contains 3 stages:
 Dereverberation using WPE
 Mask estimation using CACGMMs
 Maskbased MVDR beamforming
In this note, I will focus specifically on Step 2, i.e., mask estimation using CACGMMs. We will look at WPE and MVDR components in other notes.
Let $\mathbf{Y}_{t,f}$ be a multichannel signal in the STFT domain. Suppose there are $D$ channels, i.e., each TF bin is a $D$dimensional vector. We assume the following model of the signal:
\[\mathbf{Y}_{t,f} = \sum_k \mathbf{X}_{t,f,k}^{\mathrm{early}} + \sum_k \mathbf{X}_{t,f,k}^{\mathrm{tail}} + \mathbf{N}_{t,f},\]where $k$ denotes the speaker indices, “early” and “tail” are the early reverberation part of the signal and the late reverberations, and $\mathbf{N}_{t,f}$ is the noise in STFT domain. We can sum up the late reverberations as denote it as $\mathbf{X}_{t,f}^{\mathrm{tail}}$.
The WPE component (stage 1 of the method) estimates this quantity $\mathbf{X}_{t,f}^{\mathrm{tail}}$ and removes it from $\mathbf{Y}_{t,f}$, so that at the mask estimation stage we only have the earlyreverberated mixture with noise. Suppose we further normalize the TF bin vectors into unit vectors and denote the resulting signal as $\tilde{\mathbf{Y}}_{t,f}$.
The mask estimation technique is based on the “sparsity assumption”, which says that only one speaker is active in each timefrequency bin. Using this assumption, the vector in each TF bin can be assumed to have been generated from a mixture model where each component of the mixture belongs to a different speaker.
In the case of GSS, each mixture component is a complex angular central Gaussian. This can seem like a loaded term, but let us break it down. It is similar to a standard multivariate normal distribution, except for 2 things: (i) it models complexvalued random variables instead of realvalued variables (which is useful for us since STFT’s are complexvalued), and (ii) it distributes the random variable on a unit hypersphere $S$ (which is again relevant since we unit normalized each STFT bin).
Recall that a standard multivariate Gaussian is characterized by a mean vector $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$, and the density function is
\[p(\mathbf{x}) = \left(\frac{1}{2\pi}\right)^{\frac{D}{2}} \mathbf{\Sigma}^{\frac{1}{2}} \exp \left[ \frac{1}{2} (\mathbf{x}\mathbf{\mu})^T \mathbf{\Sigma}^{1} (\mathbf{x}\mathbf{\mu}) \right].\]In the case of a CACG, since it is zerocentered, we only have one parameter, denoted as $\mathbf{B}$, which is a positivedefinite Hermitian matrix that controls everything about the distribution. The density function is given as:
\[p(\mathbf{z}) = \left(\frac{1}{2\pi}\right)^{D} (D1)! \mathbf{B}^{1} (\mathbf{z}^H \mathbf{B}^{1} \mathbf{z})^{D}.\]The CACGMM is then given as a mixture of CACG components as follows:
\[p(\tilde{\mathbf{Y}}_{t,f}) = \sum_k \pi_{f,k} \mathcal{A}(\tilde{\mathbf{Y}}_{t,f};\mathbf{B}_{f,k}),\]where $\mathcal{A}(\tilde{\mathbf{Y}}_{t,f};\mathbf{B}_{f,k})$ is the contribution of a single CACG, and $\pi$ are the mixture weights.
At this point, it may seem like we can just run the EM algorithm independently for each frequency bin on the CACGMM model to compute its parameters. But there are two problems:

The same mixture component $k$ may correspond to different speakers in different frequency bins, leading to the wellknown permutation problem.

We do not know the number of mixture components $k$.
This is where the “guided” part of GSS comes in: if we have external guidance in the form of speakerlevel time annotations (either oracle or computed using a diarizer), we can use it to (i) fix the global speaker order, and (ii) fix the number of mixture components. We denote the speakertime annotations as $a_{t,k}$, which takes values 0 or 1 based on whether the speaker $k$ is active at time $t$. We can then convert the timeinvariant mixture weights $\pi_{f,k}$ to timevarying weights as
\[\pi_{t,f,k} = \frac{\pi_{f,k}a_{t,k}}{\sum_{k'}\pi_{f,k'}a_{t,k'}}.\]Now we are ready to apply the EM algorithm to learn the CACGMM. The Estep involves computing the state posteriors at each time step as:
\[\gamma_{t,f,k} = \frac{\pi_{t,f,k}\mathbf{B}_{f,k}^{1}(\tilde{\mathbf{Y}}_{t,f}^H \mathbf{B}^{1} \tilde{\mathbf{Y}}_{t,f})^{D}}{\sum_{k'}\pi_{t,f,k'}\mathbf{B}_{f,k'}^{1}(\tilde{\mathbf{Y}}_{t,f}^H \mathbf{B}^{1} \tilde{\mathbf{Y}}_{t,f})^{D}}.\]And the Mstep is:
\[\pi_{f,k} = \sum_t \pi_{t,f,k},\] \[\mathbf{B}_{f, k}=D \frac{\sum_{t} \gamma_{t, f, k} \frac{\tilde{\mathbf{Y}}_{t, f}^{\mathrm{H}} \tilde{\mathbf{Y}}_{t, f}}{\tilde{\mathbf{Y}}_{t, f}^{\mathrm{H}} \mathbf{B}_{f, k}^{1} \tilde{\mathbf{Y}}_{t, f}}}{\sum_{t} \gamma_{t, f, k}}.\]The E and M steps are repeated for a specified number of iterations, and the $\gamma_{t,f,k}$ obtained at the end of this process are returned as the estimated speaker masks.
In subsequent notes, we will see how these masks can be used for targetspeaker extraction from the noisy multispeaker mixture.