In a previous note, we described the process of mask estimation using complex angular central GMMs that are used in guided source separation (GSS). Mask estimation means computing the activity for each speaker at each time-frequency bin, i.e., $\gamma_{t,f,k}$.

Of course, using CACGMMs is not the only mask estimation method. Recently, it is quite popular to use neural networks to estimate them. However, they require a fixed number of speakers at train and test time. In subsequent notes, we will see continuous speech separation methods which circumvent this problem to use neural network based mask estimation.

In any case, once the speaker-specific masks have been estimated, we still need to extract the speaker audio from the mixture (which was the task in the first place). In this note, we will describe a popular method for doing this, known as mask-based MVDR beamforming. This discussion is based on Erdogan et al.

Again, our problem formulation is the same as the previous post, where we have (unit normalized and dereverberated) mixture STFT features $\tilde{\mathbf{Y}}_{t,f}$ which can be written as the sum of early reverberations of the sources and the noise:

\[\tilde{\mathbf{Y}}_{t,f} = \sum_k \mathbf{X}_{t,f,k} + \mathbf{N'}_{t,f},\]

where $\mathbf{N’}_{t,f}$ is the noise. If we focus on a specific speaker $k$, we can further write it as:

\[\begin{aligned} \tilde{\mathbf{Y}}_{t,f} &= \mathbf{X}_{t,f,k} + \sum_{k'\neq k} \mathbf{X}_{t,f,k'} + \mathbf{N'}_{t,f} \\ &= \mathbf{X}_{t,f,k} + \mathbf{N}_{t,f}, \end{aligned}\]

where we have combined the interference speakers $k’ \neq k$ and noise into a distortion component.

We currently have the mask $\gamma_{t,f,k}$ and want to compute $\mathbf{X}_{t,f,k}$. A naive way to do this is to simply multiply the mask with the mixture. However, much better results can be obtained by doing something a little more sophisticated, which we describe next.

Filter-and-sum beamforming

This beamforming technique is similar to the simpler delay-and-sum method, which I will first summarize. The idea is that if signal from the same source is captured by several receivers, they all have the same waveform, but differ in delay and phase. So, if a different delay is applied to each input signal depending on the location of the microphone, the source signals captured at all the microphones become “in-phase”, and the additive noise will be out of phase. Adding all the signals and normalizing by the number of channels will then remove the noise from the signal. Mathematically, for a linear microphone array consisting of $D$ microphones, each placed $m$ meters apart, the beamformed signal is given as

\[y(t)=\frac{1}{D} \sum_{d=1}^{D} x_{d}\left(t-\tau_{d}\right)\]

where $\tau_d$ is the delay applied to signal at microphone $d$, and is given as

\[\tau_{d}=\frac{(d-1) m \cos \phi^{\prime}}{c},\]

where $c$ is the speed of sound and $\phi^{\prime}$ is the direction in which we want to steer the main lobe of the signal. Here, all channels are assumed to have equal frequency response and hence are assigned equal amplitude weights, i.e., $a_d(f) = \frac{1}{D}$. In filter-and-sum beamforming, this assumption is removed, and instead, the frequency response is computed for each microphone

\[y(t)=\sum_{d=1}^{D} a_{d}(t) x_{d}\left(t-\tau_{d}\right).\]

Note that in the above equation, the amplitude weight is in the time domain, where in fact it is computed in the frequency domain and then multiplied to the signal in the same domain.

MVDR beamforming

Given the masks $\gamma_{t,f,k}$ and the mixture STFT signal $\tilde{\mathbf{Y}}_{t,f}$, we first compute the spatial covariance matrices as

\[\mathbf{\Phi}_k (f) = \frac{1}{T} \sum_{t} \gamma_{t,f,k} \tilde{\mathbf{Y}}_{t,f} \tilde{\mathbf{Y}}_{t,f}^H,\]

for the target speaker $k$, and similarly for the distortion mask (which is the sum of noise mask and all interfering speaker masks). Using the SCMs, the filters are computed as

\[\mathbf{h}(f) = \frac{\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}_k(f) \mathbf{e}_{\mathrm{ref}}}{\mathrm{tr}\left(\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}_k(f)\right)},\]

where $\mathbf{h}(f)$ is a $D$-dimensional vector that gives the weight of each channel for frequency $f$. The final beamformed signal is then given as

\[\hat{\mathbf{X}}_{t,f} = \tilde{\mathbf{Y}}_{t,f} \cdot \mathbf{h}(f),\]

where <$\cdot$> is the dot product. It is clear that the filter vector is constant for all $t$, and so this type of beamformer is called a time-invariant beamformer.