Speaker diarization is often formulated as a clustering of speaker embeddings. If we use conventional clustering methods such as k-means or spectral clustering, they ignore the sequential nature of turn-taking and only perform the clustering based on similarity of the embeddings. BUT’s VBx is a robust and mathematically principled approach to solve this problem by performing clustering by modeling speakers as HMM latent states and the embeddings as the emissions.

In this note, we will describe variational Bayes and the VBx algorithm. This note is based on Landini et al.’s paper and online resources on VB methods (such as this post).

Variational Bayes

Suppose there is some latent variable $\mathbf{Z}$ that we want to infer and we have some observations $\mathbf{X}$. In our case of speaker diarization, $\mathbf{X}$ may be the sequence of x-vectors, and $\mathbf{Z}$ are discrete speaker states.

Variational objective: Evidence lower bound (ELBO)

In most such problems, we are actually looking to compute the posterior $p(\mathbf{Z}\mid\mathbf{X})$, which is given as:

\[p(\mathbf{Z}\mid\mathbf{X}) = \frac{p(\mathbf{X}\mid\mathbf{Z})p(\mathbf{Z})}{\int p(\mathbf{X}\mid\mathbf{Z})p(\mathbf{Z}) \mathrm{d}\mathbf{Z}}.\]

Here, the denominator term is the marginal $p(\mathbf{X})$, and it is hard to compute because of the integral. Instead of computing this integral, we want to convert the problem to an optimization problem which can be solved by taking derivatives.

For this, we approximate the posterior by a distribution $q(\mathbf{Z})$ such that

\[q^{\ast}(\mathbf{Z}) = \text{arg}\min_{q(\mathbf{Z})\in Q} \text{KL} (q(\mathbf{Z})||p(\mathbf{Z}\mid\mathbf{X})),\]

where $Q$ is some family of densities. Let us try to expand the KL-divergence term:

\[\begin{aligned} \text{KL} (q(\mathbf{Z})||p(\mathbf{Z}\mid\mathbf{X})) &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid\mathbf{X})} \right] \\ &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{Z}\mid\mathbf{X}) \right] \\ &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] + \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{X}) \right] \\ &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] + \log p(\mathbf{X}), \end{aligned}\]

where the last step is because $p(\mathbf{X})$ is a constant for the expectation. This shows that computing the KL-divergence is again hard because of the marginal term. However, it is a constant regardless of the value of $q(\mathbf{Z})$, and so we can ignore it. Also, minimizing the KL-divergence is equal to maximizing its negative. Applying both of these ideas, we can write the objective as

\[\begin{aligned} \text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] \\ &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{X}\mid\mathbf{Z}) \right] + \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{Z}) \right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] \\ &= \mathbb{E}_{q(\mathbf{Z})} \left[ \log p(\mathbf{X}\mid\mathbf{Z}) \right] - \text{KL}(q(\mathbf{Z})||p(\mathbf{Z})). \end{aligned}\]

So maximizing the objective means maximizing the likelihood while ensuring that $q(\mathbf{Z})$ does not stray far from the prior $p(\mathbf{Z})$. The objective is called “evidence lower bound” because it is a lower bound for the marginal $p(\mathbf{X})$, which is sometimes called the evidence.

The objective is called “variational” because we are trying to find a function, instead of a single value, that maximizes some objective.

Mean field approximation

Since $q(\mathbf{Z})$ can come from a large distribution space, we need to constrain it in some way to make the optimization problem feasible. One popular way to do this is to assume that each $z_j$ is independent, such that

\[q(\mathbf{Z}) = \prod_{j=1}^m q_j(z_j).\]

We can then optimize the ELBO by considering one of the $z_j$’s at a time. This is usually referred to as “coordinate ascent”.

The VBx algorithm

Now that we have some idea of how variational inference works, let us turn to VBx, which applies this to the task of infering a sequence of speaker states given a sequence of x-vectors.

The algorithm assumes that the sequence of x-vectors is generated using a Bayesian HMM, where the latent variables are discrete speaker states. Transition probabilities are simple and are given as

\[p(z_t=s | z_{t-1}=s') = p(s|s').\]

The emission probabilities are where the “bayesian” part of the HMM comes in. Suppose the x-vectors are transformed to a space with zero mean. For each speaker state $s$, we assume that the emission for that state is given as

\[p(\mathbf{x}_t | s) = \mathcal{N}(\mathbf{x}_t; \mathbf{m}_s, \mathbf{I}),\]

where $\mathbf{m}_s$ is a speaker-specific mean and $I$ is an identity covariance matrix. We further assume that the speaker-specific means are distributed as a zero-centered Gaussian with diagonal covariance, i.e.,

\[p(\mathbf{m}_s) = \mathcal{N}(\mathbf{m}_s; \mathbf{0}, \Sigma).\]

By taking a standard normal distributed variable $\mathbf{y}_s$, we can reparametrize this as

\[\mathbf{m}_s = \mathbf{V} \mathbf{y}_s,\]

where $\mathbf{V} = \Sigma^{\frac{1}{2}}$. Substituting in above, we can characterize the emission probability as

\[p(\mathbf{x}_t | \mathbf{y}_s) = \mathcal{N}(\mathbf{x}_t; \mathbf{V}\mathbf{y}_s, \mathbf{I}).\]

The above distribution is fully characterized by $\mathbf{y}_s$, since the matrix \mathbf{V} is pre-computed and shared among all speaker states. Since the latent variable $\mathbf{y}_s$ itself has a standard normal prior, the model is called a Bayesian HMM (a prior is imposed on the emission probabilities). The complete model can be defined in terms of the joint probability of the observed and latent variables:

\[\begin{aligned} p(\mathbf{X},\mathbf{Z},\mathbf{Y}) &= p(\mathbf{X}\mid\mathbf{Z},\mathbf{Y}) p(\mathbf{Z}) p(\mathbf{Y}) \\ &= \prod_t p(\mathbf{x}_t|z_t) \prod_t p(z_t|z_{t-1}) \prod_s p(\mathbf{y}_s). \end{aligned}\]


For inference, we need to compute $p(\mathbf{Z}|\mathbf{X})$, which can be obtained by marginalizing out $\mathbf{Y}$ from the posterior

\[p(\mathbf{Z}|\mathbf{X}) = \int p(\mathbf{Z},\mathbf{Y}|\mathbf{X}) \mathrm{d}\mathbf{Y}.\]

As we saw before, the posterior $p(\mathbf{Z},\mathbf{Y}|\mathbf{X})$ is hard to compute because of the marginal term in the denominator, and so we will approximate it by $q(\mathbf{Z},\mathbf{Y})$. Again, using mean-field approximation, we will assume that

\[q(\mathbf{Z},\mathbf{Y}) = q(\mathbf{Z})q(\mathbf{Y}).\]

We can write this as an optimization problem similar to the one described in the previous section, and solve it by maximizing the ELBO, which is given as

\[\mathrm{ELBO}(q) = \mathbb{E}_{q(\mathbf{Z},\mathbf{Y})}\left[ \log p(\mathbf{X}\mid\mathbf{Y},\mathbf{Z}) \right] - \mathbb{E}_{q(\mathbf{Y})} \left[ \log \frac{q(\mathbf{Y})}{p(\mathbf{Y})}\right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{q(\mathbf{Z})}{p(\mathbf{Z})}\right],\]

similar to the simplification done earlier.

As per the mean-field approximation, we solve the above optimization problem by iteratively solving for $q(\mathbf{Y})$ and $q(\mathbf{Z})$ assuming that the other is fixed. These can be done similar to the forward-backward algorithm for learning an HMM, so we skip the details here.