In this note, I will review a popular clustering algorithm called spectral clustering. We will discuss its connection to the min-cut problem in graph partitioning, and then look at 2 methods to extend it to multi-class clustering. This post is based heavily on this tutorial.

Similarity graph and the Laplacian matrix

Suppose we have n observations x1,,xn and pairwise similarity scores sij0. We can represent them as a similarity graph G=(V,E), where V=x1,,xn and E contains edges between nodes for which sij is greater than some threshold, weighted by sij.

This is a unidirected graph, and we can define it in terms of its adjacency matrix W=wiji,j=1,,n. We can also define a degree matrix D which is a diagonal matrix such that

di=nj=1wij.

Given two subsets of vertices A and B, we define W(A,B)=iA,jBwij.

The unnormalized graph Laplacian is defined as L=DW. Let use prove some properties of the Laplacian matrix.

Property 1: L is symmetric and positive semi-definite.

Proof: Since both D and W are symmetric, L must be symmetric. Consider some vector fRn. We have

fTLf=fTDffTWf=ni=1f2idini,j=1fifjwij=12(ni=1f2idi2ni,j=1fifjwij+ni=1f2idi)=12ni,j=1wij(fifj)20,

which means L is positive semi-definite.

The above also means that if f is a vector that assigns values to all the nodes in the graph G, then the sum of weighted squared distances between all neighbors can be obtained by computing fTLf.

Property 2: The smallest eigenvalue is 0, and corresponding eigenvector is constant 1.

Proof: We have

L1=D1W1=(d1,,dn)T(nj=1w1j,,nj=1wnj)T=0=01.

Hence, 0 is an eigenvalue of L with 1 as the corresponding eigenvector.

Property 3: L has orthogonal eigenvectors.

Proof: Let v1 and v2 be eigenvectors of L corresponding to eigenvalues λ1 and λ2. We have

λ1vT1v2=(λ1v1)Tv2=(Lv1)Tv2=vT1LTv2=vT1(Lv2)=vT1(λ2v2)=λ2vT1v2vT1v2=0,

where LT=L since L is symmetric.

Property 4: L has n non-negative, real eigenvalues.

Proof: Since L is PSD, all its eigenvalues must be real and non-negative. The smallest eigenvalue is 0, as shown in Property 2.

Some interesting results arise from the relationships between the graph and the spectrum of the Laplacian. But before that, let us define a few terms.

The geometric multiplicity of an eigenvalue λ of matrix A is the number of non-zero eigenvectors corresponding to λ, i.e., it is the dimension of the null-space of AλI.

The algebraic multiplicity of an eigenvalue λ of matrix A is the number of times it occurs as a root of the characteristic equation det(AλI)=0.

As such, the G.M. cannot be greater than the A.M (skip proof).

Spectrum of the Laplacian

Result 1: The algebraic multiplicity of the eigenvalue 0 of L is equal to the number of connected components of the graph G.

Proof: First, suppose we have k connected components S1,,Sk. Define k vectors u1,,uk0,1n such that ui contains 1 in the indices corresponding to the nodes in component Si. Since all components are disjoint, so uiuj=0 for all i,j, i.e., the vectors are orthogonal. Also, it is easy to verify that Lui=0i. Hence, there are at least k orthogonal eigenvectors corresponding to eigenvalue 0. This means that the geometric multiplicity of 0 is at least k, and consequently, the algebraic multiplicity of 0 is at least k.

Now consider some eigenvector v corresponding to eigenvalue 0. We have

Lv=0vTLv=0i,jwi,j(vivj)2=0,(i,j)E(vivj)2=0,

since for the pairs (i,j) which are not edges, the weight is 0. This means that the eigenvector v is such that the indices corresponding to connected components have the same value. Hence, we can have scalars α1,,αk such that

v=ki=1αiui,

where ui’s are as defined earlier. This means that all eigenvectors correposnding to 0 lie in the basis spanned by ui’s, which means that there are at most k orthogonal eigenvectors, and so the A.M. of 0 is at most k.

Result 2: The Fiedler vector of L corresponds to the sparsest cut of the graph.

The Fiedler vector is the eigenvector associated with the second smallest eigenvalue of L.

Spectral clustering using k-means

In these methods, we first compute the Laplacian L, obtain its first k eigenvectors as the matrix URn×k, and then cluster the n rows of the matrix U using k-means clustering. The idea is that we used the spectrum of L to convert the original data points into k-dimensional representations that are well separable.

Spectral clustering using convex optimization

Another method that was proposed in this paper presents a more mathematically robust approach to multi-class spectral clustering. The idea is to represent the graph partitioning problem as a discrete optimization problem. Suppose we have estimated the number of clusters as K. Then, we can write the clustering problem as

maximizeϵ(X)=1KKk=1XTkAXkXTkDXksubject toX{0,1}N×K,X1K=1N.

Here, the objective function tries to maximize the average “link-ratio” in the graph, and the constraint just enforces that each item can be assigned to exactly 1 cluster. Unfortunately, these discrete constraints make the problem NP-hard. So instead of solving the discrete version of the problem, we remove the constraints and make it continuous. We also rewrite it as follows.

Let Z=f(X)=X(XTDX)12 (note that f is invertible). We can verify that ZTDZ=IK. Using this, we can simplify the objective as

maximizeϵ(Z)=1Ktr(ZTAZ)subject toZTDZ=IK.

Let P=D1A. Since P is diagonalizable, we can write its eigen-decomposition as P=VSV1, where V is the matrix of eigenvectors and S is the diagonal matrix of eigenvalues. Let Z contain the first K eigenvectors of P, and Λ is the K×K sub-matrix from S. Then, the solution set for the above problem is given as

{ZR:RTR=IK,PZ=ZΛ},

i.e., it is the subspace spanned by the first K eigenvectors of P through orthonormal matrices. The matrix Z is a continuous solution to our clustering problem. We now need to discretize it to obtain X which obeys the discrete constraints. Suppose ˜X=f1(Z). Then our new goal is to solve the following optimization problem:

minimizeϕ(X,R)=X˜XR2subject toX{0,1}N×K,X1K=1N,RTR=IK.

This means that we are trying to find a discrete X which is closest to any one of the solutions in the solution set described above. This problem is again difficult to solve jointly in X and R, so we optimize it alternatively. Given a fixed R, the optimum X is obtained by applying non-maximal suppression on the rows of ˜XR. Then, given a fixed X, the optimum R is given by

R=˜UUT,

where XT˜X=UΣ˜UT is a singular value decomposition. We alternate between the two steps until convergence and finally return X as the clustering assignment matrix.