Skip to article frontmatterSkip to article content

After reading about Michael Nielsen’s discovery fiction, I found the idea quite interesting and intriguing. When one reads a published work, we only see the result, and have no clue about the journey and failures that led to the final artifact. One can only guess and imagine the process that the author or researcher went through. Such imaginative work is described by M. Nielsen as “Discovery Fiction.” But what about the case where the authors themselves provide the story instead of letting the reader guess? This is an attempt at doing so, which I call “Discovery Narrative.” I’m avoiding words like “story” or “journal,” as I believe that the path that led to the result is complex, sometimes unconscious, and in many cases filled with luck and coincidence that the author might not be aware of, and so their “story” is rarely an objective portrayal and more about what they think is relevant in hindsight, hence the use of “narrative.”

In this post, I’m using my research on Spectral Clustering and Graph partitioning as an example. The journey started, in my mind, at the end of my master’s thesis. I spent it as an intern at RIKEN AIP in Tokyo exploring Deep Reinforcement Learning. At the time, PPO and TRPO were the talk of the town, and I was implementing the various algorithms and tweaks on toy environments. As it became quite well known since, RL is brittle and had a huge reproducibility problem, where even the random seeds can change a successful run to a catastrophic one. But what caught my attention back then was Hierarchical RL, also known these days as Mixture-of-Experts (MoE), although at the time MoE was a multi-armed bandit concept, not RL. I was building on Takayuki Osa’s HRL, but I wanted to apply it to TRPO to create an HTRPO. I used a simple four-room environment where rewards were sparse and you could control the sparsity by varying the resolution of the grid. I got some good results by the end of the internship on my toy environment, but I did not enjoy all the manual tweaking and design of the hierarchy. The question was: how do we automatically set the number of experts and let them learn? EigenOption discovery: use a basis of policies, which turned out to be eigenvectors of a Laplacian. Sugiyama-sensei had also explored this via the usual transition matrix Laplacian. The EigenOption, although similar, uses its own construction. But it only worked in the tabular case; even when a neural network was used, the environment had a very limited number of states. The question then: how do we scale this to neural networks? Make a neural network learn eigenvectors, and the Laplacian we construct basically produces a clustering of the state space into different domains, so that each policy specializes in a specific domain.

SpectralNets: ratio cut, looks nice, but only works on really clean datasets like MNIST. For other datasets, we need a pretrained encoder; even then, changing the dataset yields bad performance, because the encoder itself was brittle (Mixture of Gaussian prior...). The first two years of my PhD, I was reading everything I could find about random matrices, large matrices, online spectral clustering, and online eigendecomposition. I had this false conviction that I just needed to try hard enough to get it to work. But math does not care about one’s beliefs; some things are just impossible and mathematically incorrect. One can construct a 2x2 matrix whose eigendecomposition is highly sensitive to noise, so for a neural network to actually learn eigenvectors in a noisy setting, the vectors need to be “distinguishable” under noise. That is, the similarity used to construct the Laplacian has to be really good. Even if it is, the gradient approach is slow and mixes up all eigenvectors at a certain level. While one can use an encoder to get those nice embeddings to work with, if you already have nice embeddings you can simply use K-Means and get very good performance.

So, as I was visiting INRIA Paris, I decided to abandon spectral decomposition entirely. The method should be agnostic to the encoder, and clustering should be faithful to the embedding: perfect similarity should yield perfect clustering, and vice versa.

Where does spectral decomposition appear in “spectral clustering”? It appears from the relaxation of the objective from binary assignment vectors to a “unit ball element,” which renders the objective equivalent to a Rayleigh quotient minimization. What is a better alternative to this substitution? Probabilistic assignment, and it’s actually more natural. From binary elements, you move to elements in the simplex, where the binary case is the extrema of the simplex. But that’s easier said than done. However, once the equations were written down, the two years of failed attempts had built some muscle memory that made subsequent work quite smooth. Ideas were flowing non-stop, and by the end of that summer, the concept of “Probabilistic Graph Cuts” had become very clear. The resulting algorithm is not revolutionary, but it is novel, theoretically solid, competitive with existing approaches, and can explain many of the contrastive learning and representation learning “ad hoc” algorithms that usually give the vibe of “we did this and it worked, but there is no formal proof why!”

That led to the first paper on Probabilistic Ratio Cut, which served more as a foundation to show that we can handle the ratio cut, and also to sidestep the challenge of estimating how much a data point is connected to the dataset, a hard thing to do with stochastic gradients or sampled batches.

Once that was published, I was already working on extending the method to other types of graph cuts, aiming for a theoretical foundation that is versatile: handle a wider range of cuts, get tighter bounds, and provide concentration bounds to identify the conditions under which the upper bound is a good or bad estimator. Well, the reviewers did not like it. They were not happy with the theoretical emphasis and wanted to see numbers. My response was that we were providing a unifying framework that can explain the success and limitations of many existing and well-established methods. Their success is a testament that such an algorithm works.

But then, I needed to get the paper through. So let’s code! I had already been running experiments before the reviewers responded, but the issue was that the bound and the objective use functions that have no GPU implementation. I did a bandaid fix by manually implementing a PyTorch autograd function that simply offloads the computation to the CPU, but that was painfully slow. Luckily, during the same period, I was exploring sparsity for GPUs, learning CUDA, Triton, and some CuTe layout algebra, which helped tremendously.

The core function in the bound is the Gauss hypergeometric function 2F1(-m, b; c; z) where a = -m is a non-positive integer, so it’s a finite polynomial of degree m in z. Sounds easy. It’s not.

The first GPU implementation was the naive one: just evaluate the polynomial term by term, one CUDA thread per element of z. Works for small m, say m < 50. But for m = 512 or larger, the Pochhammer symbols grow combinatorially, and even in float64 the intermediate products overflow before the alternating signs cancel. NaN city. SciPy itself has this problem for m >= 16384 near z = 1.

Then I found the Kummer transformation in Abramowitz & Stegun: 2F1(a,b;c;z) = (1-z)^{c-a-b} * 2F1(c-a, c-b; c; z). It transforms the evaluation near z = 1 (where the direct series diverges) into one near z = 0 (where it converges rapidly). For a = -m, the coefficients are much better behaved. Two CUDA kernels came out of this: one that auto-selects between direct and Kummer based on z > 0.5, and a branchless one that always uses Kummer to avoid warp divergence. Stable up to very large m, matches SciPy to ~11 digits. But they’re sequential: one thread per element, iterating m times. Not great.

The series has a natural tree-reduction structure because each term is just the previous term times a ratio. Instead of one thread doing m sequential multiplies, I used 8 threads per element, each handling a strided subset of terms, then reducing via a parallel prefix scan. One variant precomputes the ratio array once and reuses it. This gave about a 4x speedup over the sequential version. But CUDA kernels are NVIDIA-only and need JIT compilation on first use.

For portability, I wrote a Triton kernel using a 32-lane associative scan. Triton’s tl.associative_scan maps naturally to the prefix-product structure. Portable across GPU vendors, no C++ compilation step. Performance within 15% of the CUDA kernels. Good enough, and the portability wins. The autograd wrapper defaults to Triton for this reason.

The whole journey from CPU offload to Triton autograd was roughly an 80x speedup. That’s the difference between “theoretically interesting but impractical” and actually being able to iterate on experiments.

With the kernel working, I could finally run things at scale. But a new problem showed up immediately: cluster collapse.

The original PRCut has an analytical gradient with an implicit balance term that vanishes as clusters collapse, creating a positive feedback loop. Small cluster, weak balance gradient, cluster shrinks further, empty cluster. With a normal learning rate, PRCut collapses on every dataset with more than a handful of clusters. The workaround in the original paper: use a very small learning rate and hope the collapse is slow enough. It kind of works on simple datasets but fails badly on anything with many classes.

I tried entropy regularization first: add a negative entropy term to the loss. Helps, but the optimal weight is dataset-dependent. Too strong and you get uniform assignments, too weak and you get collapse. Loss weighting was the same problem with more hyperparameters. Projecting onto the simplex was expensive and didn’t help with the gradient dynamics.

What worked was embarrassingly simple. The cut gradient and the balance gradient have completely different magnitudes and directions. Adding them with fixed weights means one always dominates. The fix: normalize each gradient independently before summing. Both objectives contribute equally in gradient direction, regardless of their magnitudes. I implemented it using backward hooks to capture and rescale gradients within context managers. The collapse problem vanished completely.

When I added spectral clustering as a baseline to the tables, I expected it to be clearly worse on clustering accuracy. It was. But it was also clearly better on the raw cut values (RCut, NCut). Confusing at first. How can lower cut values come with lower clustering accuracy? Spectral clustering is non-parametric. It directly solves the graph Laplacian eigenvalue problem, the relaxation of the discrete cut. So it finds the lowest-cut partition of the specific graph. But that partition overfits to noisy edges: one edge between two nodes from different classes and the boundary gets pulled through. Our parametric methods (linear model + softmax) can’t represent arbitrary graph partitions. That implicit regularization prevents overfitting to noisy edges. Higher accuracy, even though the cut values are worse. Lower cut does not mean better clustering. The parametric bottleneck is a feature, not a bug.

I also tried building the similarity graph on-the-fly within each batch instead of precomputing a global KNN graph. Sample a batch of nodes, build a KNN graph among them, compute the cut. No precomputation. The standard pipeline requires quadratic compute for distances and large storage, which for ImageNet is prohibitive. Results were surprisingly competitive. On some datasets, online even won over offline because the global graph can be noisy and the batch-local one is cleaner. For ImageNet, it ran but slowly. Not state-of-the-art, but it ran. No precomputed graph, no quadratic memory, just batched GPU operations.

Looking back, there are things I’d do differently. I’d start with the GPU kernel, not the theory. The theory was ready months before I had a practical implementation. If I’d built the Triton kernel first, I could have run experiments alongside the theory development. I wouldn’t fight the reviewers. The first submission was theory-heavy with minimal experiments. Reviewers wanted numbers. In hindsight, they were right: the experiments revealed things like gradient mixing and the spectral clustering comparison that actually improved the paper’s contribution. I’d also focus on graph quality earlier. I came up with a simple metric that measures how much class structure the KNN graph captures, and it turned out to be an almost perfect predictor of downstream clustering accuracy. It immediately explains why some datasets are easy and others are hard, independent of the algorithm. Should have been my first analysis, not my last. And I’d stop trying nonlinear models sooner. I spent weeks on MLPs and attention mechanisms. The linear model consistently matched or beat them. The graph does the heavy lifting; the model just needs to not overfit.