The Curse of Dimensionality is a set of well-known mathematical results, one of which is that in high-dimensional spaces, any two randomly chosen vectors are very likely to be nearly orthogonal (at right angles to each other), and in fact you can pack in a huge number of nearly-orthogonal vectors more than you could in 2d or 3d. In many domains, like modern AI and language models, it’s more of a feature than a curse.
Motivation
I was watching this excellent video by Grant Sanderson of 3blue1brown the other week, and he mentioned how vector embeddings, where each token is represented by a high-dimensional vector, can be summed in some cases to produce a new meaning, implying that some vectors (= directions in that space) represent concepts and summing them produces a combination of those concepts, like “ruler” + “female” = “queen” (this is all very approximate, but it bears up under examination; check out Concept Activation Vectors for one example). So I started to wonder how orthogonal can these vectors really be?
In three dimensions, once you have more than three 3d vectors, the 4th will not be even close to orthogonal to the others; there’s no more room. Best case, it’ll end up around 45° from the other vectors. Same for 2d; there’s no place to put a third vector at right angles to the first two (the and axes). But it turns out as you increase the number of dimensions, and allow the vectors to be very nearly orthogonal, it apparently becomes much easier to find spaces to put the extra vectors. You can use the dot product to measure that; the dot product of two vectors is . So I thought if you have a few thousand dimensions, maybe you can pack in more than a few thousand vectors, but if you try to pack in twice as many or 10x as many, some of them will surely end up in nearly the same direction, i.e. with a small angle and a big dot product. Let’s try it!
Implementation
There are well known mathematical proofs of this, but I really wanted to get a feel for it myself, so I thought why not just code it up in python? Create a million random 12,000-dimensional unit vectors on a hypersphere, compute all the dot products of each one against the others, and see what actually comes out?
As I often do these days, rather than code it myself I had Claude do the first pass to save time. I’m not afraid to use AI for these kinds of jobs; I treat it like an intern that never gets tired and never gets upset when it fails (which it sometimes does, spectacularly). I pretty much gave it the spec in the previous paragraph, and within 5 minutes I had some nice runnable numpy/python code. Another 5 minutes and I had the final version with batching, memory usage tracking and nice statistics and plotting.
As a bonus, Claude already knew how to make uniform vectors on the surface of a hypersphere using the Gaussian Annulus theorem: generate each coordinate using a normal distribution (not uniform!) which with high probability gives vectors near the surface of the unit hypersphere. Then normalize them as usual to get the correct result. (For gory details, check something like p. 23, theorem 2.9 of https://ttic.uchicago.edu/~avrim/MLT18/BHK2017book.pdf ).
Results
As you’d expect, of course, my results agree very closely with the theory. With (which fits in my Mac’s memory and took as long as I was willing to wait – a million was too big to do all those dot products without a lot more work) and (that’s the dimensionality of ChatGPT’s token vectors), I get a maximum dot product of which is amazingly small — an angle of 86.75°! They’re all nearly orthogonal to each other; that’s the angle of the two closest ones. The rest are even smaller.
The extreme value theory (approx version) predicts a max of from
which I do not pretend to understand well, but which is very close to the number I got. One reference suggested a slightly different formula which returns , even closer to my numerical result.
Thoughts
I’m glad the theory works out in practice! Not that I doubted it, but it’s a fun exercise to get an intuition for these high-dimensional vectors.
Why is this so interesting? One reason is that if you have a 12,000 dimensional space but can embed millions of nearly-orthogonal vectors into it, you can represent millions of distinct concepts in your language model, and they don’t interfere with each other. So at least conceptually, you could dial up or down the amount of one concept in a token without affecting its others — essentially that single token (12k-dim vector) can represent a whole collection of distinct independent concepts.