This visualisation maps the learned vocabulary of GPT-2, a 124-million-parameter language model released by OpenAI in 2019. It shows ~30,000 tokens filtered from the model's full 50,257-token vocabulary β a snapshot of how the neural network mathematically represents words, before it even begins to process a sentence.
The high-dimensional manifold
To a human, a word is defined by its dictionary entry. To a language model, a token is defined by its position in a high-dimensional vector space. In GPT-2, every token is stored as a vector of 768 numbers. This means "happy" is not an emotion, but a precise coordinate in 768-dimensional space.
The geometry of this space encodes meaning: tokens that appear in similar contexts across the training data β like "king" and "queen", or "Python" and "Java" β end up with vectors pointing in similar directions. This happens as a byproduct of training: the model learns to predict the next word in a sentence, and in doing so it discovers that words used in similar ways need similar representations. The similarity between two tokens is measured by cosine similarity β the angle between their vectors.
The raw data
The starting point is a single matrix of numbers: 50,257 rows Γ 768 columns, downloaded from openai-community/gpt2 on HuggingFace. Each row is one BPE token. Each column is one learned embedding dimension. This is the wte (word token embedding) weight tensor β the model's learned representation of its vocabulary, stored as raw floating-point numbers. These are the input embeddings: the starting point for how the model represents each token before any transformer layers process it. Here is what a single word looks like:
Everything you see on this map β the clusters, the distances, the connections β is derived from that single matrix. No human labelled any of it.
From 768D to 2D and 3D
Because humans cannot perceive 768 dimensions, we use UMAP (Uniform Manifold Approximation and Projection) to compress this space into two or three dimensions. UMAP (configured here with n_neighbors=15, min_dist=0.1, cosine metric) is a non-linear dimensionality reduction algorithm that tries to preserve local neighbourhood structure β if two tokens are close in 768 dimensions, UMAP tries to keep them close in the projection.
This means that nearby points on the map are likely to be close in the original space. But the reverse is not guaranteed: some tokens that are far apart on the map may actually be close in 768D, because compressing 768 dimensions inevitably distorts some relationships. You can switch between UMAP, t-SNE (perplexity=30, which prioritises tight local clusters), and PCA (a linear projection preserving the directions of greatest variance) to see which structures are robust across methods and which are artefacts of a particular projection.
The graph structure
This visualisation is more than a scatter plot β it is built on a k-Nearest Neighbour graph. For each of the ~30,000 tokens, we normalise the embedding vectors and compute cosine similarity (via batched dot products) against every other token, then connect it to its 8 most similar words. The resulting graph is stored as undirected β if word A has B as a neighbour, the edge exists in both directions.
Because the graph is built from each token's top-8, a word can end up connected to many more than 8 others if multiple tokens independently select it as a nearest neighbour. Common, polysemous words β like "time" or "system" β tend to have high degree because their embeddings sit in central regions of the space where many different semantic threads converge.
Clusters & colours
The colours represent communities detected by the Louvain algorithm, which finds groups of words that are more densely connected to each other than to the rest of the graph. The algorithm found 27 clusters β this number was not chosen in advance but emerged from the structure of the data. (Louvain optimises "modularity," a measure of how cleanly a network partitions into communities. 27 is the partition that maximises this score for GPT-2's embedding graph.)
We have not manually labelled these clusters. Each one is identified by its most representative words β those closest to the cluster's centroid. Exploring the map, you can find regions that loosely correspond to geography, politics, science, names, and other semantic fields, though many clusters blend semantic and syntactic properties in ways that don't map neatly onto human categories.
The orange band (Cluster 0): You'll notice a distinctive loop-like band of orange points consisting of ~8,590 tokens. This cluster contains a heterogeneous mix: many are sub-word BPE fragments that happen to be 3+ characters and purely alphabetic (like "eping", "initions", "abdom"), some are proper name fragments (like "rodham", "bezos"), and some are standalone words that don't fit neatly into any semantic category (like "binge", "kinetic", "rinse"). They cluster together because their embeddings lack strong affinity to any coherent thematic region β the Louvain algorithm groups them into this catch-all band.
Views
Constellation shows all tokens as a galaxy β zoom in to reveal labels. Explorer lets you search for a word and walk through its 2-hop neighbourhood. Sample shows a random subset with labels. Bridges highlights the tokens with highest betweenness centrality β the words that connect different semantic regions.
Why GPT-2?
Two reasons: transparency and scale. GPT-2 is fully open-weights, meaning we can extract and inspect its raw embedding matrix. Many frontier models are proprietary β their weights are not publicly available, making this kind of analysis impossible.
There is also the question of dimensionality. GPT-2 operates in 768 dimensions. GPT-3 uses 12,288 dimensions. Larger models use even higher-dimensional embedding spaces. GPT-2 offers a transparent and tractable laboratory for studying the fundamental geometry of how language models represent vocabulary.
Why not all 50,257 tokens? GPT-2 uses Byte Pair Encoding (BPE), which breaks text into sub-word tokens. Many of its 50,257 tokens are punctuation, numbers, Unicode characters, or very short fragments (1β2 characters). We filtered to tokens that are purely alphabetic, ASCII-only, and between 3β20 characters long, with case-insensitive deduplication. The result is ~30,000 tokens. Some of these are recognisable English words; others are longer sub-word fragments that passed the filter β they're included because they are genuine entries in the model's learned vocabulary.
Built with D3.js and Three.js. Dimensionality reduction via UMAP, t-SNE, and PCA. Graph analysis via NetworkX.
Model weights on HuggingFace β Β· Source code on GitHub β