CORPUS

Loading semantic map...
Pan to move the searchlight

CORPUS

THE SEMANTIC MAP OF GPT-2

This map visualises how GPT-2 β€” a 124M-parameter language model β€” represents its vocabulary. Every word is stored as a vector of 768 numbers, and words used in similar contexts end up pointing in similar directions. This is what that structure looks like.

β€”
Words
768
Dimensions
β€”
Clusters
20
Neighbours / Word

The raw data
The starting point is a single matrix of 50,257 × 768 floating-point numbers β€” the wte weight tensor from GPT-2's open weights. Here is what a single word looks like:

Embedding vector for "happy" β€” 24 of 768 dimensions

We filter to ~30,000 alphabetic English words, compute cosine similarity between every pair in the full 768-dimensional space, and connect each word to its 20 nearest neighbours.

Projection
UMAP (n_neighbors=15, min_dist=0.1, cosine metric) compresses 768 dimensions into 2D and 3D. Nearby points are likely close in the original space, but the reverse isn't guaranteed β€” some truly similar words get separated by the projection. t-SNE (perplexity=30) offers a second projection to compare.

Clusters
The Louvain algorithm detects communities in the high-dimensional neighbour graph β€” not in the 2D layout. This is why you sometimes see mixed colours in a region: the cluster structure reflects the true 768D geometry, while the map is an approximation.

Interacting
Hover a word to see its neighbours and edges. Click or search to lock the selection. Click a cluster label to highlight its members. Switch to Explore for a searchlight that reveals what's in each region. Toggle 3D to see the UMAP 3D projection. Press Escape or × to clear.

Built with D3.js and Three.js. Graph analysis via NetworkX. No human labelled any of the clusters or positions β€” everything emerges from the embedding matrix.

Model weights ↗  ·  Source code ↗