GloVe Explorer · Explainer

Word Embeddings,
from scratch

How do you teach a computer what words mean? This page walks you through the full journey, from counting words in a table, all the way to 300-dimensional meaning spaces.

01

What is an Embedding?

An embedding is a way of representing something — a word, a sentence, a document, an image — as a list of numbers called a vector. On its own that sounds unremarkable. The magic is in how those numbers are chosen.

A good embedding places similar things close together in that numerical space. If you embed every English word and look up the neighbors of "ocean", you should find "sea", "river", "coast" — not "Tuesday" or "mortgage". Meaning becomes measurable distance.

City-map analogy. Every location on Earth has exactly two numbers: latitude and longitude. Places that are geographically close have numbers that are numerically close. Word embeddings do the same for meaning — but with 50 to 300 numbers instead of 2, and semantic proximity instead of physical distance.

This page focuses on word embeddings (one vector per word), but the idea generalises: document embeddings, sentence embeddings and image embeddings follow the same principle.

02

The Simple Approach: Bag of Words

The most natural first attempt: count how often each word appears in each document and arrange those counts into a matrix — such that rows represent documents and columns represent words. This is called a Document-Feature Matrix (DFM).

Each row is now a vector for that document. Documents that use similar words get similar vectors, so we can already measure something like "topical similarity", which is useful for search, recommendation, or clustering.

Interactive — Build a DFM

Edit any sentence below and watch the word-count matrix update live. Non-zero cells are highlighted.

Doc A
Doc B
Doc C

This already works surprisingly well for document-level comparisons. Netflix-style recommendation ("if you liked this show, you might like that one") is essentially this idea applied to viewing histories.

We can also extract word embeddings from this matrix: each column is a vector for that word that reflects how the word is distributed across documents. Words that appear in similar documents will have similar column vectors.

03

How Close is "Close"? — Cosine Similarity

Once we have vectors, we need a way to measure how similar they are. The standard measure is cosine similarity: the cosine of the angle between two vectors.

Interactive — Cosine Similarity
sim = 0.87

Drag the sliders to rotate the vectors. In real embeddings this happens in 50–300 dimensions — but the math is identical.

The formula is simply the dot product of two unit-length vectors. After normalising each vector to length 1, the dot product equals cos(θ) — the angle between them.

Why not Euclidean distance? Because we care about direction, not magnitude. A long document that mentions "ocean" ten times is about the ocean just as much as a short one that mentions it twice, they just differ in length. Cosine similarity ignores length and focuses on orientation.
04

The Problem with Counting

Bag-of-words embeddings are a solid starting point, but they have serious structural limitations:

TF-IDF (down-weighting high-frequency words) helps with one of these problems. Non-negative Matrix Factorisation can produce denser representations. But the fundamental issue — that counting words in documents doesn't necessarily capture word meaning — requires a completely different approach.

05

Context = Meaning — The Key Insight

In 1957, linguist John Rupert Firth wrote one of the most-quoted sentences in all of NLP:

"You shall know a word by the company it keeps."

This Distributional Hypothesis says: words that appear in similar contexts carry similar meanings. We don't need a dictionary or hand-crafted rules — we just need to observe how words are actually used.

Consider "ocean" and "sea". They appear in nearly identical surroundings:

Contexts of "ocean"

… sailed across the ocean
… deep ocean currents …
ocean floor exploration …
… waves crashed on the ocean shore …

Contexts of "sea"

… sailed across the sea
… deep sea currents …
sea floor exploration …
… waves crashed on the sea shore …

Same contexts → similar meaning → they should end up close together in embedding space. No labels, no human annotation. Just patterns across billions of sentences.

This is how embeddings can discover semantic relationships that no linguist explicitly programmed in. The meaning was hiding in co-occurrence statistics all along.

06

Learning from Context: Word2Vec

In 2013, researchers at Google turned Firth's insight into a practical algorithm: Word2Vec. The idea is elegantly indirect — train a small neural network on a prediction task, then throw the predictions away and keep only the weights as embeddings.

Skip-gram
Predict the context

Given a target word, predict which words are likely to appear nearby. Forces the model to encode what situations a word typically occurs in.

CBOW
Predict the target

Given the surrounding context words, predict the missing word. The reverse task — but both produce the same kind of embeddings.

For example, in the sentence:

the   cat   sat   on   the   mat

The target word is "sat". The skip-gram model tries to predict context words like "cat", "on", "mat" from it.

The network sees hundreds of millions of such examples. After training, words that appear in similar contexts end up with similar weight vectors — because they activate similar patterns in the network. Those weight vectors are the word embeddings.

The neural network itself is discarded. What remains is a simple lookup table: one compact vector per word in the vocabulary, learned purely from unlabelled text.

07

GloVe: The Global Picture

Word2Vec learns from a sliding window — each training step sees only the immediate neighborhood of one word. In 2014, Stanford researchers Jeffrey Pennington, Richard Socher, and Christopher Manning took a step back and asked: why not use the full co-occurrence statistics of the entire corpus at once?

GloVe (Global Vectors for Word Representation) first builds a massive matrix: for every pair of words, how often do they appear within a window of each other across the entire corpus? It then finds a low-dimensional factorisation of that matrix — compact vectors whose dot products best reconstruct the observed co-occurrence ratios.

Word2Vec vs. GloVe at a glance
Word2Vec GloVe
Training signal Local sliding windows Global co-occurrence matrix
Method Neural network (gradient descent) Matrix factorisation (weighted least squares)
Training data Google News (~100B tokens) Wikipedia + Gigaword (~6B tokens)
Result quality Comparable — both capture semantic structure well

In practice, both models produce high-quality embeddings. GloVe is generally faster to train and tends to do slightly better on analogy benchmarks. The vectors in this Explorer are GloVe embeddings pre-trained on 6 billion words of Wikipedia and Gigaword text — available in 50-dimensional and 300-dimensional versions.

Here is the actual GloVe vector for the word "king" (50 dimensions):

[ 0.505, 0.686, −0.595, −0.023, 0.600, −0.135, −0.088, 0.474, −0.618, −0.310, −0.077, 1.493, −0.034, −0.982, 0.682, 0.817, −0.519, −0.315, −0.558, 0.664, 0.196, −0.135, −0.115, −0.303, 0.412, −2.223, −1.076, −1.078, −0.344, 0.335, 1.993, −0.042, −0.643, 0.711, 0.492, 0.168, 0.343, −0.257, −0.852, 0.166, 0.401, 1.169, −1.014, −0.216, −0.152, 0.783, −0.912, −1.611, −0.644, −0.510 ]

Fifty numbers that, together, encode everything the model has learned about what "king" means from billions of sentences — whom it co-occurs with, what topics it appears in, what it tends to precede and follow.

Do these numbers have any meaning individually? Mostly not — they are distributed, entangled representations. But the relationships between vectors are where the magic lives.

08

The Magic: Vector Arithmetic

The most surprising property of well-trained word embeddings: you can do arithmetic with meaning.

paris france + germany berlin

Why does this work? The embedding space has learned geometric directions that encode abstract relationships. The vector pointing from any country to its capital is roughly constant across many countries:

vec("paris") − vec("france")  ≈  vec("berlin") − vec("germany")  ≈  vec("rome") − vec("italy")

This direction in embedding space encodes the concept "capital of". No one programmed it in — it emerged from reading billions of sentences.

good bad happy sad

The difference good − bad captures something like "positive sentiment". The same direction works for many other word pairs because the model has learned a consistent geometry for evaluative polarity.

This is why the Explorer lets you type expressions like paris - france + germany and find the nearest neighbors to the resulting vector. Try these:

Each button opens the Explorer with the expression pre-loaded.

Not every analogy works perfectly — the model has 6 billion words of experience, but language is complicated. The failures are often just as instructive as the successes.

🧭

You're ready to explore

You now know where embeddings come from, why they work, and what you can do with them.
Go see it for yourself.

Open the Explorer →

GloVe 6B · Stanford NLP · 50d and 300d models · runs entirely in your browser