Word Embeddings,
from scratch
How do you teach a computer what words mean? This page walks you through the full journey, from counting words in a table, all the way to 300-dimensional meaning spaces.
What is an Embedding?
An embedding is a way of representing something — a word, a sentence, a document, an image — as a list of numbers called a vector. On its own that sounds unremarkable. The magic is in how those numbers are chosen.
A good embedding places similar things close together in that numerical space. If you embed every English word and look up the neighbors of "ocean", you should find "sea", "river", "coast" — not "Tuesday" or "mortgage". Meaning becomes measurable distance.
This page focuses on word embeddings (one vector per word), but the idea generalises: document embeddings, sentence embeddings and image embeddings follow the same principle.
The Simple Approach: Bag of Words
The most natural first attempt: count how often each word appears in each document and arrange those counts into a matrix — such that rows represent documents and columns represent words. This is called a Document-Feature Matrix (DFM).
Each row is now a vector for that document. Documents that use similar words get similar vectors, so we can already measure something like "topical similarity", which is useful for search, recommendation, or clustering.
Edit any sentence below and watch the word-count matrix update live. Non-zero cells are highlighted.
This already works surprisingly well for document-level comparisons. Netflix-style recommendation ("if you liked this show, you might like that one") is essentially this idea applied to viewing histories.
We can also extract word embeddings from this matrix: each column is a vector for that word that reflects how the word is distributed across documents. Words that appear in similar documents will have similar column vectors.
How Close is "Close"? — Cosine Similarity
Once we have vectors, we need a way to measure how similar they are. The standard measure is cosine similarity: the cosine of the angle between two vectors.
- Angle 0° → similarity 1.0 — vectors point in the same direction, identical meaning
- Angle 90° → similarity 0.0 — vectors are perpendicular, unrelated
- Angle 180° → similarity −1.0 — vectors point in opposite directions
Drag the sliders to rotate the vectors. In real embeddings this happens in 50–300 dimensions — but the math is identical.
The formula is simply the dot product of two unit-length vectors. After normalising each vector to length 1, the dot product equals cos(θ) — the angle between them.
The Problem with Counting
Bag-of-words embeddings are a solid starting point, but they have serious structural limitations:
- Enormously sparse. A typical vocabulary has 50 000+ words. Most documents use only a small fraction — so over 95 % of the matrix is zeros.
- Synonyms are strangers. "automobile" and "car" never co-occur in the same document? Their cosine similarity is exactly 0.0, even though they mean the same thing.
- No word order. "The dog bites the man" and "The man bites the dog" produce identical bag-of-words vectors.
- No context sensitivity. "bank" gets the same vector whether it means a riverbank or a financial institution.
TF-IDF (down-weighting high-frequency words) helps with one of these problems. Non-negative Matrix Factorisation can produce denser representations. But the fundamental issue — that counting words in documents doesn't necessarily capture word meaning — requires a completely different approach.
Context = Meaning — The Key Insight
In 1957, linguist John Rupert Firth wrote one of the most-quoted sentences in all of NLP:
This Distributional Hypothesis says: words that appear in similar contexts carry similar meanings. We don't need a dictionary or hand-crafted rules — we just need to observe how words are actually used.
Consider "ocean" and "sea". They appear in nearly identical surroundings:
Contexts of "ocean"
… sailed across the ocean …
… deep ocean currents …
… ocean floor exploration …
… waves crashed on the ocean shore …
Contexts of "sea"
… sailed across the sea …
… deep sea currents …
… sea floor exploration …
… waves crashed on the sea shore …
Same contexts → similar meaning → they should end up close together in embedding space. No labels, no human annotation. Just patterns across billions of sentences.
This is how embeddings can discover semantic relationships that no linguist explicitly programmed in. The meaning was hiding in co-occurrence statistics all along.
Learning from Context: Word2Vec
In 2013, researchers at Google turned Firth's insight into a practical algorithm: Word2Vec. The idea is elegantly indirect — train a small neural network on a prediction task, then throw the predictions away and keep only the weights as embeddings.
Predict the context
Given a target word, predict which words are likely to appear nearby. Forces the model to encode what situations a word typically occurs in.
Predict the target
Given the surrounding context words, predict the missing word. The reverse task — but both produce the same kind of embeddings.
For example, in the sentence:
The target word is "sat". The skip-gram model tries to predict context words like "cat", "on", "mat" from it.
The network sees hundreds of millions of such examples. After training, words that appear in similar contexts end up with similar weight vectors — because they activate similar patterns in the network. Those weight vectors are the word embeddings.
The neural network itself is discarded. What remains is a simple lookup table: one compact vector per word in the vocabulary, learned purely from unlabelled text.
GloVe: The Global Picture
Word2Vec learns from a sliding window — each training step sees only the immediate neighborhood of one word. In 2014, Stanford researchers Jeffrey Pennington, Richard Socher, and Christopher Manning took a step back and asked: why not use the full co-occurrence statistics of the entire corpus at once?
GloVe (Global Vectors for Word Representation) first builds a massive matrix: for every pair of words, how often do they appear within a window of each other across the entire corpus? It then finds a low-dimensional factorisation of that matrix — compact vectors whose dot products best reconstruct the observed co-occurrence ratios.
| Word2Vec | GloVe | |
|---|---|---|
| Training signal | Local sliding windows | Global co-occurrence matrix |
| Method | Neural network (gradient descent) | Matrix factorisation (weighted least squares) |
| Training data | Google News (~100B tokens) | Wikipedia + Gigaword (~6B tokens) |
| Result quality | Comparable — both capture semantic structure well | |
In practice, both models produce high-quality embeddings. GloVe is generally faster to train and tends to do slightly better on analogy benchmarks. The vectors in this Explorer are GloVe embeddings pre-trained on 6 billion words of Wikipedia and Gigaword text — available in 50-dimensional and 300-dimensional versions.
Here is the actual GloVe vector for the word "king" (50 dimensions):
Fifty numbers that, together, encode everything the model has learned about what "king" means from billions of sentences — whom it co-occurs with, what topics it appears in, what it tends to precede and follow.
Do these numbers have any meaning individually? Mostly not — they are distributed, entangled representations. But the relationships between vectors are where the magic lives.
The Magic: Vector Arithmetic
The most surprising property of well-trained word embeddings: you can do arithmetic with meaning.
Why does this work? The embedding space has learned geometric directions that encode abstract relationships. The vector pointing from any country to its capital is roughly constant across many countries:
vec("paris") − vec("france") ≈ vec("berlin") − vec("germany") ≈ vec("rome") − vec("italy")
This direction in embedding space encodes the concept "capital of". No one programmed it in — it emerged from reading billions of sentences.
The difference good − bad captures something like "positive sentiment". The same direction works for many other word pairs because the model has learned a consistent geometry for evaluative polarity.
This is why the Explorer lets you type expressions like
paris - france + germany
and find the nearest neighbors to the resulting vector.
Try these:
Each button opens the Explorer with the expression pre-loaded.
Not every analogy works perfectly — the model has 6 billion words of experience, but language is complicated. The failures are often just as instructive as the successes.
You're ready to explore
You now know where embeddings come from, why they work, and what you can do with them.
Go see it for yourself.
GloVe 6B · Stanford NLP · 50d and 300d models · runs entirely in your browser