Week 2 — Embeddings and Dense Representations
Overview
Last week’s n-gram model treated every word as an atomic symbol — “dog” and “puppy” were as unrelated as “dog” and “thermodynamics.” This week fixes that with embeddings: dense, learned vectors where similar words land near each other, so models can generalize across words they’ve rarely or never seen together. You will understand the word2vec/skip-gram objective, the geometry of the resulting space (analogies as vector arithmetic), and how text classification becomes a simple, strong application of these representations. This is the conceptual leap from symbolic to distributed representation that underlies all of modern NLP.
Course 5 already gave you logistic regression, gradient descent, Naive Bayes, and the backprop foundations; here we apply them to learn and use embeddings rather than re-deriving the optimization. The result is the representation that every sequence model and transformer in later weeks consumes.
Readings
- J&M Ch. 5: vector semantics and embeddings (count-based, word2vec, GloVe). Extract: the distributional hypothesis and the skip-gram objective.
- J&M Ch. 6: neural networks (as applied to NLP). Extract: the embedding layer and feedforward classification.
- J&M Ch. 4 (skim): logistic regression and text classification. Extract: the classification setup (theory assumed from Course 5).
- CS224N: word2vec/GloVe material. (Logistic regression, SGD, Naive Bayes, backprop: assumed from Course 5.)
Key Concepts
The distributional hypothesis
“You shall know a word by the company it keeps.” A word’s meaning is approximated by the distribution of contexts it appears in. Embeddings operationalize this: words with similar contexts get similar vectors. This is why embeddings capture semantic and syntactic regularities without any explicit supervision about meaning.
Skip-gram with negative sampling
word2vec’s skip-gram trains an embedding to predict a word’s context. The efficient form, negative sampling, turns it into binary logistic regression: push the dot product of a (word, true-context) pair high and (word, random-negative) pairs low. The loss for one positive pair plus \(k\) negatives:
\[ -\log\sigma(\mathbf{v}_w\cdot\mathbf{v}_c) - \sum_{j=1}^{k}\log\sigma(-\mathbf{v}_w\cdot\mathbf{v}_{n_j}). \]
This is just logistic regression (Course 5) over dot products, optimized by SGD. GloVe reaches similar embeddings by factorizing a co-occurrence matrix instead.
Embedding geometry
The learned space has structure: cosine similarity measures relatedness, and offsets encode relations (the famous king − man + woman ≈ queen). This is linear-algebraic structure (Course 5) emerging from the training objective. Embeddings also expose biases present in the corpus — an ethics point worth confronting.
Classification on embeddings
Average or pool word embeddings into a document vector and classify with logistic regression / a small MLP. This simple pipeline is a strong baseline and is still used in production. It demonstrates the payoff: dense features generalize across vocabulary in a way the n-gram’s sparse counts cannot.
Theory Exercises
- State the distributional hypothesis and explain how skip-gram operationalizes it.
- Derive the skip-gram-with-negative-sampling loss and show it is logistic regression over dot products (Course 5).
- Show why averaged embeddings + logistic regression generalizes to words unseen in training but seen in pretraining.
- Explain analogy arithmetic geometrically; construct an example and the vector operation that solves it.
- Contrast Naive Bayes, logistic regression, and embedding-based classification in terms of features and generalization.
Implementation
Implement skip-gram with negative sampling in embeddings/ (train on a corpus; use Course 1/Course 5 backprop/SGD knowledge). Build a text classifier that pools embeddings and trains a logistic-regression/MLP head. Provide nearest-neighbor and analogy queries to inspect the space.
Experiments
Embedding quality: nearest-neighbor coherence, analogy accuracy on a standard set, and effect of embedding dimension/window size. Classification: accuracy/F1 vs a bag-of-words + Naive Bayes baseline (Course 5), and vs pretrained embeddings. Visualize the space with t-SNE/UMAP.
Expected baselines: trained embeddings show sensible neighbors and solve simple analogies; embedding-based classification beats bag-of-words on tasks needing generalization across vocabulary; pretrained embeddings beat from-scratch on small data. Bias probes reveal corpus bias.
Connections
These embeddings are the input representation for the RNN (Week 3), and the idea of learned dense representations is the seed of contextual embeddings (transformers, Weeks 6–7) and of dense retrieval (Week 9). The classification head reappears wherever a model needs a task-specific output. Course 5’s optimization and probability are the machinery; this week is the NLP-specific application.
Further Reading
- J&M Ch. 5–6.
- Mikolov et al. (word2vec) and Pennington et al. (GloVe).
- CS224N word-vector lecture and Assignment 1.