32_Word_Embeddings__Word2Vec__Glove__Fasttext_
Word Embeddings (Word2Vec, GloVe, FastText)
Section titled “Word Embeddings (Word2Vec, GloVe, FastText)”Category: Natural Language Processing
Type: AI/ML Concept
Generated on: 2025-08-26 11:01:04
For: Data Science, Machine Learning & Technical Interviews
Word Embeddings Cheatsheet: Word2Vec, GloVe, FastText
Section titled “Word Embeddings Cheatsheet: Word2Vec, GloVe, FastText”1. Quick Overview
- What is it? Word embeddings are vector representations of words that capture semantic and syntactic relationships. Instead of treating words as discrete symbols (like in one-hot encoding), word embeddings represent them as dense, low-dimensional vectors.
- Why important? Crucial for NLP tasks as they allow algorithms to understand relationships between words, leading to better performance in tasks like:
- Text classification
- Sentiment analysis
- Machine translation
- Information retrieval
- Question answering
2. Key Concepts
- Word Vector: A numerical vector representing a word. The dimensions of the vector encode semantic information. Example:
[0.2, -0.5, 0.8, ...] - Dimensionality: The number of dimensions in the word vector. Commonly ranges from 50 to 300. Higher dimensionality can capture more nuanced relationships but increases computational cost.
- Context Window: The surrounding words of a target word. Used to train embeddings.
- Corpus: A large collection of text data used for training.
- Similarity: Words with similar meanings have vectors that are close in vector space (e.g., using cosine similarity).
cosine_similarity(vector_a, vector_b) = (vector_a . vector_b) / (||vector_a|| * ||vector_b||) - Vocabulary: The set of unique words in the corpus.
- Skip-gram: Predicts context words given a target word.
- CBOW (Continuous Bag of Words): Predicts the target word given the context words.
- Negative Sampling: A technique to improve training efficiency by only updating a small number of negative examples for each training sample.
- Subword Information: Breaking words into smaller units (n-grams) to handle out-of-vocabulary words and morphological variations (used in FastText).
3. How It Works
A. Word2Vec (Skip-gram example):
- Data Preparation: Create training pairs of (target word, context word) from a corpus.
- Example sentence: “the quick brown fox jumps over the lazy dog”
- Context window size = 2
- Training pairs:
- (quick, the)
- (quick, brown)
- (brown, quick)
- (brown, fox)
- …and so on
- Model Architecture: A shallow neural network with one hidden layer.
- Input: One-hot encoded target word (e.g., “quick” ->
[0, 0, 1, 0, 0, ...]in the vocabulary) - Hidden Layer: Learns the word embedding (e.g., 100 dimensions). The weights between the input and hidden layer are the word embeddings.
- Output: Probability distribution over the vocabulary, predicting the context word.
- Input: One-hot encoded target word (e.g., “quick” ->
- Training: Optimize the model to predict context words accurately.
- Loss function: Cross-entropy loss.
- Optimization: Stochastic Gradient Descent (SGD) or variants.
- Negative sampling is commonly used to speed up training.
- Word Embeddings: The weights of the hidden layer represent the word embeddings.
ASCII Diagram (Skip-gram):
Input (One-hot) --> [Projection Layer (Weights = Word Embeddings)] --> Output (Probabilities) "quick" --> [ 0.2, -0.5, 0.8, ... ] --> "the", "brown", ...B. GloVe (Global Vectors for Word Representation):
-
Co-occurrence Matrix: Construct a matrix X where Xij represents the number of times word j appears in the context of word i.
-
Objective Function: Minimize the following cost function:
-
J = Σ<sub>i,j</sub> f(X<sub>ij</sub>) (v<sub>i</sub><sup>T</sup>v<sub>j</sub> + b<sub>i</sub> + b<sub>j</sub> - log(X<sub>ij</sub>)) -
Where:
v<sub>i</sub>andv<sub>j</sub>are the word vectors for words i and j.b<sub>i</sub>andb<sub>j</sub>are biases.f(X<sub>ij</sub>)is a weighting function that prevents over-weighting frequent word pairs. A common choice isf(x) = (x/x<sub>max</sub>)<sup>α</sup>for x < xmax and 1 otherwise.
-
-
Optimization: Use gradient descent to learn the word vectors.
-
Word Embeddings: The learned word vectors
v<sub>i</sub>are the word embeddings.
C. FastText:
- Character n-grams: Represents each word as a bag of character n-grams. For example, for the word “where” and n=3:
"<wh", "whe", "her", "ere", "re>"(angle brackets denote word boundaries). - Word Representation: A word is represented by the sum of its n-gram vectors plus the word itself.
- Training: Similar to Word2Vec (Skip-gram or CBOW), but using the n-gram representation.
- Out-of-Vocabulary (OOV) words: Can generate embeddings for OOV words by summing the embeddings of their character n-grams.
# Example using gensim (Word2Vec)from gensim.models import Word2Vec
# Sample sentencessentences = [["the", "quick", "brown", "fox"], ["fox", "jumps", "over", "the", "lazy", "dog"]]
# Train Word2Vec modelmodel = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) # sg=1 for skip-gram
# Get word vector for "fox"vector = model.wv['fox']print(vector)
# Find similar words to "fox"similar_words = model.wv.most_similar('fox', topn=3)print(similar_words)4. Real-World Applications
- Search Engines: Understanding the meaning of search queries to provide relevant results. For example, if someone searches for “big cat,” the search engine can also show results for “lion” and “tiger.”
- Chatbots: Enabling chatbots to understand user input and generate appropriate responses.
- Machine Translation: Mapping words from one language to another based on their semantic similarity.
- Recommender Systems: Recommending items based on the similarity of their descriptions.
- Document Clustering: Grouping documents based on their content.
- Spam Detection: Identifying spam emails based on the similarity of their content to known spam messages.
- Medical Diagnosis: Analyzing patient records and identifying potential diagnoses based on the similarity of symptoms.
- Financial Analysis: Analyzing financial news and identifying potential investment opportunities based on the similarity of market trends.
5. Strengths and Weaknesses
| Feature | Word2Vec | GloVe | FastText |
|---|---|---|---|
| Strengths | Simple, efficient, captures context. | Uses global co-occurrence statistics. | Handles OOV words, captures morphology. |
| Weaknesses | Ignores global co-occurrence, doesn’t handle OOV well. | Doesn’t handle OOV words well. | Slower training, can be less effective on rare words. |
| OOV Handling | Poor | Poor | Good (character n-grams) |
| Speed | Fast | Moderate | Moderate to Slow |
| Information | Local Context | Global Co-occurrence Statistics | Local Context + Subword Information |
Analogy:
Imagine teaching a child about animals.
- Word2Vec: You show the child pictures of a “dog” and then pictures of things that are usually around dogs like “leash,” “bark,” “walk.” The child learns the context of “dog.”
- GloVe: You tell the child facts about all the animals in the zoo and how often they appear next to each other in books and descriptions. The child learns global relationships.
- FastText: You teach the child that “dog” is made up of “d,” “o,” and “g,” and “dogs” is made up of “d,” “o,” “g,” and “s.” The child can then guess what “doggie” might mean even if they’ve never seen that exact word.
6. Interview Questions
- Q: What are word embeddings and why are they useful?
- A: They are vector representations of words that capture semantic relationships. Useful because they allow algorithms to understand the meaning of words and their relationships, improving NLP tasks.
- Q: Explain the difference between Word2Vec (Skip-gram and CBOW) and GloVe.
- A: Word2Vec uses a local context window to learn embeddings, while GloVe uses global co-occurrence statistics. Skip-gram predicts context words given a target word, while CBOW predicts the target word given the context.
- Q: What is negative sampling and why is it used?
- A: A technique to improve training efficiency by only updating a small number of negative examples for each training sample. It makes the training process faster by reducing the computational cost.
- Q: How does FastText handle out-of-vocabulary (OOV) words?
- A: By breaking words into character n-grams and learning embeddings for those n-grams. The embedding for an OOV word is the sum of the embeddings of its constituent n-grams.
- Q: What are some real-world applications of word embeddings?
- A: Search engines, machine translation, chatbots, recommender systems, sentiment analysis, document clustering, spam detection.
- Q: How do you choose the dimensionality of word embeddings?
- A: It depends on the size of the corpus and the complexity of the task. A higher dimensionality can capture more nuanced relationships but requires more computational resources. Experimentation is key.
- Q: How would you evaluate the quality of word embeddings?
- A: Intrinsic evaluation (e.g., word analogy tasks) and extrinsic evaluation (e.g., measuring the performance of a downstream NLP task). Word analogy tasks involve solving questions like “a is to b as c is to ?” using vector arithmetic (e.g., vector(king) - vector(man) + vector(woman) should be close to vector(queen)).
7. Further Reading
- Original Papers:
- Word2Vec: Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013)
- GloVe: GloVe: Global Vectors for Word Representation (Pennington et al., 2014)
- FastText: Enriching Word Vectors with Subword Information (Bojanowski et al., 2017)
- Online Resources:
- Stanford NLP course: https://web.stanford.edu/class/cs224n/
- TensorFlow Tutorials: https://www.tensorflow.org/tutorials/text/word2vec
- Gensim Documentation: https://radimrehurek.com/gensim/
- Related Concepts:
- Attention Mechanisms: Focus on relevant parts of the input when processing sequences.
- Transformers: Neural network architectures that rely on attention mechanisms for sequence-to-sequence tasks.
- BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained language model that generates contextualized word embeddings.
- Sentence Embeddings: Vector representations of entire sentences.
- Document Embeddings: Vector representations of entire documents.
This cheatsheet provides a solid foundation for understanding and applying word embeddings in various NLP tasks. Remember to practice with code and explore different datasets to solidify your knowledge. Good luck!