Chapter 22: TF-IDF, Bag-of-Words Representations¶
“Before transformers, we counted words. And in that counting, meaning emerged.”
Before deep learning, NLP was powered by statistical representations of text—primarily Bag-of-Words (BoW) and TF-IDF.
These techniques are still useful today:
- They’re fast, interpretable, and effective on small datasets
- Often used as baselines before training heavy models
- Key to understanding how NLP evolved into embedding-based approaches
In this chapter, you'll learn:
- How to convert text into BoW and TF-IDF vectors
- When to use count-based vs frequency-based methods
- How to implement these using
TfidfVectorizer
andCountVectorizer
fromscikit-learn
- How to visualize feature importance
Bag-of-Words (BoW)¶
BoW represents text by counting how often each word appears, ignoring order.
Example:
Doc 1: "TensorFlow is great"
Doc 2: "TensorFlow and Keras"
Vocabulary: ['and', 'great', 'is', 'keras', 'tensorflow']
Vectors:
Doc 1 → [0, 1, 1, 0, 1]
Doc 2 → [1, 0, 0, 1, 1]
Using CountVectorizer¶
from sklearn.feature_extraction.text import CountVectorizer
texts = ["TensorFlow is great", "TensorFlow and Keras"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF: Term Frequency–Inverse Document Frequency¶
TF-IDF downweights common words (like "is", "the") and upweights rare but important ones.
Formula:
- TF: Number of times a word appears in a document
- IDF: Inverse of how many documents contain that word
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
When to Use What?¶
Use Case | Technique |
---|---|
Small text datasets | TF-IDF or BoW |
Quick baseline for classification | TF-IDF |
Sparse features, interpretable | BoW |
Neural network input | Embedding |
Optional: Visualize TF-IDF Features¶
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
- Identify key words driving predictions
- Visualize feature distributions
Limitations¶
- Doesn’t consider word order ("not good" vs "good")
- Vocabulary must be fixed
- Large vocab = sparse vectors
- Not suitable for capturing contextual meaning
That’s where deep learning models (like embeddings, RNNs, Transformers) step in.
Summary¶
In this chapter, you:
- Learned the intuition behind BoW and TF-IDF
- Used CountVectorizer and TfidfVectorizer in scikit-learn
- Compared their strengths and limitations
- Set the stage for embedding-based and deep NLP models
Classic vectorization methods are not obsolete—they’re foundational, fast, and still relevant in pipelines and search systems.