Chapter 25: NLP Projects — Spam Detection, Sentiment Analysis, Autocomplete¶
“Language is powerful. Teaching machines to understand it unlocks infinite possibilities.”
With your knowledge of tokenization, vectorization, RNNs, and Transformers, you're now ready to build end-to-end NLP applications. In this chapter, we’ll walk through three mini-projects using TensorFlow:
- Spam Detection – Binary classification using classic vectorization
- Sentiment Analysis – Text-to-emotion prediction using LSTM
- Autocomplete – Next-word prediction using a Transformer
Each project includes:
- Data loading
- Text preprocessing
- Model architecture
- Training loop
- Evaluation and usage
1. Spam Detection with TF-IDF¶
Dataset: SMS Spam Collection (UCI)
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv", sep='\t', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
Preprocessing with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
X = TfidfVectorizer(max_features=1000).fit_transform(df['message']).toarray()
y = df['label'].values
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(64, activation='relu', input_shape=(1000,)),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, validation_split=0.2)
😊 2. Sentiment Analysis with LSTM¶
Dataset: IMDB Reviews
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=10000)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=200)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=200)
Model
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=10000, output_dim=64, input_length=200),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, validation_split=0.2)
3. Autocomplete with Mini Transformer¶
This is a simplified version, using a small vocabulary.
Sample Corpus
sentences = [
"hello how are you",
"hello how is your day",
"i love natural language processing",
"tensorflow is powerful"
]
Vectorize & Build Transformer
Use TextVectorization and custom Transformer layers (as built in Chapter 24). Train it as a language model: for every input sequence, predict the next word.
Evaluation Tips¶
Project | Metric | Evaluation Example |
---|---|---|
Spam Detection | Accuracy, F1 | Confusion matrix |
Sentiment Analysis | Accuracy | Predict reviews manually |
Autocomplete | Top-k accuracy | “hello how” → “are”, “is”, “you” |
Use TensorBoard to visualize training curves, and checkpoints to save your models.
Deployment Ideas¶
Task | Deploy As |
---|---|
Spam Detection | FastAPI REST API |
Sentiment Model | Hugging Face + Gradio |
Autocomplete | TensorFlow Lite on mobile |
All three can be integrated into websites, apps, or backend systems for real-time inference.
Summary¶
In this chapter, you:
- Built three full NLP projects from start to finish
- Used both traditional ML and deep learning
- Combined text preprocessing, tokenization, and modeling
- Prepared these models for deployment and real-world use
These projects are a stepping stone to larger systems: chatbots, translation, summarization, and beyond.