Step-by-Step Guide to Text Sentiment Analysis on IMDB Reviews Using Python

Step-by-Step Guide to Text Sentiment Analysis on IMDB Reviews Using Python

Text sentiment analysis involves classifying text into positive or negative sentiment. This blog post will guide you through implementing a simple sentiment analysis model for IMDB movie reviews using Python and NumPy. Let’s break down the process into easy-to-understand steps.

1. Data Preprocessing

Preprocessing Text

Before analyzing text, we need to preprocess it. This involves converting the text to lowercase and removing special characters. This step ensures consistency and helps the model focus on the meaningful parts of the text.

import re

def preprocess_text(text):
# Convert to lowercase and remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
return text.split()

Building the Vocabulary

The vocabulary is the set of most common words in the dataset. We’ll limit the vocabulary size to 5000 words to manage computational complexity.

from collections import Counter

def build_vocabulary(texts, vocab_size=5000):
all_words = [word for text in texts for word in preprocess_text(text)]
word_counts = Counter(all_words)
vocab = [word for word, _ in word_counts.most_common(vocab_size)]
word_to_index = {word: i for i, word in enumerate(vocab)}
return vocab, word_to_index

2. Feature Vectorization

Vectorizing Text

We convert each review into a numerical vector using the bag-of-words model. Each vector represents the frequency of words from the vocabulary in the review.

import numpy as np

def vectorize_text(text, word_to_index, vocab_size=5000):
words = preprocess_text(text)
vector = np.zeros(vocab_size)
for word in words:
if word in word_to_index:
vector[word_to_index[word]] += 1
return vector / np.linalg.norm(vector)

3. Logistic Regression Model

Sigmoid Activation Function

The sigmoid function maps any real-valued number into the range [0, 1]. It is used in logistic regression to model probabilities.

def sigmoid(x):
return 1 / (1 + np.exp(-x))

Training the Model

We use gradient descent to optimize the model’s weights and bias. The model minimizes the binary cross-entropy loss.

class IMDBSentimentAnalysis:
def __init__(self, vocab_size=5000):
self.vocab_size = vocab_size
self.vocab = None
self.word_to_index = None
self.weights = None
self.bias = None

def train(self, X_train, y_train, learning_rate=0.1, epochs=100):
num_samples, num_features = X_train.shape
self.weights = np.zeros(num_features)
self.bias = 0

for epoch in range(epochs):
z = np.dot(X_train, self.weights) + self.bias
predictions = sigmoid(z)

loss = -np.mean(y_train * np.log(predictions) + (1 - y_train) * np.log(1 - predictions))

dw = np.dot(X_train.T, (predictions - y_train)) / num_samples
db = np.mean(predictions - y_train)

self.weights -= learning_rate * dw
self.bias -= learning_rate * db

if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss}")

def predict(self, X):
z = np.dot(X, self.weights) + self.bias
return sigmoid(z)

4. Putting It All Together

Data Loading

We’ll assume a function load_imdb_data() that loads the IMDB dataset. This function should return training and test datasets with their corresponding labels.

def load_imdb_data():
# Implement IMDB data loading here
# For this example, we'll use placeholder data
X_train = ["I love this movie", "I hate this movie"]
y_train = np.array([1, 0])
X_test = ["This movie is great", "This movie is awful"]
y_test = np.array([1, 0])
return X_train, y_train, X_test, y_test

Main Execution

if __name__ == "__main__":
X_train, y_train, X_test, y_test = load_imdb_data()

model = IMDBSentimentAnalysis(vocab_size=5000)

model.vocab, model.word_to_index = build_vocabulary(X_train)
X_train_vectorized = np.array([vectorize_text(text, model.word_to_index) for text in X_train])
X_test_vectorized = np.array([vectorize_text(text, model.word_to_index) for text in X_test])

model.train(X_train_vectorized, y_train, learning_rate=0.1, epochs=100)

train_predictions = model.predict(X_train_vectorized)
test_predictions = model.predict(X_test_vectorized)

train_accuracy = np.mean((train_predictions >= 0.5) == y_train)
test_accuracy = np.mean((test_predictions >= 0.5) == y_test)

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Summary

This implementation demonstrates a basic sentiment analysis model for IMDB movie reviews using logistic regression. Here are the key steps:

  1. Data Preprocessing: Convert text to lowercase, remove special characters, and split into words.
  2. Building Vocabulary: Create a vocabulary of the most common words.
  3. Feature Vectorization: Convert text into numerical vectors using the bag-of-words model.
  4. Model Training: Use logistic regression with gradient descent to train the model.
  5. Prediction and Evaluation: Make predictions and evaluate the model’s accuracy.

Further Improvements

To improve this implementation, consider:

  • Using advanced text preprocessing techniques (e.g., stemming, lemmatization).
  • Implementing n-gram features.
  • Adding regularization to prevent overfitting.
  • Using more advanced optimization techniques (e.g., Adam optimizer).
  • Implementing cross-validation for more robust evaluation.

This simple yet effective model provides a foundation for text sentiment analysis, which can be expanded and refined for more complex applications.

Leave a Reply