Lesson 26: Project Workshop — Problem Definition & Data
Define the final project problem; load and thoroughly explore the dataset.
Project Workshop: Define the Problem and Explore the Data
It's time for the capstone project! You will build a complete sentiment analysis system that classifies movie reviews as positive or negative. We'll put together everything you've learned: data exploration, preprocessing, feature engineering, model selection, evaluation, and interpretation.
Step 1: The Problem Statement
We are dealing with a binary text classification problem. Given the text of a movie review, our goal is to predict the label: Positive or Negative.
Step 2: Explore the Dataset
We have a dataset of 2,000 movie reviews (1,000 positive and 1,000 negative). Before jumping into modeling, you must understand your data.
Coding Challenge: Data Exploration
Use pandas to explore the dataset and answer these questions:
- Load the dataset into a pandas DataFrame.
- Check the class balance using
df['sentiment'].value_counts(). Are the classes perfectly balanced? - Examine 5 sample reviews from each class.
- Write a short script to calculate the average review length (in words) for positive vs. negative reviews. Is there a noticeable difference?