Lesson 27: Project Workshop — Feature Engineering & Baseline

Project Workshop: Feature Engineering & Baseline Model

With our data explored, it's time to prepare it for our machine learning algorithms. We need to convert our raw text into numbers and establish a baseline model.

Step 1: Text Preprocessing and Feature Engineering

Our models cannot read English; they need numbers. We will use a TfidfVectorizer to convert the text into TF-IDF features.

Step 2: Train/Test Split

We must set aside some data to evaluate our model later. A common split is 80% for training and 20% for testing. Make sure to use stratified sampling so both sets have an equal balance of positive and negative reviews.

Step 3: Baseline Model

A baseline model is a simple, standard model we use as a benchmark. If a complex model can't beat the baseline, it's not worth using! We will use Logistic Regression.

Coding Challenge: Build the Baseline

Put together your preprocessing and baseline model:

Fit a TfidfVectorizer on the training data. Let's try max_features=3000.
Transform both the training and test data.
Train a LogisticRegression model on the training data.
Predict on the test data and calculate the accuracy. This is your baseline score!