Zero2AI
CoursesPlayground
Start Learning
AI Fundamentals: From Zero to Your First Model • Module D: The Final Project & BeyondLesson 27: Project Workshop — Feature Engineering & Baseline
PreviousNext

Lesson 27: Project Workshop — Feature Engineering & Baseline

Preprocess text data for modeling; build and evaluate a baseline model.

Project Workshop: Feature Engineering & Baseline Model

With our data explored, it's time to prepare it for our machine learning algorithms. We need to convert our raw text into numbers and establish a baseline model.

Step 1: Text Preprocessing and Feature Engineering

Our models cannot read English; they need numbers. We will use a TfidfVectorizer to convert the text into TF-IDF features.

Step 2: Train/Test Split

We must set aside some data to evaluate our model later. A common split is 80% for training and 20% for testing. Make sure to use stratified sampling so both sets have an equal balance of positive and negative reviews.

Step 3: Baseline Model

A baseline model is a simple, standard model we use as a benchmark. If a complex model can't beat the baseline, it's not worth using! We will use Logistic Regression.

Coding Challenge: Build the Baseline

Put together your preprocessing and baseline model:

  1. Fit a TfidfVectorizer on the training data. Let's try max_features=3000.
  2. Transform both the training and test data.
  3. Train a LogisticRegression model on the training data.
  4. Predict on the test data and calculate the accuracy. This is your baseline score!

Built with AI for beginners. Open source and free forever.