Lesson 27: Project Workshop — Feature Engineering & Baseline
Preprocess text data for modeling; build and evaluate a baseline model.
Project Workshop: Feature Engineering & Baseline Model
With our data explored, it's time to prepare it for our machine learning algorithms. We need to convert our raw text into numbers and establish a baseline model.
Step 1: Text Preprocessing and Feature Engineering
Our models cannot read English; they need numbers. We will use a TfidfVectorizer to convert the text into TF-IDF features.
Step 2: Train/Test Split
We must set aside some data to evaluate our model later. A common split is 80% for training and 20% for testing. Make sure to use stratified sampling so both sets have an equal balance of positive and negative reviews.
Step 3: Baseline Model
A baseline model is a simple, standard model we use as a benchmark. If a complex model can't beat the baseline, it's not worth using! We will use Logistic Regression.
Coding Challenge: Build the Baseline
Put together your preprocessing and baseline model:
- Fit a
TfidfVectorizeron the training data. Let's trymax_features=3000. - Transform both the training and test data.
- Train a
LogisticRegressionmodel on the training data. - Predict on the test data and calculate the accuracy. This is your baseline score!