Lesson 13: The ML Workflow

The ML Workflow: From Data to Predictions

Machine Learning isn't just about training a model; it's a complete pipeline from understanding the problem to making predictions. The typical ML workflow involves 7 steps:

Define the problem: What are you trying to predict?
Collect & explore data: Gather relevant data and understand its properties.
Prepare data: Clean missing values, encode text, and split into train/test sets.
Choose a model: Pick the right algorithm for the task.
Train the model: Let the algorithm find patterns in the training data.
Evaluate the model: Test its performance on unseen data.
Improve & iterate: Tune parameters or add more data.

The Golden Rule: Train/Test Split

Imagine giving a student the answer key to a test before they take it. They would score 100%, but you wouldn't know if they actually learned the material. In ML, we never evaluate our model on the same data it learned from. We split our dataset into a Training Set and a Test Set.

Python Challenge: Split the Data!

Can you use scikit-learn to split your data into 80% training and 20% testing?

from sklearn.model_selection import train_test_split
import numpy as np

# Let's say X is our features and y is our labels
X = np.arange(100).reshape((50, 2))
y = np.arange(50)

# TODO: Split the data into X_train, X_test, y_train, y_test
# Set test_size to 20%
# X_train, X_test, y_train, y_test = ???

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

Overfitting vs. Underfitting

When training, a model might memorize the training data (Overfitting) or fail to capture the pattern at all (Underfitting). We want to find the sweet spot in the middle!