The ML Workflow: From Data to Predictions
Machine Learning isn't just about training a model; it's a complete pipeline from understanding the problem to making predictions. The typical ML workflow involves 7 steps:
Imagine giving a student the answer key to a test before they take it. They would score 100%, but you wouldn't know if they actually learned the material. In ML, we never evaluate our model on the same data it learned from. We split our dataset into a Training Set and a Test Set.
Can you use scikit-learn to split your data into 80% training and 20% testing?
from sklearn.model_selection import train_test_split
import numpy as np
# Let's say X is our features and y is our labels
X = np.arange(100).reshape((50, 2))
y = np.arange(50)
# TODO: Split the data into X_train, X_test, y_train, y_test
# Set test_size to 20%
# X_train, X_test, y_train, y_test = ???
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")When training, a model might memorize the training data (Overfitting) or fail to capture the pattern at all (Underfitting). We want to find the sweet spot in the middle!