Lesson 20: Feature Engineering — Making Data Model-Ready

Feature Engineering: Making Data Model-Ready

Machine learning models only understand numbers. If you have a column with values like "Red", "Blue", and "Green", or a column measuring age in years alongside a column measuring income in thousands of dollars, the model will struggle.

"Better features beat better algorithms."

Key Techniques

One-Hot Encoding: Turns categories into multiple binary (1 or 0) columns. E.g., Is_Red, Is_Blue.
Feature Scaling: Normalizes numerical features so they are on the same scale. For example, standardizing features to have a mean of 0 and a standard deviation of 1. Algorithms like KNN are highly sensitive to unscaled data!
Feature Creation: Combining columns to make a more predictive feature, like calculating `FamilySize = Siblings + Parents + 1`.

Python Challenge: Scale it Up

Use scikit-learn's StandardScaler to scale some numerical data.

from sklearn.preprocessing import StandardScaler
import numpy as np

# Age (years) and Income ($)
X = np.array([[25, 40000], 
              [45, 85000], 
              [30, 50000]])

# TODO: Initialize StandardScaler
# scaler = ???

# TODO: Fit the scaler to X and transform X
# X_scaled = ???

# print(X_scaled)
# Notice how the values are now centered around 0!