Lesson 2: Optimization (Gradient Descent & Adam)

Calculated gradients show the direction of maximum error reduction, but how fast should we move in that direction? This is governed by optimization algorithms.

The Learning Rate

The **Learning Rate** (alpha) controls the step size taken towards the local minimum during training. If the learning rate is too high, the model might overshoot the minimum and fail to converge. If the learning rate is too low, training will be incredibly slow and can get stuck in poor local minima.

Stochastic Gradient Descent (SGD)

Standard gradient descent calculates gradients across the *entire* dataset before making a single update. **Stochastic Gradient Descent (SGD)** updates the weights after examining a single sample (or a small batch) of data. This adds helpful noise that can rescue optimization from local minima, though it makes the descent path erratic.

Adding Momentum

To smooth out SGD updates, we add **Momentum**. This keeps a moving average of past gradients, allowing optimization to roll quickly through flat regions and noisy oscillations, much like a ball rolling down a valley.

Adaptive Optimizers: Adam

**Adam** (Adaptive Moment Estimation) is the industry standard optimization method. Instead of using a fixed learning rate for all weights, Adam dynamically scales the learning rate for *each individual weight* based on its historical update frequency.

Coding Challenge: Gradient Descent Update

Let's write a standard gradient descent parameter update rule in Python!

Create a variable weight initialized to 5.0.
Create a variable gradient initialized to 0.4.
Create a variable learning_rate initialized to 0.1.
Update the weight variable by subtracting learning_rate * gradient.
Print the new weight value. It should equal 4.96.

Run this script in the editor on the right to see if your manual parameter tuning works!