Calculated gradients show the direction of maximum error reduction, but how fast should we move in that direction? This is governed by optimization algorithms.
The **Learning Rate** (alpha) controls the step size taken towards the local minimum during training. If the learning rate is too high, the model might overshoot the minimum and fail to converge. If the learning rate is too low, training will be incredibly slow and can get stuck in poor local minima.
Standard gradient descent calculates gradients across the *entire* dataset before making a single update. **Stochastic Gradient Descent (SGD)** updates the weights after examining a single sample (or a small batch) of data. This adds helpful noise that can rescue optimization from local minima, though it makes the descent path erratic.
To smooth out SGD updates, we add **Momentum**. This keeps a moving average of past gradients, allowing optimization to roll quickly through flat regions and noisy oscillations, much like a ball rolling down a valley.
**Adam** (Adaptive Moment Estimation) is the industry standard optimization method. Instead of using a fixed learning rate for all weights, Adam dynamically scales the learning rate for *each individual weight* based on its historical update frequency.
Let's write a standard gradient descent parameter update rule in Python!
weight initialized to 5.0.gradient initialized to 0.4.learning_rate initialized to 0.1.weight variable by subtracting learning_rate * gradient.4.96.Run this script in the editor on the right to see if your manual parameter tuning works!