Basics of ML and DL
If you are already familiar with the basics of DL and Machine Learning (ML), you can skip directly this entire post. But if you are new to this field, then the first few paragraphs are meant for you. As a Data Science newbie, there are a handful of things that you must know.
Most importantly, the parameters. Parameters are all there is to a model. People talk a lot about parameters or weights (both are the same). Let’s say we have data which is already segregated into two classes. And the job of a model is to classify each data point into its respective bin. “The parameters” are those values that decide which point will go into which class.
The loss function is the next important thing that you should know. Loss function acts like a quality check for parameters. If the parameters are values which classify each data point, then loss function gives us the information about how good the parameters are for the given data.
So training a model is nothing but finding the right parameters which give the least loss. We randomly take one set of parameters and update these after every iteration to get the minimum loss. Gradient descent is used to decide which direction takes us to that minimum.
To give an analogy, if you are in a car at the top of a mountain, and the aim is to drive down, gradient descent is the direction in which your car travels. Speaking of driving a car downhill, there is one thing that we all do while driving downhill or rather, we MUST do. What might that be? Take a wild guess… Got it? YES! We use brakes.
No sane person drives the car wildly without taking friction’s help. (except for a brake failure though). While updating the parameters using gradient descent, we apply a special kind of brakes, popularly known as the Learning rate.
Learning rate provides a smooth transition from random parameters to state of the art models. Setting the learning rate too high is like accelerating the car downhill. Let alone reaching the foothill, the acceleration will throw you off the road. High learning rate too, will not converge to minimum loss. Loss increases with high learning rate and the model will stop learning.
If the learning rate too small, it is like using the brakes extensively. The model will reach the minimum for sure. But it takes a lot of time for that to happen. And we cannot train a model for an indefinite amount of time. So setting the right learning rate is always very important.
Finally, the parameter initialization. We always take a set of random numbers for parameters and update using gradient descent. Thumb rule for initialization: Symmetry does not work. Our brains are all the same. The first idea we get for initialization is a very simple one. “Why can’t we use all zeros?” Nope. That won’t work.
The best way to initialize is to take random numbers from a Gaussian distribution. That just means taking random numbers and multiplying them by sqrt{2/n}. n is the total number of parameters across all the layers.