How is deep learning different from multilayer perceptron?
The part about gradient really cleared my mind. Great reading
I’m going to try to keep this answer simple — hopefully I don’t leave out too much detail in doing so. To me, the answer is all about the initialization and training process — and this was perhaps the first major breakthrough in deep learning. Like others have said, MLP is not really different than deep learning, but arguably just one type of deep learning.
Back-propagation (which has existed for decades) theoretically allows you to train a network with many layers. But before the advent of deep learning, researchers did not have widespread success training neural networks with more than 2 layers.
This was mostly because of vanishing and/or exploding gradients. Prior to deep learning MLPs were typically initialized using random numbers. Like today, MLPs used the gradient of the network’s parameters w.r.t. to the network’s error to adjust the parameters to better values in each training iteration. In back propagation, to evaluate this gradient involves the chain rule and you must multiply each layer’s parameters and gradients together across all the layers. This is a lot of multiplication, especially for networks with more than 2 layers. If most of the weights across many layers are less than 1 and they are multiplied many times then eventually the gradient just vanishes into a machine-zero and training stops. If most of the parameters across many layers are greater than 1 and they are multiplied many times then eventually the gradient explodes into a huge number and the training process becomes intractable.
Deep learning proposed a new initialization strategy: use a series of single layer networks — which do not suffer from vanishing/exploding gradients — to find the initial parameters for a deep MLP. The pictures below attempt to illustrate this process:
1.) A single layer autoencoder network is used to find initial parameters for the first layer of a deep MLP.
2.) A single layer autoencoder network is used to find initial parameters for the second layer of a deep MLP.
3.) A single layer autoencoder network is used to find initial parameters for the third layer of a deep MLP.
4.) A softmax classifier (logistic regression) is used to find initial parameters for the output layer of a deep MLP.
Now that all the layers have been initialized through this pre-training process to values that are more suitable for the data, you can usually train the deep MLP using gradient descent techniques without the problem of vanishing/exploding gradients.
Of course the field of deep learning has moved forward since this initial breakthrough, and many researchers now argue pre-training is not necessary. But even without pre-training, reliably training a deep MLP requires some additional sophistication, either in the initialization or training process beyond the older MLP training approaches of random initialization followed by standard gradient descent.
UPDATE: Please keep in mind deep learning has evolved quite a bit since I originally answered this question a few years ago. The methods described here are representative of some of the important early work in deep learning, but not really representative of the field today.