Weight Initialization Methods  Class review
Last updated on：a month ago
A well chose initialization of filter weights can speed up your learning systems. I want to make a summary of some advanced weight initialization methods in this blog.
What is Initialization?
When you train your learning systems, you should first set the parameters of the systems. Better values of parameters can later be obtained by updating.
A wellchosen initialization can:

Speed up the convergence of gradient descent

Increase the odds of gradient descent converging to a lower training (and generalization) error
Initializing weights to very large random values does not work well.
Zeros initialization
Zeros initialization means you initialize your learning systems rate with 0 value, which is harmful, which leads to network failing break symmetry and $n^{[l]} = 1$.
Symmetry breaking
A phenomenon in which (infinitesimally) small fluctuations acting on a system crossing a critical point decide the systems’ fate, by determining which branch of a bifurcation is taken.
Random initialization
Gaussian distributions initialization
$$Var[W^i] = \sigma^2$$
$$W\sim \mathcal{N} (0, \sigma^2)$$
N means normal distribution.
Advantages
 Symmetry is broken so long as $W^{[l]}$ is initialized randomly.
Shortcoming
 As the result of fixed standard deviations, very deep models (> 8 conv layers) have difficulties to converge.
Xavier initialization
$$Var[W^i] = \frac{2}{n_i + n_{i+1}}$$
$$W\sim U[\frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}]$$
U means continuous uniform distribution.
$$f(n) =
\begin{cases}
\frac{1}{ba}, & \text{for $a\leq x\leq b$} \\ 0, & \text{for $x<a$ or $x>b$}
\end{cases}$$
He initialization
$$Var[W^i] =\sqrt{\frac{2}{n_l}}$$
$$W\sim \mathcal{N} (0, \sqrt{\frac{2}{n_l}})$$
Multiply $\sqrt{\frac{2}{layersdims[l  1]}}$ when initializing $W^{[l]}$
Xavier initialization vs. He initialization
The former is for linear activation functions only, but the latter is also valid by taking ReLU/PReLU into account.
Reference
[1] Deeplearning.ai, Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
[2] He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 10261034).
[3] Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249256). JMLR Workshop and Conference Proceedings.
本博客所有文章除特别声明外，均采用 CC BYSA 4.0 协议 ，转载请注明出处！