# How dose regularization work in ML and DL? - Class review

Last updated on：a year ago

Regularization plays an important role in solving the problems of overfitting. I take some notes about it in classes.

# Bias/Variance problems

Take mismatched train/test distribution as an example.

Train set error | Dev set error | Remarks |
---|---|---|

1% | 11% | high variance (had too much flexibility to fit) |

15% | 16% | high bias |

15% | 30% | high bias + high variance |

human 0 %

Optimal/Bayes error: 15%

## The basic recipe for ML

High bias (can *lead to underfitting*), had too much flexibility with high error rate)? (training data performance): Bigger network, training longer, NN architecture search

High variance/overfitting? (dev set performance): more data, regularization, NN architecture search

Bias variance trade-off

## Why regularization reduces overfitting

Simplify the neural network, $w^{[l]}$ close to 0

The larger $\lambda$ is, the smaller $w$ is.

$$z^{[l]} = w^{[l]} a^{[l-1]} + b^{[l]}$$

$w^{[l]} \approx 0$, then the last function is becoming a linear function.

# Method of regularization

## L2 regularization/weight decay

Add the parameters $\theta$/ $w$ into the cost function. $\lambda$ is regularization parameter.

### Linear regression

$$J( \theta) = \frac{1}{2m} \left[ \sum_{i=1}^m (h_{\theta} (x^{(i)}) - y^{(i)}) ^2 \right] + \lambda \sum^n_{j=1} \theta_j^2$$

### Gradient descent

$\alpha>0, \lambda>0, m>0$

### Normal equation

If $\lambda \gt 0$,

$$\theta = \left ( X^T X + \lambda \left [ \begin{matrix}

0 & & & & \\

& 1 & & & \\

& & 1 & & \\ & & & \ddots & \\ & & & & 1 \\

\end{matrix} \right] \right) ^{-1} X^T y$$

### Logistic regression

$$J( \theta) = - [\frac{1}{m} \sum_{i=1}^m y^{(i)} \log h_{\theta} (x^{(i)}) + (1- y^{(i)}) \log (1- h_{\theta} (x^{(i)}) )] + \frac{\lambda}{2m} \sum^n_{j=1} \theta_j^2$$

**Gradient descent**

Repeat:

Compared to linear regression’s, just $$h_\theta (x) = \frac{1}{1 + e^{ - \theta ^T x}}$$

**Normal equation**: same as linear regression’s

## L1 regularization

Compressing the model, $w$ will be sparse

$$\frac{\lambda}{2m} ||w||*1= \frac{\lambda}{2m} \sum^{n_x}*{i=1} |w|$$

## Dropout regularization

Eliminating the nodes of neural network. And it could be different units in the same hidden layer at different times of gradient descent. At the time of test predictions, we usually don’t use dropout.

Intuition: node can’t rely on any single feature, so have to spread out weights.

The size of $w^{[1]}$ should be $7\times 3$.

In general, the number of neurons in the previous layer gives us the number of columns of the weight matrix, and the number of neurons in the current layer gives us the number of rows in the weight matrix.

## Data augmentation

It would be redundant sometimes. And the new data is not as good as if you had collected an additional set of brand new independent examples. But you don’t need to pay the expense of going out to take more pictures of cats (an inexpensive way to give you data).

Practical methods: flip horizontally, random crops the image, random distortion, zooming

## Early stopping

Stop the iterating process in proper time to get a “middle size” $||w||^2_F$. **The method is not supposed to be used after fine tuning.**

*Now, regarding the quantity to monitor: prefer the loss to the accuracy. Why? The loss quantify how certain the model is about a prediction (basically having a value close to 1 in the right class and close to 0 in the other classes). The accuracy merely account for the number of correct predictions. Similarly, any metrics using hard predictions rather than probabilities have the same problem.*

$w$ is close to 0, and then to larger, so get the mid-size $||w||^2_F$

# Reference

[1] Andrew NG, Machine Learning

[2] Deeplearning.ai, Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议 ，转载请注明出处！