Programming Assigment 3

Policies

Problem Description

Architecture of the LinearNN network.

Hint on L2 Regularization

(updated on Nov 19th)

Some students are confused about where to put the l2 regularization of linear layers. You probably think about adding the regularization term when defining the MSELoss, however we cannot get the weights of linear layers in loss layer.

Actually, L2 regularization can be put in the backward propagation of linear layer when updating the weights. With L2 regularization term, our final loss function becomes:

L(X,W)=MSE+λi=1LWi22,\mathcal{L}(X, W) = \text{MSE} + \lambda\sum_{i=1}^{|L|}\|W_i\|^2_2,

Here WiW_i is the weight of i-th layers. So the gradient of each layer's weight becomes:

LWi=MSEWi+λWi22Wi\frac{\partial \mathcal L}{\partial W_i} = \frac{\partial \text{MSE}}{\partial W_i} + \lambda\frac{\partial \|W_i\|^2_2 }{\partial W_i}

The first term is computed through backward propagation, which is what we have done in linear layer. For the second term: Wi22Wi=Wi\frac{\partial \|W_i\|^2_2 }{\partial W_i} = W_i. It's only related to the weight of i-th layer. There is no need to do backward propagation for regularization. So it can be put in backward function when updating the weights.