Skip to main content

2 Deep Learning

Neural Network

Feed Forward Network

  • Each perceptron in one layer is connected to every perceptron on the next layer
  • Deep = Many hidden layers
  • Last layer still logistic regression

20230517213033

Perceptron(Neuron)

o=f(k=1nikWk)o = f(\sum_{k=1}^n i_kW_k)

f is an activation function

  • Identity: Linear Regression
  • Sigmoid: Logistic Regression

One can stack perceptron to a Multi-layer perceptron(MLP)

Activation functions

20230517220519

Feedforward and Backpropagation

  • One forward pass: Compute all the intermediate output values
  • One backward pass: Follow the chain rule to calculate the gradients at each layer, the calculate the product

Chain rule:

20230518200950

CNN

Why CNN for image:

  • Some patterns are much smaller than the whole image
  • The same patterns appear in different regions
  • Subsampling the pixels will not change the object

20230518210422

Convolution Layer

Convolutional Filters

20230518202812

One can stack convolution filters into a new tensor, each filter is a channel

20230518203932

padding: Preserve input spatial dimensions in output activations by padding with n pixels border

20230518202953

Input: W×H×DW \times H \times D tensor

Hyperparameters:

  • Number of filters: K, K is usually a power of 2 (32, 64, ...)
  • Filter size: F×F×DF \times F \times D, F is usually an odd number (e.g., 3, 5, 7)
  • Stride: S, S describes how we move the filter
  • Padding size: P

Output: W×H×KW' \times H' \times K tensor(Get the moving distance first, then add 1)

  • W’ = (W – F + 2P) / S + 1
  • H’ = (H – F + 2 P) / S + 1

Number of parameters: perform K times wTx+bw^Tx + b: (F×F×D+1)×K(F\times F \times D + 1)\times K

Usage: using 1*1 filters, we can keep the width and height but increase the number of channels.s

Pooling Layer

We can use Pooling as Subsampling(Downsampling)

Input: W×H×DW \times H \times D tensor

Hyperparameters:

  • Filter size: F×FF \times F
  • Stride: S

Output: W×H×DW' \times H' \times D tensor

  • W' = (W – F) / S + 1
  • H’ = (H – F) / S + 1

Fully Connected Layer

Flatten

20230518210350

Convolution vs Fully connected

20230518205949

20230908123625

RNN

Recurrent Neural Networks: Apply the same weights W repeatly

20230615215925

Advantages:

  • Can process any length input
  • Computation for step t can (in theory) use information from many steps back
  • Model size doesn’t increase for longer input context
  • Same weights applied on every timestep, so there is symmetry in how inputs are processed

Disadvantages:

  • Recurrent computation is slow
  • In practice, difficult to access information from many steps back

Vanishing and Exploding gradients

Vanishing gradients: model weights are updated only with respect to near effects, not long-term effects.

20230628101154

Exploding gradients: If the gradient becomes too big, then the SGD update step becomes too big. We take too large a step and reach a weird and bad parameter configuration.

Solution1: LSTM

Long Short-Term Memory RNN

On step t, there is a hidden state hth_t and a cell state ctc_t

  • The cell stores long-term information
  • The LSTM can read, erase(forget), and write information from the cell

20230629103056

ft=σ(Wfht1+Ufxt+bf)it=σ(Wiht1+Uixt+bi)ot=σ(Woht1+Uoxt+bo)ct=ftct1+ittanh(Wcht1+Ucxt+bc)ht=ottanhct\begin{aligned} f_t = \sigma{(W_fh_{t-1} + U_fx_t + b_f)} \\ i_t = \sigma{(W_ih_{t-1} + U_ix_t + b_i)} \\ o_t = \sigma{(W_oh_{t-1} + U_ox_t + b_o)} \\ c_t = f_tc_{t-1} + i_t\tanh(W_ch_{t-1} + U_cx_t + b_c) \\ h_t = o_t\tanh c_t \end{aligned}

The LSTM architecture makes it much easier for an RNN to preserve information over many timesteps.

e.g., if the forget gate is set to 1 for a cell dimension and the input gate set to 0, then the information of that cell is preserved indefinitely.

Solution2: Other techniques

ResNet

The identity connection preserves information by default

Skip connection:

Say, halfway through a normal network, the activations are informative enough to classify the inputs well, but our chosen network still has more layers after that.

We can set weights to be zero(F(x) = 0), now the blocks could easily learn the identity function or small updates

20230701093015

DenseNet

Directly connect each layer to all future layers

20230704154023

Bidirectional and Multi-layer RNNs

20230704152940

RNN could be a simple RNN or LSTM computation

Forward: ht=\overrightarrow{h_t} = RNNFW(ht1,xt)_{FW}(h_{t-1},x_t)

Backward: ht=\overleftarrow{h_t} = RNNBW(ht+1,xt)_{BW}(h_{t+1},x_t)

Concatenated hidden states: ht=[ht;ht]h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]

Overfitting

20230515204344

Early Stop

20230517213426

Weight regularization

L1 norm

Objective = αθ\alpha |\theta|, we call it Lasso Regression

Gradient descent: More zeros in weights

L2 norm

Objective = βθ2\beta |\theta|^2, we call it Ridge Regression

Decay in weights: θt+1=θtηθL(θt)=(1λ)θηθL(θt)\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta} L(\theta_t)= (1-\lambda)\theta - \eta \nabla_{\theta} L(\theta_t)

20230515204304

Dropout layer

Say dropout rate p.

During training, delete some intermediate output value with probability p or the weights times 1-p

20230517213320

Training strategy

Batch normalization

Internal Covariate Shift(ICS)

  • While training a model, we expect independent and identically distributed data
  • For a deep learning model, as the parameter of a lower layer changes, the input distribution of the upper layer also changes. It's called Internal Covariate Shift
  • The upper layer's parameters need to continuously adapt to the new input data distribution, thus reduces the learning speed

Problems caused by ICS

  • Require Lower learning rates: Training would be slower
  • More careful about initializing: saturated regime will slow down the convergence

Batch normalization

  • Whole set impractical: Mini-batch mean and variance
  • Network parameters to be learned

20240507173916

20240507174008