2 Deep Learning

Neural Network

Feed Forward Network

Each perceptron in one layer is connected to every perceptron on the next layer
Deep = Many hidden layers
Last layer still logistic regression

20230517213033

Perceptron(Neuron) $o = f(\sum_{k=1}^n i_kW_k)$ , f is an activation function

Identity: Linear Regression
Sigmoid: Logistic Regression

One can stack perceptron to a Multi-layer perceptron(MLP)

Activation functions

20230517220519

Feedforward and Backpropagation

One forward pass: Compute all the intermediate output values
One backward pass: Follow the chain rule to calculate the gradients at each layer, the calculate the product

Chain rule:

20230518200950

CNN

Why CNN for image:

Some patterns are much smaller than the whole image
The same patterns appear in different regions
Subsampling the pixels will not change the object

20230518210422

Convolution Layer

Convolutional Filters

20230518202812

One can stack convolution filters into a new tensor, each filter is a channel

20230518203932

padding: Preserve input spatial dimensions in output activations by padding with n pixels border

20230518202953

Input: $W \times H \times D$ tensor

Hyperparameters:

Number of filters: K, K is usually a power of 2 (32, 64, ...)
Filter size: $F \times F \times D$ , F is usually an odd number (e.g., 3, 5, 7)
Stride: S, S describes how we move the filter
Padding size: P

Output: $W' \times H' \times K$ tensor(Get the moving distance first, then add 1)

W’ = (W – F + 2P) / S + 1
H’ = (H – F + 2 P) / S + 1

Number of parameters: perform K times $w^Tx + b$ : $(F\times F \times D + 1)\times K$

Usage: using 1*1 filters, we can keep the width and height but increase the number of channels.s

Pooling Layer

We can use Pooling as Subsampling(Downsampling)

Input: $W \times H \times D$ tensor

Hyperparameters:

Filter size: $F \times F$
Stride: S

Output: $W' \times H' \times D$ tensor

W' = (W – F) / S + 1
H’ = (H – F) / S + 1

Fully Connected Layer

Flatten

20230518210350

Convolution vs Fully connected

20230518205949

20230908123625

CNN for NLP

kernel size: the other dimension not mentioned, e.g. $3(\times d)$
parallel filters to get different views of the data that can be computed in parallel

20241118181452

RNN

Recurrent Neural Networks: Apply the same weights W repeatly

20230615215925

Advantages:

Can process any length input
Computation for step t can (in theory) use information from many steps back
Model size doesn’t increase for longer input context
Same weights applied on every timestep, so there is symmetry in how inputs are processed

Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back

Vanishing and Exploding gradients

Vanishing gradients: model weights are updated only with respect to near effects, not long-term effects.

20230628101154

Exploding gradients: If the gradient becomes too big, then the SGD update step becomes too big. We take too large a step and reach a weird and bad parameter configuration.

Solution1: LSTM

Long Short-Term Memory RNN

On step t, there is a hidden state $h_t$ and a cell state $c_t$

The cell stores long-term information
The LSTM can read, erase(forget), and write information from the cell

20230629103056

\begin{aligned} f_t = \sigma{(W_fh_{t-1} + U_fx_t + b_f)} \\ i_t = \sigma{(W_ih_{t-1} + U_ix_t + b_i)} \\ o_t = \sigma{(W_oh_{t-1} + U_ox_t + b_o)} \\ c_t = f_tc_{t-1} + i_t\tanh(W_ch_{t-1} + U_cx_t + b_c) \\ h_t = o_t\tanh c_t \end{aligned}

The LSTM architecture makes it much easier for an RNN to preserve information over many timesteps.

e.g., if the forget gate is set to 1 for a cell dimension and the input gate set to 0, then the information of that cell is preserved indefinitely.

Solution2: Other techniques

ResNet

The identity connection preserves information by default

Skip connection:

Say, halfway through a normal network, the activations are informative enough to classify the inputs well, but our chosen network still has more layers after that.

We can set weights to be zero(F(x) = 0), now the blocks could easily learn the identity function or small updates

20230701093015

DenseNet

Directly connect each layer to all future layers

20230704154023

Bidirectional and Multi-layer RNNs

20230704152940

RNN could be a simple RNN or LSTM computation

Forward: $\overrightarrow{h_t} =$ RNN $_{FW}(h_{t-1},x_t)$

Backward: $\overleftarrow{h_t} =$ RNN $_{BW}(h_{t+1},x_t)$

Concatenated hidden states: $h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]$

Transformer

Problems with CNN/RNN

Out of vocabulary
- Solution: Tokenization with sub-words(e.g. ("h", "u", “man”))
Non-contextual embeddings
Non-attention
Sequentiality
- Solution: include sequential information(Positional embedding)

Architecture

20241119133818

Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O

\text{head}_i = \text{softmax}(\text{mask}(\frac{QK^T}{\sqrt{d_k}}))V

Q = XW^Q, K = XW^K, V = XW^V, d_k = d_v = \frac{d_{\text{model}}}h

20241119141236

tip

Self-Attention: $Q = K = V$
Cross-Attention: $Q$ comes from one sequence (e.g., decoder), and $K, V$ come from another sequence (e.g., encoder).

Overfitting

20230515204344

Early Stop

20230517213426

Weight regularization

L1 norm

Objective = $\alpha |\theta|$ , we call it Lasso Regression

Gradient descent: More zeros in weights

L2 norm

Objective = $\beta |\theta|^2$ , we call it Ridge Regression

Decay in weights: $\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta} L(\theta_t)= (1-\lambda)\theta - \eta \nabla_{\theta} L(\theta_t)$

20230515204304

Dropout layer

Say dropout rate p.

During training, delete some intermediate output value with probability p or the weights times 1-p

20230517213320

2 Deep Learning

Neural Network​

Feed Forward Network​

Activation functions​

Feedforward and Backpropagation​

CNN​

Convolution Layer​

Pooling Layer​

Fully Connected Layer​

CNN for NLP​

RNN​

Vanishing and Exploding gradients​

Solution1: LSTM​

Solution2: Other techniques​

ResNet​

DenseNet​

Bidirectional and Multi-layer RNNs​

Transformer​

Architecture​

Attention​

Overfitting​

Early Stop​

Weight regularization​

L1 norm​

L2 norm​

Dropout layer​