Skip to main content

1 Regression and Classification

Classification vs Regression:

  • The target variables in classification are discrete, while in regression are continuous
  • Classes have their order, while Clusters are exchangeable

Notations and general concepts

L-norm

Minkowski distance(L-norm)

def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)
  • p = 1, Manhattan Distance
  • p = 2, Euclidean Distance

Normalization

  • Z score(See probability)
  • Min - max

Batch Normalization

  • Batch normalize: Calculate mini-batch mean and variance of each layer to decrease covariate shift
  • Scale and shift: Identity transform in order to avoid linear regime of the nonlinearity

20231026231114

Layer Normalization

Compute the layer normalization all the hidden units in the same layer

Cross Validation

simple cross validation:

  1. Randomly divide training data into StrainS_{train} and ScvS_{cv}(cross validation)
  2. For each model, traing on StrainS_{train}
  3. Estimate error on validation set ScvS_{cv}
  4. find the smallest error

K fold cross validation:

  1. Randomly divide training data into k equal parts S1,S2,...,SkS_1,S_2,...,S_k
  2. For each model, traing on all the data except SjS_j
  3. Estimate error on validation set SjS_j: errorSjerror_{S_j}
  4. Repeat k times, find the smallest error

Error=1ki=1kerrorSjError = \frac1k \sum_{i=1}^k error_{S_j}

Leave-one-out cross validation:

repeatedly train on all but one of the training examples and test on that held-out example

K fold vs LOO:

  • much faster
  • more biased(LOO used in sparse dataset)

Metrics

20230510211336

Linear Regression

Say we find a function f(x,θ)=θTxf(x,\theta) = \theta^T x

minimize a loss function L(θ)=i=1Nl(f(x(i),θ),y(i))L(\theta) = \sum_{i=1}^Nl(f(x^{(i)},\theta),y^{(i)})

Ordinary Least Square Regression

  • θ:ω,b\theta: \omega,b
  • ll: square loss

L=12i=1N(f(x(i),θ)y(i))2L = \frac12\sum_{i=1}^N(f(x^{(i)},\theta) - y^{(i)})^2 (12\frac12 为了gradients约分)

Gradient descent

For square loss,

θL(θt)=L(θ)θj=xj(yhθ(x))\nabla_{\theta} L(\theta_t) = \frac{\partial L(\theta)}{\partial \theta_j} = x_j(y-h_{\theta}(x))
  1. Initialize θ0\theta_0 randomly, t = 0
  2. Repeat: θt+1=θtηθL(θt)\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta} L(\theta_t), t = t+1

η\eta is the learing rate

  • too small: no much progress
  • too big: Overshoot

Adjust Learning rate

  • constant learning rate
  • linear decay: linearly decreases over the training period
  • cosine learning rate: a max learning rate times the cosine value(from 0 to 1)
  • adaptive learning rate

Adagrad:

ηt=ηt+1,gt=θL(θt)θt+1=θtηtσtgtσt=i=0tgi2t+1\begin{aligned} \eta_t = \frac{\eta}{\sqrt{t+1}}, g_t = \nabla_{\theta} L(\theta_t) \\ \theta_{t+1} = \theta_t - \frac{\eta_t}{\sigma^t}g_t \\ \sigma_t = \sqrt{\frac{\sum_{i=0}^{t}g^2_i}{t+1}} \end{aligned}

Stochastic Gradient Descent

Gradient descent could be slow: Every update involves the entire training data

Stochastic Gradient Descent (SGD)

Sample a batch of training data to estimate L(θt)\nabla L(\theta_t)

Classification

Assume in C1, XN(μ1,σ2)X \sim N(\mu_1,\sigma^2); in C2, XN(μ2,σ2)X \sim N(\mu_2,\sigma^2)

Apply min-max normalization for the target variable \rightarrow [0,1]

20230515182156

Bayes‘ rule:

P(C1x)=P(xC1)P(C1)P(xC1)P(C1)+P(xC2)P(C2)P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1) + P(x|C_2)P(C_2)}

we can proof that Logistic Regression

logP(C1x)P(C2x)=θTx\log\frac{P(C_1|x)}{P(C_2|x)} = \theta^T x

As

P(C1x)+P(C2x)=1P(C1x)=11+eθTx=11+ez\begin{aligned} P(C_1|x) + P(C_2|x) = 1 \\ P(C_1|x) = \frac1{1+e^{-{\theta}^Tx}} = \frac1{1+e^{-z}} \end{aligned}

We called it "Sigmoid"

20230515183340

Maximum Likelihood Estimate

Let us assume that P(yx;θ)=hθ(x)P(y|x;\theta)=h_{\theta}(x), yi={0,1}y_i=\{0,1\}

p(x1,x2,...,xn;θ)=i=1nhθyi(xi)(1hθ(xi))1yip(x_1,x_2,...,x_n;\theta) = \prod_{i=1}^n h_{\theta}^{y_i}(x_i)(1-h_{\theta}(x_i))^{1-y_i}

Cross-Entropy Loss

L(θ)=logp(x1,x2,...,xn;θ)=i=1nyiloghθ(xi)+(1yi)log(1hθ(xi))L(\theta) = \log p(x_1,x_2,...,x_n;\theta) = \sum_{i=1}^n y_i\log h_{\theta}(x_i) + (1-y_i)\log (1-h_{\theta}(x_i))

Then

L(θ)θj=xj(yhθ(x))\frac{\partial L(\theta)}{\partial \theta_j} = x_j(y-h_{\theta}(x))

same as gradient descent

Multiclass Classification

If we look at sigmoid function P(y1x)=eθTxeθTx+1P(y_1|x) = \frac{e^{\theta^Tx}}{e^{\theta^Tx} + 1}, we could extend the dimension and proof that

P(yx;θ)=exp(θTx)j=1nexp(θjTx)P(y|x;\theta) = \frac{\exp(\vec{\theta^T x})}{\sum_{j=1}^n\exp(\theta_j^Tx)}

We define softmax: softmax(t1,...,tk)=exp(t)j=1nexp(tj)(t_1, . . . , t_k) = \frac{\exp(\vec{t})}{\sum_{j=1}^n\exp(t_j)}

In fact, that's the same as

p(yix)=p(xyi)p(yi)jp(xyj)p(yj)p(y_i|x) = \frac{p(x|y_i)p(y_i)}{\sum_jp(x|y_j)p(y_j)}

We could also proof that MLE loss is the same as cross entropy CE(y,y^)(y,\hat y)