1 Regression and Classification
Classification vs Regression:
- The target variables in classification are discrete, while in regression are continuous
- Classes have their order, while Clusters are exchangeable
Notations and general concepts
L-norm
Minkowski distance(L-norm)
def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)
- p = 1, Manhattan Distance
- p = 2, Euclidean Distance
Normalization
- Z score(See probability)
- Min - max
Batch Normalization
- Batch normalize: Calculate mini-batch mean and variance of each layer to decrease covariate shift
- Scale and shift: Identity transform in order to avoid linear regime of the nonlinearity
Layer Normalization
Compute the layer normalization all the hidden units in the same layer
Cross Validation
simple cross validation:
- Randomly divide training data into Strain and Scv(cross validation)
- For each model, traing on Strain
- Estimate error on validation set Scv
- find the smallest error
K fold cross validation:
- Randomly divide training data into k equal parts S1,S2,...,Sk
- For each model, traing on all the data except Sj
- Estimate error on validation set Sj: errorSj
- Repeat k times, find the smallest error
Error=k1∑i=1kerrorSj
Leave-one-out cross validation:
repeatedly train on all but one of the training examples and test on that
held-out example
K fold vs LOO:
- much faster
- more biased(LOO used in sparse dataset)
Metrics
Linear Regression
Say we find a function f(x,θ)=θTx
minimize a loss function L(θ)=∑i=1Nl(f(x(i),θ),y(i))
Ordinary Least Square Regression
- θ:ω,b
- l: square loss
L=21∑i=1N(f(x(i),θ)−y(i))2 (21 为了gradients约分)
Gradient descent
For square loss,
∇θL(θt)=∂θj∂L(θ)=xj(y−hθ(x)) - Initialize θ0 randomly, t = 0
- Repeat: θt+1=θt−η∇θL(θt), t = t+1
η is the learing rate
- too small: no much progress
- too big: Overshoot
Adjust Learning rate
- constant learning rate
- linear decay: linearly decreases over the training period
- cosine learning rate: a max learning rate times the cosine value(from 0 to 1)
- adaptive learning rate
Adagrad:
ηt=t+1η,gt=∇θL(θt)θt+1=θt−σtηtgtσt=t+1∑i=0tgi2 Stochastic Gradient Descent
Gradient descent could be slow: Every update involves the entire training data
Stochastic Gradient Descent (SGD)
Sample a batch of training data to estimate ∇L(θt)
Classification
Assume in C1, X∼N(μ1,σ2); in C2, X∼N(μ2,σ2)
Apply min-max normalization for the target variable → [0,1]
Bayes‘ rule:
P(C1∣x)=P(x∣C1)P(C1)+P(x∣C2)P(C2)P(x∣C1)P(C1) we can proof that Logistic Regression
logP(C2∣x)P(C1∣x)=θTx As
P(C1∣x)+P(C2∣x)=1P(C1∣x)=1+e−θTx1=1+e−z1 We called it "Sigmoid"
Maximum Likelihood Estimate
Let us assume that P(y∣x;θ)=hθ(x), yi={0,1}
p(x1,x2,...,xn;θ)=i=1∏nhθyi(xi)(1−hθ(xi))1−yi Cross-Entropy Loss
L(θ)=logp(x1,x2,...,xn;θ)=i=1∑nyiloghθ(xi)+(1−yi)log(1−hθ(xi)) Then
∂θj∂L(θ)=xj(y−hθ(x)) same as gradient descent
Multiclass Classification
If we look at sigmoid function P(y1∣x)=eθTx+1eθTx, we could extend the dimension and proof that
P(y∣x;θ)=∑j=1nexp(θjTx)exp(θTx) We define softmax: softmax(t1,...,tk)=∑j=1nexp(tj)exp(t)
In fact, that's the same as
p(yi∣x)=∑jp(x∣yj)p(yj)p(x∣yi)p(yi) We could also proof that MLE loss is the same as cross entropy CE(y,y^)