Our data \((\mathbf y, \mathbf X)\) is divided into \((y_i, \mathbf x_i)\).
We create a function \(\hat y_i = f(\mathbf x_i)\).
The best predictor of \(y\) given \(x\) is:
\(g(X)=E[Y|X]\)
The goal of regression is to find an approximation of this function.
\(\epsilon_i = y_i- \hat y_i\)
\(RSS=\sum_i \epsilon_i^2\)
\(RSS=\sum_i (y_i-\hat y_i)^2\)
\(ESS=\sum_i (\bar y-\hat y_i)^2\)
\(TSS=\sum_i (y_i-\bar y)^2\)
\(P(y|X, \theta )\)
\(\hat y =f(\mathbf x)\)
Through integration?
\(E[y] = \int P(y|X, \theta ) dy\)
\(R^2= 1-\dfrac{RSS}{TSS}\)
Classification models are a type of regression model, where \(y\) is discrete rather than continuous.
So we want to find a mapping from a vector \(X\) to probabilities across discrete \(y\) values.
A classifier takes \(X\) and returns a vector.
For a classifier we have \(K\) classes.
Confusion matrix. true positve, false positive, false negative, true negative
Can use this to get
Accuracy: percentage correct
Precision: percentage of positive predictions which are correct
Recall (sensitivity): percentage of poitive cases that were predicted as positive
Specificity: percentage of negative cases preicated as negative
Multiclass classification
What if can be email for work, friends, family, hobby?
Include error types here
A hard classifier can return a sparce vector with \(1\) in the relevant classification.
A soft classifier returns probabilities for each entry in the vector.
The vector represents \(P(Y=k|X=x)\)
We can use a cutoff.
If there are more than two classes we can choose the one with the highest score.
Mean estimate.
Can do for a parameter, or for a predicted estimate for \(y\).
Linear models
MLE is same as \(y^2\) loss
MAP is same as \(y^2\) loss with regularisation
Don’t want answers outside \(0\) and \(1\).
\(F_1\) score: \(\dfrac{2PR}{(P+R)}\)
may not just care about accuracy, eg breast cancer screening
high accurancy can result from v basic model (ie all died on titanic)
We know:
\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)
\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)
The bottom bit is a normalisation factor, and so we can use:
\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)
We have here:
Our prior - \(P(\theta )\)
Our posterior - \(P(\theta |y,X)\)
Our likelihood function - \(P(y, X| \theta )\)
We can measure the risk of a classifier. This is the chance of misclassification.
\(R(C)=P(C(X)\ne Y)\)
This is the classifer \(C(X)\) which minimises the chance of misclassification.
It takes the output of the soft classifier and chooses the one with the highest chance.