If there are more independent variables than samples then OLS will not work. There will be an infinite number of perfect fits.
For example if we regression genetic information on height with \(1000\) people, there will be too little data to fit using OLS.
This is due to colinearity.
We could also have too many variables through the use of derived variables. For example if we choose to use \(x\), \(x^2\), \(x^3\) etc.
Optimal is \(\lambda = \sigma 2\sqrt {{2\log (pn)}{n}}\)
Relies on knowing \(\sigma\), which we may not.
Instead we can use root LASSO.
Minimise the squareroot of the sum of squares loss (over n) , and use \(\lambda = \sqrt{2\log (pn)/n}\)
Doesn’t have \(\sigma\)
Lasso biased, estimators \(0\) for many.
We can use LASSO for model selection, then use OLS on only those estimators.
With LASSO we add a constraint to \(\hat \theta\).
\(\sum_i \hat \theta_i \le t\)
Regularisation of LLS. Sum of thetas are constrained to be below hyperparameter \(t\)
L1 regularisation
This is also known as sparce regression, because many weights are set to \(0\).
This now looks like:
\(w_{lasso} = \arg \min ||y-Xw||^2_2+\lambda ||w||_1\)
\(t\) is a hyperparameter.
Regularisation of LLS. The cost function now includes a norm on \(M\theta\).
\(L_2\) regularisation
This allows us to solve problems where there are too many features. \(L_1\) also allows us to to do this.
If \(n>d\) we can minimise weights subject to Xw=y. This is the same as the least norm.
Maximum a priori estimation. equiv to ridge regression with a priori estimate of \(0\)
\(W_{RR}=(\lambda I+X^TX)^{-1}X^Ty\)
\(E_[w{RR}]=(\lambda I+X^TX)^{-1}X^TXw\)
\(Var[W_{RR}]=a\)
Regularisation of LLS. Combines lasso and ridge regression.
\(L_1\) and \(L_2\) regularisation
We can generalise this to:
\(w_{l_p} = \arg \min ||y-Xw||^2_2+\lambda ||w||^p_p\)
For ridge regression there is always a solution.
For least squares there is a solution if \(X^TX\) is invertible
For Lasso we must use numerical optimisation.
lasso and \(L_1\) induces sparcity
Goal is \(min ||y - f(x)|| + \lambda g(w)\)
Ridge regression: \(g(w)=||w||\)
If \(\lambda =0\), OLS, if infinite, \(w\) goes to \(0\).
Normal equation changes to: \(\lambda I + X^TX)^{-1}X^Ty\)
We can preprocess to avoid processing of 1s. shift mean of \(y\) to \(0\). normalise \(x\) mean \(0\) var \(1\).
Alternative to ElasticNet
Each parameter is split into
\(\theta_i=\rho_i+\phi_i\)
There is \(L_2\) loss on \(rho\) and \(L_1\) loss on \(\phi\).
This means that large coefficients can be penalised like \(L_1\) and small coefficients like \(L_2\).
The Ramsey Regression Equation Specification Error Test (RESET)
Once we have done our OLS we have \(\hat y\).
The Ramsey RESET test is an additional stage, which takes these predictions and estimates:
\(y=\theta x+\sum_{i=1}^3\alpha_i \hat {y^i}\)
We then run an F-test on \(\alpha\), with the null that \(\alpha = 0\).
Alternative to RESET
We have \(\hat y\).
We regress \(y=\alpha + \beta \hat y + \gamma \hat y^2\).
We test that \(\gamma =0\).
If it is not, then this suggests the model is misspecified.
Trade-off between parameter accuracy and prediction accuracy.