One option for \(f(X)\) is a linear model.
\(f(X_i)=\hat{Y_i}= \beta_0+\sum_{j=1}^p\beta_iX_{ij}\)
The values for \(\beta\) are the regression coefficients.
So we have:
\(Y_i=\beta_0+\sum_{j=1}^p\beta_iX_{ij}+e(X_i)+e_i\)
We define the error of the estimate as:
\(\epsilon_i=Y_i-\hat{Y_i}\)
\(\epsilon_i=e(X_i)+e_i\)
So:
\(Y_i=\beta_0+\sum_{j=1}^p\beta_iX_{ij}+\epsilon_i\)
The linear model could be wrong for two reasons. No linear model could be appropriate, or the wrong coefficients could be provided for a linear model.
Linear regression if \(f\) is a linear function on \(w\). NB: not linear in \(x\) necessarily. could have \(x^2\) etc, but still linear in \(w\).
The function \(y=x^2\) is not linear, however we can model is as linear, by including \(x^2\) as a variable.
We can expand this, and using linear models to estimate parameters for functions such as:
\(y=ax^3+bx^2+xz\)
We can also transform data using logarithms and exponents.
For example we can model
\(\ln y=\theta \ln x\)
The square error is \(\sum_i (\hat{y_i}-y_i)^2\).
The differential of this with respect to \(\hat{\theta_j }\) is:
\(2\sum_i \dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}(\hat{y_i}-y_i)\)
The stationary point is where this is zero:
\(\sum_i \dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}(\hat{y_i}-y_i)=0\)
Here, \(\hat{y_i}= \sum_j x_{ij}\hat{\theta_j}\)
Therefore: \(\dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}=x_{ij}\)
And so the stationary point is where
\(\sum_i x_{ij}( \sum_j x_{ij}\hat{\theta_j }-y_i)=0\)
\(\sum_i x_{ij}( \sum_j x_{ij}\hat{\theta_j)}= \sum_i x_{ij}y_i\)
We can write this in matrix form.
\(X^TX\hat{\theta }=X^Ty\)
We can solve this as:
\(\hat{\theta }=(X^TX)^{-1}X^Ty\)
If variables are perfectly correlated then we cannot solve the normal equation.
Intuitively, this is because for perfectly correlated variables there is no single best parameter, as changes to one parameter can be counteracted by changes to another.
\(\hat y =\theta x\)
\(E[\hat y-y]=E[\theta x-y]\)
\(y=\hat y + \epsilon\)
\(E[y-\hat y |X]\)
\(E[\epsilon |X]\)
Unbiased so long as independent of error term.
\(Var [\hat y-y]=Var [\theta x-y]\)
\(Var[y-\hat y |X]\)
\(Var[\epsilon |X]\)
For a matrix \(X\), the pseudoinverse is \((X^*X)^{-1}X^*\).
For real matrices, this is: \((X^TX)^{-1}X^T\)
The pseudoinverse can be written as \(X^+\)
Therefore \(\theta\) is the pseudoinverse of the inputs, multiplied by the outputs. Or:
\(\theta = X^+y\)
The pseudoinverse satisfies:
\(XX^+X=X\)
\(X^+XX^+=X^+\)
Leverage measures how much the predicted value of \(y_i\), \(\hat y_i\), changes as \(y_i\) changes.
We have:
\(\mathbf y = \mathbf X \theta +\mathbf u\)
\(\hat \theta =X(X^TX)^{-1}X^Ty\)
\(\hat \theta =P_Xy\)
The leverage score is defined as:
\(h_i=P_{ii}\)
We have \(X\).
The projection matrix is \(X(X^TX)^{-1}X^T\)
The projection matrix maps from actual y to predicted y
\(\hat y = Py\)
Each entry refers to the covariance between actual and fitted
\(p_{ij}=\dfrac{Cov (\hat y_i, y_j}{Var (y_j)}\)
We can get residuals too:
\(u=y-\hat y=y-py=(1-P)y\)
\(1-P\) is called the annihilator matrix
We can now use the propagation of uncertainty
\(\Sigma^f = A\Sigma^x A^T\)
To get:
\(\Sigma^u = (I-P)\Sigma^y (I-P)\)
Annihilator matrix is:
\(M_X=I-X(X^TX)^{-1}X^T\)
Called this because:
\(M_XX=X-X(X^TX)^{-1}X^TX\)
\(M_XX=0\)
Is called residual maker
If we have a partitioned linear regression model:
\(\mathbf y=\mathbf X\theta+\mathbf Z\beta+\mathbf \mu\)
Use the annihilator matrix:
\(M_X\mathbf y=M_X\mathbf X\theta+M_X\mathbf Z\beta+M_X\mathbf \mu\)
\(M_X\mathbf y=M_X\mathbf Z\beta+M_X\mathbf \mu\)
We can then estimate \(\beta\).
Frisch-Waugh-Lovell theorem says that this is the same estimate as the original regression.
OLS:
\(\hat \theta =\frac{\sum_i (X_i-\mu_X)(y_i-\mu_y)}{\sum_i(x_i-\mu_X)^2}\)
Trimming
\(\hat \theta =\frac{n^{-1}\sum_i (X_i-\mu_X)(y_i-\mu_y)\mathbf 1_i}{n^{-1}\sum_i(x_i-\mu_X)^2\mathbf 1_i}\)
Where:
\(\mathbf 1_i=\mathbf 1(\hat f(z_i)\ge b)\)
Where \(b=b(n)\) is a trimming parameter, where:
\(b\rightarrow 0\) as \(n\rightarrow \infty\)
The best linear predictor is the one which minimises:
\(E[Y-X\theta ]\)
Under what circumstances is this the same as OLS? When n>>p. When n is not, then other linear estimators (like LASSO) can be better.
Cook’s distance measures the effect of deleting outliers. work out predictions if outlier was removed, sum all differneces in y hat
Outliers have a high Cook’s distance.
In linear regression we have
\(P(y|X, \theta, \sigma^2_\epsilon )\)
For Bayesian linear regression we want:
\(P(\theta, \sigma^2_\epsilon |y, X)\)
We can use Bayes rule:
\(P(\theta, \sigma^2_\epsilon |y, X)\propto P(y|X, \theta, \sigma^2_\epsilon )P(\theta |\sigma^2_\epsilon )P(\sigma^2_\epsilon )\)