Statistics

Creating statistics

We take a sample from the distribution.

\(x=(x_1, x_2,...,x_n)\)

A statistic is a function on this sample.

\(S=S(x_1, x_2,...,x_n)\).

\(\mathbf x_i\) and \(\mathbf z_i\) are not independent, so we cannot estimate just \(y_i=\mathbf x_i\theta\).

We could estimate our equation with a single ML algorithm.

\(y_i=f(\mathbf x_i, \theta) +g(\mathbf z_i) +\epsilon_i\)

For example, using LASSO.

However this would introduce bias into our estimates for \(\theta\).

We could iteratively estimate both \(\theta\) and \(g(\mathbf z_i)\).

For example iteratvely doing OLS for \(\theta\) and random forests for \(z_i\).

This would also introduce bias into \(\theta\).

\(f(\hat \theta )\rightarrow^d G\)

Where \(G\) is some distribution.

Many statistics are asymptotically normally distribution.

This is a result of the central limit theorem.

For example:

\(\sqrt n S\rightarrow^d N(s, \sigma^2)\)

We have the mean and variance, and know the distribution. This allows us to calculare confidence intervals.