Trinity of statistical inference

Goal of statistics is to learn the distribution of \(X\) after having observed \(n\) (eventually independent) copies of it. To achieve this, we introduce mathematical formalization of statistical modeling to make principled sense of the following three statements:

  • Define a function providing an estimation of the unknown parameter based on the observations;
  • Set a Confidence Interval (CI) where the unknown parameter will fall with given probability; and
  • Test the existence of statistical evidence that the unknown parameter satisfies certain hypothesis.

Statistical model

Let \(X_i\overset{\text{i.i.d.}}{\sim}\mathcal P\) be the observed outcome of a statistical experiment. In this setting, statistical model associated with such statistical experiment is the following pair.

\[(\Omega,\{\mathcal P_\theta\}_{\theta\in\Theta})\]

Where:

  • \(\Omega\) is the sample space (usually \(\Omega\subseteq\mathbb R^n\)). A valid statistical model has \(\Omega\) that does not depend on parameter \(\theta\) and is compatible with the support of \(\mathcal P_\theta\);
  • \(\{\mathcal P_\theta\}_{\theta\in\Theta}\) is a family of probability measures on \(\Omega\) (i.e. a legitimate PMF or PDF); and
  • \(\Theta\) is the parameter set.

We assume that the model is well-specified, that is \(\exists\theta\) s.t. \(\mathcal P_{\theta}=\mathcal P\), with \(\theta\) being the (unknown) true parameter. The aim of the statistical experiment is usually to find \(\theta\) (estimation) or to check whether it lies within certain region (Hypothesis Testing).

If the model is mis-specified, no matter how much data we collect, we will not be able to replicate the exact probability distribution \(\mathcal P\)—but this is not as bad as it may seem. Most assumptions necessarily lead to mis-specified models to avoid otherwise intractable analysis of the data. Hence, the saying all models are wrong, but some are useful and the main take here is that although we may never get the true underlying distribution, we should aim at a good approximation that achieves our objectives. Also, when competing models explain a given phenomenon, we should prefer the simpler one (e.g. a model with fewer parameters and smaller sample space).

When \(\Theta\subseteq\mathbb R^d\) and \(1\le d\lt\infty\), we say that the model is parametric, otherwise if \(d=\infty\), the model is non parametric. For completeness, statistical model is semi-parametric when \(\Theta=\Theta_1\times\Theta_2\), which means that \(\Theta\) can be split into finite dimensional \(\Theta_1\) and infinite dimensional \(\Theta_2\) (referred to as nuisance parameter). In that case, one only cares to estimate \(\theta\in\Theta_1\).

Below, an indicative list of statistical models for the basic discrete and continuous distributions.

Distribution and unknown parameter Statistical model
Bernoulli with parameter \(p\) \(\left(\{0,1\},\{\text{Ber}(p)\}_{p\in(0,1)}\right)\)
Binomial with parameters \((k,p)\) \(\left(\{0,1,\dots,k\},\{\text{Bin}(k,p)\}_{p\in(0,1)}\right)\)
Categorical with parameter \(\mathbf p\) \(\left(\{\mathbf x:x_j\in\{0,1\},\sum_{j=1}^d x_j=1\},\{\text{Cat}(\mathbf p)\}_{p_j>0,\sum_{j=1}^d p_j=1}\right)\)
Multinomial with parameters \((k,\mathbf p)\) \(\left(\{\mathbf x:x_j\in\{0,\dots,k\},\sum_{j=1}^d x_j=k\},\{\text{Mult}(k,\mathbf p)\}_{p_j>0,\sum_{j=1}^d p_j=k}\right)\)
Geometric with parameter \(p\) \(\left(\mathbb Z_{\ge1},\{\text{Geom}(p)\}_{p\in(0,1)}\right)\)
Poisson with parameter \(\lambda\) \(\left(\mathbb Z_{\ge0},\{\text{Pois}(\lambda)\}_{\lambda>0}\right)\)
Continuous uniform over \([0,\theta]\) \(\left(\mathbb R_{\ge0},\{\text{Unif}(0,\theta)\}_{\theta>0}\right)\)
Exponential with parameter \(\lambda\) \(\left(\mathbb R_{\ge0},\{\text{Exp}(\lambda)\}_{\lambda>0}\right)\)
Normal with parameters \((\mu,\sigma^2)\) \(\left(\mathbb R,\{\mathcal N(\mu,\sigma^2)\}_{(\mu,\sigma^2)\in\mathbb R\times(0,\infty))}\right)\)
Multivariate Normal with parameter \(\mu\) \(\left(\mathbb R^n,\{\mathcal N(\mu,\Sigma)\}_{\mu\in\mathbb R^n}\right)\)
Linear regression model \((X_i,Y_i)\subseteq\mathbb R^d\times\mathbb R\) are i.i.d. from \(Y_i=X_i\beta+\varepsilon_i\)
where \(X_i\sim\mathcal N(0,I_d)\) and \(\varepsilon_i\overset{\text{i.i.d.}}{\sim}\mathcal N(0,1)\)
for an unknown \(\beta\in\mathbb R^d\)

While setting up the statistical model, we should ask ourselves if the parameter \(\theta\) is identifiable, which is true if and only if \(\theta\neq\theta^\star\implies\mathcal P_\theta\neq\mathcal P_{\theta^\star}\), or equivalently \(\mathcal P_\theta=\mathcal P_{\theta^\star}\implies\theta=\theta^\star\).

Estimation

Any quantity computed from values in a sample is called a statistic. Common examples are:

  • sample mean (\(\bar X_n\)), sample mode (\(\arg\max f_X(x)\));
  • sample variance (e.g. biased \(S_n^2\), unbiased \(\tilde S_n^2\));
  • order statistics (e.g. sample maximum \(X_{(n)}\), minimum \(X_{(1)}\) or median \(X_{\left(\frac n 2\right)}\));
  • test statistics (e.g. \(z\)-statistic, \(t\)-statistic, \(\chi^2\)-statistic, \(F\)-statistic); and
  • in general, any measurable (i.e. computable) function of the sample.

When a statistic is used for estimation of a population parameter, it is called an estimator. Let \(\hat\Theta_n\) be an estimator of the true parameter \(\theta\), which depends on the number of samples \(n\). If \(\hat\Theta\xrightarrow{a.s./\mathbb P}\theta\), estimator is called consistent (strongly or weakly, respectively). If \(\sqrt{n}(\hat\Theta_n-\theta)\xrightarrow{(d)}\mathcal N(0,\sigma^2)\), the estimator is also asymptotically Normal, and \(\sigma^2\) is the asymptotic variance of \(\hat\Theta_n\).

Below are examples of some empirical (or natural) estimators, where the term “empirical” stands for observation of relative frequency obtained from sample of data. An important property of any estimator is bias, or \(\text{bias}(\hat\Theta_n)=\mathbb E[\hat\Theta_n]-\theta\). Except for expectation, all below empirical (or natural) estimators are biased, hence adjustments have been applied.

Property Natural estimator Alternative estimator
expectation \(\mathbb E[X_i]\) \(\displaystyle\bar X_n=\frac 1 n \sum_{i=1^n}X_i\) \(\displaystyle\bar X_n=\frac 1 n \sum_{i=1^n}X_i\)
expectation of a function \(\mathbb E[Y_i]\), where \(Y_i=f(X_i)\) \(\displaystyle\bar Y_n=\frac 1 n \sum_{i=1^n}f(X_i)\) depends on \(f(X_i)\)
variance \(\text{var}(X_i)\) \(\displaystyle\frac 1 n\sum_{i=1}^n (X_i-\bar X_n)^2\) \(\displaystyle\frac 1 {n-1}\sum_{i=1}^n (X_i-\bar X_n)^2\)
(see back-up)
covariance \(\text{cov}(X_i, Y_i)\) \(\displaystyle\frac 1 n\sum_{i=1}^n (X_i-\bar X_n)(Y_i-\bar Y_n)\) \(\displaystyle\frac 1 {n-1}\sum_{i=1}^n (X_i-\bar X_n)(Y_i-\bar Y_n)\)
\(\lambda\) in \(X_i\overset{\text{i.i.d.}}{\sim}\text{Exp}(\lambda)\) \(\displaystyle\frac{1}{\bar X_n}\) \(nX_{(1)}\)
\(\theta\) in \(X_i\sim\text{Unif}(0,\theta)\) \(2\bar X_n\) \(\displaystyle{\frac {n+1} n} X_{(n)}\)

Consistency and bias are different, and one does not imply the other. An estimator \(\hat\Theta_n=X_n\) may be unbiased, but will not converge to anything as it considers the last sample only.

Variance of an estimator \(\text{var}(\hat\Theta_n)\) helps comparing performances of different estimators, and is defined as expected value of the squared error loss \(\text{MSE}(\hat\Theta_n)=\mathbb E[(\hat\Theta_n-\theta)^2]\).

Since \(\text{var}(\hat\Theta_n)=\text{var}(\hat\Theta_n-\theta)=\mathbb E[(\hat\Theta_n-\theta)^2]-(\mathbb E[\hat\Theta_n-\theta])^2=\text{MSE}(\hat\Theta_n)-\text{bias}^2(\hat\Theta_n)\), it follows that \(\text{MSE}(\hat\Theta_n)=\text{var}(\hat\Theta_n)+\text{bias}^2(\hat\Theta_n)\ge0\). Also, if \(\text{MSE}(\hat\Theta_n)\rightarrow 0\) then estimator \(\hat\Theta_n\) is consistent (i.e. \(\hat\Theta_n\xrightarrow{\mathbb P}\theta\)).

The above can also be explained as convergence in \(L^2\) norm implies convergence in \(L^1\) norm, and therefore in probability (consistency). Implications in the opposite directions are not guaranteed and convergence in expectation does not imply convergence in variance, as convergence in \(L^1\) norm does not imply convergence in \(L^2\) norm.

Delta method

Occasionally, the unknown parameter is a function of the expectation, \(\theta=g(\mu_X)\). In such cases, CMT preserves consistency \(g(\bar X_n)\xrightarrow{\mathbb P}g(\mu_X)\) and asymptotic normality, but not the asymptotic variance. By the Taylor’s expansion, \(g(\bar X_n)=g(\mu_X)+g'(\mu_X)(\bar X_n-\mu_X)+O(\bar X_n^2)\); therefore, \(g(\bar X_n)-g(\mu_X)\simeq g'(\bar X_n)(X_n-\mu_X)\), which allows computing the new variance, provided that is continuously differentiable around \(\mu_X\), which in turn may depend on the possible values of \(\theta\).

\[\sqrt{n}(g(\bar X_n)-g(\mu_X))\xrightarrow{(d)}\mathcal N(0,(g'(\mu_X))^2\sigma_X^2)\]

The above is known as the Delta method, and even though the above example was developed around the sample mean, it works for any asymptotically Normal distribution, including with non invertible \(g(\mu_X)\).

Multivariate Delta method

Assume a random vector \(\bar{\mathbf X}_n\in\mathbb R^d\) s.t. \(\sqrt{n}(\bar{\mathbf X}_n-\boldsymbol\mu_{\mathbf X})\xrightarrow{(d)}\mathcal N_d(\mathbf 0_d,\boldsymbol\Sigma_{\mathbf X})\) and a map \(\mathbf g:\mathbb R^d\rightarrow\mathbb R^{k\ge1}\) that is continuously differentiable at \(\boldsymbol\mu_{\mathbf X}\), then the Delta method can be generalized as follows.

\[\sqrt{n}(g(\bar{\mathbf X}_n)-g(\boldsymbol\mu_{\mathbf X}))\xrightarrow{(d)}\mathcal N_d(\mathbf 0_d,\nabla g(\boldsymbol\mu_\mathbf X)^T\boldsymbol\Sigma_{\mathbf X}\nabla g(\boldsymbol\mu_\mathbf X))\]

Where \(\nabla g(\boldsymbol\mu_\mathbf X)=\begin{bmatrix}\nabla g_1(\boldsymbol\mu_\mathbf X)\dots\nabla g_k(\boldsymbol\mu_\mathbf X)\end{bmatrix}\in\mathbb R^{d\times k}\), which is also the transpose of what is known as a Jacobian matrix.

Estimator properties recap

Follows a list of the main properties of any estimator, along with some considerations that one should bear in mind when choosing an estimator, among multiple options. Examples in the below table consider the following estimators:

  • \(nX_{(1)}\) estimator of \(\frac1\lambda\) with \(X_i\overset{\text{i.i.d.}}{\sim}\text{Exp}(\lambda)\); and
  • \(X_{(n)}\) estimator of \(\theta\) with \(X_i\overset{\text{i.i.d.}}{\sim}\text{Unif}(0,\theta)\).
Property Notes
Consistency
\(\hat\Theta_n\xrightarrow[n\rightarrow\infty]{\mathbb P}\theta\)
Consistent estimator returns better estimates as \(n\) increases. Considering that \(\lim_{n\rightarrow\infty}\mathbb P(\lvert X_{(n)}-\theta\rvert\ge\varepsilon)=\lim_{n\rightarrow\infty}\left(1-\frac\varepsilon\theta\right)^n\) it follows that \(X_{(n)}\) will converge to \(0\) as \(n\rightarrow\infty\).

For \(nX_{(1)}\) we have \(\mathbb P\left(nX_{(1)}-\frac1\lambda\le\varepsilon\right)=1-e^{-(\lambda\varepsilon+1)}\), which does not depend on \(n\). Hence, no matter how many copies of \(X_i\) you collect, but the probability of having \(nX_{(1)}\) arbitrarily far from \(\frac1\lambda\) will remain constant.
Bias
\(\begin{gather}\text{bias}(\hat\Theta_n)\\=\mathbb E[\hat\Theta_n-\theta]\end{gather}\)
Ceteris paribus, an unbiased estimator is preferable. However, there are cases when biased estimators are used in practice. Although \(nX_{(1)}\) is unbiased, we may prefer a different estimator if consistency is a must.

Now consider \(X_{(n)}\), which is biased because \(\mathbb E[X_{(n)}]=\frac n{n+1}\theta\neq\theta\). In this case, bias can be fixed by defining \(\hat\Theta_n=\frac{n+1}n X_{(n)}\). Observe that \(\hat\Theta_n\) is unbiased and consistent estimator.
Variance
\(\text{var}(\hat\Theta_n)\)
Smaller the variance, higher the precision of the estimates. Relative efficiency compares unbiased estimators \(\hat\Theta_1\) and \(\hat\Theta_2\) in the form of a ratio \(\frac{\text{var}(\hat\Theta_1)}{\text{var}(\hat\Theta_2)}\), where \(\frac{\text{var}(\hat\Theta_1)}{\text{var}(\hat\Theta_2)}\le1\) indicates that the first estimator is more efficient. There is a relation between variance and consistency, because \(\lim_{n\rightarrow\infty}\text{var}(\hat\Theta_n)=0\) means convergence in \(L^2\) norm, which implies convergence in probability.
MSE
\(\begin{gather}\text{MSE}(\hat\Theta_n)\\=\text{var}(\hat\Theta_n)\\+\text{bias}^2(\hat\Theta_n)\end{gather}\)
MSE is a measure of the quality of an estimator, which considers both the bias and the variance. Shrinkage estimators introduce a slight bias in the estimator, in exchange of a reduction of the overall MSE.

Assume unbiased \(\hat\Theta_n\), that is \(\text{MSE}(\hat\Theta_n)=\text{var}(\hat\Theta_n)=\sigma^2\). Now, consider \(\alpha\hat\Theta_n\) and observe that \(\text{MSE}(\alpha\hat\Theta_n)=(\sigma^2+\theta^2)\alpha^2-2\theta\alpha+\theta^2\) reaches its minimum at \(\hat\alpha=\frac{\theta^2}{\sigma^2+\theta^2}\lt1\). Accordingly, \(\text{MSE}(\hat\alpha\hat\Theta_n)=\hat\alpha\sigma^2\lt\sigma^2\).
Distribution
\(\displaystyle(\lim_{n\rightarrow\infty})\hat\Theta_n\sim\mathcal P\)
When we mentioned bias and variance, we assumed that they can be computed. On the other hand, analyzing estimators is easier if they are distributed according to well-known probability laws. With \(nX_{(1)}\), we have that \(X_{(1)}\sim\text{Exp}(n\lambda)\), for any \(n\), and one can rely on existing tables to derive its expectation and variance. Same consideration applies to \(\bar X_n\), with \(X_i\overset{\text{i.i.d.}}{\sim}\mathcal N(\mu_X,\sigma_X^2)\), which we know will be distributed according to a Normal r.v. with expectation \(\mu_X\) and variance \(\frac{\sigma_X^2}n\). These are examples of non asymptotic distributions.

Asymptotic distributions on the other hand can only be computed assuming \(n\rightarrow\infty\) and examples include \(\bar X_n\xrightarrow{(d)}\mathcal N(\mu_X,\sigma_X^2)\) for any \(X_i\overset{\text{i.i.d.}}{\sim}\mathcal P\) thanks to CLT, or \(n\left(1-\frac{X_{(n)}}\theta\right)\xrightarrow{(d)}\text{Exp}(1)\) in the case of \(X_{(n)}\).
Robustness Outliers may be unavoidable during data collection, either due to flaws in the statistical model or measurement errors. Estimators susceptible to outliers can significantly affect our estimations, unless they are dealt propertly. For example, estimators that rely on \(X_{(1)}\) or \(X_{(n)}\) are less robust than those based on \(\bar X_n\).

Confidence Interval

Any random interval \(\mathcal I\) which boundaries do not depend on \(\theta\) is defined as (asymptotic) confidence interval (CI) of level \(1-\alpha\) if:

\[\left(\lim_{n\rightarrow\infty}\right)\mathbb P(\theta\in\mathcal I)\ge1-\alpha,\forall\theta\in\Theta\]

There are two major takeaways from the above definition:

  • CI is a random interval that, depending on its realization, will either contain or not the unknown parameter, where the expected of successes \(\mathbb E[\mathbb 1(\theta\in\mathcal I)]\) will determine its level;
  • the inequality sign means that CI of a given level, is also CI of any lower level, provided that our objective is finding the narrowest interval satisfying that given level; and
  • Unless estimator’s distribution is well known for every \(n\lt\infty\), we will need to rely on convergence theorems (CLT, Slutsky’s, CMP, Delta method) to derive asymptotic CI for \(n\rightarrow\infty\)

Derivation of CI could follow the following procedure:

  • Define a statistical model \((\Omega,\{\mathcal P_\theta\}_{\theta\in\Theta})\) based on observations \(X_1,\dots,X_n\), where \(\Theta\subseteq\mathbb R\);
  • Construct an estimator \(\hat\Theta_n\) of parameter \(\theta\) and derive its distribution;
  • Compute an interval \(\mathcal J(\theta)=\theta+[-u,v]\) s.t. \(\mathbb P(\hat\Theta_n\in\mathcal J(\theta))\ge1-\alpha\); and
  • Derive an interval \(\mathcal I(\hat\Theta_n)=\hat\Theta_n+[-v,u]\) s.t. \(\mathbb P(\theta\in\mathcal I(\hat\Theta_n))\ge1-\alpha\).

Note that in spite of the last manipulation, the interval \(\mathcal I(\hat\Theta_n)\) may still depend on the unknown \(\theta\). To address this, we may rely on one of the following three strategies:

  • Conservative bound, particularly suitable to bounded r.v.;
  • Solving the (quadratic) equation; and
  • Plugging-in the estimator \(\hat\Theta_n\) in place of \(\theta\), by exploiting Slutsky’s theorem.

Where the first two strategies will not change the (non) asymptotic nature of the interval determined before, application of plug-in strategy will necessarily lead to an asymptotic CI, due to the fact that Slutsky’s theorem applies to convergent sequences only.

Before moving forward, bear in mind that although a CI of higher level is wider than a CI of lower level, the former does not necessarily contain the latter if they were derived using different strategies.

Two-sided CI of the Mean

Considering the importance of the sample mean \(\bar X_n\) as a consistent estimator of the expectation with a well known distribution, it is quite common to see the CI defined as \(\left[\bar X_n\pm q_{\frac\alpha 2}\frac{\hat\sigma_X}{\sqrt n}\right]\), equivalent to:

\[\mathcal I(\bar X_n)=\left[\bar X_n-q_{\frac\alpha 2}\frac{\hat\sigma_X}{\sqrt n},\bar X_n+q_{\frac\alpha 2}\frac{\hat\sigma_X}{\sqrt n}\right]\text{, such that }\mathbb P\left(\mu_X\in\mathcal I(\bar X_n)\right)\ge1-\alpha\]

Where \(q_{\frac\alpha 2}\) is the two sided quantile of order \(1-\alpha\) and \(\hat\sigma_X^2\) is either an upper bound of the variance or (unbiased) sample variance obtained from observations.

The CI is obtained by solving a (quadratic) equation is generally staggered with respect to an interval centered around the mean, like in the following indicative examples.

  \(\sigma_X^2=\theta^2\) \(\sigma_X^2=\theta\)
equation \(\displaystyle\lvert g(\bar X_n)-\theta\rvert\le q_{\frac \alpha 2}\frac{\theta}{\sqrt n}\) \(\displaystyle\lvert g(\bar X_n)-\theta\rvert\le q_{\frac \alpha 2}\frac{\sqrt\theta}{\sqrt n}\)
solution interval \(\displaystyle\left[\frac{g(\bar X_n)}{\left(1\pm\frac{q_{\frac \alpha 2}}{\sqrt n}\right)}\right]\) \(\displaystyle\left[g(\bar X_n)+\frac {q_{\frac \alpha 2}^2}{2n}\left(1\pm\sqrt{1+2g(\bar X_n)\frac{2n}{q_{\frac \alpha 2}^2}}\right)\right]\)

The generic approach in determining a new estimator, can be as follows:

  • check out the converging limit of the sample mean (WLLN);
  • observe the Normal distribution and its asymptotic variance (CLT);
  • apply necessary transformations to isolate the desired (unknown) parameter (CMT);
  • determine the new asymptotic variance (Delta method); and
  • compute CI of the desired level.

Hypothesis Testing

Contrary to estimation, Hypothesis Testing’s objective is not to find the true value of \(\theta\), but instead to answer if \(\theta\in\Theta_0\) or if \(\theta\in\Theta_1\), where \(\Theta_0\cap\Theta_1=\emptyset\). In this context, \(\Theta_0\) can also contain one single element \(\{\theta_0\}\) and it is not required that \(\Theta_0\cup\Theta_1=\Theta\), the entire parameter set.

If we had access to the entire population, we would have been able to compute the true value of \(\theta\), but in addition to being impractical, most of the time this is computationally challenging. Hence, assume we collect one-sample from said population and test the following two hypothesis against each other.

\[\begin{cases}H_0:&\theta\in\Theta_0&\text{null hypothesis}\\H_1:&\theta\in\Theta_1&\text{alternative hypothesis}\end{cases}\]

The objective of the test is to look for evidence in the data to reject \(H_0\), although lack of evidence will not mean that \(H_0\) is true. To achieve this, assume \(H_0\) is and compute the region where \(\hat\Theta_n\) would fall with (low) probability \(\alpha\). The heuristic is that if an estimator’s realization \(\hat\theta_n\) (estimate) falls in this (low) probaility region, then it is unlikely that \(H_0\) explains our data and therefore can be rejected. Clearly, a single experiment \(\hat\theta_n\) (based on one or more observations), is not yet sufficient to draw any definitive conclusions, but is sufficient to estimate the chances of obtaining realizations at least as extreme as \(\hat\theta_n\).

Occasionally, we collect two-samples of separate, independent populations, and test whether their respective parameters are equal or not (or if their difference is equal to a given value). In such cases, one sample is called control group (represent reference base) and the other test group.

In either case (one-sample or two-sample tests), we need to some make modeling assumptions and reduce hypotheses to yes/no questions. Below definitions are useful in understanding how the test is carried out:

  • Test statistic \(T_n(\hat\Theta_n,\theta)\in\mathbb R\) is a pivot if we are able to write it in such a way that its distribution under the \(H_0\) is known and does not depend on any other parameters (such as \(z\)-distribution, \(t\)-distribution or \(\chi^2\)-distribution). Occasionally, it is a sort of distance function between \(\hat\Theta_n\) and a reference \(\theta\). For example if \(\hat\Theta_n=\bar X_n\), then a possible \(T_n(\bar X_n,\mu_0)=\sqrt n\frac{\bar X_n-\mu_0}{\sigma_X^2}\), which we know will converge to \(Z\sim\mathcal N(0,1)\) if \(H_0\) were true. If \(\sigma_X^2\) depends on \(\mu\), we should assume \(\sigma_X^2(\mu_0)\), to remain consistent with the hypothesis made;
  • (Statistical) test \(\psi_\alpha=\{0,1\}\) is a statistic usually of the form \(\psi_\alpha=\mathbb 1(T_n(\hat\Theta_n,\theta)\gt c_\alpha)\), where \(\theta\in\Theta_0\). Recall that \(T_n\) is a distance function and \(\psi_\alpha=1\) means that \(T_n\) exceeded certain threshold and therefore we reject \(H_0\). When \(H_1:\theta\neq\theta_0\), our \(T_n\) will be based on the absolute distance \(\lvert\hat\Theta_n-\theta_0\rvert\) and the test will be two-sided. There are examples of composite tests, such as \(\mathbb 1(T_n\le a\cup T_n\ge b)\)) and the thresholds will be the same as in the case of multiple one-sided exercises;
  • Type 1 error \(\alpha_\psi(\theta)\in[0,\alpha]\) is the probability of rejecting \(H_0\) when \(H_0\) is true. More formally, \(\alpha_\psi(\theta):\Theta_0\rightarrow[0,\alpha];\theta\mapsto\mathbb P_{\theta}(\psi=1)\). Upper bound \(\alpha\) is the (asymptotic) level and is a design parameters that defines the test (and not the other way around). In general, \(\alpha=\alpha_\psi(\theta_0)\) where \(\theta_0\) that is at the boundary between \(\Theta_0\) and \(\Theta_1\), because this is the point of the highest ambiguity. Naturally, if during an experiment \(H_0\) is rejected at level \(\alpha\), then the same experiment would have been rejected also at level \(\alpha'>\alpha\). This means that \(\alpha'\) can also be the level of the test, just not the smallest one;
  • Threshold \(c_\alpha\in\mathbb R\), occasionally referred to as critical value, is the main criteria that defines when the test should reject \(H_0\) and it depends on the designated level \(\alpha\). If \(T_n\) is a pivot, asymptotically distributed according to \(\mathcal N(0,1)\), then \(\mathbb P(T_n\gt c_\alpha)\) will converge to \(1-\Phi(c_\alpha)\), as \(n\rightarrow\infty\). Since by design we want \(\mathbb P(T_n\gt c_\alpha)\le\alpha\), \(c_\alpha=q_\alpha\) is the lowest threshold that would satisfy such relation, where \(q_\alpha\) is the standard normal distribution quantile of order \(1-\alpha\), bearing in mind that \(c_\alpha=q_{\frac\alpha2}\) with a two-sided test;
  • Rejection region \(R_{\psi_\alpha}\in \Omega^n\) is a subspace elements of which satisfy \(\psi(X_1,\dots,X_n)=1\). To define a test \(\psi_\alpha\), is sufficient defining \(R_{\psi_\alpha}\) in terms of sample subspace, even though it can also be in terms of \(T_n\) as \(R_{\psi_\alpha}=\{T_n\gt c_\alpha\}\), or \(\hat\Theta_n\) as \(R_{\psi_\alpha}=\{T_n(\hat\Theta_n,\theta)\gt c_\alpha\}\), with \(\theta\in\Theta_0\). This last manipulation highlights the duality between CI and the rejection region (or its complement, containing also the acceptance region) can be assimilated as a CI, and assessment of \(H_1:\theta\neq\theta_0\) can be promptly translated into \(\psi_\alpha=\mathbb 1(\theta_0\notin\mathcal I(\hat\Theta_n))\);
  • Type 2 error \(\beta_\psi(\theta)\in[0,1]\) is the probability of not rejecting \(H_0\) when \(H_1\) is true. More formally, \(\beta_\psi(\theta):\Theta_1\rightarrow[0,1];\theta\mapsto\mathbb P_{\theta}(\psi=0)\). Recall that \(H_1\) does not play a symmetric role (the data is only used to disprove \(H_0\)), and \(\beta_\psi\) is determined only after rejection and acceptance regions have been set. Analogously to \(\alpha_\psi(\theta)\), the highest \(\beta_\psi(\theta)\) occurs in the proximity of \(\theta_0\in\Theta_0\) when it is also supremum or infimum of \(\Theta_1\). In such case, \(\sup_{\theta\in\Theta_1}\beta_\psi(\theta)\) would be equal to \(1-\alpha_\psi(\theta_0)=1-\alpha\);
  • Power \(\pi_\psi=\inf_{\theta\in\Theta_1}(1-\beta_\psi(\theta))\) and is the probability that test \(\psi\) rejects \(H_0\) when \(H_1\) is true. Considering that \(1-\beta_\psi(\theta)\) is minimum when \(\beta_\psi(\theta)\) is maximum, we conclude that \(\pi_\psi=\alpha\) when \(\Theta_0\) and \(\Theta_1\) are two, contiguous partitions of the real line. In addition, bearing in mind that \(\beta_\psi\) is better tolerated than \(\alpha_\psi\), we see that the major drawback of a high \(\beta_\psi(\theta)\) is that we are losing power of detecting \(H_1\) when it is actually true;
  • (Asymptotic) \(p\)-value is the smallest (asymptotic) level \(\alpha\) at which \(\psi_\alpha\) rejects \(H_0\). This definition may seem counterintuitive, since \(p\)-value is just the probability of obtaining \(T_n\) at least as extreme as the actually observed \(t_n\) assuming \(H_0\) is true, that is \(p=\mathbb P_{\theta\in\Theta_0}(T_n\gt t_n)\), but it leverages on the fact that \(p\)-value is random and depends on the sample—and that based on the actual realization, one can arbitrarily set the level when to accept or reject \(H_0\). In either case, large \(T_n\) (as earlier defined) leads to a smaller \(p\)-value, and more confidently we can reject \(H_0\).

Find below an indicative summary of the error types. Recall that where \(\alpha_\psi(\theta)\) (with \(\theta\in\Theta_0\)) and \(\beta_\psi(\theta)\) (with \(\theta\in\Theta_1\)) are functions, level \(\alpha\) and power \(\pi_\psi\) are values.

test outcome \(H_0\) is true \(H_1\) is true
\(H_0\) is not rejected Correct inference
(true negative)
\(1-\alpha\)
Wrong inference
(false negative)
\(\beta\)
\(H_0\) is rejected Wrong inference
(false positive)
\(\alpha\)
Correct inference
(true positive)
\(\pi\)

Go back to the syllabi breakdown.


Back-up

Quadratic equation

The roots of a quadratic equation \(ax^x+bx+c=0\) are \(x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}\). Few simplifications apply when \(a=1\) or if \(b=2\beta\).

Condition Roots
General form \(x=-\frac b {2a}\pm\sqrt{\left(\frac b {2a}\right)^2-\left(\frac c a\right)}\)
\(a=1\) \(x=-\frac b 2\pm\sqrt{\left(\frac b 2\right)^2-c}\)
\(b=2\beta\) \(x=-\frac \beta a \pm\sqrt{\left(\frac \beta a\right)^2-\frac c a}\)
\(a=1\) and \(b=2\beta\) \(x=-\beta\pm\sqrt{\beta^2-c}\)

Variance empirical estimator

Assume \(X_i\overset{\text{i.i.d.}}{\sim}\mathcal P\), where \(\mathbb E[X_i]=\mu_X\) and \(\text{var}(X_i)=\sigma_X^2\). Assume also \(\bar X_n=\frac 1 n\sum_{i=1}^n X_i\), where \(\mathbb E[\bar X_n]=\mu_X\) and \(\text{var}(\bar X_n)=\frac{\sigma_X^2}{n}\). Now, observe that \(\mathbb E[X_i^2]=\sigma_X^2+\mu_X^2\), \(\mathbb E[\bar X_n^2]=\frac{\sigma_X^2}{n}+\mu_X^2\) and \(\mathbb E[X_i\bar X_n]=\frac 1 n\mathbb E[X_i^2+X_i\sum_{i\neq j}X_j]=\mathbb E[\bar X_n^2]\).

Based on the above, we compute \(\mathbb E[(X_i-\bar X_n)^2]=\mathbb E[X_i^2-2X_i\bar X_n+\bar X_n^2]=\frac{n-1}{n}\sigma_X^2\) and notice that the empirical variance estimator \(\hat\sigma_X^2=\frac 1 n\sum_{i=1}^n(X_i-\bar X_n)^2\) is biased, since \(\mathbb E[\hat\sigma_X^2]=\frac{n-1} n \sigma_X^2\). The bias can be addressed by defining an unbiased estimator \(s_X^2=\frac 1 {n-1}\sum_{i=1}^n(X_i-\bar X_n)^2\). Exactly same arguments can be used to derive unbiased estimator of covariance.