\[\newcommand{\ind}{\perp\kern-5pt\perp}\]

Definition

A random variable (r.v.) \(X\) is a function that associates a value with every possible outcome \(\omega\in\Omega\). A random variable \(X(\omega)=x\) can map into discrete (e.g. \(x\in\mathbb Z\)) or continuous (e.g. \(x\in\mathbb R\)) values. There could be multiple r.v. associated with the same sample space; and a function of one or more r.v. is in turn a r.v..

Probability mass function (PMF) \(p_X(x)\), on the other hand, is a function that associates a real number to every possible outcome \(\omega\in\Omega\). Or, more formally, \(p_X(x)=\mathbb P(\omega\in\Omega: X(\omega)=x)\). Usual axioms apply:

  • \(p_X(x)\ge0\), non negativity
  • \(\sum_x p_X(x)=1\), normalization

Expectation

Expected value of \(X\) is defined as \(\mu_X=\mathbb E[X]=\sum_x xp_X(x)\) and it can be interpreted as the arithmetic mean (referred by some as “statistical hammer”) of a large number of independent realizations. For infinite sums, expectation needs to be well defined (read unambiguous, especially if \(X\) can also take negative numbers), hence we assume that \(\sum_x \lvert x\rvert p_X(x)<\infty\).

Property Formula
Bounds \(a\le X\le b\implies a\le\mathbb E[X]\le b\)
Non negativity \(X\ge0\implies\mathbb E[X]\ge0\)
Non negativity, with \(X\in\{0,1,\dots,\infty\}\) \(\displaystyle\mathbb E[X]=\sum_{k=0}^\infty\mathbb P(X>k)\)
Law Of The Unconscious Statistician (LOTUS) \(\displaystyle\mathbb E[g(X)]=\sum_x g(x)p_X(x)\)
Indicator, \(\mathbb 1_A=\begin{cases}1&\text{event $A$ occurs}\\0&\text{otherwise}\end{cases}\) \(\mathbb E[\mathbb 1_A]=\mathbb P(A)\)
Linearity \(E[aX+b]=a\mathbb E[X]+b\)

Observe that where \(\mathbb 1_{A\cap B}=\mathbb 1_A\cdot\mathbb 1_B\in\{0,1\}\) is still an indicator r.v., \(\mathbb 1_A+\mathbb 1_B\in\{0,1,2\}\) is not. In this case, \(\mathbb 1_{A\cup B}=\mathbb 1_A+\mathbb 1_B-\mathbb 1_A\mathbb 1_B\), given by the inclusion-exclusion formula.

Variance

Defined as \(\sigma_X^2=\text{var}(X)=\mathbb E[(X-\mathbb E[X])^2]\), variance measures the spread of a PMF and is invariant to location. \(\sigma_X=\sqrt{\sigma_X^2}\) is referred to as standard deviation.

Property Formula
Non negativity (zero only for constants) \(\text{var}(X)\ge0\)
Equivalency \(\text{var}(X)=\mathbb E[X^2]-(\mathbb E[X])^2\)
Bounded \(\displaystyle\text{var}(X)\le\frac{1}{4}(b-a)^2\), if \(a\le X\le b\)
Scaled by the square \(\text{var}(aX)=a^2\text{var}(X)\)
Invariant with respect to location \(\text{var}(X+b)=\text{var}(X)\)

From the above, we see that \((\mathbb E[X])^2\le\mathbb E[X^2]\). A more general concept is known as Jensens’s inequality. As for the upper bound, intuitively the maximum dispersion is when \(X\) has the same chances of being \(a\) or \(b\), or equivalently \(X=a+(b-a)Y\), where \(Y\sim\text{Ber}(\frac 1 2)\).

Basic discrete r.v. distributions

r.v. PMF
\(p_X(x)\)
Expectation
\(\mu_X=\mathbb E[X]\)
Variance
\(\sigma_X^2=\text{var}(X)\)
Bernoulli
\(\text{Ber}(p)\)
\(p^x(1-p)^{1-x}\)
where \(x\in\{0,1\}\)
\(p\) \(p(1-p)\)
Rademacher
\(\text{Rad}\)
\(\frac12\)
where \(x\in\{-1,1\}\)
\(0\) \(1\)
Binomial
\(\text{Bin}(k,p)\)
\(\displaystyle{k\choose x}p^x(1-p)^{k-x}\)
where \(x\in\{0,1\dots,k\}\)
\(kp\) \(kp(1-p)\)
Geometric
\(\text{Geom}(p)\)
\((1-p)^{x-1}p\)
where \(x\in\mathbb Z_{\ge1}\)
\(\displaystyle\frac{1}{p}\) \(\displaystyle\frac{1-p}{p^2}\)
Pascal
\(\text{NBin}(k,p)\)
\(\displaystyle{x-1\choose k-1}p^k(1-p)^{x-k}\)
where \(x\in\mathbb Z_{\ge k}\)
\(\displaystyle k\frac1p\) \(\displaystyle k\frac{1-p}{p^2}\)
Poisson
\(\displaystyle\text{Pois}(\lambda)\)
\(\displaystyle e^{-\lambda}\frac{\lambda^x}{x!}\)
where \(x\in\mathbb Z_{\ge0}\)
\(\lambda\) \(\lambda\)
Uniform
\(\text{Unif}(a,b)\)
\(\displaystyle\frac{1}{n}\)
where \(x\in\{a,a+1,\dots,b-1,b\}\) and \(n=b-a+1\)
\(\displaystyle\frac{a+b}{2}\) \(\displaystyle\frac{1}{12}(n^2-1)\)
Constant
\(\text{Unif}(c,c)\)
\(1\)
where \(x=c\)
\(c\) \(0\)
Categorical
\(\text{Cat}(\mathbf p)\)
\(\begin{cases}\prod_{j=1}^d p_j^{x_j}&\mathbf x\in\{0,1\}^d\\0&\text{otherwise}\end{cases}\)
where \(\mathbf 1_d^T\mathbf x=1\), \(\mathbf 1_d^T\mathbf p=1\)
\(\mathbf p\) \(\href{/2022/01/24/further-topics-on-RV.html#categorical}{\Sigma_\mathbf X}\)
Multinomial
\(\text{Mult}(k,\mathbf p)\)
\(\begin{cases}{k\choose\mathbf x}\prod_{j=1}^d p_j^{x_j}&\mathbf x\in\{0,1,\dots,k\}^d\\0&\text{otherwise}\end{cases}\)
where \(\mathbf 1_d^T\mathbf x=k\), \(\mathbf 1_d^T\mathbf p=1\)
\(k\mathbf p\) \(k\href{/2022/01/24/further-topics-on-RV.html#categorical}{\Sigma_\mathbf X}\)

From the definition of variance, we have \(\mathbb E[X^2]=\text{var}(X)+(\mathbb E[X])^2=\sigma_X^2+\mu_X^2\). Also, observe that for \(X\sim\text{Ber}(p)\), \(\mathbb E[X^k]=p\), \(\forall k\ge1\), while for \(X\sim\text{Rad}\) we have that \(\mathbb E[X^k]\) is equal to \(0\) if \(k\) is odd, and \(1\) if \(k\) is even.

Computing expectation and variance for Pascal r.v. (a.k.a. Negative Binomial) is easier if we assume \(X=\sum_{i=1}^k X_i\), with \(X_i\sim\text{Geom}(p)\), in which case \(\mathbb E[X]=k\mathbb E[X_i]\) and \(\text{var}(X)=k\text{var}(X_i)\).

Refer to back-up for CDF of the above basic distributions and if you are curious about distributions and want to experiment with various parameters.

Conditioning

Conditioning to an event \(A\), with \(\mathbb P(A)>0\), does not alter the main PMF and expectation properties. Note that conditioning could also include another discrete r.v., e.g. \(A=\{Y=y\}\).

Property Conditional on event \(A\) Conditional on r.v. \(Y\)
PMF \(\displaystyle p_{X\lvert A}(x)=\frac{\mathbb P(X=x \cap A)}{\mathbb P(A)}\)
\(\displaystyle p_{X\lvert X\in A}(x)=\begin{cases}\frac{p_X(x)}{\mathbb P(A)}&\text{if $x\in A$}\\0&\text{otherwise}\end{cases}\)
\(\displaystyle p_{X\lvert Y}(x\lvert y)=\frac{p_{X,Y}(x,y)}{p_Y(y)}\)
Normalization \(\displaystyle\sum_x p_{X\lvert A}(x)=1\) \(\displaystyle\sum_x p_{X\lvert Y}(x\lvert y)=1\)
Expectation \(\displaystyle\mathbb E[X\lvert A]=\sum_x xp_{X\lvert A}(x)\) \(\displaystyle\mathbb E[X\lvert Y=y]=\sum_x xp_{X\lvert Y}(x\lvert y)\)
LOTUS \(\displaystyle\mathbb E[g(X)\lvert A]=\sum_x g(x)p_{X\lvert A}(x)\) \(\displaystyle\mathbb E[g(X)\lvert Y=y]=\sum_x g(x)p_{X\lvert Y}(x\lvert y)\)
Total probability \(\displaystyle p_X(x)=\sum_{i=1}^n\mathbb P(A_i)\mathbb P(X\lvert A_i)\) \(\displaystyle p_X(x)=\sum_y p_Y(y)p_{X\lvert Y}(x\lvert y)\)
Total expectation \(\displaystyle\mathbb E[X]=\sum_{i=1}^n\mathbb P(A_i)\mathbb E[X\lvert A_i]\) \(\displaystyle\mathbb E[X]=\sum_y p_Y(y)\mathbb E[X\lvert Y=y]\)

Notice that \(\mathbb E[g(X,Y)\lvert Y=y']=\sum_x g(x,y')p_{X\lvert Y}(x\lvert y')=\sum_x\sum_y g(x,y)p_{X,Y\lvert Y}(x,y\lvert y')\) because \(p_{X,Y\lvert Y}(x,y\lvert y')=0\) for any \(y\neq y'\).

Independence

According to independence of two events, \(X\ind A\) if \(p_{X\lvert A}(x)=p_X(x)\mathbb P(A), \forall x\). Similarly, \(X\ind Y\) if \(p_{X,Y}(x,y)=p_X(x)p_Y(y), \forall x,y\). Therefore, if \(X\ind Y\) then we have as follows.

Property Formula
Expectations \(\mathbb E[XY]=\mathbb E[X]\mathbb E[Y]\)
Independence of functions \(\mathbb E[g(X)h(Y)]=\mathbb E[g(X)]\mathbb E[h(Y)]\)
Sum of independent r.v. \(\text{var}(aX+bY)=a^2\text{var}(X)+b^2\text{var}(Y)\)

Sometimes, independence may involve more than two variables, in such case we should always bear in mind that if certain variables are inter-dependent (say \(X\) and \(Y\)) but independent from others (say \(Z\)), then \(p_{X,Y,Z}(x,y,z)=p_{X,Y}(x,y)p_Z(z)\neq p_X(x)p_Y(y)p_Z(z)\).

Memorylessness

Geometric distribution has a remarkable property. Assuming it models coin tosses until it results in a head, every time we toss the coin \(\mathbb P(H)=p\) is constant and independent from previous steps. In other words, given that initial \(k\) tosses of a coin were all tails, the probability of having a head in subsequent \(x-k\) tosses is as if the first \(k\) tosses never occurred, or \(\mathbb P(X\lvert X>k)=\mathbb P(X-k)\). Replacing \(k=1\) we obtain the following relationships:

\[\begin{align} \mathbb E[X\lvert X>1]&=\mathbb E[1+X]\\ \mathbb E[X^2\lvert X>1]&=\mathbb E[(1+X)^2] \end{align}\]

Alternative way to derive the above is by thinking that \(\mathbb P(X> k)=(1-p)^{k}\), and since each toss is independent the conditioned probability will be derived by dividing the probability by \((1-p)^{k}\).

Joint PMF

Taking two discrete r.v., each having its own marginal distribution, say \(p_X(x)=\mathbb P(X=x)\) and \(p_Y(y)=\mathbb P(Y=y)\), their joint distribution is defined as \(p_{X,Y}(x,y)=\mathbb P(X=x\cap Y=y)\). Marginal distributions can be derived again from the joint distribution as \(p_X(x)=\sum_y p_{X,Y}(x,y)\) and \(p_Y(y)=\sum_x p_{X,Y}(x,y)\).

Property Formula
Non negativity \(p_{X,Y}(x,y)\ge0\)
Normalization \(\sum_x\sum_y p_{X,Y}(x,y)=1\)
Probability of event \(A\) \(\displaystyle\mathbb P((x,y)\in A)=\sum_{(x,y)\in A}p_{X,Y}(x,y)\)

Given \(Z=g(X,Y)\), we have \(p_Z(z)=\mathbb P(g(x,y)=z)=\sum_{(x,y):g(x,y)=z}p_{X,Y}(x,y)\). Accordingly, \(\mathbb E[Z]=\mathbb E[g(X,Y)]=\sum_x\sum_y g(x,y)p_{X,Y}(x,y)\). Due to liearity, if \(g(X,Y)=aX+bY\) then \(\mathbb E[g(X,Y)]=g(\mathbb E[X],\mathbb E[Y])=a\mathbb E[X]+b\mathbb E[Y]\).

Quite often, we need to model joint probability of independent and identically distributed (i.i.d.) r.v., such as \(X_i\overset{\text{i.i.d.}}{\sim}\mathcal P\). In such case, the joint PMF \(p_{X_1,\dots X_n}(x_1,\dots,x_n)=\prod_{i=1}^n p_{X_i}(x_i)\).

Go back to the syllabi breakdown.


Back-up

In deriving the various relations, it may be useful bearing in mind the following relations.

Formula Equivalent form
\(\displaystyle s_n=\sum_{i=1}^n a_i\) \(s_n=n\bar a\), with \(\displaystyle\bar a=\frac{1}{n}\sum_{i=1}^n a_i\) arithmetic mean
\(\displaystyle s_n=\prod_{i=1}^n a_i\) \(s_n={\bar a}^n\), with \(\displaystyle\bar a=\sqrt[n]{\prod_{i=1}^n a_i}\) geometric mean
\(\displaystyle s_n=\sum_{i=1}^n\frac{1}{a_i}\) \(s_n=\frac{n}{\bar a}\), with \(\displaystyle\bar a=\frac{n}{\sum_{i=1}^n \frac{1}{a_i}}\) harmonic mean
\(\displaystyle\sum_{k=1}^nk\) \(\displaystyle\frac{k(k+1)}{2}\)
\(\displaystyle\sum_{k=1}^nk^2\) \(\displaystyle\frac{k(k+1)(2k+1)}{6}\)
\(\displaystyle\sum_{k=0}^\infty\frac{\lambda^k}{k!}\) \(e^\lambda\)
\((X_1+\dots+X_n)^2\) \(\displaystyle\underbrace{\sum_{i=1}^n X^2}_\text{$n$ terms}+\underbrace{\sum_{i\neq j}X_iX_j}_\text{$n^2-n$ terms}\)

CDF of basic discrete r.v. distributions

r.v. CDF
\(F_X(x)\)
Bernoulli
\(\text{Ber}(p)\)
\(\begin{cases}0&x\lt 0\\1-p&0\le x\lt1\\1& x\ge 1\end{cases}\)
Binomial
\(\text{Bin}(k,p)\)
\(\begin{cases}0&x\lt 0\\\displaystyle\sum_{k=0}^{\lfloor x\rfloor}{k\choose x}p^x(1-p)^{k-x}&0\le x\lt k\\1& x\ge k\end{cases}\)
Uniform
\(\text{Unif}(a,b)\)
\(\begin{cases}0&x\lt a\\\displaystyle\frac{\lfloor x\rfloor-a+1}{n}&a\le x\lt b\\1& x\ge b\end{cases}\)
where \(n=b-a+1\)
Constant
\(\text{Unif}(c,c)\)
\(\begin{cases}0&x\lt c\\1& x\ge c\end{cases}\)
Geometric
\(\text{Geom}(p)\)
\(\begin{cases}0&x\lt 1\\\displaystyle 1-(1-p)^{\lfloor x\rfloor}&x\ge1\end{cases}\)
Pascal
\(\text{NBin}(k,p)\)
\(\begin{cases}0&x\lt 0\\\displaystyle\sum_{n=k}^{\lfloor x\rfloor}{n-1\choose k-1}p^k(1-p)^{n-k}&x\ge0\end{cases}\)
Poisson
\(\displaystyle\text{Pois}(\lambda)\)
\(\begin{cases}0&x\lt 0\\\displaystyle e^{-\lambda}\sum_{k=0}^{\lfloor x\rfloor}\frac{\lambda^k}{k!}&x\ge0\end{cases}\)