Skip to content

Elementary Probability Theory

Probability Space

Definitions of \(\sigma\)-algebra

A family of subset \(\mathscr{F}\) of non-empty set \(\Omega\) is called a \(\pmb{\sigma}\)-algebra on \(\Omega\), if it satisfies

(i) \(\varnothing, \Omega\in \mathscr{F}\),

(ii) If \(A\in \mathscr{F}\), then \(A^c\in \mathscr{F}\),

(iii) If a sequence of sets \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcup\limits_{n=1}^\infty A_n\in \mathscr{F}\).

Easy to see that for set \(\Omega\), the smallest family of set \(\mathscr{F}=\{\varnothing, \Omega\}\), and its biggest one is composed by all its subsets (or power set), denoted as \(2^\Omega\).

Properties of \(\sigma\)-algebra

Assume \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\), then

(i) If \(A,B\in \mathscr{F}\), then \(A\cap B, A\cup B, A-B \in \mathscr{F}\),

(ii) If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcap\limits_{n=1}^\infty A_n\in \mathscr{F}\).

(i) \(A\cup B=A\cup B\cup \varnothing\cup \cdots \cup \varnothing\in \mathscr{F}\). \(A\cap B=(A^c\cup B^c)^c\in \mathscr{F}\). \(A-B=A\cap B^c\in \mathscr{F}\).

(ii) Using De Morgan formula, i.e. \(\bigcap\limits_{n=1}^\infty A_n=\left(\bigcup\limits_{n=1}^\infty A_n^c\right)^c \in \mathscr{F}\).

Example. (Discrete) Assume sequence of events \(\{\Omega_n\}_{n\geq 1}\) is a parition of \(\Omega\), then

\[ \mathscr{A}:=\left\{\bigcup_{i\in I}\Omega_i: I\subset \{1,2,\cdots\}\right\} \]

is a sub sigma-algebra of \(\mathscr{F}\).

Kolmogorov: Definition of probability measure

Assume \(\Omega\) is a sample space, \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\). A function \(\mathbb{P}\) on \(\mathscr{F}\) is called probability measure, if it satisfies

(i) Non-negative. \(\forall A\in \mathscr{F}\), \(\mathbb{P}\geq 0\).

(ii) Normalization. \(\mathbb{P}(\Omega)=1\).

(iii) Sigma-Additivity. Assume \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), \(A_i\cap A_j=\varnothing, \forall i\neq j\), then

\[ \mathbb{P}\left(\bigcup_{n\geq 1}A_n\right)=\sum_{n\geq 1}\mathbb{P}(A_n). \]

In this case, \(\mathbb{P}(A)\) is called probability that event \(A\) happens.

We combine the above definition \(\mathscr{F}\), \(\mathbb{P}\) and \(\Omega\) as \((\Omega, \mathscr{F}, \mathbb{P})\), which is called Probability Space.

Because of Sigma-Additivity, we could use all the tools and results from Measure Theory. And apart from Normalization, the definition of probability measure is the same as measure. However, in probability theory, there are some unique phenomena and methods.

Regarding \(\sigma\)-algebra, we could not always choose its power set, because \(\Omega\) might have non-denumerable elements.

Properties of Probability Measure

(i) \(\mathbb{P}(\varnothing)=0\).

(ii) If \(A,B\in \mathscr{F}\), \(A\cap B=0\), then \(\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)\).

(iii) If \(A,B\in \mathscr{F}\), \(A\subset B\), then \(\mathbb{P}(B-A)=\mathbb{P}(B)-\mathbb{P}(A)\), so \(\mathbb{P}(A)\leq \mathbb{P}(B)\).

(iv) \(\mathbb{P}(A^c)=1-\mathbb{P}(A)\).

(v) Sub-sigma-additivity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then

\[ \mathbb{P}\left(\bigcup_n A_n\right)\leq \sum_n \mathbb{P}(A_n). \]

(vi) Inferior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically increase, then

\[ \mathbb{P}\left(\bigcup_n A_n\right)=\lim_n \mathbb{P}(A_n). \]

(vi) Superior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically decrease, then

\[ \mathbb{P}\left(\bigcap_n A_n\right)=\lim_n \mathbb{P}(A_n). \]

Note Probability should not be defined randomly because it is a function defined on \(\sigma\)-algebra. When \(\Omega\) has denumerable elements, probability space is easy to construct.

Discrete Probability Space

(i) \((\Omega,\{\varnothing, A, A^c, \Omega\},\mathbb{P})\) is called Bernoulli probability space, if

\[ \mathbb{P}(\varnothing)=0,\quad\mathbb{P}(A)=p, \quad\mathbb{P}(A^c)=1-p, \quad\mathbb{P}(\Omega)=1. \]

(ii) Same as Example, we have \(\mathbb{P}\) defined by

\[ \mathbb{P}\left(\bigcup_{i\in I,\atop I\subset\{1,2,\cdots\}}\Omega_i\right)=\sum_{i\in I,\atop I\subset\{1,2,\cdots\}}\mathbb{P}(\Omega_i) \]

then \((\Omega, \mathscr{A}, \mathbb{P})\) is a probability space, also called Discrete Probability Space.

Example. Assume \(\Omega\) has denumerable elements. Choose its power set as \(\mathscr{F}\). Then for every sample point \(\omega\in\Omega\), formulate a function \(p:\Omega\mapsto \mathbb{R}\) as \(p(\omega)\), it satisfies \(\sum\limits_{\omega\in \Omega}p(\omega)=1\) and for all \(A\subset \Omega\), define

\[ \mathbb{P}(A):=\sum_{\omega\in A}p(\omega). \]

Show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. If \(|\Omega|<\infty\), then

\[ \mathbb{P}(\{\omega\})=\frac{1}{|\Omega|}, \]

which is classical model.

If \(\Omega\) has non-denumerable elements, then it is not easy to formulate its \(\sigma\)-algebra.

Example. Assume \(\Omega=[0,1]\), \(\mathscr{F}\) is a Borel set \(\mathscr{B}([0,1])\). For \(A\in \mathscr{F}\), \(\mathbb{P}(A)=|A|\), this is exactly Lebesgue measure of \(A\). Then show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, also called Geometric Probability Spcae.

Random Variable

This part we have a similar path as we have in measurable function, but just gives specific definitions.

Simply speaking, random variables is a measurable function on \(\Omega\), to endow basic events with a number.

Definition of Random Variable

Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. \(\xi:\Omega\rightarrow \mathbb{R}\), a function defined on \(\Omega\), is called Random Variable, if \(\forall \alpha\in \mathbb{R}\)

\[ \{\xi\leq \alpha\}:=\{\omega\in \Omega: \xi(\omega)\leq \alpha\}\in \mathscr{F}. \]

Note: for a random variable \(\xi\), we say \(\xi\in A\), if \(\forall \omega\in \Omega\), \(\xi(\omega)\in A\). If \(\mathbb{P}(\xi\in A)=1\) (usually \(<1\)), then we call \(\xi\) is distributed on \(A\).

Readers could check this definition with measurable functions and its equivalent definitions. Here measurability of \(\xi\) means that information of a \(\sigma\)-algebra is enough to find \(\xi\).

Properties of Random Variable

(i) random variable of \(\sigma\)-algebra constructs a linear space. Note its addition is similar to proof in measurable function.

(i) We prove for addition. For random variables \(\xi, \eta\), \(\forall \alpha\in \mathbb{R}\), we have

\[ \{\xi+\eta<\alpha\}=\bigcup_{r\in \mathbb{Q}}\left(\{\xi<r\}\cap\{\eta>r+\alpha\}\right) \]

where the right side is a denumerable union of subsets in \(\sigma\)-algebra, which still lies in \(\mathscr{F}\) by its definition.

Similar in measure theory, we have characteristic function. For \(A\subset \Omega\),

\[ 1_A(x)=\begin{cases} 1,\quad x\in A\\ 0,\quad x\notin A. \end{cases} \]

Example. Set \(A\subset \Omega\) has characteristic function \(1_A\), then

\[ \{1_A\leq \alpha\}=\begin{cases} \Omega,\quad &\alpha>1,\\ A^c,\quad &\alpha\in[0,1),\\ \varnothing,\quad &\alpha<0. \end{cases} \]

Notice here we use \(\{\xi\leq \alpha\}\) rather than \(\{\xi>\alpha\}\), because of practical meaning of the former one.

Definitions of discrete random variables

If random variable \(\xi\) is distributed on a set with denumerable elements, i.e. the range of \(\xi\) is denumerable, then we call \(\xi\) discrete random variable, whose range is denoted by \(R(\xi)\). If \(R(\xi)\) has finite elements, then we call \(\xi\) simple random variable. In terms of form, we have

\[ \xi=\sum_{x\in R(\xi)}x 1_{\{\xi=x\}}. \]

Distribution Function

Definition of Distribution function

Assume \(\xi\) is a random variable, then \(\forall x\in \mathbb{R}\),

\[ F_\xi(x):=\mathbb{P}(\xi\leq x) \]

is called Distribution Function os \(\xi\).

Properties of Distribution Function

Assume \(F_\xi\) is a distribution function, then

(i) \(F_\xi\) monotonically increases,

(ii) \(F_\xi\) is right continuous,

(iii) \(\lim\limits_{x\rightarrow -\infty}F_\xi(x)=0,\quad \lim\limits_{x\rightarrow +\infty}F_\xi(x)=1\).

Example. (Bernoulli Distribution). A random experiment with only two results are usually called Bernoulli Experiment. Denote the success probability of an event \(A\) is \(\mathbb{P}(A)=p\), counterpart is \(q=1-p\), so the index of success \(\xi\) is a random variable with distribution

\[ \left(\begin{array}{ccc} \xi & 0&1\\ \mathbb{P} &q &p \end{array}\right). \]

The index \(1_A\) of an event \(A\) is Bernoulli distribution. Any Bernoulli distribution must be an index of an event.

We care more about distribution rather than random variables. Because, only defined on the same probability space, random variable \(\xi,\eta\) could chances to equal. But defined on different probability space, we could see their resemblance by checking their distribution variables, since both its demain of definition \((R(\xi))\)and range \((F_\xi(x))\) could be measured on \(\mathbb{R}\).

Definition of Same distribution

Two random variables (might be defined on different probability space) \(\xi\) and \(\eta\) are called to have Same Distribution, if their distribution functions are of the same.

Condition Probability

Definitions of Independence

Events \(A, B\) is said to be independent, if

\[ \mathbb{P}(A\cap B)=\mathbb{P}(A)\mathbb{P}(B). \]

Events sequence \(\{A_n\}_{n\geq 1}\) are said to be independent mutually, if \(\forall\) finite number of events \(\{A_{n_j}\}_{1\leq j\leq k}\),

\[ \mathbb{P}\left(\bigcap_{j=1}^k A_{n_j}\right)=\prod_{j=1}^k \mathbb{P}(A_{n_j}). \]

Random variables \(\{\xi_i\}_{1\leq i\leq n}\) are said to be independent mutually, if \(\forall x_i\in \mathbb{R} (1\leq i\leq n)\),

\[ \mathbb{P}(\xi_1\leq x_1,\cdots, \xi_n\leq x_n)=\prod_{i=1}^n \mathbb{P}(\xi_i\leq x_i). \]

Definitions of Conditional Probability

Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, \(A,B\in \mathscr{F}\), and \(\mathbb{P}(A)>0\). Conditional probability of \(B\) on \(A\) is

\[ \mathbb{P}(B|A):=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}. \]

which is a mapping \(B\mapsto \mathbb{P}(B|A)\), is a probability on \((\Omega, \mathscr{F})\), and also a probability on a shrinked space \((A,A\cap \mathscr{F})\), where \(A\cap \mathscr{F}\) is a \(\sigma\)-algebra on \(A\).

Properties of Conditional Probability

(i) Random variables \(\xi_1,\cdots,\xi_n\) are mutually independent, iff \(\forall x_i\leq y_i, 1\leq i\leq n\),

\[ \mathbb{P}(x_1<\xi_1\leq y_1,\cdots,x_n<\xi_n\leq y_n)=\mathbb{P}(x_1<\xi_1\leq y_1)\cdots\mathbb{P}(x_n<\xi_n\leq y_n). \]

(ii) If the random variables are discrete, then they are independent iff

\[ \mathbb{P}(\xi_1=x_1,\cdots,\xi_n=x_n)=\mathbb{P}(\xi_1=x_1)\cdots\mathbb{P}(\xi_n=x_n). \]

(iii) \(\mathbb{P}((C|B)|A)=\mathbb{P}(C|A\cap B)\).

Example. Toss coin infinite times could be achieved is equivalent to uniform distribution could be achieved.

  • \(\Rightarrow\). If random variable \(\xi\) defined on probability space \((\Omega, \mathscr{F},\mathbb{P})\) is uniformly distributed on \([0,1]\). Then denote its \(n\) bit of number as \(\xi_n\), with \(\xi_n \in {0,1}\). Then the first \(n\) bits number could fall into an interval with length \(\frac{1}{2^n}\), i.e.
\[ \mathbb{P}(\xi_1=a_1,\cdots,\xi_n=a_n)=\frac{1}{2^n} \]

which means \(\{\xi_n\}_{n\geq 1}\) are mutually independent, demonstrating that it described coin-toss problem.

  • \(\Leftarrow\). If random variables \(\{\xi_n\}_{n\geq 1}\) defined on \((\Omega, \mathscr{F},\mathbb{P})\) which are mutually independent. Then define
\[ \xi:=\sum_{n=1}^\infty\frac{\xi_n}{2^n}. \]

Then \(\xi\) is a binary number. Then \(\forall n\geq 1, 0\leq k\leq 2^n-1\), we have

\[ \mathbb{P}(\xi\in [\frac{k}{2^n},\frac{k+1}{2^n}])=\frac{1}{2^n}, \]

meaning \(\xi\) is a uniform distribution.

\(\square\)

Total Probability Formula

Assume events \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a partition of \(\Omega\), then \(\forall A\in \mathscr{F}\),

\[ \mathbb{P}(A)=\sum_{n\geq 1}\mathbb{P}(A\cap \Omega_n)=\sum_{n\geq 1}\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n). \]

Since \(A=A\cap \Omega=\bigcup_{n\geq 1}(A\cap\Omega_n)\), so by sigma-additivity

\[ \mathbb{P}(A)=\mathbb{P}\left(\bigcup_{n\geq 1}(A\cap\Omega_n)\right)=\sum_{n\geq 1}\mathbb{P}(A\cap\Omega_n)=\sum_{n\geq 1}\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n). \]

\(\square\)

Definition of Bayes Formula

If \(A\) happens, then we could calculate probability of every categories \(\Omega_n\)

\[ \mathbb{P}(\Omega_n|A)=\frac{\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n)}{\mathbb{P}(A)}. \]

where \(\mathbb{P}(A)\) is calculated using Total probability formula. The above formula is called Bayes Formula.

Mathematical Expectation

This part we also have a similar path as we have in Lebesgue Integral.

Definitions of ME for simple RV

(i) Assume \(\xi\) is a simple random variable

\[ \xi=\sum_{x\in R(\xi)}x\cdot 1_{\{\xi=x\}} \]

define its mathematical expectation as a weighted average

\[ \mathbb{E}\xi=\sum_{x\in R(\xi)}x\mathbb{P}(\xi=x). \]

Actually, it is Lebesgue integral of simple function.

Region irrelevance of ME for simple RV

If \(x_1,\cdots,x_n\in \mathbb{R}\), and \(\Omega_1,\cdots,\Omega_n\) are finite partition of \(\Omega\), then

\[ \mathbb{E}\left(\sum_{i=1}^n x_i 1_{\Omega_i}\right)=\sum_{i=1}^nx_i\mathbb{P}(\Omega_i). \]

Note here \(\sum\limits_{i=1}^n\mathbb{P}(\Omega_n)=1\).

Use the intersection to achieve transition from partition of range to partition os definition domain. Let \(\xi=\sum_i x_i 1_{\Omega_i}\), then

\[ \begin{align*} \mathbb{E}\xi&=\sum_{y\in R(\xi)}y\mathbb{P}(\xi=y)\\ &=\sum_{y\in R(\xi)} y\sum_{i=1}^n \mathbb{P}(\xi=y, \Omega_i)\\ &=\sum_{i=1}^n\sum_{y\in R(\xi)} y \mathbb{P}(\xi=y, \Omega_i)\\ &=\sum_{i=1}^n x_i \mathbb{P}(\xi=x_i, \Omega_i)\quad \text{ for } j\neq i, \mathbb{P}(\xi=x_j, \Omega_i)=0\\ &=\sum_{i=1}^n x_i \mathbb{P}(\Omega_i). \end{align*} \]

\(\square\)

Properties of simple RV

Assume \(\xi,\eta\) are simple random variables,

(i) if \(\xi\geq 0\), then \(\mathbb{E}\xi\geq 0\).

(ii) Homogeneity. \(\forall a\in \mathbb{R}\), \(\mathbb{E}(a\xi)=a\mathbb{E}\xi\).

(iii) Linearity. \(\mathbb{E}(\xi+\eta)=\mathbb{E}(\xi)+\mathbb{E}(\eta)\).

(iv) Characteristic function. If \(A\in \mathscr{F}\), then \(\mathbb{E}1_A=\mathbb{P}(A)\).

(v) Zero a.s.. If \(\mathbb{P}(\xi\neq 0)=0\), then \(\mathbb{E}\xi=0\).

(vi) Monotonicity. If \(\xi\leq \eta\), then \(\mathbb{E}\xi\leq \mathbb{E}\eta\).

(vii) Independence property. If \(\xi\) and \(\eta\) are independent, then \(\mathbb{E}(\xi\cdot \eta)=\mathbb{E}\xi\cdot\mathbb{E}\eta\).

Corollary: using linearity

Choose arbitrary events \(\{A_k\}_{1\leq k\leq n}\) and real number \(\{x_k\}_{1\leq k\leq n}\), then

\[ \mathbb{E}\left(\sum_{k=1}^n x_k1_{A_k}\right)=\sum_{k=1}^n x_k\mathbb{P}(A_k). \]

Here comes the mathematical expectation of non-negative random variables.

Definition of ME for Non-negative random variables

Assume \(\xi\) is a non-negative randon variable, define is Mathematical Expectation to be

\[ \mathbb{E}\xi=\sup\{\mathbb{E}\eta: 0\leq\eta\leq \xi, \eta\text{ is a simple RV}\}. \]

If \(\mathbb{E}\xi<\infty\), we call \(\xi\) is integrable(L). If \(A\) is an event, then we use \(\mathbb{E}(\xi; A)\) to denote \(\mathbb{E}(\xi\cdot 1_A)\), the ME of \(\xi\) limited on event \(A\).

Readers could compare the following contents with similar results in Lebesgue Integral.

Lévi Monotonic Convergence Theorem

(i) If random variables \(\xi,\eta\) satisfy \(0\leq \eta\leq \xi\), then \(\mathbb{E}\eta\leq\mathbb{E}\xi\).

(ii) If a sequence of non-negative random variables \(\{\xi_n\}_{n\geq 1}\) monotinically increase and converge to \(\xi\), then \(\lim\limits_{n\rightarrow\infty}\mathbb{E}\xi_n=\mathbb{E}\xi\).

(iii) Non-negative random variables could always be represented by the limit of s sequence of non-negative random variables.

\(\square\)

We use \(\mathscr{F}_+\) to denote all the non-negative random variables on \((\Omega, \mathscr{F})\), then we know \(\mathbb{E}\) is a function defined on \(\mathscr{F}_+\).

Relationship between Probability and Mathematical Expectation

By definition of non-negative random variables and Lévi Theorem, we have the following results.

(i) Additivity of random variables. Assume \(\{\xi_n\}_{n\geq 1} \subset \mathscr{F}_+\) , then

\[ \mathbb{E}\sum_n\xi_n=\sum_n \mathbb{E}\xi_n. \]

(ii) Characteristic function. \(\forall A\in \mathscr{F}\), we have \(\mathbb{E}1_A=\mathbb{P}(A)\).

Conversely, if there exists a non-negative function \(\mathbb{E}\) defined on \(\mathscr{F}_+\), such that it satisfies additivity and \(\mathbb{E}1=1\), then \(\forall A\in \Omega\), we could use \(\mathbb{E}\) to define probability measure

\[ \mathbb{P}(A)=\mathbb{E}1_A. \]

Definitions of ME for genaral random variables

For a general random variable \(\xi\), we separate it into positive add negetive parts (non-negative random variables)

\[ \xi^+=\xi\cdot 1_{\{\xi>0\}}=\max\{\xi,0\} ,\quad \xi^-=\xi\cdot 1_{\{\xi<0\}}=\max\{-\xi,0\}. \]

Then \(\xi=\xi^+-\xi^-\), \(|\xi|=\xi^+ +\xi^-\). If \(\mathbb{E}|\xi|<\infty\), then we call \(\xi\) is integrable. If \(\xi\) is integrable, then we could define is Mathematical Expectation

\[ \mathbb{E}\xi=\mathbb{E}\xi^+ -\mathbb{E}\xi^-. \]

Properties of non-negative random variables

Almost the same as (i) (ii) (iii) in Properties of simple RV.

(iv) Independence. Assume \(\xi,\eta\) are independent. If they are non-negative, or they are integrable and their multiplication \(\xi\eta\) is also integrable, then \(\mathbb{E}\xi\eta=\mathbb{E}\xi\cdot \mathbb{E}\eta\).

Fatou Lemma

Assume \(\{\xi_n\}_{n\geq 1}\) are non-negative, then \(\mathbb{E}\lim\inf_n\xi_n\leq \lim\inf_n\mathbb{E}\xi_n\).

Lebesgue's Dominated Convergence Theorem

Assume \(\{\xi_n\}_{n\geq 1}\) satisfies \(\lim\limits_{n\rightarrow\infty}\xi_n=\xi\), and there exists an integrable non-negative simple function \(\eta\) such that \(|\xi_n|\leq \eta\), then \(\mathbb{E}\xi=\lim\limits_{n\rightarrow \infty}\mathbb{E}\xi_n\).

Similar to convergence almost everywhere, we have an allege holds almost surely, denoted by \(a.s.\). Like \(\xi=0, a.s.\) means \(\mathbb{P}(\xi\neq 0)=0\).

Properties of general Random Variables

Assume \(\xi\) is a random variable on probability space \((\Omega,\mathscr{F},\mathbb{P})\),

(i) If \(\xi=0, a.s.\), then \(\xi\) is integrable, and \(\mathbb{E}\xi=0\);

(ii) If \(\mathbb{P}(A)=0\), then \(\mathbb{E}(\xi; A)=0\);

(iii) If \(\xi\) is integrable, then \(\xi=0, a.s.\) iff \(\forall A\in \mathbb{F}\), \(\mathbb{E}(\xi;A)=0\);

(iv) If \(\xi\) is non-negative, then \(\mathbb{E}\xi=0\) implies \(\xi=0, a.s.\)

Calculations

Theorem for calculating ME of discrete random variables

A discrete random variable \(\xi\) is integrable, iff

\[ \mathbb{E}(|\xi|)=\mathbb{E}\left(\sum_{x\in R(\xi)}|x|1_{\{\xi=x\}}\right)=\sum_{x\in R(\xi)}|x|\mathbb{P}(\xi=x)<\infty \]

and holds

\[ \mathbb{E}(\xi)=\sum_{x\in R(\xi)}x\mathbb{P}(\xi=x)<\infty. \]

Using Levy Theorem. Note \(\xi>0\), \(R(\xi)=\{x_n\}_{n\geq 1}\), so \(\xi_n=\sum_{i=1}^n x_i 1_{\{\xi=x_i\}}\nearrow \xi\).

Corollary: ME for function of discrete random variable

Assume \(\xi\) is a discrete random variable, and \(\phi:\mathbb{R}\rightarrow \mathbb{R}\). Then \(\phi(\xi)\) is integrable, iff

\[ \mathbb{E}|\phi(\xi)|=\sum_{x\in R(\xi)}|\phi(x)|\mathbb{P}(\xi=x)<\infty, \]

and

\[ \mathbb{E}\phi(\xi)=\sum_{x\in R(\xi)}\phi(x)\mathbb{P}(\xi=x). \]

Easy to show that \(\phi(\xi)\) is also a discrete random variable. \(\{\phi^{-1}(\{y\}): y\in R(\phi(\xi))\}\) is a partition of \(R(\xi)\). So for \(y\in R(\phi(\xi))\),

\[ \mathbb{P}(\phi(\xi)=y)=\sum_{x\in \phi^{-1}(y)}\mathbb{P}(\xi=x). \]

Since

\[ \begin{align*} \sum_{x\in R(\xi)}|\phi(x)|\mathbb{P}(\xi=x)&=\sum_{y\in R(\phi(\xi))}\sum_{x\in \phi^{-1}(y)}|\phi(x)|\mathbb{P}(\xi=x)\quad \text{ using partition of }R(\xi)\\ &=\sum_{y\in R(\phi(\xi))}|y| \sum_{x\in \phi^{-1}(y)}\mathbb{P}(\xi=x)\\ &=\sum_{y\in R(\phi(\xi))}|y|\mathbb{P}(\phi(\xi)=y)<\infty. \end{align*} \]

So actual integral without absolute sign, would follow inversely.

\(\square\)

Theorem for calculation of general random variables

Assume \(\xi\) is a random variable, \(F\) is its distribution function. If \(\phi\) is non-negaitve continuous or bounded continuous function, then

\[ \mathbb{E}\phi(\xi)=\int_\mathbb{R}\phi(x)dF(x). \]

Assume \(\phi\geq 0\). We first prove in finite intervals \([a,b]\).

(Actually we could only prove in measurable function) Assume $\Delta: a=x_1<\cdots<x_n=b $ is a partition of \([a,b]\), let \(m_i=\inf\limits_{x\in(x_{-1},x_{i}]}\phi\), \(g^\Delta:=\sum\limits_{i=1}^n m_i 1_{(x_{i-1},x_i]}\), \(\lambda=\max\limits_{1\leq i\leq n}{|x_{i}-x_{i-1}|}\). So

\[ g^\Delta\rightarrow \phi\cdot 1_{(a,b]} \quad(\lambda\rightarrow 0) \]

so

\[ g^\Delta (\xi) \rightarrow \phi(\xi)\cdot 1_{(a,b]}(\xi)\quad (\lambda\rightarrow 0) \]

Since \(g^\Delta (\xi)\) is a simple random variable, then its limit is also random variable and

\[ \begin{align*} \mathbb{E}\phi(\xi)\cdot 1_{(a,b]}(\xi)&=\lim_{\lambda\rightarrow 0}\mathbb{E}g^\Delta (\xi)\\ &=\lim_{\lambda\rightarrow 0}\sum_{i=1}^n m_i \mathbb{E} 1_{(x_{i-1},x_i]}(\xi)\\ &=\lim_{\lambda\rightarrow 0} \sum_{i=1}^n m_i \mathbb{P}(x_{i-1}<\xi\leq x_i)\\ &=\lim_{\lambda\rightarrow 0} \sum_{i=1}^n m_i (F(x_i)-F(x_{i-1}))=\int_{(a,b]} \phi(x)dF(x). \end{align*} \]

Then for \(\mathbb{R}\), we use \(\phi(\xi)\cdot 1_{(-n,n]}(\xi)\nearrow \phi(\xi)\), and let \(n\rightarrow \infty\).

\(\square\)

Variance

The following inequation is essentially the same as we have discussed in Real analysis.

Chebyshev Inequation (Markov Inequation)

Assume \(\xi\) is a random variable, \(\alpha>0\), then \(\forall m>0\), we have

\[ \mathbb{P}(|\xi|>m)\leq \frac{1}{m^\alpha}\mathbb{E}|\xi|^\alpha. \]

By definition

\[ \mathbb{E}|\xi|^\alpha\geq \mathbb{E}(|\xi|^\alpha:|\xi|>m)\geq m^\alpha \mathbb{P}(|\xi|>m). \]

Cauchy-Schwarz Inequation

Assume \(\xi,\eta\) are random variables, then

\[ |\mathbb{E}\xi\eta|^2\leq \mathbb{E}\xi^2\cdot \mathbb{E}\eta^2. \]

Just as the most popular method, i.e. using quadratic function.

If \(\mathbb{E}\xi^2<\infty\), then \(\xi\) is square integrable, and implies \(\xi\) is integrable. Here we could define its Variance as

\[ D\xi:=\mathbb{E}(\xi-\mathbb{E}\xi)^2. \]

Properties of Variance

(i) \(D\xi\geq 0\). \(D\xi=0\) iff \(\xi=C, a.s.\)

(ii) Calculation.\(D\xi=\mathbb{E}\xi^2-(\mathbb{E}\xi)^2\).

(iii) \(D(a\xi)=a^2D\xi\).

(iv) If \(\xi,\eta\) are independent, then \(D(\xi+\eta)=D\xi+D\eta\).

Bernoulli's Law of large numbers

We could now discuss the relationship between probability and frequency.

Bernoulli's Law of large numbers

Assume we carry out a Bernoulli experiment with rate of success \(p\), and use \(\xi_n\) to denote the number of success within first \(n\) times experiments. So \(\frac{\xi_n}{n}\) is the frequency of first \(n\) times experiments, which is also a random variable. Then \(\forall \varepsilon>0\),

\[ \lim_{n\rightarrow\infty}\mathbb{P}\left(\left|\frac{\xi_n}{n}-p\right|>\varepsilon\right)=0. \]

which means that \(\eta_n=\frac{\xi_n}{n}\) converges to \(p\) in a sense of probability.

Using Chebyshev Inequation.

\[ \begin{align*} \mathbb{P}\left(\left|\frac{\xi_n}{n}-p\right|>\varepsilon\right)\leq \frac{1}{\varepsilon^\alpha}\mathbb{E}\left|\frac{\xi_n}{n}-p\right|^\alpha \end{align*} \]

Choose \(\alpha=2\), and by \(\mathbb{E}\xi_n=np\), \(\mathbb{E}\xi_n^2=np(1-p)+(np)^2\)

\[ \begin{align*} \mathbb{P}\left(\left|\frac{\xi_n}{n}-p\right|>\varepsilon\right)&\leq \frac{1}{\varepsilon^2}\mathbb{E}\left|\frac{\xi_n}{n}-p\right|^2\\ &=\frac{1}{\varepsilon^2}\frac{\mathbb{E}(\xi_n^2)-2np\mathbb{E}\xi_n+(np)^2}{n^2}\\ &=\frac{1}{\varepsilon^2}\frac{p-p^2}{n}\rightarrow 0(n\rightarrow \infty). \end{align*} \]

Total probability formula has a expectation form.

Conditional Mathematical Expectation

Assume integrable random variable \(\xi_n\) is defined on probability space \((\Omega,\mathscr{F}, \mathbb{P})\). For an arbitrary event \(A\) with positive probability, define Conditional Mathematical Expectation

\[ \mathbb{E}(\xi|A):=\frac{\mathbb{E}(\xi;A)}{\mathbb{P}(A)}. \]

So with the above definition, we have an extension of total probability formula.

Expectation form of Total probability formula

Assume \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a paritition of \(\Omega\), then for all random variable \(\xi\), we have

\[ \mathbb{E}(\xi)=\sum_{n=1}^\infty \mathbb{E}(\xi;\Omega_n)\mathbb{P}(\Omega_n). \]

Continuous Random Variables

Measurability

In a discrete sample space \(\Omega_d\), we could choose its power set as its \(\sigma\)-algebra, but in continuous sample space \(\Omega_c\), it has been proved contradicted when we choose its power set. In fact, apart from the smallest and biggest family of set, it is usually not easy to find a \(\sigma\)-algebra directly, but often generated indirectly.

Properties of \(\sigma\)-algebra

Assume \(\{\mathscr{F}_\lambda\}_{\lambda\in\Lambda}\) are \(\sigma\)-algebra on \(\Omega\), then

\[ \bigcap_{\lambda\in \Lambda}\mathscr{F}_\lambda \]

is also a \(\sigma\)-algebra.

Definition of \(\sigma\)-algebra generated

For a family set of \(\Omega\), denoted by \(\mathscr{A}\), define \(\sigma\)-algebra generated by \(\mathscr{A}\) to be the intersection of family sets of \(\Omega\) which contain \(\mathscr{A}\), denoted by

\[ \sigma(\mathscr{A}):=\bigcap_{\lambda\in \Lambda\atop \mathscr{A}\subset \mathscr{F}_\lambda}\mathscr{F}_\lambda. \]

Example. Still the Discrete-exmaple, we could write its generated \(\sigma\)-algebra

\[ \sigma(\{\Omega_n\}_{n\geq 1})=\left\{\bigcup_{i\in I}\Omega_i: I\subset \mathbb{N}\right\}. \]

To go through the following theorem, we have to introduce some properties of inverse image. That is, assume \(f:X\rightarrow Y\) is a mapping, \(X\) and \(Y\) are domain of definition and range. If \(B\subset Y\), define inverse image

\[ f^{-1}(B)=\{x\in X:f(x)=B\}. \]

which is also denoted by \(\{f\in B\}\).

Properties of Inverse Image

Assume \(f:X\rightarrow Y\) is a mapping, then

(i) \(f^{-1}(\varnothing)=\varnothing\), \(f^{-1}(Y)=X\).

(ii) If \(B\subset Y\), then \(f^{-1}(B^c)=f^{-1}(B)^c\).

(iii) If \(\{B_\lambda\}_{\lambda\in\Lambda}\subset Y\), then

\[ \begin{align*} f^{-1}\left(\bigcap_{\lambda\in\Lambda}B_\lambda\right)&=\bigcap_{\lambda\in\Lambda}f^{-1}\left(B_\lambda\right)\\ f^{-1}\left(\bigcup_{\lambda\in\Lambda}B_\lambda\right)&=\bigcup_{\lambda\in\Lambda}f^{-1}\left(B_\lambda\right). \end{align*} \]

With the following properties, we have the following theorem.

Theorem for high-dimension \(\sigma\)-Algebra

Assume \(\xi:\Omega\rightarrow \mathbb{R}^n\), and a family set \(\mathscr{B}_0\). If \(B\subset \mathbb{R}^n\), define

\[ \xi^{-1}(\mathscr{B}_0)=\{\xi^{-1}(B): B\in \mathscr{B}_0\} \]

to be the inverse of family set \(\mathscr{B}_0\). Then if \(\mathscr{B}_0\) is a \(\sigma\)-algebra, then \(\xi^{-1}(\mathscr{B}_0)\) is also a \(\sigma\)-algebra.

Using the properties of inverse, and transform the problem into image.

Sufficient and necessary condition for random variable

Assume \((\Omega,\mathscr{F}, \mathbb{P})\) is a probability space, then \(\xi\) defined on \(\Omega\) is a random variable, iff

\[ \xi^{-1}(\mathscr{B})\subset \mathscr{F}. \]
  • "\(\Leftarrow\)".

Just check the condition for \(\xi\) to be a random variable.

  • "\(\Rightarrow\)".

Denote \(\mathscr{B}'\) as the whole set \(B\subset \mathbb{R}\), which satisfies \(\{\xi\in B\}\in \mathscr{F}\), i.e.

\[ \mathscr{B}'=\{B\subset \mathbb{R}: \{\xi\in B\}\in \mathscr{F}\}. \]

Assume \(\xi\) is a random variable. then by its definition, we have

\[ \{(-\infty, x]: x\in \mathbb{R}\}\subset \mathscr{B}' \]

If \(\mathscr{B}'\) is a \(\sigma\)-algebra, then by another configuration of Borel algebra, then \(\mathscr{B}'\) it is a \(\sigma\)-algebra containing \(\mathscr{B}\), then \(\xi^{-1}(\mathscr{B})\subset \mathscr{F}\).

Check that

(i) total space \(\mathbb{R}\in\mathscr{B}'\), since \(\xi^{-1}(\mathbb{R})=\Omega\in \mathscr{F}\), so here \(B=\mathbb{R}\in \mathscr{B}'\).

(ii) Assume \(A\in \mathscr{B}'\), then by definition of \(\mathscr{B}'\), we have \(\xi^{-1}(A)\in \mathscr{F}\), then by operations of inverse mapping, we have

\[ \xi^{-1}(A^c)=(\xi^{-1}(A))^c\in \mathscr{F}. \]

So here \(B=A^c\in \mathscr{B}'\).

(iii) Assume \(\{A_n\}_{n\geq 1}\in \mathscr{B}'\), then \(\xi^{-1}(A_n)\in \mathscr{F}\), then by operations of inverse mapping, we have

\[ \xi^{-1}\left(\bigcup_n A_n\right)=\bigcup_n \xi^{-1}(A_n)\in\mathscr{F}. \]

So here \(B=\bigcup_n A_n\in \mathscr{F}\).

In a nutshell, combined with (i) (ii) and (iii), we have \(\mathscr{B}'\) is a \(\sigma\)-algebra.

From the above deduction, we could see that \(\xi\) is a random variable, iff there exists any family set \(\mathscr{A}\) generated by \(\mathscr{B}\), such that

\[ \xi^{-1}(\mathscr{A})\in \mathscr{F}. \]

So if \(\xi\) is a random variable on \(\sigma\)-algebra \(\mathscr{A}\), then it must be a random variable on any \(\sigma\)-algebra \(\mathscr{A}'\supset \mathscr{A}\). From these logic, we could see that there exists a smallest \(\sigma\)-algebra denoted by \(\sigma(\xi)\), such that \(\xi\) is a random variable on \(\sigma(\xi)\). It is easy to see that

\[ \sigma(\xi)=\xi^{-1}(\mathscr{B}). \]

Definition of Borel measurable function

For a special case, \(\Omega=\mathbb{R}\), we call \(\xi=f\) is a Borel measurable function, if \(\forall x\in \mathbb{R}\),

\[ \{f\leq x\}\in \mathscr{B}. \]

It is natural to have the following properties.

Properties of Borel measurable function

(i) Assume \(f\in C(\mathbb{R})\) is Borel measurable function.

(ii) Assume \(\xi\) is a random variable, \(f\) is a Borel measurable function, then \(f(\xi)\) is a random variable.

Achievement of Distribution function

With the help of Borel algebra, we have the following extra properties of distribution function.

Properties of Distribution Function

(i) \(\forall a<b\), \(\mathbb{P}(\xi\in (a,b])=F(b)-F(a)\).

(ii) \(\forall x\in \mathbb{R}\), \(\mathbb{R}(\xi=x)=F(x)-F(x^-)\).

(iii) \(\forall x\in\mathbb{R}\), \(\mathbb{P}(\xi>x)=1-F(x)\).

(iv) Two distribution functions \(F(x)=G(x),\quad \forall x\in \mathbb{R}\), iff they are the same on a dense subset of \(\mathbb{R}\).

Theorem for achievement of DF

An arbitrary distribution function \(F\) on \(\mathbb{R}\) could be achieved.

Using generalized inverse. The above proof is not applicable to multi-dimension distribution function.

Density Function

Continuous DF & Density Function

A distribution function \(F\) is said to be continuous, if there exists a non-negative L integrable function \(f\), such that \(\forall x\in \mathbb{R}\),

\[ F(x)=\int_{-\infty}^x f(t)dt. \]

and \(f\) is called the Density Function. In this case, we also call its corresponding random variable \(\xi\) to be continuous.

Readers could see that \(F\) is absolute continuous, by properties of indefinite integral. Density function is not unique, since they could differ in zero-measure sets.

Calculation for ME of Continuous distribution

Assume \(F\) is a continuous distribution function of random variable \(\xi\), with its density function \(f\), then for a non-negative continuous or bounded continuous function \(\phi(\xi)\),

\[ \mathbb{E}\phi(\xi)=\int_\mathbb{R}\phi(x)f(x)dx. \]

Random Vector

Definition of joint distribution function

A joint distribution function of random vector \(X=(\xi_1,\cdot,\xi_n)\) is defined by

\[ F_X(x_1,\cdots,x_n)=\mathbb{P}(\xi_1\leq x_1,\cdots,\xi_n\leq x_n). \]

if \(\phi\) is a continuous function on \(\mathbb{R}^n\), and \(\phi(X)\) is integrable, then

\[ \mathbb{E}\phi(X)=\int_{\mathbb{R}^n}\phi(x_1,\cdots,x_n)d_n F(x_1,\cdots,x_n). \]

Properties of joint distribution function

(i) Easy to see that marginal distribution function of one random variable

\[ F_{\xi_i}(x_i)=F_X(\infty,\cdots,x_i,\cdots,\infty). \]

(ii) If \(\xi_1,\cdots,\xi_n\) are mutually independent, then

\[ F_X(x_1,\cdots,x_n)=F_{\xi_1}(x_1)\cdots F_{\xi_n}(x_n). \]

Covariance

Definition of Covariance

Assume \(\xi, \eta\) are two random variables, and their Covariance is defined by

\[ \text{cov}(\xi,\eta):=\mathbb{E}[(\xi-\mathbb{\xi})(\eta-\mathbb{E}\eta)]=\mathbb{E}\xi\eta-\mathbb{E}\xi\cdot\mathbb{E}\eta. \]

Properties of Covariance

Assume \(\xi, \eta\) are two random variables, then

(i) \(\text{cov}(\xi,\xi)=D\xi\geq 0\);

(ii) \(\text{cov}(\xi,\eta)=\text{cov}(\eta,\xi)\);

(iii) Linearity. \(\forall c_1,c_2\in \mathbb{R}\),

\[ \text{cov}(c_1\xi_1+c_2\xi_2,\eta)=c_1\text{cov}(\xi_1,\eta)+c_2\text{cov}(\xi_2,\eta). \]

Corresponding to covariance, we have its normalized quantity correlation coefficient.

Definition of Correlation Coefficient

Assume \(\xi,\eta\) are two random variables, their Correlation Coefficient is defined by

\[ \rho(\xi,\eta):=\frac{\text{cov}(\xi,\eta)}{\sqrt{D\xi\cdot D\eta}}. \]

when the denumerator equals \(0\), it is accustomed to \(1\).

Properties of Correlation Coefficient

(i) \(|\rho(\xi,\eta)|\leq 1\).

(ii) \(|\rho(\xi,\eta)|=1\), iff \(\xi, \eta\) are linearly relevant, i.e. \(\exists a,b,c\in \mathbb{R}, a,b\neq 0\), such that

\[ \mathbb{P}(a\xi+b\eta=c)=1. \]

Definition of Covariance Matrix

Assume \(X=(\xi_1,\cdots,\xi_n)\), \(Y=(\eta_1,\cdots,\eta_m)\) is a random vector, then define its Covariance Matrix $$to be

\[ \pmb{\text{cov}}(X,Y):=(\text{cov}(\xi_i,\eta_j))_{n\times m}. \]

For \(Y=X\), we have a phalanx

\[ \pmb{\text{cov}}(X,X)=\mathbb{E}[X^TX]-(\mathbb{E}X)^T(\mathbb{E}X). \]

Properties of Covariance Matrix of a random vector \(X\)

Assume cov\((X,X)\) is a covariance matrix of \(X=(\xi_1,\cdots,\xi_n)\), then cov\((X,X)\) is a symmetric non-negative definite matrix.

\(\forall (x_1,\cdots,x_n)\in\mathbb{R}^n\),

\[ \sum_{1\leq i,j\leq n}x_i\text{cov}(\xi_i,\xi_j)x_j=\left(\sum_{i=1}^nx_i\xi_i,\sum_{j=1}^n x_j\xi_j\right)=D\left(\sum_{i=1}^n x_i\xi_i\right)\geq 0. \]

Function of random variables

We focus on random variables with density functions.

Fubini Theorem

For Borel measurable function \(h\) defined on \(\mathbb{R}\times \mathbb{R}\), we have

\[ \int_{\mathbb{R}^2}h(x,y) dF_\xi(x) dF_\eta(y)=\int_{\mathbb{R}}dF_\xi(x) \int_{\mathbb{R}} h(x,y)dF_\eta(y). \]

sum of random variable

Assume \(\xi\), \(\eta\) are two independent random variables, with distribution function \(F\) and \(G\), then by Fubini Theorem,

\[ \mathbb{P}(\xi+\eta\leq x)=\int_\mathbb{R}dG(v)\int_{u+v\leq x} dF(u)=\int_\mathbb{R} F(x-v)dG(v). \]

If \(\xi\) and \(\eta\) are continuous random variables, with its density function \(f\), \(g\), then

\[ \begin{align*} \mathbb{P}(\xi+\eta\leq x)&=\int_{-\infty}^\infty g(v)dv\int_{-\infty}^{x-v} f(u) du\\ &=\int_{-\infty}^\infty g(v)dv\int_{-\infty}^{x} f(z-v) dz\\ &=\int_{-\infty}^{x} dz \int_{-\infty}^\infty g(v)f(z-v) dv \end{align*} \]

so the density function of \(\xi+\eta\) is

\[ f * g (x)= \int_{-\infty}^\infty g(v)f(x-v) dv. \]

Actually, if only one of \(\xi\) and \(\eta\) are continuous, let \(G\) has a density function \(g\), then the density function of \(\xi+\eta\) is

\[ \int_{-\infty}^\infty g(x-v)dF(v). \]

Convergence for Sequence of random variables

Definition of convergence

Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. We call

(i) \(\{\xi_n\}\) converges to \(\xi\) in a sense of probability, if \(\forall \varepsilon>0\),

\[ \lim_{n\rightarrow \infty}\mathbb{P}(\{\omega\in\Omega: |\xi_n(\omega)-\xi(\omega)|>\varepsilon\})=0. \]

which is denoted by \(\xi_n\overset{p}{\rightarrow}\xi\).

(ii) \(\{x_n\}\) almost surely converges to \(\xi\), if

\[ \mathbb{P}(\{\omega\in\Omega: \lim_{n\rightarrow\infty}\xi_n(\omega)=\xi(\omega)\})=1 \]

or \(\forall \varepsilon>0\), \(\exists A\subset \Omega\), s.t. \(\mathbb{P}(A)=0\) and

\[ \lim_{n\rightarrow\infty} \xi_n(\omega)=\xi(\omega),\quad \forall \omega\in \Omega-A. \]

which is denoted by \(\xi_n\overset{a.s.}{\rightarrow}\xi\).

Law of large numbers

Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables. Let partial summation \(S_n=\sum\limits_{i=1}^n\xi_i\), \(m_n=\mathbb{E}(S_n)\), \(s_n^2=DS_n\). Then \(\{\xi_n\}\) satisfies

(i) Law of large number, if \(\frac{S_n-m_n}{n}\overset{P}{\rightarrow} 0(n\rightarrow \infty)\), i.e. \(\forall \varepsilon>0\),

\[ \lim_{n\rightarrow \infty}\mathbb{P}\left(\left\{\frac{\sum\limits_{i=1}^n\xi_i-m_n}{n}>\varepsilon\right\}\right)=0. \]

(ii) Strong law of large number, if \(\frac{S_n-m_n}{n}\overset{a.s.}{\rightarrow} 0(n\rightarrow \infty)\), i.e.

\[ \mathbb{P}\left(\left\{\lim_{n\rightarrow \infty}\frac{\sum\limits_{i=1}^n\xi_i-m_n}{n}=0\right\}\right)=1. \]

Borel-Cantelli Theorem

Assume \(\{A_n\}\) is a sequence of events,

(i) If \(\sum\limits_{n=1}^\infty\mathbb{P}(A_n)<\infty\), then

\[ \mathbb{P}(\lim\sup_nA_n)=0. \]

(ii) If \(\{A_n\}\) are independent events, and \(\sum\limits_{n=1}^\infty\mathbb{P}(A_n)=\infty\), then

\[ \mathbb{P}(\lim\sup_nA_n)=1. \]

The following is the probability form of Riesz theorem.

Riesz Theorem for Probability

Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable, then

(i) If \(\xi_n\overset{P}{\rightarrow}\xi\), then there exists a subsequence \(\{\xi_{n_k}\}\) such that \(\xi_{n_k}\overset{a.s.}{\rightarrow} \xi\).

(ii) If \(\xi_{n}\overset{a.s.}{\rightarrow} \xi\), then \(\xi_n\overset{P}{\rightarrow}\xi\).

The following theorem is a sufficient condition for almost surely convergence.

Sufficient condition for almost surely convergence

Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. If \(\forall \varepsilon>0\),

\[ \sum_{n=1}^\infty\mathbb{P}(|\xi_n-\xi|>\varepsilon)<\infty, \]

then \(\xi_{n}\overset{a.s.}{\rightarrow} \xi\).

Uniform Convergence

This is the expectation form of Hölder inequation.

Hölder Inequation

Assume \(1<p<\infty\), \(1<q<\infty\), \(\frac{1}{p}+\frac{1}{q}=1\), then

\[ \mathbb{E}|\xi\eta|\leq (\mathbb{E}|\xi|^p)^{\frac{1}{p}}(\mathbb{E}|\eta|^q)^{\frac{1}{q}}. \]

a order absolute moment

Assume \(\xi\) is a ranodm variable and \(a>0\), we call \(\mathbb{E}|\xi|^a\) \(a\) order absolute moment.

Properties of absolute moment

Assume \(0<a<b\), then

\[ (\mathbb{E}|\xi|^a)^{\frac{1}{a}}\leq (\mathbb{E}|\xi|^b)^{\frac{1}{b}} \]

Using Hölder inequation, let \(p=\frac{a}{b}\).

\[ \mathbb{E}|\xi|^a\cdot 1\leq (\mathbb{E}(|\xi|^a)^\frac{a}{b})^{\frac{b}{a}}\cdot (\mathbb{E}(1)^{\frac{a}{a-b}})^{\frac{a-b}{a}} \]

\(L^r\) space for probability

Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. \(\{\xi_n\}\) is said to converge to \(\xi\) in a sense of \(r\) order absolute moment, or \(L^r\) (\(r\geq 1\)), if \(\mathbb{E}|\xi_n|^r\) and \(\mathbb{E}|\xi|^r\) is finite, and

\[ \lim_{n\rightarrow \infty}\mathbb{E}|\xi_n-\xi|^r=0, \]

which is denoted by \(\xi_n\overset{L^r}{\rightarrow}\xi\).

Definition of uniform convergence

Assume integrable family of random variables \(\{\xi_\lambda\}_{\lambda\in\Lambda}\) is uniformly convergent, if

\[ \lim_{N\rightarrow \infty}\sup_\Lambda\mathbb{E}(|\xi_\lambda|; |\xi|\geq N)=0. \]

Convergence in distribution

Analysis Tool

Probability Generating Function

Assume the distribution sequence of discrete variable \(\xi\) is \(\mathbb{P}(\xi=k)=p_k\), \(k=0,1,\cdots\). We call a function of real variable \(s\)

\[ \psi_\xi(s)=\mathbb{E}(s^\xi)=\sum_{k=0}^\infty p_k s^k, |s|<1. \]

Probability Generating Function.

Similar to \(z\)-Transfer in signal analysis.

Properties for generating function

(i) \(|\psi_\xi(s)|\leq \psi_\xi(1)=1\).

(ii) Assume \(\{\xi_i\}_{1\leq i\leq n}\) are mutually independent, and have generating function \(\psi_{\xi_i}(s)\), then \(\eta=\sum_{i=1}^n \xi_i\) has generating function

\[ \psi_\eta(s)=\prod_{i=1}^n \psi_{\xi_i}(s). \]

(iii) \(p_k=\mathbb{P}(\xi=k)=\frac{\psi^{(k)}(0)}{k!}\), \(k=0,1,\cdots\)

Example. Calculate the generating function of the following distribution corresponding to a random variable.

(i) Binary distribution \(\xi\sim B(n,p)\).

(ii) Possion distribution \(\xi\sim \pi(\lambda)\).