Skip to content

Elementary Probability Theory

Probability Space

Definitions of \(\sigma\)-algebra

A family of subset \(\mathscr{F}\) of non-empty set \(\Omega\) is called a \(\pmb{\sigma}\)-algebra on \(\Omega\), if it satisfies

(i) \(\varnothing, \Omega\in \mathscr{F}\),

(ii) If \(A\in \mathscr{F}\), then \(A^c\in \mathscr{F}\),

(iii) If a sequence of sets \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcup\limits_{n=1}^\infty A_n\in \mathscr{F}\).

Easy to see that for set \(\Omega\), the smallest family of set \(\mathscr{F}=\{\varnothing, \Omega\}\), and its biggest one is composed by all its subsets (or power set), denoted as \(2^\Omega\).

Properties of \(\sigma\)-algebra

Assume \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\), then

(i) If \(A,B\in \mathscr{F}\), then \(A\cap B, A\cup B, A-B \in \mathscr{F}\),

(ii) If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcap\limits_{n=1}^\infty A_n\in \mathscr{F}\).

(i) \(A\cup B=A\cup B\cup \varnothing\cup \cdots \cup \varnothing\in \mathscr{F}\). \(A\cap B=(A^c\cup B^c)^c\in \mathscr{F}\). \(A-B=A\cap B^c\in \mathscr{F}\).

(ii) Using De Morgan formula, i.e. \(\bigcap\limits_{n=1}^\infty A_n=\left(\bigcup\limits_{n=1}^\infty A_n^c\right)^c \in \mathscr{F}\).

Example. (Discrete) Assume sequence of events \(\{\Omega_n\}_{n\geq 1}\) is a parition of \(\Omega\), then

\[ \mathscr{A}:=\left\{\bigcup_{i\in I}\Omega_i: I\subset \{1,2,\cdots\}\right\} \]

is a sub sigma-algebra of \(\mathscr{F}\).

Kolmogorov: Definition of probability measure

Assume \(\Omega\) is a sample space, \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\). A function \(\mathbb{P}\) on \(\mathscr{F}\) is called probability measure, if it satisfies

(i) Non-negative. \(\forall A\in \mathscr{F}\), \(\mathbb{P}\geq 0\).

(ii) Normalization. \(\mathbb{P}(\Omega)=1\).

(iii) Sigma-Additivity. Assume \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), \(A_i\cap A_j=\varnothing, \forall i\neq j\), then

\[ \mathbb{P}\left(\bigcup_{n\geq 1}A_n\right)=\sum_{n\geq 1}\mathbb{P}(A_n). \]

In this case, \(\mathbb{P}(A)\) is called probability that event \(A\) happens.

We combine the above definition \(\mathscr{F}\), \(\mathbb{P}\) and \(\Omega\) as \((\Omega, \mathscr{F}, \mathbb{P})\), which is called Probability Space.

Because of Sigma-Additivity, we could use all the tools and results from Measure Theory. And apart from Normalization, the definition of probability measure is the same as measure. However, in probability theory, there are some unique phenomena and methods.

Regarding \(\sigma\)-algebra, we could not always choose its power set, because \(\Omega\) might have non-denumerable elements.

Properties of Probability Measure

(i) \(\mathbb{P}(\varnothing)=0\).

(ii) If \(A,B\in \mathscr{F}\), \(A\cap B=0\), then \(\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)\).

(iii) If \(A,B\in \mathscr{F}\), \(A\subset B\), then \(\mathbb{P}(B-A)=\mathbb{P}(B)-\mathbb{P}(A)\), so \(\mathbb{P}(A)\leq \mathbb{P}(B)\).

(iv) \(\mathbb{P}(A^c)=1-\mathbb{P}(A)\).

(v) Sub-sigma-additivity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then

\[ \mathbb{P}\left(\bigcup_n A_n\right)\leq \sum_n \mathbb{P}(A_n). \]

(vi) Inferior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically increase, then

\[ \mathbb{P}\left(\bigcup_n A_n\right)=\lim_n \mathbb{P}(A_n). \]

(vi) Superior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically decrease, then

\[ \mathbb{P}\left(\bigcap_n A_n\right)=\lim_n \mathbb{P}(A_n). \]

Note Probability should not be defined randomly because it is a function defined on \(\sigma\)-algebra. When \(\Omega\) has denumerable elements, probability space is easy to construct.

Discrete Probability Space

(i) \((\Omega,\{\varnothing, A, A^c, \Omega\},\mathbb{P})\) is called Bernoulli probability space, if

\[ \mathbb{P}(\varnothing)=0,\quad\mathbb{P}(A)=p, \quad\mathbb{P}(A^c)=1-p, \quad\mathbb{P}(\Omega)=1. \]

(ii) Same as Example, we have \(\mathbb{P}\) defined by

\[ \mathbb{P}\left(\bigcup_{i\in I,\atop I\subset\{1,2,\cdots\}}\Omega_i\right)=\sum_{i\in I,\atop I\subset\{1,2,\cdots\}}\mathbb{P}(\Omega_i) \]

then \((\Omega, \mathscr{A}, \mathbb{P})\) is a probability space, also called Discrete Probability Space.

Example. Assume \(\Omega\) has denumerable elements. Choose its power set as \(\mathscr{F}\). Then for every sample point \(\omega\in\Omega\), formulate a function \(p:\Omega\mapsto \mathbb{R}\) as \(p(\omega)\), it satisfies \(\sum\limits_{\omega\in \Omega}p(\omega)=1\) and for all \(A\subset \Omega\), define

\[ \mathbb{P}(A):=\sum_{\omega\in A}p(\omega). \]

Show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. If \(|\Omega|<\infty\), then

\[ \mathbb{P}(\{\omega\})=\frac{1}{|\Omega|}, \]

which is classical model.

If \(\Omega\) has non-denumerable elements, then it is not easy to formulate its \(\sigma\)-algebra.

Example. Assume \(\Omega=[0,1]\), \(\mathscr{F}\) is a Borel set \(\mathscr{B}([0,1])\). For \(A\in \mathscr{F}\), \(\mathbb{P}(A)=|A|\), this is exactly Lebesgue measure of \(A\). Then show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, also called Geometric Probability Spcae.

Random Variable

This part we have a similar path as we have in measurable function, but just gives specific definitions.

Simply speaking, random variables is a measurable function on \(\Omega\), to endow basic events with a number.

Definition of Random Variable

Assume \(\{\Omega, \mathscr{F}, \mathbb{P}\}\) is a probability space. \(\xi:\Omega\rightarrow \mathbb{R}\), a function defined on \(\Omega\), is called Random Variable, if \(\forall \alpha\in \mathbb{R}\)

\[ \{\xi\leq \alpha\}:=\{\omega\in \Omega: \xi(\omega)\leq \alpha\}\in \mathscr{F}. \]

Note: for a random variable \(\xi\), we say \(\xi\in A\), if \(\forall \omega\in \Omega\), \(\xi(\omega)\in A\). If \(\mathbb{P}(\xi\in A)=1\) (usually \(<1\)), then we call \(\xi\) is distributed on \(A\).

Readers could check this definition with measurable functions and its equivalent definitions. Here measurability of \(\xi\) means that information of a \(\sigma\)-algebra is enough to find \(\xi\).

Properties of Random Variable

(i) random variable of \(\sigma\)-algebra constructs a linear space. Note its addition is similar to proof in measurable function.

(i) We prove for addition. For random variables \(\xi, \eta\), \(\forall \alpha\in \mathbb{R}\), we have

\[ \{\xi+\eta<\alpha\}=\bigcup_{r\in \mathbb{Q}}\left(\{\xi<r\}\cap\{\eta>r+\alpha\}\right) \]

where the right side is a denumerable union of subsets in \(\sigma\)-algebra, which still lies in \(\mathscr{F}\) by its definition.

Similar in measure theory, we have characteristic function. For \(A\subset \Omega\),

\[ 1_A(x)=\begin{cases} 1,\quad x\in A\\ 0,\quad x\notin A. \end{cases} \]

Example. Set \(A\subset \Omega\) has characteristic function \(1_A\), then

\[ \{1_A\leq \alpha\}=\begin{cases} \Omega,\quad &\alpha>1,\\ A^c,\quad &\alpha\in[0,1),\\ \varnothing,\quad &\alpha<0. \end{cases} \]

Notice here we use \(\{\xi\leq \alpha\}\) rather than \(\{\xi>\alpha\}\), because of practical meaning of the former one.

Definitions of discrete random variables

If random variable \(\xi\) is distributed on a set with denumerable elements, i.e. the range of \(\xi\) is denumerable, then we call \(\xi\) discrete random variable, whose range is denoted by \(R(\xi)\). If \(R(\xi)\) has finite elements, then we call \(\xi\) simple random variable. In terms of form, we have

\[ \xi=\sum_{x\in R(\xi)}x 1_{\{\xi=x\}}. \]

Definition of Distribution function

Assume \(\xi\) is a random variable, then \(\forall x\in \mathbb{R}\),

\[ F_\xi(x):=\mathbb{P}(\xi\leq x) \]

is called Distribution Function os \(\xi\).

Properties of Distribution Function

Assume \(F_\xi\) is a distribution function, then

(i) \(F_\xi\) monotonically increases,

(ii) \(F_\xi\) is right continuous,

(iii) \(\lim\limits_{x\rightarrow -\infty}F_\xi(x)=0,\quad \lim\limits_{x\rightarrow +\infty}F_\xi(x)=1\).

Example. (Bernoulli Distribution). A random experiment with only two results are usually called Bernoulli Experiment. Denote the success probability of an event \(A\) is \(\mathbb{P}(A)=p\), counterpart is \(q=1-p\), so the index of success \(\xi\) is a random variable with distribution

\[ \left(\begin{array}{ccc} \xi & 0&1\\ \mathbb{P} &q &p \end{array}\right). \]

The index \(1_A\) of an event \(A\) is Bernoulli distribution. Any Bernoulli distribution must be an index of an event.

We care more about distribution rather than random variables. Because, only defined on the same probability space, random variable \(\xi,\eta\) could chances to equal. But defined on different probability space, we could see their resemblance by checking their distribution variables, since both its demain of definition \((R(\xi))\)and range \((F_\xi(x))\) could be measured on \(\mathbb{R}\).

Definition of Same distribution

Two random variables (might be defined on different probability space) \(\xi\) and \(\eta\) are called to have Same Distribution, if their distribution functions are of the same.

Condition Probability

Definitions of Independence

Events \(A, B\) is said to be independent, if

\[ \mathbb{P}(A\cap B)=\mathbb{P}(A)\mathbb{P}(B). \]

Events sequence \(\{A_n\}_{n\geq 1}\) are said to be independent mutually, if \(\forall\) finite number of events \(\{A_{n_j}\}_{1\leq j\leq k}\),

\[ \mathbb{P}\left(\bigcap_{j=1}^k A_{n_j}\right)=\prod_{j=1}^k \mathbb{P}(A_{n_j}). \]

Random variables \(\{\xi_i\}_{1\leq i\leq n}\) are said to be independent mutually, if \(\forall x_i\in \mathbb{R} (1\leq i\leq n)\),

\[ \mathbb{P}(\xi_1\leq x_1,\cdots, \xi_n\leq x_n)=\prod_{i=1}^n \mathbb{P}(\xi_i\leq x_i). \]

Definitions of Conditional Probability

Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, \(A,B\in \mathscr{F}\), and \(\mathbb{P}(A)>0\). Conditional probability of \(B\) on \(A\) is

\[ \mathbb{P}(B|A):=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}. \]

which is a mapping \(B\mapsto \mathbb{P}(B|A)\), is a probability on \((\Omega, \mathscr{F})\), and also a probability on a shrinked space \((A,A\cap \mathscr{F})\), where \(A\cap \mathscr{F}\) is a \(\sigma\)-algebra on \(A\).

Properties of Conditional Probability

(i) Random variables \(\xi_1,\cdots,\xi_n\) are mutually independent, iff \(\forall x_i\leq y_i, 1\leq i\leq n\),

\[ \mathbb{P}(x_1<\xi_1\leq y_1,\cdots,x_n<\xi_n\leq y_n)=\mathbb{P}(x_1<\xi_1\leq y_1)\cdots\mathbb{P}(x_n<\xi_n\leq y_n). \]

(ii) If the random variables are discrete, then they are independent iff

\[ \mathbb{P}(\xi_1=x_1,\cdots,\xi_n=x_n)=\mathbb{P}(\xi_1=x_1)\cdots\mathbb{P}(\xi_n=x_n). \]

(iii) \(\mathbb{P}((C|B)|A)=\mathbb{P}(C|A\cap B)\).

Example. Toss coin infinite times could be achieved is equivalent to uniform distribution could be achieved.

  • \(\Rightarrow\). If random variable \(\xi\) defined on probability space \((\Omega, \mathscr{F},\mathbb{P})\) is uniformly distributed on \([0,1]\). Then denote its \(n\) bit of number as \(\xi_n\), with \(\xi_n \in {0,1}\). Then the first \(n\) bits number could fall into an interval with length \(\frac{1}{2^n}\), i.e.
\[ \mathbb{P}(\xi_1=a_1,\cdots,\xi_n=a_n)=\frac{1}{2^n} \]

which means \(\{\xi_n\}_{n\geq 1}\) are mutually independent, demonstrating that it described coin-toss problem.

  • \(\Leftarrow\). If random variables \(\{\xi_n\}_{n\geq 1}\) defined on \((\Omega, \mathscr{F},\mathbb{P})\) which are mutually independent. Then define
\[ \xi:=\sum_{n=1}^\infty\frac{\xi_n}{2^n}. \]

Then \(\xi\) is a binary number. Then \(\forall n\geq 1, 0\leq k\leq 2^n-1\), we have

\[ \mathbb{P}(\xi\in [\frac{k}{2^n},\frac{k+1}{2^n}])=\frac{1}{2^n}, \]

meaning \(\xi\) is a uniform distribution.

\(\square\)

Total Probability Formula

Assume events \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a partition of \(\Omega\), then \(\forall A\in \mathscr{F}\),

\[ \mathbb{P}(A)=\sum_{n\geq 1}\mathbb{P}(A\cap \Omega_n)=\sum_{n\geq 1}\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n). \]

Since \(A=A\cap \Omega=\bigcup_{n\geq 1}(A\cap\Omega_n)\), so by sigma-additivity

\[ \mathbb{P}(A)=\mathbb{P}\left(\bigcup_{n\geq 1}(A\cap\Omega_n)\right)=\sum_{n\geq 1}\mathbb{P}(A\cap\Omega_n)=\sum_{n\geq 1}\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n). \]

\(\square\)

Definition of Bayes Formula

If \(A\) happens, then we could calculate probability of every categories \(\Omega_n\)

\[ \mathbb{P}(\Omega_n|A)=\frac{\mathbb{P}(A|\Omega_n)\mathbb{P}(\Omega_n)}{\mathbb{P}(A)}. \]

where \(\mathbb{P}(A)\) is calculated using Total probability formula. The above formula is called Bayes Formula.

Mathematical Expectation

This part we also have a similar path as we have in Lebesgue Integral.

Definitions of ME for simple RV

(i) Assume \(\xi\) is a simple random variable

\[ \xi=\sum_{x\in R(\xi)}x\cdot 1_{\{\xi=x\}} \]

define its mathematical expectation as a weighted average

\[ \mathbb{E}\xi=\sum_{x\in R(\xi)}x\mathbb{P}(\xi=x). \]

Actually, it is Lebesgue integral of simple function.

Region irrelevance of ME for simple RV

If \(x_1,\cdots,x_n\in \mathbb{R}\), and \(\Omega_1,\cdots,\Omega_n\) are finite partition of \(\Omega\), then

\[ \mathbb{E}\left(\sum_{i=1}^n x1_{\Omega_i}\right)=\sum_{i=1}^nx_i\mathbb{P}(\Omega_i). \]

Note here \(\sum\limits_{i=1}^n\mathbb{P}(\Omega_n)=1\).

Properties of simple RV

Assume \(\xi,\eta\) are simple random variables,

(i) if \(\xi\geq 0\), then \(\mathbb{E}\xi\geq 0\).

(ii) Homogeneity. \(\forall a\in \mathbb{R}\), \(\mathbb{E}(a\xi)=a\mathbb{E}\xi\).

(iii) Linearity. \(\mathbb{E}(\xi+\eta)=\mathbb{E}(\xi)+\mathbb{E}(\eta)\).

(iv) Characteristic function. If \(A\in \mathscr{F}\), then \(\mathbb{E}1_A=\mathbb{P}(A)\).

(v) Zero a.s.. If \(\mathbb{P}(\xi\neq 0)=0\), then \(\mathbb{E}\xi=0\).

(vi) Monotonicity. If \(\xi\leq \eta\), then \(\mathbb{E}\xi\leq \mathbb{E}\eta\).

(vii) Independence property. If \(\xi\) and \(\eta\) are independent, then \(\mathbb{E}(\xi\cdot \eta)=\mathbb{E}\xi\cdot\mathbb{E}\eta\).

Corollary: using linearity

Choose arbitrary events \(\{A_k\}_{1\leq k\leq n}\) and real number \(\{x_k\}_{1\leq k\leq n}\), then

\[ \mathbb{E}\left(\sum_{k=1}^n x_k1_{A_k}\right)=\sum_{k=1}^n x_k\mathbb{P}(A_k). \]

Here comes the mathematical expectation of non-negative random variables.

Definition of ME for Non-negative random variables

Assume \(\xi\) is a non-negative randon variable, define is Mathematical Expectation to be

\[ \mathbb{E}\xi=\sup\{\mathbb{E}\eta: 0\leq\eta\leq \xi, \eta\text{ is a simple RV}\}. \]

If \(\mathbb{E}\xi<\infty\), we call \(\xi\) is integrable(L). If \(A\) is an event, then we use \(\mathbb{E}(\xi; A)\) to denote \(\mathbb{E}(\xi\cdot 1_A)\), the ME of \(\xi\) limited on event \(A\).