Elementary Probability Theory¶
Probability Space¶
Definitions of \(\sigma\)-algebra
A family of subset \(\mathscr{F}\) of non-empty set \(\Omega\) is called a \(\pmb{\sigma}\)-algebra on \(\Omega\), if it satisfies
(i) \(\varnothing, \Omega\in \mathscr{F}\),
(ii) If \(A\in \mathscr{F}\), then \(A^c\in \mathscr{F}\),
(iii) If a sequence of sets \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcup\limits_{n=1}^\infty A_n\in \mathscr{F}\).
Easy to see that for set \(\Omega\), the smallest family of set \(\mathscr{F}=\{\varnothing, \Omega\}\), and its biggest one is composed by all its subsets (or power set), denoted as \(2^\Omega\).
Properties of \(\sigma\)-algebra
Assume \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\), then
(i) If \(A,B\in \mathscr{F}\), then \(A\cap B, A\cup B, A-B \in \mathscr{F}\),
(ii) If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcap\limits_{n=1}^\infty A_n\in \mathscr{F}\).
(i) \(A\cup B=A\cup B\cup \varnothing\cup \cdots \cup \varnothing\in \mathscr{F}\). \(A\cap B=(A^c\cup B^c)^c\in \mathscr{F}\). \(A-B=A\cap B^c\in \mathscr{F}\).
(ii) Using De Morgan formula, i.e. \(\bigcap\limits_{n=1}^\infty A_n=\left(\bigcup\limits_{n=1}^\infty A_n^c\right)^c \in \mathscr{F}\).
Example. (Discrete) Assume sequence of events \(\{\Omega_n\}_{n\geq 1}\) is a parition of \(\Omega\), then
is a sub sigma-algebra of \(\mathscr{F}\).
Kolmogorov: Definition of probability measure
Assume \(\Omega\) is a sample space, \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\). A function \(\mathbb{P}\) on \(\mathscr{F}\) is called probability measure, if it satisfies
(i) Non-negative. \(\forall A\in \mathscr{F}\), \(\mathbb{P}\geq 0\).
(ii) Normalization. \(\mathbb{P}(\Omega)=1\).
(iii) Sigma-Additivity. Assume \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), \(A_i\cap A_j=\varnothing, \forall i\neq j\), then
In this case, \(\mathbb{P}(A)\) is called probability that event \(A\) happens.
We combine the above definition \(\mathscr{F}\), \(\mathbb{P}\) and \(\Omega\) as \((\Omega, \mathscr{F}, \mathbb{P})\), which is called Probability Space.
Because of Sigma-Additivity, we could use all the tools and results from Measure Theory. And apart from Normalization, the definition of probability measure is the same as measure. However, in probability theory, there are some unique phenomena and methods.
Regarding \(\sigma\)-algebra, we could not always choose its power set, because \(\Omega\) might have non-denumerable elements.
Properties of Probability Measure
(i) \(\mathbb{P}(\varnothing)=0\).
(ii) If \(A,B\in \mathscr{F}\), \(A\cap B=0\), then \(\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)\).
(iii) If \(A,B\in \mathscr{F}\), \(A\subset B\), then \(\mathbb{P}(B-A)=\mathbb{P}(B)-\mathbb{P}(A)\), so \(\mathbb{P}(A)\leq \mathbb{P}(B)\).
(iv) \(\mathbb{P}(A^c)=1-\mathbb{P}(A)\).
(v) Sub-sigma-additivity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then
(vi) Inferior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically increase, then
(vi) Superior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically decrease, then
Note Probability should not be defined randomly because it is a function defined on \(\sigma\)-algebra. When \(\Omega\) has denumerable elements, probability space is easy to construct.
Discrete Probability Space
(i) \((\Omega,\{\varnothing, A, A^c, \Omega\},\mathbb{P})\) is called Bernoulli probability space, if
(ii) Same as Example, we have \(\mathbb{P}\) defined by
then \((\Omega, \mathscr{A}, \mathbb{P})\) is a probability space, also called Discrete Probability Space.
Example. Assume \(\Omega\) has denumerable elements. Choose its power set as \(\mathscr{F}\). Then for every sample point \(\omega\in\Omega\), formulate a function \(p:\Omega\mapsto \mathbb{R}\) as \(p(\omega)\), it satisfies \(\sum\limits_{\omega\in \Omega}p(\omega)=1\) and for all \(A\subset \Omega\), define
Show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. If \(|\Omega|<\infty\), then
which is classical model.
If \(\Omega\) has non-denumerable elements, then it is not easy to formulate its \(\sigma\)-algebra.
Example. Assume \(\Omega=[0,1]\), \(\mathscr{F}\) is a Borel set \(\mathscr{B}([0,1])\). For \(A\in \mathscr{F}\), \(\mathbb{P}(A)=|A|\), this is exactly Lebesgue measure of \(A\). Then show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, also called Geometric Probability Spcae.
Random Variable¶
This part we have a similar path as we have in measurable function, but just gives specific definitions.
Simply speaking, random variables is a measurable function on \(\Omega\), to endow basic events with a number.
Definition of Random Variable
Assume \(\{\Omega, \mathscr{F}, \mathbb{P}\}\) is a probability space. \(\xi:\Omega\rightarrow \mathbb{R}\), a function defined on \(\Omega\), is called Random Variable, if \(\forall \alpha\in \mathbb{R}\)
Note: for a random variable \(\xi\), we say \(\xi\in A\), if \(\forall \omega\in \Omega\), \(\xi(\omega)\in A\). If \(\mathbb{P}(\xi\in A)=1\) (usually \(<1\)), then we call \(\xi\) is distributed on \(A\).
Readers could check this definition with measurable functions and its equivalent definitions. Here measurability of \(\xi\) means that information of a \(\sigma\)-algebra is enough to find \(\xi\).
Properties of Random Variable
(i) random variable of \(\sigma\)-algebra constructs a linear space. Note its addition is similar to proof in measurable function.
(i) We prove for addition. For random variables \(\xi, \eta\), \(\forall \alpha\in \mathbb{R}\), we have
where the right side is a denumerable union of subsets in \(\sigma\)-algebra, which still lies in \(\mathscr{F}\) by its definition.
Similar in measure theory, we have characteristic function. For \(A\subset \Omega\),
Example. Set \(A\subset \Omega\) has characteristic function \(1_A\), then
Notice here we use \(\{\xi\leq \alpha\}\) rather than \(\{\xi>\alpha\}\), because of practical meaning of the former one.
Definitions of discrete random variables
If random variable \(\xi\) is distributed on a set with denumerable elements, i.e. the range of \(\xi\) is denumerable, then we call \(\xi\) discrete random variable, whose range is denoted by \(R(\xi)\). If \(R(\xi)\) has finite elements, then we call \(\xi\) simple random variable. In terms of form, we have
Definition of Distribution function
Assume \(\xi\) is a random variable, then \(\forall x\in \mathbb{R}\),
is called Distribution Function os \(\xi\).
Properties of Distribution Function
Assume \(F_\xi\) is a distribution function, then
(i) \(F_\xi\) monotonically increases,
(ii) \(F_\xi\) is right continuous,
(iii) \(\lim\limits_{x\rightarrow -\infty}F_\xi(x)=0,\quad \lim\limits_{x\rightarrow +\infty}F_\xi(x)=1\).
Example. (Bernoulli Distribution). A random experiment with only two results are usually called Bernoulli Experiment. Denote the success probability of an event \(A\) is \(\mathbb{P}(A)=p\), counterpart is \(q=1-p\), so the index of success \(\xi\) is a random variable with distribution
The index \(1_A\) of an event \(A\) is Bernoulli distribution. Any Bernoulli distribution must be an index of an event.
We care more about distribution rather than random variables. Because, only defined on the same probability space, random variable \(\xi,\eta\) could chances to equal. But defined on different probability space, we could see their resemblance by checking their distribution variables, since both its demain of definition \((R(\xi))\)and range \((F_\xi(x))\) could be measured on \(\mathbb{R}\).
Definition of Same distribution
Two random variables (might be defined on different probability space) \(\xi\) and \(\eta\) are called to have Same Distribution, if their distribution functions are of the same.
Condition Probability¶
Definitions of Independence
Events \(A, B\) is said to be independent, if
Events sequence \(\{A_n\}_{n\geq 1}\) are said to be independent mutually, if \(\forall\) finite number of events \(\{A_{n_j}\}_{1\leq j\leq k}\),
Random variables \(\{\xi_i\}_{1\leq i\leq n}\) are said to be independent mutually, if \(\forall x_i\in \mathbb{R} (1\leq i\leq n)\),
Definitions of Conditional Probability
Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, \(A,B\in \mathscr{F}\), and \(\mathbb{P}(A)>0\). Conditional probability of \(B\) on \(A\) is
which is a mapping \(B\mapsto \mathbb{P}(B|A)\), is a probability on \((\Omega, \mathscr{F})\), and also a probability on a shrinked space \((A,A\cap \mathscr{F})\), where \(A\cap \mathscr{F}\) is a \(\sigma\)-algebra on \(A\).
Properties of Conditional Probability
(i) Random variables \(\xi_1,\cdots,\xi_n\) are mutually independent, iff \(\forall x_i\leq y_i, 1\leq i\leq n\),
(ii) If the random variables are discrete, then they are independent iff
(iii) \(\mathbb{P}((C|B)|A)=\mathbb{P}(C|A\cap B)\).
Example. Toss coin infinite times could be achieved is equivalent to uniform distribution could be achieved.
- \(\Rightarrow\). If random variable \(\xi\) defined on probability space \((\Omega, \mathscr{F},\mathbb{P})\) is uniformly distributed on \([0,1]\). Then denote its \(n\) bit of number as \(\xi_n\), with \(\xi_n \in {0,1}\). Then the first \(n\) bits number could fall into an interval with length \(\frac{1}{2^n}\), i.e.
which means \(\{\xi_n\}_{n\geq 1}\) are mutually independent, demonstrating that it described coin-toss problem.
- \(\Leftarrow\). If random variables \(\{\xi_n\}_{n\geq 1}\) defined on \((\Omega, \mathscr{F},\mathbb{P})\) which are mutually independent. Then define
Then \(\xi\) is a binary number. Then \(\forall n\geq 1, 0\leq k\leq 2^n-1\), we have
meaning \(\xi\) is a uniform distribution.
\(\square\)
Total Probability Formula
Assume events \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a partition of \(\Omega\), then \(\forall A\in \mathscr{F}\),
Since \(A=A\cap \Omega=\bigcup_{n\geq 1}(A\cap\Omega_n)\), so by sigma-additivity
\(\square\)
Definition of Bayes Formula
If \(A\) happens, then we could calculate probability of every categories \(\Omega_n\)
where \(\mathbb{P}(A)\) is calculated using Total probability formula. The above formula is called Bayes Formula.
Mathematical Expectation¶
This part we also have a similar path as we have in Lebesgue Integral.
Definitions of ME for simple RV
(i) Assume \(\xi\) is a simple random variable
define its mathematical expectation as a weighted average
Actually, it is Lebesgue integral of simple function.
Region irrelevance of ME for simple RV
If \(x_1,\cdots,x_n\in \mathbb{R}\), and \(\Omega_1,\cdots,\Omega_n\) are finite partition of \(\Omega\), then
Note here \(\sum\limits_{i=1}^n\mathbb{P}(\Omega_n)=1\).
Properties of simple RV
Assume \(\xi,\eta\) are simple random variables,
(i) if \(\xi\geq 0\), then \(\mathbb{E}\xi\geq 0\).
(ii) Homogeneity. \(\forall a\in \mathbb{R}\), \(\mathbb{E}(a\xi)=a\mathbb{E}\xi\).
(iii) Linearity. \(\mathbb{E}(\xi+\eta)=\mathbb{E}(\xi)+\mathbb{E}(\eta)\).
(iv) Characteristic function. If \(A\in \mathscr{F}\), then \(\mathbb{E}1_A=\mathbb{P}(A)\).
(v) Zero a.s.. If \(\mathbb{P}(\xi\neq 0)=0\), then \(\mathbb{E}\xi=0\).
(vi) Monotonicity. If \(\xi\leq \eta\), then \(\mathbb{E}\xi\leq \mathbb{E}\eta\).
(vii) Independence property. If \(\xi\) and \(\eta\) are independent, then \(\mathbb{E}(\xi\cdot \eta)=\mathbb{E}\xi\cdot\mathbb{E}\eta\).
Corollary: using linearity
Choose arbitrary events \(\{A_k\}_{1\leq k\leq n}\) and real number \(\{x_k\}_{1\leq k\leq n}\), then
Here comes the mathematical expectation of non-negative random variables.
Definition of ME for Non-negative random variables
Assume \(\xi\) is a non-negative randon variable, define is Mathematical Expectation to be
If \(\mathbb{E}\xi<\infty\), we call \(\xi\) is integrable(L). If \(A\) is an event, then we use \(\mathbb{E}(\xi; A)\) to denote \(\mathbb{E}(\xi\cdot 1_A)\), the ME of \(\xi\) limited on event \(A\).