Elementary Probability Theory¶
Probability Space¶
Definitions of \(\sigma\)-algebra
A family of subset \(\mathscr{F}\) of non-empty set \(\Omega\) is called a \(\pmb{\sigma}\)-algebra on \(\Omega\), if it satisfies
(i) \(\varnothing, \Omega\in \mathscr{F}\),
(ii) If \(A\in \mathscr{F}\), then \(A^c\in \mathscr{F}\),
(iii) If a sequence of sets \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcup\limits_{n=1}^\infty A_n\in \mathscr{F}\).
Easy to see that for set \(\Omega\), the smallest family of set \(\mathscr{F}=\{\varnothing, \Omega\}\), and its biggest one is composed by all its subsets (or power set), denoted as \(2^\Omega\).
Properties of \(\sigma\)-algebra
Assume \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\), then
(i) If \(A,B\in \mathscr{F}\), then \(A\cap B, A\cup B, A-B \in \mathscr{F}\),
(ii) If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then \(\bigcap\limits_{n=1}^\infty A_n\in \mathscr{F}\).
(i) \(A\cup B=A\cup B\cup \varnothing\cup \cdots \cup \varnothing\in \mathscr{F}\). \(A\cap B=(A^c\cup B^c)^c\in \mathscr{F}\). \(A-B=A\cap B^c\in \mathscr{F}\).
(ii) Using De Morgan formula, i.e. \(\bigcap\limits_{n=1}^\infty A_n=\left(\bigcup\limits_{n=1}^\infty A_n^c\right)^c \in \mathscr{F}\).
Example. (Discrete) Assume sequence of events \(\{\Omega_n\}_{n\geq 1}\) is a parition of \(\Omega\), then
is a sub sigma-algebra of \(\mathscr{F}\).
Kolmogorov: Definition of probability measure
Assume \(\Omega\) is a sample space, \(\mathscr{F}\) is a \(\sigma\)-algebra on \(\Omega\). A function \(\mathbb{P}\) on \(\mathscr{F}\) is called probability measure, if it satisfies
(i) Non-negative. \(\forall A\in \mathscr{F}\), \(\mathbb{P}\geq 0\).
(ii) Normalization. \(\mathbb{P}(\Omega)=1\).
(iii) Sigma-Additivity. Assume \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), \(A_i\cap A_j=\varnothing, \forall i\neq j\), then
In this case, \(\mathbb{P}(A)\) is called probability that event \(A\) happens.
We combine the above definition \(\mathscr{F}\), \(\mathbb{P}\) and \(\Omega\) as \((\Omega, \mathscr{F}, \mathbb{P})\), which is called Probability Space.
Because of Sigma-Additivity, we could use all the tools and results from Measure Theory. And apart from Normalization, the definition of probability measure is the same as measure. However, in probability theory, there are some unique phenomena and methods.
Regarding \(\sigma\)-algebra, we could not always choose its power set, because \(\Omega\) might have non-denumerable elements.
Properties of Probability Measure
(i) \(\mathbb{P}(\varnothing)=0\).
(ii) If \(A,B\in \mathscr{F}\), \(A\cap B=0\), then \(\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)\).
(iii) If \(A,B\in \mathscr{F}\), \(A\subset B\), then \(\mathbb{P}(B-A)=\mathbb{P}(B)-\mathbb{P}(A)\), so \(\mathbb{P}(A)\leq \mathbb{P}(B)\).
(iv) \(\mathbb{P}(A^c)=1-\mathbb{P}(A)\).
(v) Sub-sigma-additivity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\), then
(vi) Inferior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically increase, then
(vi) Superior continuity. If \(\{A_n\}_{n\geq 1}\in \mathscr{F}\) and monotonically decrease, then
Note Probability should not be defined randomly because it is a function defined on \(\sigma\)-algebra. When \(\Omega\) has denumerable elements, probability space is easy to construct.
Discrete Probability Space
(i) \((\Omega,\{\varnothing, A, A^c, \Omega\},\mathbb{P})\) is called Bernoulli probability space, if
(ii) Same as Example, we have \(\mathbb{P}\) defined by
then \((\Omega, \mathscr{A}, \mathbb{P})\) is a probability space, also called Discrete Probability Space.
Example. Assume \(\Omega\) has denumerable elements. Choose its power set as \(\mathscr{F}\). Then for every sample point \(\omega\in\Omega\), formulate a function \(p:\Omega\mapsto \mathbb{R}\) as \(p(\omega)\), it satisfies \(\sum\limits_{\omega\in \Omega}p(\omega)=1\) and for all \(A\subset \Omega\), define
Show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. If \(|\Omega|<\infty\), then
which is classical model.
If \(\Omega\) has non-denumerable elements, then it is not easy to formulate its \(\sigma\)-algebra.
Example. Assume \(\Omega=[0,1]\), \(\mathscr{F}\) is a Borel set \(\mathscr{B}([0,1])\). For \(A\in \mathscr{F}\), \(\mathbb{P}(A)=|A|\), this is exactly Lebesgue measure of \(A\). Then show that \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, also called Geometric Probability Spcae.
Random Variable¶
This part we have a similar path as we have in measurable function, but just gives specific definitions.
Simply speaking, random variables is a measurable function on \(\Omega\), to endow basic events with a number.
Definition of Random Variable
Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space. \(\xi:\Omega\rightarrow \mathbb{R}\), a function defined on \(\Omega\), is called Random Variable, if \(\forall \alpha\in \mathbb{R}\)
Note: for a random variable \(\xi\), we say \(\xi\in A\), if \(\forall \omega\in \Omega\), \(\xi(\omega)\in A\). If \(\mathbb{P}(\xi\in A)=1\) (usually \(<1\)), then we call \(\xi\) is distributed on \(A\).
Readers could check this definition with measurable functions and its equivalent definitions. Here measurability of \(\xi\) means that information of a \(\sigma\)-algebra is enough to find \(\xi\).
Properties of Random Variable
(i) random variable of \(\sigma\)-algebra constructs a linear space. Note its addition is similar to proof in measurable function.
(i) We prove for addition. For random variables \(\xi, \eta\), \(\forall \alpha\in \mathbb{R}\), we have
where the right side is a denumerable union of subsets in \(\sigma\)-algebra, which still lies in \(\mathscr{F}\) by its definition.
Similar in measure theory, we have characteristic function. For \(A\subset \Omega\),
Example. Set \(A\subset \Omega\) has characteristic function \(1_A\), then
Notice here we use \(\{\xi\leq \alpha\}\) rather than \(\{\xi>\alpha\}\), because of practical meaning of the former one.
Definitions of discrete random variables
If random variable \(\xi\) is distributed on a set with denumerable elements, i.e. the range of \(\xi\) is denumerable, then we call \(\xi\) discrete random variable, whose range is denoted by \(R(\xi)\). If \(R(\xi)\) has finite elements, then we call \(\xi\) simple random variable. In terms of form, we have
Distribution Function¶
Definition of Distribution function
Assume \(\xi\) is a random variable, then \(\forall x\in \mathbb{R}\),
is called Distribution Function os \(\xi\).
Properties of Distribution Function
Assume \(F_\xi\) is a distribution function, then
(i) \(F_\xi\) monotonically increases,
(ii) \(F_\xi\) is right continuous,
(iii) \(\lim\limits_{x\rightarrow -\infty}F_\xi(x)=0,\quad \lim\limits_{x\rightarrow +\infty}F_\xi(x)=1\).
Example. (Bernoulli Distribution). A random experiment with only two results are usually called Bernoulli Experiment. Denote the success probability of an event \(A\) is \(\mathbb{P}(A)=p\), counterpart is \(q=1-p\), so the index of success \(\xi\) is a random variable with distribution
The index \(1_A\) of an event \(A\) is Bernoulli distribution. Any Bernoulli distribution must be an index of an event.
We care more about distribution rather than random variables. Because, only defined on the same probability space, random variable \(\xi,\eta\) could chances to equal. But defined on different probability space, we could see their resemblance by checking their distribution variables, since both its demain of definition \((R(\xi))\)and range \((F_\xi(x))\) could be measured on \(\mathbb{R}\).
Definition of Same distribution
Two random variables (might be defined on different probability space) \(\xi\) and \(\eta\) are called to have Same Distribution, if their distribution functions are of the same.
Condition Probability¶
Definitions of Independence
Events \(A, B\) is said to be independent, if
Events sequence \(\{A_n\}_{n\geq 1}\) are said to be independent mutually, if \(\forall\) finite number of events \(\{A_{n_j}\}_{1\leq j\leq k}\),
Random variables \(\{\xi_i\}_{1\leq i\leq n}\) are said to be independent mutually, if \(\forall x_i\in \mathbb{R} (1\leq i\leq n)\),
Definitions of Conditional Probability
Assume \((\Omega, \mathscr{F}, \mathbb{P})\) is a probability space, \(A,B\in \mathscr{F}\), and \(\mathbb{P}(A)>0\). Conditional probability of \(B\) on \(A\) is
which is a mapping \(B\mapsto \mathbb{P}(B|A)\), is a probability on \((\Omega, \mathscr{F})\), and also a probability on a shrinked space \((A,A\cap \mathscr{F})\), where \(A\cap \mathscr{F}\) is a \(\sigma\)-algebra on \(A\).
Properties of Conditional Probability
(i) Random variables \(\xi_1,\cdots,\xi_n\) are mutually independent, iff \(\forall x_i\leq y_i, 1\leq i\leq n\),
(ii) If the random variables are discrete, then they are independent iff
(iii) \(\mathbb{P}((C|B)|A)=\mathbb{P}(C|A\cap B)\).
Example. Toss coin infinite times could be achieved is equivalent to uniform distribution could be achieved.
- \(\Rightarrow\). If random variable \(\xi\) defined on probability space \((\Omega, \mathscr{F},\mathbb{P})\) is uniformly distributed on \([0,1]\). Then denote its \(n\) bit of number as \(\xi_n\), with \(\xi_n \in {0,1}\). Then the first \(n\) bits number could fall into an interval with length \(\frac{1}{2^n}\), i.e.
which means \(\{\xi_n\}_{n\geq 1}\) are mutually independent, demonstrating that it described coin-toss problem.
- \(\Leftarrow\). If random variables \(\{\xi_n\}_{n\geq 1}\) defined on \((\Omega, \mathscr{F},\mathbb{P})\) which are mutually independent. Then define
Then \(\xi\) is a binary number. Then \(\forall n\geq 1, 0\leq k\leq 2^n-1\), we have
meaning \(\xi\) is a uniform distribution.
\(\square\)
Total Probability Formula
Assume events \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a partition of \(\Omega\), then \(\forall A\in \mathscr{F}\),
Since \(A=A\cap \Omega=\bigcup_{n\geq 1}(A\cap\Omega_n)\), so by sigma-additivity
\(\square\)
Definition of Bayes Formula
If \(A\) happens, then we could calculate probability of every categories \(\Omega_n\)
where \(\mathbb{P}(A)\) is calculated using Total probability formula. The above formula is called Bayes Formula.
Mathematical Expectation¶
This part we also have a similar path as we have in Lebesgue Integral.
Definitions of ME for simple RV
(i) Assume \(\xi\) is a simple random variable
define its mathematical expectation as a weighted average
Actually, it is Lebesgue integral of simple function.
Region irrelevance of ME for simple RV
If \(x_1,\cdots,x_n\in \mathbb{R}\), and \(\Omega_1,\cdots,\Omega_n\) are finite partition of \(\Omega\), then
Note here \(\sum\limits_{i=1}^n\mathbb{P}(\Omega_n)=1\).
Use the intersection to achieve transition from partition of range to partition os definition domain. Let \(\xi=\sum_i x_i 1_{\Omega_i}\), then
\(\square\)
Properties of simple RV
Assume \(\xi,\eta\) are simple random variables,
(i) if \(\xi\geq 0\), then \(\mathbb{E}\xi\geq 0\).
(ii) Homogeneity. \(\forall a\in \mathbb{R}\), \(\mathbb{E}(a\xi)=a\mathbb{E}\xi\).
(iii) Linearity. \(\mathbb{E}(\xi+\eta)=\mathbb{E}(\xi)+\mathbb{E}(\eta)\).
(iv) Characteristic function. If \(A\in \mathscr{F}\), then \(\mathbb{E}1_A=\mathbb{P}(A)\).
(v) Zero a.s.. If \(\mathbb{P}(\xi\neq 0)=0\), then \(\mathbb{E}\xi=0\).
(vi) Monotonicity. If \(\xi\leq \eta\), then \(\mathbb{E}\xi\leq \mathbb{E}\eta\).
(vii) Independence property. If \(\xi\) and \(\eta\) are independent, then \(\mathbb{E}(\xi\cdot \eta)=\mathbb{E}\xi\cdot\mathbb{E}\eta\).
Corollary: using linearity
Choose arbitrary events \(\{A_k\}_{1\leq k\leq n}\) and real number \(\{x_k\}_{1\leq k\leq n}\), then
Here comes the mathematical expectation of non-negative random variables.
Definition of ME for Non-negative random variables
Assume \(\xi\) is a non-negative randon variable, define is Mathematical Expectation to be
If \(\mathbb{E}\xi<\infty\), we call \(\xi\) is integrable(L). If \(A\) is an event, then we use \(\mathbb{E}(\xi; A)\) to denote \(\mathbb{E}(\xi\cdot 1_A)\), the ME of \(\xi\) limited on event \(A\).
Readers could compare the following contents with similar results in Lebesgue Integral.
Lévi Monotonic Convergence Theorem
(i) If random variables \(\xi,\eta\) satisfy \(0\leq \eta\leq \xi\), then \(\mathbb{E}\eta\leq\mathbb{E}\xi\).
(ii) If a sequence of non-negative random variables \(\{\xi_n\}_{n\geq 1}\) monotinically increase and converge to \(\xi\), then \(\lim\limits_{n\rightarrow\infty}\mathbb{E}\xi_n=\mathbb{E}\xi\).
(iii) Non-negative random variables could always be represented by the limit of s sequence of non-negative random variables.
\(\square\)
We use \(\mathscr{F}_+\) to denote all the non-negative random variables on \((\Omega, \mathscr{F})\), then we know \(\mathbb{E}\) is a function defined on \(\mathscr{F}_+\).
Relationship between Probability and Mathematical Expectation
By definition of non-negative random variables and Lévi Theorem, we have the following results.
(i) Additivity of random variables. Assume \(\{\xi_n\}_{n\geq 1} \subset \mathscr{F}_+\) , then
(ii) Characteristic function. \(\forall A\in \mathscr{F}\), we have \(\mathbb{E}1_A=\mathbb{P}(A)\).
Conversely, if there exists a non-negative function \(\mathbb{E}\) defined on \(\mathscr{F}_+\), such that it satisfies additivity and \(\mathbb{E}1=1\), then \(\forall A\in \Omega\), we could use \(\mathbb{E}\) to define probability measure
Definitions of ME for genaral random variables
For a general random variable \(\xi\), we separate it into positive add negetive parts (non-negative random variables)
Then \(\xi=\xi^+-\xi^-\), \(|\xi|=\xi^+ +\xi^-\). If \(\mathbb{E}|\xi|<\infty\), then we call \(\xi\) is integrable. If \(\xi\) is integrable, then we could define is Mathematical Expectation
Properties of non-negative random variables
Almost the same as (i) (ii) (iii) in Properties of simple RV.
(iv) Independence. Assume \(\xi,\eta\) are independent. If they are non-negative, or they are integrable and their multiplication \(\xi\eta\) is also integrable, then \(\mathbb{E}\xi\eta=\mathbb{E}\xi\cdot \mathbb{E}\eta\).
Fatou Lemma
Assume \(\{\xi_n\}_{n\geq 1}\) are non-negative, then \(\mathbb{E}\lim\inf_n\xi_n\leq \lim\inf_n\mathbb{E}\xi_n\).
Lebesgue's Dominated Convergence Theorem
Assume \(\{\xi_n\}_{n\geq 1}\) satisfies \(\lim\limits_{n\rightarrow\infty}\xi_n=\xi\), and there exists an integrable non-negative simple function \(\eta\) such that \(|\xi_n|\leq \eta\), then \(\mathbb{E}\xi=\lim\limits_{n\rightarrow \infty}\mathbb{E}\xi_n\).
Similar to convergence almost everywhere, we have an allege holds almost surely, denoted by \(a.s.\). Like \(\xi=0, a.s.\) means \(\mathbb{P}(\xi\neq 0)=0\).
Properties of general Random Variables
Assume \(\xi\) is a random variable on probability space \((\Omega,\mathscr{F},\mathbb{P})\),
(i) If \(\xi=0, a.s.\), then \(\xi\) is integrable, and \(\mathbb{E}\xi=0\);
(ii) If \(\mathbb{P}(A)=0\), then \(\mathbb{E}(\xi; A)=0\);
(iii) If \(\xi\) is integrable, then \(\xi=0, a.s.\) iff \(\forall A\in \mathbb{F}\), \(\mathbb{E}(\xi;A)=0\);
(iv) If \(\xi\) is non-negative, then \(\mathbb{E}\xi=0\) implies \(\xi=0, a.s.\)
Calculations¶
Theorem for calculating ME of discrete random variables
A discrete random variable \(\xi\) is integrable, iff
and holds
Using Levy Theorem. Note \(\xi>0\), \(R(\xi)=\{x_n\}_{n\geq 1}\), so \(\xi_n=\sum_{i=1}^n x_i 1_{\{\xi=x_i\}}\nearrow \xi\).
Corollary: ME for function of discrete random variable
Assume \(\xi\) is a discrete random variable, and \(\phi:\mathbb{R}\rightarrow \mathbb{R}\). Then \(\phi(\xi)\) is integrable, iff
and
Easy to show that \(\phi(\xi)\) is also a discrete random variable. \(\{\phi^{-1}(\{y\}): y\in R(\phi(\xi))\}\) is a partition of \(R(\xi)\). So for \(y\in R(\phi(\xi))\),
Since
So actual integral without absolute sign, would follow inversely.
\(\square\)
Theorem for calculation of general random variables
Assume \(\xi\) is a random variable, \(F\) is its distribution function. If \(\phi\) is non-negaitve continuous or bounded continuous function, then
Assume \(\phi\geq 0\). We first prove in finite intervals \([a,b]\).
(Actually we could only prove in measurable function) Assume $\Delta: a=x_1<\cdots<x_n=b $ is a partition of \([a,b]\), let \(m_i=\inf\limits_{x\in(x_{-1},x_{i}]}\phi\), \(g^\Delta:=\sum\limits_{i=1}^n m_i 1_{(x_{i-1},x_i]}\), \(\lambda=\max\limits_{1\leq i\leq n}{|x_{i}-x_{i-1}|}\). So
so
Since \(g^\Delta (\xi)\) is a simple random variable, then its limit is also random variable and
Then for \(\mathbb{R}\), we use \(\phi(\xi)\cdot 1_{(-n,n]}(\xi)\nearrow \phi(\xi)\), and let \(n\rightarrow \infty\).
\(\square\)
Variance¶
The following inequation is essentially the same as we have discussed in Real analysis.
Chebyshev Inequation (Markov Inequation)
Assume \(\xi\) is a random variable, \(\alpha>0\), then \(\forall m>0\), we have
By definition
Cauchy-Schwarz Inequation
Assume \(\xi,\eta\) are random variables, then
Just as the most popular method, i.e. using quadratic function.
If \(\mathbb{E}\xi^2<\infty\), then \(\xi\) is square integrable, and implies \(\xi\) is integrable. Here we could define its Variance as
Properties of Variance
(i) \(D\xi\geq 0\). \(D\xi=0\) iff \(\xi=C, a.s.\)
(ii) Calculation.\(D\xi=\mathbb{E}\xi^2-(\mathbb{E}\xi)^2\).
(iii) \(D(a\xi)=a^2D\xi\).
(iv) If \(\xi,\eta\) are independent, then \(D(\xi+\eta)=D\xi+D\eta\).
Bernoulli's Law of large numbers¶
We could now discuss the relationship between probability and frequency.
Bernoulli's Law of large numbers
Assume we carry out a Bernoulli experiment with rate of success \(p\), and use \(\xi_n\) to denote the number of success within first \(n\) times experiments. So \(\frac{\xi_n}{n}\) is the frequency of first \(n\) times experiments, which is also a random variable. Then \(\forall \varepsilon>0\),
which means that \(\eta_n=\frac{\xi_n}{n}\) converges to \(p\) in a sense of probability.
Using Chebyshev Inequation.
Choose \(\alpha=2\), and by \(\mathbb{E}\xi_n=np\), \(\mathbb{E}\xi_n^2=np(1-p)+(np)^2\)
Total probability formula has a expectation form.
Conditional Mathematical Expectation
Assume integrable random variable \(\xi_n\) is defined on probability space \((\Omega,\mathscr{F}, \mathbb{P})\). For an arbitrary event \(A\) with positive probability, define Conditional Mathematical Expectation
So with the above definition, we have an extension of total probability formula.
Expectation form of Total probability formula
Assume \(\{\Omega_n\}_{n\geq 1}\subset \mathscr{F}\) is a paritition of \(\Omega\), then for all random variable \(\xi\), we have
Continuous Random Variables¶
Measurability¶
In a discrete sample space \(\Omega_d\), we could choose its power set as its \(\sigma\)-algebra, but in continuous sample space \(\Omega_c\), it has been proved contradicted when we choose its power set. In fact, apart from the smallest and biggest family of set, it is usually not easy to find a \(\sigma\)-algebra directly, but often generated indirectly.
Properties of \(\sigma\)-algebra
Assume \(\{\mathscr{F}_\lambda\}_{\lambda\in\Lambda}\) are \(\sigma\)-algebra on \(\Omega\), then
is also a \(\sigma\)-algebra.
Definition of \(\sigma\)-algebra generated
For a family set of \(\Omega\), denoted by \(\mathscr{A}\), define \(\sigma\)-algebra generated by \(\mathscr{A}\) to be the intersection of family sets of \(\Omega\) which contain \(\mathscr{A}\), denoted by
Example. Still the Discrete-exmaple, we could write its generated \(\sigma\)-algebra
To go through the following theorem, we have to introduce some properties of inverse image. That is, assume \(f:X\rightarrow Y\) is a mapping, \(X\) and \(Y\) are domain of definition and range. If \(B\subset Y\), define inverse image
which is also denoted by \(\{f\in B\}\).
Properties of Inverse Image
Assume \(f:X\rightarrow Y\) is a mapping, then
(i) \(f^{-1}(\varnothing)=\varnothing\), \(f^{-1}(Y)=X\).
(ii) If \(B\subset Y\), then \(f^{-1}(B^c)=f^{-1}(B)^c\).
(iii) If \(\{B_\lambda\}_{\lambda\in\Lambda}\subset Y\), then
With the following properties, we have the following theorem.
Theorem for high-dimension \(\sigma\)-Algebra
Assume \(\xi:\Omega\rightarrow \mathbb{R}^n\), and a family set \(\mathscr{B}_0\). If \(B\subset \mathbb{R}^n\), define
to be the inverse of family set \(\mathscr{B}_0\). Then if \(\mathscr{B}_0\) is a \(\sigma\)-algebra, then \(\xi^{-1}(\mathscr{B}_0)\) is also a \(\sigma\)-algebra.
Using the properties of inverse, and transform the problem into image.
Sufficient and necessary condition for random variable
Assume \((\Omega,\mathscr{F}, \mathbb{P})\) is a probability space, then \(\xi\) defined on \(\Omega\) is a random variable, iff
- "\(\Leftarrow\)".
Just check the condition for \(\xi\) to be a random variable.
- "\(\Rightarrow\)".
Denote \(\mathscr{B}'\) as the whole set \(B\subset \mathbb{R}\), which satisfies \(\{\xi\in B\}\in \mathscr{F}\), i.e.
Assume \(\xi\) is a random variable. then by its definition, we have
If \(\mathscr{B}'\) is a \(\sigma\)-algebra, then by another configuration of Borel algebra, then \(\mathscr{B}'\) it is a \(\sigma\)-algebra containing \(\mathscr{B}\), then \(\xi^{-1}(\mathscr{B})\subset \mathscr{F}\).
Check that
(i) total space \(\mathbb{R}\in\mathscr{B}'\), since \(\xi^{-1}(\mathbb{R})=\Omega\in \mathscr{F}\), so here \(B=\mathbb{R}\in \mathscr{B}'\).
(ii) Assume \(A\in \mathscr{B}'\), then by definition of \(\mathscr{B}'\), we have \(\xi^{-1}(A)\in \mathscr{F}\), then by operations of inverse mapping, we have
So here \(B=A^c\in \mathscr{B}'\).
(iii) Assume \(\{A_n\}_{n\geq 1}\in \mathscr{B}'\), then \(\xi^{-1}(A_n)\in \mathscr{F}\), then by operations of inverse mapping, we have
So here \(B=\bigcup_n A_n\in \mathscr{F}\).
In a nutshell, combined with (i) (ii) and (iii), we have \(\mathscr{B}'\) is a \(\sigma\)-algebra.
From the above deduction, we could see that \(\xi\) is a random variable, iff there exists any family set \(\mathscr{A}\) generated by \(\mathscr{B}\), such that
So if \(\xi\) is a random variable on \(\sigma\)-algebra \(\mathscr{A}\), then it must be a random variable on any \(\sigma\)-algebra \(\mathscr{A}'\supset \mathscr{A}\). From these logic, we could see that there exists a smallest \(\sigma\)-algebra denoted by \(\sigma(\xi)\), such that \(\xi\) is a random variable on \(\sigma(\xi)\). It is easy to see that
Definition of Borel measurable function
For a special case, \(\Omega=\mathbb{R}\), we call \(\xi=f\) is a Borel measurable function, if \(\forall x\in \mathbb{R}\),
It is natural to have the following properties.
Properties of Borel measurable function
(i) Assume \(f\in C(\mathbb{R})\) is Borel measurable function.
(ii) Assume \(\xi\) is a random variable, \(f\) is a Borel measurable function, then \(f(\xi)\) is a random variable.
Achievement of Distribution function¶
With the help of Borel algebra, we have the following extra properties of distribution function.
Properties of Distribution Function
(i) \(\forall a<b\), \(\mathbb{P}(\xi\in (a,b])=F(b)-F(a)\).
(ii) \(\forall x\in \mathbb{R}\), \(\mathbb{R}(\xi=x)=F(x)-F(x^-)\).
(iii) \(\forall x\in\mathbb{R}\), \(\mathbb{P}(\xi>x)=1-F(x)\).
(iv) Two distribution functions \(F(x)=G(x),\quad \forall x\in \mathbb{R}\), iff they are the same on a dense subset of \(\mathbb{R}\).
Theorem for achievement of DF
An arbitrary distribution function \(F\) on \(\mathbb{R}\) could be achieved.
Using generalized inverse. The above proof is not applicable to multi-dimension distribution function.
Density Function¶
Continuous DF & Density Function
A distribution function \(F\) is said to be continuous, if there exists a non-negative L integrable function \(f\), such that \(\forall x\in \mathbb{R}\),
and \(f\) is called the Density Function. In this case, we also call its corresponding random variable \(\xi\) to be continuous.
Readers could see that \(F\) is absolute continuous, by properties of indefinite integral. Density function is not unique, since they could differ in zero-measure sets.
Calculation for ME of Continuous distribution
Assume \(F\) is a continuous distribution function of random variable \(\xi\), with its density function \(f\), then for a non-negative continuous or bounded continuous function \(\phi(\xi)\),
Random Vector¶
Definition of joint distribution function
A joint distribution function of random vector \(X=(\xi_1,\cdot,\xi_n)\) is defined by
if \(\phi\) is a continuous function on \(\mathbb{R}^n\), and \(\phi(X)\) is integrable, then
Properties of joint distribution function
(i) Easy to see that marginal distribution function of one random variable
(ii) If \(\xi_1,\cdots,\xi_n\) are mutually independent, then
Covariance¶
Definition of Covariance
Assume \(\xi, \eta\) are two random variables, and their Covariance is defined by
Properties of Covariance
Assume \(\xi, \eta\) are two random variables, then
(i) \(\text{cov}(\xi,\xi)=D\xi\geq 0\);
(ii) \(\text{cov}(\xi,\eta)=\text{cov}(\eta,\xi)\);
(iii) Linearity. \(\forall c_1,c_2\in \mathbb{R}\),
Corresponding to covariance, we have its normalized quantity correlation coefficient.
Definition of Correlation Coefficient
Assume \(\xi,\eta\) are two random variables, their Correlation Coefficient is defined by
when the denumerator equals \(0\), it is accustomed to \(1\).
Properties of Correlation Coefficient
(i) \(|\rho(\xi,\eta)|\leq 1\).
(ii) \(|\rho(\xi,\eta)|=1\), iff \(\xi, \eta\) are linearly relevant, i.e. \(\exists a,b,c\in \mathbb{R}, a,b\neq 0\), such that
Definition of Covariance Matrix
Assume \(X=(\xi_1,\cdots,\xi_n)\), \(Y=(\eta_1,\cdots,\eta_m)\) is a random vector, then define its Covariance Matrix $$to be
For \(Y=X\), we have a phalanx
Properties of Covariance Matrix of a random vector \(X\)
Assume cov\((X,X)\) is a covariance matrix of \(X=(\xi_1,\cdots,\xi_n)\), then cov\((X,X)\) is a symmetric non-negative definite matrix.
\(\forall (x_1,\cdots,x_n)\in\mathbb{R}^n\),
Function of random variables¶
We focus on random variables with density functions.
Fubini Theorem
For Borel measurable function \(h\) defined on \(\mathbb{R}\times \mathbb{R}\), we have
sum of random variable
Assume \(\xi\), \(\eta\) are two independent random variables, with distribution function \(F\) and \(G\), then by Fubini Theorem,
If \(\xi\) and \(\eta\) are continuous random variables, with its density function \(f\), \(g\), then
so the density function of \(\xi+\eta\) is
Actually, if only one of \(\xi\) and \(\eta\) are continuous, let \(G\) has a density function \(g\), then the density function of \(\xi+\eta\) is
Convergence for Sequence of random variables¶
Definition of convergence
Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. We call
(i) \(\{\xi_n\}\) converges to \(\xi\) in a sense of probability, if \(\forall \varepsilon>0\),
which is denoted by \(\xi_n\overset{p}{\rightarrow}\xi\).
(ii) \(\{x_n\}\) almost surely converges to \(\xi\), if
or \(\forall \varepsilon>0\), \(\exists A\subset \Omega\), s.t. \(\mathbb{P}(A)=0\) and
which is denoted by \(\xi_n\overset{a.s.}{\rightarrow}\xi\).
Law of large numbers
Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables. Let partial summation \(S_n=\sum\limits_{i=1}^n\xi_i\), \(m_n=\mathbb{E}(S_n)\), \(s_n^2=DS_n\). Then \(\{\xi_n\}\) satisfies
(i) Law of large number, if \(\frac{S_n-m_n}{n}\overset{P}{\rightarrow} 0(n\rightarrow \infty)\), i.e. \(\forall \varepsilon>0\),
(ii) Strong law of large number, if \(\frac{S_n-m_n}{n}\overset{a.s.}{\rightarrow} 0(n\rightarrow \infty)\), i.e.
Borel-Cantelli Theorem
Assume \(\{A_n\}\) is a sequence of events,
(i) If \(\sum\limits_{n=1}^\infty\mathbb{P}(A_n)<\infty\), then
(ii) If \(\{A_n\}\) are independent events, and \(\sum\limits_{n=1}^\infty\mathbb{P}(A_n)=\infty\), then
The following is the probability form of Riesz theorem.
Riesz Theorem for Probability
Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable, then
(i) If \(\xi_n\overset{P}{\rightarrow}\xi\), then there exists a subsequence \(\{\xi_{n_k}\}\) such that \(\xi_{n_k}\overset{a.s.}{\rightarrow} \xi\).
(ii) If \(\xi_{n}\overset{a.s.}{\rightarrow} \xi\), then \(\xi_n\overset{P}{\rightarrow}\xi\).
The following theorem is a sufficient condition for almost surely convergence.
Sufficient condition for almost surely convergence
Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. If \(\forall \varepsilon>0\),
then \(\xi_{n}\overset{a.s.}{\rightarrow} \xi\).
Uniform Convergence¶
This is the expectation form of Hölder inequation.
Hölder Inequation
Assume \(1<p<\infty\), \(1<q<\infty\), \(\frac{1}{p}+\frac{1}{q}=1\), then
a order absolute moment
Assume \(\xi\) is a ranodm variable and \(a>0\), we call \(\mathbb{E}|\xi|^a\) \(a\) order absolute moment.
Properties of absolute moment
Assume \(0<a<b\), then
Using Hölder inequation, let \(p=\frac{a}{b}\).
\(L^r\) space for probability
Assume \(\{\xi_n\}_{n\geq 1}\) is a sequence of random variables, \(\xi\) is a random variable. \(\{\xi_n\}\) is said to converge to \(\xi\) in a sense of \(r\) order absolute moment, or \(L^r\) (\(r\geq 1\)), if \(\mathbb{E}|\xi_n|^r\) and \(\mathbb{E}|\xi|^r\) is finite, and
which is denoted by \(\xi_n\overset{L^r}{\rightarrow}\xi\).
Definition of uniform convergence
Assume integrable family of random variables \(\{\xi_\lambda\}_{\lambda\in\Lambda}\) is uniformly convergent, if
Convergence in distribution¶
Analysis Tool¶
Probability Generating Function
Assume the distribution sequence of discrete variable \(\xi\) is \(\mathbb{P}(\xi=k)=p_k\), \(k=0,1,\cdots\). We call a function of real variable \(s\)
Probability Generating Function.
Similar to \(z\)-Transfer in signal analysis.
Properties for generating function
(i) \(|\psi_\xi(s)|\leq \psi_\xi(1)=1\).
(ii) Assume \(\{\xi_i\}_{1\leq i\leq n}\) are mutually independent, and have generating function \(\psi_{\xi_i}(s)\), then \(\eta=\sum_{i=1}^n \xi_i\) has generating function
(iii) \(p_k=\mathbb{P}(\xi=k)=\frac{\psi^{(k)}(0)}{k!}\), \(k=0,1,\cdots\)
(i) Binary distribution \(\xi\sim B(n,p)\).
(ii) Possion distribution \(\xi\sim \pi(\lambda)\).