These lecture notes include both discrete- and continuous-time processes. We consider applications to insurance, finance, biology, and social sciences. These lecture notes are intended for junior- and senior-level undergraduate courses. They contain enough material for two semesters or three academic quarters.
The first 10 sections are devoted to Probability Theory (first semester), and the next 10 sections are devoted to Stochastic Processes (second semester). Appendix contains methods of simulations. These lecture notes are suitable for STEM majors, but are likely too hard for business and humanities majors.
A prerequisite for the first semester is calculus: a standard two-semester or three-quarter sequence, including multivariable calculus. For the second semester, we also need differential equations and matrix algebra. We tried to keep notes as condensed and concise as possible, very close to the actual notes written on the board.
I would like to thank Janko Gravner from Department of Mathematics, University of California in Davis, who put his lecture notes on Probability and Markov Chains on his web page. I use his materials a lot. Last but not least, I am thankful to Soumik Pal, my Ph.D. adviser, Ioannis Karatzas, his Ph.D. adviser, Jean-Pierre Fouque, my postdoctoral mentor, and all my graduate and undergraduate professors in Probability Theory and Stochastic Processes, for helping understand and appreciate this exciting area of mathematics.
A permutation of \(123\) is, say, \(321\) or \(231\): numbers cannot repeat. There are \(3\cdot 2\cdot 1 = 3! = 6\) permutations: there are \(3\) choices for the first slot, \(2\) choices for the second slot (because one of the numbers is already in the first slot and cannot be repeated), and only one choice for the last, third slot.
In general, for \(1, 2, 3, \ldots, n-1, n\) there are \[ n! = 1\cdot 2\cdot 3\cdot\ldots\cdot(n-1)\cdot n \] permutations. This number is called \(n\) factorial. Examples: \[ 1! = 1,\ \ 2! = 2,\ \ 3! = 6,\ \ 4! = 24,\ \ 5! = 120,\ \ 6! = 720. \] There is a convention that \(0! = 1\). Indeed, we have the property \((n-1)!n = n!\) for \(n = 2, 3, 4, \ldots\). It follows from the definition of the factorial, and this is the main property. We would like it to be true also for \(n = 1\): \(0!\cdot 1 = 1!\), so \(0! = 1\). Factorial grows very quickly. Indeed, \(100!\) is extremely large; no modern computer can go through permutations of \(1, 2, \ldots, 100\). When a computer programming problem encounters search among permutations of \(n\) numbers, then this problem is deemed unsolvable. Stirling’s formula: \[ n! \backsim \sqrt{2\pi n}\left(\frac{n}{e}\right)^n\ \ \mbox{as}\ \ n \to \infty \] where \(f(n) \backsim g(n)\) means \(\lim_{n \to \infty}f(n)/g(n) = 1\).
If we have three slots for numbers \(1, 2, 3, 4, 5, 6, 7\), and repetitions are not allowed, this is called an arrangement. Say, \(364\), \(137\), \(634\). There are \(7\cdot 6\cdot 5 = 210\) such arrangements: \(7\) choices for the first slot, \(6\) for the second and \(5\) for the third. We can write this as \[ A_7^3 = 7\cdot6\cdot 5 = \frac{7\cdot 6\cdot 5\cdot 4\cdot 3\cdot 2\cdot 1}{4\cdot 3\cdot 2\cdot 1} = \frac{7!}{(7-3)!}. \] In general, if there are \(k\) slots for \(1, 2, \ldots, n\), then the number of arrangements is \[ A_n^k = n\cdot(n-1)\cdot(n-2)\cdot\ldots\cdot(n-k+1) = \frac{n!}{(n-k)!}. \] A permutation can be viewed as a particular case of an arrangement, when \(k = n\): the number of slots is the same as the total number of elements.
How many subsets of three elements are there in the set \(\{1, 2, \ldots, 7\}\)? The difference between an arrangement and a subset is that for a subset, order does not matter. (But in both of them, there are no repetitions.) For example, \(\{3, 4, 6\}\) and \(\{6, 3, 4\}\) is the same subset, but \(346\) and \(634\) are different arrangements. From any subset, we can create \(3! = 6\) arrangements. The quantity of subsets is equal to the quantity of arrangements divided by \(6\): \[ \frac{A^3_7}{3!} = \frac{7!}{4!3!} = \frac{210}{6} = 35. \] In general, the quantity of subsets of \(k\) elements in \(\{1, \ldots, n\}\) is equal to \[ \binom{n}{k} = \frac{A_n^k}{k!} = \frac{n\cdot(n-1)\cdot\ldots\cdot(n-k+1)}{k!} = \frac{n!}{(n-k)!k!} \] It is pronounced as \(n\) choose \(k\).
\(\binom{1}{0} = 1\), because there is only one subset of zero elements in \(\{1\}\), and this is an empty set \(\varnothing\).
\(\binom{1}{1} = 1\), because there is only one subset of one element in \(\{1\}\): the set \(\{1\}\) itself.
\(\binom{n}{0} = 1\), for the same reason as in (1);
\(\binom{n}{n} = 1\), for the same reason as in (2);
\(\binom{2}{1} = 2\), because there are two subsets of one element of \(\{1, 2\}\): these are \(\{1\}\) and \(\{2\}\);
\(\binom{n}{1} = n\), because there are \(n\) subsets of one element of \(\{1, 2, \ldots, n\}\): \(\{1\},\ \{2\},\ldots, \{n\}\);
\(\binom{n}{n-1} = n\), because to choose a subset of \(n-1\) elements out of \(\{1, 2, \ldots, n\}\), we need to throw away one element, and it can be chosen in \(n\) ways;
We can say without calculations that \(\binom{8}{2} = \binom{8}{6}\). Indeed, for every subset of \(\{1, 2, \ldots, 8\}\) of two elements there is a subset of six elements: its complement. For example, \(\{3, 5\}\) corresponds to \(\{1, 2, 4, 6, 7, 8\}\). This is a one-to-one correspondence. So there are equally many subsets of two elements and subsets of six elements. Similarly, \(\binom{8}{3} = \binom{8}{5}\). More generally, \[ \binom{n}{k} = \binom{n}{n-k} \]
For each question, there are two possible answers. The total number of choices is \(2\cdot2\cdot\ldots\cdot2 = 2^n\). The set of all subsets of \(\{1, \ldots, n\}\) is called a power set, and it contains \(2^n\) elements. But we can also write the quantity of all subsets as the sum of binomial coefficients: \[ \binom{n}{0} + \binom{n}{1} + \binom{n}{2} + \ldots + \binom{n}{n}. \] So we get the following identity: \[ \binom{n}{0} + \binom{n}{1} + \ldots + \binom{n}{n} = 2^n \]
Let \(n = 2\). Yes-Yes: \(E = \{1, 2\}\), Yes-No: \(E = \{2\}\), No-Yes: \(E = \{1\}\), No-No: \(E = \varnothing\). Total number of subsets: \(2^2 = 4\). Two of them have one element: \(\binom{2}{1} = 2\), one has two elements, \(\binom{2}{2} = 1\), and one has zero elements, \(\binom{2}{0} = 1\). Total: \(1 + 2 + 1 = 4\).
We can claim that \[ \binom{5}{2} = \binom{4}{2} + \binom{4}{1}. \] Indeed, the total number of subsets \(E \subseteq \{1, 2, 3, 4, 5\}\) which contain two elements is \(\binom{5}{2}\). But there are two possibilities:
Case 1. \(5 \in E\). Then \(E\setminus\{5\}\) is a one-element subset of \(\{1, 2, 3, 4\}\); there are \(\binom{4}{1}\) such subsets.
Case 2. \(5\notin E\). Then \(E\) is a two-element subset of \(\{1, 2, 3, 4\}\). There are \(\binom{4}{2}\) such subsets.
So \(\binom{4}{1} + \binom{4}{2} = \binom{5}{2}\). In general, \[ \binom{n}{k} = \binom{n-1}{k} + \binom{n-1}{k-1} \] We can construct the Pascal’s triangle: Each element is the sum of two elements immediately above it: this is the reduction formula. We start from the edges, fill them with ones: \(\binom{n}{0} = \binom{n}{n} = 1\), see the previous lecture. Then we fill the inside from top to bottom using this rule, which is the reduction formula.
We can expand \((x + y)^2 = x^2 + 2xy + y^2\), and \((x + y)^3 = x^3 + 3x^2y + 3xy^2 + y^3\). The coefficients are taken from corresponding lines in Pascal’s triangle. Why is this? Let us show this for \(n = 3\). \[ (x+y)^3 = (x+y)(x+y)(x+y) = xxx + xxy + xyx + yxx + xyy + yxy + yyx + yyy. \] Each term has slots occupied by \(y\): \(xxy \leftrightarrow \{3\}\), \(yxy \leftrightarrow \{1, 3\}\). If there is one slot occupied by \(y\), this corresponds to \(x^2y\), and there are \(\binom{3}{1}\) such combinations. So we have: \(\binom{3}{1}x^2y\). Other terms give us: \[ \binom{3}{0}x^3 + \binom{3}{1}x^2y + \binom{3}{2}xy^2 + \binom{3}{3}y^3. \] The general formula looks like this: \[ (x + y)^n = \binom{n}{0}x^n + \binom{n}{1}x^{n-1}y + \ldots + \binom{n}{n}y^n \] Let \(x = y = 1\). Then we get: \[ 2^n = \binom{n}{0} + \binom{n}{1} + \ldots + \binom{n}{n}. \] This formula was already proven above. Let \(x = 1, y = -1\). Then \[ 0 = \binom{n}{0} - \binom{n}{1} + \binom{n}{2} - \binom{n}{3} + \ldots, \] \[ \binom{n}{0} + \binom{n}{2} + \ldots = \binom{n}{1} + \binom{n}{3} + \ldots \] The quantity of subsets with even number of elements is equal to the quantity of subsets with odd number of elements.
Here, we study a few combinatorics (counting) problems, which can be reduced to counting permutations and combinations.
Five women and four men take an exam. We rank them from top to bottom, according to their performance. There are no ties.
A ranking is just another name for permutation of nine people. The answer is \(9!\)
There are \(5!\) permutations for women and \(4!\) permutations for men. The total number is \(5!4!\) We should multiply them, rather than add, because men’s and women’s rankings are independent. We are interested in pairs: the first item is a ranking for women, the second item is a ranking for men. If we needed to choose: either rank women or rank men, then the solution would be \(5! + 4!\)
A licence plate consists of seven symbols: digits or letters. How many licence plates are there if the following is true:
Solution.
Choose three slots among seven for letters; this may be done in \(\binom{7}{3}\) ways. Fill each of these three slots with letters; there are \(26^3\) ways to do this, since letters can repeat. Fill each of the remaining four slots with digits; there are \(10^4\) ways to do this, since numbers can repeat. Answer: \(\binom{7}{3}\cdot26^3\cdot10^4.\)
This case is different because there are \(26\cdot 25\cdot 24 = 26!/23!\) ways to fill three chosen slots for letters, and \(10\cdot 9\cdot 8\cdot 7 = 10!/6!\) ways to fill four chosen slots for numbers. Answer: \(\binom{7}{3}26\cdot25\cdot24\cdot10\cdot9\cdot8\cdot7.\)
This case is easier than the previous ones, since there are \(36\) symbols, and each of the seven slots can be filled with any of these symbols. Answer: \(36^7.\)
We have five women and four men. We need to choose a committee of three women and two men. How many ways are there to do this if:
There are \(\binom{4}{2} = 6\) ways to choose two men out of four, and there are \(\binom{5}{3} = 10\) ways to choose two women out of five. So the answer is \(60\).
How many committees are there for which this restriction is violated, so Mike and Thomas do serve together? If they are already chosen, then we do not need to choose any other man, and there are \(\binom{5}{3} = 10\) ways to choose three women out of five. So the quantity of committees where Mike and Thomas do serve together is \(10\). The answer is \(60-10 = 50.\)
Similarly to the previous case, the number of committees where Britney and Lindsay serve together is \(\binom{3}{1}\binom{4}{2} = 18\), because you can choose one more woman out of the remaining three in \(\binom{3}{1} = 3\) ways, and the number of choices for men is \(\binom{4}{2}\). So the answer is \(60 - 18 = 42.\)
A set is a collection of its elements: \(A = \{1, 3, 4, 5\}\), \(B = \{red, blue\}\). We say: \(1 \in A\) (\(1\) belongs to \(A\)), but \(2 \notin A\) (\(2\) does not belong to \(A\)). For two sets \(A\) and \(B\), we can define their union: \(A\cup B = \{x\mid x \in A\ \mbox{or}\ x \in B\}\), intersection: \(A\cap B := \{x \mid x\in A\ \mbox{and}\ x \in B\}\), and difference: \(A\setminus B = \{x\mid x \in A\ \mbox{and}\ x \notin B\}\). The empty set, which does not contain any elements, is denoted by \(\varnothing\).
Let \(A = \{0, 1, 2, 3\}\), \(B = \{0 \le x \le 7\mid\ x\ \mbox{is odd}\} = \{1, 3, 5, 7\}\). Then \[ A\cup B = \{0, 1, 2, 3, 5, 7\},\ A\cap B = \{1, 3\},\ A\setminus B = \{0, 2\},\ B\setminus A = \{5, 7\}. \]
The following set of axioms was formulated by a Russian mathematician Andrey Kolmogorov. We have the set \(S\) of elementary outcomes \(s \in S\). This set is called the sample space, and subsets \(A \subseteq S\) are called events. Each event \(A\) has a number \(\mathbb{P}(A)\), which is called the probability of \(A\) and satisfies the following axioms:
By the last axiom, if we take the complement of the event \(A\): \(A^c = S\setminus A\), then \(\mathbb{P}(A) + \mathbb{P}(A^c) = \mathbb{P}(A\cap A^c) = \mathbb{P}(S) = 1\), and therefore \(\mathbb{P}(A^c) = 1 - \mathbb{P}(A)\).
Toss a coin twice. Then \(S\) = {HH, TT, HT, TH}, and the probabilities of each outcome is \(1/4\). For example, the event \(A\) = {the same result} = {HH, TT} has probability \(\mathbb{P}(A) = 1/2\).
(Chevalier de Mere’s Problem.) What is the probability that at least one six in four rolls of a fair die? Denote this event by \(A\). The probability space is \[ S = \{(i_1, i_2, i_3, i_4)\mid i_1, i_2, i_3, i_4 \in \{1, 2, 3, 4, 5, 6\}\}. \] The probability of each elementary outcome is \(1/6^4\). The complement of the event \(A\) is \(A^c = S\setminus A\), which means that there are no six in any of the four rolls. We have: \[ A^c = \{(i_1, i_2, i_3, i_4)\mid i_1, i_2, i_3, i_4 \in \{1, 2, 3, 4, 5\}\}. \] The event \(A^c\) contains \(5^4\) outcomes. Each has probability \(1/6^4\), so \(\mathbb{P}(A^c) = 5^4/6^4\). But \[ \mathbb{P}(A) = 1 - \mathbb{P}(A^c) = 1 - \left(\frac56\right)^4 \approx 0.5177. \] This event has probability greater than \(50\%\), so it is good to bet on it.
Suppose we have \(n = 365\) days, each day is equally likely to be a birthday. There are \(k\) people. Then \[ \mathbb{P}(\mbox{there are two people with same birthdays}) = 1 - \mathbb{P}(\mbox{there are no people with same birthdays}) = 1 - \frac{n(n-1)\ldots(n-k+1)}{n^k}, \] because the number of birthday arrangements (assuming they are all different) is \(A^k_n = n(n-1)\ldots(n-k+1)\). This probability for \(k = 23\) equals \(0.5073\), greater than \(50\%\)!
Consider two events, \(A\) and \(B\), which can intersect. Then \(\mathbb{P}(A\cup B) \ne \mathbb{P}(A) + \mathbb{P}(B)\). Indeed, toss two coins, let A be the event that the first toss is H, let B be the event that the second toss is H. Then \(\mathbb{P}(A\cup B) = 3/4\), but \(\mathbb{P}(A) = \mathbb{P}(B) = 1/2\). Instead, we count intersection \(A\cap B\) twice in \(\mathbb{P}(A) + \mathbb{P}(B)\), so we need to subtract it. Then we get real probability \(\mathbb{P}(A\cup B)\).
We have: \(\mathbb{P}(A\cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A\cap B)\).
Indeed, let \(C_1 = A\setminus B\), \(C_2 = A\cap B\), \(C_3 = B\setminus A\). Then \[ \mathbb{P}(A\cup B) = \mathbb{P}(C_1\cup C_2\cup C_3) = \mathbb{P}(C_1) + \mathbb{P}(C_2) + \mathbb{P}(C_3), \] \[ \mathbb{P}(A) = \mathbb{P}(C_1\cup C_2) = \mathbb{P}(C_1) + \mathbb{P}(C_2), \] \[ \mathbb{P}(B) = \mathbb{P}(C_2\cup C_3) = \mathbb{P}(C_2) + \mathbb{P}(C_3), \] \[ \mathbb{P}(A\cap B) = \mathbb{P}(C_2), \] and the formula follows from here.
\(\mathbb{P}(A) = 30\%\), \(\mathbb{P}(B) = 50\%\), \(\mathbb{P}(A\cup B) = 70\%\). Find: \(\mathbb{P}(A\cap B)\), \(\mathbb{P}(A^c\cup B^c)\), and \(\mathbb{P}(A^c\cap B)\). Solution: \(\mathbb{P}(A\cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A\cap B) = 0.3 + 0.5 - 0.1 = 0.7\); \(\mathbb{P}(A^c\cup B^c) = 1 - \mathbb{P}(A\cap B) = 1 - 0.1 = 0.9\); \(\mathbb{P}(A^c\cap B) = \mathbb{P}(B) - \mathbb{P}(A\cap B) = 0.5 - 0.1 = 0.4\).
(SOA) \(22\%\) patients visit both physical therapist \(T\) and a chiropractor \(C\), \(12\%\) visit neither. Next, \(\mathbb{P}(C) = 0.14 + \mathbb{P}(T)\). Find \(\mathbb{P}(T)\). Indeed, \(\mathbb{P}(C\cap T) = 22\%\) and \(\mathbb{P}(C^c\cap T^c) = 12\%\), so \(\mathbb{P}(C\cup T) = 88\%\). Next, \(\mathbb{P}(C) + \mathbb{P}(T) = \mathbb{P}(C\cup T) + \mathbb{P}(C\cap T) = 1.1\). But \(\mathbb{P}(C) = 0.14 + \mathbb{P}(T)\). Thus \(2\mathbb{P}(T) + 0.14 = 1.1\) and \(\mathbb{P}(T) = 0.48\).
Choose a random number from \(1\) to \(1000\). What is the probability that it is divisible either by \(2\) or by \(3\)? Let \(A\) = divisible by 2, and \(B\) = divisible by 3. There are \(500\) numbers in \(A\) and \(333\) numbers in \(B\), because \(1000/3 = 333 + 1/3\) has integer part \(333\). More exactly, \[ A = \{2, 4, 6, \ldots, 1000\},\ \ B = \{3, 6, 9, \ldots, 996, 999\}. \] \[ \mathbb{P}(A) = \frac{500}{1000} = \frac12,\ \ \mathbb{P}(B) = \frac{333}{1000}. \] In addition, \(A\cap B = \{\mbox{divisible by}\ \ 6\}\) contains \(166\) numbers: \(1000/6\) has integer part \(166\). Therefore, \[ \mathbb{P}(A\cap B) = 166/1000,\quad \mathbb{P}(A\cup B) = \frac{500}{1000} + \frac{333}{1000} - \frac{166}{1000} = \frac{667}{1000} \]
We can state an analogous formula for three, four, or more events. For simplicity, let us state it for 3 events.
For any events \(A, B, C\), we have: \[ \mathbb{P}(A\cup B\cup C) = \mathbb{P}(A) + \mathbb{P}(B) + \mathbb{P}(C) - \mathbb{P}(A\cap B) - \mathbb{P}(A\cap C) - \mathbb{P}(B \cap C) + \mathbb{P}(A\cap B\cap C). \] Informal explanation: To correct for double counting in double intersections, we remove them. But this removes \(A\cap B\cap C\) three times, and it was initially counted three times too (as part of \(A, B, C\)). Thus we need to add it once again.
Let \(p_1\) be the probability of the event \(1\) on the Venn diagram, which is \(A\setminus(B\cup C)\). Let \(p_2\) be the probability of the event \(2\), etc. Then \[ \begin{cases} \mathbb{P}(A) = p_1 + p_3 + p_5 + p_7,\\ \mathbb{P}(B) = p_2 + p_3 + p_6 + p_7,\\ \mathbb{P}(C) = p_4 + p_6 + p_7 + p_5,\\ \mathbb{P}(A\cap B) = p_3 + p_7,\\ \mathbb{P}(A\cap C) = p_5 + p_7,\\ \mathbb{P}(B\cap C) = p_6 + p_7,\\ \mathbb{P}(A\cap B\cap C) = p_7. \end{cases} \] Finally, \(\mathbb{P}(A\cup B\cup C) = p_1 + p_2 + p_3 + p_4 + p_5 + p_6 + p_7\). Plugging this into the formula in the box above, we can check it is indeed true.
Out of numbers \(1, \ldots, 600\), how many are divisible by either 2, 3, or 5? We have: \(A\) = {divisible by 2}, \(B\) = {divisible by 3}, \(C\) = {divisible by 5}. Then \(A\cap B\) = {divisible by 6}, \(A\cap C\) = {divisible by 10}, \(B\cap C\) = {divisible by 15}, \(A\cap B\cap C\) = {divisible by 30}. Thus \[ \mathbb{P}(A) = \frac12,\, \mathbb{P}(B) = \frac13,\, \mathbb{P}(C) = \frac15,\, \mathbb{P}(A\cap B) = \frac16,\, \mathbb{P}(A\cap C) = \frac1{10},\, \mathbb{P}(B\cap C) = \frac1{15},\, \mathbb{P}(A\cap B\cap C) = \frac1{30}. \] By the inclusion-exclusion formula, \[ \mathbb{P}(A\cup B\cup C) = \frac12 + \frac13 + \frac15 - \frac16 - \frac1{10} - \frac1{15} + \frac1{30} = \frac{11}{15}. \]
A bag with \(11\) cubes: \(3\) red and fuzzy, \(2\) red and smooth, \(4\) blue and fuzzy, \(2\) blue and smooth. Put your hand in the bag and randomly pick a cube. Let \[ R = \{\mbox{the cube is red}\},\ F = \{\mbox{the cube is fuzzy}\}. \] Then the probability that it is red is \(\mathbb{P}(R) = 5/11\). But if you feel that it is fuzzy, then the probability that it is red is \(3/7\). This is called conditional probability of \(R\) given \(F\): \[ \mathbb{P}(R\mid F) = \frac{\mathbb{P}(R\cap F)}{\mathbb{P}(F)} = \frac 37. \]
Conditional probability of \(A\) given \(B\) is defined as \[ \mathbf P(A\mid B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}. \]
\(\mathbb{P}(A\cap B) = \mathbb{P}(B)\mathbb{P}(A\mid B) = \mathbb{P}(A)\mathbb{P}(B\mid A)\).
For familiar with two children, assume that all probabilities of two kids are equally likely: \(S = \{GG, GB, BG, BB\}\). Let \(A\) = at least one boy, \(B\) = at least one is girl. Then \[ \mathbb{P}(A\mid B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)} = \frac{2/4}{3/4} = \frac34. \]
\(\mathbb{P}(A\setminus B) = 0.2\), \(\mathbb{P}(B\setminus A) = 0.1\), \(\mathbb{P}(A^c\cap B^c) = 0.6\). Find \(\mathbb{P}(A\mid B)\). Solution: \(\mathbb{P}(A\cup B) = 1 - 0.6 = 0.4\), and \(\mathbb{P}(A\cap B) = 0.4 - 0.2 - 0.1 = 0.1\). Thus \(\mathbb{P}(B) = 0.1 + 0.1 = 0.2\) and \(\mathbb{P}(A\mid B) = 0.1/0.2 = 50\%\).
Take events \(F_1, \ldots, F_n\) such that \[ F_1\cup F_2\cup \ldots\cup F_n = \Omega,\ \ F_i\cap F_j = \varnothing,\ i \ne j. \] This means one and only one of the events \(F_1, \ldots, F_n\) happens. This collection of events is called a partition. An example is Heads or Tails coin toss; or numbers from 1 to 6 on a die.
For an event \(A\) and a partition \(F_1, \ldots, F_n\), we have: \[ \mathbb{P}(A) = \sum\limits_{k=1}^n\mathbb{P}(A\cap F_k) = \sum\limits_{k=1}^n\mathbb{P}(A\mid F_k)\mathbb{P}(F_k). \]
We have from conditional probability definition: \(\mathbb{P}(A\mid F_i)\mathbb{P}(F_i) = \mathbb{P}(A\cap F_i)\). But \[ \mathbb{P}(A\cap F_1) + \ldots + \mathbb{P}(A\cap F_n) = \mathbb{P}(A), \] because the union of events \(A\cap F_1, \ldots, A\cap F_n\) is \(A\), and these do not intersect.
A policyholder is classified as prime or subprime, with probabilities \(20\%\) and \(80\%\), respectively. Prime policyholders have accidents with probability \(2\%\), and subprime with probability \(10\%\). Find the probability that a random policyholder has an accident. Here, \(F_1\) = {prime}, \(F_2\) = {subprime}, \(A\) = {accident}. Then \[ \mathbb{P}(F_1) = 0.2,\, \mathbb{P}(F_2) = 0.8,\, \mathbb{P}(A\mid F_1) = 0.02,\, \mathbb{P}(A\mid F_2) = 0.1. \] Thus by the total probability formula \[ \mathbb{P}(A) = \mathbb{P}(A\mid F_1)\mathbb{P}(F_1) + \mathbb{P}(A\mid F_2)\mathbb{P}(F_2) = 0.2\cdot 0.02 + 0.8\cdot 0.1 = 8.4\%. \]
Flip a fair coin. If H, roll one die. If T, roll two dice. What is the probability \(p\) that there is at least one six? Let \(X_1, X_2\) be the first and the second dice. Let \(F_1\) = heads, \(F_2\) = tails, \(A\) = at least one six. But \(\mathbb{P}(X_1 = 6) = 1/6\), and by inclusion-exclusion formula we have: \[ \mathbb{P}(X_1 = 6\,\mbox{or}\, X_2 = 6) = \mathbb{P}(X_1 = 6) + \mathbb{P}(X_2 = 6) - \mathbb{P}(X_1 = 6,\ X_2 = 6) = \frac16 + \frac16 - \frac1{36} = \frac{11}{36}. \] Thus by the total probability formula, \[ p = \frac12\cdot\frac16 + \frac12\cdot\frac{11}{36} = \frac{17}{72}. \]
In the setting above, we have: \[ \mathbb{P}(F_1\mid A) = \frac{\mathbb{P}(F_1\cap A)}{\mathbb{P}(A)} = \frac{\mathbb{P}(A\mid F_1)\mathbb{P}(F_1)}{ \mathbb{P}(A\mid F_1)\mathbb{P}(F_1) + \ldots + \mathbb{P}(A\mid F_n)\mathbb{P}(F_n)}. \]
\(1\%\) of people have a disease. If a person is sick, the test is positive with probability \(90\%\), and if a person is healthy, the test is positive only with probability \(20\%\). A random person is selected from the population and is tested positive. What is the probability that he is sick? Let \(F_1\) = sick, \(F_2\) = healthy, \(A\) = tested positive.
\[
\mathbb{P}(F_1) = 0.01,\ \mathbb{P}(F_2) = 0.99,\ \mathbb{P}(A\mid F_1) = 0.9, \ \mathbb{P}(A\mid F_2) = 0.2.
\] \[
\mathbb{P}(F_1\mid A) = \frac{0.01\cdot0.9}{0.01\cdot0.9 + 0.99\cdot 0.2} \approx 4.3\%
\] We updated the probability of our hypothesis (that he is sick) from \(1\%\) to \(4.3\%\), using new information that the test is positive. The probability \(\mathbb{P}(F_1)\) is called a prior probability, and \(\mathbb{P}(F_1\mid A)\) is called a posterior probability. But this probability is still small, because the original probability was small. This is prosecutor’s fallacy: Even though the probability \(90\%\) is large, the posterior probability is small, because the prior is small.
We have a fair coin and a magic coin, which always comes out H. Choose one at random (each can be chosen with probability \(1/2\)), flip it twice. It comes out H both times. What is the probability it is fair? Let \(F_1\) = fair, \(F_2\) = magic, \(A\) = both heads. Then \[ \mathbb{P}(F_1) = \mathbb{P}(F_2) = \frac12,\ \ \mathbb{P}(A\mid F_1) = \frac14,\ \mathbb{P}(A\mid F_2) = 1. \] Therefore, by Bayes’ formula we get: \[ \mathbb{P}(F_1\mid A) = \frac{\frac12\cdot\frac14}{\frac12\cdot\frac14 + \frac12\cdot 1} = \frac{1}{5} = 20\%. \]
Events \(A\) and \(B\) are called independent if knowledge of whether \(A\) happened or not does not influence the probability of \(B\): \(\mathbb{P}(B\mid A) = \mathbb{P}(B)\). Since \(\mathbb{P}(B\mid A) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}\), we can rewrite this as \[ \mathbb{P}(A\cap B) = \mathbb{P}(A)\mathbb{P}(B). \]
Toss the coin twice. Let \(A\) = first toss H, \(B\) = second toss H, \(C\) = both tosses the same. Then \(A\) and \(B\) are independent. Indeed, the probability space (the space of all outcomes) is \(S = \{HH, HT, TH, TT\}\), and \(A\) = {HH, HT}, \(B\) = {TH, HH}, \(C\) = {TT, HH}. Next, \[ \mathbb{P}(A\cap B) = \mathbb{P}\{HH\} = \frac14 = \frac12\cdot \frac12 = \mathbb{P}(A)\cdot\mathbb{P}(B). \] It is obvious that these events are independent, because they result in different tosses of the coin. In some other cases, it is not obvious. For example, \(A\) and \(C\) are also independent. Indeed, \[ \mathbb{P}(A\cap C) = \mathbb{P}\{HH\} = \frac14 = \frac12\cdot\frac12 = \mathbb{P}(A)\cdot\mathbb{P}(C). \] Similarly, \(B\) and \(C\) are independent.
A collateralized debt obligation (CDO) is backed by 5 subprime mortgages from California, each defaults with probability \(40\%\). A senior tranch defaults only if all 5 mortgages default. Find the probability of this in two cases: (a) they are independent; (b) they default simultaneously because of housing bubble. Solution: (a) \(0.4^5 \approx 1\%\); (b) \(40\%\). The case (b) is roughly what happened in 2007-8. ### Example 2.17. Mary and Carlos are debugging a code. Their result are independent. Mary can spot a mistake with probability \(60\%\), and Carlos with probability \(80\%\). What is the probability that this mistake remains? Answer: \((1 - 60\%)\cdot (1 - 80\%) = 0.08\).
A gene exhibits two alleles: dominant \(W\) and recessive \(w\), with \[ \mathbb{P}(WW) = 50\%,\, \mathbb{P}(Ww) = 20\%,\, \mathbb{P}(ww) = 30\%. \] What is the chance that the offspring is \(ww\)? Needs to receive \(w\) from father and \(w\) from mother, which are independent. The probability of getting \(w\) from the father is \[ \frac12\cdot 0.2 + 0.3 = 0.4. \] The same probability is of getting \(w\) from mother. Thus the final probability is \(0.4^2 = 0.16\).
Events \(A\), \(B\) and \(C\) are called independent if \[ \mathbb{P}(A\cap B) = \mathbb{P}(A)\mathbb{P}(B),\ \ \mathbb{P}(A\cap C) = \mathbb{P}(A)\mathbb{P}(C),\ \ \mathbb{P}(B\cap C) = \mathbb{P}(B)\mathbb{P}(C), \] \[ \mathbb{P}(A\cap B\cap C) = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C). \] This last condition is important, because it does not automatically follow from the first three conditions. For example, if \(A\), \(B\) and \(C\) are the events from the example, then \(A\) and \(B\) are independent, \(B\) and \(C\) are independent, \(A\) and \(C\) are independent, so these events are pairwise independent. But \(A\cap B\cap C\) = HH, so \[ \mathbb{P}(A\cap B\cap C) = \frac14 \ne \frac18 = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C). \]
Three clients, each has accident independently with probability \(20\%\). Let \(X\) be the number of total accidents. Then \(\mathbb{P}(X = 0) = (1 - 0.2)^3 = 0.512\), and \(\mathbb{P}(X = 3) = 0.2^3 = 0.008\).
Take a sample space \(S\).
Any function \(X : S \to \mathbb R\) is called a random variable. The distribution of a random variable is the list of its values, together with probabilities of each value.
For \(S\) = {HH, HT, TH, TT} (two coin tosses), let \(X\) be the number of Heads. Then \[ X(HH) = 2,\quad X(TH) = 1,\quad X(HT) = 1,\quad X(TT) = 0. \] The distribution, otherwise called the probability mass function, is \[ P(0) = \frac14,\quad P(1) = \frac12,\quad P(2) = \frac14. \]
Two dice rolled, \(S = \{11, 12, \ldots, 66\}\). Let \(X\) be the difference. Then \(X(12) = 1\), \(X(31) = 2\).
Some random variables are continuous. They take infinitely many values, and have density \(f(x)\): For all \(a, b\), \[ \mathbb P(a \le X \le b) = \int_a^bf(x)\,\mathrm{d}x. \] Thus the probability of getting between \(a\) and \(b\) is the area under the density curve. For such random variables, \(\mathbb P(X = c) = 0\) for all \(c\). Any particular value is taken with probability \(0\). To be a density, a function needs to be nonnegative and integrate from \(-\infty\) to \(+\infty\) up to \(1\). For example, \(f(x) = x\) or \(f(x) = x^2\) do not pass this test.
For an exponential random variable with unit mean, \[ f(x) = \begin{cases} e^{-x},\ x > 0;\\ 0,\, x \le 0. \end{cases} \] Then \(\mathbb P(X \le 1) = \int_0^1f(x)\,\mathrm{d}x = \int_0^1e^{-x}\,\mathrm{d}x = 1 - e^{-1}\).
Actually, random variables can be neither discrete nor continuous, but a mixture of both. Say, an accident happens with probability \(10\%\). In this case, losses are exponential, with distribution as above. Thus the amount \(X\) of losses is distributed as (for \(0 < a < b\)) \[ \mathbb{P}(X = 0) = 0.9,\quad \mathbb{P}(a \le X \le b) = \frac1{10}\int_a^be^{-x}\,\mathrm{d}x. \]
We have distribution of \(X\): \[ p(0) = 0.5,\, p(1) = 0.2,\, p(2) = p(3) = p(4) = 0.1 \] Find \(\mathbb P(X \ge 3\mid X \ge 1)\): this is equal to \[ \frac{\mathbb P(X \ge 3)}{\mathbb P(X \ge 1)} = \frac{0.1 + 0.1}{1 - \mathbb P(X = 0)} = \frac{0.2}{0.5} = 40\%. \]
A random variable \(X\) has density \(f(x) = 2x,\, 0 \le x \le 1\). Outside of this interval, the density is \(0\) (often we skip this \(0\) outside of effective interval). Then \[ \mathbb P(X \ge 0.5) = \int_{0.5}^12x\,\mathrm{d}x = \frac34. \]
For a random variable \(X\), its cumulative distribution function (CDF) is \(F_X(x) := \mathbb P(X \le x)\). It exists for every random variable, discrete, continuous, or otherwise (unlike PDF, which exists only for continuous distributions), and it uniquely determines the distribution: If \(F_X(x) = F_Y(x)\) for all \(x\), then random variables \(X\) and \(Y\) have the same distribution.
Take \(X\) which takes values \(0\) and \(1\) with equal probabilities \(0.5\). Then:
\[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 1/2,\, 0 \le x < 1;\\ 1,\ x \ge 1. \end{cases} \]
\[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 4/9,\, 0 \le x < 1;\\ 8/9,\, 1 \le x < 2;\\ 1, x \ge 2. \end{cases} \]
We can see that for a discrete random variable \(X\), the CDF is piecewise constant. Jump points correspond to the values which \(X\) takes, and the size of the jump corresponds to the probability that this value is taken.
Let us solve the opposite problem: Take a cumulative distribution function of a random variable \(X\): \[ F_X(x) = \begin{cases} 0,\, x < 1;\\ 0.4,\, x \in [1, 2);\\ 1,\, x \ge 2. \end{cases} \] Let us find the distribution of \(X\): which values it takes and with which probabilities. It jumps at \(1\) and \(2\), and the size of jumps is \(0.4\) and \(0.6\) respectively. Thus \(\mathbb P(X = 1) = 0.4\) and \(\mathbb P(X = 2) = 0.6\).
For a continuous random variable \(X\) with density \(p_X\), then \[ F_X(x) = \mathbb P(X \le x) = \int_{-\infty}^xp_X(y)\,\mathrm{d}y. \] Thus \(F_X\) is continuous. If \(p_X\) is continuous at the point \(x\), then \(F'_X(x) = p_X(x)\). That is, the derivative of this CDF is the PDF: probability density function, or simply density.
Find CDF for \(X\) with the density \[ p(x) = \begin{cases} 2e^{-2x},\, x \ge 0;\\ 0,\, x < 0. \end{cases} \] Thus for \(x < 0\) the event \(\{X \le x\}\) is impossible, because \(X \ge 0\) always. Therefore, \(F_X(x) = \mathbb{P}(X \le x) = 0\). And for \(x \ge 0\) we have: \[ F_X(x) = \int_0^x(2e^{-2y})\,\mathrm{d}y = \left.\left(-e^{-2y}\right)\right|_{y=0}^{y=x} = 1 - e^{-2x}. \] Let us summarize: \[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 1 - e^{-2x},\, x \ge 0. \end{cases} \] Unlike in the exampels above, \(F_X(x) < 1\) for all \(x\), but \(F_X(x) \to 1\) as \(x \to +\infty\).
Find the distribution of a random variable \(X\) with CDF \[ F_X(x) = \begin{cases} 0,\, x < 0;\\ x^2,\, x \in [0, 1);\\ 1,\, x \ge 1. \end{cases} \] We have: \(\mathbb{P}(X \le 0) = F_X(0) = 0\), thus \(X > 0\) with probability \(1\). Next, \(\mathbb{P}(X \le 1) = F_X(1) = 1\), thus \(X \le 1\) with probability \(1\). Thus with probability \(1\), \(0 < X \le 1\). And for \(x \in (0, 1)\), the density is the derivative of \(F_X\): \(p_X(x) = (x^2)' = 2x\). Thus, \[ p_X(x) = \begin{cases} 2x,\, 0 \le x \le 1;\\ 0,\ \mbox{else}. \end{cases} \]
A common mistake is to assign, say, for random variable \(X\) lying between \(0\) and \(2\), \(F_X(x) = 0\) for \(x \ge 2\). This confuses CDF \(F_X\) with PDF \(p_X\). But while \(p_X(x) = 0\) in this case for \(x \le 0\) and \(x \ge 2\), instead for CDF we have: \[ F_x(x) = \begin{cases} 0,\ x < 0;\\ 1,\ x \ge 2. \end{cases} \]
Take a random variable \(X : S \to \mathbb R\) on a sample space \(S\).
The expectation of \(X\) is the sum over all elementary outcomes: \[ \mathbb E\left[X\right] = \sum\limits_{s \in S}p(s)X(s). \] This is also called mean, or average, or expected value.
\(S\) = {HH, HT, TH, TT} (two fair coins). Let \(X\) be the number of Heads. Then \[ X(HH) = 2,\quad X(TH) = 1,\quad X(HT) = 1,\quad X(TT) = 0. \] Each of these 4 elementary outcomes has probabiltiy \(1/4\). The distribution, otherwise called the probability mass function, is \[ P(0) = \frac14,\quad P(1) = \frac12,\quad P(2) = \frac14. \] The expectation can be computed in two equivalent ways: \[ \mathbb E[X] = 0\cdot P(0) + 1\cdot P(1) + 2\cdot P(2) = \frac14\cdot X(TT) + \frac14 \cdot X(HT) + \frac14 \cdot X(TH) + \frac14\cdot X(HH) = 1. \]
In general, we have the following results.
Assume \(X\) takes values \(c_1, \ldots, c_n\), with probabilities \(p_1, \ldots, p_n\). Then the expectation is \[ \mathbb E[X] = c_1p_1 + \ldots + c_np_n. \]
If \(X\) is not discrete (that is, does not take finitely or countably many values), then it is harder to define expectation (we need measure theory). However, it is easy to compute it.
If \(X\) is continuous with density \(f\), then \[ \mathbb E[X] = \int_{-\infty}^{+\infty}xf(x)\,\mathrm{d}x. \] Indeed, split real line into small subintervals \([y, y + \mathrm{d}y]\). Then \(\mathbb P(y \le X \le y + \mathrm{d}y) \approx f(y)\,\mathrm{d}y\). Sum this over all such small intervals and get the integral.
For a uniform random variable \(X\) on \([0, 1]\) with density \(f(x) = 1,\, 0 \le x \le 1\). Then \[ \mathbb E[X] = \int_0^1xf(x)\,\mathrm{d}x = \int_0^1x\,\mathrm{d}x = \frac12. \]
Sometimes the expectation can be infinite. Take density \(f(x) = x^{-2}\) for \(x \ge 1\). Then \[ \mathbb E[X] = \int_1^{\infty}xf(x)\,\mathrm{d}x = \int_1^{\infty}x^{-1}\,\mathrm{d}x = \left.\ln x\right|_{x=1}^{x = \infty}. \]
(St Petersburg Paradox) A fair coin is flipped until first tail. Win \(2\) if it appears on the first toss (which happens with probability \(1/2\)); Win \(4\) if on the second toss (which happens with probability \(1/4\)); etc. Thus \(\mathbb E[X] = 2\cdot \frac12 + 4\cdot \frac14 + \ldots = \infty\). Thus the value of this game is infinite. But would you pay, say 1000 to play it?
Alaska Airlines flight Reno-Seattle on Embraer 75 has 64 economy class seats. Each ticket costs 100$. The airline overbooks: sells an additional tickets for full flight. If all passengers show up, the airline needs to refund 100$ and give a voucher 150$ to this bumped passenger. Each passenger will show up independently with probability 98%. Is this airline better off overbooking? Solution: Airline gains 100$ if at least one does not show up. But it loses 150$ if all show up, which happens with probability \(0.98^{64} \approx 0.27\). Thus it is better to overbook: Expected gain is \[ 100\cdot (1 - 0.27) + (-150)\cdot 0.27 > 0. \]
For a function \(g : \mathbb R \to \mathbb R\), we get: \[ \mathbb E[g(X)] = \begin{cases} \int_{-\infty}^{+\infty}g(y)f(y)\,\mathrm{d}y,\\ \sum_cg(c)\mathrm{P}(X = c), \end{cases} \] if \(X\) has density \(f\) or discrete distribution, respectively.
For \(X\) with density \(f(x) = e^{-x},\, x > 0\), we get: \[ \mathbb E\bigl[e^{X/2}\bigr] = \int_0^{\infty}e^{x/2}e^{-x}\,\mathrm{d}x = \int_0^{\infty}e^{-x/2}\,\mathrm{d}x = 2. \]
This quantity measures the concentration of random variable around its mean.
For a random variable \(X\), if \(\mu = \mathbb E[X]\), then \(\mathrm{Var}(X) = \mathbb E(X - \mu)^2\) is called variance.
This quantity is always nonnegative, and it can be zero only when \(X = \mu\) is constant, not random.
One can compute variance as follows: \(\mathrm{Var}(X) = \mathbb E[X^2] - (\mathbb E X)^2\).
Note that \((X - \mu)^2 = X^2 - 2X\mu + \mu^2\). Take expectation and get: \(\mathbb E[X^2] - 2\mu\mathbb E[X] + \mu^2 = \mathbb E[X^2] - 2\mu\cdot\mu + \mu^2 = \mathbb E[X^2] - \mu^2\).
For a uniform random variable \(X\) on \([0, 1]\), its density is \(f(x) = 1\) for \(0 \le x \le 1\). Thus \[ \mathbb E[X^2] = \int_0^1x^2f(x)\,\mathrm{d}x = \int_0^1x^2\,\mathrm{d}x = \frac13. \] We can compute variance: \(\mathrm{Var}(X) = 1/3 - (1/2)^2 = 1/12\).
Let \(X\) be the number of Heads in two coin tosses. Its distribution is \(p(0) = 1/4\), \(p(1) = 1/2\), \(p(2) = 1/4\). Thus \[ \mathbb E[X^2] = \frac14\cdot 0^2 + \frac12\cdot 1^2 + \frac14\cdot 2^2 = \frac32, \] and the variance is equal to \(3/2 - 1^2 = 1/2\).
For a continuous random variable \(X\), its median is defined as the \(m\) such that \[ \mathbb P(X \ge m) = \int_m^{\infty}f(x)\,\mathrm{d}x = \frac12. \]
\(f(x) = e^{-x},\, x \ge 0\). Then \(\mathbb P(X \ge m) = e^{-m} = 1/2\) and thus \(m = \ln 2\). This is different from mean \(\mathbb E[X] = 1\).
For discrete random variables, it is sometimes not possible to find \(m\) such that \(\mathbb P(X \ge m) = \mathbb P(X \le m) = 1/2\). Instead, we should start from the smallest value and add probabilities until we reach or exceed \(1/2\).
For the distribution \(p(-2) = 0.2\), \(p(-1) = 0.1\), \(p(0) = 0.4\), \(p(2) = 0.2\), \(p(3) = 0.1\), we get to \(0\) with probability \(0.2+0.1+0.4 = 0.7\). Thus the median is \((-1+0)/2 = -0.5\).
For the distribution \(p(-2) = 0.2\), \(p(-1) = 0.1\), \(p(0) = 0.2\), \(p(2) = 0.2\), \(p(3) = 0.3\), we get to \(0\) with probability \(0.2+0.1+0.4 = 0.5\). Thus the median is \(0\).
A mode \(x\) is the \(x\) such that \(f(x)\) is maximal, or \(\mathrm{P}(X = x)\) is maximal.
For example, for \(f(x) = e^{-x},\, x \ge 0\), and the mode is \(0\).
For a random variable \(X \ge 0\), the function \(S(c) = \mathrm{P}(X > c)\) is called the survival or tail function. Let \(X\) be the lifetime with density \(f\). The mortality rate is defined as \[ m(x) = \frac{f(x)}{S(x)}, \] which has meaning \[ m(x)\,\mathrm{d}x = \mathbb P(x < X < x + \mathrm{d}x\mid X > x). \] This is the conditional probability that I will survive for the additional \(\mathrm{d}x\), given that I survived until \(x\). These survival functions and mortality rates are available in the actuarial tables in Social Security Administration.
\(X\) | \(Y\) | \(P\) |
---|---|---|
0 | 0 | 1/2 |
1 | 1 | 1/6 |
2 | 1.3 | 1/12 |
3 | 1.6 | 1/20 |
4 | 1.9 | 1/30 |
\(\ge 5\) | 2 | \(p\) |
The probability \(p\) is equal to \[ p = 1 - \frac12 - \frac16 - \frac1{12} - \frac1{20} - \frac1{30} = \frac16. \] We can see that the expected out-of-pocket payment is \[ \mathbb E[Y] = 0\cdot\frac12 + 1\cdot\frac16 + 1.3\cdot\frac1{12} + 1.6\cdot\frac1{20} + 1.9\cdot \frac1{30} + 2\cdot \frac16 \approx 0.75. \]
Here we take functions of continuous random variables with known densities. We find densities of new random variables. Indeed, take \(X\) with \(f_X(x) = 2e^{-2x},\, x \ge 0\). Find density of \(X^2 = Y\). There are two ways: change-of-variables and CDF.
Change of variables. For every \(0 < a < b\), \[ \mathbb P(a \le X \le b) = \int_a^bf_X(x)\,\mathrm{d}x. \] Try to do the same for \(Y\) instead of \(X\): \[ \mathbb P(a \le Y \le b) = \mathbb P\left(a^{1/2} \le X \le b^{1/2}\right) = \int_{a^{1/2}}^{b^{1/2}}f_X(x)\,\mathrm{d}x. \] Next, change variables in the integral: \[ y^{1/2} = x;\ a^{1/2} \le x \le b^{1/2} \Leftrightarrow a \le y \le b;\ \mathrm{d}x = \frac12 y^{-1/2}\,\mathrm{d}y. \] Thus we get: \[ \int_{a^{1/2}}^{b^{1/2}}f_X(x)\,\mathrm{d}x = \int_a^b2e^{-2y^{1/2}}\frac12 y^{-1/2}\,\mathrm{d}y. \] Combining this, we get: \[ \mathbb P(a \le Y \le b) = \int_a^be^{-2y^{1/2}}y^{-1/2}\,\mathrm{d}y. \] Since this is true for all \(a < b\), we get that the density of \(Y\) is \[ f_Y(y) = e^{-2y^{1/2}}y^{-1/2},\ y > 0. \]
CDF. Compute CDF for \(X\): \(\mathrm{P}(X \le c) = 1 - e^{-2c},\, c > 0\). Thus \[ \mathbb P(Y \le c) = \mathbb P(X \le c^{1/2}) = 1 - e^{-2c^{1/2}},\, c > 0. \] The derivative of this function is the density of \(Y\): \[ f_Y(c) = \left(1 - e^{-2c^{1/2}}\right)' = -(-2c^{1/2})'e^{-2c^{1/2}} = c^{-1/2}e^{-2c^{1/2}},\ c > 0. \]
The first method seems to be better when you cannot compute the CDF explicitly by integrating density.
Take two random variables \(X\) and \(Y\) with joint distribution: If they are discrete, they have joint probability mass function \[ \mathbb P(X = x,\, Y = y) = p(x, y). \] If they are continuous, they have joint density or joint probability density function \(f(x, y)\) such that for all \(a, b, c, d\), \[ \mathbb{P}(a \le X \le b,\, c \le Y \le d) = \int_a^b\int_c^df(x, y)\,\mathrm{d}y\,\mathrm{d}x. \] To qualify as joint density, a function needs to be nonnegative: \(f \ge 0\), and integrate up to \(1\) over the coordinate plane \(\mathbb R^2\): \[ \iint_{\mathbb R^2}f(x, y)\,\mathrm{d}x\,\mathrm{d}y = 1. \]
Toss a coin twice. Let \(X = 1\) if first Heads, \(X = 0\) if first Tails, and \(Y\) be the number of Heads.
\(s\) | HH | HT | TH | TT |
---|---|---|---|---|
\(X\) | 1 | 1 | 0 | 0 |
\(Y\) | 2 | 1 | 1 | 0 |
\(P\) | 0.25 | 0.25 | 0.25 | 0.25 |
Take joint density \(f(x, y) = cx^2y\) if \(x^2 \le y \le 1\) (and \(0\) for other \(x\) and \(y\)) for some constant \(c\). Find \(c\) and \(\mathbb{P}(X \ge Y)\). Solution: \[ 1 = c\int_{-1}^1\int_{x^2}1x^2y\,\mathrm{d}y\,\mathrm{d}x = c\int_{-1}^1\left.\frac{x^2y^2}2\right|_{y=x^2}^{y=1}\,\mathrm{d}x \] \[ = c\int_{-1}^1\left(x^2 - x^6\right)\,\mathrm{d}x = \frac c2\left.\left(\frac{x^3}3 - \frac{x^7}7\right)\right|_{x=-1}^{x=1} = \frac{4c}{21};\\ \mathbb{P}(X \le Y) \] \[ = \frac{21}{4}\int_{0}^1\int_{x^2}^xx^2y\,\mathrm{d}y\,\mathrm{d}x = \frac{21}4\int_{0}^1x^2\left.\frac{y^2}2\right|_{y=x^2}^{y=x}\,\mathrm{d}x \] \[ = \frac{21}4\int_0^1\frac{x^2}2\left(x^2 - x^4\right)\,\mathrm{d}x = \frac{21}8\int_0^1\left(x^4 - x^6\right)\,\mathrm{d}x = \frac{21}8\cdot\frac2{35} = \frac3{20}. \]
For any function \(g(x, y)\), we have: \[ \mathbb E g(X, Y) = \begin{cases} \sum_{x, y}g(x, y)\mathbb{P}(X = x,\, Y = y),\\ \iint_{\mathbb R^2}g(x, y)f(x, y)\,\mathrm{d}y\,\mathrm{d}x. \end{cases} \] if \((X, Y)\) are jointly discrete or jointly continuous with density \(f\).
For \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\), we get: \[ \mathbb E[XY] = \int_0^1\int_0^1xy(x+y)\,\mathrm{d}y\,\mathrm{d}x = \int_0^1\left.\left(\frac{x^2y^2}2 + \frac{xy^3}3\right)\right|_{y=0}^{y=1}\,\mathrm{d}x \] \[ = \int_0^1\left(\frac{x^2}2 + \frac{x}3\right)\,\mathrm{d}x = \frac13. \]
Take random variables \(X\) and \(Y\).
If \(X\) and \(Y\) are discrete, they are called independent if for all values \(x\) and \(y\) which they assume, events \(\{X = x\}\) and \(\{Y = y\}\) are independent: \[ \mathbb{P}(X = x,\, Y = y) = \mathbb{P}(X = x)\cdot \mathbb{P}(Y = y). \] If they are continuous, they are called independent if for all \(a, b, c, d\), the following events are independent: \(\{a \le X \le b\}\) and \(\{c \le Y \le d\}\).
The latter is equivalent to joint density having product form: \(f(x, y) = f_X(x)f_Y(y)\), where \(f_X\) and \(f_Y\) are marginal densities of \(X\) and \(Y\), respectively.
For independent \(X\) and \(Y\), we have: \(\mathbb E[XY] = \mathbb EX\cdot \mathbb EY\).
Say \(X\) and \(Y\) are discrete. Then apply the above to \(g(x, y) = xy\). \[ \mathbb E[XY] = \sum_{x, y}xy\,\mathbb{P}(X = x,\, Y = y) = \sum_{x, y}xy\,\mathbb{P}(X = x)\,\mathbb{P}(Y = y) \] \[ = \sum_xx\,\mathbb{P}(X = x)\cdot\sum_yy\,\mathbb{P}(Y = y) = \mathbb E X \cdot\mathbb E Y. \] Similarly it can be proved for continuous \(X\) and \(Y\).
For joint density \(f(x, y) = 6xy^2\), \(0 \le x \le 1\), \(0 \le y \le 1\). This is product form: \(2x\cdot 3y^2\). Thus \(X\) and \(Y\) are independent. We have: \[ \mathbb E[XY] = \mathbb E X \cdot\mathbb E Y = \int_0^12x\cdot x\,\mathrm{d}x \cdot \int_0^13y^2\cdot y\,\mathrm{d}y = \frac23\cdot\frac34 = \frac12. \]
For \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\), random variables \(X\) and \(Y\) are not independent, because the density is not product form.
For \(f(x) = 2\), \(0 \le x \le 1\), \(x \le y \le 1\), the density is not product form, because the effective domain is not a rectangle.
These are measures of dependency between two random variables.
For two random variables \(X\) and \(Y\), their covariance is defined as \[ \mathrm{Cov}(X, Y) = \mathbb E[XY] - \mathbb E X \cdot \mathbb E Y. \] Their correlation is defined as \[ \rho(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var} X}\sqrt{\mathrm{Var} Y}}. \]
Positive covariance/correlation implies that on average, if \(X\) increases, \(Y\) also increases. Negative covariance/correlation implies the reverse. Correlation is always between \(-1\) and \(+1\). If it is \(+1\), then \(Y\) is a linear function of \(X\): \(Y = aX + b\), with \(a > 0\). If it is \(-1\), then \(Y = aX + b\) with \(a < 0\).
Take random variables \(X\) and \(Y\) with joint distribution as in the table below.
\(X\) | 1 | 2 | 3 | 4 |
\(Y\) | 0 | 1 | 1 | 2 |
\(P\) | 0.2 | 0.3 | 0.2 | 0.3 |
Then we have: \[ \mathbb E[X] = 1\cdot 0.2 + 2\cdot 0.3 + 3\cdot 0.2 + 4\cdot 0.3 = 2.6, \] \[ \mathbb E[Y] = 0\cdot 0.2 + 1\cdot 0.3 + 1\cdot 0.2 + 2\cdot 0.3 = 1.1, \] \[ \mathbb E[XY] = 1\cdot 0 \cdot 0.2 + 2\cdot 1\cdot 0.3 + 3\cdot 1\cdot 0.2 + 4\cdot 2\cdot 0.3 = 3.6. \] Thus \(\mathrm{Cov}(X, Y) = 3.6 - 2.6\cdot 1.1 = 0.74\). Next, \[ \mathrm{Var} X = \mathbb E[X^2] - (\mathbb E X)^2 = 0.2\cdot 1^2 + 0.3\cdot 2^2 + 0.2\cdot 3^2 + 0.3\cdot 4^2 - 2.6^2 = 1.24; \] \[ \mathrm{Var} Y = \mathbb E[Y^2] - (\mathbb E Y)^2 = 1.7 - 1.1^2 = 0.49. \] Thus the correlation is almost perfect: \[ \rho(X, Y) = \frac{0.74}{\sqrt{1.24}\sqrt{0.49}} = 0.95. \]
If \(X\) and \(Y\) are independent then they are uncorrelated: \(\rho(X, Y) = 0\). But the converse is not true: \(\mathrm{Cov}(X, Y) = 0\) does not mean \(X\) and \(Y\) are independent.
Take \((X, Y)\) uniformly distributed on four points: \((0, 1), (1, 0), (0, -1), (-1, 0)\). Then (because \(XY = 0\) always) \[ \mathbb E[X] = \mathbb E[Y] = 0,\quad \mathbb E[XY] = 0. \] Thus \(\mathrm{Cov}(X, Y) = 0 - 0\cdot 0 = 0\). But \(X\) and \(Y\) are dependent: If \(X = 1\), then must be \(Y = 0\).
For all (not necessarily independent) \(X\) and \(Y\), \[ \mathrm{Var}(X+Y) = \mathrm{Var} X + \mathrm{Var} Y + 2\mathrm{Cov}(X, Y). \] In particular, for independent \(X\) and \(Y\) we have: \[ \mathrm{Var}(X + Y) = \mathrm{Var} X + \mathrm{Var} Y. \]
If \(\mathbb E X = \mathbb E Y = 0\), we have: \[ \mathrm{Var}(X+Y) = \mathbb E[(X+Y)^2] = \mathbb E(X^2 + 2XY + Y^2) = \mathbb E[X^2] + \mathbb E[Y^2] + 2\mathbb E[XY]. \] But \(\mathrm{Var} X = \mathbb E[X^2]\), \(\mathrm{Var} Y = \mathbb E[Y^2]\), and \(\mathbb E[XY] = \mathrm{Cov}(X, Y)\). This completes the proof. The case of general \(\mathbb E X\) and \(\mathbb E Y\) is similar.
Similarly we can get a more general linear combination: \(\mathrm{Var}(2X - 3Y + 1) = \mathrm{Var}(2X) + \mathrm{Var}(-3Y) + 2\mathrm{Cov}(2X, -3Y) = 4\mathrm{Var}(X) + 9\mathrm{Var}(Y) - 12\mathrm{Cov}(X, Y)\).
Invest in 3 independent stocks with returns \(S_1, S_2, S_3\) with means \(2, 1, 3\) and variances \(4, 3, 2\). Invest in these stock in proportion \(2:1:1\): total return is \[ R = \frac12S_1 + \frac14S_2 + \frac14S_3;\ \mathbb E R = \frac12\cdot 2 + \frac12\cdot 1 + \frac14\cdot 1 = \frac32;\ \mathrm{Var}(R) = \frac14\cdot 4 + \frac1{16}3 + \frac1{16}2 = \frac{21}{16}. \]
Two stocks with returns \(X\) and \(Y\), with \(\mathbb E X = 2\) and \(\mathrm{Var} X = 3\), \(\mathbb E Y = 1\) and \(\mathrm{Var} Y = 1\), and \(\rho(X, Y) = 0.5\). Invest in them equally, then total return is \(R = (X+Y)/2\). We get: \[ \mathbb E R = \frac12(2 + 1) = 1.5,\ \mathrm{Var} R = \frac14\mathrm{Var}(X+Y) = \frac14(\mathrm{Var} X + \mathrm{Var} Y + 2\mathrm{Cov}(X, Y)) = \frac14(3 + 1 + 2\cdot\sqrt{3}\sqrt{1}\cdot 0.5) = 1 + \frac{\sqrt{3}}4 \approx 1.4. \] Thus diversification decreased variance and with it risk.
If we know the joint distribution of \((X, Y)\), then the distribution of only \(X\) or only \(Y\) is called a marginal distribution. If \((X, Y)\) are jointly discrete, then the marginal distribution of \(X\) can be found by summing the probabilities over all \(Y\), and vice versa: \[ \mathrm{P}(X = x) = \sum_y\mathrm{P}(X = x,\, Y = y);\quad \mathrm{P}(Y = y) = \sum_x\mathrm{P}(X = x,\, Y = y). \] If \((X, Y)\) are jointly continuous with density \(f\), we get: \[ f_X(x) = \int_{-\infty}^{+\infty}f(x, y)\,\mathrm{d}y;\quad f_Y(y) = \int_{-\infty}^{+\infty}f(x, y)\,\mathrm{d}x. \]
Consider \((X, Y)\) with joint distributions \[ \mathbb P(X = Y = 0) = 0.2,\ \mathbb P(X = 1,\, Y = 0) = 0.4,\ \mathbb P(X = Y = 1) = 0.4. \] Then the marginal distributions of \(X\) and \(Y\) are: \[ \mathrm{P}(X = 0) = 0.2,\, \mathrm{P}(X = 1) = 0.8;\quad \mathrm{P}(Y = 0) = 0.6,\, \mathrm{P}(Y = 1) = 0.4. \]
Consider \((X, Y)\) with joint density \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\). Then density of \(Y\) is \[ f_Y(y) = \int_0^1(x+y)\,\mathrm{d}x = y + \frac12,\ 0 \le y \le 1. \]
For \(f(x, y) = 2\), \(0 \le x \le 1\), \(x \le y \le 1\), we have: \[ f_X(x) = \int_x^12\,\mathrm{d}y = 2(1 - x),\ 0 \le x \le 1;\quad f_Y(y) = \int_0^y2\,\mathrm{d}x = 2y,\ 0 \le y \le 1. \]
Take i.i.d. random variables \(X_1,\ldots, X_n\) with CDF \(F_X(x) = \mathbb P(X \le x)\). Find the cdf of the minimum \(m_n = \min(X_1, \ldots, X_n)\) and the maximum \(M_N = \max(X_1, \ldots, X_n)\).
For the maximum, we can say that \(M_n \le x\) is equivalent to \(X_1 \le x\) and \(X_2 \le x\) and etc. and \(X_n \le x\). The CDF of \(M_N\) is given by \[ F_n(x) = \mathbb P(M_N \le x) = \mathbb P(X_1 \le x, \ldots, X_n \le x) = \mathbb P(X_1 \le x)\cdot\mathbb P(X_n \le x) = F_X^n(x). \] By differentiating this CDF, if this is a continuous random variable, we can get density.
If \(X, Y \sim \mathrm{Exp}(\lambda)\) i.i.d. (waiting time), then \(F_X(x) = 1 - e^{-\lambda x}\) for \(x > 0\), and \(\max(X, Y)\) has CDF \((1 - e^{-\lambda x})^2\) for \(x > 0\). Thus density is \[ 2\lambda e^{-\lambda x}(1 - e^{-\lambda x}),\ x > 0. \]
For the minimum, we cannot say this because \(m_n \le x\) is equivalent to \(X_1 \le x\) or \(X_2 \le x\) etc. Here we have OR instead of AND. Thus we need to invert: \(m_n > x\) is equivalent to \(X_1 > x\) and \(X_2 > x\) and etc. We can compute tail functions instead of CDFs: \(1 - F_X(x) = \mathbb P(X > x)\). \[ G_n(x) = \mathbb P(m_n \le x) = 1 - \mathbb P(m_n > x) = 1 - \mathbb P(X_1 > x, \ldots, X_n > x) = 1 - (1 - F_X(x))^n. \] Thus \(1 - G_n(x) = (1 - F_X(x))^n\).
If \(X, Y \sim \mathrm{Exp}(\lambda)\) i.i.d. (waiting time), then \(1 - F_X(x) = e^{-\lambda x}\) for \(x > 0\), and \(\max(X, Y)\) has tail function \((e^{-\lambda x})^2 = e^{-2\lambda x}\) for \(x > 0\). Thus density is minus derivative of tail function: \[ 2\lambda e^{-2\lambda x},\quad x > 0. \]
Recall how to find conditional probability from Section 2.
Toss a fair coin twice. Let \(A\) be the event that there are two Heads. Let \(Y = 1\) if the first toss is H, \(0\) otherwise. Then \[ \mathbb{P}(A\mid Y = 1) = \frac12, \] because this is the probability that the second toss is H. Next, if the first toss is not H, then we cannot have both H; thus, \[ \mathbb{P}(A\mid Y = 0) = 0. \] We can write this as \[ \mathbb{P}(A\mid Y) = \begin{cases} \mathbb{P}(A\mid Y = 0),\ Y = 0;\\ \mathbb{P}(A\mid Y = 1),\ Y = 1; \end{cases} = \quad \begin{cases} 0,\ Y = 0;\\ \frac12,\ Y = 1; \end{cases} = \frac Y2. \]
The random variable \(\mathbb{P}(A\mid Y)\) is called the conditional probability of \(A\) given \(Y\). We can also have conditional probability depending on many random variables.
Let \(X, Y, Z\) be random variables with the following joint distribution:
\(X\) | \(Y\) | \(Z\) | Prob. |
---|---|---|---|
\(0\) | \(0\) | \(0\) | \(1/8\) |
\(1\) | \(0\) | \(0\) | \(1/8\) |
\(0\) | \(1\) | \(2\) | \(1/8\) |
\(-1\) | \(1\) | \(1\) | \(1/8\) |
\(0\) | \(1\) | \(1\) | \(1/2\) |
Find \(\mathbb{P}(X = 0\mid Y, Z)\). We have: \[ \mathbb{P}(X = 0\mid Y = Z = 0) = \frac{1/8}{1/8 + 1/8} = \frac12. \] \[ \mathbb{P}(X = 0\mid Y = 1,\, Z = 2) = 1. \] \[ \mathbb{P}(X = 0\mid Y = Z = 1) = \frac{1/2}{1/2 + 1/8} = \frac45. \] In other words, \[ \mathbb{P}(X = 0\mid Y, Z) = \begin{cases} \frac12,\ Y = Z = 0;\\ 1,\ Y = 1,\, Z = 2;\\ \frac45,\ Y = Z = 1; \end{cases} \]
This is defined similarly to the conditional probability.
In the previous example, let us find \(\mathbb{E}(X\mid Y, Z)\). We have: \[ \mathbb{E}(X\mid Y = Z = 0) = \frac{(1/8)\cdot 0 + (1/8)\cdot 1}{(1/8) + (1/8)} = \frac12, \] \[ \mathbb{E}(X\mid Y = 1,\, Z = 2) = 0, \] \[ \mathbb{E}(X\mid Y = Z = 1) = \frac{(-1)\cdot (1/8) + 0\cdot (1/2)}{(1/8) + (1/2)} = -\frac15. \]
Toss a fair coin twice, and let \(X\) be the number of Heads, let \(Y = 1\) if the first toss is H, \(Y = 0\) otherwise. If \(Y = 0\), then \(X = 0\) or \(X = 1\) with probability \(1/2\). If \(Y = 1\), then \(X = 1\) or \(X = 2\) with probabilty \(1/2\). Then \[ \mathbb{P}(X\mid Y = 0) = 0\cdot \frac12 + 1\cdot\frac12 = \frac12, \] \[ \mathbb{P}(X\mid Y = 1) = 1\cdot\frac12 + 2\cdot\frac12 = \frac32. \] We can represent this as \[ \mathbb{P}(X\mid Y) = \begin{cases} \frac12,\ Y = 0;\\ \frac32,\ Y = 1; \end{cases} = \frac12 + Y. \]
Take random variables \(X, Y, Z_1, \ldots, Z_n\).
\(\mathbb{E}(X + Y\mid Z_1, \ldots, Z_n) = \mathbb{E}(X\mid Z_1, \ldots, Z_n) + \mathbb{E}(Y\mid Z_1, \ldots, Z_n)\);
If \(X\) is a function of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(X\mid Z_1,\ldots, Z_n) = X\), because if you know \(Z_1, \ldots, Z_n\), you already know the exact value of \(X\).
If \(X\) is independent of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(X\mid Z_1, \ldots, Z_n) = \mathbb{E} X\), because knowledge of \(Z_1, \ldots, Z_n\) does not give us any additional information about the distribution of \(X\).
If \(X\) is a function of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(XY\mid Z_1, \ldots, Z_n) = X\mathbb{E}(Y\mid Z_1, \ldots, Z_n)\). Indeed, when you try to predict the value of \(XY\) based on the information from \(Z_1,\ldots, Z_n\), then you already know the value of \(X\), and you can assume \(X\) is just a constant, and put it outside of the expectation sign.
Take i.i.d. \(X, Y, Z \sim \mathrm{Poi}(1)\). Then \(\mathbb{E} X = \mathrm{Var} X = 1\) thus \(\mathbb{E} X^2 = (\mathbb{E} X)^2 + \mathrm{Var} X = 2\), same for \(Y, Z\). Therefore, \[ \mathbb{E}(X + 2Y + Z\mid Z) = \mathbb{E}(X\mid Z) + 2\mathbb{E}(Y\mid Z) = \mathbb{E}(Z\mid Z) = \mathbb{E} X + 2\mathbb{E} Y + Z = \boxed{3 + Z} \] \[ \mathbb{E}(2X + Z - 1)^2\mid Z) = \mathbb{E}(4X^2 + 4XZ - 4X + Z^2 - 2Z + 1\mid Z) = 4\mathbb{E}(X^2\mid Z) + 4\mathbb{E}(XZ\mid Z) - 4\mathbb{E}(X\mid Z) + \mathbb{E}(Z^2 - 2Z + 1\mid Z) \] \[ = 4\mathbb{E} X^2 + 4Z\mathbb{E} X - 4\mathbb{E} X + Z^2 - 2Z + 1 = 8 + 4Z - 4 + Z^2 - 2Z + 1 = \boxed{Z^2 + 2Z + 5} \]
This is the distribution of a random variable \(X\) given \(Z_1, \ldots, Z_n\), which consists of all probabilities \(\mathbb{P}(X = x\mid Z_1, \ldots, Z_n)\), where \(x\) is a possible value of the random variable \(X\).
Let \(S = \{1, 2, 3, 4, 5\}\), and \(p(1) = 0.5,\ p(2) = 0.2,\ p(3) = p(4) = p(5) = 0.1\). Let \[ X(\omega) = \begin{cases} 1,\ \omega = 1;\\ 0,\ \omega = 2, 3, 4, 5; \end{cases} \ \ Y(\omega) = \begin{cases} 1,\ \omega = 1, 2, 3;\\ 0,\ \omega = 4, 5; \end{cases} \ \ Z(\omega) = \begin{cases} 1,\ \omega = 1, 2, 3, 4;\\ 0,\ \omega = 5. \end{cases} \] Let us find the law (the joint distribution) of \((X, Y)\) given \(Z\). If \(Z = 1\), then this event has probability \[ \mathbb{P}(Z = 1) = p(1) + p(2) + p(3) + p(4) = 0.9, \] and therefore we can calculate \[ \mathbb{P}(X = 1, Y = 1\mid Z = 1) = \frac{p(1)}{\mathbb{P}(Z = 1)} = \frac{0.5}{0.9} = \frac59, \] \[ \mathbb{P}(X = 0, Y = 1\mid Z=1) = \frac{p(2)+p(3)}{\mathbb{P}(Z=1)} = \frac{0.3}{0.9} = \frac13, \] \[ \mathbb{P}(X = 0, Y = 0\mid Z=1) = \frac{p(4)}{\mathbb{P}(Z=1)} = \frac{0.1}{0.9} = \frac19. \] Also, if \(Z = 0\), then \(X = Y = 0\) with probability \(1\), but \[ \mathbb{P}(X = 0, Y = 0\mid Z=0) = 1. \]
Assume \((X, Y)\) have joint density \(f(x, y)\). The marginal density of \(X\) is given by \(f_X(x) = \int f(x, y)\mathrm{d} y\), and the marginal density of \(Y\) is given by \(f_Y(y) = \int f(x, y)\mathrm{d} x\). If we know that \(X = x\), then \(Y\) has conditional density \[ f_{Y\mid X}(y\mid x) = \frac{f(x, y)}{f_X(x)}. \] Similarly, if \(Y = y\), then \(X\) has conditional density \[ f_{X\mid Y}(x\mid y) = \frac{f(x, y)}{f_Y(y)}. \] We can also calculate the conditional expectation: \[ \mathbb E(Y\mid X = x) = \int yf_{Y\mid X}(y\mid x)\mathrm{d} y. \]
Take the density \(f(x, y) = x + y\) for \(0 \le x \le 1\) and \(0 \le y \le 1\). Then \[ f_X(x) = \int_0^1(x + y)\mathrm{d} y = \left.\left(xy + \frac{y^2}2\right)\right|_{y=0}^{y=1} = x + \frac12, \] \[ f_{Y\mid X}(y\mid x) = \frac{x+y}{x + 1/2},\ 0 \le y \le 1. \] For example, if \(x = 1/4\), then \[ f_{Y\mid X}(y\mid x) = \frac{1/4 + y}{3/4} = \frac13 + \frac43y,\ 0 \le y \le 1, \] \[ \mathbb E\left(Y\mid X = \frac14\right) = \int_0^1\left(\frac13 + \frac43y\right)y\mathrm{d} y = \left.\left(\frac16y^2 + \frac49y^3\right)\right|_{y=0}^{y=1} = \frac16 + \frac49 = \frac{11}{18}. \]
A moment generating function (MGF) for a random variable \(X\) is defined as \[ M_X(t) = \mathbb E\left[e^{tX}\right]. \]
For the random variable which assumes values \(\pm1\) with equal probability \(1/2\), this MGF is given by \[ M_X(t) = \frac12\cdot e^{1t} + \frac12\cdot e^{-1t} = \frac12\left(e^t + e^{-t}\right). \]
If \(X\) is the number of Heads in two coin tosses, then the law of \(X\) is \[ \mathbb P(X = 0) = \frac14,\quad \mathbb P(X = 1) = \frac12,\quad \mathbb P(X = 0) = \frac14. \] Thus \(\mathbb E\left[e^{tX}\right] = \frac14e^{0t} + \frac12e^{1t} + \frac14e^{2t} = \frac14(1 + 2e^t + e^{2t})\).
Let \(X\) be the number of tosses needed to get one Head. Then \[ \mathbb P(X = 1) = \mathbb P(H) = \frac12,\quad \mathbb P(X = 2) = \mathbb P(TH) = \frac14, \ldots, \mathbb P(X = n) = 2^{-n}. \] Recall: \(\sum_{n=1}^{\infty} a^n = a/(1-a)\). Thus the MGF is \[ \mathbb E\left[e^{tX}\right] = \sum_{n=1}^{\infty}2^{-n}e^{tn} = \sum_{n=1}^{\infty}(e^t/2)^n = \frac{e^t/2}{1 - e^t/2}. \]
\(X\) has density \(f(x) = 2e^{-2x},\ x > 0\). Thus \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \int_0^{\infty}f(x)e^{tx}\,\mathrm{d}x = 2\int_0^{\infty}e^{-(2-t)x}\,\mathrm{d}x = \frac2{2-t}. \] Recall that \(\int_0^{\infty}e^{-ax}\,\mathrm{d}x = \frac1a\) for \(a > 0\).
We have: \(\mathbb E\left[X\right] = M'_X(0)\), and for other \(k = 1, 2, \ldots\) we have: \(\mathbb E\left[X^k\right] = M^{(k)}_X(0)\).
\(M'_X(t) = \mathbb E\left[(e^{tX})'\right] = \mathbb E\left[Xe^{tX}\right]\). Letting \(t = 0\), we get: \(M'_X(0) = \mathbb E\left[Xe^0\right] = \mathbb E\left[X\right]\). Similarly for \(k \ge 2\).
For the previous example, \[ \mathbb E\left[X\right] = M'_X(0) = \left.\left(\frac2{2-t}\right)'\right|_{t=0} = \left.\frac2{(2-t)^2}\right|_{t=0} = \frac12. \] This is easier to compute than integrate by parts \[ \mathbb E\left[X\right] = \int_0^{\infty}2xe^{-2x}\,\mathrm{d}x. \] Next, the second moment is \[ \mathbb E\left[X^2\right] = M''_X(0) = \left.\frac{2\cdot 2}{(2 - t)^3}\right|_{t=0} = \frac12. \] Thus \(\mathrm{Var} X = \mathbb E\left[X^2\right] - (\mathbb E X)^2 = 1/4\).
For \(M_X(t) = \frac23e^{-2t} + \frac13e^t\), we have: \(\mathbb P(X = 1) = 1/3\) and \(\mathbb P(X = -2) = 2/3\).
For two independent random variables \(X\) and \(Y\), \(M_{X+Y}(t) = M_X(t)M_Y(t)\).
We have: \(\mathbb E\left[e^{t(X+Y)}\right] = \mathbb E\left[e^{tX}e^{tY}\right] = \mathbb E\left[e^{tX}\right] \mathbb E\left[e^{tY}\right]\).
For sum of two independent random variables with \(f(x) = 2e^{-2x},\, x > 0\), we get: \(M(t) = (2/(2-t))^2\).
Consider a sequence of Bernoulli trials: Each results in success or failure with probabilities \(p\) and \(q\), with \(p + q = 1\). All trials are independent.
Toss a coin 10 times. Then Heads is success, Tails is failure, and \(p = q = 1/2\).
Roll a die. Success is 6, all others are failures. Then \(p = 1/6\) and \(q = 5/6\).
A stock market, measured by Standard & Poor 500, rises or falls each day. Assume daily movements are independent. Then \(p = 53.29\%\) and \(q = 46.71\%\) are estimated from daily movements, 1996–2016.
A binomial random variable is the number of successes in \(N\) trials, with probability \(p\) of success. The distribution is denoted by \(\mathrm{Bin}(N, p)\).
The distribution of \(X \sim \mathrm{Bin}(N, p)\) is: Can take values \(0, 1, \ldots, N\), with \[ \mathbb P(X = k) = \binom{N}{k}p^kq^{N-k},\quad k = 0, \ldots, N. \]
The binomial coefficient \(\binom{N}{k}\) is the number of choices which \(k\) of \(N\) trials will be successful. In other words, this is the quantity of subsets of \(k\) elements in \(\{1, 2, \ldots, N\}\). Each such choice has probability \(p^kq^{N-k}\).
\(N = 4\) and \(k = 2\). Then there are 6 choices: SSFF, SFSF, SFFS, FSSF, FSFS, FFSS. Each choice has probability \(p^2q^2\). Thus \(\mathbb P(X = 2) = 6p^2q^2\).
For \(X \sim \mathrm{Bin}(N, p)\), we have: \(\mathbb E X = Np\) and \(\mathrm{Var}(X) = Npq\).
Let \(X = X_1 + \ldots + X_N\) with \(X_1 = 1\) if the first toss resulted in H, \(X_1 = 0\) if the first toss resulted in T. Same for \(X_2, X_3, \ldots\) Then \(\mathbb E X_i = 1\cdot p + 0\cdot q = p\) and \(\mathbb E [X_i^2] = 1^2\cdot p + 0^2\cdot q = p\), thus \(\mathrm{Var} X_i = p - p^2 = pq\). Therefore, \[ \mathbb E X = \mathbb E X_1 + \ldots + \mathbb E X_N = p + p + \ldots + p = N. \] Since \(X_1, \ldots, X_N\) are independent, we have: \[ \mathrm{Var} X = \mathrm{Var} X_1 + \ldots + \mathrm{Var} X_N = pq + \ldots + pq = Npq. \]
Let \(X \sim \mathrm{Bin}(4, 0.3)\). Then \(\mathbb E X = 4\cdot 0.3 = 1.2\) and \(\mathrm{Var} X = 4\cdot 0.3\cdot 0.7 = 0.84\).
The MGF for \(X \sim \mathrm{Bin}(N, p)\) is \(M_X(t) = (pe^t + q)^N\).
We have: \(M_{X_i}(t) = \mathbb E\left[e^{tX_i}\right] = pe^{1t} + qe^{0t} = pe^t + q\). Thus by independence \[ M_X(t) = M_{X_1}(t)\ldots M_{X_N}(t) = (pe^t+q)^N. \]
An IT center uses 9 disk drivers. Each disk is out of servic, independently of others, with probability 6%. The instruction stipulates that at least 7 drives must be available. What is the probability of this? The number of working drives is \(X \sim \mathrm{Bin}(9, 0.94)\). Thus \[ \mathbb P(X \ge 7) = \mathbb P(X = 7) + \mathbb P(X = 8) + \mathbb P(X = 9) = \binom{9}{7}0.94^70.06^2 + \binom{9}{8}0.94^80.06 + \binom{9}{9}0.94^9 \approx 0.986. \]
An insurance company has 10 clients, each has an accident independently of others, with probability 20%. Then the number of accidents is \(X \sim \mathrm{Bin}(10, 0.2)\), and \(\mathbb P(X = 2) = \binom{10}{2}0.2^20.8^8 \approx 0.302\).
For a sequence of independent Bernoulli trials with probabilities \(p\) and \(q\) of success and failure, let \(X\) be the number of trials until the first success. Then \(X \sim \mathrm{Geo}(p)\). This is called a geometric distribution.
For such \(X\), \(\mathbb P(X = n) = pq^{n-1},\, n = 1, 2, \ldots\)
\(X = n\) means first \(n-1\) failures and \(1\) success: FF…FS, which has probability \(q\cdot q\cdot\ldots\cdot q\cdot p = pq^{n-1}\).
The probability that your first Heads will be \(n\) is equal to \((1/2)^n\), since \(p = q = 1/2\).
The moment generating function is \[ M_X(t) = \frac{pe^t}{1 - qe^t}. \]
We have: \(M_X(t) = \sum_{n \ge 1}pq^{n-1}e^{tn} = pe^t\sum_{n \ge 1}(qe^t)^{n-1}\), and now recall that \(\sum_{n = 0}^{\infty}a^n = 1/(1 - a)\).
For \(X \sim \mathrm{Geo}(p)\), we have \(\mathbb E[X] = 1/p\) and \(\mathrm{Var} X = q/p^2\).
Take \(M'_X(0) = \mathbb E[X] = 1/p\) and \(M''_X(0) = \mathbb E[X^2] = (1+q)/p^2\). Thus \(\mathrm{Var} X = q/p^2\).
For \(X\) the number of tosses until first Heads, then \(X \sim \mathrm{Geo}(1/2)\). Then \[ M_X(t) = \frac{e^t/2}{1 - e^t/2},\quad \mathbb E X = 2,\quad \mathrm{Var} X = 2. \]
How many times \(X\) do you need to toss a coin to get \(r\) Heads, where \(r\) is a fixed parameter? This is called negative binomial distribution \(\mathrm{NB}(r, p)\). For example, if \(r = 3\) and the sequence is TTTHTHH, then \(X = 7\). What is \(\mathbb{P}(X = n)\)? If the \(n\)th toss resulted in \(r\)th Heads, another way to say this is the first \(n-1\) tosses contain \(r-1\) Heads and \(n-r\) Tails, and the last, \(n\)th toss resulted in Heads. The probability that the first \(n-1\) tosses resulted in \(r-1\) Heads (and \(n-r\) Tails) is \[ \binom{n-1}{r-1}p^{r-1}q^{n-r}. \] Indeed, there are \[ \binom{n-1}{r-1} \] choices for slots in the first \(n-1\) tosses occupied by Heads. Each of these choices has probability \(p^{r-1}q^{n-r}\). Finally, the probability of the last toss being Heads is \(p\). So \[ \boxed{ \mathbb{P}(N = n) = \binom{n-1}{r-1}p^{r}q^{n-r} } \] This is true for \(n = r, r + 1, r + 2, \ldots\). The random variable \(X \sim \mathrm{NB}(r, p)\) can take only values greater than or equal to \(r\), because to get \(r\) Heads, you need at least \(r\) tosses. One can alternatively describe this distribution as the number of Bernoulli trials one needs to get to the \(r\)th success. For \(r = 1\), this becomes the Geometric distribution with parameter \(p\).
For the geometric and negative binomial distributions, there is ambiguity in the definition. Sometimes it is defined as the quantity of T, as opposed to tosses, until your first H. In this case, it is shifted by \(r\) down. Thus defined negative binomial random variable takes values \(0, 1, 2, \ldots\) We define this distribution as the number of tosses until your \(r\)th H. ### Example 7.10.
Let \(p = 1/2\), \(r = 2\). Then \[
\mathbb{P}(X = 3) = \binom{3-1}{2-1}\left(\frac12\right)^2\left(1 - \frac12\right) = 2\cdot\frac14\cdot\frac12 = \frac14.
\] We can equivalently get this as follows: to get two Heads in exactly three tosses, we need either THH or HTH. Each of these sequences has probability \(1/8\). So the resulting probability is \((1/8) + (1/8) = 1/4\).
For \(X \sim \mathrm{NB}(r, p)\), we have: \[ X = X_1 + \ldots + X_r,\ \ X_1, \ldots, X_r \sim \mathrm{Geo}(p)\ \mbox{i.i.d.} \] Recall that i.i.d. means independent identically distributed. Indeed, to get to the \(r\)th Heads, we need to get the first Heads, which required geometric number \(X_1\) of tosses; then we need to get from first to second Heads, which required also geometric number \(X_2\) of tosses. This second number is independent of the first one, because the coin does not have a memory, etc. Therefore, \[ \mathbb E X = \mathbb E X_1 + \ldots + \mathbb E X_r = \frac{r}{p}, \] and using independence, we get: \[ \mathrm{Var} X = \mathrm{Var} X_1 + \ldots + \mathrm{Var} X_r = \frac{qr}{p^2}. \]
The negative binomial distribution and the binomial distribution answer somewhat opposite questions. For the binomial distribution, you have fixed number of tries, and you ask how many successful trials you have. For the negative binomial distribution, you have fixed number of successful trials, and you ask how many trials you need.
Take a parameter \(\lambda > 0\).
A Poisson distribution \(X \sim \mathrm{Poi}(\lambda)\) is given by \[ \mathbb{P}(X = k) = \frac{\lambda^k}{k!}e^{-\lambda},\quad k = 0, 1, 2, \ldots \]
For \(X \sim \mathrm{Poi}(2)\), we have: \(\mathrm{P}(X \ge 1) = 1 - \mathrm{P}(X = 0) = 1 - e^{-2}\); \(\mathrm{P}(X \ge 2) = 1 - \mathrm{P}(X = 0) - \mathrm{P}(X = 1) - 1 - e^{-2} - 2e^{-2} = 1 - 3e^{-2}\).
Consider a binomial distribution \(\mathrm{Bin}(N, p)\). If \(N \to \infty\), and \(Np = \lambda\), then \(\mathrm{Bin}(N, p) \to \mathrm{Poi}(\lambda)\). More precisely, for every fixed \(k = 0, 1, 2, \ldots\), we have as \(N \to \infty\): \[ \binom{N}{k}p^kq^{N-k} \to \frac{\lambda^k}{k!}e^{-\lambda}. \]
Note that \(p = \lambda/N\), \(q = 1 - \lambda/N\), and \[ \binom{N}{k} = \frac{N!}{k!(N-k)!} = \frac{N(N-1)\ldots(N-k+1)}{N^k}. \] Therefore, \[ \binom{N}{k}p^kq^{N-k} = \frac{N(N-1)\ldots(N-k+1)}{N^k}\left(\frac{\lambda}N\right)^k\left(1 - \frac{\lambda}N\right)^{N-k} \] \[ = \frac{\lambda^k}{k!}\cdot\frac{N(N-1)\ldots(N-k+1)}{N^k}\cdot\left(1 - \frac{\lambda}N\right)^{N-k}. \] The first factor \(\lambda^k/k!\) is constant (does not depend on \(N\)); the second factor is \[ \frac{N(N-1)\ldots(N-k+1)}{N^k} = \frac{N}{N}\cdot\frac{N-1}N\cdot\ldots\cdot\frac{N-k+1}N \to 1, \] because each of these fractions tends to one: indeed, \(N \to \infty\), and \(k\) is fixed. Finally, the third factor tends to \(e^{-\lambda}\), because \[ \log\left(1 - \frac{\lambda}N\right)^{N-k} = (N-k)\log\left(1 - \frac{\lambda}N\right) \backsim (N-k)\left(-\frac{\lambda}N\right) = -\lambda\frac{N-k}N \to -\lambda, \] and then we exponentiate both sides. We use the tangent line approximation: \[ \log(1 + x) \backsim x,\ x \to 0, \] which is true: let \(f(x) = \log(1 + x)\); then \(f'(x) = 1/(1 + x)\), and so \(f(0) = 0\), \(f'(0) = 1\), and \[ f(x) \backsim f(0) + f'(0)x = x,\ \ x \to 0. \] This is the meaning of a Poisson distribution: this is approximately the quantity of many events, each of which is very rare.
A company which sells flood insurance has \(N = 20000\) clients, and each client has probability of flood \(p = 0.01\%\), independently of others. What is the distribution of the number \(X\) of floods? Approximately it is \(X \backsim \mathrm{Bin}(N, p)\), but it is hard for calculations: for example, \[ \mathbb{P}(X = 10) = \binom{20000}{10}\left(\frac1{10000}\right)^{10}\left(\frac{9999}{10000}\right)^{19990}. \] Use Poisson approximation: \(X\) is approximately distributed as \(\mathrm{Poi}(2)\), because \(\la = Np = 2\). So \[ \mathbb{P}(X = 10) = \frac{2^{10}}{10!}e^{-2}. \] Also, \[ \mathbb{P}(X \ge 3) = 1 - \mathbb{P}(X = 0) - \mathbb{P}(X = 1) - \mathbb{P}(X = 2) \] \[ = 1 - e^{-2} - \frac{2^1}{1!}e^{-2} - \frac{2^2}{2!}e^{-2} = 1 - e^{-2} - 2e^{-2} - 2e^{-2} = 1 - 5e^{-2}. \] A slightly more general statement is true. If we have \(N\) (a large number of) independent events, occuring with probabilities \(p_1, \ldots, p_N\), then the quantity of events which happened is approximately \(\mathrm{Poi}(\lambda)\) with \(\lambda = p_1 + \ldots + p_N\). So the Poisson distribution is the law of rare events: if we have many events, each of which is very unlikely to occur, then the number of events is Poisson.
In one city, independently of each other, \(N_1 = 1000\) houses have fire with probability \(p_1 = .2\%\) and \(N_2 = 500\) houses have fire with probability \(p_2 = .6\%\). Then the number of fires is approximately \(\mathrm{Poi}(\lambda)\) with \(\lambda = N_1p_1 + N_2p_2 = 5\).
Another application is a sequence of events, with next event happens with probability \(\lambda\triangle t\) on \([t, t + \triangle t]\), independently of other time intervals. Then the number of events during \([0, T]\) is \(\sim \mathrm{Poi}(\lambda T)\). Indeed, split \([0, T]\) into \(N\) equal subintervals of length \(T/N\), and apply the above law of rare events to \(p = \lambda(T/N)\). Then \(Np = \lambda T\).
The rate for \(\alpha\)-particle emission is \(\lambda \approx 3.87\) (with unit of time 1/8 of minute). Then during 2 minutes we have \(\mathrm{Poi}(3.87\cdot 16)\) particles.
We define first the standard normal distribution and then the general normal distribution.
A standard normal random variable has density \(f_Z(z) = (2\pi)^{-1/2}\exp(-z^2/2)\). For a standard normal \(Z\), we define normal \(X \sim \mathcal N(\mu, \sigma^2)\) as \(X = \mu + \sigma Z\).
The MGF for \(Z \sim \mathcal N(0, 1)\) is given by \[ M_Z(t) = \int_{-\infty}^{+\infty}e^{tz}f_Z(z)\,\mathrm{d}z = e^{t^2/2}. \] The MGF for \(X \sim \mathcal N(\mu, \sigma^2)\) is given by \[ M_X(t) = \exp\left[\mu t + \frac{\sigma^2t^2}2\right]. \]
First, we compute MGF for standard normal law: \[ F_Z(t) = \frac1{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{tz}e^{-z^2/2}\mathrm{d} z = \frac1{\sqrt{2\pi}}e^{t^2/2}\int_{-\infty}^{\infty}e^{-t^2/2+tz - z^2/2}\mathrm{d} z \] \[ = \frac{e^{t^2/2}}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-(z-t)^2/2}\mathrm{d} z = \frac{e^{t^2/2}}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-u^2}\mathrm{d} u = \frac{e^{t^2/2}}{\sqrt{2\pi}}\sqrt{2\pi} = e^{t^2/2}. \] (we changed variables \(u = z - t\)). Next, we can represent \(X = \mu + \sigma Z\), and \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \mathbb E\left[e^{t\mu + t\sigma Z}\right] = e^{t\mu}\cdot\mathbb E\left[e^{t\sigma Z}\right] = e^{t\mu}e^{(t\sigma)^2/2} = \exp(t\mu + t^2\sigma^2/2). \]
For \(X \sim \mathcal N(-2, 2)\) we get: \(M_X(t) = e^{-2t + t^2}\).
For \(X \sim \mathcal N(\mu, \sigma^2)\), we have: \(\mathbb E X = \mu\) and \(\mathrm{Var} X = \sigma^2\).
We compute for standard normal \(Z\): \(\mathbb E Z = M_Z'(0) = 0\), \(\mathbb E Z^2 = M''_Z(0) = 1\), and thus \(\mathrm{Var} Z = 1\). Therefore \(\mathbb E[X] = \mu + \sigma \mathbb E[Z] = \mu\), and \(\mathrm{Var}(X) = \sigma^2\mathrm{Var}(Z) = \sigma^2\).
The density of \(\mathcal N(\mu, \sigma^2)\) is given by \[ f(y) = \frac1{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right). \]
Indeed, for all \(a\) and \(b\), \[ \mathbb P(a \le Y \le b) = \mathbb P(a \le \mu + \sigma X \le b) = \mathbb P\left(\frac{a - \mu}{\sigma} \le X \le \frac{b - \mu}{\sigma}\right) = \frac1{\sqrt{2\pi}}\int\limits_{(a - \mu)/\sigma}^{(b - \mu)/\sigma}e^{-x^2/2}\, \mathrm{d}x. \] Now let us change variables: \[ x = \frac{y - \mu}{\sigma};\ \mathrm{d}x = \frac{\mathrm{d}y}{\sigma};\ \ a \le y \le b. \] Therefore, \[ \frac1{\sqrt{2\pi}}\int\limits_{(a - \mu)/\sigma}^{(b - \mu)/\sigma}e^{-x^2/2}\, \mathrm{d}x = \frac1{\sqrt{2\pi}\sigma}\int\limits_a^be^{-(y - \mu)^2/(2\sigma^2)}\,\mathrm{d}y = \int_a^bp(y)\, \mathrm{d}y. \] This proves that \(Y\) has density \(f(y)\), as shown above. ### Example 7.16. For \(X \sim \mathcal N(1, 4)\), find \(\mathbb E\left[(X-2)^3\right]\): \(X = 1 + 2Z\) with \(Z \sim \mathcal N(0, 1)\), and \((X - 2)^3 = (2Z - 1)^3 = 8Z^3 - 12Z^2 + 6Z - 1\). Taking the expected value and using \(\mathbb E Z = \mathbb E Z^3 = 0\) (by symmetry of the standard normal distribution), \(\mathbb E Z^2 = 1\), we get: \(\mathbb E\left[(X-2)^3\right] = -12 - 1 = \boxed{-13}\)
For \(X \sim \mathcal N(1, 4)\), find \(\mathbb P(X \le 3)\): \(X = 1 + 2Z\) with \(Z \sim \mathcal N(0, 1)\), and \(\mathbb P(X \le 3) = \mathbb P(1 + 2Z \le 3) = \mathbb P(Z \le 1) = 0.8413\)
Find a normal distribution with mean \(1\) and \(99\%\) quantile \(4.5\). We can represent this random variable as \(X = 1 + \sigma Z\), where \(\sigma\) needs to be found. Because \[ 0.99 = \mathbb{P}(X \le 4.5) = \mathbb{P}(1 + \sigma Z \le 4.5) = \mathbb{P}(Z \le \sigma^{-1}3.5), \] from the Z-table at the end of these lecture notes we get: \(\sigma^{-1}3.5 = x_{99\%} = 2.326\), so \(\sigma = 3.5/2.326 = 1.505\), and \(X \sim \boxed{\mathcal N(1, 1.505^2)}\)
Sometimes it is used to model the lifespan of a device, when the remaining lifespan is independent of how long it has worked so far. Take a \(\lambda > 0\).
A random variable \(X\) with density \(f(x) = \lambda e^{-\lambda x}\), \(x > 0\), is called exponential with rate \(\lambda\). It is denoted as \(\mathrm{Exp}(\lambda)\).
This distribution has the following properties: for \(a > 0\), \[ \mathbb{P}(X \ge a) = \int_a^{\infty}\lambda e^{-\lambda x}\, \mathrm{d}x = \left.\left(-e^{-\lambda x}\right)\right|_{x=a}^{x = \infty} = e^{-\lambda a}, \] and thus the CDF is given by \[ F_X(u) = \begin{cases} 1 - e^{-\lambda u},\, u > 0;\\ 0,\, u \le 0. \end{cases} \]
This distribution has memoryless property: for \(s, t > 0\), \(\mathbb{P}(X > t + s \mid X > s) = \mathbb{P}(X > t)\).
Indeed, for any \(t, s > 0\), we have: \[ \mathbb P(X > t + s \mid X > s) = \frac{\mathbb P(X > t + s,\, X > s)}{\mathbb P(X > s)} = \frac{\mathbb P(X > t + s)}{\mathbb P(X > s)} = \frac{e^{-\lambda(s + t)}}{e^{-\lambda s}} = e^{-\lambda t} = \mathbb P(X > t). \]
Let \(X \sim \mathrm{Exp}(\lambda)\). Then we have: \(f_X(x) = \lambda e^{-\lambda x}\) for \(x > 0\). Thus for \(t < \lambda\), \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \int_0^{\infty}e^{tx}f_X(x)\,\mathrm{d}x = \lambda\int_0^{\infty}e^{-(\lambda - t)x}\,\mathrm{d}x = \frac{\lambda}{\lambda - t}. \] Thus we can find \[ \mathbb E\left[X\right] = M'_X(0) = \left.\left(\frac{\lambda}{\lambda - t}\right)'\right|_{t = 0} = \left.\frac{\lambda}{(\lambda - t)^2}\right|_{t=0} = \frac1{\lambda}, \] and similarly \(\mathbb E\left[X^2\right] = M''_X(0) = 2/\lambda^2\), \(\mathrm{Var} X = \lambda^{-2}\).
Find an exponential distribution such that \(\mathbb{P}(X \ge 2) = 0.4\). We have: \(e^{-2\lambda} = 0.4 \Leftrightarrow \lambda = -\ln(0.4)/2 = 0.46\).
Calculate the CDF of \(X \sim \mathrm{Exp}(2)\). For \(x \le 0\), we have: \(F(x) = 0\), because \(X \ge 0\) always. For \(x > 0\), we have: \(F(x) = \int_0^x2e^{-2x}\,\mathrm{d} x = \left.(-e^{-2x})\right|_{x=0}^{x=\infty} = 1 - e^{-2x}\). Therefore, \[ F(x) = \begin{cases} 0,\, x \le 0;\\ 1 - e^{-2x},\, x \ge 0. \end{cases} \]
Take \(n\) i.i.d. \(X_1,\ldots, X_n \sim \mathrm{Exp}(\lambda)\). Then \(X = X_1 + \ldots + X_n\) has Gamma distribution and is denoted as \(\Gamma(n, \lambda)\). ### Theorem 7.12. The MGF of \(X \sim \Gamma(n, \lambda)\) is \[ M_X(t) = \left(\frac{\lambda}{\lambda - t}\right)^n. \] Its mean and variance are \[ \mathbb E[X] = \frac n{\lambda},\quad \mathrm{Var}(X) = \frac n{\lambda^2}. \]
This MGF is computed as \(M_X(t) = M_{X_1}(t)\cdot M_{X_n}(t)\). Mean and variance are computed as sums of means and variances of \(X_1, \ldots, X_n\).
The density of \(\Gamma(n, \lambda)\) is equal to \[ f(x) = \frac{\lambda^n}{(n-1)!}x^{n-1}e^{-\lambda x},\quad x > 0. \]
Note that \[ n! = I_n := \int_0^{\infty}x^ne^{-x}\mathrm{d} x. \] Why? Integrate \(I_n\) by parts: \[ I_n = -\int_0^{\infty}x^{n}\mathrm{d} e^{-x} = -\left.x^{n}e^{-x}\right|_{x=0}^{x = \infty} + \int_0^{\infty}e^{-x}\mathrm{d}\left(x^{n}\right) = 0 + n\int_0^{\infty}x^{n-1}e^{-x}\mathrm{d} x = nI_{n-1}. \] Combining that with \(I_0 = \int_0^{\infty}e^{-x}\mathrm{d} x = 1\), we have: \(I_n = n!\). Doing change of variables \(\lambda x = y\): \[ \int_0^{\infty}x^{n-1}e^{-\lambda x}\,\mathrm{d}x = \frac{(n-1)!}{\lambda^{n}}. \] Applying this to \(\lambda - t\) instead of \(\lambda\), we get: \[ \int_0^{\infty}f(x)e^{tx}\,\mathrm{d}x = \frac{\lambda^n}{(n-1)!}\int_0^{\infty}x^{n-1}e^{tx}e^{-\lambda x}\,\mathrm{d}x \] \[ = \frac{\lambda^n}{(n-1)!}\frac{(n-1)!}{(\lambda - t)^{n}} = \left(\frac{\lambda}{\lambda - t}\right)^n. \]
We compare three modes: almost surely, in probability, and in law (distribution).
Take random variables \(Z_0, Z_1, Z_2,\ldots\) Then \(Z_n \to Z\) in probability if for all \(\varepsilon > 0\) we have: \(\mathbb P(|Z_n - Z_0| > \varepsilon) \to 0\) as \(n \to \infty\).
But there is another mode of convergence:
\(Z_n \to Z_0\) in law, or in distribution, if \(F_{Z_n}(x) = \mathbb P(Z_n \le x) \to F_{Z_0}(x) = \mathbb P(Z_0 \le x)\) for all \(x\) where \(F_{Z_0}\) is continuous.
The third convergence mode is almost surely: \(\mathbb P(Z_n \to Z_0) = 1\).
In particular, if \(Z_0\) is continuous, then \(F_{Z_n}(x) \to F_{Z_0}(x)\) as \(n \to \infty\) for all \(x\). Thus for all \(a < b\), \[ \mathbb P(a \le Z_n \le b) = F_{Z_n}(b) - F_{Z_n}(a) \to \mathbb P(a \le Z_0 \le b) = F_{Z_0}(b) - F_{Z_0}(a). \]
\(Z_n = 1/n\) and \(Z_0 = 0\) (constant random variables). Then \(Z_n \to Z_0\) in law, because \[ F_{Z_n}(x) = \begin{cases} 1,\ x \ge 1/n;\\ 0,\ x < 1/n; \end{cases} \quad F_{Z_0}(x) = \begin{cases} 1,\ x \ge 0;\\ 0,\ x < 0. \end{cases} \] Thus \(F_{Z_n}(x) \to F_{Z_0}(x)\) for \(x \ne 0\). Indeed, for \(x > 0\), we have \(x \ge 1/n\) for sufficiently large \(n\), and \(F_{Z_n}(x) = 1 = F_{Z_0}(x)\). And for \(x < 0\), \(F_{Z_n}(x) = 0 = F_{Z_0}(x)\). But \(F_{Z_n}(0) = 0 \nrightarrow 1 = F_{Z_0}(0)\). This is why we need to exclude discontinuity.
Convergence in probability implies convergence in distribution, but not vice versa.
Toss a fair coin. Let \(Z_n = 1\) if Heads, \(Z_n = 0\) if Tails, for odd \(n\); vice versa when \(n\) is even. Then all \(Z_n\) have the same distribution: \(\mathbb P(Z_n = 0) = \mathbb P(Z_n = 1) = 1/2\). Thus \(Z_n \to Z_0\) in law. But not in probability: \(|Z_n - Z_0| = 1\) for \(n\) odd.
Convergence almost surely is stronger than convergence in probability.
Take a nonnegative random variable \(X \ge 0\).
For every \(a > 0\), we have: \[ \mathbb{P}(X \ge a) \le \frac{\mathbb{E} X}{a}. \]
Consider another random variable: \[ Y = \begin{cases} a,\ X \ge a;\\ 0,\ 0 \le X < a. \end{cases} \] Then always \(Y \le X\), so \(\mathbb{E} Y \le \mathbb{E} X\). But \(\mathbb{E} Y = a\mathbb{P}(X \ge a) + 0\mathbb{P}(0 \le X < a) = a\mathbb{P}(X \ge a)\). This gives us \[ a\cdot\mathbb{P}(X \ge a) \le \mathbb{E} X \Rightarrow \mathbb{P}(X \ge a) \le \frac{\mathbb{E} X}{a}. \]
Consider the exponential random variable \(X \sim \mathrm{Exp}(1)\). Then \(\mathbb{P}(X \ge 10) \le \frac{\mathbb{E} X}{10} = \frac1{10}\), because \(\mathbb{E} X = 1\). In reality, \(\mathbb{P}(X \ge 10) = e^{-10}\), which is much smaller than \(1/10\). Markov’s inequality gives us very rough and crude estimates, which are one-size-fits-all solutions. But they can be applied to any variable.
This measures large deviations of \(X\) from its mean.
If \(\mathbb{E} X = \mu\), then for every \(b > 0\) we have: \[ \mathbb{P}(|X - \mu| \ge b) \le \frac{\mathrm{Var} X}{b^2}. \]
Apply Markov’s inequality to \((X - \mu)^2\) instead of \(X\) and \(b^2\) instead of \(a\). We get: \[ \mathbb{P}(|X - \mu| \ge b) = \mathbb{P}((X - \mu)^2 \ge b^2) \le \frac{\mathbb{E}(X - \mu)^2}{b^2} = \frac{\mathrm{Var} X}{b^2}. \]
For \(X \sim \mathcal N(4, 1)\), \(\mathbb E X = 4\), \(\mathrm{Var} X = 1\), and thus \(\mathbb{P}(|X - 4| > 2) \le 1/4\). This is equivalent to \(X < 2\) or \(X > 6\).
Let \(X_1, X_2, \ldots\) be iid (independent identically distributed) random variables with
\[
\mathbb E X_1 = \mathbb E X_2 = \ldots = \mu,\quad \mathrm{Var} X_1 = \mathrm{Var} X_2 = \ldots = \sigma^2.
\]
Let \(S_N = X_1 + \ldots + X_N\). Then we have convergence in probability: \[ \frac{S_N}N \to \mu,\ N \to \infty, \] in the sense that for every \(\varepsilon > 0\) we have: \[ \mathbb{P}\left(\left|\frac{S_N}N - \mu\right| \ge \varepsilon\right) \to 0. \]
Apply Chebyshev’s inequality to \(S_N/N\). We have: \[ \mathbb{E}\frac{S_N}N = \frac1N(\mathbb{E} X_1 + \ldots + \mathbb{E} X_N) = \frac1N(\mu + \ldots + \mu) = \frac{\mu N}{N} = \mu; \] \[ \mathrm{Var}\frac{S_N}N &= \frac1{N^2}\mathrm{Var} S_N = \frac1{N^2}\left(\mathrm{Var} X_1 + \ldots + \mathrm{Var} X_N\right) = \frac1{N^2}\left(\sigma^2 + \ldots + \sigma^2\right) = \frac{\sigma^2}N. \] Therefore, as \(N \to \infty\), we have: \[ \mathbb{P}\left(\left|\frac{S_N}N - \mu\right| \ge \varepsilon\right) \le \frac{\sigma^2/N}{\varepsilon^2} \to 0. \]
Actually it is possible to prove stronger convergence almost surely, but this is harder.
If \(X \sim \mathrm{Exp}(2)\), then \(\mu = 1/2\) and \(\sigma^2 = 1/4\), and \[ \mathbb{P}(|\overline{X}_n - \frac12| \ge 0.1) \le \frac{1/4}{0.1^2\cdot n} = \frac{25}n. \] For \(n = 10\) this is still greater than \(1\), and this estimate is useless. But for \(n = 1000\) this is useful: this probability is \(2.5\%\).
Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E} X_1 = \mu\). Then by the Law of Large Numbers for \(S_N = X_1 + \ldots + X_N\) we have: \(S_N/N \to \mu\). Large Deviations refer to estimating the following probabilities for \(\varepsilon > 0\): \[ \mathbb{P}\left(\frac{S_N}{N} - \mu \ge \varepsilon\right)\ \mbox{and}\ \mathbb{P}\left(\frac{S_N}{N} - \mu \le -\varepsilon\right). \] This can be done by combining Markov’s inequality and moment generating functions. Instead of formulating general theory, we do a few examples.
Take \(X \sim \mathcal N(2, 1)\). Then \(\mathbb{E} X = 2\). Let us estimate \(\mathbb{P}(X \ge 3)\): For \(t > 0\), a parameter to be determined later. Then \(\mathbb{P}(X \ge 3) = \mathbb{P}\left(e^{tX} \ge e^{3t}\right)\). By Markov’s inequality, \[ \mathbb{P}\left(e^{tX} \ge e^{3t}\right) \le \frac{\mathbb{E} e^{tX}}{e^{3t}}. \] But the moment generating function for \(X \sim \mathcal N(2, 1)\) is given by \(\mathbb{E} e^{tX} = e^{2t + t^2/2}\). Choose \(t\) to minimize \[ \frac{e^{2t + t^2/2}}{e^{3t}} = \exp\left(\frac12t^2 - t\right). \] To minimize \(t^2/2 - t\), take the derivative with respect to \(t\): \((t^2/2 - t)' = t-1 = 0\ \Rightarrow\ t = 1\). This gives us \[ \exp\left(\frac12t^2 - t\right) = e^{-0.5}. \] Therefore, \(\mathbb{P}(X \ge 3) \le \boxed{e^{-0.5}}\) Of course, in this simple case we might as well directly calculate this probability. But is more complicated settings, this direct calculation is impossible.
Take \(X \sim \mathcal N(2, 1)\), and estimate \(\mathbb{P}(X \le 1)\). For \(t > 0\), we have: \[ \mathbb{P}(X \le 1) = \mathbb{P}\left(e^{-tX} \ge e^{-t}\right) = \frac{\mathbb{E} e^{-tX}}{e^{-t}} = \frac{e^{-2t + t^2/2}}{e^{-t}} = \exp\left(\frac{t^2}2 - t\right), \] and this problem is solved simlarly: \(\mathbb{P}(X \le 1) \le \boxed{e^{-0.5}}\)
This method works particularly well for sums of independent random variables.
Let \(X_1, \ldots, X_{10} \sim \mathrm{Exp}(3)\) be i.i.d. random variables, and let \(S := X_1 + \ldots + X_{10}\). Then \(\mathbb{E} S = \frac{10}{3}\). Let us estmate \(\mathbb{P}(S \le 2)\). For \(t > 0\), \[ \mathbb{P}(S \le 2) = \mathbb{P}\left(e^{-tS} \le e^{-2t}\right) \le \frac{\mathbb{E} e^{-tS}}{e^{-2t}}. \] Since \(X_1, \ldots, X_{10}\) are independent, \[ \mathbb{E} e^{-tS} = \mathbb{E}\left[e^{-tX_1 - \ldots - tX_{10}}\right] = \mathbb{E}\left[e^{-tX_1}\ldots e^{-tX_{10}}\right] \] \[ = \mathbb{E}\left[e^{-tX_1}\right]\cdot\ldots\cdot\mathbb{E}\left[e^{-tX_{10}}\right] = \left(\mathbb{E} e^{-tX_1}\right)^{10} = \left(\frac{3}{3+t}\right)^{10}. \] We used expression for moment generating function of the exponential random variable: \(\mathbb{E} e^{-tX_1} = 3/(3+t)\). Therefore, we need to minimize \[ F(t) = \frac{\left(\frac{3}{3+t}\right)^{10}}{e^{-2t}} = e^{2t}\left(\frac{3}{3+t}\right)^{10} \] Take the logarithm: \[ \ln F(t) = 2t - 10\ln(3+t)\ \Rightarrow\ (\ln F(t))' = 2 - \frac{10}{3+t} = 0\ \Rightarrow\ t = 2. \] Therefore, the minimal value is \(\mathbb{P}(S \le 2) \le F(2) = e^{2\cdot2}(3/5)^{10} = \boxed{0.33}\)
This is a special case of large deviations above. Let \(X_1, \ldots, X_N\) be independent Bernoulli random variables with \(\mathbb{P}(X_i = 1) = p_i\), \(\mathbb{P}(X_i = 0) = 1 - p_i\). Then \(S = X_1 + \ldots + X_N\) has mean \(\mu = \mathbb{E}[S] = \mathbb{E}[X_1] + \ldots + \mathbb{E}[X_N] = p_1 + \ldots + p_N\). Fix a \(\delta \in (0, 1)\) and estimate \[ \mathbb{P}(S \ge \mu + \delta\mu). \] Take a parameter \(t > 0\): \[ \mathbb{P}(S \ge \mu + \delta\mu) \le \mathbb{P}\left(e^{tS} \ge e^{t\mu(1 + \delta)}\right) \le \frac{\mathbb{E}[e^{tS}]}{e^{t\mu(1 + \delta)}}. \] Since \(X_1, \ldots, X_N\) are independent, \[ \mathbb{E}[e^{tS}] = \mathbb{E}\left[e^{tX_1+\ldots + tX_N}\right] = \mathbb{E}e^{tX_1}\cdot\ldots\cdot\mathbb{E} e^{tX_N}. \] For each \(i\), we have: \(\mathbb{E} e^{tX_i} = p_ie^{t\cdot 1} + (1 - p_i)e^{t\cdot 0} = p_i e^t + (1 - p_i) = 1 + p_i(e^t - 1)\). Now comes the main step. Recall that \[ e^x = 1 + x + \frac{x^2}2 + \frac{x^3}6 \ge 1 + x\ \mbox{for}\ x \ge 0. \] Apply this inequality to \(x = p_i(e^t - 1)\). Then \[ \mathbb{E} e^{tX_i} \le \exp\left(p_i(e^t - 1)\right). \] Substituting, we get: \[ \mathbb{E} e^{tS} \le \exp\left(p_1(e^t - 1)\right)\cdot\ldots\cdot\exp\left(p_N(e^t - 1)\right) = \exp\left((p_1 + \ldots + p_N)(e^t - 1)\right) = \exp(\mu(e^t - 1)). \] The right-hand side depends only on \(\mu\), regardless of individual \(p_i\). Plugging, we get: \[ \mathbb{P}(S \ge \mu + \delta\mu) \le \exp\left(\mu(e^t - 1 - t(1 + \delta))\right) = \exp(\mu F(t)),\ F(t) := e^t - 1 - (1 + \delta)t. \] Minimize \(F(t)\) with respect to \(t\): \(F'(t) = e^t - (1 + \delta) = 0\ \Rightarrow\ t = \ln(1 + \delta)\). The minimal value is \(F(\delta) = e^{\ln(1 + \delta)} - 1 - (1+\delta)\ln(1 + \delta) = \delta - (1 + \delta)\ln(1 + \delta)\). It is possible to show that this is less than \(-\delta^2/3\) for \(\delta \in (0, 1)\). Thus, \[ \boxed{ \mathbb{P}(S \ge \mu + \delta\mu) \le \exp\left(-\frac{\delta^2\mu}3\right) } \] Similarly, it is possible to show that \[ \boxed{ \mathbb{P}(S \le \mu - \delta\mu) \le \exp\left(-\frac{\delta^2\mu}2\right) } \]
\(S \sim \mathrm{Bin}(20, 0.4)\) can be represented as \(S = X_1 + \ldots + X_{20}\) with i.i.d. Bernoulli \(X_i\) with \(p_i = 0.4\). Then \(\mu = p_1 + \ldots + p_{20} = 0.4\cdot 20 = 8\). Let us estimate \(\mathbb{P}(S \ge 12)\): this corresponds to \(8(1 + \delta) = 12 \Rightarrow \delta = 0.5\), and \(\mathbb{P}(S \ge 12) \le \exp\left(-\frac{0.5^2\cdot 8}{2}\right) = \boxed{e^{-1}}\)
Take \(S_1 \sim \mathrm{Bin}(10, 0.4)\) and \(S_2 \sim \mathrm{Bin}(20, 0.6)\). Then \(S_1 = X_1 + \ldots + X_{10}\) with i.i.d. Bernoulli \(X_1, \ldots, X_{10}\) with \(p_1 = \ldots = p_{10} = 0.4\), and \(S_2 = X_{11} + \ldots + X_{30}\) with i.i.d. Bernoulli \(X_{11}, \ldots, X_{30}\) with \(p_{11} = \ldots = p_{30} = 0.6\). Then \(\mathbb{E} S = \mu = p_1 + \ldots + p_{30} = 10\cdot 0.4 + 20\cdot 0.6 = 16\). Therefore, \(\mathbb{P}(S \le 10) \le \mathbb{P}(S \le \mu(1 - \delta))\) for \(\delta = 3/8\), and is estimated as \(\exp(-0.5\cdot(3/8)^2\cdot 16) = \boxed{0.32}\)
The LLN states: For \(X_1, X_2, \ldots\) i.i.d. with mean \(\mathbb E X_i = \mu\) and \(\mathrm{Var} X_i = \sigma^2\), if \(S_n = X_1 + \ldots + X_n\), then \(S_n/n \to \mu\) in probability. But how to estimate the speed of convergence? Note that \[ \mathbb E S_n = \mathbb E X_1 + \ldots + \mathbb E X_n = n\mu, \] and by independence, \[ \mathrm{Var}(S_n) = \mathrm{Var}(X_1) + \ldots + \mathrm{Var}(X_n) = n\sigma^2. \] Following is the statement of the CLT.
We have the following convergence in law, as \(n \to \infty\): \[ \frac{S_n - \mathbb E S_n}{\sqrt{\mathrm{Var} S_n}} = \frac{S_n - n\mu}{\sqrt{n}\sigma} \to \mathcal N(0, 1). \]
Since normal distribution is continuous, \[ \mathbb P\left(a \le \frac{S_n - n\mu}{\sqrt{n}\sigma} \le b\right) \to \frac1{\sqrt{2\pi}}\int_a^be^{-z^2/2}\,\mathrm{d}z = \Phi(b) - \Phi(a), \] where \(\Phi\) is the CDF of \(\mathcal N(0, 1)\): \[ \Phi(z) = \frac1{\sqrt{2\pi}}\int_{-\infty}^ze^{-u^2/2}\,\mathrm{d}u. \]
Compute MGF for \((S_n - n\mu)/\sqrt{n}\sigma\) and prove that as \(n \to \infty\), it converges to \(e^{t^2/2}\), which is MGF for \(\mathcal N(0, 1)\).
In practice, these integrals are found from the Z-table.
Suppose we have a plane of \(N = 200\) people. Each customer, independently of others, chooses chicken or pasta, with probability \(50\%\). How many chicken and pasta is needed to satisfy chicken-lovers with probability \(95\%\) and pasta-lovers with probability \(95\%\) (so that everybody will be satisfied with probability \(90\%\))?
If we prepare only \(100\) chicken and \(100\) pasta, we will likely run out ofeither chicken or pasta: it is very unlikely that exactly \(100\) people will choose chicken. The other extreme is to prepare \(200\) chicken and \(200\) pasta. This guarantees that every client will be pleased, but a lot of food (\(200\) meals) will be thrown away. Which is not good, because airlines have tight budgets. Let us find a compromise. Let \(X_k = 1\) if the \(k\)th person chooses chicken, \(X_k = 0\) if he chooses pasta. Then \(S_N = X_1 + \ldots + X_N\) is the total number of chicken required. The random variable \(X_k\) is Bernoulli, with \(p = 1/2\), so \[ \mu = \frac12,\ \ \sigma^2 = \frac14,\ \ \sigma = \frac12; \] \[ \frac{S_N - N\mu}{\sigma\sqrt{N}} = \frac{S_N - 100}{\sqrt{200}/2} \approx \mathcal N(0, 1). \] We can find \(x_{95\%} = 1.645\) such that \[ \Phi(x) = \frac1{\sqrt{2\pi}}\int_{-\infty}^xe^{-u^2/2}\, \mathrm{d}u = 95\%. \] Therefore, approximately \[ \mathbb{P}\left(\frac{S_N - 100}{\sqrt{200}/2} \le 1.645\right) \approx 95\%. \] With this high probability \(95\%\), \[ \frac{S_N - 100}{\sqrt{200}/2} \le 1.645\ \ \Leftrightarrow\ \ S_N \le 100 + \frac12\sqrt{200}\cdot 1.645 = 112. \] So we need only \(112\) chicken: \(100\) chicken according to the mean, and \(12\) additional, as a buffer. The same applies to pasta. We will throw away \(12 + 12 = 24\) meals after the trip, which is not a lot. Or maybe just give them to pilots or flight attendants.
A research sample \(X_1, \ldots, X_n\) of size \(n = 100\) is drawn from distribution with density \(f(x) = 3(1 - x)^2,\ 0 \le x \le 1\). Let \[ \overline{X} = \frac1n\sum_{k=1}^nX_k. \] Estimate \(\mathbb P(0.2 \le \overline{X} \le 0.3)\). Solution: This distribution has mean and variance (after computation) \[ \mu = \int_0^1xf(x)\,\mathrm{d}x = \frac14,\ \sigma^2 = \int_0^1x^2f(x)\,\mathrm{d}x - \mu^2 = \frac3{80}. \] Thus \(0.2 \le \overline{X} \le 0.3\) is equivalent to \(0.2n \le S_n \le 0.3n\), that is, \[ \frac{0.2n - n\mu}{\sqrt{n}\sigma} \le \frac{S_n - n\mu}{\sqrt{n}\sigma} \le \frac{0.3n - n\mu}{\sqrt{n}\sigma}. \] The lower limit is \(a = -0.26\) and the upper limit is \(b = 0.26\). By CLT, the probability of this event is approximately \[ \Phi(0.26) - \Phi(-0.26) = 0.6026 - (1 - 0.6026) = 0.2502. \]
The first approximation is used when the events are rare, the second - when they are usual, not rare. In practice, if you have something like \(N = 1000\) and \(p = 0.1\%\), you should use Poisson, and if \(N =1000\) but \(p = 10\%\), you should use Normal.
An accountant lists each of \(N = 100\) expenses with precision up to \(1\) dollar. That is, the rounding error is \(\mathrm{Uni}[-0.5, 0.5]\). Find the probability that the total error exceeds 5. Let \(X_k\) = error in the \(k\)th expense. The total error is \(S = X_1 + \ldots + X_N\). We need \(|S| \ge 5\). After computations, we get: \(\mathbb E X_k = 0\) and \(\mathrm{Var} X_k = 1/12\). Thus \(\mathbb E S = 0\), \(\mathrm{Var} S = N/12\). By the Central Limit Theorem, \[ \mathbb{P}(S \ge 5) = \mathbb{P}\left(\frac{S - \mathbb E S}{\sqrt{\mathrm{Var} S}} \ge \frac{5 - 0}{\sqrt{100/12}}\right) \approx 1 - \Phi(\frac{5 - 0}{\sqrt{100/12}}). \] Similarly, \(\mathbb{P}(S \le -5) = \mathbb{P}(S \ge 5)\) because \(S\) is symmetric.
The CLT is valid not only for iid (independent identically distributed) random variables. It is (sometimes) true if variables are independent but not identically distributed, or if they have slight dependence. The search for the conditions under which CLT is valid is still an active research area. For the purposes of this class, you can assume it is true if \(\mathrm{Var} S_N\) is large. This is how it is formulated in the general form.
For independent (but not identically distributed) \(X_1, X_2, \ldots\), convergence in law: \[ \frac{S_N - \mathbb E[S_N]}{\sqrt{\mathrm{Var} S_N}} \to \mathcal N(0, 1),\ \ N \to \infty. \]
An insurance company has \(N = 10000\) clients. Half of them file a claim which is distributed as \(\mathrm{Exp}(1)\). The other half file a claim \(\mathrm{Exp}(2)\); all claims are independent. Then the total amount of claims is \[
S_N = X_1 + \ldots + X_{5000} + Y_1 + \ldots + Y_{5000},
\] where \(X_i \sim \mathrm{Exp}(1)\), \(Y_i \backsim \mathrm{Exp}(2)\). Thus
\[
\mathbb{E} X_i = 1,\ \mathrm{Var} X_i = 1,\ \mathbb{E} Y_i = 1/2,\ \mathrm{Var} Y_i = 1/4;
\] \[
\mathbb E S_N = 5000\cdot 1 + 5000\cdot (1/2) = 7500,\ \ \mathrm{Var} S_N = 5000\cdot 1 + 5000\cdot(1/4) = 6250.
\] By the Central Limit Theorem, \[
\frac{S_N - 7500}{\sqrt{6250}} \approx \mathcal N(0, 1).
\]