Introduction

These lecture notes include both discrete- and continuous-time processes. We consider applications to insurance, finance, biology, and social sciences. These lecture notes are intended for junior- and senior-level undergraduate courses. They contain enough material for two semesters or three academic quarters.

The first 10 sections are devoted to Probability Theory (first semester), and the next 10 sections are devoted to Stochastic Processes (second semester). Appendix contains methods of simulations. These lecture notes are suitable for STEM majors, but are likely too hard for business and humanities majors.

A prerequisite for the first semester is calculus: a standard two-semester or three-quarter sequence, including multivariable calculus. For the second semester, we also need differential equations and matrix algebra. We tried to keep notes as condensed and concise as possible, very close to the actual notes written on the board.

Acknowledgements

I would like to thank Janko Gravner from Department of Mathematics, University of California in Davis, who put his lecture notes on Probability and Markov Chains on his web page. I use his materials a lot. Last but not least, I am thankful to Soumik Pal, my Ph.D. adviser, Ioannis Karatzas, his Ph.D. adviser, Jean-Pierre Fouque, my postdoctoral mentor, and all my graduate and undergraduate professors in Probability Theory and Stochastic Processes, for helping understand and appreciate this exciting area of mathematics.

1. Combinatorics

1.1. Permutations

A permutation of \(123\) is, say, \(321\) or \(231\): numbers cannot repeat. There are \(3\cdot 2\cdot 1 = 3! = 6\) permutations: there are \(3\) choices for the first slot, \(2\) choices for the second slot (because one of the numbers is already in the first slot and cannot be repeated), and only one choice for the last, third slot.

In general, for \(1, 2, 3, \ldots, n-1, n\) there are \[ n! = 1\cdot 2\cdot 3\cdot\ldots\cdot(n-1)\cdot n \] permutations. This number is called \(n\) factorial. Examples: \[ 1! = 1,\ \ 2! = 2,\ \ 3! = 6,\ \ 4! = 24,\ \ 5! = 120,\ \ 6! = 720. \] There is a convention that \(0! = 1\). Indeed, we have the property \((n-1)!n = n!\) for \(n = 2, 3, 4, \ldots\). It follows from the definition of the factorial, and this is the main property. We would like it to be true also for \(n = 1\): \(0!\cdot 1 = 1!\), so \(0! = 1\). Factorial grows very quickly. Indeed, \(100!\) is extremely large; no modern computer can go through permutations of \(1, 2, \ldots, 100\). When a computer programming problem encounters search among permutations of \(n\) numbers, then this problem is deemed unsolvable. Stirling’s formula: \[ n! \backsim \sqrt{2\pi n}\left(\frac{n}{e}\right)^n\ \ \mbox{as}\ \ n \to \infty \] where \(f(n) \backsim g(n)\) means \(\lim_{n \to \infty}f(n)/g(n) = 1\).

If we have three slots for numbers \(1, 2, 3, 4, 5, 6, 7\), and repetitions are not allowed, this is called an arrangement. Say, \(364\), \(137\), \(634\). There are \(7\cdot 6\cdot 5 = 210\) such arrangements: \(7\) choices for the first slot, \(6\) for the second and \(5\) for the third. We can write this as \[ A_7^3 = 7\cdot6\cdot 5 = \frac{7\cdot 6\cdot 5\cdot 4\cdot 3\cdot 2\cdot 1}{4\cdot 3\cdot 2\cdot 1} = \frac{7!}{(7-3)!}. \] In general, if there are \(k\) slots for \(1, 2, \ldots, n\), then the number of arrangements is \[ A_n^k = n\cdot(n-1)\cdot(n-2)\cdot\ldots\cdot(n-k+1) = \frac{n!}{(n-k)!}. \] A permutation can be viewed as a particular case of an arrangement, when \(k = n\): the number of slots is the same as the total number of elements.

1.2. Subsets

How many subsets of three elements are there in the set \(\{1, 2, \ldots, 7\}\)? The difference between an arrangement and a subset is that for a subset, order does not matter. (But in both of them, there are no repetitions.) For example, \(\{3, 4, 6\}\) and \(\{6, 3, 4\}\) is the same subset, but \(346\) and \(634\) are different arrangements. From any subset, we can create \(3! = 6\) arrangements. The quantity of subsets is equal to the quantity of arrangements divided by \(6\): \[ \frac{A^3_7}{3!} = \frac{7!}{4!3!} = \frac{210}{6} = 35. \] In general, the quantity of subsets of \(k\) elements in \(\{1, \ldots, n\}\) is equal to \[ \binom{n}{k} = \frac{A_n^k}{k!} = \frac{n\cdot(n-1)\cdot\ldots\cdot(n-k+1)}{k!} = \frac{n!}{(n-k)!k!} \] It is pronounced as \(n\) choose \(k\).

  1. \(\binom{1}{0} = 1\), because there is only one subset of zero elements in \(\{1\}\), and this is an empty set \(\varnothing\).

  2. \(\binom{1}{1} = 1\), because there is only one subset of one element in \(\{1\}\): the set \(\{1\}\) itself.

  3. \(\binom{n}{0} = 1\), for the same reason as in (1);

  4. \(\binom{n}{n} = 1\), for the same reason as in (2);

  5. \(\binom{2}{1} = 2\), because there are two subsets of one element of \(\{1, 2\}\): these are \(\{1\}\) and \(\{2\}\);

  6. \(\binom{n}{1} = n\), because there are \(n\) subsets of one element of \(\{1, 2, \ldots, n\}\): \(\{1\},\ \{2\},\ldots, \{n\}\);

  7. \(\binom{n}{n-1} = n\), because to choose a subset of \(n-1\) elements out of \(\{1, 2, \ldots, n\}\), we need to throw away one element, and it can be chosen in \(n\) ways;

  8. \(\binom{4}{2} = 4!/(2!2!) = 24/4 = 6\), and these subsets of \(\{1, 2, 3, 4\}\) are \[ \{1, 2\},\ \{1, 3\},\ \{1, 4\},\ \{2, 3\},\ \{2, 4\},\ \{3, 4\}. \]

1.3. Symmetry

We can say without calculations that \(\binom{8}{2} = \binom{8}{6}\). Indeed, for every subset of \(\{1, 2, \ldots, 8\}\) of two elements there is a subset of six elements: its complement. For example, \(\{3, 5\}\) corresponds to \(\{1, 2, 4, 6, 7, 8\}\). This is a one-to-one correspondence. So there are equally many subsets of two elements and subsets of six elements. Similarly, \(\binom{8}{3} = \binom{8}{5}\). More generally, \[ \binom{n}{k} = \binom{n}{n-k} \]

1.4. Power set

How many subsets does the set \(\{1, 2, \ldots, n\}\) contain? Answer: \(2^n\). Indeed, to construct an arbitrary subset \(E\), you should answer \(n\) questions:
  • Is \(1 \in E\)? Yes/No
  • Is \(2 \in E\)? Yes/No
  • Is \(n \in E\)? Yes/No

For each question, there are two possible answers. The total number of choices is \(2\cdot2\cdot\ldots\cdot2 = 2^n\). The set of all subsets of \(\{1, \ldots, n\}\) is called a power set, and it contains \(2^n\) elements. But we can also write the quantity of all subsets as the sum of binomial coefficients: \[ \binom{n}{0} + \binom{n}{1} + \binom{n}{2} + \ldots + \binom{n}{n}. \] So we get the following identity: \[ \binom{n}{0} + \binom{n}{1} + \ldots + \binom{n}{n} = 2^n \]

Example 1.1.

Let \(n = 2\). Yes-Yes: \(E = \{1, 2\}\), Yes-No: \(E = \{2\}\), No-Yes: \(E = \{1\}\), No-No: \(E = \varnothing\). Total number of subsets: \(2^2 = 4\). Two of them have one element: \(\binom{2}{1} = 2\), one has two elements, \(\binom{2}{2} = 1\), and one has zero elements, \(\binom{2}{0} = 1\). Total: \(1 + 2 + 1 = 4\).

1.5. Reduction property

We can claim that \[ \binom{5}{2} = \binom{4}{2} + \binom{4}{1}. \] Indeed, the total number of subsets \(E \subseteq \{1, 2, 3, 4, 5\}\) which contain two elements is \(\binom{5}{2}\). But there are two possibilities:

Case 1. \(5 \in E\). Then \(E\setminus\{5\}\) is a one-element subset of \(\{1, 2, 3, 4\}\); there are \(\binom{4}{1}\) such subsets.

Case 2. \(5\notin E\). Then \(E\) is a two-element subset of \(\{1, 2, 3, 4\}\). There are \(\binom{4}{2}\) such subsets.

So \(\binom{4}{1} + \binom{4}{2} = \binom{5}{2}\). In general, \[ \binom{n}{k} = \binom{n-1}{k} + \binom{n-1}{k-1} \] We can construct the Pascal’s triangle: Each element is the sum of two elements immediately above it: this is the reduction formula. We start from the edges, fill them with ones: \(\binom{n}{0} = \binom{n}{n} = 1\), see the previous lecture. Then we fill the inside from top to bottom using this rule, which is the reduction formula.

1.6. Newton’s binomial formula

We can expand \((x + y)^2 = x^2 + 2xy + y^2\), and \((x + y)^3 = x^3 + 3x^2y + 3xy^2 + y^3\). The coefficients are taken from corresponding lines in Pascal’s triangle. Why is this? Let us show this for \(n = 3\). \[ (x+y)^3 = (x+y)(x+y)(x+y) = xxx + xxy + xyx + yxx + xyy + yxy + yyx + yyy. \] Each term has slots occupied by \(y\): \(xxy \leftrightarrow \{3\}\), \(yxy \leftrightarrow \{1, 3\}\). If there is one slot occupied by \(y\), this corresponds to \(x^2y\), and there are \(\binom{3}{1}\) such combinations. So we have: \(\binom{3}{1}x^2y\). Other terms give us: \[ \binom{3}{0}x^3 + \binom{3}{1}x^2y + \binom{3}{2}xy^2 + \binom{3}{3}y^3. \] The general formula looks like this: \[ (x + y)^n = \binom{n}{0}x^n + \binom{n}{1}x^{n-1}y + \ldots + \binom{n}{n}y^n \] Let \(x = y = 1\). Then we get: \[ 2^n = \binom{n}{0} + \binom{n}{1} + \ldots + \binom{n}{n}. \] This formula was already proven above. Let \(x = 1, y = -1\). Then \[ 0 = \binom{n}{0} - \binom{n}{1} + \binom{n}{2} - \binom{n}{3} + \ldots, \] \[ \binom{n}{0} + \binom{n}{2} + \ldots = \binom{n}{1} + \binom{n}{3} + \ldots \] The quantity of subsets with even number of elements is equal to the quantity of subsets with odd number of elements.

1.7. Combinatorics problems

Here, we study a few combinatorics (counting) problems, which can be reduced to counting permutations and combinations.

Example 1.2

Five women and four men take an exam. We rank them from top to bottom, according to their performance. There are no ties.

  • How many possible rankings?
  • What if we rank men and women separately?
  • As in the previous case, but Julie has the third place in women’s rankings.
Solution.
  • A ranking is just another name for permutation of nine people. The answer is \(9!\)

  • There are \(5!\) permutations for women and \(4!\) permutations for men. The total number is \(5!4!\) We should multiply them, rather than add, because men’s and women’s rankings are independent. We are interested in pairs: the first item is a ranking for women, the second item is a ranking for men. If we needed to choose: either rank women or rank men, then the solution would be \(5! + 4!\)

  • We exclude Julie from consideration, because her place is already reserved. There are four women remaining, so the number of permutations is \(4!\) For men, it is also \(4!\) The answer is \(4!^2.\)

Example 1.3.

A licence plate consists of seven symbols: digits or letters. How many licence plates are there if the following is true:

  • there must be three letters and four digits, and symbols may repeat?
  • there must be three letters and four digits, and symbols may not repeat?
  • no restrictions on the quantity of letters and numbers, and symbols may repeat?
  • no restrictions on the quantity of letters and numbers, and symbols may not repeat?

Solution.

  • Choose three slots among seven for letters; this may be done in \(\binom{7}{3}\) ways. Fill each of these three slots with letters; there are \(26^3\) ways to do this, since letters can repeat. Fill each of the remaining four slots with digits; there are \(10^4\) ways to do this, since numbers can repeat. Answer: \(\binom{7}{3}\cdot26^3\cdot10^4.\)

  • This case is different because there are \(26\cdot 25\cdot 24 = 26!/23!\) ways to fill three chosen slots for letters, and \(10\cdot 9\cdot 8\cdot 7 = 10!/6!\) ways to fill four chosen slots for numbers. Answer: \(\binom{7}{3}26\cdot25\cdot24\cdot10\cdot9\cdot8\cdot7.\)

  • This case is easier than the previous ones, since there are \(36\) symbols, and each of the seven slots can be filled with any of these symbols. Answer: \(36^7.\)

  • Similarly to the previous case, we get: \(36\cdot35\cdot34\cdot33\cdot32\cdot31\cdot30 = 36!/29!\)

Example 1.4.

We have five women and four men. We need to choose a committee of three women and two men. How many ways are there to do this if:

  1. there are no additional restrictions?
  2. Mike and Thomas refuse to serve together?
  3. Britney and Lindsey refuse to serve together?
  4. Andrew and Anna refuse to serve together?
Solution.
  1. There are \(\binom{4}{2} = 6\) ways to choose two men out of four, and there are \(\binom{5}{3} = 10\) ways to choose two women out of five. So the answer is \(60\).

  2. How many committees are there for which this restriction is violated, so Mike and Thomas do serve together? If they are already chosen, then we do not need to choose any other man, and there are \(\binom{5}{3} = 10\) ways to choose three women out of five. So the quantity of committees where Mike and Thomas do serve together is \(10\). The answer is \(60-10 = 50.\)

  3. Similarly to the previous case, the number of committees where Britney and Lindsay serve together is \(\binom{3}{1}\binom{4}{2} = 18\), because you can choose one more woman out of the remaining three in \(\binom{3}{1} = 3\) ways, and the number of choices for men is \(\binom{4}{2}\). So the answer is \(60 - 18 = 42.\)

  4. Similarly to the previous case, the number of committees where Andrew and Anna serve together is \(\binom{3}{1}\binom{4}{2} = 18\), because you can choose one more man out of the remaining three in \(\binom{3}{1} = 3\) ways, and two more women out of the remaining four in \(\binom{4}{2}\) ways. So the answer is \(60 - 18 = 42.\)

2. Axioms of Probability

2.1. Set-theoretic notation

A set is a collection of its elements: \(A = \{1, 3, 4, 5\}\), \(B = \{red, blue\}\). We say: \(1 \in A\) (\(1\) belongs to \(A\)), but \(2 \notin A\) (\(2\) does not belong to \(A\)). For two sets \(A\) and \(B\), we can define their union: \(A\cup B = \{x\mid x \in A\ \mbox{or}\ x \in B\}\), intersection: \(A\cap B := \{x \mid x\in A\ \mbox{and}\ x \in B\}\), and difference: \(A\setminus B = \{x\mid x \in A\ \mbox{and}\ x \notin B\}\). The empty set, which does not contain any elements, is denoted by \(\varnothing\).

Example 2.1.

Let \(A = \{0, 1, 2, 3\}\), \(B = \{0 \le x \le 7\mid\ x\ \mbox{is odd}\} = \{1, 3, 5, 7\}\). Then \[ A\cup B = \{0, 1, 2, 3, 5, 7\},\ A\cap B = \{1, 3\},\ A\setminus B = \{0, 2\},\ B\setminus A = \{5, 7\}. \]

2.2. Axioms of probability

The following set of axioms was formulated by a Russian mathematician Andrey Kolmogorov. We have the set \(S\) of elementary outcomes \(s \in S\). This set is called the sample space, and subsets \(A \subseteq S\) are called events. Each event \(A\) has a number \(\mathbb{P}(A)\), which is called the probability of \(A\) and satisfies the following axioms:

  • \(0 \le \mathbb{P}(A) \le 1\): every probability is between \(0\) and \(1\);
  • \(\mathbb{P}(\varnothing) = 0\): the probability of an empty set is \(0\);
  • \(\mathbb{P}(S) = 1\): the probability of the total event is \(1\);
  • If two events \(A\) and \(B\) are disjoint: \(A\cap B = \varnothing\), then \(\mathbb{P}(A\cup B) = \mathbb{P}(A) + \mathbb{P}(B)\). The same is true for three or more disjoint events, or even for infinitely many.

By the last axiom, if we take the complement of the event \(A\): \(A^c = S\setminus A\), then \(\mathbb{P}(A) + \mathbb{P}(A^c) = \mathbb{P}(A\cap A^c) = \mathbb{P}(S) = 1\), and therefore \(\mathbb{P}(A^c) = 1 - \mathbb{P}(A)\).

Example 2.2.

Toss a coin twice. Then \(S\) = {HH, TT, HT, TH}, and the probabilities of each outcome is \(1/4\). For example, the event \(A\) = {the same result} = {HH, TT} has probability \(\mathbb{P}(A) = 1/2\).

Example 2.3.

(Chevalier de Mere’s Problem.) What is the probability that at least one six in four rolls of a fair die? Denote this event by \(A\). The probability space is \[ S = \{(i_1, i_2, i_3, i_4)\mid i_1, i_2, i_3, i_4 \in \{1, 2, 3, 4, 5, 6\}\}. \] The probability of each elementary outcome is \(1/6^4\). The complement of the event \(A\) is \(A^c = S\setminus A\), which means that there are no six in any of the four rolls. We have: \[ A^c = \{(i_1, i_2, i_3, i_4)\mid i_1, i_2, i_3, i_4 \in \{1, 2, 3, 4, 5\}\}. \] The event \(A^c\) contains \(5^4\) outcomes. Each has probability \(1/6^4\), so \(\mathbb{P}(A^c) = 5^4/6^4\). But \[ \mathbb{P}(A) = 1 - \mathbb{P}(A^c) = 1 - \left(\frac56\right)^4 \approx 0.5177. \] This event has probability greater than \(50\%\), so it is good to bet on it.

Example 2.4.

Suppose we have \(n = 365\) days, each day is equally likely to be a birthday. There are \(k\) people. Then \[ \mathbb{P}(\mbox{there are two people with same birthdays}) = 1 - \mathbb{P}(\mbox{there are no people with same birthdays}) = 1 - \frac{n(n-1)\ldots(n-k+1)}{n^k}, \] because the number of birthday arrangements (assuming they are all different) is \(A^k_n = n(n-1)\ldots(n-k+1)\). This probability for \(k = 23\) equals \(0.5073\), greater than \(50\%\)!

2.3. Inclusion-exclusion formula for two events

Consider two events, \(A\) and \(B\), which can intersect. Then \(\mathbb{P}(A\cup B) \ne \mathbb{P}(A) + \mathbb{P}(B)\). Indeed, toss two coins, let A be the event that the first toss is H, let B be the event that the second toss is H. Then \(\mathbb{P}(A\cup B) = 3/4\), but \(\mathbb{P}(A) = \mathbb{P}(B) = 1/2\). Instead, we count intersection \(A\cap B\) twice in \(\mathbb{P}(A) + \mathbb{P}(B)\), so we need to subtract it. Then we get real probability \(\mathbb{P}(A\cup B)\).

Theorem 2.1.

We have: \(\mathbb{P}(A\cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A\cap B)\).

Proof of Theorem 2.1.

Indeed, let \(C_1 = A\setminus B\), \(C_2 = A\cap B\), \(C_3 = B\setminus A\). Then \[ \mathbb{P}(A\cup B) = \mathbb{P}(C_1\cup C_2\cup C_3) = \mathbb{P}(C_1) + \mathbb{P}(C_2) + \mathbb{P}(C_3), \] \[ \mathbb{P}(A) = \mathbb{P}(C_1\cup C_2) = \mathbb{P}(C_1) + \mathbb{P}(C_2), \] \[ \mathbb{P}(B) = \mathbb{P}(C_2\cup C_3) = \mathbb{P}(C_2) + \mathbb{P}(C_3), \] \[ \mathbb{P}(A\cap B) = \mathbb{P}(C_2), \] and the formula follows from here.

Example 2.5.

\(\mathbb{P}(A) = 30\%\), \(\mathbb{P}(B) = 50\%\), \(\mathbb{P}(A\cup B) = 70\%\). Find: \(\mathbb{P}(A\cap B)\), \(\mathbb{P}(A^c\cup B^c)\), and \(\mathbb{P}(A^c\cap B)\). Solution: \(\mathbb{P}(A\cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A\cap B) = 0.3 + 0.5 - 0.1 = 0.7\); \(\mathbb{P}(A^c\cup B^c) = 1 - \mathbb{P}(A\cap B) = 1 - 0.1 = 0.9\); \(\mathbb{P}(A^c\cap B) = \mathbb{P}(B) - \mathbb{P}(A\cap B) = 0.5 - 0.1 = 0.4\).

Example 2.6.

(SOA) \(22\%\) patients visit both physical therapist \(T\) and a chiropractor \(C\), \(12\%\) visit neither. Next, \(\mathbb{P}(C) = 0.14 + \mathbb{P}(T)\). Find \(\mathbb{P}(T)\). Indeed, \(\mathbb{P}(C\cap T) = 22\%\) and \(\mathbb{P}(C^c\cap T^c) = 12\%\), so \(\mathbb{P}(C\cup T) = 88\%\). Next, \(\mathbb{P}(C) + \mathbb{P}(T) = \mathbb{P}(C\cup T) + \mathbb{P}(C\cap T) = 1.1\). But \(\mathbb{P}(C) = 0.14 + \mathbb{P}(T)\). Thus \(2\mathbb{P}(T) + 0.14 = 1.1\) and \(\mathbb{P}(T) = 0.48\).

Example 2.7.

Choose a random number from \(1\) to \(1000\). What is the probability that it is divisible either by \(2\) or by \(3\)? Let \(A\) = divisible by 2, and \(B\) = divisible by 3. There are \(500\) numbers in \(A\) and \(333\) numbers in \(B\), because \(1000/3 = 333 + 1/3\) has integer part \(333\). More exactly, \[ A = \{2, 4, 6, \ldots, 1000\},\ \ B = \{3, 6, 9, \ldots, 996, 999\}. \] \[ \mathbb{P}(A) = \frac{500}{1000} = \frac12,\ \ \mathbb{P}(B) = \frac{333}{1000}. \] In addition, \(A\cap B = \{\mbox{divisible by}\ \ 6\}\) contains \(166\) numbers: \(1000/6\) has integer part \(166\). Therefore, \[ \mathbb{P}(A\cap B) = 166/1000,\quad \mathbb{P}(A\cup B) = \frac{500}{1000} + \frac{333}{1000} - \frac{166}{1000} = \frac{667}{1000} \]

2.4. Inclusion-exclusion formula for three events

We can state an analogous formula for three, four, or more events. For simplicity, let us state it for 3 events.

Theorem 2.2.

For any events \(A, B, C\), we have: \[ \mathbb{P}(A\cup B\cup C) = \mathbb{P}(A) + \mathbb{P}(B) + \mathbb{P}(C) - \mathbb{P}(A\cap B) - \mathbb{P}(A\cap C) - \mathbb{P}(B \cap C) + \mathbb{P}(A\cap B\cap C). \] Informal explanation: To correct for double counting in double intersections, we remove them. But this removes \(A\cap B\cap C\) three times, and it was initially counted three times too (as part of \(A, B, C\)). Thus we need to add it once again.

Proof of Theorem 2.2.

Let \(p_1\) be the probability of the event \(1\) on the Venn diagram, which is \(A\setminus(B\cup C)\). Let \(p_2\) be the probability of the event \(2\), etc. Then \[ \begin{cases} \mathbb{P}(A) = p_1 + p_3 + p_5 + p_7,\\ \mathbb{P}(B) = p_2 + p_3 + p_6 + p_7,\\ \mathbb{P}(C) = p_4 + p_6 + p_7 + p_5,\\ \mathbb{P}(A\cap B) = p_3 + p_7,\\ \mathbb{P}(A\cap C) = p_5 + p_7,\\ \mathbb{P}(B\cap C) = p_6 + p_7,\\ \mathbb{P}(A\cap B\cap C) = p_7. \end{cases} \] Finally, \(\mathbb{P}(A\cup B\cup C) = p_1 + p_2 + p_3 + p_4 + p_5 + p_6 + p_7\). Plugging this into the formula in the box above, we can check it is indeed true.

Example 2.8.

Out of numbers \(1, \ldots, 600\), how many are divisible by either 2, 3, or 5? We have: \(A\) = {divisible by 2}, \(B\) = {divisible by 3}, \(C\) = {divisible by 5}. Then \(A\cap B\) = {divisible by 6}, \(A\cap C\) = {divisible by 10}, \(B\cap C\) = {divisible by 15}, \(A\cap B\cap C\) = {divisible by 30}. Thus \[ \mathbb{P}(A) = \frac12,\, \mathbb{P}(B) = \frac13,\, \mathbb{P}(C) = \frac15,\, \mathbb{P}(A\cap B) = \frac16,\, \mathbb{P}(A\cap C) = \frac1{10},\, \mathbb{P}(B\cap C) = \frac1{15},\, \mathbb{P}(A\cap B\cap C) = \frac1{30}. \] By the inclusion-exclusion formula, \[ \mathbb{P}(A\cup B\cup C) = \frac12 + \frac13 + \frac15 - \frac16 - \frac1{10} - \frac1{15} + \frac1{30} = \frac{11}{15}. \]

2.5. Conditional probability

A bag with \(11\) cubes: \(3\) red and fuzzy, \(2\) red and smooth, \(4\) blue and fuzzy, \(2\) blue and smooth. Put your hand in the bag and randomly pick a cube. Let \[ R = \{\mbox{the cube is red}\},\ F = \{\mbox{the cube is fuzzy}\}. \] Then the probability that it is red is \(\mathbb{P}(R) = 5/11\). But if you feel that it is fuzzy, then the probability that it is red is \(3/7\). This is called conditional probability of \(R\) given \(F\): \[ \mathbb{P}(R\mid F) = \frac{\mathbb{P}(R\cap F)}{\mathbb{P}(F)} = \frac 37. \]

Definition 2.1.

Conditional probability of \(A\) given \(B\) is defined as \[ \mathbf P(A\mid B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}. \]

Theorem 2.3.

\(\mathbb{P}(A\cap B) = \mathbb{P}(B)\mathbb{P}(A\mid B) = \mathbb{P}(A)\mathbb{P}(B\mid A)\).

Example 2.9.

For familiar with two children, assume that all probabilities of two kids are equally likely: \(S = \{GG, GB, BG, BB\}\). Let \(A\) = at least one boy, \(B\) = at least one is girl. Then \[ \mathbb{P}(A\mid B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)} = \frac{2/4}{3/4} = \frac34. \]

Example 2.10.

\(\mathbb{P}(A\setminus B) = 0.2\), \(\mathbb{P}(B\setminus A) = 0.1\), \(\mathbb{P}(A^c\cap B^c) = 0.6\). Find \(\mathbb{P}(A\mid B)\). Solution: \(\mathbb{P}(A\cup B) = 1 - 0.6 = 0.4\), and \(\mathbb{P}(A\cap B) = 0.4 - 0.2 - 0.1 = 0.1\). Thus \(\mathbb{P}(B) = 0.1 + 0.1 = 0.2\) and \(\mathbb{P}(A\mid B) = 0.1/0.2 = 50\%\).

2.6. Total probability formula

Take events \(F_1, \ldots, F_n\) such that \[ F_1\cup F_2\cup \ldots\cup F_n = \Omega,\ \ F_i\cap F_j = \varnothing,\ i \ne j. \] This means one and only one of the events \(F_1, \ldots, F_n\) happens. This collection of events is called a partition. An example is Heads or Tails coin toss; or numbers from 1 to 6 on a die.

Theorem 2.4.

For an event \(A\) and a partition \(F_1, \ldots, F_n\), we have: \[ \mathbb{P}(A) = \sum\limits_{k=1}^n\mathbb{P}(A\cap F_k) = \sum\limits_{k=1}^n\mathbb{P}(A\mid F_k)\mathbb{P}(F_k). \]

Proof of Theorem 2.4.

We have from conditional probability definition: \(\mathbb{P}(A\mid F_i)\mathbb{P}(F_i) = \mathbb{P}(A\cap F_i)\). But \[ \mathbb{P}(A\cap F_1) + \ldots + \mathbb{P}(A\cap F_n) = \mathbb{P}(A), \] because the union of events \(A\cap F_1, \ldots, A\cap F_n\) is \(A\), and these do not intersect.

Example 2.11.

A policyholder is classified as prime or subprime, with probabilities \(20\%\) and \(80\%\), respectively. Prime policyholders have accidents with probability \(2\%\), and subprime with probability \(10\%\). Find the probability that a random policyholder has an accident. Here, \(F_1\) = {prime}, \(F_2\) = {subprime}, \(A\) = {accident}. Then \[ \mathbb{P}(F_1) = 0.2,\, \mathbb{P}(F_2) = 0.8,\, \mathbb{P}(A\mid F_1) = 0.02,\, \mathbb{P}(A\mid F_2) = 0.1. \] Thus by the total probability formula \[ \mathbb{P}(A) = \mathbb{P}(A\mid F_1)\mathbb{P}(F_1) + \mathbb{P}(A\mid F_2)\mathbb{P}(F_2) = 0.2\cdot 0.02 + 0.8\cdot 0.1 = 8.4\%. \]

Example 2.12.

Flip a fair coin. If H, roll one die. If T, roll two dice. What is the probability \(p\) that there is at least one six? Let \(X_1, X_2\) be the first and the second dice. Let \(F_1\) = heads, \(F_2\) = tails, \(A\) = at least one six. But \(\mathbb{P}(X_1 = 6) = 1/6\), and by inclusion-exclusion formula we have: \[ \mathbb{P}(X_1 = 6\,\mbox{or}\, X_2 = 6) = \mathbb{P}(X_1 = 6) + \mathbb{P}(X_2 = 6) - \mathbb{P}(X_1 = 6,\ X_2 = 6) = \frac16 + \frac16 - \frac1{36} = \frac{11}{36}. \] Thus by the total probability formula, \[ p = \frac12\cdot\frac16 + \frac12\cdot\frac{11}{36} = \frac{17}{72}. \]

2.7. Bayes’ formula

In the setting above, we have: \[ \mathbb{P}(F_1\mid A) = \frac{\mathbb{P}(F_1\cap A)}{\mathbb{P}(A)} = \frac{\mathbb{P}(A\mid F_1)\mathbb{P}(F_1)}{ \mathbb{P}(A\mid F_1)\mathbb{P}(F_1) + \ldots + \mathbb{P}(A\mid F_n)\mathbb{P}(F_n)}. \]

Example 2.13.

\(1\%\) of people have a disease. If a person is sick, the test is positive with probability \(90\%\), and if a person is healthy, the test is positive only with probability \(20\%\). A random person is selected from the population and is tested positive. What is the probability that he is sick? Let \(F_1\) = sick, \(F_2\) = healthy, \(A\) = tested positive.
\[ \mathbb{P}(F_1) = 0.01,\ \mathbb{P}(F_2) = 0.99,\ \mathbb{P}(A\mid F_1) = 0.9, \ \mathbb{P}(A\mid F_2) = 0.2. \] \[ \mathbb{P}(F_1\mid A) = \frac{0.01\cdot0.9}{0.01\cdot0.9 + 0.99\cdot 0.2} \approx 4.3\% \] We updated the probability of our hypothesis (that he is sick) from \(1\%\) to \(4.3\%\), using new information that the test is positive. The probability \(\mathbb{P}(F_1)\) is called a prior probability, and \(\mathbb{P}(F_1\mid A)\) is called a posterior probability. But this probability is still small, because the original probability was small. This is prosecutor’s fallacy: Even though the probability \(90\%\) is large, the posterior probability is small, because the prior is small.

Example 2.14.

We have a fair coin and a magic coin, which always comes out H. Choose one at random (each can be chosen with probability \(1/2\)), flip it twice. It comes out H both times. What is the probability it is fair? Let \(F_1\) = fair, \(F_2\) = magic, \(A\) = both heads. Then \[ \mathbb{P}(F_1) = \mathbb{P}(F_2) = \frac12,\ \ \mathbb{P}(A\mid F_1) = \frac14,\ \mathbb{P}(A\mid F_2) = 1. \] Therefore, by Bayes’ formula we get: \[ \mathbb{P}(F_1\mid A) = \frac{\frac12\cdot\frac14}{\frac12\cdot\frac14 + \frac12\cdot 1} = \frac{1}{5} = 20\%. \]

2.8. Independence for two events

Events \(A\) and \(B\) are called independent if knowledge of whether \(A\) happened or not does not influence the probability of \(B\): \(\mathbb{P}(B\mid A) = \mathbb{P}(B)\). Since \(\mathbb{P}(B\mid A) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}\), we can rewrite this as \[ \mathbb{P}(A\cap B) = \mathbb{P}(A)\mathbb{P}(B). \]

Example 2.15.

Toss the coin twice. Let \(A\) = first toss H, \(B\) = second toss H, \(C\) = both tosses the same. Then \(A\) and \(B\) are independent. Indeed, the probability space (the space of all outcomes) is \(S = \{HH, HT, TH, TT\}\), and \(A\) = {HH, HT}, \(B\) = {TH, HH}, \(C\) = {TT, HH}. Next, \[ \mathbb{P}(A\cap B) = \mathbb{P}\{HH\} = \frac14 = \frac12\cdot \frac12 = \mathbb{P}(A)\cdot\mathbb{P}(B). \] It is obvious that these events are independent, because they result in different tosses of the coin. In some other cases, it is not obvious. For example, \(A\) and \(C\) are also independent. Indeed, \[ \mathbb{P}(A\cap C) = \mathbb{P}\{HH\} = \frac14 = \frac12\cdot\frac12 = \mathbb{P}(A)\cdot\mathbb{P}(C). \] Similarly, \(B\) and \(C\) are independent.

Example 2.16.

A collateralized debt obligation (CDO) is backed by 5 subprime mortgages from California, each defaults with probability \(40\%\). A senior tranch defaults only if all 5 mortgages default. Find the probability of this in two cases: (a) they are independent; (b) they default simultaneously because of housing bubble. Solution: (a) \(0.4^5 \approx 1\%\); (b) \(40\%\). The case (b) is roughly what happened in 2007-8. ### Example 2.17. Mary and Carlos are debugging a code. Their result are independent. Mary can spot a mistake with probability \(60\%\), and Carlos with probability \(80\%\). What is the probability that this mistake remains? Answer: \((1 - 60\%)\cdot (1 - 80\%) = 0.08\).

Example 2.18.

A gene exhibits two alleles: dominant \(W\) and recessive \(w\), with \[ \mathbb{P}(WW) = 50\%,\, \mathbb{P}(Ww) = 20\%,\, \mathbb{P}(ww) = 30\%. \] What is the chance that the offspring is \(ww\)? Needs to receive \(w\) from father and \(w\) from mother, which are independent. The probability of getting \(w\) from the father is \[ \frac12\cdot 0.2 + 0.3 = 0.4. \] The same probability is of getting \(w\) from mother. Thus the final probability is \(0.4^2 = 0.16\).

2.9. Independence for three or more events

Events \(A\), \(B\) and \(C\) are called independent if \[ \mathbb{P}(A\cap B) = \mathbb{P}(A)\mathbb{P}(B),\ \ \mathbb{P}(A\cap C) = \mathbb{P}(A)\mathbb{P}(C),\ \ \mathbb{P}(B\cap C) = \mathbb{P}(B)\mathbb{P}(C), \] \[ \mathbb{P}(A\cap B\cap C) = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C). \] This last condition is important, because it does not automatically follow from the first three conditions. For example, if \(A\), \(B\) and \(C\) are the events from the example, then \(A\) and \(B\) are independent, \(B\) and \(C\) are independent, \(A\) and \(C\) are independent, so these events are pairwise independent. But \(A\cap B\cap C\) = HH, so \[ \mathbb{P}(A\cap B\cap C) = \frac14 \ne \frac18 = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C). \]

Example 2.19.

Three clients, each has accident independently with probability \(20\%\). Let \(X\) be the number of total accidents. Then \(\mathbb{P}(X = 0) = (1 - 0.2)^3 = 0.512\), and \(\mathbb{P}(X = 3) = 0.2^3 = 0.008\).

3. Random Variables

3.1. Definition

Take a sample space \(S\).

Definition 3.1.

Any function \(X : S \to \mathbb R\) is called a random variable. The distribution of a random variable is the list of its values, together with probabilities of each value.

Example 3.1.

For \(S\) = {HH, HT, TH, TT} (two coin tosses), let \(X\) be the number of Heads. Then \[ X(HH) = 2,\quad X(TH) = 1,\quad X(HT) = 1,\quad X(TT) = 0. \] The distribution, otherwise called the probability mass function, is \[ P(0) = \frac14,\quad P(1) = \frac12,\quad P(2) = \frac14. \]

Example 3.2.

Two dice rolled, \(S = \{11, 12, \ldots, 66\}\). Let \(X\) be the difference. Then \(X(12) = 1\), \(X(31) = 2\).

Some random variables are continuous. They take infinitely many values, and have density \(f(x)\): For all \(a, b\), \[ \mathbb P(a \le X \le b) = \int_a^bf(x)\,\mathrm{d}x. \] Thus the probability of getting between \(a\) and \(b\) is the area under the density curve. For such random variables, \(\mathbb P(X = c) = 0\) for all \(c\). Any particular value is taken with probability \(0\). To be a density, a function needs to be nonnegative and integrate from \(-\infty\) to \(+\infty\) up to \(1\). For example, \(f(x) = x\) or \(f(x) = x^2\) do not pass this test.

Example 3.3.

For an exponential random variable with unit mean, \[ f(x) = \begin{cases} e^{-x},\ x > 0;\\ 0,\, x \le 0. \end{cases} \] Then \(\mathbb P(X \le 1) = \int_0^1f(x)\,\mathrm{d}x = \int_0^1e^{-x}\,\mathrm{d}x = 1 - e^{-1}\).

Actually, random variables can be neither discrete nor continuous, but a mixture of both. Say, an accident happens with probability \(10\%\). In this case, losses are exponential, with distribution as above. Thus the amount \(X\) of losses is distributed as (for \(0 < a < b\)) \[ \mathbb{P}(X = 0) = 0.9,\quad \mathbb{P}(a \le X \le b) = \frac1{10}\int_a^be^{-x}\,\mathrm{d}x. \]

Example 3.4.

We have distribution of \(X\): \[ p(0) = 0.5,\, p(1) = 0.2,\, p(2) = p(3) = p(4) = 0.1 \] Find \(\mathbb P(X \ge 3\mid X \ge 1)\): this is equal to \[ \frac{\mathbb P(X \ge 3)}{\mathbb P(X \ge 1)} = \frac{0.1 + 0.1}{1 - \mathbb P(X = 0)} = \frac{0.2}{0.5} = 40\%. \]

Example 3.5.

A random variable \(X\) has density \(f(x) = 2x,\, 0 \le x \le 1\). Outside of this interval, the density is \(0\) (often we skip this \(0\) outside of effective interval). Then \[ \mathbb P(X \ge 0.5) = \int_{0.5}^12x\,\mathrm{d}x = \frac34. \]

3.2. Cumulative distribution function

For a random variable \(X\), its cumulative distribution function (CDF) is \(F_X(x) := \mathbb P(X \le x)\). It exists for every random variable, discrete, continuous, or otherwise (unlike PDF, which exists only for continuous distributions), and it uniquely determines the distribution: If \(F_X(x) = F_Y(x)\) for all \(x\), then random variables \(X\) and \(Y\) have the same distribution.

Theorem 3.1.

  • This function is nondecreasing: For \(x \le y\), we have \(F_X(x) \le F_X(y)\).
  • \(F_X(x) \to 1\) as \(x \to +\infty\). In other words, \(F_X(+\infty) = 1\).
  • \(F_X(x) \to 0\) as \(x \to -\infty\). In other words, \(F_X(-\infty) = 0\).

Proof of Theorem 3.1.

  • If \(X \le x\), then \(X \le y\). In other words, \(\{X \le x\} \subseteq \{X \le y\}\). Thus \(\mathbb{P}(X \le x) \le \mathbb{P}(X \le y)\). In other words, \(F_X(x) \le F_X(y)\).
  • Always \(X \le +\infty\) true, with probability \(1\).
  • Never \(X \le -\infty\), with probability \(0\).

Example 3.6.

Take \(X\) which takes values \(0\) and \(1\) with equal probabilities \(0.5\). Then:

  • \(x < 0\). Then the event \(X \le x\) cannot happen, because \(X\) can take only values \(0\) or \(1\). Therefore, the probability of this event is zero: \(F_X(x) = 0\).
  • \(x \in [0, 1)\). Then the event \(X \le x\) is equivalent to \(X = 0\), which has probability \(1/2\). Therefore, the probability of this event is \(0.5\): \(F_X(x) = 0.5\).
  • \(x \ge 1\). Then the event \(X \le x\) always happens, because \(X\) can take only values \(0\) or \(1\). Therefore, the probability of this event is one: \(F_X(x) = 1\).

\[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 1/2,\, 0 \le x < 1;\\ 1,\ x \ge 1. \end{cases} \]

Example 3.7.

Take \(X\) the (random) number of successful trials in two independent trials, each with probability \(1/3\) of success (and \(2/3\) of failure). Then \[ \mathbb P(X = 0) = \left(\frac 23\right)^2 = \frac49,\, \mathbb P(X = 1) = 2\cdot\frac13\cdot\frac23 = \frac49,\, \mathbb P(X = 2) = \left(\frac13\right)^2 = \frac19. \]
  • \(x < 0\). Then the event \(X \le x\) cannot happen. Therefore, the probability of this event is zero: \(F_X(x) = 0\).
  • \(x \in [0, 1)\). Then the event \(X \le x\) is equivalent to \(X = 0\), which has probability \(4/9\). Therefore, \(F_X(x) = 4/9\).
  • \(x \in [1, 2)\). Then the event \(X \le x\) is equivalent to \(X \in \{0, 1\}\), which has probability \(4/9 + 4/9 = 8/9\). Therefore, \(F_X(x) = 4/9\).
  • \(x \ge 2\). Then \(X \le x\) always happens, with probability \(1\); therefore, \(F_X(x) = 1\).

\[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 4/9,\, 0 \le x < 1;\\ 8/9,\, 1 \le x < 2;\\ 1, x \ge 2. \end{cases} \]

We can see that for a discrete random variable \(X\), the CDF is piecewise constant. Jump points correspond to the values which \(X\) takes, and the size of the jump corresponds to the probability that this value is taken.

Example 3.8.

Let us solve the opposite problem: Take a cumulative distribution function of a random variable \(X\): \[ F_X(x) = \begin{cases} 0,\, x < 1;\\ 0.4,\, x \in [1, 2);\\ 1,\, x \ge 2. \end{cases} \] Let us find the distribution of \(X\): which values it takes and with which probabilities. It jumps at \(1\) and \(2\), and the size of jumps is \(0.4\) and \(0.6\) respectively. Thus \(\mathbb P(X = 1) = 0.4\) and \(\mathbb P(X = 2) = 0.6\).

For a continuous random variable \(X\) with density \(p_X\), then \[ F_X(x) = \mathbb P(X \le x) = \int_{-\infty}^xp_X(y)\,\mathrm{d}y. \] Thus \(F_X\) is continuous. If \(p_X\) is continuous at the point \(x\), then \(F'_X(x) = p_X(x)\). That is, the derivative of this CDF is the PDF: probability density function, or simply density.

Example 3.9.

Find CDF for \(X\) with the density \[ p(x) = \begin{cases} 2e^{-2x},\, x \ge 0;\\ 0,\, x < 0. \end{cases} \] Thus for \(x < 0\) the event \(\{X \le x\}\) is impossible, because \(X \ge 0\) always. Therefore, \(F_X(x) = \mathbb{P}(X \le x) = 0\). And for \(x \ge 0\) we have: \[ F_X(x) = \int_0^x(2e^{-2y})\,\mathrm{d}y = \left.\left(-e^{-2y}\right)\right|_{y=0}^{y=x} = 1 - e^{-2x}. \] Let us summarize: \[ F_X(x) = \begin{cases} 0,\, x < 0;\\ 1 - e^{-2x},\, x \ge 0. \end{cases} \] Unlike in the exampels above, \(F_X(x) < 1\) for all \(x\), but \(F_X(x) \to 1\) as \(x \to +\infty\).

Example 3.10.

Find the distribution of a random variable \(X\) with CDF \[ F_X(x) = \begin{cases} 0,\, x < 0;\\ x^2,\, x \in [0, 1);\\ 1,\, x \ge 1. \end{cases} \] We have: \(\mathbb{P}(X \le 0) = F_X(0) = 0\), thus \(X > 0\) with probability \(1\). Next, \(\mathbb{P}(X \le 1) = F_X(1) = 1\), thus \(X \le 1\) with probability \(1\). Thus with probability \(1\), \(0 < X \le 1\). And for \(x \in (0, 1)\), the density is the derivative of \(F_X\): \(p_X(x) = (x^2)' = 2x\). Thus, \[ p_X(x) = \begin{cases} 2x,\, 0 \le x \le 1;\\ 0,\ \mbox{else}. \end{cases} \]

A common mistake is to assign, say, for random variable \(X\) lying between \(0\) and \(2\), \(F_X(x) = 0\) for \(x \ge 2\). This confuses CDF \(F_X\) with PDF \(p_X\). But while \(p_X(x) = 0\) in this case for \(x \le 0\) and \(x \ge 2\), instead for CDF we have: \[ F_x(x) = \begin{cases} 0,\ x < 0;\\ 1,\ x \ge 2. \end{cases} \]

3.3. Expectation

Take a random variable \(X : S \to \mathbb R\) on a sample space \(S\).

Definition 3.2.

The expectation of \(X\) is the sum over all elementary outcomes: \[ \mathbb E\left[X\right] = \sum\limits_{s \in S}p(s)X(s). \] This is also called mean, or average, or expected value.

Example 3.11.

\(S\) = {HH, HT, TH, TT} (two fair coins). Let \(X\) be the number of Heads. Then \[ X(HH) = 2,\quad X(TH) = 1,\quad X(HT) = 1,\quad X(TT) = 0. \] Each of these 4 elementary outcomes has probabiltiy \(1/4\). The distribution, otherwise called the probability mass function, is \[ P(0) = \frac14,\quad P(1) = \frac12,\quad P(2) = \frac14. \] The expectation can be computed in two equivalent ways: \[ \mathbb E[X] = 0\cdot P(0) + 1\cdot P(1) + 2\cdot P(2) = \frac14\cdot X(TT) + \frac14 \cdot X(HT) + \frac14 \cdot X(TH) + \frac14\cdot X(HH) = 1. \]

In general, we have the following results.

Theorem 3.2.

Assume \(X\) takes values \(c_1, \ldots, c_n\), with probabilities \(p_1, \ldots, p_n\). Then the expectation is \[ \mathbb E[X] = c_1p_1 + \ldots + c_np_n. \]

If \(X\) is not discrete (that is, does not take finitely or countably many values), then it is harder to define expectation (we need measure theory). However, it is easy to compute it.

Theorem 3.3.

If \(X\) is continuous with density \(f\), then \[ \mathbb E[X] = \int_{-\infty}^{+\infty}xf(x)\,\mathrm{d}x. \] Indeed, split real line into small subintervals \([y, y + \mathrm{d}y]\). Then \(\mathbb P(y \le X \le y + \mathrm{d}y) \approx f(y)\,\mathrm{d}y\). Sum this over all such small intervals and get the integral.

Example 3.12.

For a uniform random variable \(X\) on \([0, 1]\) with density \(f(x) = 1,\, 0 \le x \le 1\). Then \[ \mathbb E[X] = \int_0^1xf(x)\,\mathrm{d}x = \int_0^1x\,\mathrm{d}x = \frac12. \]

Example 3.13.

Sometimes the expectation can be infinite. Take density \(f(x) = x^{-2}\) for \(x \ge 1\). Then \[ \mathbb E[X] = \int_1^{\infty}xf(x)\,\mathrm{d}x = \int_1^{\infty}x^{-1}\,\mathrm{d}x = \left.\ln x\right|_{x=1}^{x = \infty}. \]

Example 3.14.

(St Petersburg Paradox) A fair coin is flipped until first tail. Win \(2\) if it appears on the first toss (which happens with probability \(1/2\)); Win \(4\) if on the second toss (which happens with probability \(1/4\)); etc. Thus \(\mathbb E[X] = 2\cdot \frac12 + 4\cdot \frac14 + \ldots = \infty\). Thus the value of this game is infinite. But would you pay, say 1000 to play it?

Example 3.15.

Alaska Airlines flight Reno-Seattle on Embraer 75 has 64 economy class seats. Each ticket costs 100$. The airline overbooks: sells an additional tickets for full flight. If all passengers show up, the airline needs to refund 100$ and give a voucher 150$ to this bumped passenger. Each passenger will show up independently with probability 98%. Is this airline better off overbooking? Solution: Airline gains 100$ if at least one does not show up. But it loses 150$ if all show up, which happens with probability \(0.98^{64} \approx 0.27\). Thus it is better to overbook: Expected gain is \[ 100\cdot (1 - 0.27) + (-150)\cdot 0.27 > 0. \]

Theorem 3.4.

For a function \(g : \mathbb R \to \mathbb R\), we get: \[ \mathbb E[g(X)] = \begin{cases} \int_{-\infty}^{+\infty}g(y)f(y)\,\mathrm{d}y,\\ \sum_cg(c)\mathrm{P}(X = c), \end{cases} \] if \(X\) has density \(f\) or discrete distribution, respectively.

Example 3.16.

For \(X\) with density \(f(x) = e^{-x},\, x > 0\), we get: \[ \mathbb E\bigl[e^{X/2}\bigr] = \int_0^{\infty}e^{x/2}e^{-x}\,\mathrm{d}x = \int_0^{\infty}e^{-x/2}\,\mathrm{d}x = 2. \]

3.4. Variance

This quantity measures the concentration of random variable around its mean.

Definition 3.3.

For a random variable \(X\), if \(\mu = \mathbb E[X]\), then \(\mathrm{Var}(X) = \mathbb E(X - \mu)^2\) is called variance.

This quantity is always nonnegative, and it can be zero only when \(X = \mu\) is constant, not random.

Theorem 3.5.

One can compute variance as follows: \(\mathrm{Var}(X) = \mathbb E[X^2] - (\mathbb E X)^2\).

Proof of Theorem 3.5.

Note that \((X - \mu)^2 = X^2 - 2X\mu + \mu^2\). Take expectation and get: \(\mathbb E[X^2] - 2\mu\mathbb E[X] + \mu^2 = \mathbb E[X^2] - 2\mu\cdot\mu + \mu^2 = \mathbb E[X^2] - \mu^2\).

Example 3.17.

For a uniform random variable \(X\) on \([0, 1]\), its density is \(f(x) = 1\) for \(0 \le x \le 1\). Thus \[ \mathbb E[X^2] = \int_0^1x^2f(x)\,\mathrm{d}x = \int_0^1x^2\,\mathrm{d}x = \frac13. \] We can compute variance: \(\mathrm{Var}(X) = 1/3 - (1/2)^2 = 1/12\).

Example 3.18.

Let \(X\) be the number of Heads in two coin tosses. Its distribution is \(p(0) = 1/4\), \(p(1) = 1/2\), \(p(2) = 1/4\). Thus \[ \mathbb E[X^2] = \frac14\cdot 0^2 + \frac12\cdot 1^2 + \frac14\cdot 2^2 = \frac32, \] and the variance is equal to \(3/2 - 1^2 = 1/2\).

3.5. Median

For a continuous random variable \(X\), its median is defined as the \(m\) such that \[ \mathbb P(X \ge m) = \int_m^{\infty}f(x)\,\mathrm{d}x = \frac12. \]

Example 3.19.

\(f(x) = e^{-x},\, x \ge 0\). Then \(\mathbb P(X \ge m) = e^{-m} = 1/2\) and thus \(m = \ln 2\). This is different from mean \(\mathbb E[X] = 1\).

For discrete random variables, it is sometimes not possible to find \(m\) such that \(\mathbb P(X \ge m) = \mathbb P(X \le m) = 1/2\). Instead, we should start from the smallest value and add probabilities until we reach or exceed \(1/2\).

Example 3.20.

For the distribution \(p(-2) = 0.2\), \(p(-1) = 0.1\), \(p(0) = 0.4\), \(p(2) = 0.2\), \(p(3) = 0.1\), we get to \(0\) with probability \(0.2+0.1+0.4 = 0.7\). Thus the median is \((-1+0)/2 = -0.5\).

Example 3.21.

For the distribution \(p(-2) = 0.2\), \(p(-1) = 0.1\), \(p(0) = 0.2\), \(p(2) = 0.2\), \(p(3) = 0.3\), we get to \(0\) with probability \(0.2+0.1+0.4 = 0.5\). Thus the median is \(0\).

Definition 3.4.

A mode \(x\) is the \(x\) such that \(f(x)\) is maximal, or \(\mathrm{P}(X = x)\) is maximal.

For example, for \(f(x) = e^{-x},\, x \ge 0\), and the mode is \(0\).

3.6. Survivial function

For a random variable \(X \ge 0\), the function \(S(c) = \mathrm{P}(X > c)\) is called the survival or tail function. Let \(X\) be the lifetime with density \(f\). The mortality rate is defined as \[ m(x) = \frac{f(x)}{S(x)}, \] which has meaning \[ m(x)\,\mathrm{d}x = \mathbb P(x < X < x + \mathrm{d}x\mid X > x). \] This is the conditional probability that I will survive for the additional \(\mathrm{d}x\), given that I survived until \(x\). These survival functions and mortality rates are available in the actuarial tables in Social Security Administration.

3.7. Deductibles and copays

An insurance policy has out-of-pocket maximum of 2 and deductible 1, and actuarial value 70%. This means the follows: The first 1 is paid by the client, then the company pays 70% until the out-of-pocket reaches 2. Starting from this, the company pays everything. Assume the loss distribution is \[ \mathbb P(X = n) = \frac1{(n+1)(n+2)},\, n = 0, 1, 2, \ldots \] Let \(Y\) be the payment by the client. The distributions of \(X\) and \(Y\) are given in Table below.
\(X\) \(Y\) \(P\)
0 0 1/2
1 1 1/6
2 1.3 1/12
3 1.6 1/20
4 1.9 1/30
\(\ge 5\) 2 \(p\)

The probability \(p\) is equal to \[ p = 1 - \frac12 - \frac16 - \frac1{12} - \frac1{20} - \frac1{30} = \frac16. \] We can see that the expected out-of-pocket payment is \[ \mathbb E[Y] = 0\cdot\frac12 + 1\cdot\frac16 + 1.3\cdot\frac1{12} + 1.6\cdot\frac1{20} + 1.9\cdot \frac1{30} + 2\cdot \frac16 \approx 0.75. \]

3.8. Transformations

Here we take functions of continuous random variables with known densities. We find densities of new random variables. Indeed, take \(X\) with \(f_X(x) = 2e^{-2x},\, x \ge 0\). Find density of \(X^2 = Y\). There are two ways: change-of-variables and CDF.

Change of variables. For every \(0 < a < b\), \[ \mathbb P(a \le X \le b) = \int_a^bf_X(x)\,\mathrm{d}x. \] Try to do the same for \(Y\) instead of \(X\): \[ \mathbb P(a \le Y \le b) = \mathbb P\left(a^{1/2} \le X \le b^{1/2}\right) = \int_{a^{1/2}}^{b^{1/2}}f_X(x)\,\mathrm{d}x. \] Next, change variables in the integral: \[ y^{1/2} = x;\ a^{1/2} \le x \le b^{1/2} \Leftrightarrow a \le y \le b;\ \mathrm{d}x = \frac12 y^{-1/2}\,\mathrm{d}y. \] Thus we get: \[ \int_{a^{1/2}}^{b^{1/2}}f_X(x)\,\mathrm{d}x = \int_a^b2e^{-2y^{1/2}}\frac12 y^{-1/2}\,\mathrm{d}y. \] Combining this, we get: \[ \mathbb P(a \le Y \le b) = \int_a^be^{-2y^{1/2}}y^{-1/2}\,\mathrm{d}y. \] Since this is true for all \(a < b\), we get that the density of \(Y\) is \[ f_Y(y) = e^{-2y^{1/2}}y^{-1/2},\ y > 0. \]

CDF. Compute CDF for \(X\): \(\mathrm{P}(X \le c) = 1 - e^{-2c},\, c > 0\). Thus \[ \mathbb P(Y \le c) = \mathbb P(X \le c^{1/2}) = 1 - e^{-2c^{1/2}},\, c > 0. \] The derivative of this function is the density of \(Y\): \[ f_Y(c) = \left(1 - e^{-2c^{1/2}}\right)' = -(-2c^{1/2})'e^{-2c^{1/2}} = c^{-1/2}e^{-2c^{1/2}},\ c > 0. \]

The first method seems to be better when you cannot compute the CDF explicitly by integrating density.

4. Joint distributions

4.1. Definitions

Take two random variables \(X\) and \(Y\) with joint distribution: If they are discrete, they have joint probability mass function \[ \mathbb P(X = x,\, Y = y) = p(x, y). \] If they are continuous, they have joint density or joint probability density function \(f(x, y)\) such that for all \(a, b, c, d\), \[ \mathbb{P}(a \le X \le b,\, c \le Y \le d) = \int_a^b\int_c^df(x, y)\,\mathrm{d}y\,\mathrm{d}x. \] To qualify as joint density, a function needs to be nonnegative: \(f \ge 0\), and integrate up to \(1\) over the coordinate plane \(\mathbb R^2\): \[ \iint_{\mathbb R^2}f(x, y)\,\mathrm{d}x\,\mathrm{d}y = 1. \]

Example 4.1.

Toss a coin twice. Let \(X = 1\) if first Heads, \(X = 0\) if first Tails, and \(Y\) be the number of Heads.

\(s\) HH HT TH TT
\(X\) 1 1 0 0
\(Y\) 2 1 1 0
\(P\) 0.25 0.25 0.25 0.25

Example 4.2.

Take joint density \(f(x, y) = cx^2y\) if \(x^2 \le y \le 1\) (and \(0\) for other \(x\) and \(y\)) for some constant \(c\). Find \(c\) and \(\mathbb{P}(X \ge Y)\). Solution: \[ 1 = c\int_{-1}^1\int_{x^2}1x^2y\,\mathrm{d}y\,\mathrm{d}x = c\int_{-1}^1\left.\frac{x^2y^2}2\right|_{y=x^2}^{y=1}\,\mathrm{d}x \] \[ = c\int_{-1}^1\left(x^2 - x^6\right)\,\mathrm{d}x = \frac c2\left.\left(\frac{x^3}3 - \frac{x^7}7\right)\right|_{x=-1}^{x=1} = \frac{4c}{21};\\ \mathbb{P}(X \le Y) \] \[ = \frac{21}{4}\int_{0}^1\int_{x^2}^xx^2y\,\mathrm{d}y\,\mathrm{d}x = \frac{21}4\int_{0}^1x^2\left.\frac{y^2}2\right|_{y=x^2}^{y=x}\,\mathrm{d}x \] \[ = \frac{21}4\int_0^1\frac{x^2}2\left(x^2 - x^4\right)\,\mathrm{d}x = \frac{21}8\int_0^1\left(x^4 - x^6\right)\,\mathrm{d}x = \frac{21}8\cdot\frac2{35} = \frac3{20}. \]

Theorem 4.1.

For any function \(g(x, y)\), we have: \[ \mathbb E g(X, Y) = \begin{cases} \sum_{x, y}g(x, y)\mathbb{P}(X = x,\, Y = y),\\ \iint_{\mathbb R^2}g(x, y)f(x, y)\,\mathrm{d}y\,\mathrm{d}x. \end{cases} \] if \((X, Y)\) are jointly discrete or jointly continuous with density \(f\).

Example 4.3.

For \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\), we get: \[ \mathbb E[XY] = \int_0^1\int_0^1xy(x+y)\,\mathrm{d}y\,\mathrm{d}x = \int_0^1\left.\left(\frac{x^2y^2}2 + \frac{xy^3}3\right)\right|_{y=0}^{y=1}\,\mathrm{d}x \] \[ = \int_0^1\left(\frac{x^2}2 + \frac{x}3\right)\,\mathrm{d}x = \frac13. \]

4.2. Independent random variables

Take random variables \(X\) and \(Y\).

Definition 4.2.

If \(X\) and \(Y\) are discrete, they are called independent if for all values \(x\) and \(y\) which they assume, events \(\{X = x\}\) and \(\{Y = y\}\) are independent: \[ \mathbb{P}(X = x,\, Y = y) = \mathbb{P}(X = x)\cdot \mathbb{P}(Y = y). \] If they are continuous, they are called independent if for all \(a, b, c, d\), the following events are independent: \(\{a \le X \le b\}\) and \(\{c \le Y \le d\}\).

The latter is equivalent to joint density having product form: \(f(x, y) = f_X(x)f_Y(y)\), where \(f_X\) and \(f_Y\) are marginal densities of \(X\) and \(Y\), respectively.

Theorem 4.2.

For independent \(X\) and \(Y\), we have: \(\mathbb E[XY] = \mathbb EX\cdot \mathbb EY\).

Proof of Theorem 4.2.

Say \(X\) and \(Y\) are discrete. Then apply the above to \(g(x, y) = xy\). \[ \mathbb E[XY] = \sum_{x, y}xy\,\mathbb{P}(X = x,\, Y = y) = \sum_{x, y}xy\,\mathbb{P}(X = x)\,\mathbb{P}(Y = y) \] \[ = \sum_xx\,\mathbb{P}(X = x)\cdot\sum_yy\,\mathbb{P}(Y = y) = \mathbb E X \cdot\mathbb E Y. \] Similarly it can be proved for continuous \(X\) and \(Y\).

Example 4.4.

For joint density \(f(x, y) = 6xy^2\), \(0 \le x \le 1\), \(0 \le y \le 1\). This is product form: \(2x\cdot 3y^2\). Thus \(X\) and \(Y\) are independent. We have: \[ \mathbb E[XY] = \mathbb E X \cdot\mathbb E Y = \int_0^12x\cdot x\,\mathrm{d}x \cdot \int_0^13y^2\cdot y\,\mathrm{d}y = \frac23\cdot\frac34 = \frac12. \]

Example 4.5.

For \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\), random variables \(X\) and \(Y\) are not independent, because the density is not product form.

Example 4.6.

For \(f(x) = 2\), \(0 \le x \le 1\), \(x \le y \le 1\), the density is not product form, because the effective domain is not a rectangle.

4.3. Covariance and correlation

These are measures of dependency between two random variables.

Definition 4.3.

For two random variables \(X\) and \(Y\), their covariance is defined as \[ \mathrm{Cov}(X, Y) = \mathbb E[XY] - \mathbb E X \cdot \mathbb E Y. \] Their correlation is defined as \[ \rho(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var} X}\sqrt{\mathrm{Var} Y}}. \]

Positive covariance/correlation implies that on average, if \(X\) increases, \(Y\) also increases. Negative covariance/correlation implies the reverse. Correlation is always between \(-1\) and \(+1\). If it is \(+1\), then \(Y\) is a linear function of \(X\): \(Y = aX + b\), with \(a > 0\). If it is \(-1\), then \(Y = aX + b\) with \(a < 0\).

Example 4.7.

Take random variables \(X\) and \(Y\) with joint distribution as in the table below.

\(X\) 1 2 3 4
\(Y\) 0 1 1 2
\(P\) 0.2 0.3 0.2 0.3

Then we have: \[ \mathbb E[X] = 1\cdot 0.2 + 2\cdot 0.3 + 3\cdot 0.2 + 4\cdot 0.3 = 2.6, \] \[ \mathbb E[Y] = 0\cdot 0.2 + 1\cdot 0.3 + 1\cdot 0.2 + 2\cdot 0.3 = 1.1, \] \[ \mathbb E[XY] = 1\cdot 0 \cdot 0.2 + 2\cdot 1\cdot 0.3 + 3\cdot 1\cdot 0.2 + 4\cdot 2\cdot 0.3 = 3.6. \] Thus \(\mathrm{Cov}(X, Y) = 3.6 - 2.6\cdot 1.1 = 0.74\). Next, \[ \mathrm{Var} X = \mathbb E[X^2] - (\mathbb E X)^2 = 0.2\cdot 1^2 + 0.3\cdot 2^2 + 0.2\cdot 3^2 + 0.3\cdot 4^2 - 2.6^2 = 1.24; \] \[ \mathrm{Var} Y = \mathbb E[Y^2] - (\mathbb E Y)^2 = 1.7 - 1.1^2 = 0.49. \] Thus the correlation is almost perfect: \[ \rho(X, Y) = \frac{0.74}{\sqrt{1.24}\sqrt{0.49}} = 0.95. \]

If \(X\) and \(Y\) are independent then they are uncorrelated: \(\rho(X, Y) = 0\). But the converse is not true: \(\mathrm{Cov}(X, Y) = 0\) does not mean \(X\) and \(Y\) are independent.

Example 4.8.

Take \((X, Y)\) uniformly distributed on four points: \((0, 1), (1, 0), (0, -1), (-1, 0)\). Then (because \(XY = 0\) always) \[ \mathbb E[X] = \mathbb E[Y] = 0,\quad \mathbb E[XY] = 0. \] Thus \(\mathrm{Cov}(X, Y) = 0 - 0\cdot 0 = 0\). But \(X\) and \(Y\) are dependent: If \(X = 1\), then must be \(Y = 0\).

Theorem 4.3.

For all (not necessarily independent) \(X\) and \(Y\), \[ \mathrm{Var}(X+Y) = \mathrm{Var} X + \mathrm{Var} Y + 2\mathrm{Cov}(X, Y). \] In particular, for independent \(X\) and \(Y\) we have: \[ \mathrm{Var}(X + Y) = \mathrm{Var} X + \mathrm{Var} Y. \]

Proof of Theorem 4.3.

If \(\mathbb E X = \mathbb E Y = 0\), we have: \[ \mathrm{Var}(X+Y) = \mathbb E[(X+Y)^2] = \mathbb E(X^2 + 2XY + Y^2) = \mathbb E[X^2] + \mathbb E[Y^2] + 2\mathbb E[XY]. \] But \(\mathrm{Var} X = \mathbb E[X^2]\), \(\mathrm{Var} Y = \mathbb E[Y^2]\), and \(\mathbb E[XY] = \mathrm{Cov}(X, Y)\). This completes the proof. The case of general \(\mathbb E X\) and \(\mathbb E Y\) is similar.

Example 4.9.

Similarly we can get a more general linear combination: \(\mathrm{Var}(2X - 3Y + 1) = \mathrm{Var}(2X) + \mathrm{Var}(-3Y) + 2\mathrm{Cov}(2X, -3Y) = 4\mathrm{Var}(X) + 9\mathrm{Var}(Y) - 12\mathrm{Cov}(X, Y)\).

Example 4.10.

Invest in 3 independent stocks with returns \(S_1, S_2, S_3\) with means \(2, 1, 3\) and variances \(4, 3, 2\). Invest in these stock in proportion \(2:1:1\): total return is \[ R = \frac12S_1 + \frac14S_2 + \frac14S_3;\ \mathbb E R = \frac12\cdot 2 + \frac12\cdot 1 + \frac14\cdot 1 = \frac32;\ \mathrm{Var}(R) = \frac14\cdot 4 + \frac1{16}3 + \frac1{16}2 = \frac{21}{16}. \]

Example 4.11.

Two stocks with returns \(X\) and \(Y\), with \(\mathbb E X = 2\) and \(\mathrm{Var} X = 3\), \(\mathbb E Y = 1\) and \(\mathrm{Var} Y = 1\), and \(\rho(X, Y) = 0.5\). Invest in them equally, then total return is \(R = (X+Y)/2\). We get: \[ \mathbb E R = \frac12(2 + 1) = 1.5,\ \mathrm{Var} R = \frac14\mathrm{Var}(X+Y) = \frac14(\mathrm{Var} X + \mathrm{Var} Y + 2\mathrm{Cov}(X, Y)) = \frac14(3 + 1 + 2\cdot\sqrt{3}\sqrt{1}\cdot 0.5) = 1 + \frac{\sqrt{3}}4 \approx 1.4. \] Thus diversification decreased variance and with it risk.

4.4. Marginal distributions

If we know the joint distribution of \((X, Y)\), then the distribution of only \(X\) or only \(Y\) is called a marginal distribution. If \((X, Y)\) are jointly discrete, then the marginal distribution of \(X\) can be found by summing the probabilities over all \(Y\), and vice versa: \[ \mathrm{P}(X = x) = \sum_y\mathrm{P}(X = x,\, Y = y);\quad \mathrm{P}(Y = y) = \sum_x\mathrm{P}(X = x,\, Y = y). \] If \((X, Y)\) are jointly continuous with density \(f\), we get: \[ f_X(x) = \int_{-\infty}^{+\infty}f(x, y)\,\mathrm{d}y;\quad f_Y(y) = \int_{-\infty}^{+\infty}f(x, y)\,\mathrm{d}x. \]

Example 4.12.

Consider \((X, Y)\) with joint distributions \[ \mathbb P(X = Y = 0) = 0.2,\ \mathbb P(X = 1,\, Y = 0) = 0.4,\ \mathbb P(X = Y = 1) = 0.4. \] Then the marginal distributions of \(X\) and \(Y\) are: \[ \mathrm{P}(X = 0) = 0.2,\, \mathrm{P}(X = 1) = 0.8;\quad \mathrm{P}(Y = 0) = 0.6,\, \mathrm{P}(Y = 1) = 0.4. \]

Example 4.13.

Consider \((X, Y)\) with joint density \(f(x, y) = x + y\), \(0 \le x \le 1\), \(0 \le y \le 1\). Then density of \(Y\) is \[ f_Y(y) = \int_0^1(x+y)\,\mathrm{d}x = y + \frac12,\ 0 \le y \le 1. \]

Example 4.14.

For \(f(x, y) = 2\), \(0 \le x \le 1\), \(x \le y \le 1\), we have: \[ f_X(x) = \int_x^12\,\mathrm{d}y = 2(1 - x),\ 0 \le x \le 1;\quad f_Y(y) = \int_0^y2\,\mathrm{d}x = 2y,\ 0 \le y \le 1. \]

4.5. Order statistics

Take i.i.d. random variables \(X_1,\ldots, X_n\) with CDF \(F_X(x) = \mathbb P(X \le x)\). Find the cdf of the minimum \(m_n = \min(X_1, \ldots, X_n)\) and the maximum \(M_N = \max(X_1, \ldots, X_n)\).

For the maximum, we can say that \(M_n \le x\) is equivalent to \(X_1 \le x\) and \(X_2 \le x\) and etc. and \(X_n \le x\). The CDF of \(M_N\) is given by \[ F_n(x) = \mathbb P(M_N \le x) = \mathbb P(X_1 \le x, \ldots, X_n \le x) = \mathbb P(X_1 \le x)\cdot\mathbb P(X_n \le x) = F_X^n(x). \] By differentiating this CDF, if this is a continuous random variable, we can get density.

Example 4.15.

If \(X, Y \sim \mathrm{Exp}(\lambda)\) i.i.d. (waiting time), then \(F_X(x) = 1 - e^{-\lambda x}\) for \(x > 0\), and \(\max(X, Y)\) has CDF \((1 - e^{-\lambda x})^2\) for \(x > 0\). Thus density is \[ 2\lambda e^{-\lambda x}(1 - e^{-\lambda x}),\ x > 0. \]

For the minimum, we cannot say this because \(m_n \le x\) is equivalent to \(X_1 \le x\) or \(X_2 \le x\) etc. Here we have OR instead of AND. Thus we need to invert: \(m_n > x\) is equivalent to \(X_1 > x\) and \(X_2 > x\) and etc. We can compute tail functions instead of CDFs: \(1 - F_X(x) = \mathbb P(X > x)\). \[ G_n(x) = \mathbb P(m_n \le x) = 1 - \mathbb P(m_n > x) = 1 - \mathbb P(X_1 > x, \ldots, X_n > x) = 1 - (1 - F_X(x))^n. \] Thus \(1 - G_n(x) = (1 - F_X(x))^n\).

Example 4.16.

If \(X, Y \sim \mathrm{Exp}(\lambda)\) i.i.d. (waiting time), then \(1 - F_X(x) = e^{-\lambda x}\) for \(x > 0\), and \(\max(X, Y)\) has tail function \((e^{-\lambda x})^2 = e^{-2\lambda x}\) for \(x > 0\). Thus density is minus derivative of tail function: \[ 2\lambda e^{-2\lambda x},\quad x > 0. \]

5. Conditional Distributions

5.1. Conditional probability, discrete case

Recall how to find conditional probability from Section 2.

Example 5.1.

Toss a fair coin twice. Let \(A\) be the event that there are two Heads. Let \(Y = 1\) if the first toss is H, \(0\) otherwise. Then \[ \mathbb{P}(A\mid Y = 1) = \frac12, \] because this is the probability that the second toss is H. Next, if the first toss is not H, then we cannot have both H; thus, \[ \mathbb{P}(A\mid Y = 0) = 0. \] We can write this as \[ \mathbb{P}(A\mid Y) = \begin{cases} \mathbb{P}(A\mid Y = 0),\ Y = 0;\\ \mathbb{P}(A\mid Y = 1),\ Y = 1; \end{cases} = \quad \begin{cases} 0,\ Y = 0;\\ \frac12,\ Y = 1; \end{cases} = \frac Y2. \]

The random variable \(\mathbb{P}(A\mid Y)\) is called the conditional probability of \(A\) given \(Y\). We can also have conditional probability depending on many random variables.

Example 5.2.

Let \(X, Y, Z\) be random variables with the following joint distribution:

\(X\) \(Y\) \(Z\) Prob.
\(0\) \(0\) \(0\) \(1/8\)
\(1\) \(0\) \(0\) \(1/8\)
\(0\) \(1\) \(2\) \(1/8\)
\(-1\) \(1\) \(1\) \(1/8\)
\(0\) \(1\) \(1\) \(1/2\)

Find \(\mathbb{P}(X = 0\mid Y, Z)\). We have: \[ \mathbb{P}(X = 0\mid Y = Z = 0) = \frac{1/8}{1/8 + 1/8} = \frac12. \] \[ \mathbb{P}(X = 0\mid Y = 1,\, Z = 2) = 1. \] \[ \mathbb{P}(X = 0\mid Y = Z = 1) = \frac{1/2}{1/2 + 1/8} = \frac45. \] In other words, \[ \mathbb{P}(X = 0\mid Y, Z) = \begin{cases} \frac12,\ Y = Z = 0;\\ 1,\ Y = 1,\, Z = 2;\\ \frac45,\ Y = Z = 1; \end{cases} \]

5.2. Conditional expectation, discrete case

This is defined similarly to the conditional probability.

Example 5.3.

In the previous example, let us find \(\mathbb{E}(X\mid Y, Z)\). We have: \[ \mathbb{E}(X\mid Y = Z = 0) = \frac{(1/8)\cdot 0 + (1/8)\cdot 1}{(1/8) + (1/8)} = \frac12, \] \[ \mathbb{E}(X\mid Y = 1,\, Z = 2) = 0, \] \[ \mathbb{E}(X\mid Y = Z = 1) = \frac{(-1)\cdot (1/8) + 0\cdot (1/2)}{(1/8) + (1/2)} = -\frac15. \]

Example 5.4.

Toss a fair coin twice, and let \(X\) be the number of Heads, let \(Y = 1\) if the first toss is H, \(Y = 0\) otherwise. If \(Y = 0\), then \(X = 0\) or \(X = 1\) with probability \(1/2\). If \(Y = 1\), then \(X = 1\) or \(X = 2\) with probabilty \(1/2\). Then \[ \mathbb{P}(X\mid Y = 0) = 0\cdot \frac12 + 1\cdot\frac12 = \frac12, \] \[ \mathbb{P}(X\mid Y = 1) = 1\cdot\frac12 + 2\cdot\frac12 = \frac32. \] We can represent this as \[ \mathbb{P}(X\mid Y) = \begin{cases} \frac12,\ Y = 0;\\ \frac32,\ Y = 1; \end{cases} = \frac12 + Y. \]

5.3. Properties of a conditional expectation

Take random variables \(X, Y, Z_1, \ldots, Z_n\).

  • \(\mathbb{E}(X + Y\mid Z_1, \ldots, Z_n) = \mathbb{E}(X\mid Z_1, \ldots, Z_n) + \mathbb{E}(Y\mid Z_1, \ldots, Z_n)\);

  • If \(X\) is a function of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(X\mid Z_1,\ldots, Z_n) = X\), because if you know \(Z_1, \ldots, Z_n\), you already know the exact value of \(X\).

  • If \(X\) is independent of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(X\mid Z_1, \ldots, Z_n) = \mathbb{E} X\), because knowledge of \(Z_1, \ldots, Z_n\) does not give us any additional information about the distribution of \(X\).

  • If \(X\) is a function of \(Z_1, \ldots, Z_n\), then \(\mathbb{E}(XY\mid Z_1, \ldots, Z_n) = X\mathbb{E}(Y\mid Z_1, \ldots, Z_n)\). Indeed, when you try to predict the value of \(XY\) based on the information from \(Z_1,\ldots, Z_n\), then you already know the value of \(X\), and you can assume \(X\) is just a constant, and put it outside of the expectation sign.

  • \(\mathbb{E}(\mathbb{E}(X\mid Z_1, \ldots, Z_n)) = \mathbb{E} X\). The prediction of \(X\) given \(Z_1, \ldots, Z_n\) is itself a random variable, which is a function of \(Z_1, \ldots, Z_n\). If you average over all possible values of \(Z_1, \ldots, Z_n\), you give up your knowledge of \(Z_1, \ldots, Z_n\), and arrive and the original situation, when you did not know anything. There, the best prediction for \(X\) is \(\mathbb{E} X\).

Example 5.5.

Take i.i.d. \(X, Y, Z \sim \mathrm{Poi}(1)\). Then \(\mathbb{E} X = \mathrm{Var} X = 1\) thus \(\mathbb{E} X^2 = (\mathbb{E} X)^2 + \mathrm{Var} X = 2\), same for \(Y, Z\). Therefore, \[ \mathbb{E}(X + 2Y + Z\mid Z) = \mathbb{E}(X\mid Z) + 2\mathbb{E}(Y\mid Z) = \mathbb{E}(Z\mid Z) = \mathbb{E} X + 2\mathbb{E} Y + Z = \boxed{3 + Z} \] \[ \mathbb{E}(2X + Z - 1)^2\mid Z) = \mathbb{E}(4X^2 + 4XZ - 4X + Z^2 - 2Z + 1\mid Z) = 4\mathbb{E}(X^2\mid Z) + 4\mathbb{E}(XZ\mid Z) - 4\mathbb{E}(X\mid Z) + \mathbb{E}(Z^2 - 2Z + 1\mid Z) \] \[ = 4\mathbb{E} X^2 + 4Z\mathbb{E} X - 4\mathbb{E} X + Z^2 - 2Z + 1 = 8 + 4Z - 4 + Z^2 - 2Z + 1 = \boxed{Z^2 + 2Z + 5} \]

5.4. Conditional distribution

This is the distribution of a random variable \(X\) given \(Z_1, \ldots, Z_n\), which consists of all probabilities \(\mathbb{P}(X = x\mid Z_1, \ldots, Z_n)\), where \(x\) is a possible value of the random variable \(X\).

Example 5.6.

Toss two fair coins, and let \[ X = \begin{cases} 1,\ \mbox{first H};\\ 0,\ \mbox{first T}, \end{cases} \ \ Y = \mbox{number of Heads} = \begin{cases} 2,\ \mbox{HH};\\ 1,\ \mbox{HT, TH};\\ 0,\ \mbox{TT}. \end{cases} \]
  • if \(Y = 2\), then \(\mathbb{P}(X = 1\mid Y) = \mathbb{P}(X = 1\mid Y) = 1\), \(\mathbb{E}(X\mid Y) = 1\), and \(\mathrm{Var}(X\mid Y) = 0\), because a constant random variable \(X\) has zero variance;
  • if \(Y = 1\), then \(\mathbb{P}(X = 1\mid Y) = \mathbb{P}(X = 0\mid Y) = 0.5\), \(\mathbb{E}(X\mid Y) = 0.5\), \(\mathbb{E}(X^2\mid Y) = 1^2\cdot 0.5 + 0^2\cdot 0.5 = 0.5\), and \(\mathrm{Var}(X\mid Y) = \mathbb{E}(X^2\mid Y) - (\mathbb{E}(X\mid Y))^2 = 0.5 - 0.5^2 = 0.25\);
  • if \(Y = 0\), then \(\mathbb{P}(X = 0\mid Y) = 1\), \(\mathbb{E}(X\mid Y) = 0\), and \(\mathrm{Var}(X\mid Y) = 0\).

Example 5.7.

Let \(S = \{1, 2, 3, 4, 5\}\), and \(p(1) = 0.5,\ p(2) = 0.2,\ p(3) = p(4) = p(5) = 0.1\). Let \[ X(\omega) = \begin{cases} 1,\ \omega = 1;\\ 0,\ \omega = 2, 3, 4, 5; \end{cases} \ \ Y(\omega) = \begin{cases} 1,\ \omega = 1, 2, 3;\\ 0,\ \omega = 4, 5; \end{cases} \ \ Z(\omega) = \begin{cases} 1,\ \omega = 1, 2, 3, 4;\\ 0,\ \omega = 5. \end{cases} \] Let us find the law (the joint distribution) of \((X, Y)\) given \(Z\). If \(Z = 1\), then this event has probability \[ \mathbb{P}(Z = 1) = p(1) + p(2) + p(3) + p(4) = 0.9, \] and therefore we can calculate \[ \mathbb{P}(X = 1, Y = 1\mid Z = 1) = \frac{p(1)}{\mathbb{P}(Z = 1)} = \frac{0.5}{0.9} = \frac59, \] \[ \mathbb{P}(X = 0, Y = 1\mid Z=1) = \frac{p(2)+p(3)}{\mathbb{P}(Z=1)} = \frac{0.3}{0.9} = \frac13, \] \[ \mathbb{P}(X = 0, Y = 0\mid Z=1) = \frac{p(4)}{\mathbb{P}(Z=1)} = \frac{0.1}{0.9} = \frac19. \] Also, if \(Z = 0\), then \(X = Y = 0\) with probability \(1\), but \[ \mathbb{P}(X = 0, Y = 0\mid Z=0) = 1. \]

5.5. Continuous distributions

Assume \((X, Y)\) have joint density \(f(x, y)\). The marginal density of \(X\) is given by \(f_X(x) = \int f(x, y)\mathrm{d} y\), and the marginal density of \(Y\) is given by \(f_Y(y) = \int f(x, y)\mathrm{d} x\). If we know that \(X = x\), then \(Y\) has conditional density \[ f_{Y\mid X}(y\mid x) = \frac{f(x, y)}{f_X(x)}. \] Similarly, if \(Y = y\), then \(X\) has conditional density \[ f_{X\mid Y}(x\mid y) = \frac{f(x, y)}{f_Y(y)}. \] We can also calculate the conditional expectation: \[ \mathbb E(Y\mid X = x) = \int yf_{Y\mid X}(y\mid x)\mathrm{d} y. \]

Take the density \(f(x, y) = x + y\) for \(0 \le x \le 1\) and \(0 \le y \le 1\). Then \[ f_X(x) = \int_0^1(x + y)\mathrm{d} y = \left.\left(xy + \frac{y^2}2\right)\right|_{y=0}^{y=1} = x + \frac12, \] \[ f_{Y\mid X}(y\mid x) = \frac{x+y}{x + 1/2},\ 0 \le y \le 1. \] For example, if \(x = 1/4\), then \[ f_{Y\mid X}(y\mid x) = \frac{1/4 + y}{3/4} = \frac13 + \frac43y,\ 0 \le y \le 1, \] \[ \mathbb E\left(Y\mid X = \frac14\right) = \int_0^1\left(\frac13 + \frac43y\right)y\mathrm{d} y = \left.\left(\frac16y^2 + \frac49y^3\right)\right|_{y=0}^{y=1} = \frac16 + \frac49 = \frac{11}{18}. \]

6. Moment Generating Functions

6.1. Definition and Examples

A moment generating function (MGF) for a random variable \(X\) is defined as \[ M_X(t) = \mathbb E\left[e^{tX}\right]. \]

Example 6.1.

For the random variable which assumes values \(\pm1\) with equal probability \(1/2\), this MGF is given by \[ M_X(t) = \frac12\cdot e^{1t} + \frac12\cdot e^{-1t} = \frac12\left(e^t + e^{-t}\right). \]

Example 6.2.

If \(X\) is the number of Heads in two coin tosses, then the law of \(X\) is \[ \mathbb P(X = 0) = \frac14,\quad \mathbb P(X = 1) = \frac12,\quad \mathbb P(X = 0) = \frac14. \] Thus \(\mathbb E\left[e^{tX}\right] = \frac14e^{0t} + \frac12e^{1t} + \frac14e^{2t} = \frac14(1 + 2e^t + e^{2t})\).

Example 6.3.

Let \(X\) be the number of tosses needed to get one Head. Then \[ \mathbb P(X = 1) = \mathbb P(H) = \frac12,\quad \mathbb P(X = 2) = \mathbb P(TH) = \frac14, \ldots, \mathbb P(X = n) = 2^{-n}. \] Recall: \(\sum_{n=1}^{\infty} a^n = a/(1-a)\). Thus the MGF is \[ \mathbb E\left[e^{tX}\right] = \sum_{n=1}^{\infty}2^{-n}e^{tn} = \sum_{n=1}^{\infty}(e^t/2)^n = \frac{e^t/2}{1 - e^t/2}. \]

Example 6.4.

\(X\) has density \(f(x) = 2e^{-2x},\ x > 0\). Thus \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \int_0^{\infty}f(x)e^{tx}\,\mathrm{d}x = 2\int_0^{\infty}e^{-(2-t)x}\,\mathrm{d}x = \frac2{2-t}. \] Recall that \(\int_0^{\infty}e^{-ax}\,\mathrm{d}x = \frac1a\) for \(a > 0\).

6.2. Properties of MGF

Theorem 6.1.

We have: \(\mathbb E\left[X\right] = M'_X(0)\), and for other \(k = 1, 2, \ldots\) we have: \(\mathbb E\left[X^k\right] = M^{(k)}_X(0)\).

Proof of Theorem 6.1.

\(M'_X(t) = \mathbb E\left[(e^{tX})'\right] = \mathbb E\left[Xe^{tX}\right]\). Letting \(t = 0\), we get: \(M'_X(0) = \mathbb E\left[Xe^0\right] = \mathbb E\left[X\right]\). Similarly for \(k \ge 2\).

Example 6.5.

For the previous example, \[ \mathbb E\left[X\right] = M'_X(0) = \left.\left(\frac2{2-t}\right)'\right|_{t=0} = \left.\frac2{(2-t)^2}\right|_{t=0} = \frac12. \] This is easier to compute than integrate by parts \[ \mathbb E\left[X\right] = \int_0^{\infty}2xe^{-2x}\,\mathrm{d}x. \] Next, the second moment is \[ \mathbb E\left[X^2\right] = M''_X(0) = \left.\frac{2\cdot 2}{(2 - t)^3}\right|_{t=0} = \frac12. \] Thus \(\mathrm{Var} X = \mathbb E\left[X^2\right] - (\mathbb E X)^2 = 1/4\).

Example 6.6.

For \(M_X(t) = \frac23e^{-2t} + \frac13e^t\), we have: \(\mathbb P(X = 1) = 1/3\) and \(\mathbb P(X = -2) = 2/3\).

Theorem 6.2.

For two independent random variables \(X\) and \(Y\), \(M_{X+Y}(t) = M_X(t)M_Y(t)\).

Proof of Theorem 6.2.

We have: \(\mathbb E\left[e^{t(X+Y)}\right] = \mathbb E\left[e^{tX}e^{tY}\right] = \mathbb E\left[e^{tX}\right] \mathbb E\left[e^{tY}\right]\).

Example 6.7.

For sum of two independent random variables with \(f(x) = 2e^{-2x},\, x > 0\), we get: \(M(t) = (2/(2-t))^2\).

7. Special Distributions

7.1. Binomial distribution

Consider a sequence of Bernoulli trials: Each results in success or failure with probabilities \(p\) and \(q\), with \(p + q = 1\). All trials are independent.

Example 7.1.

Toss a coin 10 times. Then Heads is success, Tails is failure, and \(p = q = 1/2\).

Example 7.2.

Roll a die. Success is 6, all others are failures. Then \(p = 1/6\) and \(q = 5/6\).

Example 7.3.

A stock market, measured by Standard & Poor 500, rises or falls each day. Assume daily movements are independent. Then \(p = 53.29\%\) and \(q = 46.71\%\) are estimated from daily movements, 1996–2016.

Definition 7.1.

A binomial random variable is the number of successes in \(N\) trials, with probability \(p\) of success. The distribution is denoted by \(\mathrm{Bin}(N, p)\).

Theorem 7.1.

The distribution of \(X \sim \mathrm{Bin}(N, p)\) is: Can take values \(0, 1, \ldots, N\), with \[ \mathbb P(X = k) = \binom{N}{k}p^kq^{N-k},\quad k = 0, \ldots, N. \]

Proof of Theorem 7.1.

The binomial coefficient \(\binom{N}{k}\) is the number of choices which \(k\) of \(N\) trials will be successful. In other words, this is the quantity of subsets of \(k\) elements in \(\{1, 2, \ldots, N\}\). Each such choice has probability \(p^kq^{N-k}\).

Example 7.4.

\(N = 4\) and \(k = 2\). Then there are 6 choices: SSFF, SFSF, SFFS, FSSF, FSFS, FFSS. Each choice has probability \(p^2q^2\). Thus \(\mathbb P(X = 2) = 6p^2q^2\).

Theorem 7.2.

For \(X \sim \mathrm{Bin}(N, p)\), we have: \(\mathbb E X = Np\) and \(\mathrm{Var}(X) = Npq\).

Proof of Theorem 7.2.

Let \(X = X_1 + \ldots + X_N\) with \(X_1 = 1\) if the first toss resulted in H, \(X_1 = 0\) if the first toss resulted in T. Same for \(X_2, X_3, \ldots\) Then \(\mathbb E X_i = 1\cdot p + 0\cdot q = p\) and \(\mathbb E [X_i^2] = 1^2\cdot p + 0^2\cdot q = p\), thus \(\mathrm{Var} X_i = p - p^2 = pq\). Therefore, \[ \mathbb E X = \mathbb E X_1 + \ldots + \mathbb E X_N = p + p + \ldots + p = N. \] Since \(X_1, \ldots, X_N\) are independent, we have: \[ \mathrm{Var} X = \mathrm{Var} X_1 + \ldots + \mathrm{Var} X_N = pq + \ldots + pq = Npq. \]

Example 7.5.

Let \(X \sim \mathrm{Bin}(4, 0.3)\). Then \(\mathbb E X = 4\cdot 0.3 = 1.2\) and \(\mathrm{Var} X = 4\cdot 0.3\cdot 0.7 = 0.84\).

Theorem 7.3.

The MGF for \(X \sim \mathrm{Bin}(N, p)\) is \(M_X(t) = (pe^t + q)^N\).

Proof of Theorem 7.3.

We have: \(M_{X_i}(t) = \mathbb E\left[e^{tX_i}\right] = pe^{1t} + qe^{0t} = pe^t + q\). Thus by independence \[ M_X(t) = M_{X_1}(t)\ldots M_{X_N}(t) = (pe^t+q)^N. \]

Example 7.6.

An IT center uses 9 disk drivers. Each disk is out of servic, independently of others, with probability 6%. The instruction stipulates that at least 7 drives must be available. What is the probability of this? The number of working drives is \(X \sim \mathrm{Bin}(9, 0.94)\). Thus \[ \mathbb P(X \ge 7) = \mathbb P(X = 7) + \mathbb P(X = 8) + \mathbb P(X = 9) = \binom{9}{7}0.94^70.06^2 + \binom{9}{8}0.94^80.06 + \binom{9}{9}0.94^9 \approx 0.986. \]

Example 7.7.

An insurance company has 10 clients, each has an accident independently of others, with probability 20%. Then the number of accidents is \(X \sim \mathrm{Bin}(10, 0.2)\), and \(\mathbb P(X = 2) = \binom{10}{2}0.2^20.8^8 \approx 0.302\).

7.2. Geometric distribution

Definition 7.2.

For a sequence of independent Bernoulli trials with probabilities \(p\) and \(q\) of success and failure, let \(X\) be the number of trials until the first success. Then \(X \sim \mathrm{Geo}(p)\). This is called a geometric distribution.

Theorem 7.4.

For such \(X\), \(\mathbb P(X = n) = pq^{n-1},\, n = 1, 2, \ldots\)

Proof of Theorem 7.4.

\(X = n\) means first \(n-1\) failures and \(1\) success: FF…FS, which has probability \(q\cdot q\cdot\ldots\cdot q\cdot p = pq^{n-1}\).

Example 7.8.

The probability that your first Heads will be \(n\) is equal to \((1/2)^n\), since \(p = q = 1/2\).

Theorem 7.5.

The moment generating function is \[ M_X(t) = \frac{pe^t}{1 - qe^t}. \]

Proof of Theorem 7.5.

We have: \(M_X(t) = \sum_{n \ge 1}pq^{n-1}e^{tn} = pe^t\sum_{n \ge 1}(qe^t)^{n-1}\), and now recall that \(\sum_{n = 0}^{\infty}a^n = 1/(1 - a)\).

Theorem 7.6.

For \(X \sim \mathrm{Geo}(p)\), we have \(\mathbb E[X] = 1/p\) and \(\mathrm{Var} X = q/p^2\).

Proof of Theorem 7.6.

Take \(M'_X(0) = \mathbb E[X] = 1/p\) and \(M''_X(0) = \mathbb E[X^2] = (1+q)/p^2\). Thus \(\mathrm{Var} X = q/p^2\).

Example 7.9.

For \(X\) the number of tosses until first Heads, then \(X \sim \mathrm{Geo}(1/2)\). Then \[ M_X(t) = \frac{e^t/2}{1 - e^t/2},\quad \mathbb E X = 2,\quad \mathrm{Var} X = 2. \]

7.3. Negative binomial distribution

How many times \(X\) do you need to toss a coin to get \(r\) Heads, where \(r\) is a fixed parameter? This is called negative binomial distribution \(\mathrm{NB}(r, p)\). For example, if \(r = 3\) and the sequence is TTTHTHH, then \(X = 7\). What is \(\mathbb{P}(X = n)\)? If the \(n\)th toss resulted in \(r\)th Heads, another way to say this is the first \(n-1\) tosses contain \(r-1\) Heads and \(n-r\) Tails, and the last, \(n\)th toss resulted in Heads. The probability that the first \(n-1\) tosses resulted in \(r-1\) Heads (and \(n-r\) Tails) is \[ \binom{n-1}{r-1}p^{r-1}q^{n-r}. \] Indeed, there are \[ \binom{n-1}{r-1} \] choices for slots in the first \(n-1\) tosses occupied by Heads. Each of these choices has probability \(p^{r-1}q^{n-r}\). Finally, the probability of the last toss being Heads is \(p\). So \[ \boxed{ \mathbb{P}(N = n) = \binom{n-1}{r-1}p^{r}q^{n-r} } \] This is true for \(n = r, r + 1, r + 2, \ldots\). The random variable \(X \sim \mathrm{NB}(r, p)\) can take only values greater than or equal to \(r\), because to get \(r\) Heads, you need at least \(r\) tosses. One can alternatively describe this distribution as the number of Bernoulli trials one needs to get to the \(r\)th success. For \(r = 1\), this becomes the Geometric distribution with parameter \(p\).

For the geometric and negative binomial distributions, there is ambiguity in the definition. Sometimes it is defined as the quantity of T, as opposed to tosses, until your first H. In this case, it is shifted by \(r\) down. Thus defined negative binomial random variable takes values \(0, 1, 2, \ldots\) We define this distribution as the number of tosses until your \(r\)th H. ### Example 7.10.
Let \(p = 1/2\), \(r = 2\). Then \[ \mathbb{P}(X = 3) = \binom{3-1}{2-1}\left(\frac12\right)^2\left(1 - \frac12\right) = 2\cdot\frac14\cdot\frac12 = \frac14. \] We can equivalently get this as follows: to get two Heads in exactly three tosses, we need either THH or HTH. Each of these sequences has probability \(1/8\). So the resulting probability is \((1/8) + (1/8) = 1/4\).

For \(X \sim \mathrm{NB}(r, p)\), we have: \[ X = X_1 + \ldots + X_r,\ \ X_1, \ldots, X_r \sim \mathrm{Geo}(p)\ \mbox{i.i.d.} \] Recall that i.i.d. means independent identically distributed. Indeed, to get to the \(r\)th Heads, we need to get the first Heads, which required geometric number \(X_1\) of tosses; then we need to get from first to second Heads, which required also geometric number \(X_2\) of tosses. This second number is independent of the first one, because the coin does not have a memory, etc. Therefore, \[ \mathbb E X = \mathbb E X_1 + \ldots + \mathbb E X_r = \frac{r}{p}, \] and using independence, we get: \[ \mathrm{Var} X = \mathrm{Var} X_1 + \ldots + \mathrm{Var} X_r = \frac{qr}{p^2}. \]

The negative binomial distribution and the binomial distribution answer somewhat opposite questions. For the binomial distribution, you have fixed number of tries, and you ask how many successful trials you have. For the negative binomial distribution, you have fixed number of successful trials, and you ask how many trials you need.

7.4. Poisson distribution

Take a parameter \(\lambda > 0\).

Definition 7.3.

A Poisson distribution \(X \sim \mathrm{Poi}(\lambda)\) is given by \[ \mathbb{P}(X = k) = \frac{\lambda^k}{k!}e^{-\lambda},\quad k = 0, 1, 2, \ldots \]

Example 7.11.

For \(X \sim \mathrm{Poi}(2)\), we have: \(\mathrm{P}(X \ge 1) = 1 - \mathrm{P}(X = 0) = 1 - e^{-2}\); \(\mathrm{P}(X \ge 2) = 1 - \mathrm{P}(X = 0) - \mathrm{P}(X = 1) - 1 - e^{-2} - 2e^{-2} = 1 - 3e^{-2}\).

Theorem 7.7.

Consider a binomial distribution \(\mathrm{Bin}(N, p)\). If \(N \to \infty\), and \(Np = \lambda\), then \(\mathrm{Bin}(N, p) \to \mathrm{Poi}(\lambda)\). More precisely, for every fixed \(k = 0, 1, 2, \ldots\), we have as \(N \to \infty\): \[ \binom{N}{k}p^kq^{N-k} \to \frac{\lambda^k}{k!}e^{-\lambda}. \]

Proof of Theorem 7.7.

Note that \(p = \lambda/N\), \(q = 1 - \lambda/N\), and \[ \binom{N}{k} = \frac{N!}{k!(N-k)!} = \frac{N(N-1)\ldots(N-k+1)}{N^k}. \] Therefore, \[ \binom{N}{k}p^kq^{N-k} = \frac{N(N-1)\ldots(N-k+1)}{N^k}\left(\frac{\lambda}N\right)^k\left(1 - \frac{\lambda}N\right)^{N-k} \] \[ = \frac{\lambda^k}{k!}\cdot\frac{N(N-1)\ldots(N-k+1)}{N^k}\cdot\left(1 - \frac{\lambda}N\right)^{N-k}. \] The first factor \(\lambda^k/k!\) is constant (does not depend on \(N\)); the second factor is \[ \frac{N(N-1)\ldots(N-k+1)}{N^k} = \frac{N}{N}\cdot\frac{N-1}N\cdot\ldots\cdot\frac{N-k+1}N \to 1, \] because each of these fractions tends to one: indeed, \(N \to \infty\), and \(k\) is fixed. Finally, the third factor tends to \(e^{-\lambda}\), because \[ \log\left(1 - \frac{\lambda}N\right)^{N-k} = (N-k)\log\left(1 - \frac{\lambda}N\right) \backsim (N-k)\left(-\frac{\lambda}N\right) = -\lambda\frac{N-k}N \to -\lambda, \] and then we exponentiate both sides. We use the tangent line approximation: \[ \log(1 + x) \backsim x,\ x \to 0, \] which is true: let \(f(x) = \log(1 + x)\); then \(f'(x) = 1/(1 + x)\), and so \(f(0) = 0\), \(f'(0) = 1\), and \[ f(x) \backsim f(0) + f'(0)x = x,\ \ x \to 0. \] This is the meaning of a Poisson distribution: this is approximately the quantity of many events, each of which is very rare.

Example 7.12.

A company which sells flood insurance has \(N = 20000\) clients, and each client has probability of flood \(p = 0.01\%\), independently of others. What is the distribution of the number \(X\) of floods? Approximately it is \(X \backsim \mathrm{Bin}(N, p)\), but it is hard for calculations: for example, \[ \mathbb{P}(X = 10) = \binom{20000}{10}\left(\frac1{10000}\right)^{10}\left(\frac{9999}{10000}\right)^{19990}. \] Use Poisson approximation: \(X\) is approximately distributed as \(\mathrm{Poi}(2)\), because \(\la = Np = 2\). So \[ \mathbb{P}(X = 10) = \frac{2^{10}}{10!}e^{-2}. \] Also, \[ \mathbb{P}(X \ge 3) = 1 - \mathbb{P}(X = 0) - \mathbb{P}(X = 1) - \mathbb{P}(X = 2) \] \[ = 1 - e^{-2} - \frac{2^1}{1!}e^{-2} - \frac{2^2}{2!}e^{-2} = 1 - e^{-2} - 2e^{-2} - 2e^{-2} = 1 - 5e^{-2}. \] A slightly more general statement is true. If we have \(N\) (a large number of) independent events, occuring with probabilities \(p_1, \ldots, p_N\), then the quantity of events which happened is approximately \(\mathrm{Poi}(\lambda)\) with \(\lambda = p_1 + \ldots + p_N\). So the Poisson distribution is the law of rare events: if we have many events, each of which is very unlikely to occur, then the number of events is Poisson.

Example 7.13.

In one city, independently of each other, \(N_1 = 1000\) houses have fire with probability \(p_1 = .2\%\) and \(N_2 = 500\) houses have fire with probability \(p_2 = .6\%\). Then the number of fires is approximately \(\mathrm{Poi}(\lambda)\) with \(\lambda = N_1p_1 + N_2p_2 = 5\).

Another application is a sequence of events, with next event happens with probability \(\lambda\triangle t\) on \([t, t + \triangle t]\), independently of other time intervals. Then the number of events during \([0, T]\) is \(\sim \mathrm{Poi}(\lambda T)\). Indeed, split \([0, T]\) into \(N\) equal subintervals of length \(T/N\), and apply the above law of rare events to \(p = \lambda(T/N)\). Then \(Np = \lambda T\).

Example 7.14.

The rate for \(\alpha\)-particle emission is \(\lambda \approx 3.87\) (with unit of time 1/8 of minute). Then during 2 minutes we have \(\mathrm{Poi}(3.87\cdot 16)\) particles.

7.5. Normal distribution

We define first the standard normal distribution and then the general normal distribution.

Definition 7.4.

A standard normal random variable has density \(f_Z(z) = (2\pi)^{-1/2}\exp(-z^2/2)\). For a standard normal \(Z\), we define normal \(X \sim \mathcal N(\mu, \sigma^2)\) as \(X = \mu + \sigma Z\).

Theorem 7.8.

The MGF for \(Z \sim \mathcal N(0, 1)\) is given by \[ M_Z(t) = \int_{-\infty}^{+\infty}e^{tz}f_Z(z)\,\mathrm{d}z = e^{t^2/2}. \] The MGF for \(X \sim \mathcal N(\mu, \sigma^2)\) is given by \[ M_X(t) = \exp\left[\mu t + \frac{\sigma^2t^2}2\right]. \]

Proof of Theorem 7.8.

First, we compute MGF for standard normal law: \[ F_Z(t) = \frac1{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{tz}e^{-z^2/2}\mathrm{d} z = \frac1{\sqrt{2\pi}}e^{t^2/2}\int_{-\infty}^{\infty}e^{-t^2/2+tz - z^2/2}\mathrm{d} z \] \[ = \frac{e^{t^2/2}}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-(z-t)^2/2}\mathrm{d} z = \frac{e^{t^2/2}}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-u^2}\mathrm{d} u = \frac{e^{t^2/2}}{\sqrt{2\pi}}\sqrt{2\pi} = e^{t^2/2}. \] (we changed variables \(u = z - t\)). Next, we can represent \(X = \mu + \sigma Z\), and \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \mathbb E\left[e^{t\mu + t\sigma Z}\right] = e^{t\mu}\cdot\mathbb E\left[e^{t\sigma Z}\right] = e^{t\mu}e^{(t\sigma)^2/2} = \exp(t\mu + t^2\sigma^2/2). \]

Example 7.15.

For \(X \sim \mathcal N(-2, 2)\) we get: \(M_X(t) = e^{-2t + t^2}\).

Theorem 7.9.

For \(X \sim \mathcal N(\mu, \sigma^2)\), we have: \(\mathbb E X = \mu\) and \(\mathrm{Var} X = \sigma^2\).

Proof of Theorem 7.9.

We compute for standard normal \(Z\): \(\mathbb E Z = M_Z'(0) = 0\), \(\mathbb E Z^2 = M''_Z(0) = 1\), and thus \(\mathrm{Var} Z = 1\). Therefore \(\mathbb E[X] = \mu + \sigma \mathbb E[Z] = \mu\), and \(\mathrm{Var}(X) = \sigma^2\mathrm{Var}(Z) = \sigma^2\).

Theorem 7.10.

The density of \(\mathcal N(\mu, \sigma^2)\) is given by \[ f(y) = \frac1{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right). \]

Proof of Theorem 7.10.

Indeed, for all \(a\) and \(b\), \[ \mathbb P(a \le Y \le b) = \mathbb P(a \le \mu + \sigma X \le b) = \mathbb P\left(\frac{a - \mu}{\sigma} \le X \le \frac{b - \mu}{\sigma}\right) = \frac1{\sqrt{2\pi}}\int\limits_{(a - \mu)/\sigma}^{(b - \mu)/\sigma}e^{-x^2/2}\, \mathrm{d}x. \] Now let us change variables: \[ x = \frac{y - \mu}{\sigma};\ \mathrm{d}x = \frac{\mathrm{d}y}{\sigma};\ \ a \le y \le b. \] Therefore, \[ \frac1{\sqrt{2\pi}}\int\limits_{(a - \mu)/\sigma}^{(b - \mu)/\sigma}e^{-x^2/2}\, \mathrm{d}x = \frac1{\sqrt{2\pi}\sigma}\int\limits_a^be^{-(y - \mu)^2/(2\sigma^2)}\,\mathrm{d}y = \int_a^bp(y)\, \mathrm{d}y. \] This proves that \(Y\) has density \(f(y)\), as shown above. ### Example 7.16. For \(X \sim \mathcal N(1, 4)\), find \(\mathbb E\left[(X-2)^3\right]\): \(X = 1 + 2Z\) with \(Z \sim \mathcal N(0, 1)\), and \((X - 2)^3 = (2Z - 1)^3 = 8Z^3 - 12Z^2 + 6Z - 1\). Taking the expected value and using \(\mathbb E Z = \mathbb E Z^3 = 0\) (by symmetry of the standard normal distribution), \(\mathbb E Z^2 = 1\), we get: \(\mathbb E\left[(X-2)^3\right] = -12 - 1 = \boxed{-13}\)

Example 7.17.

For \(X \sim \mathcal N(1, 4)\), find \(\mathbb P(X \le 3)\): \(X = 1 + 2Z\) with \(Z \sim \mathcal N(0, 1)\), and \(\mathbb P(X \le 3) = \mathbb P(1 + 2Z \le 3) = \mathbb P(Z \le 1) = 0.8413\)

Example 7.18.

Find a normal distribution with mean \(1\) and \(99\%\) quantile \(4.5\). We can represent this random variable as \(X = 1 + \sigma Z\), where \(\sigma\) needs to be found. Because \[ 0.99 = \mathbb{P}(X \le 4.5) = \mathbb{P}(1 + \sigma Z \le 4.5) = \mathbb{P}(Z \le \sigma^{-1}3.5), \] from the Z-table at the end of these lecture notes we get: \(\sigma^{-1}3.5 = x_{99\%} = 2.326\), so \(\sigma = 3.5/2.326 = 1.505\), and \(X \sim \boxed{\mathcal N(1, 1.505^2)}\)

7.6. Exponential distribution

Sometimes it is used to model the lifespan of a device, when the remaining lifespan is independent of how long it has worked so far. Take a \(\lambda > 0\).

Definition 7.5.

A random variable \(X\) with density \(f(x) = \lambda e^{-\lambda x}\), \(x > 0\), is called exponential with rate \(\lambda\). It is denoted as \(\mathrm{Exp}(\lambda)\).

This distribution has the following properties: for \(a > 0\), \[ \mathbb{P}(X \ge a) = \int_a^{\infty}\lambda e^{-\lambda x}\, \mathrm{d}x = \left.\left(-e^{-\lambda x}\right)\right|_{x=a}^{x = \infty} = e^{-\lambda a}, \] and thus the CDF is given by \[ F_X(u) = \begin{cases} 1 - e^{-\lambda u},\, u > 0;\\ 0,\, u \le 0. \end{cases} \]

Theorem 7.11.

This distribution has memoryless property: for \(s, t > 0\), \(\mathbb{P}(X > t + s \mid X > s) = \mathbb{P}(X > t)\).

Proof of Theorem 7.11.

Indeed, for any \(t, s > 0\), we have: \[ \mathbb P(X > t + s \mid X > s) = \frac{\mathbb P(X > t + s,\, X > s)}{\mathbb P(X > s)} = \frac{\mathbb P(X > t + s)}{\mathbb P(X > s)} = \frac{e^{-\lambda(s + t)}}{e^{-\lambda s}} = e^{-\lambda t} = \mathbb P(X > t). \]

Let \(X \sim \mathrm{Exp}(\lambda)\). Then we have: \(f_X(x) = \lambda e^{-\lambda x}\) for \(x > 0\). Thus for \(t < \lambda\), \[ M_X(t) = \mathbb E\left[e^{tX}\right] = \int_0^{\infty}e^{tx}f_X(x)\,\mathrm{d}x = \lambda\int_0^{\infty}e^{-(\lambda - t)x}\,\mathrm{d}x = \frac{\lambda}{\lambda - t}. \] Thus we can find \[ \mathbb E\left[X\right] = M'_X(0) = \left.\left(\frac{\lambda}{\lambda - t}\right)'\right|_{t = 0} = \left.\frac{\lambda}{(\lambda - t)^2}\right|_{t=0} = \frac1{\lambda}, \] and similarly \(\mathbb E\left[X^2\right] = M''_X(0) = 2/\lambda^2\), \(\mathrm{Var} X = \lambda^{-2}\).

Example 7.19.

Find an exponential distribution such that \(\mathbb{P}(X \ge 2) = 0.4\). We have: \(e^{-2\lambda} = 0.4 \Leftrightarrow \lambda = -\ln(0.4)/2 = 0.46\).

Example 7.20.

Calculate the CDF of \(X \sim \mathrm{Exp}(2)\). For \(x \le 0\), we have: \(F(x) = 0\), because \(X \ge 0\) always. For \(x > 0\), we have: \(F(x) = \int_0^x2e^{-2x}\,\mathrm{d} x = \left.(-e^{-2x})\right|_{x=0}^{x=\infty} = 1 - e^{-2x}\). Therefore, \[ F(x) = \begin{cases} 0,\, x \le 0;\\ 1 - e^{-2x},\, x \ge 0. \end{cases} \]

7.7. Gamma distribution

Take \(n\) i.i.d. \(X_1,\ldots, X_n \sim \mathrm{Exp}(\lambda)\). Then \(X = X_1 + \ldots + X_n\) has Gamma distribution and is denoted as \(\Gamma(n, \lambda)\). ### Theorem 7.12. The MGF of \(X \sim \Gamma(n, \lambda)\) is \[ M_X(t) = \left(\frac{\lambda}{\lambda - t}\right)^n. \] Its mean and variance are \[ \mathbb E[X] = \frac n{\lambda},\quad \mathrm{Var}(X) = \frac n{\lambda^2}. \]

Proof of Theorem 7.12.

This MGF is computed as \(M_X(t) = M_{X_1}(t)\cdot M_{X_n}(t)\). Mean and variance are computed as sums of means and variances of \(X_1, \ldots, X_n\).

Theorem 7.13.

The density of \(\Gamma(n, \lambda)\) is equal to \[ f(x) = \frac{\lambda^n}{(n-1)!}x^{n-1}e^{-\lambda x},\quad x > 0. \]

Proof of Theorem 7.13.

Note that \[ n! = I_n := \int_0^{\infty}x^ne^{-x}\mathrm{d} x. \] Why? Integrate \(I_n\) by parts: \[ I_n = -\int_0^{\infty}x^{n}\mathrm{d} e^{-x} = -\left.x^{n}e^{-x}\right|_{x=0}^{x = \infty} + \int_0^{\infty}e^{-x}\mathrm{d}\left(x^{n}\right) = 0 + n\int_0^{\infty}x^{n-1}e^{-x}\mathrm{d} x = nI_{n-1}. \] Combining that with \(I_0 = \int_0^{\infty}e^{-x}\mathrm{d} x = 1\), we have: \(I_n = n!\). Doing change of variables \(\lambda x = y\): \[ \int_0^{\infty}x^{n-1}e^{-\lambda x}\,\mathrm{d}x = \frac{(n-1)!}{\lambda^{n}}. \] Applying this to \(\lambda - t\) instead of \(\lambda\), we get: \[ \int_0^{\infty}f(x)e^{tx}\,\mathrm{d}x = \frac{\lambda^n}{(n-1)!}\int_0^{\infty}x^{n-1}e^{tx}e^{-\lambda x}\,\mathrm{d}x \] \[ = \frac{\lambda^n}{(n-1)!}\frac{(n-1)!}{(\lambda - t)^{n}} = \left(\frac{\lambda}{\lambda - t}\right)^n. \]

8. Convergence Modes

We compare three modes: almost surely, in probability, and in law (distribution).

8.1. Concepts

Definition 8.1.

Take random variables \(Z_0, Z_1, Z_2,\ldots\) Then \(Z_n \to Z\) in probability if for all \(\varepsilon > 0\) we have: \(\mathbb P(|Z_n - Z_0| > \varepsilon) \to 0\) as \(n \to \infty\).

But there is another mode of convergence:

Definition 8.2.

\(Z_n \to Z_0\) in law, or in distribution, if \(F_{Z_n}(x) = \mathbb P(Z_n \le x) \to F_{Z_0}(x) = \mathbb P(Z_0 \le x)\) for all \(x\) where \(F_{Z_0}\) is continuous.

Definition 8.3.

The third convergence mode is almost surely: \(\mathbb P(Z_n \to Z_0) = 1\).

In particular, if \(Z_0\) is continuous, then \(F_{Z_n}(x) \to F_{Z_0}(x)\) as \(n \to \infty\) for all \(x\). Thus for all \(a < b\), \[ \mathbb P(a \le Z_n \le b) = F_{Z_n}(b) - F_{Z_n}(a) \to \mathbb P(a \le Z_0 \le b) = F_{Z_0}(b) - F_{Z_0}(a). \]

Example 8.1.

\(Z_n = 1/n\) and \(Z_0 = 0\) (constant random variables). Then \(Z_n \to Z_0\) in law, because \[ F_{Z_n}(x) = \begin{cases} 1,\ x \ge 1/n;\\ 0,\ x < 1/n; \end{cases} \quad F_{Z_0}(x) = \begin{cases} 1,\ x \ge 0;\\ 0,\ x < 0. \end{cases} \] Thus \(F_{Z_n}(x) \to F_{Z_0}(x)\) for \(x \ne 0\). Indeed, for \(x > 0\), we have \(x \ge 1/n\) for sufficiently large \(n\), and \(F_{Z_n}(x) = 1 = F_{Z_0}(x)\). And for \(x < 0\), \(F_{Z_n}(x) = 0 = F_{Z_0}(x)\). But \(F_{Z_n}(0) = 0 \nrightarrow 1 = F_{Z_0}(0)\). This is why we need to exclude discontinuity.

8.2. Relations

Theorem 8.1.

Convergence in probability implies convergence in distribution, but not vice versa.

Example 8.2.

Toss a fair coin. Let \(Z_n = 1\) if Heads, \(Z_n = 0\) if Tails, for odd \(n\); vice versa when \(n\) is even. Then all \(Z_n\) have the same distribution: \(\mathbb P(Z_n = 0) = \mathbb P(Z_n = 1) = 1/2\). Thus \(Z_n \to Z_0\) in law. But not in probability: \(|Z_n - Z_0| = 1\) for \(n\) odd.

Theorem 8.2.

Convergence almost surely is stronger than convergence in probability.

9. Inequalities

9.1. Markov’s inequality

Take a nonnegative random variable \(X \ge 0\).

Theorem 9.1.

For every \(a > 0\), we have: \[ \mathbb{P}(X \ge a) \le \frac{\mathbb{E} X}{a}. \]

Proof of Theorem 9.1.

Consider another random variable: \[ Y = \begin{cases} a,\ X \ge a;\\ 0,\ 0 \le X < a. \end{cases} \] Then always \(Y \le X\), so \(\mathbb{E} Y \le \mathbb{E} X\). But \(\mathbb{E} Y = a\mathbb{P}(X \ge a) + 0\mathbb{P}(0 \le X < a) = a\mathbb{P}(X \ge a)\). This gives us \[ a\cdot\mathbb{P}(X \ge a) \le \mathbb{E} X \Rightarrow \mathbb{P}(X \ge a) \le \frac{\mathbb{E} X}{a}. \]

Example 9.1.

Consider the exponential random variable \(X \sim \mathrm{Exp}(1)\). Then \(\mathbb{P}(X \ge 10) \le \frac{\mathbb{E} X}{10} = \frac1{10}\), because \(\mathbb{E} X = 1\). In reality, \(\mathbb{P}(X \ge 10) = e^{-10}\), which is much smaller than \(1/10\). Markov’s inequality gives us very rough and crude estimates, which are one-size-fits-all solutions. But they can be applied to any variable.

9.2. Chebyshev’s inequality

This measures large deviations of \(X\) from its mean.

Theorem 9.2.

If \(\mathbb{E} X = \mu\), then for every \(b > 0\) we have: \[ \mathbb{P}(|X - \mu| \ge b) \le \frac{\mathrm{Var} X}{b^2}. \]

Proof of Theorem 9.2.

Apply Markov’s inequality to \((X - \mu)^2\) instead of \(X\) and \(b^2\) instead of \(a\). We get: \[ \mathbb{P}(|X - \mu| \ge b) = \mathbb{P}((X - \mu)^2 \ge b^2) \le \frac{\mathbb{E}(X - \mu)^2}{b^2} = \frac{\mathrm{Var} X}{b^2}. \]

Example 9.2.

For \(X \sim \mathcal N(4, 1)\), \(\mathbb E X = 4\), \(\mathrm{Var} X = 1\), and thus \(\mathbb{P}(|X - 4| > 2) \le 1/4\). This is equivalent to \(X < 2\) or \(X > 6\).

9.3. Law of large numbers

Let \(X_1, X_2, \ldots\) be iid (independent identically distributed) random variables with
\[ \mathbb E X_1 = \mathbb E X_2 = \ldots = \mu,\quad \mathrm{Var} X_1 = \mathrm{Var} X_2 = \ldots = \sigma^2. \]

Theorem 9.3.

Let \(S_N = X_1 + \ldots + X_N\). Then we have convergence in probability: \[ \frac{S_N}N \to \mu,\ N \to \infty, \] in the sense that for every \(\varepsilon > 0\) we have: \[ \mathbb{P}\left(\left|\frac{S_N}N - \mu\right| \ge \varepsilon\right) \to 0. \]

Proof of Theorem 9.3.

Apply Chebyshev’s inequality to \(S_N/N\). We have: \[ \mathbb{E}\frac{S_N}N = \frac1N(\mathbb{E} X_1 + \ldots + \mathbb{E} X_N) = \frac1N(\mu + \ldots + \mu) = \frac{\mu N}{N} = \mu; \] \[ \mathrm{Var}\frac{S_N}N &= \frac1{N^2}\mathrm{Var} S_N = \frac1{N^2}\left(\mathrm{Var} X_1 + \ldots + \mathrm{Var} X_N\right) = \frac1{N^2}\left(\sigma^2 + \ldots + \sigma^2\right) = \frac{\sigma^2}N. \] Therefore, as \(N \to \infty\), we have: \[ \mathbb{P}\left(\left|\frac{S_N}N - \mu\right| \ge \varepsilon\right) \le \frac{\sigma^2/N}{\varepsilon^2} \to 0. \]

Actually it is possible to prove stronger convergence almost surely, but this is harder.

Example 9.3.

If \(X \sim \mathrm{Exp}(2)\), then \(\mu = 1/2\) and \(\sigma^2 = 1/4\), and \[ \mathbb{P}(|\overline{X}_n - \frac12| \ge 0.1) \le \frac{1/4}{0.1^2\cdot n} = \frac{25}n. \] For \(n = 10\) this is still greater than \(1\), and this estimate is useless. But for \(n = 1000\) this is useful: this probability is \(2.5\%\).

9.4. Large deviations

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E} X_1 = \mu\). Then by the Law of Large Numbers for \(S_N = X_1 + \ldots + X_N\) we have: \(S_N/N \to \mu\). Large Deviations refer to estimating the following probabilities for \(\varepsilon > 0\): \[ \mathbb{P}\left(\frac{S_N}{N} - \mu \ge \varepsilon\right)\ \mbox{and}\ \mathbb{P}\left(\frac{S_N}{N} - \mu \le -\varepsilon\right). \] This can be done by combining Markov’s inequality and moment generating functions. Instead of formulating general theory, we do a few examples.

Example 9.4.

Take \(X \sim \mathcal N(2, 1)\). Then \(\mathbb{E} X = 2\). Let us estimate \(\mathbb{P}(X \ge 3)\): For \(t > 0\), a parameter to be determined later. Then \(\mathbb{P}(X \ge 3) = \mathbb{P}\left(e^{tX} \ge e^{3t}\right)\). By Markov’s inequality, \[ \mathbb{P}\left(e^{tX} \ge e^{3t}\right) \le \frac{\mathbb{E} e^{tX}}{e^{3t}}. \] But the moment generating function for \(X \sim \mathcal N(2, 1)\) is given by \(\mathbb{E} e^{tX} = e^{2t + t^2/2}\). Choose \(t\) to minimize \[ \frac{e^{2t + t^2/2}}{e^{3t}} = \exp\left(\frac12t^2 - t\right). \] To minimize \(t^2/2 - t\), take the derivative with respect to \(t\): \((t^2/2 - t)' = t-1 = 0\ \Rightarrow\ t = 1\). This gives us \[ \exp\left(\frac12t^2 - t\right) = e^{-0.5}. \] Therefore, \(\mathbb{P}(X \ge 3) \le \boxed{e^{-0.5}}\) Of course, in this simple case we might as well directly calculate this probability. But is more complicated settings, this direct calculation is impossible.

Example 9.5.

Take \(X \sim \mathcal N(2, 1)\), and estimate \(\mathbb{P}(X \le 1)\). For \(t > 0\), we have: \[ \mathbb{P}(X \le 1) = \mathbb{P}\left(e^{-tX} \ge e^{-t}\right) = \frac{\mathbb{E} e^{-tX}}{e^{-t}} = \frac{e^{-2t + t^2/2}}{e^{-t}} = \exp\left(\frac{t^2}2 - t\right), \] and this problem is solved simlarly: \(\mathbb{P}(X \le 1) \le \boxed{e^{-0.5}}\)

This method works particularly well for sums of independent random variables.

Example 9.6.

Let \(X_1, \ldots, X_{10} \sim \mathrm{Exp}(3)\) be i.i.d. random variables, and let \(S := X_1 + \ldots + X_{10}\). Then \(\mathbb{E} S = \frac{10}{3}\). Let us estmate \(\mathbb{P}(S \le 2)\). For \(t > 0\), \[ \mathbb{P}(S \le 2) = \mathbb{P}\left(e^{-tS} \le e^{-2t}\right) \le \frac{\mathbb{E} e^{-tS}}{e^{-2t}}. \] Since \(X_1, \ldots, X_{10}\) are independent, \[ \mathbb{E} e^{-tS} = \mathbb{E}\left[e^{-tX_1 - \ldots - tX_{10}}\right] = \mathbb{E}\left[e^{-tX_1}\ldots e^{-tX_{10}}\right] \] \[ = \mathbb{E}\left[e^{-tX_1}\right]\cdot\ldots\cdot\mathbb{E}\left[e^{-tX_{10}}\right] = \left(\mathbb{E} e^{-tX_1}\right)^{10} = \left(\frac{3}{3+t}\right)^{10}. \] We used expression for moment generating function of the exponential random variable: \(\mathbb{E} e^{-tX_1} = 3/(3+t)\). Therefore, we need to minimize \[ F(t) = \frac{\left(\frac{3}{3+t}\right)^{10}}{e^{-2t}} = e^{2t}\left(\frac{3}{3+t}\right)^{10} \] Take the logarithm: \[ \ln F(t) = 2t - 10\ln(3+t)\ \Rightarrow\ (\ln F(t))' = 2 - \frac{10}{3+t} = 0\ \Rightarrow\ t = 2. \] Therefore, the minimal value is \(\mathbb{P}(S \le 2) \le F(2) = e^{2\cdot2}(3/5)^{10} = \boxed{0.33}\)

9.5. Chernov’s inequality

This is a special case of large deviations above. Let \(X_1, \ldots, X_N\) be independent Bernoulli random variables with \(\mathbb{P}(X_i = 1) = p_i\), \(\mathbb{P}(X_i = 0) = 1 - p_i\). Then \(S = X_1 + \ldots + X_N\) has mean \(\mu = \mathbb{E}[S] = \mathbb{E}[X_1] + \ldots + \mathbb{E}[X_N] = p_1 + \ldots + p_N\). Fix a \(\delta \in (0, 1)\) and estimate \[ \mathbb{P}(S \ge \mu + \delta\mu). \] Take a parameter \(t > 0\): \[ \mathbb{P}(S \ge \mu + \delta\mu) \le \mathbb{P}\left(e^{tS} \ge e^{t\mu(1 + \delta)}\right) \le \frac{\mathbb{E}[e^{tS}]}{e^{t\mu(1 + \delta)}}. \] Since \(X_1, \ldots, X_N\) are independent, \[ \mathbb{E}[e^{tS}] = \mathbb{E}\left[e^{tX_1+\ldots + tX_N}\right] = \mathbb{E}e^{tX_1}\cdot\ldots\cdot\mathbb{E} e^{tX_N}. \] For each \(i\), we have: \(\mathbb{E} e^{tX_i} = p_ie^{t\cdot 1} + (1 - p_i)e^{t\cdot 0} = p_i e^t + (1 - p_i) = 1 + p_i(e^t - 1)\). Now comes the main step. Recall that \[ e^x = 1 + x + \frac{x^2}2 + \frac{x^3}6 \ge 1 + x\ \mbox{for}\ x \ge 0. \] Apply this inequality to \(x = p_i(e^t - 1)\). Then \[ \mathbb{E} e^{tX_i} \le \exp\left(p_i(e^t - 1)\right). \] Substituting, we get: \[ \mathbb{E} e^{tS} \le \exp\left(p_1(e^t - 1)\right)\cdot\ldots\cdot\exp\left(p_N(e^t - 1)\right) = \exp\left((p_1 + \ldots + p_N)(e^t - 1)\right) = \exp(\mu(e^t - 1)). \] The right-hand side depends only on \(\mu\), regardless of individual \(p_i\). Plugging, we get: \[ \mathbb{P}(S \ge \mu + \delta\mu) \le \exp\left(\mu(e^t - 1 - t(1 + \delta))\right) = \exp(\mu F(t)),\ F(t) := e^t - 1 - (1 + \delta)t. \] Minimize \(F(t)\) with respect to \(t\): \(F'(t) = e^t - (1 + \delta) = 0\ \Rightarrow\ t = \ln(1 + \delta)\). The minimal value is \(F(\delta) = e^{\ln(1 + \delta)} - 1 - (1+\delta)\ln(1 + \delta) = \delta - (1 + \delta)\ln(1 + \delta)\). It is possible to show that this is less than \(-\delta^2/3\) for \(\delta \in (0, 1)\). Thus, \[ \boxed{ \mathbb{P}(S \ge \mu + \delta\mu) \le \exp\left(-\frac{\delta^2\mu}3\right) } \] Similarly, it is possible to show that \[ \boxed{ \mathbb{P}(S \le \mu - \delta\mu) \le \exp\left(-\frac{\delta^2\mu}2\right) } \]

Example 9.7.

\(S \sim \mathrm{Bin}(20, 0.4)\) can be represented as \(S = X_1 + \ldots + X_{20}\) with i.i.d. Bernoulli \(X_i\) with \(p_i = 0.4\). Then \(\mu = p_1 + \ldots + p_{20} = 0.4\cdot 20 = 8\). Let us estimate \(\mathbb{P}(S \ge 12)\): this corresponds to \(8(1 + \delta) = 12 \Rightarrow \delta = 0.5\), and \(\mathbb{P}(S \ge 12) \le \exp\left(-\frac{0.5^2\cdot 8}{2}\right) = \boxed{e^{-1}}\)

Example 9.8.

Take \(S_1 \sim \mathrm{Bin}(10, 0.4)\) and \(S_2 \sim \mathrm{Bin}(20, 0.6)\). Then \(S_1 = X_1 + \ldots + X_{10}\) with i.i.d. Bernoulli \(X_1, \ldots, X_{10}\) with \(p_1 = \ldots = p_{10} = 0.4\), and \(S_2 = X_{11} + \ldots + X_{30}\) with i.i.d. Bernoulli \(X_{11}, \ldots, X_{30}\) with \(p_{11} = \ldots = p_{30} = 0.6\). Then \(\mathbb{E} S = \mu = p_1 + \ldots + p_{30} = 10\cdot 0.4 + 20\cdot 0.6 = 16\). Therefore, \(\mathbb{P}(S \le 10) \le \mathbb{P}(S \le \mu(1 - \delta))\) for \(\delta = 3/8\), and is estimated as \(\exp(-0.5\cdot(3/8)^2\cdot 16) = \boxed{0.32}\)

10. Central Limit Theorem

10.1. Statement of the CLT

The LLN states: For \(X_1, X_2, \ldots\) i.i.d. with mean \(\mathbb E X_i = \mu\) and \(\mathrm{Var} X_i = \sigma^2\), if \(S_n = X_1 + \ldots + X_n\), then \(S_n/n \to \mu\) in probability. But how to estimate the speed of convergence? Note that \[ \mathbb E S_n = \mathbb E X_1 + \ldots + \mathbb E X_n = n\mu, \] and by independence, \[ \mathrm{Var}(S_n) = \mathrm{Var}(X_1) + \ldots + \mathrm{Var}(X_n) = n\sigma^2. \] Following is the statement of the CLT.

Theorem 10.1.

We have the following convergence in law, as \(n \to \infty\): \[ \frac{S_n - \mathbb E S_n}{\sqrt{\mathrm{Var} S_n}} = \frac{S_n - n\mu}{\sqrt{n}\sigma} \to \mathcal N(0, 1). \]

Since normal distribution is continuous, \[ \mathbb P\left(a \le \frac{S_n - n\mu}{\sqrt{n}\sigma} \le b\right) \to \frac1{\sqrt{2\pi}}\int_a^be^{-z^2/2}\,\mathrm{d}z = \Phi(b) - \Phi(a), \] where \(\Phi\) is the CDF of \(\mathcal N(0, 1)\): \[ \Phi(z) = \frac1{\sqrt{2\pi}}\int_{-\infty}^ze^{-u^2/2}\,\mathrm{d}u. \]

Proof of Theorem 10.1.

Compute MGF for \((S_n - n\mu)/\sqrt{n}\sigma\) and prove that as \(n \to \infty\), it converges to \(e^{t^2/2}\), which is MGF for \(\mathcal N(0, 1)\).

In practice, these integrals are found from the Z-table.

Example 10.1.

Suppose we have a plane of \(N = 200\) people. Each customer, independently of others, chooses chicken or pasta, with probability \(50\%\). How many chicken and pasta is needed to satisfy chicken-lovers with probability \(95\%\) and pasta-lovers with probability \(95\%\) (so that everybody will be satisfied with probability \(90\%\))?

If we prepare only \(100\) chicken and \(100\) pasta, we will likely run out ofeither chicken or pasta: it is very unlikely that exactly \(100\) people will choose chicken. The other extreme is to prepare \(200\) chicken and \(200\) pasta. This guarantees that every client will be pleased, but a lot of food (\(200\) meals) will be thrown away. Which is not good, because airlines have tight budgets. Let us find a compromise. Let \(X_k = 1\) if the \(k\)th person chooses chicken, \(X_k = 0\) if he chooses pasta. Then \(S_N = X_1 + \ldots + X_N\) is the total number of chicken required. The random variable \(X_k\) is Bernoulli, with \(p = 1/2\), so \[ \mu = \frac12,\ \ \sigma^2 = \frac14,\ \ \sigma = \frac12; \] \[ \frac{S_N - N\mu}{\sigma\sqrt{N}} = \frac{S_N - 100}{\sqrt{200}/2} \approx \mathcal N(0, 1). \] We can find \(x_{95\%} = 1.645\) such that \[ \Phi(x) = \frac1{\sqrt{2\pi}}\int_{-\infty}^xe^{-u^2/2}\, \mathrm{d}u = 95\%. \] Therefore, approximately \[ \mathbb{P}\left(\frac{S_N - 100}{\sqrt{200}/2} \le 1.645\right) \approx 95\%. \] With this high probability \(95\%\), \[ \frac{S_N - 100}{\sqrt{200}/2} \le 1.645\ \ \Leftrightarrow\ \ S_N \le 100 + \frac12\sqrt{200}\cdot 1.645 = 112. \] So we need only \(112\) chicken: \(100\) chicken according to the mean, and \(12\) additional, as a buffer. The same applies to pasta. We will throw away \(12 + 12 = 24\) meals after the trip, which is not a lot. Or maybe just give them to pilots or flight attendants.

Example 10.2.

A research sample \(X_1, \ldots, X_n\) of size \(n = 100\) is drawn from distribution with density \(f(x) = 3(1 - x)^2,\ 0 \le x \le 1\). Let \[ \overline{X} = \frac1n\sum_{k=1}^nX_k. \] Estimate \(\mathbb P(0.2 \le \overline{X} \le 0.3)\). Solution: This distribution has mean and variance (after computation) \[ \mu = \int_0^1xf(x)\,\mathrm{d}x = \frac14,\ \sigma^2 = \int_0^1x^2f(x)\,\mathrm{d}x - \mu^2 = \frac3{80}. \] Thus \(0.2 \le \overline{X} \le 0.3\) is equivalent to \(0.2n \le S_n \le 0.3n\), that is, \[ \frac{0.2n - n\mu}{\sqrt{n}\sigma} \le \frac{S_n - n\mu}{\sqrt{n}\sigma} \le \frac{0.3n - n\mu}{\sqrt{n}\sigma}. \] The lower limit is \(a = -0.26\) and the upper limit is \(b = 0.26\). By CLT, the probability of this event is approximately \[ \Phi(0.26) - \Phi(-0.26) = 0.6026 - (1 - 0.6026) = 0.2502. \]

10.2. Normal and Poisson approximations

Suppose you have a large quantity \(N\) of independent events, each of which happens with probability \(p\). The number of these events is \(\mathrm{Bin}(N, p)\). We have studied two approximations:
  • Poisson, when \(\mathrm{Bin}(N, p) \to \mathrm{Poi}(\lambda)\) if \(p = \lambda/N\), as \(N \to \infty\).
  • Normal, when for \(X \sim \mathrm{Bin}(N, p)\), as \(N \to \infty\), for constant \(p\) (independent of \(N\)): \[ \frac{X - pN}{\sqrt{pqN}} \to \mathcal N(0, 1). \]

The first approximation is used when the events are rare, the second - when they are usual, not rare. In practice, if you have something like \(N = 1000\) and \(p = 0.1\%\), you should use Poisson, and if \(N =1000\) but \(p = 10\%\), you should use Normal.

Example 10.3.

An accountant lists each of \(N = 100\) expenses with precision up to \(1\) dollar. That is, the rounding error is \(\mathrm{Uni}[-0.5, 0.5]\). Find the probability that the total error exceeds 5. Let \(X_k\) = error in the \(k\)th expense. The total error is \(S = X_1 + \ldots + X_N\). We need \(|S| \ge 5\). After computations, we get: \(\mathbb E X_k = 0\) and \(\mathrm{Var} X_k = 1/12\). Thus \(\mathbb E S = 0\), \(\mathrm{Var} S = N/12\). By the Central Limit Theorem, \[ \mathbb{P}(S \ge 5) = \mathbb{P}\left(\frac{S - \mathbb E S}{\sqrt{\mathrm{Var} S}} \ge \frac{5 - 0}{\sqrt{100/12}}\right) \approx 1 - \Phi(\frac{5 - 0}{\sqrt{100/12}}). \] Similarly, \(\mathbb{P}(S \le -5) = \mathbb{P}(S \ge 5)\) because \(S\) is symmetric.

10.3. Extensions of the Central Limit Theorem

The CLT is valid not only for iid (independent identically distributed) random variables. It is (sometimes) true if variables are independent but not identically distributed, or if they have slight dependence. The search for the conditions under which CLT is valid is still an active research area. For the purposes of this class, you can assume it is true if \(\mathrm{Var} S_N\) is large. This is how it is formulated in the general form.

Theorem 10.3.

For independent (but not identically distributed) \(X_1, X_2, \ldots\), convergence in law: \[ \frac{S_N - \mathbb E[S_N]}{\sqrt{\mathrm{Var} S_N}} \to \mathcal N(0, 1),\ \ N \to \infty. \]

Example 10.4.

An insurance company has \(N = 10000\) clients. Half of them file a claim which is distributed as \(\mathrm{Exp}(1)\). The other half file a claim \(\mathrm{Exp}(2)\); all claims are independent. Then the total amount of claims is \[ S_N = X_1 + \ldots + X_{5000} + Y_1 + \ldots + Y_{5000}, \] where \(X_i \sim \mathrm{Exp}(1)\), \(Y_i \backsim \mathrm{Exp}(2)\). Thus
\[ \mathbb{E} X_i = 1,\ \mathrm{Var} X_i = 1,\ \mathbb{E} Y_i = 1/2,\ \mathrm{Var} Y_i = 1/4; \] \[ \mathbb E S_N = 5000\cdot 1 + 5000\cdot (1/2) = 7500,\ \ \mathrm{Var} S_N = 5000\cdot 1 + 5000\cdot(1/4) = 6250. \] By the Central Limit Theorem, \[ \frac{S_N - 7500}{\sqrt{6250}} \approx \mathcal N(0, 1). \]