11. Discrete-Time Markov Chains with Applications

11.1. Definitions

Assume there was only one fast-food restaurant in a small town: McDonalds. Every day customers went to this restaurant. But then another fast-food place, a Subway, opened there. People started to switch between these two places. Each next day, \(20\%\) of customers who went to McDonalds the day before, went to Subway. Conversely, \(10\%\) of customers who went to Subway the day before switched to McDonalds.

On the \(0\)th day, before Subway opened, \(100\%\) of customers went to McDonalds and \(0\%\) to Subway. On the \(1\)st day, \(80\%\) of customers went to McDonalds and \(20\%\) went to Subway. On the \(2\)nd day, \(0.8\cdot 0.8 + 0.2\cdot 0.1 = 0.66 = 66\%\) of customers went to McDonalds and \(0.8\cdot0.2 + 0.2\cdot0.9 = 0.34 = 34\%\) of customers went to Subway. If at day \(n\) the share \(\mathrm{mc}_n\) of customers went to McDonalds and the share \(\mathrm{sub}_n\) went to Subway, then \[ \begin{cases} \mathrm{mc}_{n+1} = 0.8\mathrm{mc}_n + 0.1\mathrm{sub}_n,\\ \mathrm{sub}_{n+1} = 0.2\mathrm{mc}_n + 0.9\mathrm{sub}_n. \end{cases} \] We can write this as \[ \begin{bmatrix} \mathrm{mc}_{n+1} & \mathrm{sub}_{n+1} \end{bmatrix} = \begin{bmatrix} \mathrm{mc}_n & \mathrm{sub}_n\end{bmatrix}A,\ \ \mbox{where}\ \ A = \begin{bmatrix} 0.8 & 0.2\\ 0.1 & 0.9 \end{bmatrix} \] is called the transition matrix. Consider a sequence of random variables \[ X = (X_n)_{n \ge 0} = (X_0, X_1, X_2, \ldots) \] which evolves according to the following laws. Each random variable takes values \(M\) and \(S\). The initial distribution is given by \[ \mathbb{P}(X_0 = M) = 1,\ \mathbb{P}(X_0 = S) = 0. \] The distribution of \(X_{n+1}\) given \(X_n\) is given by \[ \begin{cases} \mathbb{P}(X_{n+1} = M\mid X_n = M) = 0.8,\\ \mathbb{P}(X_{n+1} = S\mid X_n = M) = 0.2,\\ \mathbb{P}(X_{n+1} = M\mid X_n = S) = 0.1,\\ \mathbb{P}(X_{n+1} = S\mid X_n = S) = 0.9. \end{cases} \] Then we have: \[ \mathbb{P}(X_{n} = M) = \mathrm{mc}_n,\ \mathbb{P}(X_n = S) = \mathrm{sub}_n. \] In other words, the vector \(x(n) = \begin{bmatrix}\mathrm{mc}_n & \mathrm{sub}_n\end{bmatrix}\) is the distribution of \(X_n\) at time \(n\). The random process \(X = (X_0, X_1, \ldots)\) is called a Markov chain.

11.2. Stationary and limiting distributions

Does \(x(n)\) converge to something as \(n \to \infty\)? Actually, it does. To find this limit, let us find nonzero vectors \(v = \begin{bmatrix}v_1 & v_2\end{bmatrix}\) such that \(vA = \lambda v\) for some real number \(\lambda\). These \(\lambda\) and \(v\) are called eigenvalues and eigenvectors. Usually, it is formulated in terms of column vectors rather than row vectors, but here it is convenient for us to study row vectors, and it does not really make any substantial difference.

We can rewrite this as \(v(A - \lambda I_2) = 0\), where \(I_2\) is the identity \(2\times 2\) matrix. Since we multiply a matrix by a nonzero vector and get zero vector, the matrix \(A - \lambda I_2\) must be nonsingular: \(\det(A - \lambda I_2) = 0\). But \[ \det(A - \lambda I_2) = \begin{vmatrix} 0.8-\lambda & 0.2\\ 0.1 & 0.9-\lambda \end{vmatrix} = 0\ \ \Rightarrow\ \ (0.8-\lambda)(0.9-\lambda) - 0.1\cdot 0.2 = 0\ \Rightarrow\ \lambda = 1, 0.7 \] These are called eigenvalues. Let us find a vector \(v\) corresponding to each of these eigenvalues. For \(\lambda = 1\), we have: \[ v = vA\ \Rightarrow\ \begin{cases} v_1 = 0.8v_1 + 0.1v_2\\ v_2 = 0.2v_1 + 0.9v_2 \end{cases} \ \Rightarrow\ \ 2v_1 = v_2\ \Rightarrow\ v = \begin{bmatrix}1 & 2\end{bmatrix} \] For \(\lambda = 0.7\), we have: \[ 0.7w = wA\ \Rightarrow\ \begin{bmatrix} 0.7w_1 = 0.8w_1 + 0.1w_2\\ 0.7w_2 = 0.2w_1 + 0.9w_2 \end{bmatrix} \ \Rightarrow\ w_1 + w_2 = 0\ \Rightarrow\ w = \begin{bmatrix}1 & -1\end{bmatrix} \] Every vector \(x(0)\) can be decomposed as a linear combination of these two vectors. For example, if \(x(0) = \begin{bmatrix}1 & 0 \end{bmatrix}\) (initially all customers went to McDonalds), then \[ x(0) = c_1v + c_2w. \] How do we find \(c_1\) and \(c_2\)? Solve the system of equations \[ \begin{cases} 1 = c_1 + c_2\\ 0 = 2c_1 - c_2 \end{cases} \ \ \Rightarrow\ \ c_1 = \frac13,\ \ c_2 = \frac23. \] Therefore, \(x(0) = \frac13v + \frac23w\), and \[ x(n) = x(0)A^n = \frac230.7^nw + \frac13v \to \frac13v = \begin{bmatrix}\frac13 & \frac23\end{bmatrix} =: p. \] This is the limiting distribution. Also, it is a stationary distribution: if \(x(0) = p\), then \(x(1)= p\), because \(p = pA\). If we start from this distribution, then we forever remain in this distribution. Any stationary distribution must be a solution to this equation \(p = pA\), and the sum \(p_1 + p_2 = 1\) (because total probability is \(1\)). Here, \(p\) is a unique solution to this problem, that is, a unique probability distribution.

Actually, \(p\) is a limit regardless of the initial distribution. Indeed, suppose that \(x(0)\) were different. Then this would change only \(c_1\) and \(c_2\): the matrix \(A\) would remain the same, with its eigenvalues and eigenvectors \(\lambda_1, \lambda_2, v, w\). So \(x(n) \to c_1v\). But \(x(n)\) is a probability distribution for each \(n\), so its components sum up to \(1\). The same must be true for \(c_1v\), because it is the limit of \(x(n)\). However, \(v_1 = \begin{bmatrix}1 & 2\end{bmatrix}\), so we must have \(c_1 = 1/(1 + 2) = 1/3\) and \(c_1v = \begin{bmatrix}1/3 & 2/3\end{bmatrix} = p\).

You can see that in the long run (actually, in a few weeks, because \(0.7^{10}\) is already quite small) Subway will have \(2/3\) of customers, and McDonalds will have only \(1/3\) of them. In addition, each customer will spend approximately \(2/3\) of days in the long-term in Subway, and \(1/3\) in McDonalds. Say, among the first \(1500\) days approximately \(1000\) will be spent in Subway, and \(500\) in McDonalds. This type of statement is true if a Markov chain has a unique stationary distribution.

Assume a person always buys coffee with the meal. The coffee at McDonalds costs \(1\$\), and at Subway it costs \(2\$\). Then the long-term average cost of coffee is given by \[ 1\cdot p_1 + 2 \cdot p_2 = \frac53. \] This is the approximate average cost of coffee during the first \(N\) days, when \(N\) is large.

Not all Markov chains have a unique stationary distribution. For example,
\[ A = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} \] then every distribution is stationary: \(x(1) = x(0)A = x(0)\), and so \(x(n) = x(0)\). The limit coincides with the stationary distribution! This corresponds to the case where all customers are completely loyal: if they went to McDonalds yesterday, they are going to make it there today, and same for Subway.

11.3. General finite-state Markov chains

Such a chain has state space \(\{1, \ldots, N\}\) of \(N\) elements; if \(p_{ij}\) is the probability of moving from state \(i\) to state \(j\), then \[ p_{i1} + \ldots + p_{iN} =1,\ i = 1, \ldots, N. \] We can write these transition probabilities in the form of an \(N\times N\) transition matrix \[ P = (p_{ij})_{i, j = 1,\ldots, N}. \] The Markov chain is a collection \(X = (X_n)_{n \ge 0} = (X_0, X_1, X_2, \ldots)\) of random variables \(X_n\), each of which takes values in the state space \(\{1, \ldots, N\}\). If \(X_n = i\), then \(X_{n+1} = j\) with probability \(p_{ij}\). We can write this as \[ \mathbb{P}(X_{n+1} = j\mid X_n = i) = p_{ij}. \] We denote by \(x(n)\) the distribution of \(X_n\): \[ x(n) = \begin{bmatrix} x_1(n) & \ldots & x_N(n)\end{bmatrix},\ x_i(n) = \mathbb{P}(X_n = i). \] These components \(x_1(n), \ldots, x_N(n)\) must satisfy \[ x_1(n) + \ldots + x_N(n) = 1. \] Also, we have the following matrix multiplication formula: \[ x(n+1) = x(n)P. \] The difference between \(X_n\) and \(x(n)\) is that \(X_n\) are random variables, and \(x(n)\) are vectors.

Every Markov chain (with a finite state space) has a stationary distribution (at least one, possibly more than one). But it does not necessarily converge to this stationary distribution. However, if it converges, then each probability \(p_i\) in this stationary distribution corresponds to the long-term proportion of time spent in this state \(i\).

11.4. Rate of convergence

Consider the transition matrix \[ A = \begin{bmatrix} 0 & 0.3 & 0.7\\ 0.2 & 0.8 & 0\\ 1 & 0 & 0 \end{bmatrix} \] It has a unique stationary distribution \(p = \begin{bmatrix}p_1 & p_2 & p_3\end{bmatrix}\), which can be found from \[ p = pA,\ \ p_1 + p_2 + p_3 = 1. \] Let us solve this system of equations: \[ p_1 = 0.2p_2 + p_3,\ \ p_2 = 0.3p_1 + 0.8p_2,\ \ p_3 = 0.7p_1. \] Therefore, from the second equation we get: \[ 0.2p_2 = 0.3p_1\ \Rightarrow\ p_2 = 1.5p_1. \] And \(p_3 = 0.7p_1\), so \(1 = p_1 + p_2 + p_3 = (1.5 + 1 + 0.7)p_1 = 3.2p_1\), and \(p_1 = 1/3.2 = 5/16\), \(p_2 = 15/32\), \(p_3 = 7/32\). What is the rate of convergence of \(x(n) = x(0)A^n\), the distribution at the \(n\)th step, to this distribution? Let us find eigenvalues and eigenvectors of \(A\). Eigenvalues: \[ \lambda_1 = 1,\ \ \lambda_2 = \frac{1}{10}\left(\sqrt{57} - 1\right),\ \ \lambda_3 = \frac1{10}\left(-\sqrt{57} - 1\right). \] We have: \(|\lambda_2| < |\lambda_3| < 1\). Let \(v_1, v_2, v_3\) be eigenvectors corresponding to \(\lambda_1, \lambda_2, \lambda_3\): \[ v_1\lambda_1 = v_1A,\ \ v_2\lambda_2 = v_2A,\ \ v_3\lambda_3 = v_3A. \] We can take \(v_1 = p\) because \(\lambda_1 = 1\). Then for any initial distribution \(x(0)\) we can decompose \[ x(0) = c_1v_1 + c_2v_2 + c_3v_3, \] for some numbers \(c_1, c_2, c_3\), and \[ x(n) = x(0)A^n = c_1v_1A^n + c_2v_2A^n + c_3v_3A^n = c_1p + c_2\lambda\to \infty, x(n) \to c_1p. \] But we know that \(x(n) \to p\), so \(c_1 = 1\). Therefore, \[ x(n) - p = c_2\lambda_2^nv_2 + c_3\lambda_3^nv_3, \] \[ |x(n) - p| \le c_2|\lambda_2|^n|v_2| + |c_3||\lambda_3|^n|v_3|. \] As \(n \to \infty\), \(|\lambda_2|^n\) converges to zero faster than \(|\lambda_3|^n\), because \(|\lambda_2| < |\lambda_3|\). Therefore, the whole expression \[ c_2|\lambda_2|^n|v_2| + |c_3||\lambda_3|^n|v_3| \] converges to zero with the same rate as \(|c_3||\lambda_3|^n|v_3|\). We say that the rate of convergence is \(|\lambda_3|^n\) (because the rest are just constants, independent of \(n\)).

11.5. Recurrence and transience

Let \(f_i := \mathbb P(S\ \mbox{ever returns to the state}\ i\mid S_0 = i)\) be the probability that the Markov chain returns to the position \(i\) if it started from \(i\). The state \(i\) is called recurrent if \(f_i = 1\), and transient \(f_i < 1\). If the state \(i\) is recurrent, then the process returns to \(i\). But then, starting again from there, it again returns to \(i\), and again, etc. So the process returns to the recurrent state infinitely often. If \(N_i\) is the number of times when the process returns to \(i\), then \(N_i = \infty\), and \(\mathbb E[N_i] = \infty\).

For a transient state \(i\), with probability \(1 - f_i\) the process never returns there. With probability \(f_i\), it does return there. After this, with probability \(f_i(1 - f_i)\), it never returns there again. With probability \(f_i^2\), it returns there for the second time. With probability \(f_i^2(1 - f_i)\), it never returns there again, etc. So \[ \mathbb P(N_i = k) = f_i^k(1 - f_i),\ \ k = 0, 1, 2, \ldots \] Therefore, \(N_i\) has geometric distribution with parameter \(1 - f_i\), and \(\mathbb E[N_i] = (1 - f_i)^{-1} < \infty\). If we start from a state and then return there infinitely many times, this state is recurrent; otherwise, it is transient.

Example 11.1.

Consider the following Markov chain: \[ A = \begin{bmatrix} 0.3 & 0 & 0.7\\ 0.4 & 0 & 0.6\\ 0 & 0 & 1 \end{bmatrix} \] Then you will eventually get to state \(3\); states \(1\) and \(2\) are transient, state \(3\) is recurrent. This chain is irreducible and aperiodic; therefore, it is ergodic. Let \(p = \begin{bmatrix}p_1 & p_2 & p_3\end{bmatrix}\) be the stationary distribution. We can solve for the stationary distribution: \(p = pA\), and find \(p = \begin{bmatrix}0 & 0 & 1\end{bmatrix}\).

Example 11.2.

Consider the following Markov chain: \[ A = \begin{bmatrix} 0.5 & 0.5 & 0\\ 0.4 & 0.6 & 0\\ 0.4 & 0.4 & 0.2 \end{bmatrix} \] Stationary distribution \(p = \begin{bmatrix}p_1 & p_2 & p_3\end{bmatrix}\) is determined from the equation \(p = pA\) and the additional condition \(p_1 + p_2 + p_3 = 1\). But the state \(3\) is transient, therefore \(p_3 = 0\) (see the previous example), and so we have the system of equations \[ \begin{cases} p_1 = 0.5p_1 + 0.4p_2 + 0.4p_3\\ p_2 = 0.5p_1 + 0.6p_2 + 0.4p_3\\ p_3 = 0.2p_3 \end{cases} \ \Rightarrow\ 0.5p_1 = 0.4p_2. \] Because \(p_1 + p_2 + p_3 = 1\), we have: \(p_1 + p_2 = 1\). Therefore, \(p_1 = 4/9\), and \(p_2 = 5/9\).

11.6. Irreducibility

Take a Markov chain and ignore its transient states, because the Markov chain will eventually leave them forever. In other words, suppose all states are recurrent. Take two states \(i, j\). Suppose we can get from \(i\) to \(j\) with positive probability (in a certain number of steps): we write this as \(i \to j\). Then we can also get from \(j\) to \(i\) with positive probability: \(j \to i\). Otherwise, \(j\) would serve as a “sink”: you leave \(i\) for \(j\) and you do not return to \(i\) with positive probability. Then the probability of return to \(i\), starting from \(i\), is less than \(1\), and this means \(i\) is transient. So if \(i \to j\), then \(j \to i\); we write this as \(i \leftrightarrow j\) and call these states communicating

We can split the state space into a few classes of communicating states. If there is more than one class, this chain is called reducible, and if there is only one class, it is called irreducible. Each such class can be considered as a Markov chain of its own. (Some textbooks use a different definition of irreducibility, without removal of transient states. However, we shall utilize the one given above, because it is related to stationary distributions, as shown below.)

If the chain is reducible, there is more than one stationary distribution. Indeed, suppose the Markov chain has a state space \(\{1, 2, 3, 4, 5\}\), where \(\{1, 3, 5\}\) form an irreducible class with stationary distribution \[ \pi_1 = \frac12,\ \pi_3 = \frac38,\ \pi_5 = \frac18, \] and \(\{2, 4\}\) form an irreducible class with stationary distribution \[ \pi_2 = \frac13,\ \pi_4 = \frac23. \] Then there are many stationary distributions for the whole chain: add the first distribution multiplied by \(p_1\) and the second distribution multiplied by \(p_2\), where \(p_1 + p_2 = 1\), \(p_1, p_2 \ge 0\). Therefore, \[ \pi = \begin{bmatrix}\frac12p_1 & \frac13p_2 & \frac38p_1 & \frac23p_2 & \frac18p_1\end{bmatrix} \] You can view it as follows: we toss a coin which has Heads with probability \(p_1\) and Tails with probability \(p_2\). If Heads, we move to the class \(\{1, 3, 5\}\) and start from distribution \(\begin{bmatrix}\pi_1 & \pi_3 & \pi_5\end{bmatrix}\), and forever remain in this distribution (because it is stationary). If Tails, we move to the class \(\{2, 4\}\) and start from the distribution \(\begin{bmatrix} \pi_2 & \pi_4\end{bmatrix}\), and forever remain in this distribution. For example, take \(p_1 = 3/4\) and \(p_2 = 1/4\); then \[ \pi = \begin{bmatrix}\frac38 & \frac1{12} & \frac9{32} & \frac16 & \frac3{32}\end{bmatrix} \] We might also take \(p_1 = 0\) and \(p_2 = 1\); then we only move in the class \(\{2, 4\}\), and \[ \pi = \begin{bmatrix}0 & \frac13 & 0 & \frac23 & 0\end{bmatrix} \] We showed that if a Markov chain is reducible, then there are multiple stationary distributions. (Because there exists a stationary distribution for each communicating class, and we can combine them as above in many ways.) We can also show the following theorem: If a Markov chain is irreducible, then there is only one stationary distribution.

11.7. Aperiodicity

Again, ignore transient states, and suppose the Markov chain is irreducible (or, in case it is reducible, move to any class of communicating states). Start with some element \(i\) and find the number of steps needed to get there; let \(\phi_1, \phi_2, \ldots\) be numbers of steps for different paths. Take \(d\), the greatest common divisor of \(\phi_1, \phi_2, \ldots\). This is called the period of this Markov chain. It does not depend on \(i\). This means we can split the state space into \(d\) subclasses such that if you start in the \(1\)st subclass, at the next step you get to the \(2\)nd one, etc. to the \(d\)th one, and then to the first one. If \(d = 1\), the Markov chain is called aperiodic, and if \(d \ge 2\), it is called periodic. If there is a loop (there exists a state from which you can return to itself at the next step), then \(\phi_1 = 1\), and \(d = 1\).

As mentioned before, there exists a unique stationary distribution \(p\). However, if \(d \ge 2\) (the chain is periodic), then it might not converge to this stationary distribution \(p\). That is, not necessarily \(x(n) \to p\) as \(n \to \infty\). (Some textbooks use another definition of aperiodicity, without removal of transient states. However, we shall remove transient states first, because the connection to stationary distributions becomes clearer.)

Example 11.3.

Consider a Markov chain with the state space \(\{1, 2\}\) and the transition matrix \[ A = \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \] This chain is \(2\)-periodic, and its stationary distribution \(p\) satisfies \(p = pA\), which implies \[ \begin{cases} p_1 = p_2\\ p_2 = p_1\end{cases}\ \Ra\ p_1 = p_2 = \frac12. \] But if \(x(0) = \begin{bmatrix}1/3 & 2/3\end{bmatrix}\), then \(x(1) = x(0)A = \begin{bmatrix}2/3 & 1/3\end{bmatrix}\), \(x(2) = x(1)A = \begin{bmatrix}1/3 & 2/3\end{bmatrix}\), etc. So \[ x(n) = x(0)\ \ \mbox{for even}\ \ n,\ \ x(n) = x(1)\ \ \mbox{for odd}\ \ n. \] Therefore, \(x(n)\) does not have a limit. The distribution switches between \(x(0)\) and \(x(1)\).

If the Markov chain is irreducible and aperiodic, then \(x(n) \to p\) for \(n \to \infty\), regardless of the initial distribution \(x(0)\). Moreover, \(p_i > 0\) for each state \(i\). In this case, the Markov chain is called ergodic.

11.8. Summary

When we analyze a Markov chain \(X\), the following steps are required:

Identify all transient and all recurrent states: For every state, you should find whether it is transient or recurrent. Then remove all transient states. You get a certain Markov chain \(Y\), with all states being recurrent.

Is the resulting Markov chain \(Y\) irreducible? That is, can you get from any state to any state? If yes, the Markov chain is irreducible. If not, it splits into more than one communicating classes \(C^1, \ldots, C^r\). Each such class is by itself an irreducible Markov chain, with all its states being recurrent.

Each such communicating class has a unique stationary distribution. In fact, an irreducible Markov chain has a unique stationary distribution. If the class \(C^i\) has a stationary distribution \(\pi^i\), then we can weigh them (as explained earlier in this section) with certain weight coefficients \(p_1,\ldots, p_r\) which satisfy \[ p_1, \ldots, p_r \ge 0,\ p_1 + \ldots + p_r = 1, \] and get a stationary distribution \(\pi\) for the Markov chain \(Y\). If a Markov chain is reducible (that is, if it consists of more than one communicating class), then it has more than one (in fact, infinitely many) stationary distributions, because \(p_1, \ldots, p_r\) can be chosen in multiple ways.

Every stationary distribution of \(Y\) corresponds to a stationary distribution of \(X\): add zeroes in the places corresponding to the removed transient states. Every stationary distribution has zeroes in components corresponding to every transient state.

The stationary distribution for an irreducible Markov chain \(Z = (Z_n)_{n \ge 0}\) has the meaning of long-term time proportions spent in this state. Take the state \(i\), and let \(\pi_i\) be the corresponding component of the stationary distribution \(\pi\). Let \(N_i(n)\) be the number of steps \(k = 1, \ldots, n\) spent in \(i\): \(Z_k = i\). Then we have: \[ \frac{N_i(n)}n \to \pi_i\ \mbox{a.s. as}\ n \to \infty. \] In particular, if \(i\) is a transient state, then the Markov chain spends only finitely many steps in \(i\). Say it spends \(M_i\) steps there. Then \(N_i(n) = M_i\) for large enough \(n\), so \(N_i(n)\) does not grow as \(n \to \infty\). Thus, \[ \frac{N_i(n)}n \to 0\ \mbox{a.s. as}\ n \to \infty. \] However, if \(i\) is a recurrent state, then \(\pi_i > 0\). In other words, the long-term proportion of steps spent at \(i\) is positive. During the first \(n\) steps for a large \(n\), the Markov chain spends approximately \(\pi_in\) steps in the state \(i\).

Finally, fix a communicating class; or, alternatively, assume the Markov chain is irreducible and does not contain any transient states. We can find its period \(d\) by fixing the state \(i\), and taking the gcd of lengths of all paths returning from \(i\) to \(i\), as explained above. This number is the same for all states \(i\). Fix a state \(i\); it is in a subclass \(D_0\). All states \(j\) where you can go from \(i\) in one step belong to the next subclass \(D_1\). All states where you can get in one step from some states in \(D_1\) form the subclass \(D_2\), etc. The whole Markov chain now is split into \(d\) subclasses \(D_0, \ldots, D_{d-1}\), and the dynamics is as follows: \[ D_0 \Rightarrow D_1 \Rightarrow D_2 \Rightarrow \ldots \Rightarrow D_{d-1} \Rightarrow D_0. \] If an irreducible Markov chain is aperiodic (\(d = 1\)), then for every initial distribution \(x(0)\), we have convergence to the stationary distribution: \[ x(n) \to \pi,\ n \to \infty. \] However, if this Markov chain is periodic (\(d \ge 2\)), then for some initial distribution \(x(0)\) there is no such convergence. For example, consider the initial distribution concentrated on \(D_0\). That is, \(X_0 \in D_0\): the initial state is among the states from the subclass \(D_0\). Then at the next step the process goes to the subclass \(D_1\): \(X_1 \in D_1\). Next, \[ X_2 \in D_2,\ X_3 \in D_3,\ \ldots, X_{d-1} \in D_{d-1},\ X_d \in D_0,\ X_{d+1} \in D_1, \ldots \] For each step \(n\), only some of the components of the distribution \(x(n)\) are positive: those who correspond to the respective class. Others are zero. But for the stationary distribution \(\pi\), all components are positive. So we cannot have convergence \(x(n) \to \pi\) as \(n \to \infty\).

Every comunicating class has its own period; they might be different.

The overall scheme is as follows:

  • For each state, find whether it is recurrent or transient. Remove all transient states.
  • Split the resulting Markov chain (with only recurrent states left) into communicating classes. If there is only one, the chain is irreducible; otherwise, it is reducible.
  • For each communicating class, find its period and split into subclasses. If the period is \(1\), this class is aperiodic; otherwise, it is periodic.

11.9. Time spent in transient states

Assume we have a Markov chain \[ \begin{bmatrix} 0.6 & 0.3 & 0.1\\ 0 & 0.5 & 0.5\\ 0 & 0 & 1 \end{bmatrix} \] States \(1\) and \(2\) are transient. If we start from state \(1\), then every attempt to escape succeeds with probability \(0.4\) and fails with probability \(0.6\). And once it leaves state \(1\), it never returns there. Therefore, time \(T_1\) spent in \(1\) is distributed geometrically with parameter \(0.4\). In particular, \[ \mathbb P(T_1 = n) = 0.4\cdot 0.6^{n-1},\, n = 1, 2, \ldots;\quad \mathbb E[T_1] = \frac1{0.4} = 1.25. \] Similarly, if we start from state \(2\), then time \(T_2\) spent in \(2\) is \(\mathrm{Geo}(0.5)\), and \(\mathbb{E}[T_2] = 1/0.5 = 2\).

11.10. Comparison of hitting times

Consider the Markov chain with four states \(1, 2, 3, 4\) and transition matrix \[ \begin{bmatrix} 0.5 & 0.3 & 0.2 & 0\\ 0.3 & 0.4 & 0.2 & 0.1\\ * & * & * & *\\ * & * & * & * \end{bmatrix} \] Let \(p_x\) be the probability that, starting from \(x\), the process will reach \(3\) before it reaches \(4\). By definition, \[ p_3 = 1,\ \ p_4 = 0. \] \[ \begin{cases} p_1 = 0.5p_1 + 0.3p_2 + 0.2p_3;\\ p_2 = 0.3p_1 + 0.4p_2 + 0.2p_3 + 0.1p_4\\ \end{cases} \] Derive the first equation: When we start from \(1\), we can move back to \(1\) with probability \(0.5\), and then we start from \(1\) again; or we can move to \(2\) with probability \(0.3\), then we start from \(2\), etc. We can rewrite these equations as \[ 0.5p_1 = 0.3p_2 + 0.2,\ \ 0.6p_2 = 0.3p_1 + 0.2. \] It suffices to solve them and find \(p_1\) and \(p_2\).

11.11. Random walk on graphs

Consider a graph with finitely many vertices \(1, 2, \ldots, N\). A Markov chain starts from some vertex and chooses its neighbors with equal probability. We write \(i \leftrightarrow j\) if the vertices \(i\) and \(j\) are connected. Let \(\deg i\) be the degree of the vertex \(i\) (the number of neighbors for \(i\)). Therefore, the transition matrix is given by \[ A_{ij} = \begin{cases} (\deg i)^{-1},\ \ i \leftrightarrow j;\\ 0,\ \ \mbox{otherwise.} \end{cases} \] Let us take the stationary distribution \[ \pi_i := \frac{\deg i}{2M},\ i = 1, \ldots, N. \] where \(M\) is the total number of edges. Let us check that this is indeed a stationary distribution. For each vertex \(j\), we have: \[ (\pi A)_j = \sum\limits_i\pi_iA_{ij} = \sum\limits_{i\leftrightarrow j}\pi_iA_{ij} = \sum\limits_{i\leftrightarrow j}\frac{\deg i}{2M}\frac1{\deg i} = \frac1{2M}\sum\limits_{i \leftrightarrow j}1 = \frac{\deg i}{2M}. \]

This Markov chain (or, rather, its analogue for directed graph) is important for Google PageRank search algorithm: see section on simulation.

If the graph is connected, then the random walk on this graph is irreducible and has a unique stationary distribution. However, if the graph is disconnected, then the random walk is reducible and has more than one stationary distribution.

11.12. Application: a Bonus-Malus system

An insurance company classifies car drivers into three tiers: Tier \(0\) consists of the best drivers, tier \(1\) consists of not-so-good drivers, and tier \(2\) consists of bad drivers. This is the table which shows the change between current year and next year, depending on the number of accidents this year:

Tier 0 1 2 \(\ge 3\)
0 0 1 1 2
1 0 1 2 2
2 1 2 2 2

Say, if a driver from the first tier had no accidents this year, this driver moves to the zeroth tier next year. Assume now that the number of accidents is always \(X \sim \mathrm{Poi}(2)\) for the first two tiers and \(Y \sim \mathrm{Poi}(3)\) for the worst tier. Then \[ \mathbb P(X = 0) = e^{-2},\ \mathbb P(X = 1) = \mathbb P(X = 2) = 2e^{-2},\ \mathbb P(X \ge 3) = 1 - 5e^{-2}, \] \[ \mathbb P(Y = 0) = e^{-3},\ \mathbb P(Y \ge 1) = 1 - e^{-3}. \] The transition matrix is \[ P = \begin{bmatrix} e^{-2} & 4e^{-2} & 1 - 5e^{-2}\\ e^{-2} & 2e^{-2} & 1 - 3e^{-2}\\ 0 & e^{-3} & 1 - e^{-3} \end{bmatrix} \] We can find the stationary distribution for this Markov chain: \(\begin{bmatrix} p_0 & p_1 & p_2\end{bmatrix}\) by solving the system of equations \(pP = p\). If \(r_k\) is the premium for tier \(k\), then the long-term average premium is \(p_0r_0 + p_1r_1 + p_2r_2\).

12. Random Walk and the Binomial Model in Finance

12.1. Construction

Consider a sequence \(X_1, X_2, \ldots\) of i.i.d. (independent identically distributed) random variables with distribution \(\mathbb P(X_i = 1) = p\) and \(\mathbb P(X_i = -1) = q\), \(p + q = 1\). Take \[ S_0 := x,\ \mbox{and}\ S_{n} = S_{n-1} + X_n,\ n = 1, 2, \ldots \] \[ S_1 = x + X_1,\ S_2 = x + X_1 + X_2,\ S_3 = x + X_1 + X_2 + X_3, \ldots \] Each time, the particle moves either one step to the right (with probability \(p\)) or one step to the left (with probability \(q\)). Each next step is independent of the past. An example of the trajectory starting from \(x = 0\) is: \[ (S_0 = 0,\ S_1 = 1,\ S_2 = 2,\ S_3 = 1,\ S_4 = 2,\ S_5 = 3, \ldots) \] This process \((S_0, S_1, \ldots) = (S_n)_{n \ge 0}\) is called a random walk. When \(p = q = 1/2\), this is called simple or symmetric random walk.

Example 12.1.

Consider a random walk starting from \(S_0 = -1\) with \(p = 2/3\) and \(q = 1/3\). Find \(\mathbb E[S_{12}]\) and \(\mathrm{Var}[S_{12}]\). \[ \mathbb E[X_k] = 1\cdot p + (-1)\cdot q = \frac13, \] \[ \mathbb E[X_k^2] = 1^2\cdot p + (-1)^2\cdot q = 1,\ \mathrm{Var}(X_k) = \mathbb E[X_k^2] - (\mathbb E[X_k])^2 = 1 - \left(\frac13\right)^2 = \frac89. \] \[ \mathbb E[S_{12}] = \mathbb E\left[-1 + X_1 + \ldots + X_{12}\right] = -1 + 12\cdot \frac13 = \boxed{3} \] \[ \mathrm{Var}(S_{12}) = \mathrm{Var}(X_1) + \ldots + \mathrm{Var}(X_{12}) = 12\cdot \frac 89 = \boxed{\frac{32}3} \]

12.2. Transition probabilities

The transition probabilities are \(\mathbb P(S_n = z\mid S_m = y)\). Let us give an example on how to calculate them: take \(\mathbb P(S_{10} = 4\mid S_{5} = 3)\). The quantity of paths from \((5, 3)\) to \((10, 4)\): suppose the path has \(a\) steps up and \(b\) steps down. The total number of steps is \(a + b = 5\), and \(a - b = 1\) (because the total ascent from \(3\) to \(4\) is \(1\)). So \(a = (5+1)/2 = 3\), and \(b = (5-1)/2 = 2\). The path made \(3\) steps up and \(2\) steps down. There are five possible slots for upward steps; we must choose three upward steps, the rest will be occupied by downward steps. The number of ways to choose this is \(\binom{5}{3} = (5\cdot 4)/2 = 10\). Each such path has probability \(p^3q^2\), and so the total probability is \[ \mathbb P(S_{10} = 4\mid S_{5} = 3) = 10p^3q^2. \]

12.3. Recurrence and transience

Recall the definition of recurrent and transient states for random walk. Let \(N_i\) be the number of times when the proces \(\mathbb E[N_i] = \infty\), this state is recurrent. If \(N_i\) is finite (Geometric distribution) and therefore \(\mathbb E[N_i] < \infty\), it is transient. Now, let us find which of two cases holds for the random walk. Since all states are alike, let us take the state \(i = 0\). Let \(1_A\) be the indicator of the event \(A\): \[ 1_A = \begin{cases} 1,\ \mbox{if}\ A\ \mbox{happened};\\ 0,\ \mbox{if}\ A\ \mbox{did not happend} \end{cases} \ \ \mbox{then}\ \ \mathbb E[1_A] = \mathbb P(A). \] Because the random walk can return to zero only in an even number \(n = 2k\) of steps. \[ N_0 = \sum\limits_{n=1}^{\infty}1_{\{S_n = 0\}},\ \ \mathbb E[N_0] = \sum\limits_{n=1}^{\infty}\mathbb P(S_n = 0) = \sum\limits_{k=1}^{\infty}\mathbb P(S_{2k} = 0). \] Let us calculate \(\mathbb P(S_{2k} = 0)\). To get in \(2k\) steps from \(0\) to \(0\), the process has to make \(k\) steps up and \(k\) steps down. The number of ways to choose \(k\) steps upward is \(\binom{2k}{k}\). Each such path has probability \(p^kq^k\). Therefore, \[ \mathbb P(S_{2k} = 0) = \binom{2k}{k}p^kq^k. \] Now, we need to find when the series \[ \sum\limits_{k=1}^{\infty}\binom{2k}{k}(pq)^k \] converges or diverges. By the Stirling formula, we have: \(k! \sim \sqrt{2\pi k}(k/e)^{k}\). Therefore, \[ \binom{2k}{k} = \frac{(2k)!}{k!^2} \sim \frac{\sqrt{2\pi\cdot 2k}(2k/e)^{2k}}{\left(\sqrt{2\pi k}(k/e)^{k}\right)^2} = \frac{2^{2k}}{\sqrt{\pi k}}. \] The series \[ \sum\limits_{k=1}^{\infty}\frac{2^{2k}}{\sqrt{\pi k}}(pq)^k = \sum\limits_{k=1}^{\infty}\frac{(4pq)^k}{\sqrt{\pi k}} \] converges if \(4pq < 1\), because of the geometric series: \[ \sum\limits_{k=1}^{\infty}\frac{(4pq)^k}{\sqrt{\pi k}} \le \sum\limits_{k=1}^{\infty}(4pq)^k < \infty. \] And it diverges if \(4pq = 1\), because \[ \sum\limits_{k=1}^{\infty}\frac1{\sqrt{\pi k}} = \infty. \] Now, if \(p = q = 1/2\), then \(4pq = 1\), and otherwise \(4pq = 4p(1 - p) < 1\). Indeed, \[ 1 - 4p(1 - p) = 1 - 4p + 4p^2 = (1 - 2p)^2 \ge 0, \] and this equals to zero only if \(p = 1/2\). Result: for the symmetric random walk, every state is recurrent; otherwise, every state is transient.

12.3. Gambler’s ruin

Consider a random walk \(X_0, X_1, X_2, \ldots\), with probability \(p\) of going up and probability \(q\) of going down. Suppose \(X_0 = n\). What is the probability \(q_n\) that this random walk hits \(N\) before hitting zero? We can restate it: if a gambler wants to get \(N\) dollars, starting from \(n\) dollars, but does not want to go bankrupt before that, what is the probability that he can pull this out?

Let us deduce this for \(N = 3\), \(p = .25\), \(q = .75\). We will simulateously deduce this for \(n = 0, 1, 2, 3\). At the first step, starting from \(n\), the random walk can either go up (with probability \(p\)) and reach \(n+1\) or go down (with probability \(q\)) and reach \(n - 1\). Therefore, \[ q_n = pq_{n+1} + qq_{n-1},\ \ n = 1, \ldots, N-1. \] Also, \(q_0 = 0\), because if the gambler already had nothing in the beginning, he is bankrupt. And \(q_N = 1\), because if the gambler started with \(N\) dollars, he already achieved this goal. Plugging \(n = 1\) and \(n = 2\), we get: \[ q_2 = .25\cdot q_3 + .75\cdot q_1 = .25 + .75\cdot q_1;\quad q_1 = .25\cdot q_2 + .75\cdot q_0 = .25\cdot q_2. \] Solving this system, we get: \(q_2 = .25 + .75\cdot .25\cdot q_2 = 1/4 + (3/16)q_2\), therefore \(q_2 = 4/13\), and \(q_1 = .25q_2 = 1/13\). In general case, we get with \(\alpha = q/p\): \[ q_k = \begin{cases} \frac{1 - \alpha^k}{1 - \alpha^N},\ \alpha \ne 1;\\ \frac{k}{N},\ \alpha = 1. \end{cases} \] If \(p > q\), then \(\alpha < 1\), and as \(N \to \infty\), we have: \[ q_k \to 1 - \alpha^k > 0. \] Therefore, \(1 - \alpha^k\) is the probability of never becoming bankrupt for a good gambler: with \(\alpha < 1\), who is more likely to win than lose at each step, who is playing against a bank with an infinite amount of money.

12.4. Reflection principle

Consider \((S_n)_{n \ge 0}\), a random walk. Let us find the quantity of paths from \(S_{2} = 1\) to \(S_{12} = 3\) such that they do not hit or cross the line \(y = -1\).

First, let us find the quantity of all paths from \(S_2 = 1\) to \(S_{12} = 3\), that is, from \((2, 1)\) to \((12, 3)\), regardless of whether they cross or hit this line. Let \(n\) and \(m\) be the quantity of up and down steps. Then the total number of steps is \(n+m = 12-2 = 10\), and the change in altitude is \(n-m = 3-1 = 2\). Therefore, \(n = 6\) and \(m = 4\). There are \(\binom{10}{6}\) ways to choose \(6\) upward steps out of \(10\), so \(\binom{10}{6}\) such random walk paths. Then, let us find the quantity of these paths that do hit or cross this line \(y = -1\). Starting from this hitting point, reflect them across this line. Then we get a new random walk path, which goes from \((2, 1)\) to \((12, -5)\) (the point symmetric to \((12, 3)\) with respect to the line \(y = -1\)). Every path from \((2, 1)\) to \((12, 3)\) which hits \(y = -1\) corresponds to a path from \((2, 1)\) to \((12, -5)\), and vice versa. Let us find the quantity of paths from \((2, 1)\) to \((12, -5)\). If it has \(n\) steps up and \(m\) steps down, then \(n + m = 12-2 = 10\) and \(n - m = (-5)-1 = -6\). Therefore, \(n = 2\) and \(m = 8\). The number of such paths is \(\binom{10}{8}\)

Thus, the quantity of paths from \((2, 1)\) to \((12, 3)\) which do not cross the line \(y = -1\) is \(\binom{10}{6} - \binom{10}{8} = \frac{10\cdot9\cdot8\cdot7}{4!} - \binom{10\cdot9}{2} = 210 - 45 = 165\).

Another way to say that the path \((S_n)_{n \ge 0}\) of random walk from \((2, 1)\) to \((12, 3)\) does not cross or hit \(y = -1\) is \[ S_{12} = 3,\ S_n \ge 0,\ n = 2, \ldots, 12. \] Assume we have a random walk with \(p = 0.6\) and \(q = 0.4\). Every such path has \(6\) steps up and \(4\) steps down (see above), and the probability of each such path is \(p^6q^4\). Therefore, the probability of this event is \[ \mathbb{P}\left(S_{12} = 3,\ S_n \ge 0,\ n = 2, \ldots, 12\mid S_2 = 1\right) = \boxed{165\cdot p^6q^4} \]

12.5. The ballot problem

Assume there were \(1000\) voters and two candidates, A and B, who got \(600\) and \(400\) votes, respectively. Therefore A won. We start counting ballots.

How many different paths of counting ballots, that is, random walks \((S_n)_{n \ge 0}\) from \((0, 0)\) to \((1000, 200)\)? Here, \(S_n\) is the ballots cast for A minus ballots cast for B when \(n\)th ballot has been counted. Therefore, \(S_0 = 0\) and \(S_{1000} = 600-400 = 200\). There has to be \(600\) ballots for A, and there are \(1000\) total ballots. The number of ways to choose them is \(\binom{1000}{600}\).

How many such paths when A is always ahead of B? The first ballot must be for A, otherwise B immediately comes ahead. Next, how many paths from \((1, 1)\) to \((1000, 200)\) which do not hit the line \(y = 0\)? Similarly to the previous subsection, we can find: Each such path has \(n\) steps up and \(m\) steps down, with \(n+m = 1000-1 = 999\), and \(n - m = 200-1 = 199\). Therefore, \(n = 599\), \(m = 400\), and there are \(\binom{999}{400}\) such paths.

How many such paths do cross or hit this line? Reflect it, starting from the first hitting point, across this line \(y = 0\). Then we get a random walk path from \((1, 1)\) to \((1000, -200)\). It has \(n\) steps up and \(m\) steps down, with \(n+m = 1000-1 = 999\), and \(n - m = -200-1 = -201\). Therefore, \(n = 399\) and \(m = 600\). There are \(\binom{999}{399}\) such paths. Therefore, the number of ways to count ballots such that A is always ahead of B is: \[ \binom{999}{400} - \binom{999}{399} \] We can rewrite this as (because \(1000! = 1000\cdot999!\) and \(600! = 600\cdot599!\)) \[ \frac{999!}{599!400!} - \frac{999!}{399!600!} = \frac{600}{1000}\frac{1000!}{600!400!} - \frac{400}{1000}\frac{1000!}{600!400!} = \frac{600-400}{1000}\binom{1000}{600}. \] The probability that A was always ahead of B is \[ \frac{600-400}{1000} = \boxed{\frac15} \]

12.6. The maximum of a random walk

Using the reflection principle, we can find the distribution of the maximum \[ M_n := \max(S_0, S_1, \ldots, S_n) \] of the symmetric random walk, with \(p = q = 0.5\). For example, let us find \(\mathbb{P}(M_{10} \ge 5)\). This event \(\{M_{10} \ge 5\}\) can happen in two cases: (a) \(S_{10} \ge 5\); (b) \(M_{10} \ge 5\) but \(S_{10} < 5\). By reflection principle, the probability of the second event is equal to \(\mathbb{P}(S_{10} > 5)\).

Indeed, starting from the first hitting point of the line \(y = 5\), we reflect this trajectory of a random walk across this line. Thus, \[ \mathbb{P}(M_{10} \ge 5) = \mathbb{P}(S_{10} \ge 5) + \mathbb{P}(S_{10} > 5) = \mathbb{P}(S_{10} \ge 5) + \mathbb{P}(S_{10} \ge 6). \] Similarly, substituting \(6\) instead of \(5\) and \(7\) instead of \(6\), we get: \[ \mathbb{P}(M_{10} \ge 6) = \mathbb{P}(S_{10} \ge 6) + \mathbb{P}(S_{10} \ge 7). \] Subtract one from the other and get: \[ \mathbb{P}(M_{10} = 5) = \mathbb{P}(S_{10} = 5) + \mathbb{P}(S_{10} = 6). \] But \(S_{10}\) can take only even values, therefore the first term in the right-hand side is zero, and this equation takes the form \[ \mathbb{P}(M_{10} = 5) = \mathbb{P}(S_{10} = 6). \] This we can find, as explained earlier: If the path with \(S_{10} = 6\) takes \(a\) steps up and \(b\) steps down, then \(a + b = 10\) and \(a - b = 6\), so \(a = 8\) and \(b = 2\). Each given path has probability \(2^{-10}\). Thus \[ \mathbb{P}(S_{10} = 6) = 2^{-10}\binom{10}{2} = \frac{10\cdot 9}{2\cdot 1024} = 0.0439. \]

12.7. Financial modeling

Random walk is used for elementary modeling of movements of the stock. Assume \(P_n\) is the price of a stock at day \(n\). We can model it as a geometric random walk: \(P_{n} = P_0Z_1\ldots Z_{n}\), where \(Z_1, \ldots, Z_n\) are i.i.d. positive random variables. Take a random walk \(S = (S_n)_{n \ge 0}\) starting from \(S_0 = x\), with probabilities \(p\) and \(q\) of moving up and down. Then we let \[ P_n = \exp\left(\sigma S_n + bn\right),\ n = 0, 1, 2, \ldots \] for some coefficients \(\sigma > 0\) and \(b\). The coefficient \(\sigma\) is called volatility, since it is responsible for random fluctuations. The coefficient \(b\) is called drift. We can alternatively write this as follows: \[ P_n = \exp\left(\sigma \sum\limits_{k=1}^nX_k + bn\right) = \exp\left(\sum\limits_{k=1}^nY_k\right), \] where \(Y_k = \sigma X_k + b\). All \(Y_1, Y_2, \ldots\) are i.i.d. and have distribution \[ \mathbb{P}(Y_k = b + \sigma) = p,\ \mathbb{P}(Y_k = b - \sigma) = q. \]

Example 12.1.

For \(p = q = 0.5\) (symmetric random walk), drift \(b = -2\) and volatility \(\sigma = 3\), find \(\mathbb E[P_{10}]\) if \(P_0 = 2\). We have: \[ \mathbb E[e^{Y_k}] = 0.5e^{-2 + 3} + 0.5e^{-2 - 3} = 0.5\left(e + e^{-5}\right) = 1.363. \] Therefore, \[ \mathbb E[P_{10}] = \mathbb E\left[P_0e^{Y_1}e^{Y_2}\ldots e^{Y_{10}}\right] = 2\mathbb E\left[e^{Y_1}\right]\cdot\ldots\cdot\mathbb E\left[e^{Y_{10}}\right] = 2\cdot 1.363^{10} = 44.1. \]

Example 12.2.

For \(p = 0.75\), \(q = 0.25\), \(b = 1\) and \(\sigma = 2\), given that \(P_0 = 0.5\), what are the possible values of \(P_3\), and what is the distribution of \(P_3\)? Each random variable \(Y_k\) takes values \(3\) and \(-1\) with probabilities \(p\) and \(q\), respectively. Therefore, the possible values of \(P_3\) are \[ P_3 = e^{3 + 3 + 3}P_0 = \frac{e^9}{2},\ \mbox{with probability}\ p^3; \] \[ P_3 = e^{3 + 3 - 1}P_0 = \frac{e^5}2,\ \mbox{with probability}\ 3p^2q; \] \[ P_3 = e^{3 -1-1}P_0 = \frac{e}2,\ \mbox{with probability}\ 3pq^2; \] \[ P_3 = e^{-1-1-1}P_0 = \frac{e^{-3}}2,\ \mbox{with probability}\ q^3. \]

12.8. Options and other derivatives

Assume the price of a stock is modeled by \((P_n)_{n \ge 0}\), as above. A European option call is the right to buy this stock at a given price \(K\), called the strike, at some future time \(T\), called the maturity. A European option put is the right to sell this stock at the price \(K\) at time \(T\). Let us find the values of these options.

If the stock price \(P_T\) at maturity is less than \(K\), then it makes no sense to exercise the European option call: We can just buy the stock on the free market. Therefore, the price of the option is zero. However, the European option put is valuable, and its value is \(K - P_T\). For the case \(P_T > K\), the situation is reversed: The value of the European option call is \(P_T - K\), and the European option put has value \(0\). Thus, if we denote \(a_+ := \max(a, 0)\) for every real number \(a\), then the value of the European option call is \((P_T - K)_+\), and the value of the European option put is \((K - P_T)_+\).

An American option call/put can be exercised at any time until maturity. Therefore, such option is generally more valuable than the corresponding European option. We shall discuss valuation of the American options later.

European options are particular cases of European derivatives, which are worth \(f(P_T)\), where \(f(x)\) is a certain function, and \(T\) is the maturity. For example, \(f(x) = 1(x \ge a)\) is the binary option: pay \(1\) if the price at maturity \(P_T\) exceeded \(a\), and pay nothing otherwise. A forward contract corresponds to \(f(x) = x - K\), which is the obligation to buy the stock at time \(T\) for the price \(K\). Such contracts are used by farmers and ranchers, who want to lock in good prices (to protect themselves from downside risk), and international companies (airlines, oil, steel), who want to lock in a certain exchange rate (say, be able to buy a euro next year using the current exchange rate, rather than the future exchange rate).

12.9. Hedging a derivative

This means buying a certain amount of this stock to be able to replicate this derivative: exactly match the value of this derivative at maturity. Let us illustrate this using a one-step model: \(T = 1\), and \[ P_0 = 1,\quad P_1 = \begin{cases} 1.2\ \mbox{with probability}\ p = 40\%;\\ 0.6\ \mbox{with probability}\ q = 60\%. \end{cases} \] We sell a European option call with maturity \(T = 1\) and strike \(K = 0.9\). Assume \(v\) is its current fair value. Then we need to use this money \(v\) to buy \(s\) stocks and \(c\) cash, and hedge the option. At time \(t = 0\), the stock price is \(1\); therefore, \[ v = s + c. \] Next, at time \(t = 1\), if the stock price is \(1.2\), then we need that our portfolio (consisting of cash \(c\) and \(s\) shares) has value \(1.2s + c\), equal to our obligation (which is the option value): \[ 1.2s + c = (1.2 - 0.9)_+ = 0.3. \] Similarly, if the stock price is \(0.6\), then we need: \[ 0.6s + c = (0.6 - 0.9)_+ = 0. \] Solve these two equations: \(s = 0.5\), \(c = -0.3\). Thus, we need to borrow \(0.3\) in cash and buy \(0.6\) shares of stock. The fair value is \(v = -0.3 + 0.5 = 0.2\). Note that this is not an expected value \(\mathbb E(P_1 - K)_+\), which is equal to \(0.3\cdot 40\% + 0\cdot 60\% = 0.12\). Rather, this is an expected value \(\mathbb E_0(P_1 - K)_+\) with respect to the risk-neutral probability \(p_0, q_0\), such that \(\mathbb E_0P_1 = P_0\). Let us find them: \[ \begin{cases} p_0\cdot1.2 + q_0\cdot 0.6 = 1\\ p_0 + q_0 = 1 \end{cases} \Rightarrow p_0 = \frac23,\ q_0 = \frac13\ \Rightarrow\ \mathbb E_0(P_1 - K)_+ = \frac23\cdot0.3 + \frac13\cdot0 = 0.2. \] Compute \(v\) as follows: multiply the first equation by \(p_0\) and the second equation by \(q_0\), and sum them. To make the coefficient at \(s\) be equal to \(1\), we get: \(p_0\cdot 1.2 + q_0\cdot 0.6 = 1\), which is the same as \(\mathbb E_0P_1 = P_0\). Then \(v = p_0\cdot 0.3 + q_0\cdot 0 = \mathbb E_0(P_1 - K)_+\). The same can be done with any other derivative with maturity \(T = 1\). Actually, the same result is true for any maturity.

This works if \(P_1\) has only two values. If it has three or more values, or has continuous distribution (such as normal), then the system of equations has three or more equations, but two variables, and it does not have a solution. Therefore, not all derivatives can be replicated. Such market is called incomplete. There is more than one risk-neutral probabilities, because there are two equations for them, but more than two variables.

For more than one step: \(T \ge 2\), we hedge for every step, starting from the end: Assume \(P_n = P_0Z_1\ldots Z_n\), where \(Z_i\) are i.i.d. with values \(2\) and \(0.5\) with some (irrelevant) probabilities. Let \(P_0 = 1\). Consider the European option call with maturity \(T = 2\) and strike \(K = .5\). We have three cases: \[ \begin{cases} Z_1 = Z_2 = 2:\, \mbox{then}\quad P_2 = 4,\, (P_2 - K)_+ = 3.5;\\ Z_1 = 2,\ Z_2 = 0.5\ \mbox{or}\ Z_1 = 0.5,\ Z_2 = 2:\, \mbox{then}\quad P_2 = 1,\, (P_2 - K)_+ = 0.5;\\ Z_1 = Z_2 = 0.5: P_2 = 0.25,\,\mbox{then}\quad (P_2 - K)_+ = 0.\\ \end{cases} \] Assume we are at \(P_1 = 0.5\), then hedge: \[ c + 0.5s = v,\, c + s = 0.5,\, c + 0.25s = 0\ \Rightarrow\ v = \frac{0.5}3. \] The risk-neutral probability in this case is \(1/3\) and \(2/3\) corresponding to the jumps up and down, because \[ 0.5 = \frac13\cdot1 + \frac23\cdot 0.25. \] Similarly, if \(P_1 = 2\), then \[ c + 2s = v,\, c + s = 0.5,\, c + 4s = 3.5\ \Rightarrow\ v = \frac13\cdot 3.5 + \frac23\cdot 0.5 = 1.5. \] We have fair values corresponding to \(P_1 = 0.5\) (\(V_1 = 1/6\)) and \(P_2 = 2\) (\(V_2 = 1.5\)), then we can hedge on the first step, with \(P_0 = 1\): \[ c + s = v,\, c + 0.5s = \frac16,\ c + 2s = 1.5. \] At every step, we take the expectation with respect to risk-neutral probability, when \(\mathbb E Z_1 = \mathbb E Z_2 = 1\), and therefore \(\mathbb E(P_2\mid P_1) = \mathbb E(P_1Z_2\mid P_1) = P_1\mathbb E Z_2 = P_1\), and \(\mathbb E(P_1\mid P_0) = \mathbb E P_1 = \mathbb E Z_1 = 1\). We say that under this measure, the process \((P_n)_{n \ge 0}\) is a martingale (see Section 13).

There is a general result (see more details in courses on stochastic finance): If there exists a unique risk-neutral probability, then the market is complete: Every European derivative is replicable, and its fair price is the expectation with respect to this risk-neutral probability.

13. Discrete-Time Martingales

13.1. Definitions

A process \(X = (X_n)_{n \ge 0}\) is called a martingale if for each \(n = 0, 1, 2, \ldots\), we have: \[ \mathbb E(X_{n+1}\mid X_0, \ldots, X_n) = X_n. \] That is, if the best prediction of the next value, given all history, is the current value. If we have \[ \mathbb E(X_{n+1}\mid X_0, \ldots, X_n) \ge X_n, \] then the process is called a submartingale. If we have: \[ \mathbb E(X_{n+1}\mid X_0, \ldots, X_n) \le X_n, \] then the process is called a supermartingale

Example 13.1.

Take independent random variables \(Z_1, Z_2, \ldots\). Let \[ X_0 := 0,\ \ \mbox{and}\ \ X_n := Z_1 + \ldots + Z_n\ \ \mbox{for}\ \ n = 1, 2, \ldots \] Then \(X_{n+1} = X_n + Z_{n+1}\). Therefore, \[ \mathbb E(X_{n+1}\mid X_0, \ldots, X_n) = \mathbb E(X_n\mid X_0, \ldots, X_n) + \mathbb E(Z_{n+1}\mid X_0, \ldots, X_n) = X_n + \mathbb E Z_{n+1}. \] because \(Z_{n+1}\) is independent of \(X_0, \ldots, X_n\). Thus, \((X_n)_{n \ge 0}\) is a martingale if and only if all \(\mathbb E Z_1 = \mathbb E Z_2 = \ldots = 0\). It is a submartingale if all \(\mathbb E Z_k \ge 0\), and a supermartingale if all \(\mathbb E Z_k \le 0\).

Example 13.2.

Again, take independent random variables \(Z_1, Z_2, \ldots\). Let \(Y_n := e^{X_n}\). Then \(Y_{n+1} = Y_ne^{Z_{n+1}}\). Therefore, \[ \mathbb E(Y_{n+1}\mid Y_0, \ldots, Y_n) = \mathbb E(Y_ne^{Z_{n+1}}\mid Y_0, \ldots, Y_n) = Y_n\mathbb E(e^{Z_{n+1}}\mid Y_0, \ldots, Y_n) = Y_n\mathbb E\left[e^{Z_{n+1}}\right], \] because \(Z_{n+1}\) is independent of \(Y_0, \ldots, Y_n\). Thus, \((X_n)_{n \ge 0}\) is a martingale if and only if all \(\mathbb E[e^{Z_1}] = \mathbb E[e^{Z_2}] = \ldots = 1\). It is a submartingale if all \(\mathbb E[e^{Z_k}] \ge 1\), and a supermartingale if all \(\mathbb E[e^{Z_k}] \le 1\).

For a submartingale \(X = (X_n)_{n \ge 0}\), taking expectations, we get: \(\mathbb E(X_{n+1}) \ge \mathbb E(X_n)\). Therefore, \[ \mathbb E X_0 \le \mathbb E X_1 \le \mathbb E X_2 \le \ldots \] For a martingale, we have the equality. Our goal now is to show this is true for random times \(\tau\) instead of fixed times \(n\).

13.2. Stopping times

A random variable \(\tau\) which takes values \(0, 1, 2, \ldots\) is called a stopping time if the event \(\{\tau = n\}\) depends only on \(X_0, \ldots, X_n\). If \(X_n\) is the price of your stock on day \(n\), then \(\tau\) can be the moment when you decide to sell your stock. But you can decide whether to do this right now (on day \(n\)) or not based only on the current information, that is, on \(X_0, \ldots, X_n\); not on \(X_{n+1}\), for example.

Example 13.3.

The moment \(\tau := \min\{n: X_n > 1\}\) is a stopping time, because \(\{\tau = n\}\) means that \(X_0 \le 1, X_1 \le 1, \ldots, X_{n-1} \le 1, X_n > 1\). So \(\{\tau = n\}\) depends only on \(X_0, \ldots, X_n\).

Example 13.4.

Similarly, \(\tau = \min\{n: a \le X_n \le b\}\) is a stopping time, for fixed \(a\) and \(b\). More generally, for every subset \(D \subseteq \mathbb R\), \(\tau = \min\{n: X_n \in D\}\) is a stopping time.

Example 13.5.

However, \(\tau = \max\{n: X_n > 1\}\) is not a stopping time. Indeed, consider the event \(\{\tau = 1\}\). This means that \(X_2 \le 1\), so it depends on the future values of \(X\).

13.3. Optional Stopping Theorem

Consider a stopping time \(\tau\) whch is bounded from above by some constant \(N\): \(\tau \le N\). If \(X = (X_n)_{n \ge 0}\) is a martingale, then \[ \mathbb E[X_{\tau}] = \mathbb E[X_N] = \mathbb E[X_0]. \] Indeed, \(\tau\) can take values \(0, 1, \ldots, N-1, N\). Therefore, \[ \mathbb E[X_N] = \sum\limits_{n=0}^N\mathbb E\left[X_N1_{\{\tau = n\}}\right],\ \mathbb E[X_{\tau}] = \sum\limits_{n=0}^N\mathbb E\left[X_{\tau}1_{\{\tau = n\}}\right]. \] But \(\{\tau = n\}\) depends only on \(X_0, \ldots, X_n\), and \(\mathbb E\left[X_N\mid X_0, \ldots, X_n\right] = X_n\). Therefore \[ \mathbb E\left[X_N1_{\{\tau = n\}}\right] = \mathbb E\left[\mathbb E\left[X_N1_{\{\tau = n\}}\mid X_0, \ldots, X_n\right]\right] = \mathbb E\left[1_{\{\tau = n\}}\mathbb E\left[X_N\mid X_0, \ldots, X_n\right]\right] = \mathbb E\left[1_{\{\tau = n\}}X_n\right]. \] Summing these equation over \(n = 0, \ldots, N\), we complete the proof of Optional Stopping Theorem.

For a submartingale, we have \(\mathbb E X_0 \le \mathbb E X_{\tau} \le \mathbb E X_N\), and for a supermartingale, these inequalities are reversed. Assume \(X\) is the price of a stock. If it is a martingale, then the optional stopping theorem means the following: Suppose you need to sell the stock by day \(N\). Then you cannot devise a strategy (a rule) which will make you more (expected) profit than if you simply sell the stock at the initial day. No matter how you observe and analyze the behavior of this stock, you will not gain extra profit. However, if the price is a submartingale, then you should hold it and sell it at the last day. If the price is a supermartingale, you should sell it immediately.

The condition that \(\tau\) is bounded is important. Consider the symmetric random walk from Section 12: \[ X_n = Z_1 + \ldots + Z_n,\ \mathbb P(Z_n = \pm 1) = \frac12. \] Since \(\mathbb E Z_n = 1\cdot \frac12 + (-1)\cdot \frac12 = 0\), the process \(X = (X_n)_{n \ge 0}\) is a martingale. Let \(\tau = \min\{n : X_n = 1\}\). As noted in Section 12, this stopping time is well defined, since this random walk will eventually hit level \(1\). But this hitting can happen very late; this stopping time \(\tau\) is unbounded. By construction, \(X_{\tau} = 1\), so \(\mathbb E X_{\tau} = 1 \ne 0 = \mathbb E X_0\). If you are an investor trying to sell the stock with price \(X_n\) (forgetting for a minute about the fact that stock prices cannot go negative), then your strategy \(\tau\) is to sell the stock when it hits level \(1\). But until then, your investment might go far into the negative terrirory. You will need to borrow potentially unlimited amount of money, and no lender will agree to do this.

13.4. Jensen’s inequality

Recall that \(g : \mathbb R \to \mathbb R\) is a convex function if \[ g(\lambda x + (1 - \lambda)y) \le \lambda g(x) + (1 - \lambda)g(y). \] For example, \(g(x) = x\) and \(g(x) = x^2\) are convex functions, while \(g(x) = \sin x\) is not. Equivalently, if you connect any two points on the graph of \(g\) by a segment, then it lies above the graph. For a twice differentiable function \(g\), it is convex if and only if its second derivative is nonnegative: \(g''(x) \ge 0\) for all \(x\). Jensen’s inequality says that if \(g\) is a convex function, then \[ \mathbb E g(Z) \ge g(\mathbb E Z). \] Indeed, let \(m = \mathbb E Z\). Since \(g\) is convex, there exists a real-valued \(a\) such that for all real \(x\) we have: \(g(x) - g(m) \ge a(x - m)\). (The graph of \(g\) lies above the tangent line at point \(x = m\).) Plugging in \(x = Z\), we have: \(g(Z) - g(m) \ge a(Z - m)\). Take the expectations: \[ \mathbb E g(Z) - g(m) \ge a\mathbb E(Z - m) = 0. \] Therefore, \[ \mathbb E g(Z) \ge g(m) = g(\mathbb E Z). \] One example of this is a well-known fact that \(\mathbb E[Z^2] \ge (\mathbb E Z)^2\). This is true, and \(\mathbb E Z^2 - (\mathbb E Z)^2 = \mathrm{Var} Z \ge 0\). This immediately follows from Jensen’s inequality: jsut apply \(g(x) = x^2\).

Similarly, we can show Jensen’s inequality for conditional expectation instead of the unconditional: \[ \mathbb E\left[g(Z)\mid Y_1, \ldots, Y_n\right] \ge g\left(\mathbb E\left[Z\mid Y_1, \ldots, Y_n\right]\right). \]

13.5. Preservation of the martingale property

Take a martingale \(X = (X_n)_{n \ge 0}\). Apply a convex function \(g\) to the main martingale inequality. By Jensen’s inequality with conditional expectations, we have: \[ \mathbb E\left[g(X_{n+1})\mid X_0, \ldots, X_n\right] \ge g\left(\mathbb E\left[X_{n+1}\mid X_0, \ldots, X_n\right]\right) = g(X_n). \] Therefore, \(g(X) = (g(X_n))_{n \ge 0}\) is a submartingale.

Let us apply this to option pricing. We already discussed European options and other European derivatives in Section 12. Recall that a call option is the right to buy a stock at a certain strike price \(K\). A European call option has maturity time \(N\), when you can exercise this option: demand to buy this stock at price \(K\). If the market price \(S_N\) of this stock at time \(N\) is less than \(K\), then you can just buy the stock at the market price and forget about your option. Then your option does not have value. However, if the market price \(S_N \ge K\), then you should exercise this option, and its value is \(S_N - K\). In general, the value of this option is \(\max(S_N - K, 0) = g(S_N)\), where \(g(x) = \max(x - K, 0)\).

An American call option is different from a European one in the following way: the former can be exercised at any time until maturity \(N\), while the latter must be exercised at maturity. Therefore, let \(\tau\) be the time you decide to exercise your American call option to get the best expected value \(\mathbb E g\left(S_{\tau}\right)\). When is the best exercise time \(\tau\)? This is a stopping time, since your decision to exercise at time \(n\) or not is based only on your observations until time \(n\), but not on future observations. But the function \(g\) is convex (draw a graph and check). Frequently, the stock price \(X\) is modeled by a martingale. Then \(g(S) = (g(S_n))_{n \ge 0}\) is a submartingale. By the optional stopping theorem, \[ \mathbb E g\left(S_{\tau}\right) \le \mathbb E g\left(S_N\right). \] Therefore, the best time to exercise your American call option is at maturity \(n = N\). In fact, American and European call options are of the same value in this case. Additional freedom to choose exercise time does not give you anything.

13.6. Doob’s inequalities

These are generalizations of Markov’s and Chebyshev’s inequalities. Take a nonnegative submartingale \(X = (X_n)_{n \ge 0}\) and a number \(\lambda > 0\). Then \[ \mathbb P\left(\max\limits_{k = 0, \ldots, n}X_k \ge \lambda\right) \le \frac{\mathbb E X_n}{\lambda}. \] Indeed, take the stopping time \(\tau := \min\{k = 0, \ldots, n: X_k \ge \lambda\}\), with \(\tau = n\) if \(X_0 < \lambda, \ldots, X_n < \lambda\). If the event \[ A = \left\{\max\limits_{k = 0, \ldots, n}X_k \ge \lambda\right\} \] has happened, then \(X_{\tau} \ge \lambda\), and \[ \lambda\mathbb{P}(A) \le \mathbb E\left(X_{\tau}1_A\right). \] But \(X\) is nonnegative, so \[ \mathbb E\left(X_{\tau}1_A\right) \le \mathbb E X_{\tau}1_A + \mathbb E X_{\tau}1_{A^c} = \mathbb E X_{\tau}. \] Because of the optional stopping theorem, \(\mathbb E X_{\tau} \le \mathbb E X_n\). Comparing this, we get: \[ \lambda\mathbb P(A) \le \mathbb E\left(X_{\tau}1_A\right) \le \mathbb E X_{\tau} \le \mathbb E X_n. \] It suffices to divide by \(\lambda\). A corollary: for a nonnegative submartingale or a martingale \(X = (X_n)_{n \ge 0}\), if \(p \ge 1\), then \[ \mathbb{P}\left(\max\limits_{k = 0, \ldots, n}X_k \ge \lambda\right) \le \frac{\mathbb{E}|X_n|^p}{\lambda^p}. \] Indeed, the function \(x \mapsto |x|^p\) is convex for \(p \ge 1\), and for \(x \ge 0\) it is equal to \(x^p\), so it is nondecreasing.

Letting \(p = 2\) and recalling the very first example of a martingale: \(X_n = Z_1 + \ldots + Z_n\) for independent \(Z_1, \ldots, Z_n\), we have the following Kolmogorov’s inequality: \[ \mathbb{P}\left(\max\left(X_0, \ldots, X_n\right) \ge \lambda\right) \le \frac1{\lambda^2}\mathbb E X_n^2 = \frac1{\lambda^2}\sum\limits_{k=1}^N\mathbb E Z_k^2. \]

Example 13.6.

Simple random walk \(S_n\) starting from \(S_0 = 0\) is a martingale. Estimate by Kolmogorov’s inequality: \[ \mathbb{P}\left[\max\limits_{0 \le n \le 10}|S_n| \ge 5\right] \le \frac{\mathbb{E} S^2_{10}}{5^2}. \] Since \(\mathbb E S_{10} = 0\), \(\mathbb E S^2_{10} = \mathrm{Var} S_{10} = \mathrm{Var}(X_1 + \ldots + X_{10}) = \mathrm{Var} X_1 + \ldots + \mathrm{Var} X_{10} = 1 + \ldots + 1 = 10\), where \(X_i = \pm 1\) with equal probabilities and are independent. Thus the right-hand side in the Kolmogorov inequality is \(10/25 = \boxed{0.4}\)

Example 13.7.

Now, consider \(M_n = e^{S_n - cn}\). Need to find \(n\) such that this is a martingale: \[ \mathbb E e^{X_1 - c} = 1\ \Rightarrow\ \frac12\left(e^1 + e^{-1}\right) = e^c\ \Rightarrow\ c = \ln\left(e^1 + e^{-1}\right) - \ln 2. \] Apply Doob’s martingale inequality with \(f(x) = x^3\). For \(x > 0\), this function is convex, since \(f''(x) = 6x > 0\); and only this matters, because \(M_n > 0\) always. Thus for \(\lambda > 0\) we have: \[ \mathbb P\left[\max\limits_{0 \le n \le 100}M_n \ge \lambda\right] \le \frac{\mathbb E\left[ M^3_{100}\right]}{\lambda^3}. \] Now we calculate the right-hand side: \[ M^3_{100} = e^{3S_{100}}e^{-300c} = e^{3X_1}e^{3X_2}\ldots e^{3X_{100}}e^{-300c}. \] All terms in the right-hand side of this equation are independent, so we can calculate the expectation: \[ \mathbb E\left[M_3^{100}\right] = e^{-300c}\left(\mathbb E\left[e^{3X_1}\right]\right)^{100} = e^{-300c}\left(\frac12\left[e^{3} + e^{-3}\right]\right)^{100}. \]

14. Poisson Process

14.1. Definition

Fix a positive real number \(\lambda\). A collection \(N = (N(t),\, t \ge 0)\), of random variables with values in \(\mathbb{Z}_+ := \{0, 1, 2, \ldots\}\) is called a Poisson process if: \(N(0) = 0\), and for every \(0 \le s < t\), \(N(t) - N(s)\) is distributed as a Poisson random variable \(\mathrm{Poi}(\lambda(t - s))\) and is independent of \(N(u),\, u \le s\). In particular, \[ N(t) \sim \mathrm{Poi}(\lambda t);\ \mathbb{P}(N(t) = k) = \frac{(\lambda t)^k}{k!}e^{-\lambda t},\ \ k = 0, 1, 2, \ldots \] This process is used, among other applications, to model arrival of insurance claims, and a stock price. Since a Poisson random variable with parameter \(\mu\) has mean and variance \(\mu\), we have: \(\mathbb{E} N(t) = \mathrm{Var} N(t) = \lambda t\).

Example 14.1.

Take \(\lambda = 3\). What is the probability that \(N(5) = 3\), given \(N(2) = 1\), for the same Poisson process? \[ \mathbb{P}(N(5) = 3\mid N(2) = 1) = \mathbb{P}(N(5) - N(2) = 2\mid N(2) = 1) = \mathbb{P}(N(5) - N(2) = 2) \] \[ \mathbb{P}(\mathrm{Poi}(3\cdot 3) = 2) = \frac{9^2}{2!}e^{-9} = \frac{81}{2}e^{-9}. \] \[ \mathbb{P}(N(5) = 3,\, N(2) = 1) = \mathbb{P}(N(5) = 3\mid N(2) = 1)\cdot \mathbb{P}(N(2) = 1) = \frac{81}2e^{-9}\cdot \frac{6^1}{1!}e^{-6} = \boxed{243e^{-15}} \]

Example 14.2.

Take \(\lambda = 3\). Then \(N(2) \sim \mathrm{Poi}(6)\), therefore \(\mathbb E N(2) = 6\) and \(\mathrm{Var} N(2) = 6\). Thus \[ \mathbb E[N^2(2)] = \mathrm{Var}(N(2)) + (\mathbb E N(2))^2 = 6 + 6^2 = \boxed{42} \] Also, by independence of \(N(2)\) and \(N(5) - N(2)\), we have: \[ \mathbb E\left[N(2)N(5)\right] = \mathbb E N^2(2) + \mathbb E \left[N(2)(N(5) - N(2))\right] = 42 + \mathbb E N(2)\cdot \mathbb E\left[N(5) - N(2)\right] = 42 + 2\lambda\cdot (5-2)\lambda = \boxed{96} \]

14.2. Jump times of Poisson process

Note that Poisson random variable is nonnegative, and so for \(s \le t\), we have: \(N(t) - N(s) \ge 0\). Therefore, the process \(N\) is nondecreasing: \(N(t) \ge N(s)\). It starts from \(0\), then jumps to \(1\) after waiting an exponential time \(\tau_1 \sim \mathrm{Exp}(\lambda)\). Why? Because the probability that this process has not yet jumped by time \(t\) is equal to the probability that \(N(t) = 0\). Letting \(k = 0\) into the formula above, we get: \(\mathbb{P}(N(t) = 0) = e^{-\lambda t}\). Therefore, \[ \mathbb{P}(\tau_1 > t) = e^{-\lambda t}. \] This means that \(\tau_1 \sim \mathrm{Exp}(\lambda)\). This random variable has cumulative distribution function and density \[ F(t) = \mathbb{P}(\tau_1 \le t) = 1 - e^{-\lambda t},\ p(t) = F'(t) = \lambda e^{-\lambda t},\ t > 0. \] This random variable has \(\mathbb{E}\tau_1 = \lambda^{-1}\) and \(\mathrm{Var}\tau_1 = \lambda^{-2}\).

After jumping to \(\tau_1\), the process waits some random time and jumps from \(1\) to \(2\). If \(\tau_2\) is the time when it jumps from \(1\) to \(2\), then \(\tau_2 - \tau_1\) is the time which it waits between its first and second jumps. It turns out that \(\tau_2 - \tau_1\) is independent from \(\tau_1\) and is also distributed as \(\mathrm{Exp}(\lambda)\). The reason for this is when the process \(N\) jumps to \(1\), it ``forgets the past’’ and behaves as if instead of \(1\) it was at \(0\) and nothing yet has happened. This follows from the property above: that \(N(t) - N(s)\) for \(t > s\) is independent of \(N(u)\) for \(u \le s\). This important property is called the Markov property, after a Russian mathematician Andrey Markov.

Similarly, if \(\tau_k\) is the time of jump from \(k-1\) to \(k\), then \(\tau_k - \tau_{k-1}, \tau_{k-1} - \tau_{k-2}, \ldots, \tau_2 - \tau_1, \tau_1\) are i.i.d. \(\mathrm{Exp}(\lambda)\) random variables. It is possible to show that this process does not jump, say, from \(0\) to \(2\): all its jumps have magnitude \(1\).

Let us now find the distribution of \(\tau_2\). This is a sum of two independent random variables \(\tau_1\) and \(\tau_2 - \tau_1\), both with density \(p_1(t) = \lambda e^{-\lambda t}\) for \(t > 0\). As mentioned in subsection 5.8, the density \(p_2(x)\) of \(\tau_2\) is then the convolution of these exponential densities: \[ p_2(x) = \int_0^xp_1(y)p_1(x - y)\,\mathrm{d} y = \lambda^2xe^{-\lambda x}. \] It was calculated in subsection 5.9. More generally, the density \(p_k\) of \(\tau_k\) is given by \[ p_k(x) = \frac{\lambda^k}{(k-1)!}x^{k-1}e^{-\lambda x},\ x > 0. \] This is called the Gamma distribution \(\Gamma(k, \lambda)\) with parameters \(k\) and \(\lambda\). It has expectation and variance (because of independence): \[ \mathbb E\tau_k = \mathbb E\tau_1 + \mathbb E(\tau_2 - \tau_1) + \ldots + \mathbb E(\tau_{k} - \tau_{k-1}) = k\lambda^{-1}, \] \[ \mathrm{Var}\tau_k = \mathrm{Var}\tau_1 + \mathrm{Var}(\tau_2 - \tau_1) + \ldots + \mathrm{Var}(\tau_{k} - \tau_{k-1}) = k\lambda^{-2}. \]

Example 14.3.

What is the probability that the Poisson process with \(\lambda = 3\), by time \(t = 4\), has jumped two or more times? \[ \mathbb{P}(\tau_2 \le 4) = \mathbb{P}(N(4) \ge 2) = 1 - \mathbb{P}(N(4) = 0) - \mathbb{P}(N(4) = 1). \] Since \(N(4) \sim \mathrm{Poi}(12)\), this becomes \[ 1 - \frac{12^0}{0!}e^{-12} - \frac{12^1}{1!}e^{-12} = \boxed{1 -13e^{-12}} \]

14.3. Compound Poisson process

Now, we modify the Poisson process to let it jump not only by \(1\) upward, but in a more general way. Define a sequence of i.i.d. (independent identically distributed) random variables \(Z_1, Z_2, \ldots\), independent of \(N(t),\, t \ge 0)\). Then a compound Poisson process is defined as \[ X(t) = \sum\limits_{k=1}^{N(t)}Z_k. \] It starts from \(X(0) = 0\), then waits time \(\tau_1\) and jumps to \(X(\tau_1) = Z_1\). Next, it waits time \(\tau_2 - \tau_1\) and jumps to \(X(\tau_2) = Z_1 + Z_2\), then waits time \(\tau_3 - \tau_2\) and jumps to \(X(\tau_3) = Z_1 + Z_2 + Z_3\), etc.

This process also has the property that \(X(t) - X(s)\) is independent of \(X(u),\, u \le s\) (Markov property), but it is distributed usually not as a Poisson random variable. The distribution of \(X(t) - X(s)\) is the same as the distribution of \(X(t - s)\), because they are sums of \(N(t) - N(s)\) and, respectively, \(N(t - s)\)i.i.d. random variables \(Z_1, Z_2, \ldots\) but \(N(t) - N(s)\) and \(N(t - s)\) have the same distribution.

To find the distribution of \(X(t)\), let us recall the theory of Section 8: random sum of random numbers. The generating function \(\varphi_{N(t)}(s)\) of \(N(t) \sim \mathrm{Poi}(\lambda t)\) is given by \[ \varphi_{N(t)}(s) = \mathbb E s^{N(t)} = e^{\lambda t(s - 1)}. \] Assume \(F_Z(y) = \mathbb E e^{yZ_i}\) is the moment generating function of each \(Z_i\); it is independent of \(i\), since \(Z_1, Z_2, \ldots\) have the same distribution. Then by results of Section 8 (random sum of random variables), we have: the moment generating function of \(X(t)\) is equal to \[ G_t(y) := \mathbb E e^{yX(t)} = \varphi_{N(t)}(F_Z(y)) = \exp\left(\lambda t(F_Z(y) - 1)\right). \] We can also find expectation and variance, using the formulas from Section 8. Assume \(\mathbb{E} Z_k = \mu\), and \(\mathrm{Var} Z_k = \sigma^2\). Because \(\mathbb{E} N(t) = \mathrm{Var} N(t) = \lambda t\), \[ \mathbb{E} X(t) = \mathbb{E} N(t)\cdot\mathbb{E} Z_k = \lambda\mu t, \] \[ \mathrm{Var} X(t) = \mathbb E N(t)\cdot\mathrm{Var} Z_k + \mathrm{Var} N(t)\cdot(\mathbb E Z_k)^2 = \lambda(\sigma^2 + \mu^2)t. \]

14.4. Sampling from a Poisson process

One can consider the Poisson process \(N = (N(t),\, t \ge 0)\) with intensity \(\lambda\), as the process of arrival of customers with interarrival times independent \(\mathrm{Exp}(\lambda)\). Now assume that good customers arrive with probability \(p\); that is, each customer is good with probability \(p\), independently of other customers and of the Poisson process. Then the number of good customers arrived by time \(t\) is \[ M(t) = \sum\limits_{k=1}^{N(t)}Z_k,\ \ Z_k = \begin{cases}1,\ \mbox{if}\ k\mbox{th customer good};\\ 0,\ \mbox{else.}\end{cases} \] This is a compound Poisson process with \(M(t)\) having moment generating function \[ G_t(y) = \mathbb{E} e^{yM(t)} = \exp\left(\lambda t(F_Z(y) - 1)\right),\ \ F_Z(y) := \mathbb{E} e^{Z_ky} = pe^y + (1 - p). \] Then we get: \[ G_t(y) = \exp\left(\lambda pt(F_Z(y) - 1)\right). \] Therefore, \(M = (M(t),\, t \ge 0)\) is itself a Poisson process, but with intensity \(\lambda p\).

14.5. Sum of two Poisson processes

Take two independent Poisson processes \(N_1\) and \(N_2\) with intensities \(\lambda_1\) and \(\lambda_2\). Then \(N = N_1 + N_2\) is also a Poisson process with intensity \(\lambda = \lambda_1 + \lambda_2\). Why? First, \(N(0) = N_1(0) + N_2(0) = 0\). Next, for \(t > s\) we have: \[ N(t) - N(s) = (N_1(t) - N_1(s)) + (N_2(t) - N_2(s)). \] The expression above is independent of \(N(u) = N_1(u) + N_2(u)\) for \(u \le s\). Indeed, \(N_1(t) - N_1(s)\) is independent of \(N_1(u)\), because \(N_1\) is a Poisson process; and \(N_1(t) - N_1(s)\) is independent of \(N_2\), because these two Poisson processes are independent. Therefore, \(N_1(t) - N_1(s)\) is independent from \(N(u)\). Same for \(N_2(t) - N_2(s)\). This is the sum of two independent Poisson random variables: \[ N_1(t) - N_1(s) \sim \mathrm{Poi}(\mu_1),\ \ N_2(t) - N_2(s) \sim \mathrm{Poi}(\mu_2), \] \[ \mu_1 = \lambda_1(t-s),\ \mu_2 = \lambda_2(t-s). \] Therefore, this sum is also a Poisson random variable with parameter \(\mu_1 + \mu_2 = (t-s)(\lambda_1 + \lambda_2)\). Which completes the proof that \(N\) is a Poisson process with intensity \(\lambda = \lambda_1 + \lambda_2\).

14.6. Central Limit Theorem

We can apply Central Limit Theorem to a sum of large quantity of i.i.d. random variables, even if the sum itself is random (that is, random quantity of summands). Then we get: For a compound Poisson process, if \(\mathbb E Z_k = \mu\) and \(\mathrm{Var} Z_k = \sigma^2\), \[ \frac{X(t) - \mathbb E X(t)}{\sqrt{\mathrm{Var} X(t)}} \Rightarrow \mathcal N(0, 1). \] We can use this normal approximation to find distribution of \(X(t)\) for large \(t\), when it becomes inconvenient to use moment generating function.

Example 14.4.

An insurance company receives claims as a Poisson process with \(\lambda = 2\). Each claim has mean \(\mu = 3\) and variance \(\sigma^2 = 4\). What is the value at risk \(\mathrm{VaR}_{95\%}\) for two years (\(t = 2\)) at confidence level \(95\%\)? That is, which capital does the company need to accumulate so that it can pay its obligations? From the above formulas for mean and variance, we have: \[ \mathbb{E} X(t) = \lambda\mu t = 12,\ \mathrm{Var} X(t) = \lambda(\mu^2 + \sigma^2)t = 52. \] Therefore, from the Central Limit Theorem we get: \[ \frac{X(2) - 12}{\sqrt{52}} \approx \mathcal N(0, 1). \] Looking at the table of normal distribution, the quantile corresponding to \(95\%\) is \(x_{95\%} = 1.645\). Therefore, the following event happens with probability approximately \(95\%\): \[ \frac{X(2) - 12}{\sqrt{52}} \le x_{95\%}\ \Leftrightarrow\ X(2) \le 12 + \sqrt{52}\cdot 1.645 = \boxed{23.86} \]

14.7. Cramer-Lundberg model

Assume the initial capital of an insurance company is \(u\), the constant flow rate of premiums is \(c\), and the claims arrive according to the Poisson process \(N = (N(t),\, t \ge 0)\) and have i.i.d. sizes \(Z_1, Z_2, \ldots\) Then the capital of the company at time \(t\) is \[ X(t) = u + ct - \sum\limits_{k=1}^{N(t)}Z_k. \] The ruin time is the first jump when \(X(t) < 0\). This can be infinite, when ruin actually never happens.

Example 14.5.

Assume \(u = 1,\, c = 2,\, \lambda = 2,\, Z_k \sim \mathrm{Exp}(1.5)\). Then \[ X(3) = 7 - \sum\limits_{k=1}^{N(3)}Z_k, \] Since \(\mathbb E N(3) = \mathrm{Var} N(3) = 6\) and \(\mathbb E Z_k = 1/1.5\), \(\mathrm{Var} Z_k = (1/1.5)^2\), we can calculate the mean and variance: \[ \mathbb E X(3) = 7 - \mathbb E Z_k\cdot \mathbb E N(3) = 7 - \frac1{1.5}\cdot 2\cdot 3 = 3, \] \[ \mathrm{Var} X(3) = \mathbb E N(3)\cdot\mathrm{Var} Z_k + \mathrm{Var} N(3)\cdot(\mathbb E Z_k)^2 = 6\cdot\frac1{1.5^2} + 6\cdot\frac1{1.5^2} = 16/3. \]

15. Continuous-Time Markov Chains

15.1. Properties of exponential random variables

Recall that \(\tau \sim \mathrm{Exp}(\lambda)\) is an exponential random variable with intensity \(\lambda\) if it has density \(\lambda e^{-\lambda x}\) for \(x > 0\). Then it has the property \(\mathbb{P}(\tau > t) = e^{-\lambda t}\). It is commonly used to model waiting time until something happens.

Memoryless property: for \(s, t \ge 0\), \[ \mathbb{P}(\tau > t + s\mid \tau > s) = \mathbb{P}(\tau > t). \] The meaning is that if we already waited time \(s\), the remaining wait time is distributed in the same way as if we had not waited at all. In other words, how much more we need to wait is independent of how long we already waited. (This is the only distribution which has such property.) This property follows from \[ \mathbb{P}(\tau> t + s\mid \tau > s) = \frac{\mathbb{P}(\tau > t + s,\, \tau > s)}{\mathbb{P}(\tau > s)} = \frac{\mathbb{P}(\tau > t+s)}{\mathbb{P}(\tau > s)} = \frac{e^{-\lambda(t+s)}}{e^{-\lambda s}} = e^{-\lambda t} = \mathbb{P}(\tau > t). \]

If we have two independent exponential random variables \[ X_1 \sim \mathrm{Exp}(\lambda_1),\ \mbox{and}\ X_2 \sim \mathrm{Exp}(\lambda_2), \] then \(\min(X_1, X_2) \sim \mathrm{Exp}(\lambda_1 + \lambda_2)\): \[ \mathbb{P}(\min(X_1, X_2) > t) = \mathbb{P}(X_1 > t,\, X_2 > t) = \mathbb{P}(X_1 > t)\mathbb{P}(X_2 > t) = e^{-\lambda_1t}e^{-\lambda_2t} = e^{-(\lambda_1 + \lambda_2)t}. \]

If we have two independent exponential random variables \(X_1 \sim \mathrm{Exp}(\lambda_1)\) and \(X_2 \sim \mathrm{Exp}(\lambda_2)\), then \[ \mathbb{P}(X_1 < X_2) = \frac{\lambda_1}{\lambda_1 + \lambda_2}. \] Indeed, we condition on the value of \(X_1\): \[ \mathbb{P}(X_1 < X_2) = \int_0^{\infty}\mathbb{P}(X_2 > x)\lambda_1e^{-\lambda_1x}\,\mathrm{d} x = \int_0^{\infty}e^{-\lambda_2x}\lambda_1e^{-\lambda_1x}\,\mathrm{d} x = \lambda_1e^{-(\lambda_1 + \lambda_2)x}\,\mathrm{d} x = \frac{\lambda_1}{\lambda_1 + \lambda_2}. \]

15.2. Definition and construction of continuous-time Markov chains

A continuous-time Markov chain on the state space \(\{1, 2, \ldots, N\}\) is a process \(X = (X(t),\, t \ge 0)\), where for every \(t \ge 0\) \(X(t)\) is a random variable which takes values \(1, \ldots, N\), such that the behavior of \(X(t)\) for \(t \ge s\) is determined only by the knowledge of \(X(s)\); we do not need additional knowledge of \(X(u)\) for \(u < s\). This is called the Markov property.

If the process stays at, for example, \(1\), then it can stay there an exponential amount of time \(\tau_1\) (with some intensity \(\lambda_1\)) before jumping to some other state. This is because of the Markov property: how much more we need to wait at state \(1\) does not depend on how long we waited to far. Therefore, every state \(i\) has an associated waiting time \(\mathrm{Exp}(\lambda_i)\): an exponential “clock”. When this clock rings, we have to leave \(i\). As we leave \(i\), we have a choice of going to other states \(j \ne i\), with the probability of going from \(i\) to \(j\) being \(p_{ij}\). The parameter \(\lambda_{ij} = p_{ij}\lambda_i\) is called the intensity of move from \(i\) to \(j\). We usually write them in the form of the generating matrix, or simply a generator: \[ A = (a_{ij}),\, a_{ij} = \lambda_{ij},\ i \ne j;\\ a_{ii} = -\lambda_i. \] This matrix has the property that its off-diagonal elements are nonnegative, and the sum of each row is equal to zero.

We can construct this Markov chain alternatively: for each state \(i\) we can generate \(N-1\) independent exponential clocks \(\tau_{ij} \sim \mathrm{Exp}(\lambda_{ij}),\, j \ne i\). Whichever rings first (whichever is the minimal of them) we move to the corresponding state \(j\). This construction gives us the same result as above, because by the property 2 above we have: \[ \sum\limits_{j \ne i}\tau_{ij} \sim \mathrm{Exp}\left(\sum\limits_{j \ne i}\lambda_{ij}\right) = \mathrm{Exp}(\lambda_i). \] From Property 3 above, the probability that clock \(\tau_{ij}\) will ring first among these clocks is \[ \mathbb{P}\left(\tau_{ij} = \min\limits_{k \ne i}\tau_{ik}\right) = \frac{\lambda_{ij}}{\sum_{k \ne i}\lambda_{ik}} = \frac{\lambda_{ij}}{\lambda_i} = p_{ij}. \] The parameter \(\lambda_i\) is called the intensity of exit from state \(i\). The parameter \(\lambda_{ij}\) is called the intensity of moving from \(i\) to \(j\).

Example 15.1.

State space \(\{1, 2, 3\}\), with generating matrix \[ A = \begin{bmatrix} -2.9 & 0.8 & 2.1\\ 2.6 & -4.1 & 1.5\\ 1.3 & 0 & -1.3 \end{bmatrix} \]

Generate two independent exponential clocks \(\tau_{12} \sim \mathrm{Exp}(0.8)\), \(\tau_{13} \sim \mathrm{Exp}(2.1)\). Then \(\tau_1 = \min(\tau_{12}, \tau_{13}) \sim \mathrm{Exp}(2.9)\), so we will have to wait \(\tau_1\) time until leaving \(1\). Upon leaving \(1\), we will go to \(2\) with probability \[ \mathbb{P}(\tau_{12} < \tau_{13}) = \frac{0.8}{0.8+2.1} = \frac{0.8}{2.9} \] and to \(3\) with probability \[ \mathbb{P}(\tau_{13} < \tau_{12}) = \frac{2.1}{0.8+2.1} = \frac{2.1}{2.9}. \] Similarly with exit from \(2\). Exiting from \(3\), we will always go to \(1\), because the intensity of moving from \(3\) to \(2\) is \(0\). We can include this in our framework by letting \(\tau_{32} \sim \mathrm{Exp}(0) = \infty\). (An exponential distribution \(\mathrm{Exp}(\lambda)\) with intensity \(\lambda = 0\) is concentrated at infinity.)

15.3. Transition matrix

For states \(i\) and \(j\), let \(p_{ij}(t) = \mathbb{P}(X(t) = j\mid X(0) = i)\) be the probability that the Markov chain, starting from state \(i\), will be at time \(t\) at state \(j\). These probabilities form the transition matrix: \(P(t) = (p_{ij}(t))_{i, j = 1, \ldots, N}\). Let the row-vector in \(\mathbb{R}^N\): \[ x(t) = \begin{bmatrix}\mathbb{P}(X(t) = 1) & \mathbb{P}(X(t) = 2) & \ldots & \mathbb{P}(X(t) = N)\end{bmatrix} \] be the distribution of \(X(t)\). If we start from the initial distribution \(x(0) \in \mathbb{R}^N\), then the distribution \(x(t)\) can be found as \(x(t) = x(0)P(t)\). Indeed, assume you want to be at state \(j\) at time \(t\). The probability \(\mathbb{P}(X(t) = j)\) of this is the \(j\)th element of the vector \(x(t)\). Then you can achieve this by being at state \(i\) initially (which happens with probability \(x_i(0) = \mathbb{P}(X(0) = i)\)), and moving from state \(i\) to state \(j\) in time \(t\) (which happens with probability \(p_{ij}(t)\)). Summing over all \(i = 1, \ldots, N\), we have: \[ x_j(t) = \sum\limits_{i=1}^Np_{ij}(t)x_i(0), \] which is the same as above. We can similarly show Chapman-Kolmorogov equations: \(P(t+s) = P(t)P(s)\) for \(t, s \ge 0\). Note that \(p_{ii}(0) = 1\) and \(p_{ij}(0) = 0\) for \(i \ne j\). Therefore, \(P(0) = I_N\), which is the \(N\times N\)-identity matrix.

15.4. Forward and backward Kolmorogov equations

Taking the derivatives entrywise, we get: \(P'(t) = (p'_{ij}(t))\). The following system of Kolmogorov equations hold true: \[ P'(t) = P(t)A\ \ \mbox{(forward Kolmorogov equations)} \] \[ P'(t) = AP(t)\ \ \mbox{(backward Kolmorogov equations)} \] Indeed, how does the matrix \(P(t)\) change from \(t\) to \(t + \mathrm{d} t\)? In other words, how can a process get from state \(i\) to state \(j\) in time \(t + \mathrm{d} t\)?

Case 1. It can already be at state \(j\) by time \(t\). This happens with probability \(p_{ij}(t)\). And the probability that it will stay at state \(j\) from time \(t\) to time \(t + \mathrm{d}t\) is \(\mathbb{P}(\tau_j > \mathrm{d} t) = e^{-\lambda_j\mathrm{d}t} \approx 1 - \lambda_j\mathrm{d} t\). This gives us the term \(p_{ij}(t)(1 - \lambda_j\mathrm{d} t)\).

Case 2. It can be at a different state \(k \ne j\) by time \(t\). This happens with probability \(p_{ik}(t)\). and the probability that the process will jump from \(k\) to \(j\) during time \([t, t + \mathrm{d} t]\) is \(\mathrm{P}(\tau_{kj} \le \mathrm{d} t) = 1 - e^{-\lambda_{kj}\,\mathrm{d} t} = \lambda_{kj}\,\mathrm{d} t\). This gives us the term \(p_{ik}(t)\lambda_{kj}\,\mathrm{d} t\). Summing these terms, we have: \[ p_{ij}(t + \mathrm{d} t) = p_{ij}(t)(1 - \lambda_j\mathrm{d}t) + \sum\limits_{k \ne j}p_{ik}(t)\lambda_{kj}\,\mathrm{d}t. \] Subtracting \(p_{ij}\), and dividing by \(\mathrm{d}t\), we get: \[ p'_{ij}(t) = \frac{p_{ij}(t + \mathrm{d} t) - p_{ij}(t)}{\mathrm{d} t} = -\lambda_jp_{ij}(t) + \sum\limits_{k \ne j}p_{ik}(t)\lambda_{kj}(t) = \sum\limits_{k=1}^Np_{ik}(t)a_{kj}. \] This shows, by conditioning on \(X(t)\), that \(P'(t) = P(t)A\); this proves the forward equations. One can also show the backward equations by conditioning on \(X(\mathrm{d} t)\). Finally, in light of above, we have: \(P'(0) = A\).

15.6. Stationary distribution

Take a distribution \(\pi\). Assume we start from there: \(x(0) = \pi\). If the Markov chain remains there for each time \(t \ge 0\): \(x(t) = \pi\), then we have a stationary distribution \(p\). We can rewrite this as \(\pi = \pi P(t)\). Take a derivative at \(t = 0\). The left-hand side is constant, so its derivative is zero (row vector). The right-hand side has derivative \(\pi A\). Therefore, \(0 = \pi A\). A continuous-time Markov chain on the state space \(\{1, \ldots, N\}\) always has a stationary distribution.

Example 15.2.

Take the generator \[ A = \begin{bmatrix} -2 & 2\\ 3 & -3 \end{bmatrix} \] Solve the system of equations for \(\pi = \begin{bmatrix}\pi_1 & \pi_2\end{bmatrix}\): \[ 0 = \pi A\ \Rightarrow\ -2\pi_1 + 3\pi_2 = 0\ \Rightarrow\ \pi_1 = \frac32\pi_2. \] But since this is a probability distribution, it should also satisfy \[ \pi_1 + \pi_2 = 1. \] Therefore, \[ \boxed{ \pi_1 = \frac35,\ \pi_2 = \frac25 } \]

15.7. Convergence

The vector \(x(t)\), which is the distribution of \(X(t)\), satisfies the ODE itself: Differentiating the above equation, and using forward Kolmogorov equations, we have: \[ x'(t) = x(0)P'(t) = x(0)P(t)A = x(t)A. \] Let us solve it, if we know the initial distribution \(\pi(0)\). Assume that this Markov chain has a unique stationary distribution \(\pi\); in other words, it has only one eigenvector corresponding to the eigenvalue \(\lambda_1 = 0\). Other eigenvalues \(\lambda_2, \ldots, \lambda_N\) are negative (or are complex but have negative real parts - the analysis is the same) and have eigenvectors \(u_2, \ldots, u_N\): \[ u_iA = \lambda_iu_i,\ \ i = 2, \ldots, N. \] Then the vectors \(\pi, u_2, \ldots, u_N\) form a basis of \(\mathbb{R}^N\). We can find coefficients \(c_1(t), \ldots, c_N(t)\) such that \[ x(t) = c_1(t)\pi + c_2(t)u_2 + \ldots + c_N(t)u_N. \] Let us derive the ODE for each \(c_i(t)\): \[ x'(t) = c_1'(t)\pi + c'_2(t)u_2 + \ldots + c'_N(t)u_N = c_1(t)\pi A + c_2(t)u_2 A + \ldots + c_N(t) u_NA = 0 + \lambda_2c_2(t)u_2 + \ldots + \lambda_Nc_N(t)u_N. \] Comparing left and right-hand sides, we have: \[ c_1'(t) = 0\ \Rightarrow \ c_1(t) = c_1(0), \] \[ c_i'(t) = \lambda_ic_i(t)\ \Rightarrow\ c_i(t) = c_i(0)e^{\lambda_it},\ i = 2, \ldots, N. \] We can find \(c_1(0), \ldots, c_N(0)\) by decomposing the initial distribution \(x(0) = c_1(0)\pi + c_2(0)u_2 + \ldots + c_N(0)u_N\) as a linear combination of the basis vectors. Then plug these initial conditions into the above solution. We have: \[ x(t) = c_1\pi + c_2(0)e^{\lambda_2t}u_2 + \ldots + c_N(0)e^{\lambda_Nt}u_N. \] Because \(\lambda_2, \ldots, \lambda_N < 0\), we have: \(x(t) \to c_1\pi\) as \(t \to \infty\). Actually, \(c_1 = 1\), because all \(x(t)\) and \(\pi\) have the same property: sum of components equals to one. Moreover, we get: \[ |x(n) - \pi| \le c_2(0)|u_2|e^{\lambda_2t} + \ldots + c_N(0)|u_N|e^{\lambda_Nt}. \] Each summand converges to zero with rate \(e^{\lambda_it}\). The slowest convergence is for \(\lambda = \lambda_i\) with smallest absolute value (closest to zero). This gives general convergence rate \(e^{\lambda t}\). To find convergence rate, take among nonzero eigenvalues the closest \(\lambda\) to zero, and then take \(e^{\lambda t}\).

The matrix \(P(t)\) can be found in the same way, because its \(i\)th row is, in fact, \(x(t)\) for \[ x(0) = \begin{bmatrix}0 & 0 & \ldots 0 & & 1 & 0 & \ldots & 0\end{bmatrix}, \] where the unity is on the \(i\)th place. Indeed, \(p_{ij}(t)\) is the probability of \(X(t) = j\), given that \(X(0) = i\); and if \(X(0) = i\) with probabiltiy one, then \(X(0)\) has distribution as above.

Example 15.3.

A Markov chain on two states \(\{1, 2\}\), with \[ A = \begin{bmatrix} -2 & 2\\ 3 & -3 \end{bmatrix} \] Let the initial distribution be \(x(0) = \begin{bmatrix} 1 & 0 \end{bmatrix}\) Then \(\pi = \begin{bmatrix}\frac35 & \frac25\end{bmatrix}\) (found above). The eigenvalues of \(A\): \(\det(A -\lambda I_2) =0 \Rightarrow\ \lambda^2 + 5\lambda = 0\ \Rightarrow\ \lambda_{1, 2} = 0, -5\). An eigenvector corresponding to \(\lambda_1 = 0\) is \(\pi\). An eigenvector corresponding to \(\lambda_2 = -5\) can be found from \[ -5u = uA\ \Rightarrow\ \begin{cases} -5u_1 = -2u_1 + 3u_2\\ -5u_2 = 2u_1 - 3u_2 \end{cases} \ \Rightarrow\ 3u_1 + 3u_2 = 0\ \Rightarrow\ u = \begin{bmatrix}1 & -1\end{bmatrix} \] We can decompose \(x(0) = c_1\pi + c_2u\) by comparing componentwise: \[ \begin{cases} 1 = \frac35c_1 + c_2\\ 0 = \frac25c_1 - c_2 \end{cases} \ \Rightarrow\ c_1 = 1,\ c_2 = \frac25. \] Then \[ x(t) = \pi + \frac25e^{-5t}u_2\ \Leftrightarrow\ \begin{cases} x_1(t) = \frac35 + \frac25e^{-5t}\\ x_2(t) = \frac25 - \frac25e^{-5t} \end{cases} \] The rate of convergence is \(e^{-5t}\).

15.8. Relation between discrete- and continuous-time Markov chains

Take a continuous-time Markov chain on three states \(1, 2, 3\), with generator \[ A = \begin{bmatrix} -2.9 & 0.8 & 2.1\\ 1.3 & -3.9 & 2.6\\ 3 & 0 & -3 \end{bmatrix} \] Let \(\tau_0 = 0 < \tau_1 < \tau_2 < \ldots\) be the jump times. For example, if the Markov chain starts from \(1\): \(X(0) = 1\), then \(\tau_1 \sim \mathrm{Exp}(2.9)\). Next, if \(X(\tau_1) = 2\) (that is, if \(X\) makes its first jump from \(1\) to \(2\)), then \(\tau_2 - \tau_1 \sim \mathrm{Exp}(3.9)\), etc. Let \[ Y_n := X(\tau_n),\, n = 0, 1, 2, \ldots \] Then \(Y = (Y_n)_{n \ge 0} = (Y_0, Y_1, Y_2, \ldots)\) is a discrete-time Markov chain, with transition matrix \[ P = \begin{bmatrix} 0 & \frac{0.8}{2.9} & \frac{2.1}{2.9}\\ \frac{1.3}{3.9} & 0 & \frac{2.6}{3.9}\\ 1 & 0 & 0 \end{bmatrix} \] Indeed, if the continuous-time Markov chain \(X\) jumps from \(1\), then it goes to \(2\) with probability \(0.8/2.9\), and to \(3\) with probability \(2.1/2.9\).

This discrete-time Markov chain \(Y\), by construction, cannot go in one step from a state \(i\) to the same state \(i\). It has to jump somewhere else (although it can return to the same state in more than one step).

When we switch from a continuous-time to this discrete-time Markov chain, we lose information about when jumps occurred. We only know from where to where the process has jumped.

We can also move backward, from a discrete-time to a continuous-time Markov chain. Assume we have a discrete-time Markov chain \[ P = \begin{bmatrix} 0 & 0.7 & 0.3\\ 0.4 & 0 & 0.6\\ 1 & 0 & 0 \end{bmatrix} \] It has to have zeros on the main diagonal, which corresponds to the property from the above remark. Choose any intensities \(\lambda_1, \lambda_2, \lambda_3 > 0\) of exiting from states \(1, 2, 3\); for example, \[ \lambda_1 = 2,\, \lambda_2 = 3,\,\lambda_3 = 4. \] We have complete freedom in choosing these intensities; by doing this, we restore the information lost when switching from continuous time to discrete time (see above). Then the corresponding continuous-time Markov chain will have generator \[ A = \begin{bmatrix} -2 & 2\cdot 0.7 & 2\cdot 0.3\\ 3\cdot 0.4 & -3 & 3\cdot 0.6\\ 4 & 0 & -4 \end{bmatrix} \] There is also a relation between stationary distributions of the discrete-time and the continuous-time Markov chains: Assume \[ \pi = \begin{bmatrix}\pi_1 & \pi_2 & \pi_3\end{bmatrix} \] is a stationary distribution for the discrete-time Markov chain. Then we have: \[ \pi = \pi P\ \Rightarrow\ \begin{cases} \pi_1 = 0.4\pi_2 + \pi_3\\ \pi_2 = 0.7\pi_1\\ \pi_3 = 0.3\pi_1 + 0.6\pi_2 \end{cases} \] Take \(\rho_i = \pi_i/\lambda_i,\, i = 1, 2, 3\), and \(\rho = \begin{bmatrix}\rho_1 & \rho_2 & \rho_3\end{bmatrix}\) Then \[ \begin{cases} -2\rho_1 + 3\cdot0.4\rho_2 + 4\rho_3 = -\pi_1 + 0.4\pi_2 + \pi_3 = 0\\ 2\cdot 0.7\rho_1 - 3\rho_2 = 0.7\pi_1 - \pi_2 = 0\\ 2\cdot 0.3\rho_1 + 3\cdot 0.6\rho_2 - 4\rho_3 = 0.3\pi_1 + 0.6\pi_2 - \pi_3 = 0 \end{cases} \ \Rightarrow\ \rho A = 0. \] This means \(\rho\), or, rather, \[ \rho' = \begin{bmatrix}\rho'_1 & \rho'_2 & \rho'_3\end{bmatrix} \] \[ \rho'_i = \frac{\rho_i}{\rho_1 + \rho_2 + \rho_3},\ i = 1, 2, 3 \] (we divide each \(\rho_i\) to make them sum up to \(1\)) is a stationary distribution for the continuous-time Markov chain. Conversely, if \(\rho\) is a stationary distribution for a continuous-time Markov chain, then \[ \pi = \begin{bmatrix} \lambda_1\rho_1 & \lambda_2\rho_2 & \lambda_3\rho_3\end{bmatrix} \] is a stationary distribution for the corresponding discrete-time Markov chain.

There is a heuristical explanation of this relation between \(\pi\) and \(\rho\). If the Markov chain is ergodic, then it converges to its stationary distribution as time goes to infinity. And the long-term share of time spent at state \(i\) is equal to \(\pi_i\) for the discrete-time Markov chain or \(\rho'_i\) for the continuous-time Markov chain. Assume that \(\lambda_i\) is very large compared to other \(\lambda_j\). Then, as long as the continuous-time Markov chain jumps to \(i\), it almost immediately jumps out of \(i\). The share of time spent at \(i\) is small. This corresponds to a small \(\rho_i = \pi_i/\lambda_i\), which comes from the large \(\lambda_i\).

15.9. Reducible and irreducible Markov chains

There is a well-developed theory of discrete-time Markov chains, see Section 11. We can apply the same theory to continuous-time Markov chains. A state \(i\) is called recurrent for a continuous-time Markov chain, if it is recurrent for the corresponding discrete-time Markov chain; that is, if the probability of ever returning to \(i\) if we start from \(i\) is \(1\). Otherwise, it is called transient. After removal of all transient states, if all remaining states are connected, then the Markov chain is called irreducible; otherwise, it is called reducible, and it splits into corresponding communicating classes. An irreducible Markov chain (both discrete and continuous time) has a unique stationary distribution, but a reducible one has infinitely many stationary distributions.

Take a continuous-time Markov chain on \(\{1, 2, 3, 4\}\) with generating matrix \[ A = \begin{bmatrix} -2 & 0 & 2 & 0\\ 0 & -1 & 0 & 1\\ 3 & 0 & -3 & 0\\ 0 & 1 & 0 & -1 \end{bmatrix} \] Then this Markov chain splits into two parts: a Markov chain with states \(1\) and \(3\), and with transition matrix \[ A_1 = \begin{bmatrix} -2 & 2\\ 3 & -3 \end{bmatrix} \] and a Markov chain with states \(2\) and \(4\), and with transition matrix \[ A_2 = \begin{bmatrix} -1 & 1\\ 1 & -1 \end{bmatrix} \]

These two parts are not connected to each other. We can find a stationary distribution for the first part: \[ \begin{bmatrix}\pi_1 & \pi_3\end{bmatrix}A_1 = \begin{bmatrix}0 & 0\end{bmatrix}\ \Rightarrow\ -2\pi_1 + 3\pi_3 = 0\ \Rightarrow\ \pi_1 = \frac32\pi_3, \] and because \(\pi_1 + \pi_3 = 1\), we have: \(\pi_1 = \frac35,\ \pi_3 = \frac25\). Similarly, the second part has the following stationary distribution: \[ \begin{bmatrix}\pi_2 & \pi_4\end{bmatrix} = \begin{bmatrix}\frac12 & \frac12\end{bmatrix} \] Now, let us construct a stationary distribution for the whole Markov chain. Let \(p_1\) be the probability that we are in part \(1-3\). Denote it by \(p_1\). Let \(p_2\) be the probability that we are in \(2-4\). Then \(p_1 + p_2 = 1\) and \(p_1, p_2 \ge 0\). Given that we are in \(1-3\), we are at \(1\) with probability \(3/5\) (conditional probability). Therefore, the unconditional probability of being in \(1\) is \((3/5)p_1\). Similarly, we can find it for other states: \[ \begin{bmatrix} \frac35p_1 & \frac12p_2 & \frac25p_1 & \frac12p_2\end{bmatrix} \] This is an infinite family of stationary distributions. Members of this family include: \[ \begin{bmatrix} \frac35 & 0 & \frac25 & 0\end{bmatrix} \ \ \mbox{for}\ \ p_1 = 1,\, p_2 = 0; \] \[ \begin{bmatrix} 0 & \frac12 & 0 & \frac12\end{bmatrix} \ \ \mbox{for}\ \ p_1 = 0,\, p_2 = 1; \] \[ \begin{bmatrix} \frac3{10} & \frac14 & \frac15 & \frac14\end{bmatrix} \ \ \mbox{for}\ \ p_1 = p_2 = \frac12. \] Another way to find this would be simply to solve the system \[ \pi A = \begin{bmatrix}0 & 0 & 0 & 0\end{bmatrix}\ \ \pi_1 + \pi_2 + \pi_3 + \pi_4 = 1. \] This would give us a free variable: \[ -2\pi_1 + 3\pi_3 = 0,\ -\pi_2 + \pi_4 = 0,\ \ \pi_1 + \pi_2 + \pi_3 + \pi_4 = 1. \] \[ \pi_1 = \frac32\pi_3,\ \pi_4 = \pi_3,\ \frac52\pi_3 + 2\pi_4 = 1\ \Rightarrow\ \pi_4 = \frac12 - \frac54\pi_3. \] \[ \begin{bmatrix}\pi_1 & \pi_2 & \pi_3 & \pi_4\end{bmatrix} = \begin{bmatrix}\frac32\pi_3 & \frac12 - \frac54\pi_3 & \pi_3 & \frac12 - \frac54\pi_3\end{bmatrix} \] \[ = \pi_3\begin{bmatrix}\frac32 & -\frac54 & 1 & -\frac54\end{bmatrix} + \begin{bmatrix}0 & \frac12 & 0 & \frac12\end{bmatrix} \] This is the same answer, but in a different form.

15.10. Applications

Consider a common infection model, with three states: S = susceptible, I = infected, R = recovered, and intensity \(2\) of moving from S to I, and intensity \(3\) from I to R; the state R is absorbing, and S, I are transient. The generator is \[ A = \begin{bmatrix} -2 & 2 & 0\\ 0 & -3 & 3\\ 0 & 0 & 0 \end{bmatrix} \] If a virus mutates and can threaten again after recovery, we can add, say, intensity \(4\) transition from R to S: \[ A = \begin{bmatrix} -2 & 2 & 0\\ 0 & -3 & 3\\ 4 & 0 & -4 \end{bmatrix} \]

16. Queueing Theory

16.1. M/M/1 queue

Assume we have a server (cashier) which serves customers, one by one. Each customer is served for a random time, while others in the queue wait. This random time is independent for each customer and is distributed as \(\mathrm{Exp}(\mu)\). Customers arrive (from the shop) with intensity \(\lambda\); that is, the interarrival time of each next customer is distributed as \(\mathrm{Exp}(\lambda)\), independently of other interarrival times, and of serving times. Let \(X(t)\) be the number of customers in the queue at time \(t\) (including the customer who is currently being served). Then \(X(t)\) can take values \(0, 1, 2, \ldots\) Actually, this is a continuous-time Markov chain on the state space \(\{0, 1, 2, \ldots\}\) with transition intensities \[ \lambda_{n, n+1} = \lambda,\ \lambda_{n, n-1} = \mu. \] It has generating matrix \[ A = \begin{bmatrix} -\lambda & \lambda & 0 & 0 & \ldots\\ \mu & -\lambda-\mu & \lambda & 0 & \ldots\\ 0 & \mu & -\lambda-\mu & \lambda & \ldots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix} \] Let us find its stationary distribution \[ \pi = \begin{bmatrix}\pi_0 & \pi_1 & \pi_2 & \ldots\end{bmatrix} \] It has to satisfy \(\pi A = \begin{bmatrix}0 & 0 & 0 & \ldots\end{bmatrix}\) Therefore, \[ -\lambda\pi_0 + \mu\pi_1 = 0, \] \[ \lambda\pi_{n-1} - (\lambda + \mu)\pi_n + \mu\pi_{n+1} = 0,\ n \ge 1. \] Let us try to find a solution to this system of equations. First, consider the case \(\lambda < \mu\): the arrival intensity is less than the service intensity. Try \(\pi_n = c\rho^n,\ n = 0, 1, 2, \ldots\) Then plugging into the equation, we get: \[ \lambda\cdot c\rho^{n-1} - (\lambda + \mu)\cdot c\rho^n + \mu\cdot c\rho^{n+1} = 0. \] Canceling \(c\rho^{n-1}\), we have: \(\lambda - (\lambda + \mu)\rho + \mu \rho^2 = 0\). Solving this quadratic equation, we get: \(\rho_1 = 1,\ \rho_2 = \lambda/\mu\). Therefore, we get the following solutions: \[ \pi_n = c_1\cdot 1^n = c_1,\ \mbox{and}\ \pi_n = c_2\left(\frac{\lambda}{\mu}\right)^n. \] And their sum is also a solution to the above difference equations. Actually, it is the most general solution: \[ \pi_n = c_1 + c_2\left(\frac{\lambda}{\mu}\right)^n. \] Plug in to find \(c_1\) and \(c_2\): \[ \pi_0 = c_1 + c_2,\ \pi_1 = c_1 + c_2\frac{\lambda}{\mu}\ \Rightarrow\ -\lambda(c_1 + c_2) + \mu\left(c_1 + c_2\frac{\lambda}{\mu}\right) = (\mu - \lambda)c_1\ \Rightarrow\ c_1 = 0. \] Therefore, we have: \[ \pi_n = c_2\left(\frac{\lambda}{\mu}\right)^n,\ \ n = 0, 1, 2, \ldots \] Next, because of \(\sum_{n=0}^{\infty}\pi_n = 1\), we have: \[ c_2\sum_{n=0}^{\infty}\left(\frac{\lambda}{\mu}\right)^n = 1\ \Rightarrow\ c_2\left(1 - \frac{\lambda}{\mu}\right)^{-1} = 1\ \Rightarrow\ c_2 = 1 - \frac{\lambda}{\mu}. \] \[ \boxed{ \pi_n = \left(1 - \frac{\lambda}{\mu}\right)\left(\frac{\lambda}{\mu}\right)^n,\ \ n = 0, 1, 2, \ldots } \] This is the distribution of \(N = Z - 1\), where \(Z\) has a geometric distribution \(\mathrm{Geo}(\rho)\) with parameter \(\rho = 1 - \lambda/\mu \in (0, 1)\). We know mean and variance of the geometric distribution. Therefore, the distribution \(\pi\) has mean and variance \[ \mathbb E[N] = \mathbb E[Z] - 1 = \frac1{\rho} - 1,\ \mathrm{Var} N = \mathrm{Var} Z = \frac{1 - \rho}{\rho^2}. \]

For the case \(\lambda \ge \mu\), the intensity of arrival is greater than or equal to the intensity of service. One can show that there is no stationary distribution, and the queue, on average, grows infinitely large as time goes to infinity: \(\mathbb E[X(t)] \to \infty\) as \(t \to \infty\).

In the stationary distribution, the outflow of customers also forms a Poisson process with intensity \(\lambda\), just like the inflow (Burke’s theorem). This follows from the fact that the system is in the equilibrium.

16.2. Finite queues

Now, assume new people do not come when \(N\) people already are in the queue. Then \(X = (X(t),\, t \ge 0)\) is a continuous-time Markov chain with state space \(\{0, 1, \ldots, N-1, N\}\), with generator \[ A = \begin{bmatrix} -\lambda & \lambda & 0 & \ldots & 0 & 0\\ \mu & -\lambda-\mu & \lambda & \ldots & 0 & 0\\ 0 & \mu & -\lambda-\mu & \ldots & 0 & 0\\ \vdots & \vdots & \vdots & \ddots & 0 & 0\\ 0 & 0 & 0 & \ldots & -\lambda-\mu & \lambda\\ 0 & 0 & 0 & \ldots & \mu & -\mu \end{bmatrix} \] Let us find its stationary distribution \(\pi = \begin{bmatrix}\pi_0 & \pi_1 & \ldots & \pi_N\end{bmatrix}\) It has to satisfy \(\pi A = \begin{bmatrix}0 & 0 & \ldots & 0\end{bmatrix}\) We can write this as the above for \(n = 1, \ldots, N-1\); the last column gives us another boundary condition \[ \lambda\pi_{N-1} - \mu\pi_N = 0. \] The general solution to the difference equation is given by the above. The boundary condition again gives us \(c_1 = 0\), so we have \[ \pi_n = c\left(\frac{\lambda}{\mu}\right)^n,\ n = 0, \ldots, N. \] This solution also satisfies these boundary conditions. And to find \(c\), we need to use \[ \pi_0 + \ldots + \pi_N = 1\ \Rightarrow\ c\left(1 + \rho + \ldots + \rho^N\right) = 1,\ \ \rho := \frac{\lambda}{\mu}. \] Summing this finite geometric series, we get (for \(\lambda \ne \mu\ \Leftrightarrow\ \rho \ne 1\)): \[ c\frac{\rho^{N+1} - 1}{\rho - 1} = 1\ \Rightarrow\ c = \frac{\rho-1}{\rho^{N+1} - 1}. \] The answer is: \[ \boxed{ \pi_n = \frac{\rho-1}{\rho^{N+1} - 1}\rho^n,\, n = 0, \ldots, N } \] Separate case: \(\lambda = \mu\). Then we have: \[ \pi_{n-1} - 2\pi_n + \pi_{n+1} = 0,\ n = 1, 2, \ldots, N-1. \] Thus, we have: \(\pi_0 = \pi_1\), \(\pi_{N-1} = \pi_N\). We can just let \(\pi_n = 1\) for all \(n = 0, \ldots, N\). Then these conditions are satisfied. But we need also to normalize it. Then we get: \[ \boxed{ \pi_0 = \ldots = \pi_N = \frac{1}{N+1} } \]

This stationary distribution exists even if \(\lambda \ge \mu\), because the queue cannot grow indefinitely.

16.3. Finite queues with varying intensities

Assume we have a lemonade stand. People come there with intensity \(\lambda = 2\), and are served with intensity \(\mu = 2\), but with the following qualifications. If there are currently no people in the queue, then the newcomer joins the queue. If there are one or two people in the queue, the newcomer joint the queue with probability \(1/2\). If there are three or more people, then the newcomer does not join the queue. Let \(X(t)\) be the number of people in the queue at time \(t\). Then \(X\) is a continuous-time Markov chain with state space \(\{0, 1, 2, 3\}\). When \(X(t) = 1\), the intensity of jumping to \(2\) is \(2\cdot 0.5 = 1\). Indeed, if two people per minute on average come to the lemonade stand, but each of them chooses whether to stay or not independently with probability \(0.5\), then the effective rate of people joining the queue is \(1\) per minite. Similarly, when \(X(t) = 2\), the intensity of jumping to \(3\) is \(2\cdot 0.5 = 1\). The intensity from \(3\) to \(2\), from \(2\) to \(1\), from \(1\) to \(0\), is \(2\). The generator is \[ A = \begin{bmatrix} -2 & 2 & 0 & 0\\ 2 & -3 & 1 & 0\\ 0 & 2 & -3 & 1\\ 0 & 0 & 2 & -2 \end{bmatrix} \] To find the stationary distribution \(\pi = \begin{bmatrix}\pi_0 & \pi_1 & \pi_2 & \pi_3\end{bmatrix}\), solve \(\pi A = \begin{bmatrix}0 & 0 & 0 & 0\end{bmatrix}\)

17. Brownian Motion

17.1. Continuous-time random walk

Take a compound Poisson process \[ S(t) = \sum\limits_{k=1}^{N(t)}Z_k, \] with \(N = (N(t),\, t \ge 0)\) being a Poisson process with intensity \(\lambda = 1\), and with \(Z_1, Z_2, \ldots\) are i.i.d. variables with \(\mathbb P(Z_i = 1) = \mathbb P(Z_i = -1) = 0.5\). This process \(S\) is called a continuous-time random walk, by analogy with a discrete-time random walk \(X_n = Z_1 + \ldots + Z_n\). In fact, if you take \(0 =: \tau_0 < \tau_1 < \ldots\) to be the jump times of this continuous-time random walk \(S\), then the corresponding discrete-time Markov chain is nothing else but a discrete-time random walk: \(X_n = S(\tau_n)\). \[ \mathbb E Z_k = 1\cdot0.5 + (-1)\cdot 0.5 = 0,\ \mathbb E Z_k^2 = 1^2\cdot0.5 + (-1)^2\cdot 0.5 = 1, \] \[ \mbox{and so}\ \ \mathrm{Var} Z_k = \mathbb E Z_k^2 - (\mathbb E Z_k)^2 = 1. \] Therefore, \[ \mathbb E S(t) = \lambda t\cdot\mathbb E Z_k = 0,\ \mathrm{Var} S(t) = t\left(\lambda(\mathbb{E} Z_k)^2 + \lambda^2\mathrm{Var} Z_k\right) = t. \]

17.2. Zooming out: Scaling limit is a Brownian motion

To make jumps very small but very frequent, take a large \(n\), and define \[ S_n(t) = \frac1{\sqrt{n}}\sum\limits_{k=1}^{N(nt)}Z_k = \frac1{\sqrt{n}}S(nt). \] This is also a compound Poisson process: Poisson process \((N(tn),\, t \ge 0)\) having intensity \(\lambda n = n\), so jumps occur every \(n^{-1}\) times, on average; and each jump is up or down with equal probability, with size \(n^{-1/2}\) of jump. We have: \[ \mathbb E S_n(t) = \frac1{\sqrt{n}}\mathbb E S(nt) = 0,\ \mathrm{Var} S_n(t) = \left(\frac1{\sqrt{n}}\right)^2\mathrm{Var} S(nt) = \frac1n\cdot nt = t. \] The random variable \(S_n(t)\), ofr a large \(n\), is the sum of many very small summands. By Central Limit Theorem, as \(n \to \infty\), it converges to the normal distribution with the same mean and variance as \(S_n(t)\). In other words, \[ S_n(t) = \frac1{\sqrt{n}}\sum\limits_{k=1}^{N(nt)}Z_k\ \Rightarrow\ \mathcal N(0, t). \] Similarly, for \(t > s\), as \(n \to \infty\), \[ S_n(t) - S_n(s) = \frac1{\sqrt{n}}\sum\limits_{N(ns)+1}^{N(nt)}Z_k\ \Rightarrow\ \mathcal N(0, t-s). \] And this difference is independent of \(S_n(u)\) for \(u \le s\). Therefore, \(S_n(t) \Rightarrow W(t)\) as \(n \to \infty\). Here, we have the following properties of \(W\):

  • \(W(t) - W(s) \sim \mathcal N(0, t-s)\) for \(t > s\), and is independent of \(W(u)\), \(u \le s\);

  • \(W(0) = 0\), because \(S_n(0) = 0\).

  • \(W\) has continuous trajectories, because the jumps of \(S_n\) are very small: only of size \(n^{-1}\).

This process \(W\) which satisfies these properties is called a Brownian motion. This is the most important random (stochastic) process in Probability Theory.

This movement is a sum of very frequent but very small jumps. It was first used by Einstein to model the motion of a small dust particle hit by molecules (many small hits). Brownian motion is also used to model the stock price, because its movement is influenced by a lot of new bits of information, which are random (if they were not random, they would be predictable, that is, not new).

Brownian motion is a Markov process: Future movement \(W(t),\, t > s\), does not depend on the past \(W(t),\, t < s\), if we know the present \(W(s)\). This is because \(W(t) - W(s)\) is independent of \(W(u),\, u \le s\). The state space of the Brownian motion is the whole real line: \(\mathbb{R}\).

17.3. Expectation, variance, and transition density

Assume \(W(s) = x\). Then \(W(t) \sim \mathcal N(x, t-s)\), because \(W(t) - W(s) \sim \mathcal N(0, t-s)\). Therefore, the density of \(W(t)\) given \(W(s) = x\) is \[ p(s, t, x, y) = \frac1{\sqrt{2\pi(t-s)}}\exp\left(-\frac{(y-x)^2}{2(t-s)}\right). \] In other words, for every \(a < b\), \[ \mathbb{P}(a \le W(t) \le b\mid W(s) = x) = \int_a^bp(s, t, x, y)\,\mathrm{d} y. \] We often write this as \[ p(s, t, x, y) = \phi(t-s, y-x),\ \phi(t, z) := \frac1{\sqrt{2\pi t}}\exp\left(-\frac{z^2}{2t}\right). \]

Example 17.1.

We have: \[ \mathbb E W(2)W(3) = \mathbb E W^2(2) + \mathbb E\left[W(2)(W(3) - W(2))\right] = \mathrm{Var} W(2) + \mathbb E(W(3) - W(2))\mathbb E W(2) = 2. \] More generally, \(\mathbb E W(s)W(t) = \min(s, t)\).

Example 17.2.

For every number \(x\), we have: \[ \mathbb E\left[W(3)\mid W(2) = x\right] = \mathbb E\left[W(3) - W(2)\mid W(2) = x\right] + \mathbb E\left[W(2)\mid W(2) = x\right] = \mathbb E(W(3) - W(2)) + x = 0 + x = x. \]

Example 17.3.

\(\mathbb E\left[W^2(5)\mid W(3) = 4\right] = \mathbb E\left(\xi + 4\right)^2 = \mathbb E\xi^2 + 8\mathbb E\xi + 16 = 2 + 0 + 16 = 18\), with \(\xi = W(5) - W(3) \sim \mathcal N(0, 2)\).

Example 17.4.

\(\mathbb E\left[(W(6) - W(4))W(5)\right] = \mathbb E W(5)W(6) - \mathbb E W(4)W(5) = \min(5, 6) - \min(4, 5) = 1\).

Example 17.5.

\(\mathbb{P}(W(3) > 0) = 0.5\), because \(W(3) \sim \mathcal N(0, 3)\) is symmetric with respect to zero.

Example 17.6.

Let us find \(\mathbb{P}(W(3) > 1\mid W(1) = 2)\). We have: \(\xi := W(3) - W(1) \sim \mathcal N(0, 2)\). Therefore, \[ \mathbb P(W(3) > 1\mid W(1) = 2) = \mathbb P(\xi + 2 > 1) = \mathbb P(\xi > -1). \] But \(\xi = \sqrt{2}Z\) for \(Z \sim \mathcal N(0, 1)\), so \[ \mathbb P(\xi > -1) = \mathbb P(\sqrt{2}Z > -1) = \mathbb P(Z > -1/\sqrt{2}) = \mathbb P(Z > -0.707) = 0.7611. \]

Example 17.7.

Let us find the density of \(W^3(2)\), given \(W(1) = 2\). We have: \(W^3(2) = (\xi + 2)^3\) for \(\xi \sim \mathcal N(0, 1)\). Then the density of \(\xi\) is \[ f(x) = \frac1{\sqrt{2\pi}}e^{-x^2/2}, \] and the density of \((\xi+2)^3\) can be found as follows: \[ \mathbb{P}\left(a \le (\xi + 2)^3 \le b\right) = \mathbb{P}\left(a^{1/3} - 2 \le \xi \le b^{1/3} - 2\right) = \int_{a^{1/3} - 2}^{b^{1/3}-2}f(x)\,\mathrm{d} x = \int_{a}^bf(y^{1/3} - 2)\frac13y^{-2/3}\mathrm{d} y. \] where we change variables \(x = y^{1/3} - 2\), so \(\mathrm{d} x = y^{-2/3}\mathrm{d} y/3\), and \[ a^{1/3} - 2 \le x \le b^{1/3} - 2\ \Leftrightarrow\ a^{1/3} - 2 \le y^{1/3} -2 \le b^{1/3} - 2\ \Rightarrow\ a \le y \le b \] Therefore, the density of \((\xi + 2)^3\) is given by \[ f(y^{1/3} - 2)\frac13y^{-2/3} = \frac1{3\sqrt{2\pi}y^{2/3}}\exp\left(-\frac{(y^{1/3} - 2)^2}{2}\right). \]

17.4. Brownian motion with drift and diffusion

Take coefficients \(g\) and \(\sigma\), a starting point \(x\), and define \[ X(t) = x + gt + \sigma W(t). \] This is a Brownian motion starting from \(x\), with drift coefficient \(g\), and diffusion coefficient \(\sigma^2\). The standard Brownian motion has \(x = 0,\, g = 0,\, \sigma = 1\). When we say Brownian motion without specifying the coefficients and the starting point, we mean the standard Brownian motion. This process has properties similar to the standard Brownian motion: for \(t > s\), \[ X(t) - X(s) = g(t-s) + \sigma(W(t) - W(s)) \sim \mathcal N\left(g(t-s), \sigma^2(t-s)\right), \] and this random variable is independent of \(X(u),\, u \le s\).

Example 17.8.

Consider \(x = 0,\, g = -1,\, \sigma = 2\). Find \(\mathbb E(X(3)\mid X(2) = 2)\). We have: \(X(t) = -t + 2W(t)\), and so \(X(3) - X(2) = -1 + 2(W(3) - W(2)) \sim \mathcal N(-1, 4)\). Therefore, \(\mathbb E(X(3)\mid X(2) = 2) = \mathbb E(X(3) - X(2)) + 2 = -1 + 2 = 1\). Next, find \(\mathbb E(X^2(5)\mid X(2) = 2)\): if \(\xi := X(5) - X(2) \sim \mathcal N(-3, 12)\), then we have \(X^2(5) = (2 + \xi)^2\), and the answer is \[ \mathbb E(2 + \xi)^2 = 4 + 4\mathbb E \xi + \mathbb E \xi^2 = 4 + 4\cdot(-3) + (-3)^2 + 12 = 13. \]

17.5. Levy processes

Combine a Brownian motion with a compound Poisson process: \[ L(t) = x + gt + \sigma W(t) + \sum\limits_{k=1}^{N(t)}Z_k. \] Here, \(W\) is a standard Brownian motion, \(N\) is a Poisson process with intensity \(\lambda\), and \(Z_1, Z_2, \ldots\) are i.i.d. Also, \(W, N, Z_k\) are all independent. This process behaves as a Brownian motion between jumps, which occur every \(\mathrm{Exp}(\lambda)\) times, and the displacement during the \(k\)th jump is \(Z_k\). This process also has the Markov property: \(L(t) - L(s)\) is independent of \(L(u),\, u \le s\). Also, it has stationary increments: the distribution of \(L(t) - L(s)\) depends only on \(t - s\) for \(t > s\). For example, the distribution of \(L(6) - L(4)\) and of \(L(5) - L(3)\) is the same.

Example 17.9.

\(x = 1,\, g = 2,\, \sigma = 3,\, \lambda = 2,\, Z_k \sim \mathrm{Exp}(2)\). Then \[ \mathbb E L(t) = x + gt + \sigma\mathbb E W(t) + \mathbb E\sum\limits_{k=1}^{N(t)}Z_k = 1 + 2t + \lambda t\mathbb E Z_k = 1 + 2t + 2t\cdot\frac12 = 1 + 3t, \] \[ \mathrm{Var} L(t) = \sigma^2\mathrm{Var} W(t) + \mathrm{Var}\sum\limits_{k=1}^{N(t)}Z_k = \sigma^2t + t\lambda\left((\mathbb E Z_k)^2 + \mathrm{Var} Z_k\right) = 9t + 2t\left(0.5^2 + \frac14\right) = 10t. \] Let \(G(u) := \mathbb E e^{uZ_k}\) be the moment generating function of \(Z_k\). It is equal to \[ G(u) = \frac{2}{2 - u}. \] Then the moment generating function of \(L(t)\) is given by \[ F_t(u) := \mathbb E\exp(uL(t)) = \mathbb E\exp\left[u\left(x + gt + \sigma W(t)\right)\right]\cdot \mathbb E\exp\left[u\sum\limits_{k=1}^{N(t)}Z_k\right] \] \[ = \exp\left(ux + gut + \frac{\sigma^2u^2}2t\right)\exp\left(\lambda t(G(u) - 1)\right) \] \[ = e^{ux}\exp\left(t\left[gu + \sigma^2u^2/2 + \lambda(G(u) - 1)\right]\right) = \exp\left(u + t\left[2u + \frac92u^2 + 2\left(\frac{2}{2 - u} - 1\right)\right]\right). \]

Any process with stationary independent increments is called a Levy process. One can show that if such process has at most finitely many jumps on a finite time interval, then it can be represented in this way above.

17.6. Reflection principle and the maximum of Brownian motion

Try to find the distribution of \(M(t) = \max_{0 \le s \le t}W(s)\). For example, find \[ \mathbb{P}(M(4) \ge 3,\, 1 \le W(4) \le 2). \] The trajectory of the Brownian motion which satisfies this condition crosses the line \(y = 3\). Let us take the first moment of crossing and reflect the trajectory of the Brownian motion across this line, starting from this moment. Then we get another trajectory of Brownian motion, because it is symmetric (can go up or down with equal probability). But it reaches between \(4\) and \(5\) at time \(t = 4\), because \(4\) and \(5\) are symmetric to the points \(2\) and \(1\) respectively, with respect to this line. Conversely, every such trajectory of a Brownian motion with \(4 \le W(4) \le 5\), after being reflected after its first crossing of line \(y = 3\), becomes a Brownian motion which satisfies this. Therefore, the probability of this event is equal to (with \(W(4) = 2Z,\, Z \sim \mathcal N(0, 1)\)): \[ \mathbb{P}(4 \le W(4) \le 5) = \frac1{\sqrt{2\pi}2}\int_4^5e^{-x^2/8}\,\mathrm{d} x = \mathbb{P}(2 \le Z \le 2.5) = 0.9938 - 0.9772 = \boxed{0.0166} \] Similarly, \[ \mathbb{P}(M(4) \ge 3,\, W(4) \le 3) = \mathbb{P}(W(4) \ge 3) = \frac1{\sqrt{2\pi}2}\int_3^{\infty}e^{-x^2/8}\,\mathrm{d} x. \] Note that if \(W(4) \ge 3\), then certainly \(M(4) \ge 3\). And \[ \mathbb P(M(4) \ge 3) = \mathbb P(M(4) \ge 3,\, W(4) \le 3) + \mathbb P(W(4) \ge 3) = 2\mathbb P(W(4) \ge 3). \] Similarly, for every \(t > 0\), and \(y \ge 0\), \[ \mathbb{P}(M(t) \ge y) = 2\mathbb{P}(W(t) \ge y) = \mathbb{P}(W(t) \ge y) + \mathbb{P}(W(t) \le -y) = \mathbb{P}(|W(t)| \ge y). \] That is, \(M(t)\) and \(|W(t)|\) have the same distribution. It has density \[ \frac2{\sqrt{2\pi t}}e^{-x^2/2t}\,\mathrm{d} x,\ x \ge 0. \]

17.7. Reflected Brownian motion

This is the process \(|W| = (|W(t)|,\, t \ge 0)\). It is a Markov process, because the behavior of \(|W(t)|\) for \(t \ge s\) depends only on \(|W(s)|\); if you know \(W(u),\, u < s\), this does not give you any additional information. The state space of this process is \(\mathbb R_+ := [0, \infty)\). Skorohod representation: \[ |W(t)| = B(t) + \ell(t), \] where \(B\) is another Brownian motion (not \(W\)!), and \(\ell\) is a continuous nondecreasing process with \(\ell(0) = 0\), with \(\ell\) increasing only when \(|W| = 0\). As long as \(|W|\) stays inside the positive half-line, it behaves as a Brownian motion. But when it hits zero, it “wants” to go down, but is not allowed to, because it has to stay positive. Then we add a push \(\mathrm{d}\ell(t)\) to make it positive, and prevent it from crossing \(y = 0\). The process \(\ell\) is called a local time of \(|W|\) at zero.

The transition density of \(|W(t)|\): for any \(0 \le a < b\), \[ \mathbb{P}\left(a \le |W(t)| \le b\mid |W(s)| = x\right) = \mathbb{P}\left(a \le W(t) \le b\mid W(s) = x\right) + \mathbb{P}\left(-b \le W(t) \le -a\mid W(s) = x\right) \] \[ = \frac1{\sqrt{2\pi(t-s)}}\int_a^b\exp\left(-\frac{(y-x)^2}{2(t-s)}\right)\,\mathrm{d} y + \frac1{\sqrt{2\pi(t-s)}}\int_{-b}^{-a}\exp\left(-\frac{(y-x)^2}{2(t-s)}\right)\,\mathrm{d} y \] \[ = \frac1{\sqrt{2\pi(t-s)}}\int_a^b\left[\exp\left(-\frac{(y-x)^2}{2(t-s)}\right) + \exp\left(-\frac{(-y-x)^2}{2(t-s)}\right) \right]\,\mathrm{d} y \] so given \(|W(s)| = x\), \(|W(t)|\) has density \[ p(s, t, x, y) = \frac1{\sqrt{2\pi(t-s)}}\left[\exp\left(-\frac{(y-x)^2}{2(t-s)}\right) + \exp\left(-\frac{(y+x)^2}{2(t-s)}\right)\right]. \]

Example 17.10.

Let us find \(\mathbb{P}(|W(2)| \le 1\mid |W(1)| = 1)\). Let \(\xi := W(2) - W(1) \sim \mathcal N(0, 1)\). Then \[ \mathbb{P}(|W(2)| \le 1\mid |W(1)| = 1) = \mathbb{P}(-1 \le W(2) \le 1\mid W(1) = 1) = \mathbb{P}(-1 \le \xi - 1 \le 1) = \mathbb{P}(0 \le \xi \le 2) \] \[ = \frac1{\sqrt{2\pi}}\int_0^2e^{-x^2/2}\,\mathrm{d} x = 0.48. \]

Example 17.11.

Let us find \(\mathbb E|W(3)|\). We can represent \(W(3) = \sqrt{3}\xi\), \(\xi \sim \mathcal N(0, 1)\). Then \[ \mathbb E|W(3)| = \sqrt{3}\mathbb E|\xi| = \sqrt{3}\frac1{\sqrt{2\pi}}\int_{-\infty}^{\infty}|y|e^{-y^2/2}\,\mathrm{d} y = \sqrt{3}\frac{2}{\sqrt{2\pi}}\int_0^{\infty}ye^{-y^2/2}\,\mathrm{d} y = \sqrt{3}\frac{2}{\sqrt{2\pi}}\left.\left(-e^{-y^2/2}\right)\right|_{y=0}^{\infty} = \frac{2\sqrt{3}}{\sqrt{2\pi}}. \]

17.8. Transformations of Brownian motion

Note that the process \((W(9t),\, t \ge 0)\) is also a Brownian motion with diffusion coefficient \(9\) and drift coefficient \(0\). In other words, the following processes have the same distribution: \[ (W(9t),\, t \ge 0)\ \mbox{and}\ (3W(t),\, t \ge 0). \] Indeed, \(W(9t)\) and \(3W(t)\) have the same distribution: \(\mathcal N(0, 9t)\). Moreover, \(W(9t) - W(9s)\) and \(3W(t) - 3W(s)\) have the same distribution: \(\mathcal N(0, 9(t-s))\), for \(t > s\). And \(W(9t) - W(9s)\) is independent of \(W(9u),\, u \le s\).

Consider two independent Brownian motions: \(X = (X(t),\, t \ge 0)\) is a Brownian motion with drift \(g_1 = -2\) and diffusion \(\sigma_1^2 = 3\), starting from \(x_1 = -1\), and \(Y = (Y(t),\, t \ge 0)\) is a Brownian motion with drift \(g_2 = 3\) and diffusion \(\sigma_2^2 = 4\), starting from \(x_2 = 2\). Then take the process \[ Z = (Z(t),\, t \ge 0) = 2X - 3Y = (2X(t) - 3Y(t),\, t \ge 0). \] This is also a Brownian motion, starting from \[ Z(0) = 2X(0) - 3Y(0) = 2\cdot(-1) - 3\cdot2 = -8. \] For \(t > s\), we have: \[ Z(t) - Z(s) \sim 2(X(t) - X(s)) - 3(Y(t) - Y(s)). \] But the following increments are normally distributed and are independent: \[ X(t) - X(s) \sim \mathcal N(-2(t-s), 3(t-s)),\ Y(t) - Y(s) \sim \mathcal N(3(t-s), 4(t-s)). \] Therefore, using results of Section 9, we get: \(Z(t) - Z(s)\) is also normal with \[ \mathbb E(Z(t) - Z(s)) = 2\mathbb E(X(t) - X(s)) - 3\mathbb E(Y(t) - Y(s)) = -2(t-s)\cdot 2 + 3(t-s)\cdot(-3) = -13(t-s), \] \[ \mathrm{Var}(Z(t) - Z(s)) = 2^2\mathrm{Var}(X(t) - X(s)) + (-3)^2\mathrm{Var}(Y(t) - Y(s)) = 2^2\cdot 3(t-s) + (-3)^2\cdot 4(t-s) = 48(t-s). \] And \(Z(t) - Z(s)\) is independent of \(Z(u),\, u \le s\), because \(X\) and \(Y\) satisfy the same property. Therefore, \(Z\) is a Brownian motion with drift coefficient \(-13\) and diffusion coefficient \(48\), starting from \(-8\).

17.9. Derivatives in the Black-Scholes model

This is an extension of stochastic finance theory from Section 12 for the continuous case. We model the stock price as the geometric Brownian motion: \[ S(t) = S_0\exp(gt + \sigma W(t)). \] Here, \(\sigma\) is called volatility; this parameter shows the magnitude of fluctuations in the price. One can see the definitions of European options and derivatives in Section 12. We would like to hedge a derivative \(D = f(S(T))\), where \(f(x)\) is some real-valued function. In particular, we are interested in the following functions, with \(a_+ := \max(a, 0)\):

  • \(f(x) = (x-K)_+\): European option-call with strike \(K\);
  • \(f(x) = (K - x)_+\): European option-put with strike \(K\);
  • \(f(x) = 1(x > K)\): Binary option with barrier \(K\).

If we had random walk instead of a Brownian motion, we would switch to a risk-neutral probability, under which \(\mathbb E S(t) = S_0\). Here, we have a (geometric) Brownian motion, which is a limit of geometric random walks with small but frequent jumps. It is possible to show that in this case, risk-neutral probability corresponds to a change in drift. That is, we need to find new \(g_*\) such that \(\mathbb E S(t) = S_0\). We can do this as follows: Calculating the moment generating function of \(W(t) \sim \mathcal N(0, t)\), we get: \[ \mathbb E \exp(g_*t + \sigma W(t)) = e^{g_*t}e^{\sigma^2t/2} = 1\ \Rightarrow\ g_* = -\frac{\sigma^2}2. \] Therefore, under the new risk-neutral probability, the stock price is modeled by a geometric Brownian motion: \[ S(t) = S_0\exp\left(\sigma W(t) - \frac{\sigma^2}2t\right),\, t \ge 0. \] And we need to take the expected value: \(v = \mathbb E f(S(T))\). This is the fair price. We can find the way to hedge, or replicate these derivatives, using stochastic calculus in the next section.

Example 17.12.

For the binary option with barrier \(K = 3\), maturity \(T = 2\), traded upon the stock with volatility \(\sigma = 3\) and initial price \(S_0 = 1.5\), its risk-neutral dynamics is given by \(S(t) = 1.5\,\exp\left(3W(t) - 9t/2\right)\), therefore the fair price of this option is (with \(W(2) = \sqrt{2}Z\) for \(Z \sim \mathcal N(0, 1)\)): \[ \mathbb E 1(S(2) > 3) = \mathbb{P}\left(1.5\,\exp\left(3W(2) - 9\cdot 2/2\right) > 3\right) = \mathbb{P}\left(\exp\left(3\sqrt{2}Z - 9\right) > 2\right) \] \[ = \mathbb{P}\left(3\sqrt{2}Z - 9 > \ln 2\right) = \mathbb{P}\left(Z > \frac3{\sqrt{2}} + \frac{\ln 2}{3\sqrt{2}}\right) = \mathbb{P}(Z > 2.28) = \mbox{0.011} \]

For the European option call, we get the celebrated Black-Scholes formula (Ecomonics Nobel Prize 1997).

Example 17.13.

Try strike \(K = 1.5\), maturity \(T = 3\), volatility \(\sigma = 2\), and current price \(S_0 = 1\). Then \(S(t) = \exp(2W(t) - 2t)\), and \(S(T) = \exp(2W(3) - 6)\). Then the fair price is \[ v = \mathbb E(S(T) - K)_+ = \mathbb E\left(\exp(2W(3) - 6) - 1.5\right)_+. \] The option is executed if its terminal price at maturity is greater than strike, that is, \[ \exp(2W(3) - 6) - 1.5 > 0\ \Leftrightarrow W(3) > 3 + \frac12\log 1.5 = 3.405, \] The probability density function of \(W(3)\) is given by \((6\pi)^{-1/2}\exp(-x^2/6)\). Then \[ v = \mathbb E\left(\exp(2W(3) - 6) - 1.5\right)_+ = \frac1{\sqrt{6\pi}}\int_{3 + 0.5\log 1.5}^{\infty}\left[e^{2x - 6} - 1.5\right]e^{-x^2/6}\,\mathrm{d} x \] \[ = \frac1{\sqrt{6\pi}}\int_{3 + 0.5\log 1.5}^{\infty}e^{2x - 6}\,e^{-x^2/6}\,\mathrm{d} x - \frac1{\sqrt{6\pi}}\int_{3 + 0.5 \log 1.5}^{\infty}1.5\,e^{-x^2/6}\,\mathrm{d} x. \] The density \((6\pi)^{-1/2}e^{-x^2/6}\) is of \(\sqrt{3}Z,\, Z \sim \mathcal N(0, 1)\). The second term in the right-hand side is equal to: \[ \frac1{\sqrt{6\pi}}\int_{3+0.5\log 1.5}^{\infty}1.5e^{-x^2/6}\,\mathrm{d} x = \mathbb{P}(\sqrt{3}Z > 3 + 0.5\log 1.5) = \mathbb{P}(\sqrt{3}Z > 3 + 0.5\log 1.5) = \mathbb{P}\left(Z > \sqrt{3} + \frac{0.5\log 1.5}{\sqrt{3}}\right). \] The first term in the right-hand side is equal to: \[ \frac1{\sqrt{6\pi}}\int_{3 + 0.5\log 1.5}^{\infty}e^{2x - 6}\,e^{-x^2/6}\,\mathrm{d} x = \frac1{\sqrt{6\pi}}\int_{3 + 0.5\log 1.5}^{\infty}e^{-x^2/6 + 2x - 6}\,\mathrm{d} x \] \[ = \frac1{\sqrt{6\pi}}\int_{3 + 0.5\log 1.5}^{\infty}e^{-(x-6)^2/6}\,\mathrm{d} x = \mathbb{P}(\sqrt{3}Z + 6 > 3 + 0.5\log 1.5) = \mathbb P\left(Z > -\sqrt{3} + \frac{0.5\log 1.5}{\sqrt{3}}\right). \] Therefore, the answer is \[ v = \mathbb P\left(Z > - \sqrt{3} + \frac{0.5\log 1.5}{\sqrt{3}}\right) - 1.5\cdot \mathbb P\left(Z > \sqrt{3} + \frac{0.5\log 1.5}{\sqrt{3}}\right) \] \[ = \mathbb P(Z > -1.53) - 1.5\cdot\mathbb P(Z > 1.93) = 0.937 - 1.5\cdot (1 - 0.9732) = \boxed{0.8968} \]

We can also find the fair price of a European derivative at any time \(t\), if we know the price \(S(t) = x\) at this time: \[ u(t, x) = \mathbb E(f(S(T))\mid S(t) = x), \] where the expectation is taken again with respect to the risk-neutral probability, that is, for the process \(S(t)\) from above. Then we simply treat time \(t\) as the initial time, and \(S(t) = x\) as the initial price \(x\). The hedging of such a derivative: that is, replicating it with a combination of cash and this stock, is slightly harder, and is deferred to the next section on stochastic calculus.

17.10. Drawbacks of the Black-Scholes model}

Heavy tails. Large fluctuations of stocks occur more frequently than prescribed by the normal distributions. A better distribution is \(\mathbb P(\ln(S(t_2)/S(t_1)) \ge x) \approx cx^{-\alpha}\), and same for \(\mathbb P(\ln(S(t_2)/S(t_1)) \le -x)\); here, \(c, \alpha > 0\) are some constants.

Short-term dependence. Increments of logarithms are not independent: say \(\ln S(2.1) - \ln S(2)\) and \(\ln S(2.2) - \ln S(2.1)\) are dependent, although they should be independent if we lived in the Black-Scholes model.

Volatility is not constant. If we calculate \(\sigma^2\) from real data, it should be constant in the Black-Scholes model, but it is not. In fact, the graph \((S(t), \sigma(t))\), of the stock price and volatility, is not a horizontal line, but a decreasing one: volatility skew or a U-shaped curve: volatility smile. To explain this, several techniques are available, including stochastic volatility models: when \(\sigma(t)\) is modeled itself by a stochastic process, for example using second Brownian motion \(B(t)\), correlated with \(W(t)\).

Bid-ask spread. The stock price is actually two different prices: bid, the price offered for buying, and ask, the price offered for selling. Selling - buying occurs when a bid is equal to an ask. The dynamics of bid offers and ask offers is modeled separately.

18. Stochastic Calculus

In this section, our aim is to differentiate Brownian motion \(W = (W(t),\, t \ge 0)\). We cannot do this in the usual sense, because this is not differentiable. Indeed, \[ W(t + \mathrm{d} t) - W(t) \sim \mathcal N(0, \mathrm{d} t)\ \Rightarrow\ \frac{W(t + \mathrm{d} t) - W(t)}{\mathrm{d} t} \sim \mathcal N\left(0, \frac1{\mathrm{d} t^2}\cdot\mathrm{d} t\right) = \mathcal N\left(0, \frac1{\mathrm{d} t}\right). \] As \(\mathrm{d} t \to 0\), the variance \(1/\mathrm{d} t \to \infty\). Therefore, this ratio does not converge anywhere.

18.1. Quadratic variation

For a function \(f : [0, \infty) \to \mathbb R\), define its quadratic variation on the interval \([0, t]\): \[ \langle f\rangle_t := \lim\limits_{N \to \infty}\sum\limits_{k=0}^{N-1}\left(f\left(\frac{(k+1)t}N\right) - f\left(\frac{kt}N\right)\right)^2. \] We split the interval \([0, t]\) into \(N\) equal subintervals \[ \left[0, \frac tN\right], \left[\frac tN, \frac{2t}N\right], \ldots, \left[\frac{(N-1)t}{N}, t\right]. \] and calculate the increment of \(f\) on each such subinterval; then square each and sum them. Then let the size of intervals go to zero. For smooth functions \(f\), one can show that this quantity is zero. But for the Brownian motion \(W\), we have independent identically distributed increments: \[ W\left(\frac{(k+1)t}{N}\right) - W\left(\frac{kt}{N}\right) \sim \mathcal N\left(0, \frac{t}{N}\right). \] Therefore, we can represent \[ W\left(\frac{(k+1)t}{N}\right) - W\left(\frac{kt}{N}\right) = (t/N)^{1/2}Z_k,\ k = 0, \ldots, N-1, \] where \(Z_1, \ldots, Z_N\) are i.i.d. \(\mathcal N(0, 1)\). Therefore, by the Law of Large Numbers, as \(N \to \infty\): \[ \left((t/N)^{1/2}Z_1\right)^2 + \ldots + \left((t/N)^{1/2}Z_N\right)^2 = \frac tN\left(Z_1^2 + \ldots + Z_N^2\right) \to t\mathbb E Z_1 = t. \] We have shown that \(\langle W\rangle_t = t\). We can also symbolically represent quadratic variation as \[ \langle f\rangle_t = \int_0^t\left(\mathrm{d} f\right)^2. \] In other words, \(\mathrm{d}\langle f\rangle_t = (\mathrm{d} f)^2\). For smooth functions \(f\), we have: \(\mathrm{d} f = f'(t)\,\mathrm{d} t\), and therefore \[ \mathrm{d}\langle f\rangle_t = (f'(t)\,\mathrm{d} t)^2 = f'^2(t)\,(\mathrm{d} t)^2 = 0. \] Indeed, in such calculations we must consider \((\mathrm{d} t)^2 = 0\), and in general \((\mathrm{d} t)^{a} = 0\) for every \(a > 1\).

18.2. Construction of a stochastic integral

For a deterministic function or a random process \(H = (H(t),\, t \ge 0)\), define the stochastic integral, or It^o’s integral: \[ \int_0^tH(s)\,\mathrm{d} W(s) = \lim\limits_{N \to \infty}\sum\limits_{k=0}^{N-1}H\left(\frac{kt}N\right)\left(W\left(\frac{(k+1)t}{N}\right) - W\left(\frac{kt}{N}\right)\right) \] \[ = H(0)\left(W\left(\frac tN\right) - W(0)\right) + H\left(\frac tN\right)\left(W\left(\frac{2t}{N}\right) - W\left(\frac{t}{N}\right)\right) + \ldots + H\left(\frac{t(N-1)}{N}\right)\left(W(t) - W\left(\frac{t(N-1)}{N}\right)\right). \] This is similar to a Riemann sum for the integral \(\int_0^tH(s)\,\mathrm{d} s\).

Example 18.1.

Let \(H(s) \equiv c\). Then the sum above becomes \(c(W(t) - W(0)) = cW(t)\). That is, \[ \int_0^tc\mathrm{d} W(s) = cW(t). \]

18.3. Stochastic integral of a deterministic function

More generally, take a deterministic (non-random) function \(f : [0, t] \to \mathbb R\), and let \(H := f\). Then the sum above is a linear combination of independent normal random variables \[ Z_k := W\left(\frac{(k+1)t}{N}\right) - W\left(\frac{kt}{N}\right) \sim \mathcal N\left(0, \frac{t}{N}\right),\ \ k = 0, \ldots, N-1. \] Each of which has \(\mathbb E Z_k = 0\) and \(\mathrm{Var} Z_k = t/N\). We can rewrite this sum as \[ S_N := f(0)\,Z_0 + f\left(\frac tN\right)\,Z_1 + \ldots + f\left(\frac{(N-1)t}{N}\right)\,Z_{N-1}. \] It also has normal distribution (as a linear combination of independent normal random variables), with \[ \mathbb E S_N = f(0)\,\mathbb E Z_0 + \ldots + f\left(\frac{(N-1)t}{N}\right)\,\mathbb E Z_{N-1} = 0, \] \[ \mathrm{Var} (S_N) = f^2(0)\,\mathrm{Var} Z_0 + \ldots + f^2\left(\frac{(N-1)t}{N}\right)\,\mathrm{Var} Z_{N-1} = \frac tN\sum\limits_{k=0}^{N-1}f^2\left(\frac{kt}{N}\right). \] Note that \(\mathrm{Var} S_N\) is a Riemann sum for the integral \(I := \int_0^tf^2(s)\,\mathrm{d} s\). Therefore, \(\mathrm{Var} S_N \to I\). And the stochastic integral is distributed as the normal random variable: \[ \int_0^tf(s)\,\mathrm{d} W(s) \sim \mathcal N(0, I). \]

Example 18.2.

Consider the integral \(\int_0^2(3-t)\,\mathrm{d} W(t)\). It is distributed as \(\mathcal N(0, \sigma^2)\) with \[ \sigma^2 = \int_0^2(3-s)^2\,\mathrm{d} s = \int_0^2\left(9 - 6s + s^2\right)\,\mathrm{d} s = \left.\left(9s - 3s^2 + \frac{s^3}3\right)\right|_{s=0}^{s=2} = \frac{26}{3}. \]

Example 18.3.

Consider the integral \(\int_0^1t\,\mathrm{d} X(t)\), where \(X\) is a Brownian motion with drift \(g = -1\) and diffusion \(\sigma^2 = 4\). Then we have: \(\mathrm{d} X(t) = -\mathrm{d} t + 2\mathrm{d} W(t)\), and therefore \[ \int_0^1t\,\mathrm{d} X(t) = -\int_0^1t\,\mathrm{d} t + 2\int_0^1t\,\mathrm{d} W(t) \sim \mathcal N\left(-\int_0^1t\,\mathrm{d} t, 4\int_0^1t^2\,\mathrm{d} t\right) = \mathcal N\left(-\frac12, \frac43\right). \]

If the integrator \(H = (H(t),\, t \ge 0)\) is random, then similarly we can show that \[ \mathbb E\int_0^TX(t)\,\mathrm{d} W(t) = 0\ \mbox{and}\ \mathbb E\left[\int_0^TX(t)\,\mathrm{d} W(t)\right]^2 = \mathbb E\int_0^tX^2(t)\,\mathrm{d} t. \] However, in this case the distribution of this stochastic integral is not necessarily normal.

18.4. Relation between stochastic integral and quadratic variation

The stochastic integral \(X(t) = \int_0^tH(s)\,\mathrm{d} W(s)\) for \(t \ge 0\) can be itself viewed as a random process. We can write it in the form \[ \mathrm{d} X(t) = H(t)\,\mathrm{d} W(t). \] Its quadratic variation is equal to \[ \langle X\rangle_t = \int_0^t(\mathrm{d} X(s))^2 = \int_0^tH^2(s)\,(\mathrm{d} W(s))^2 = \int_0^tH^2(s)\mathrm{d}\langle W\rangle_s = \int_0^tH^2(s)\,\mathrm{d} s. \] Alternatively, we can write this as \[ \mathrm{d}\langle X\rangle_t = H^2(t)\,\mathrm{d} t. \] When \(X\) is a sum of a stochastic integral and a usual integral: \[ X(t) = X(0) + \int_0^tG(s)\,\mathrm{d} s + \int_0^tH(s)\,\mathrm{d} W(s), \] we can rewrite this as \(\mathrm{d} X(t) = G(t)\,\mathrm{d} t + H(t)\,\mathrm{d} W(t)\). Therefore, \[ \mathrm{d}\langle X\rangle_t = (\mathrm{d} X(t))^2 = G^2(t)(\mathrm{d} t)^2 + 2G(t)H(t)\mathrm{d} t\mathrm{d} W(t) + H^2(t)\,\mathrm{d} t. \] The term \((\mathrm{d} t)^2\) should be neglected in such calculation, as explained before. The term \(\mathrm{d} t\mathrm{d} W(t)\) is of order \((\mathrm{d} t)^{3/2}\), because \(\mathrm{d} W(t) = W(t + \mathrm{d} t) - W(t) \sim \mathcal N(0, \mathrm{d} t)\) is of order \((\mathrm{d} t)^{1/2}\). So it should also be neglected. At the end, we get: \(\mathrm{d}\langle X\rangle_t = H^2(t)\,\mathrm{d} t\).
These processes have the same quadratic variation. Adding a usual (non-stochastic) integral does not influence the quadratic variation. Take expectation: \[ \mathbb E (X(t) - X(0)) = \int_0^t\mathbb E G(s)\,\mathrm{d} s. \] In particular, if \(\mathbb E X(t) = \mathbb E X(0)\) for all \(t \ge 0\), then \(\mathbb E G(s) = 0\) for all \(s \ge 0\).

18.5. It^o’s formula

This is the main formula in stochastic calculus. Take a stochastic process \(X = (X(t),\, t \ge 0)\) with quadratic variation \(\langle X\rangle_t\). Consider a function \(f : \mathbb R \to \mathbb R\). Apply \(f\) to \(X(t)\) and get \(Y(t) = f(X(t))\). Then \[ \mathrm{d} Y(t) = f'(X(t))\,\mathrm{d} X(t) + \frac12f''(X(t))\,\mathrm{d} \langle X\rangle_t. \] Let us explain this: use Taylor decomposition \[ f(y) - f(x) \approx f'(x)(y-x) + \frac12f''(x)(y-x)^2\ \mbox{for}\ y \approx x. \] \[ \mathrm{d} Y(t) = Y(t + \mathrm{d} t) - Y(t) = f(X(t + \mathrm{d} t)) - f(X(t)) = f'(X(t))(X(t + \mathrm{d} t) - X(t)) + \frac12f''(X(t))\left(X(t + \mathrm{d} t) - X(t)\right)^2 \] \[ = f'(X(t))\,\mathrm{d} X(t) + \frac12f''(X(t))(\mathrm{d} X(t))^2 = f'(X(t))\,\mathrm{d} X(t) + \frac12f''(X(t))\mathrm{d}\langle X\rangle_t. \]

Example 18.3.

Let \(f(x) = x^2\) and \(X(t) = W(t)\). Then \(f'(x) = 2x\) and \(f''(x) = 2\). Also, \(\langle W\rangle_t = t\). Therefore, applying It^o’s formula, we get: \[ \mathrm{d} Y(t) = \mathrm{d} W^2(t) = 2W(t)\,\mathrm{d} W(t) + \mathrm{d} t. \] We can write this as (because \(W(0) = 0\)): \[ W^2(t) = 2\int_0^tW(s)\,\mathrm{d} W(s) + t\ \Rightarrow\ \int_0^tW(s)\,\mathrm{d} W(s) = \frac{W^2(t) - t}2. \] For smooth functions \(f\) with \(f(0) = 0\), we have: \[ \int_0^tf(s)\,\mathrm{d} f(s) = \int_0^tf(s)f'(s)\,\mathrm{d} s = \frac{f^2(t)}2. \] This additional term \(-t/2\) is due to difference between ordinary and stochastic calculus. We can immediately calculate the quadratic variation of \(W^2(t)\): \[ \langle W^2\rangle_t = \int_0^t4W^2(s)\,\mathrm{d} s. \] For a function \(f(t, x)\), It^o’s formula takes form \[ \mathrm{d} f(t, X(t)) = \frac{\partial f}{\partial t}(t, X(t))\,\mathrm{d} t + \frac{\partial f}{\partial x}(t, X(t))\,\mathrm{d} X(t) + \frac12\frac{\partial^2f}{\partial x^2}(t, X(t))\mathrm{d}\langle X\rangle_t. \]

Example 18.4.

Find quadratic variation of the process \(tW(t)\). Apply It^o’s formula to \(f(t, x) = tx\) and \(X(t) = W(t)\), with \(\langle W\rangle_t = t\). We have: \[ \frac{\partial f}{\partial t} = x,\ \frac{\partial f}{\partial x} = t,\ \frac{\partial^2f}{\partial x^2} = 0. \] \[ \mathrm{d} (tW(t)) = W(t)\,\mathrm{d} t + t\,\mathrm{d} W(t)\ \Rightarrow\ \langle (tW(t))\rangle_t = \int_0^ts^2\,\mathrm{d} s = \frac{t^3}3. \]

Let us introduce a geometric Brownian motion with drift \(g\) and diffusion \(\sigma^2\), starting from \(x\): \[ X(t) = xe^{gt + \sigma W(t)}. \]

Example 18.5.

Apply It^o’s formula to geometric Brownian motion \(X(t)\), starting from \(x = 3\), with parameters \(g = 2\) and \(\sigma = 3\), and to the function \(f(x) = x^2t\). We have: \[ X(t) = 3e^{2t + 3W(t)} = e^{Y(t)},\ Y(t) := \ln 3 + 2t + 3W(t). \] We need to solve this problem in three steps:

  1. Find \(\mathrm{d} X(t)\) and \(\langle X\rangle_t\). To this end, apply It^o’s formula to \(g(x) := e^x\) and \(Y(t)\). Then \(g'(x) = g''(x) = e^x\), and \(g'(Y(t)) = g''(Y(t)) = X(t)\). Also, \(\mathrm{d} Y(t) = 2\mathrm{d} t + 3\mathrm{d} W(t)\), so \(\langle Y\rangle_t = 9t\). Therefore, \[ \mathrm{d} X(t) = \mathrm{d} g(Y(t)) = X(t)\,\mathrm{d} Y(t) + \frac12X(t)\mathrm{d} \langle Y\rangle_t = X(t)\left(2\mathrm{d} t + 3\mathrm{d} W(t) + \frac92\mathrm{d} t\right) = \] \[ X(t)\left(\frac{13}2\mathrm{d} t + 3\mathrm{d} W(t)\right); \] \[ \langle X\rangle_t = \int_0^t(9X^2(s))\,\mathrm{d} s\ \Rightarrow\ \mathrm{d}\langle X\rangle_t = 9X^2(t)\,\mathrm{d} t. \]

  2. Calculate \[ \frac{\partial f}{\partial t} = x^2,\ \frac{\partial f}{\partial x} = 2tx,\ \frac{\partial^2f}{\partial x^2} = 2t. \]

  3. Apply It^o’s formula. We have: \[ \mathrm{d} f(t, X(t)) = X^2(t)\,\mathrm{d} t + 2tX(t)\mathrm{d} X(t) + t\,\mathrm{d}\langle X\rangle_t \] \[ = 2X^2(t)\,\mathrm{d} t + 13tX^2(t)\,\mathrm{d} t + 6tX^2(t)\,\mathrm{d} W(t) + 9tX^2(t)\,\mathrm{d} t \] \[ = (2 + 22t)X^2(t)\,\mathrm{d} t + 6tX^2(t)\,\mathrm{d} W(t). \]

18.6. Hedging a European derivative

Consider a European derivative \(f(S(T))\), with \[ S(t) = S_0e^{gt + \sigma W(t)} \] is the Black-Scholes model of a stock price, \(T\) is the maturity, and \(f(x)\) is a real-valued function. Assume at time \(t\) we constuct a portfolio: We split our wealth \(V(t)\) at time \(t\) between \(H(t)\) shares of the stock and \(V(t) - H(t)S(t)\) in cash. Then the change in this wealth during time \([t, t + \mathrm{d} t]\) is equal to \[ H(t)(S(t + \mathrm{d} t) - S(t)) = H(t)\,\mathrm{d} S(t). \] Thus \(\mathrm{d} V(t) = H(t)\,\mathrm{d} S(t)\). We need \(V(T) = f(S(T))\): at time \(T\) our wealth should exactly match the derivative. Then \(V(0) = v\) would be the fair price at time \(t = 0\), and \(V(t)\) is the fair price at time \(t\). The function \(H(t)\) gives our hedging strategy, or replicating portfolio. Let us find \(V(t)\) and \(H(t)\). As discussed in the previous section, the fair price at time \(t\) given \(S(t) = x\) is given by \[ u(t, x) = \mathbb E(f(S(T))\mid S(t) = x),\, S(t) = S_0\exp(\sigma W(t) - \sigma^2t/2);\quad u(t, S(t)) = V(t). \] Note that \(\mathbb E u(t, S(t)) = \mathbb E f(S(T))\), because the expectation of a conditional expectation is always equal to the unconditional expectation. Therefore, \(\mathbb E u(t, S(t))\) does not depend on \(t\): It is constant. Now, decompose \(u(t, S(t))\) according to It^o’s formula: \[ \mathrm{d} u(t, S(t)) = \frac{\partial u}{\partial t}(t, S(t))\,\mathrm{d} t + \frac{\partial u}{\partial x}(t, S(t))\,\mathrm{d} S(t) + \frac12\frac{\partial^2u}{\partial x^2}(t, S(t))\,\mathrm{d} \langle S\rangle_t. \] Apply It^o’s formula with \(f(t, x) = e^{\sigma x - \sigma^2t/2}\) to \(x = W(t)\). Then we have: \[ \frac{\partial f}{\partial t} = -\frac{\sigma^2}2f,\, \frac{\partial f}{\partial x} = \sigma f,\, \frac{\partial^2 f}{\partial x^2} = \sigma^2f. \] \[ \mathrm{d} S(t) = \mathrm{d} f(t, W(t)) = \left[\frac{\partial f}{\partial t}(t, S(t)) + \frac12\frac{\partial^2f}{\partial x^2}(t, S(t))\right]\,\mathrm{d} t + \frac{\partial f}{\partial x}(t, S(t))\,\mathrm{d} W(t) \] \[ = \sigma f(t, S(t))\,\mathrm{d} W(t) = \sigma S(t)\,\mathrm{d} W(t). \] Therefore, \(\mathrm{d}\langle S\rangle_t = \sigma^2S^2(t)\,\mathrm{d} t\). Thus we can rewrite It^o’s formula as \[ \mathrm{d} u(t, S(t)) = \left[\frac{\partial u}{\partial t}(t, S(t)) + \frac12\sigma^2S^2(t) \frac{\partial^2u}{\partial x^2}(t, S(t))\right]\,\mathrm{d} t + \frac{\partial u}{\partial x}(t, S(t))\cdot \sigma S(t)\,\mathrm{d} W(t). \] But we need that \(u(t, S(t)) = V(t)\) satisfies \(\mathrm{d} V(t) = H(t)\mathrm{d} S(t)\). Then the coefficient at \(\mathrm{d} t\) is equal to \(0\): \[ \frac{\partial u}{\partial t} + \frac12\sigma^2x^2\frac{\partial^2u}{\partial x^2} = 0, \] This is called the Black-Scholes equation. And \(\frac{\partial u}{\partial x}(t, S(t)) = H(t)\). This is called delta hedging: The derivatives of \(u(t, x)\) with respect to volatility \(\sigma\), current price \(x\), current time \(t\), are called delta, vega, and theta, respectively. They measure sensitivity of the option price with respect to the parameters of the model. Collectively, they are called greeks. The delta derivative gives us the quantity of shares we need to have at time \(t\) in our portfolio.

Example 18.6.

\(f(x) = x^2\), \(\sigma = T = 1\). Then \(S(t) = e^{W(t) - t/2}\), and the fair price is given by \[ V(t) = u(t, S(t)) = \mathbb E\left[\left(e^{W(1) - 1/2}\right)^2\mid S(t)\right] = \mathbb E\left[e^{2W(1) - 1}\mid W(t)\right] = e^{2W(t) - 1}\mathbb E\,e^{2(W(1) - W(t))}. \] In the last derivation, we used that \(W(1) = W(t) + (W(1) - W(t))\), and \(W(1) - W(t)\) is independent of \(W(t)\). Continuing our calculations, we have: Since \(W(1) - W(t) \sim \mathcal N(0, 1-t)\), calculate the moment generating function: \[ V(t) = e^{2W(t) - 1}e^{2^2(1-t)/2} = e^{2W(t) - 1}e^{2(1-t)} = e^{2W(t)+1 - 2t}. \] Now express this in terms of \(S(t)\): \(W(t) = t/2 + \ln S(t)\), and thus \[ V(t) = \exp\left(2(t/2 + \ln S(t)) + 1 - 2t\right) = \exp\left(2\ln S(t) + 1 - t\right) = S^2(t)e^{1-t}. \] Therefore, the fair price function \(u(t, x) = x^2e^{1-t}\), and the delta hedging gives us: \[ \frac{\partial u}{\partial x} = 2xe^{1-t},\, H(t) = \frac{\partial u}{\partial x}(t, S(t)) = 2S(t)e^{1-t}. \] As an example, if \(S(0.5) = 1.4\), then at time \(t = 0.5\) we need to have \(2\cdot 1.4\cdot e^{1 - 0.5} = 4.616\) shares of this stock.

19. Stochastic Differential Equations

19.1. Definition

Take two functions \(g, \sigma : \mathbb R \to \mathbb R\). An equation \[ \mathrm{d} X(t) = g(X(t))\,\mathrm{d} t + \sigma(X(t))\,\mathrm{d} W(t) \] is called a stochastic differential equation (SDE) with drift \(g\) and diffusion \(\sigma\). If we impose the initial condition \(X(0) = x_0\), then this equation (under some conditions) has a unique solution. One can rewrite the SDE together with this initial condition in the form \[ X(t) = x_0 + \int_0^tg(X(s))\,\mathrm{d} s + \int_0^t\sigma(X(s))\,\mathrm{d} W(s). \]

Example 19.1.

If \(g(x) \equiv g\) and \(\sigma(x) \equiv \sigma\) are constant functions, then the equation takes the form \[ X(t) = x_0 + \int_0^tg\,\mathrm{d} s + \int_0^t\sigma\,\mathrm{d} W(s) = x_0 + gt + \sigma W(t). \] This is a Brownian motion with drift \(g\) and diffusion \(\sigma\).

19.2. Geometric Brownian motion

Let \(g(x) = g_0x\) and \(\sigma(x) = \sigma_0 x\) for some constants \(g_0\) and \(\sigma_0\). Then the SDE takes the form \[ \mathrm{d} X(t) = X(t)\left[g_0\,\mathrm{d} t + \sigma_0\,\mathrm{d} W(t)\right]. \] Let us show this is a geometric Brownian motion. From this SDE, we get: \[ \mathrm{d}\langle X\rangle_t = \sigma_0^2X^2(t)\,\mathrm{d} t. \] Apply It^o’s formula to this SDE with \(f(x) = \log x\). Then \[ f'(x) = \frac1x,\ f''(x) = -\frac1{x^2}, \] \[ \mathrm{d}\log X(t) = \mathrm{d} f(X(t)) = f'(X(t))\,\mathrm{d} X(t) + \frac12f''(X(t))\,\mathrm{d}\langle X\rangle_t \] \[ = \frac1{X(t)}\cdot X(t)\left[g_0\,\mathrm{d} t + \sigma_0\,\mathrm{d} W(t)\right] - \frac1{2X^2(t)}\sigma_0^2X^2(t)\,\mathrm{d} t = \mu\,\mathrm{d} t + \sigma_0\mathrm{d} W(t), \] where \(\mu = g_0 - \sigma_0^2/2\) is the drift. Therefore, \(\log X(t) = \log X(0) + \mu t + \sigma_0 W(t)\) which implies \(X(t) = X(0)e^{\mu t + \sigma_0W(t)}\).

Example 19.2.

Let \(\mathrm{d} X(t) = 4X(t)\,\mathrm{d} t + 0.5X(t)\,\mathrm{d} W(t)\), and \(X(0) = 3\). Then \(g_0 = 4\), \(\sigma_0 = 0.5\), and \(\mu = \frac{31}{8}\). Therefore, \[ X(t) = 3e^{\frac{31}8t + 0.5W(t)}. \] Recall the moment generating functionfor normal random variables: \[ Z \sim \mathcal N(m, \rho^2)\ \Rightarrow\ \mathbb E e^{uZ} = \exp\left(mu + \frac12\rho^2u^2\right). \] \[ Z = \frac{31}8t + 0.5W(t)\ \Rightarrow\ m = \mathbb E Z = \frac{31}8t, \] \[ \rho^2 = \mathrm{Var} Z = \mathrm{Var}(0.5W(t)) = 0.5^2\mathrm{Var} W(t) = \frac t4. \] Plugging this \(Z\) and \(u = 1\), we get after calculation: \(\mathbb E X(t) = 3\mathbb E e^Z = 3e^{4t}\). Similarly, plugging \(u = 2\) into the equation with \(Z\), we can find the second moment: \[ \mathbb E X^2(t) = 3^2\mathbb E e^{2Z} = 9\exp\left(\frac{33}4t\right). \]

19.3. Ornstein-Uhlenbeck process

Assume \(g(x) = c(m - x)\) and \(\sigma(x) = \sigma\) for \(c, \sigma > 0\). Then the SDE takes the form \[ \mathrm{d} X(t) = c(m - X(t))\,\mathrm{d} t + \sigma\,\mathrm{d} W(t). \] This has mean-reverting property: if \(X(t) > m\), then \(X\) tends to move, on average, down to \(m\); if \(X(t) < m\), then \(X\) tends to move, on average, up to \(m\). Therefore, this process tends to oscillate around \(m\). Compare this with Brownian motion, which just diffuses (as a random walk) to infinity, without any limits. Or with geometric Brownian motion, which does the same, but on the logarithmic scale.

The equation is an example of a linear SDE, which has drift \(g(x)\) and diffusion \(\sigma(x)\) to be linear in \(x\).

19.4. Linear equations

An SDE is called linear if the drift and diffusion coefficients are linear functions. For such SDE, we can find their mean and variance. As an example, consider the process \[ \mathrm{d} X(t) = (X(t) - 2)\,\mathrm{d} t + 2X(t)\,\mathrm{d} W(t),\ X(0) = -3. \] Find \(m(t) := \mathbb E X(t)\). We have: \(m(0) = -3\), and \[ X(t) = -3 + \int_0^t(X(s) - 2)\,\mathrm{d} s + 2\int_0^tX(s)\,\mathrm{d} W(s). \] Taking the expectation, note that the stochastic integral has expectation zero, and \[ m(t) = -3 + \int_0^t(m(s) - 2)\,\mathrm{d} s\ \Rightarrow\ m'(t) = m(t) - 2,\ m(0) = -3. \] Let us now solve this linear SDE using integrating factor. Because \(m' = m\ \Rightarrow\ m(t) = Ce^t\), the integrating factor is \(m(t) = e^{-t}\), and so \[ m' - m = -2\ \Rightarrow\ e^{-t}m' - e^{-t}m = -2e^{-t}\ \Rightarrow\ (e^{-t}m)' = -2e^{-t} \] \[ e^{-t}m = \int(-2e^{-t})\,\mathrm{d} t\ \Rightarrow e^{-t}m = 2e^{-t} + C\ \Rightarrow\ \boxed{m = 2 + Ce^t} \] Because \(m(0) = -3\), we find \(2 + C = -3\), and \(C = -5\). Thus, \[ m(t) = 2 - 5e^t. \] Now, apply It^o’s formula with \(f(x) = x^2\), \(f'(x) = 2x\), \(f''(x) = 2\): \[ \mathrm{d} X^2(t) = f'(X(t))\,\mathrm{d} X(t) + \frac12f''(X(t))\,\mathrm{d}\langle X\rangle_t = 2X(t)\,\mathrm{d} X(t) + \mathrm{d}\langle X\rangle_t. \] But \(d\langle X\rangle_t = 4X^2(t)\,\mathrm{d} t\). Therefore, \[ \mathrm{d} X^2(t) = 2X(t)\left[(X(t) - 2)\,\mathrm{d} t + 2X(t)\,\mathrm{d} W(t)\right] + 4X^2(t)\,\mathrm{d} t \] \[ = \left[6X^2(t) - 4X(t)\right]\,\mathrm{d} t + 4X^2(t)\,\mathrm{d} W(t). \] We can rewrite this as \[ X^2(t) = X^2(0) + \int_0^t\left[6X^2(s) - 4X(s)\right]\,\mathrm{d} s + \int_0^t4X^2(s)\,\mathrm{d} W(s). \] Take expectation and use that \(X(0) = -3\): \[ \mathbb E X^2(t) = (-3)^2 + \int_0^t\mathbb E\left[6X^2(s) - 4X(s)\right]\,\mathrm{d} s. \] Denote \(a(t) := \mathbb E X^2(t)\). Then \[ a(t) = 9 + \int_0^t\left[6a(s) - 4m(s)\right]\,\mathrm{d} s. \] We can rewrite this as \[ a'(t) = 6a(t) - 4m(t) = 6a(t) - 4(2 - 5e^t),\ a(0) = 9. \] Now it remains to solve this linear ODE. Using the same integrating factor method, we solve this equation.

19.5. General remarks

Every SDE is a continuous-time Markov process with state space \(\mathbb R\), the whole real line, unlike Markov chains, which have state space \(\{1, 2, 3\}\) or another finite or countable set. This Markov process has transition density \(p(t, x, y)\): the density of \(X(t)\) at point \(y\) given that it started from \(X(0) = x\). We already know this transition density for Brownian motion. For the general SDE, this transition density is a solution of two PDEs, called forward and backward Kolmogorov equations.

Sometimes this SDE has a stationary distribution. For example, Brownian motion and geometric Brownian motion do not have a stationary distribution, but Ornstein-Uhlenbeck process does have one, and it is normal: \[ \mathcal N(m, \rho^2),\ \mbox{with}\ \rho^2 := \frac{\sigma^2}{2c}. \] The rate of convergence to such stationary distribution is known for the Ornstein-Uhlenbeck process, but not for the general SDE.

20. Continuous-Time Martingales

The content of this section largely duplicates Section 13, adapting it for continuous time. We did not avoid some repetitions, doing this for the convenience of a reader.

20.1. Definition and examples

A process \(M = (M(t),\, t \ge 0)\) is called a martingale if for every \(0 \le s \le t\), \[ \mathbb E\left[M(t)\mid M(u),\, 0 \le u \le s\right] = M(s)\ \Leftrightarrow\ \mathbb E\left[M(t) - M(s)\mid M(u),\, 0 \le u \le s\right] = 0. \] If we have \(\ge\) or \(\le\) instead of equality, then \(M\) is called a submartingale or supermartingale, respectively. This equation means that the best prediction of the future value at time \(t\) is the current value at time \(s\). In particular, for a martingale \(M\), we have \(\mathbb E M(t) = \mathbb E M(0)\).

Example 20.1.

A Brownian motion \(W = (W(t),\, t \ge 0)\) is a martingale: Since \(W(t) - W(s)\) is independent of \(W(u),\, 0 \le u \le s\), and \(W(t) - W(s) \sim \mathcal N(0, t-s)\), we have: \[ \mathbb E\left[W(t)\mid W(u),\, 0 \le u \le s\right] = \mathbb E\left[W(t) - W(s)\mid W(u),\, 0 \le u \le s\right] + \mathbb E\left[W(s)\mid W(u),\, 0 \le u \le s\right] \] \[ = \mathbb E\left[W(t) - W(s)\right] + W(s) = W(s). \]

Example 20.2.

Compensated Poisson process \(M(t) = N(t) - \lambda t\), where \(N(t)\) is a Poisson process with intensity \(\lambda\). Then \(N(t) - N(s) \sim \mathrm{Poi}(\lambda(t-s))\) is independent of \(N(u),\, 0 \le u \le s\), and we can write \(M(t) - M(s) = N(t) - N(s) - \lambda(t-s)\). Thus, \[ \mathbb E \left[M(t) - M(s)\mid M(u),\, 0 \le u \le s\right] = \mathbb E\left[N(t) - N(s)\mid N(u),\, 0 \le u \le s\right] - \lambda(t-s) = \mathbb E\left[N(t) - N(s)\right] - \lambda(t-s) = 0. \]

Note that in both of these examples, we used the property that \(M(t) - M(s)\) is independent of \(M(u),\, 0 \le u \le s\). Thus for every Levy process \(L(t)\), if \(\mathbb E\left[L(t) - L(s)\right] = 0\), then this is a martingale. If \(\mathbb E\left[L(t) - L(s)\right] \ge 0\), then \(L\) is a submartingale.

Example 20.3.

Consider a L'evy process \(L(t) = 1 + 2t + 2W(t) + \sum_{k=1}^{N(t)}Z_k\), with \(N\) a Poisson process with intensity \(\lambda = 3\). Then it is a martingale if and only if \(\mathbb E\left[L(t) - L(s)\right] = 2(t-s) + \lambda\cdot \mathbb E Z_1 = 0\).

We can consider a geometric Brownian motion \(X(t) = X(0)\exp\left(gt + \sigma W(t)\right)\). When is it a martingale? \[ X(t) = X(s)\exp\left(g(t-s) + \sigma(W(t) - W(s))\right),\quad s < t. \] And \(W(t) - W(s)\) is independent of \(X(u),\, 0 \le u \le s\). Therefore, we can write \[ \mathbb E\left[X(t)\mid X(u),\, 0 \le u \le s\right] = X(s)\mathbb E\exp\left(g(t-s) + \sigma(W(t) - W(s))\right) = X(s)\exp\left((t-s)(g + \sigma^2/2)\right). \] Indeed, \(W(t) - W(s) \sim \mathcal N(0, t-s)\) and therefore \(\mathbb E\exp(\sigma(W(t) - W(s))) = \exp(\sigma^2(t-s)/2)\). Thus, this geometric Brownian motion is a martingale if and only if \(g + \sigma^2/2 = 0\). More generally, take a Levy process \(L = (L(t),\, t \ge 0)\). Then \((e^{L(t)},\, t \ge 0)\) is a martingale if and only if \[ \mathbb E e^{L(t) - L(s)} = 1\quad \mbox{for every}\ 0 \le s < t. \]

Example 20.4.

Consider \(L(t) = W(t) + ct + \sum_{k=1}^{N(t)}Z_k\) with \(Z_k \sim \mathcal N(0, 1)\) and \(N\) having intensity \(2\). Then we can find the moment generating function, as in Section 17: Because \(N(t) - N(s) \sim \mathrm{Poi}(2(t-s))\), and \(Z_k\) has moment generating function \(e^{u^2/2}\), we have: \[ \mathbb E\left[e^{L(t) - L(s)}\right] = e^{c(t-s)}\mathbb E\left[e^{W(t) - W(s)}\right]\mathbb E\left[\exp\left(\sum_{k=N(s)+1}^{N(t)}Z_k\right)\right] \] \[ = e^{c(t-s)}\exp\left(\frac12(t-s)\right) \exp\left(2(t-s)e^{1/2}\right) = \exp\left[\left(c + \frac12 + 2e^{1/2}\right)(t-s)\right]. \] Therefore, we need \(c = -0.5 - 2e^{1/2}\) for \(e^L\) to be a martingale.

For every random variable \(\xi\) and a process \(X = (X(t),\, t \ge 0)\), the following conditional expectation is a martingale: \[ M(t) = \mathbb E\left[\xi\mid X(u),\, 0 \le u \le t\right]. \] Indeed, \(M(t)\) is the best prediction of \(\xi\) at time \(t\); and the best prediction of this at earlier time \(s\) is equal to the best prediction of \(\xi\) at time \(s\). In particular, take a stock price \(S = (S(t),\, t \ge 0)\). This last remark applies to the fair price process \(V(t)\) of a European derivative \(f(S(T))\) with maturity \(T\). Indeed, this fair price, as shown in Section 17, is \[ V(t) = \mathbb E_*\left[f(S(T))\mid S(u),\, 0 \le u \le t\right], \] where \(\mathbb E_*\) is the expectation taken with respect to the risk-neutral probability, under which \(S\) is a martingale.

Finally, let us mention that any stochastic integral with respect to the Brownian motion is a martingale: \[ M(t) = \int_0^tX(s)\,\mathrm{d} W(s),\, t \ge 0, \] if only \(X(t)\) is itself dependent on \(W(u),\, 0 \le u \le t\). Indeed, fix \(0 \le s < t\) and split the interval \([s, t]\) into many small subintervals: \(s = u_0 < u_1 < \ldots < u_N = t\). We can express the difference \[ M(t) - M(s) = \int_s^tX(u)\,\mathrm{d} W(u) \approx \sum\limits_{k=0}^{N-1}X(u_k)\left(W(u_{k+1}) - W(u_k)\right). \] If \(X\) is deterministic (non-random), then each summand in the right-hand side of this approximation is an increment of the Brownian motion, which is independent of \(W(u),\, 0 \le u \le s\) and has expectation zero. Therefore, its conditional expectation with respect to \(M(u),\, 0 \le u \le s\), is zero. For stochastic (random) \(X\), it is also possible to show, although not so easy.

20.2. Optional stopping

For every continuous-time martingale \[ M = (M(t),\, t \ge 0),\quad (M(\epsilon n),\, n = 0, 1, 2, \ldots) \] is a discrete-time martingale. And we can approximate a continuous-time martingale by such discrete-time martingales. Therefore, we can extend an optional stopping theorem from Section 13 for continuous-time martingales.

We say that \(\tau\) is a stopping time with respect to a process \(X = (X(t),\, t \ge 0)\), if for every \(t \ge 0\), the event \(\{\tau \le t\}\) depends only on \(X(u),\, 0 \le u \le t\). One can think about this as time to sell the stock, based only on its observed prices so far.

Example 20.5.

Stopping times \(\tau\): \[ \inf\{t \ge 0\mid X(t) \le 0\}, \quad \inf\{t \ge 0\mid X(t) \in [0, 1]\}, \] \[ \min(\inf\{t \ge 0\mid X(t) = 0\}, 3),\quad \inf\{t \ge 0\mid X(t-1) \ge 0\} \]

Example 20.6.

These are not stopping times: \[ \tau = \inf\{t \ge 0\mid X(t+1) \ge 0\},\, \quad \tau = \sup\{t \ge 0\mid X(t) = 0\}. \]

If \(M(t)\) is a martingale, and \(\tau\) is bounded, or \(M(t),\, 0 \le t \le \tau\), is bounded, then \(\mathbb E M(\tau) = \mathbb E M(t)\) for all \(t \ge 0\). The boundedness condition is essential. We can weaken it, but not get rid of it entirely. Indeed, consider a Brownian motion \(W\) starting with zero. Eventually it will hit \(1\), and so we let \(\tau = \inf\{t \ge 0\mid W(t) = 1\}\). Then \(W(\tau) = 1\) but \(W(0) = 0\). This does not contradict the optional stopping theorem, however, since \(\tau\) is unbounded (we can wait for a very long time until we hit \(1\); in fact \(\mathbb E\tau = \infty\)), and \((W(t),\, 0 \le t \le \tau)\) is also unbounded (the Brownian motion can go arbitrarily far into the minus until it hits the level \(1\)).

This corresponds to the doubling strategy already discussed in Section 13: Play the coin until you end up with cumulative win. But this is only feasible if you are prepared to wait for unilimited time (unfeasible) and borrow arbitrarily much from someone (unfeasible). If you decide to wait only until time \(T = 100\), then you get \(\tau' := \min(\tau, 100)\), which is bounded. Then we have \(\mathbb E W(\tau') = \mathbb E W(0) = 0\). Or if you decide to borrow up to \(100\), then you get \(\tau'' := \inf\{t \ge 0\mid W(t) \ge 1\ \mbox{or}\ W(t) \le -100\}\). Then \(W(t)\) is bounded for \(t \le \tau''\), and \(\mathbb E W(\tau'') = \mathbb E W(0) = 0\).

20.3. Transformation of martingales

Take a martingale \(M = (M(t),\, t \ge 0)\). Apply a convex function \(g\) to this martingale. By Jensen’s inequality from Section 13 applied to conditional expectations, we have: \[ \mathbb E\left[g(M(t))\mid M(u),\, 0 \le u \le s\right] \ge g\left(\mathbb E\left[M(t)\mid M(u),\, 0 \le u \le s\right]\right) = g(M(t)). \] Therefore, \(g(X) = (g(M(t)),\, t \ge 0)\) is a submartingale.

Let us apply this to option pricing. We already discussed European options and other European derivatives in Sections 13, 17, and 18. Recall that a call option is the right to buy a stock at a certain strike \(K\). A European call option has maturity time \(T\), when you can exercise this option: demand to buy this stock at price \(K\). If the market price \(S(T)\) of this stock at time \(T\) is less than \(K\), then you can just buy the stock at the market price and forget about your option. Then your option does not have value. However, if the market price \(S(T) \ge K\), then you should exercise this option, and its value is \(S(T) - K\). In general, the value of this option is \((S(T) - K)_+ = g(S(T))\), where \(g(x) = (x - K)_+\), \(a_+ := \max(a, 0)\).

An American call option is different from a European one in the following way: the former can be exercised at any time until maturity \(N\), while the latter must be exercised at maturity. Therefore, let \(\tau\) be the time you decide to exercise your American call option to get the best expected value \(\mathbb E g\left(S(\tau)\right)\). When is the best exercise time \(\tau\)? This is a stopping time, since your decision to exercise at time \(n\) or not is based only on your observations until time \(n\), but not on future observations. But the function \(g\) is convex (draw a graph and check). Frequently, the stock price \(X\) is modeled by a martingale. Then \(g(S) = (g(S(t)),\, t \ge 0)\), is a submartingale. By the optional stopping theorem, \[ \mathbb E g\left(S(\tau)\right) \le \mathbb E g\left(S(T)\right). \] Therefore, the best time to exercise your American call option is at maturity \(t = T\). In fact, American and European call options are of the same value in this case. Additional freedom to choose exercise time does not give you anything.

20.4. Doob’s inequalities

These are generalizations of Markov’s and Chebyshev’s inequalities. Take a nonnegative submartingale \(X = (X(t),\, t \ge 0)\) and a number \(\lambda > 0\). Then \[ \mathbb{P}\left(\max\limits_{0 \le t \le T}X(t) \ge \lambda\right) \le \frac{\mathbb{E} X(T)}{\lambda}. \] This is proved by approximating the continuous-time submartingale \((X(t),\, 0 \le t \le T)\) by discrete-time submartingales \((X(\epsilon n),\ n = 0, 1, \ldots)\) for small \(\epsilon > 0\). In particular, applying a convex function \(f(x) = x^2\) to a martingale \(M = (M(t),\, t \ge 0)\), we get a submartingale \(M^2(t)\). Therefore, we get Kolmogorov’s inequality: \[ \mathbb{P}\left(\max\limits_{0 \le t \le T}|M(t)| \ge \lambda\right) = \mathbb{P}\left(\max\limits_{0 \le t \le T}M^2(t) \ge \lambda^2\right) \le \frac{\mathbb E M^2(T)}{\lambda^2}. \]

Example 20.7.

For a geometric Brownian motion \(M(t) = e^{W(t) - t/2}\), which is a martingale, apply Kolmorogov’s inequality: \[ \mathbb{P}\left(\max\limits_{0 \le t \le 4}M(t) \ge 20\right) \le \frac{1}{20^2}\mathbb E\left[e^{W(4) - 2}\right]^2 = \frac1{400}\mathbb E\left[e^{2W(4) - 4}\right] \] \[ = \frac1{400}\exp\left(4\cdot 4/2 - 4\right) = \boxed{0.136} \]

Example 20.8.

In the previous example, we can also apply the function \(f(x) = x^3\), which is convex on \(x > 0\): \(f''(x) = 6x > 0\). Thus \[ \mathbb{P}\left(\max\limits_{0 \le t \le 4}M(t) \ge 20\right) \le \frac{1}{20^3}\mathbb E\left[e^{W(4) - 2}\right]^3 = \frac1{8000}\mathbb E\left[e^{3W(4) - 6}\right] \] \[ \frac1{8000}\exp\left(9\cdot 4/2 - 6\right) > 1, \] so it is useless…