Markov Process

Support/Figures/Pasted image 20250119135611.png
Imagine that we have a set of $m$ states, Then, the idea of a Markov process, is that no matter where we are in the chain of transitions, or no matter how many state transitions we have seen, or what path we have followed to get to this state, given the current state, $X_{n}$ $X_{n + 1}$ is determined by $X_{n}$ with some noise. that is, if $X_{n} = i$ , then there are probabilities for each $j = 1 \to m$ such that $X_{n + 1} = j$ . and such transition probabilities are present for each state $i$ of $X_{n}$ .
A nice way to draw a markov process is to draw a directed graph (which may have self loop, but no multi edges) whose vertices are the states $1, 2, . . . m$ and whose edges $(i, j)$ represent a single step transition from state $i$ to state $j$ , with a probability weight $p (i, j)$ on that edge. (note that $p (i, j)$ can be different from $p (j, i)$ )
these edge weights could be written $p_{i j}$ as well. We can also enforce a complete directed graph, with all self loops, and put $p (i, j) = 0$ when we can't get from state $i$ to state $j$ .

Then, the question is, starting from state $i$ , and taking $n$ transitions or steps, what is the probability I get to state $j$ ?
denote $r_{i j} (n)$ as the probability of starting at state $i$ and reaching state $j$ after $n$ steps.

Then, recursively, using the complete DAG idea, for each state $k = 1 \to m$ , (ofc state transitions are independent of previous history) we can look at the probabiity of reaching $k$ from $i$ in $n - 1$ states, depicted $r_{i k} (n - 1)$ and then multiply that by the probability of reaching $j$ from $k$ in one step, $p_{k j}$ and add up the branches for each k.

That is, $$r_{ij}(n) = \sum_{k=1}^mp_{kj} \cdot r_{ik}(n-1)$$

Note that for any state $i$ , the probabilities of the outgoing edges to any state (maybe even itself), should add up to one. $\sum_{j = 1}^{m} p_{i j} = 1, \forall i = 1 \to m$ .

generic limits for $r_{i j} (n)$ as $n \to \infty$
Support/Figures/Pasted image 20250119142337.png

Recurrent and Transient states:

When drawing graphs of Markov chains, it is custom to omit zero probability edges, hence in the graph of the Markov chain, there is a path between two vertices iff there is a non zero probability of reaching (at least in 1 direction) from one state to the other.

A graph $G$ is a markov graph if $V (G) = {1, 2, . ., m}$ the set of all possible states, and $(i, j) \in E (G) ⟺ p_{i j} > 0$ , moreover for any vertex $i$ , $\sum_{j = 1}^{m} p_{i j} = 1$ .

define the relation on $V (G)$ , $R (i, j)$ where $i R j ⟺ \exists path [i ⟶ j]$ .
Now, define another relation $R^{*} (i, j)$ where $i R^{*} j$ if and only if whenever $j \in R (i)$ (that is whenever $j$ is on some path from $i$ ), $i \in R (j)$ as well. (that is there is a path from $j$ ) back to $i$ .

That is $R^{*} (i)$ is the set of all vertices such that whenever $j$ is on some path of $i$ , there is also a path from $j$ back to $i$ . of course this is an equivalence relation, and each equivalence class partitions the vertices into states where once you get there, you stay in the same equivalence class (like two way connected components) So once you're in any $[i]_{R *}$ , there is no way to reach any $j \notin [i]_{R^{*}}$

These equivalence classes are called recurrent states.
Notice that $R^{*}$ is not a full partition, some vertices are NOT in ANY equivalence class under $R^{*}$ . Such vertices/states are called transient.
Eventually, moving between transient states, we will have to escape into some state, and we can never get back, that is we, will move into some reccurrant class.

if $i$ is a transient state, in the long run $P (X_{n} = i) \to 0$ as $n \to \infty$ .

Support/Figures/Pasted image 20250119143925.png

In a general markov chain, you have some reccurant classes of states, with all edges internal, and some transient states that can get to some vertex of some recurrant class (and by definition can never leave)

Periodic markov chains

Suppose $G$ is a markov graph, Then $G$ is periodic, if $V (G)$ can be partitioned into color classes $C_{0}, C_{1}, . . C_{d - 1}$ such that there exists an edge from (a vertex in) $C_{i}$ to $C_{j}$ if and only if $j = i + 1 (mod d)$ an(d that $d$ is the minimum partitioning. (fewest number of classes).
Support/Figures/Pasted image 20250119145700.png
if we have a self loop, the chain is not periodic :)

Steady state Markov theorem:

So before we go on, Let us talk about the matrix interpretation.
so if I have some states $1, 2, . . . m$ and I have some initial random varaible $X_{0}$ , which encapsulates the distribution $p_{X_{0}} (i)$ for each $i = 1 \to m$ , then we can write $X_{0}$ as a vector $v_{0} \in R^{m}$ where $\sum_{j = 1}^{m} {v_{0}}^{j} = 1$ and each component is between $0$ and $1$ of course.
Now, suppose using a the Markov graph $G$ , make an $m \times m$ matrix where the $(i, j)$ th entry is the probability edge weight (or zero if that edge is not in G) of the edge $(j, i)$ (we are reversing the direction here)

That is $M_{i j} = p_{j i}$ for each pair $(i, j)$ .
Why do we reverse the direction? just see below (we just do it to preserve notation of multiplying vectors on the right).

So HWAHT is is $M v_{0}$ ? well as a computation, $$M\mathbf{v_{0}}^i = \sum_{k=1}^m M[i][k] \mathbf{v_{0}}^k$$
Hence, the $i$ th entry of $M v_{0}$ , can be seen as:

M {v_{0}}^{i} = \sum_{k = 1}^{m} p_{k i} {v_{0}}^{k}

Probabalistically, we can say $$ P(X_{1} = i) = \sum_{k=1}^m p_{ki} P(X_{0} = k)$$
So The above thing looks like the total probability theorem right? probability that $X_{1} = i$ is the sum of the disjoint probabilities that: given $X_{0} = k$ , what is the probability that $X_{1} = i$ ? the conditional becomes multiplication due to independence.

Therefore, if $M$ is a matrix of a markov graph $G$ , where $M [i] [j] = p (j, i) if (j, i) \in E (G) else 0$ , and The initial random variable $X_{0}$ is represented by the distribution $v_{0}$ , Then the distribution of $X_{1}$ is represented by $M v_{0}$ . Inductively, And since markov chains are only determined by the current state, and forget the previous state, we can treat $X_{1}$ as the intital state, and $v_{1} = M v_{0}$ , to get that $X_{2}$ is represented by the distribution $M v_{1} = M^{2} v_{0}$ .

Therefore, using the transition matrix $M$ we have that $X_{k}$ is represented by the distribution $M^{k} v_{0}$ .

We sort of want to know if the distribution of the states, after $n$ transitions, given by $M^{n} v_{0}$ when $n$ become very large, approaches a steady distribution $v^{*}$ , that is independent of the initial choice of $v_{0}$

If A markov chain has more than one recurrent class, (and assume $v_{0}$ is deterministic, that is we choose exactly 1 initial state j, without any uncertainity) if $v_{0}$ starts at a transient vertex, then with non zero probability, we enter one of (at least 2) recurrent classes, after which we cannot leave it. However if we start at the OTHER recurrent class, we can never enter this one. So the steady state distribution depends on where we start.

A Markov chain that has more than one recurrent class is called "reducible", intuitively, you can think about the fact that given a long time, the distribution settles into one (of the many) recurrent classes and never leaves, so each recurrent class can be treated as a different chain.
A Markov chain is irreducible if there is exactly one recurrent class, and no transient vertices.

So we can turn a transition matrix $M$ into an adjacency matrix by just taking the ceiling, so non zero probability $p_{j i}$ means $M^{*} [i] [j] = 1$ . So the adjacency matrix $A = M^{* T}$ .

So the nice thing about $A$ is that $A^{1} [i] [j]$ repsents the number of 1 length paths from $i$ to $j$ , $A^{2} [i] [j]$ represents two length paths and so on. To determine reachability, it is suffient to compute $A^{1} + A^{2} \dots + A^{m - 1}$ . and wherever there is a
non zero element make it one, else leave it zero, (there are at most m-1 length paths) to get the reachability matrix $A^{*}$

AN aside

NOTE! there are many ways to get an algorithm for the sum $\sum_{i = 1}^{m - 1} A^{i}$ , we could repeatedly square and store about $\log_{2} (m)$ powers $A, A^{2}, A^{4}, . . A^{2^{\log_{2} (m)}}$ and for any power $k$ , that we want to calculate, we want to write $k$ in binary, as $k = \sum_{j = 0}^{t - 1} a_{j} 2^{j}$ , and whichever $a_{j} = 1$ , we have $A^{2^{j}}$ calculated in our look up table, and we have to multiply all these, doing $t = O (\log (k))$ matrix multiplications for each $k$ that we have not calculated. So how many matrix multiplications are we doing to calculate all of $A^{1}, A^{2}, . . . A^{m - 1}$ ?, well first we do $O (\log_{2} (m))$ multiplications one for each $A, A^{2}, A^{4}, . . . A^{m - 1}$ (roughly, roughly). And then, we have to calculate $A^{k}$ for each $k$ from $1 \to m - 1$ that is not a power of 2. and each such calculation takes $l o g (k)$ time.
For a rough upper bound, suppose we just do each $k = 1 \to m$ , without ignoring the powers of two we have already computed.
So we do $O (\sum_{k = 1}^{m - 1} \log_{2} (k)) + O (\log_{2} (m))$ total matrix multiplies. Which is roughly $O (\log_{2} ((m - 1)!))$ which is roughly $O (m \log_{2} (m))$ .
So doing $O (m \log m)$ matmuls, we might parallelize these matmuls or whatnot. Just saying that its doable and not intractable. And then we have to do $O (m)$ matrix additions. which we could also paraellize.

Now that we have the reachability matrix $A^{*}$ of the markov graph $G$ , then $G$ is irreducable if and only if $A^{*}$ is symmetric. This means there is only one reccurent class, as $j$ is reachable from $i$ if and only if $i$ is reachable from $j$ .
If we want to impose a stronger notion on irreducible markov chains, as given here. We have said that $G$ is irreducible if there is only one recurrent class and no transient vertices. But we can go one step further and say that $G$ is irreducible if and only if ANY node $i$ is reachable from any other node $j$ . That is, the reachability matrix $A^{*}$ of $G$ has one in ALL of its entries.

So there is this process of getting $M_{i j} = p_{j i}$ , then $A = ceil (M^{T})$ , then $Q = \sum_{i = 1}^{m - 1} A^{i}$ , then $A^{*} = BinSquash (Q)$ , where $BinSquash$ gives $1$ if $Q [i] [j] > 0$ else gives $0$ . So we want $A^{*}$ to have all ones, so let us reverse engineer this. This means that $Q$ has all positive elements.
Which means for every pair $(i, j)$ there exits $t \in {1, 2, \dots m - 1}$ such that $A^{t} [i] [j] > 0$ . So if this is violated, there exists (at least one) pair $(i, j)$ for which each of $A^{1}, A^{2}, . . . A^{m - 1}$ have a zero in the $(i, j)$ th entry, which is easier to work with.
So we will not push on what $A$ might look like given these conditions. (at least for now)

Next, given adjacency matrix $A$ of Markov chain $G$ , we say $G$ is periodic, if it is possible to partition vertices of $G$ into $C_{0}, C_{1}, . . C_{d - 1}$ , such that for any two vertices $c_{i} \in C_{j}$ and $c_{j} \in C_{j}$ , there exists an edge $(c_{i}, c_{j})$ if and only if $j = i + 1$ modulo $d$ . (and $d$ is the minimal size of any such partition)

Now, there is no harm is assuming that $C_{0} = {0, 1, . . a_{1}}$ and $C_{1} = {a_{1} + 1, a_{2} + 2, \dots a_{2}}$ , and so on. This is just a relabeling of the vertices of $G$ , so that each color class has labels that are consecutive. If this re-labeling is given by an isomorphism $f$ , then The isomorphic adjacency matrix $B$ is given by $B [i] [j] = A [f (i)] [f (j)]$ .
The matrix $B$ is particularly cool, as we can partition it into block matrices, Where the block $C_{i j}$ is made by taking the the rows $a_{i}, a_{i} + 1, \dots a_{i + 1}$ and the columns $a_{j}, a_{j} + 1, \dots a_{j + 1}$ and intersecting them.
Writing the matrix as blocks like this, we see that the only blocks that can have ones in it are the blocks $C_{01}, C_{12}, C_{23}, \dots, C_{d - 1} C_{0}$ and all other blocks must be made entirely of zeros. (i will put a picture for this sometime)

A matrix that DOES NOT have this property represents the adjacency of an Aperiodic Markov chain.

Now to state the thoerem: if $M$ is a state trasition matrix of markov chain $G$ , and $A$ is an adjacency matrix of $G$ , (both $M$ and $A$ are allowed to be re-written, as long as they use the same isomorphism on $G$ ), then a steady state distribution $v^{*}$ is one where $M v^{*} = v^{*}$ . If $M$ (and hence $A$ ) are such that they are irreducible and Aperiodic, then there exists a steady state $v^{*}$ such that no matter the initial distribution $v_{0}$ , for any $ϵ > 0$ , there exists $n = n (v_{0}, ϵ) \in N$ such that $| | v^{*} - (M^{n} v_{0}) | | < ϵ$ .

As far is I know, $n$ is allowed to depend on $v_{0}$ as well, just that no matter what $v_{0}$ is, it eventually has to converge (maybe at a different rate than some $u_{0}$ ) to $v^{*}$ .
A lot of the ideas we build high on drugs today is refined and tied up to become proofs here. (note that I use a transposed transition matrix, so I can right multiply column vectors the usual way, but the paper does it the other way around)

Finally, if $X_{n}$ is a markov chain, in $m$ states, described by the transition matrix $M$ , which contains probabilities as its entries, $M_{i j} = p_{j i}$ and that the columns of $M$ are normalized (they sum to one), And such that $M$ is aperiodic and irreducible, then there exists a unique steady state distribution, $v^{*}$ which we can solve for by allowing $M v = v$ where $v \in [0, 1]^{m}$ and the sum of the components of $v$ is $1$ .

Given the steady state distribution $v^{*}$ , and observing over a long, long number of transitions, The fraction of times we reach some state $j$ is equal (in the limit) to ${v^{*}}^{j}$ .
And the faction of times we transition from state $i$ to state $j$ is ${v^{*}}^{i} p_{i j}$ .