0RL - Introduction

Supervised, unsupervised and Reinforcement Learning:

1. Supervised Learning

Example:
If you have a dataset of images of cats and dogs labeled as "cat" or "dog," a supervised learning model learns to classify new, unseen images as either a cat or a dog.


2. Unsupervised Learning

Example:
Given a set of news articles without labels, an unsupervised learning algorithm can group them into clusters of similar topics.


3. Reinforcement Learning (RL)

Example:
A robot learning to navigate a maze receives positive rewards for moving closer to the exit and negative rewards for hitting walls. Over time, it learns a strategy (policy) to reach the exit efficiently.


Feature Supervised Learning Unsupervised Learning Reinforcement Learning
Data Labeled Unlabeled Interactive (states, actions, rewards)
Goal Predict outputs Discover patterns Maximize cumulative rewards
Feedback Direct Indirect (latent) Sparse and delayed
Example Tasks Classification, Regression Clustering, Dim. Reduction Game AI, Robotics
Typical Algorithms SVM, Neural Networks K-means, PCA Q-learning, DQN, PPO

The RL problem template:

!Support/Pasted image 20250326075332.png

1. Agent’s Internal State St

2. Action At as a Function of History Ht

3. Environment State Ste and Its Transition

4. Reward Rt+1 and Observation Ot+1

5. Updating the Agent’s Internal State

In this way there is a cycle:

None of this is formal!!!

Notation always makes things look final, formal and well defined, but the above discussion is purely informal, and we are only describing a rough idea of what a reinforcement learning problem looks like. The only thing to takeaway here, is that there is an agent that acts based on its history so far, and that this action (along with the history of actions and the environment getting affected by it) bekons the environment to give a reward and observation to the agent, which is then taken by the agent along with its history to update its state. The new state is then merged with the history.
Also note that these functions can be stochastic and not deterministic(they might spit out a probability distribution).

Formalizing the RL problem:

Okay what now? How do we formalize this problem? well, firstly, we have to get rid of storing histories which can get very unwieldy. Next, we can slowly build up to the formalization of an RL problem, instead of directly jumping into it.

1. Markov Process:

Imagine a tumbleweed floating around simply moved by the wind. The wind doesn't say anything to the tumbleweed and the tumbleweed doesn't think or do anything. Such is the nature of a Markov process. The agent is like a tumbleweed, completely at the mercy of the environment, can't think or even act, and the environment itself simply affects the dynamics of the agent, and does not give it any feedback.

Markov process

A markov process is a tuple <S,P> Where S is a finite set of states, and P a matrix depicting the state transition dynamics, P[i,j]=P[St+1=sj|St=si]. We may write the state transition dynamics as Pss or Ps,s where both mean the probability of going to state s given that we are in state s.

Note that A Markov process has the property that P[St+1|St,St1,S1]=P[St+1|St]. So the time step we are in does not matter, the probability distribution over the next state, is completely determined by the current state and nothing else.

2. Markov Reward Process

Now the environment can talk to the tumbleweed, it can tell the tumbleweed if its doing good or bad, and the tumbleweed can think, it can evaluate stuff based on what the environment says to it, however poor tumbleweed still can't ACT.

That is, we're talking about an agent that can calculate stuff based on the feedback of the environment, or do prediction, but it cannot act, its state is still fully beckoned by the environment and natural chaos.

Markov Reward Process

A Markov reward process is a tuple <S,P,R,γ> Where:

  • <S,P> is a Markov process.
  • The state-reward R is a vector where R[si]=E[Rt+1|St=si]. That is, the reward of a state, is expectation over all possible immediately next rewards the environment gives, given that we (the agent) are currently in that state. We may write Rts for the state reward of state s at time t, but which time step we're in doesn't matter, so we may simply write Rs.
  • γ is a discount factor for future rewards.

note: intuitively, we can think that the environment gives an immediate reward for every state that we (the agent) are in, but that the reward is noisy, following some probability distribution, and the agent simply calculates the expectation of this distribution for each states and stores it in R, but it still can't act.

In truth, the agent likes to calculate not just the expectation of the next immediate reward, but rather the expected value over all possible chains of rewards into the future, given the current state. The agent also likes to discount future rewards, so as to not get stuck in infinite loops and allow each chain of rewards to converge. This is known as the state-value function.

Discounted return, State-value function

Consider a chain of rewards Rt+1,Rt+2..... For this chain the discounted return G_t = \sum_{j =0}^{\infty} \gamma { #j} R_{t+1+j} where γ[0,1] is a discount factor which decides how myopic the agent is when calculating returns.

The state-value function maps a state s to the expectation over all possible discounted returns given that the agent is currently in the state s. That is, v(s)=E[Gt|St=s].

Bellman equation for the state-value function in a MRP.

Suppose we (the agent) are currently in the state s. The first reward we get is determined by which state s we visit next. Hence, if we know the value function for s, we can make the following intuitive observation. With probability Pss we make a reward Rss and expect γv(s) additional reward after reaching state s.

Hence, v(s)=sSPss(Rss+γv(s)) But, sSPssRss=E[Rt+1|Rt=s]=Rs. Hence, $$ v(s) = \mathbf{R}{s} + \gamma\sum{s'\in S}\mathscr{P}_{s\to s'}v(s')$$

The above equation is known as the bellman equation, for state value function in MRP.
note: we can also expand Gt in the definition of v(s) and use the linearity of conditional expectation to derive the same equation, but I find the intuitive explanation more satisfying.

As vectors and matrices, we write v=R+γPv, or equivalently v=(IγP)1R. Matrix inversion takes O(n3) for n states. We will utilize better methods.

A note on converting to matrix form: We see that sSPssv(s) is the dot product of the row P[s] with the column vector v which simply stores the state-value v(s) for each state s.

2. Markov Decision Process

Now finally, the environment can talk to the agent, the agent can calculate stuff and predict, and finally it can also do CONTROL which is the ability to act. In a Markov decision process, the state transition dynamics given by the environment is not only dependent on the state of the agent, but also the action of the agent, since the agent can ACT now.

Markov Decision Process

A Markov decision process is a tuple <S,A,P,R,γ>

  • S is a finite set of states, A is a finite set of actions.
  • P is now a "3d-tensor like thing" where Pssa=P[St+1=s|At=a,St=s].
  • R is the reward matrix, where Rsa=E[Rt+1|St=s,At=a]
  • γ is the discount factor.

The first thing we have to understand is that now the agent has a "policy", it has a distribution over the set of actions given that is in some state. This policy is what the model thinks (currently) is the best way to pick actions given its current state, to maximize future returns.
Note that the reward is immediately given after a state followed by an action and not just a state.

Policy

In a MDP, the policy of the agent is a distribution over the set of actions given a state. π(s)=P[At=a|St=s], since the time step we're in is irrelevant, we may write it as π(s)=P[A=a|S=s], but more importantly this policy is therefore stationary. It only depends onn the current state.

Given a policy, we can define The state-value function, as well as the action-value function in a MDP.

State-Value and Action-Value functions in MDP

For an MDP and a policy π, we have:

  • The state-value function vπ(s)=Eπ[Gt|St=s].
  • The action-value function qπ(s,a)=Eπ[Gt|St=s,At=a]

Yet again, we can decompose both of these using linearity of expectation and so on, but we will stick to an intuitive approach.
Imagine we (the agent) in a state s, whenever we take an action a from state s we expect a return of qπ(s,a), and we take this action with probability π(a|s).
Therefore vπ(s)=aAπ(a|s)qπ(s,a)

Further, if we start at the state s and take an action a, we will first incur an immediate reward of Rsa the environment dynamics will blow us to a state s with probability Pssa, where we expect a discounted return of γvπ(s)

Therefore qπ(s,a)=Rsa+γsSPssavπ(s).

We can stitch the above two equations to get the bellman equations for state-value and action-value functions

vπ(s)=aAπ(a|s)(Rsa+γsSPssavπ(s))qπ(s,a)=Rsa+γsSPssa(aAπ(a|s)qπ(s,a))
Optimality of MDP:
Optimal value functions and policies

The optimal state value function is the maximum of all value functions for all policies. The optimal action value function is defined analogously.

v(s)=maxπvπ(s)$$$$q(s,a)=maxπqπ(s,a)

Where maximum over functions means pointwise maximums over all inputs.
Now define a partial order on the policies π>π if vπ(s)>vπ(s)sS. The (an)upperbound on this partial order is called the optimal policy π
There are theorems that show that the optimal policy exists an induces both the optimal state value and action value functions.
We can infer an optimal deterministic policy from the optimal action value function in the following way:

π(a|s):=1 if a=argmaxaq(s,a) else 0
Recursion to calculate optimal value functions

Suppose we are at state s, and for each action a in A, we know the optimal action value function q(s,a) for the state s. the optimal state value function at s: v(s) is the same as the maximum of the action value function at s, over all actions a.
That is, v(s)=maxaAq(s,a)

Now suppose we start at the state s and do the action a. First we incur an immediate reward of Rsa, then the environment dynamics will move us to a state s with probability Pssa, where we expect a return of γv(s) But beware!! there is no role of the policy after doing action a from state s, its purely the environment dynamics that will blow our agent into the state s, therefore, q(s,a)=Rsa+γsSPssav(s).

Stitching the two equations yet again:

v(s)=maxaA(Rsa+γsSPssav(s))q(s,a)=Rsa+γsSPssa(maxaAq(s,a))

Both the bellman optimality equations are non linear (think about how max(a,b) is non linear) unlike the normal bellman equations. And there are many methods to solve it:
Policy iteration, value iteration, Q learning, Sarsa etc.