0RL - Introduction

Supervised, unsupervised and Reinforcement Learning:

1. Supervised Learning

Goal: Learn a mapping from inputs (features) to outputs (labels).
Data: Labeled data — each input has a corresponding target output.
Feedback: Direct feedback is provided during training through the loss function.
Example Problems: Classification (e.g., spam detection), regression (e.g., predicting house prices).
Algorithms: Linear Regression, Decision Trees, Support Vector Machines, Neural Networks.

Example:
If you have a dataset of images of cats and dogs labeled as "cat" or "dog," a supervised learning model learns to classify new, unseen images as either a cat or a dog.

2. Unsupervised Learning

Goal: Discover underlying patterns, structures, or distributions in data without explicit labels.
Data: Unlabeled data — only input data is available, no target output.
Feedback: No direct feedback; evaluation is typically subjective or indirect.
Example Problems: Clustering (e.g., customer segmentation), dimensionality reduction (e.g., PCA), anomaly detection.
Algorithms: K-means, DBSCAN, Autoencoders, Gaussian Mixture Models.

Example:
Given a set of news articles without labels, an unsupervised learning algorithm can group them into clusters of similar topics.

3. Reinforcement Learning (RL)

Goal: Learn to make a sequence of decisions that maximize a cumulative reward.
Data: The agent interacts with an environment, receiving feedback through rewards or penalties.
Feedback: Delayed, sparse, or intermittent — feedback is not always immediate or direct.
Example Problems: Game playing (e.g., AlphaGo), robotic control, autonomous driving.
Algorithms: Q-learning, Deep Q-Networks (DQN), Policy Gradients, Actor-Critic methods.

Example:
A robot learning to navigate a maze receives positive rewards for moving closer to the exit and negative rewards for hitting walls. Over time, it learns a strategy (policy) to reach the exit efficiently.

Feature	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data	Labeled	Unlabeled	Interactive (states, actions, rewards)
Goal	Predict outputs	Discover patterns	Maximize cumulative rewards
Feedback	Direct	Indirect (latent)	Sparse and delayed
Example Tasks	Classification, Regression	Clustering, Dim. Reduction	Game AI, Robotics
Typical Algorithms	SVM, Neural Networks	K-means, PCA	Q-learning, DQN, PPO

The RL problem template:

!Support/Pasted image 20250326075332.png

1. Agent’s Internal State $S_{t}$

Definition:
The state $S_{t}$ represents the agent's internal, private representation of all the variables and knowledge that it uses to decide what to do at time $t$ .

2. Action $A_{t}$ as a Function of History $H_{t}$

History $H_{t}$ :
The agent maintains a history of its states and rewards: $H_{t} = {S_{1}, R_{1}, S_{2}, R_{2}, \dots, S_{t}, R_{t}}$
Action:
The action taken at time $t$ is determined by this history: $A_{t} = f_{agent} (H_{t})$ This means that the agent's decision is based on all the past information it has accumulated.

3. Environment State $S_{t}^{e}$ and Its Transition

Environment State:
The environment has its own state $S_{t}^{e}$ , which is typically not fully known to the agent.
Transition:
When the agent takes an action $A_{t}$ , the environment's state transitions from $S_{t}^{e}$ to a new state $S_{t + 1}^{e}$ : $S_{t + 1}^{e} = f_{env} (H_{t}^{e})$ where the environment's history is (the actions of the agent affect the environment): $H_{t}^{e} = {S_{1}^{e}, A_{1}, S_{2}^{e}, A_{2}, \dots, S_{t}^{e}, A_{t}}$

4. Reward $R_{t + 1}$ and Observation $O_{t + 1}$

Reward:
After the environment transitions to $S_{t + 1}^{e}$ , it provides a reward: $R_{t + 1} = g_{env} (H_{t}^{e})$ This reward quantifies the immediate benefit (or cost) of the action taken.
Observation:
The environment also provides an observation $O_{t + 1}$ to the agent, which carries new information from the environment.

5. Updating the Agent’s Internal State

State Update:
The agent updates its internal state based on:
- Its previous state $S_{t}$ ,
- The action it took $A_{t}$ ,
- The observation $O_{t + 1}$ received, and
- The reward $R_{t + 1}$ provided by the environment.
This update can be expressed as:
$S_{t + 1} = update (S_{t}, A_{t}, O_{t + 1}, R_{t + 1})$
Here, the new state $S_{t + 1}$ incorporates the impact of both the observation and the reward following the agent's action.
the history of the agent is updated $H_{t + 1} = H_{t} + + {S_{t + 1}, R_{t + 1}}$

In this way there is a cycle:

Agent acts based on its history of states and rewards.
the history of actions and environment states including the action that the agent JUST performed, determines the new state of the environment, and the reward given by the environment.
Some part of the new state of the environment is packaged as an observation and sent to the agent along with the reward
The agent updates its internal state, based on the observation, the action it previously took, and the reward it received, along with its previous sate. In some sense, the new state of the agent is at least partially affected by the environment via the observation sent to the agent.

None of this is formal!!!

Notation always makes things look final, formal and well defined, but the above discussion is purely informal, and we are only describing a rough idea of what a reinforcement learning problem looks like. The only thing to takeaway here, is that there is an agent that acts based on its history so far, and that this action (along with the history of actions and the environment getting affected by it) bekons the environment to give a reward and observation to the agent, which is then taken by the agent along with its history to update its state. The new state is then merged with the history.
Also note that these functions can be stochastic and not deterministic(they might spit out a probability distribution).

Formalizing the RL problem:

Okay what now? How do we formalize this problem? well, firstly, we have to get rid of storing histories which can get very unwieldy. Next, we can slowly build up to the formalization of an RL problem, instead of directly jumping into it.

1. Markov Process:

Imagine a tumbleweed floating around simply moved by the wind. The wind doesn't say anything to the tumbleweed and the tumbleweed doesn't think or do anything. Such is the nature of a Markov process. The agent is like a tumbleweed, completely at the mercy of the environment, can't think or even act, and the environment itself simply affects the dynamics of the agent, and does not give it any feedback.

Markov process

A markov process is a tuple $< S, P >$ Where $S$ is a finite set of states, and $P$ a matrix depicting the state transition dynamics, $P [i, j] = P [S_{t + 1} = s_{j} | S_{t} = s_{i}]$ . We may write the state transition dynamics as $P_{s \to s^{'}}$ or $P_{s, s^{'}}$ where both mean the probability of going to state $s^{'}$ given that we are in state $s$ .

Note that A Markov process has the property that $P [S_{t + 1} | S_{t}, S_{t - 1}, \dots S_{1}] = P [S_{t + 1} | S_{t}]$ . So the time step we are in does not matter, the probability distribution over the next state, is completely determined by the current state and nothing else.

2. Markov Reward Process

Now the environment can talk to the tumbleweed, it can tell the tumbleweed if its doing good or bad, and the tumbleweed can think, it can evaluate stuff based on what the environment says to it, however poor tumbleweed still can't ACT.

That is, we're talking about an agent that can calculate stuff based on the feedback of the environment, or do prediction, but it cannot act, its state is still fully beckoned by the environment and natural chaos.

Markov Reward Process

A Markov reward process is a tuple $< S, P, R, γ >$ Where:

$< S, P >$ is a Markov process.
The state-reward $R$ is a vector where $R [s_{i}] = E [R_{t + 1} | S_{t} = s_{i}]$ . That is, the reward of a state, is expectation over all possible immediately next rewards the environment gives, given that we (the agent) are currently in that state. We may write $R_{t}^{s}$ for the state reward of state $s$ at time $t$ , but which time step we're in doesn't matter, so we may simply write $R_{s}$ .
$γ$ is a discount factor for future rewards.

note: intuitively, we can think that the environment gives an immediate reward for every state that we (the agent) are in, but that the reward is noisy, following some probability distribution, and the agent simply calculates the expectation of this distribution for each states and stores it in $R$ , but it still can't act.

In truth, the agent likes to calculate not just the expectation of the next immediate reward, but rather the expected value over all possible chains of rewards into the future, given the current state. The agent also likes to discount future rewards, so as to not get stuck in infinite loops and allow each chain of rewards to converge. This is known as the state-value function.

Discounted return, State-value function

Consider a chain of rewards $R_{t + 1}, R_{t + 2} . . . .$ . For this chain the discounted return $G_t = \sum_{j =0}^{\infty} \gamma {#j} R_{t+1+j}$ where $γ \in [0, 1]$ is a discount factor which decides how myopic the agent is when calculating returns.

The state-value function maps a state $s$ to the expectation over all possible discounted returns given that the agent is currently in the state $s$ . That is, $v (s) = E [G_{t} | S_{t} = s]$ .

Bellman equation for the state-value function in a MRP.

Suppose we (the agent) are currently in the state $s$ . The first reward we get is determined by which state $s^{'}$ we visit next. Hence, if we know the value function for $s^{'}$ , we can make the following intuitive observation. With probability $P_{s \to s^{'}}$ we make a reward $R_{s \to s^{'}}$ and expect $γ \cdot v (s^{'})$ additional reward after reaching state $s^{'}$ .

Hence, $v (s) = \sum_{s^{'} \in S} P_{s \to s^{'}} (R_{s \to s^{'}} + γ v (s^{'}))$ But, $\sum_{s^{'} \in S} P_{s \to s^{'}} R_{s \to s^{'}} = E [R_{t + 1} | R_{t} = s] = R_{s}$ . Hence, $$ v(s) = \mathbf{R}{s} + \gamma\sum{s'\in S}\mathscr{P}_{s\to s'}v(s')$$

The above equation is known as the bellman equation, for state value function in MRP.
note: we can also expand $G_{t}$ in the definition of $v (s)$ and use the linearity of conditional expectation to derive the same equation, but I find the intuitive explanation more satisfying.

As vectors and matrices, we write $v = R + γ P v$ , or equivalently $v = (I - γ P)^{- 1} R$ . Matrix inversion takes $O (n^{3})$ for $n$ states. We will utilize better methods.

A note on converting to matrix form: We see that $\sum_{s^{'} \in S} P_{s \to s^{'}} v (s^{'})$ is the dot product of the row $P [s]$ with the column vector $v$ which simply stores the state-value $v (s^{'})$ for each state $s^{'}$ .

2. Markov Decision Process

Now finally, the environment can talk to the agent, the agent can calculate stuff and predict, and finally it can also do CONTROL which is the ability to act. In a Markov decision process, the state transition dynamics given by the environment is not only dependent on the state of the agent, but also the action of the agent, since the agent can ACT now.

Markov Decision Process

A Markov decision process is a tuple $< S, A, P, R, γ >$

$S$ is a finite set of states, $A$ is a finite set of actions.
$P$ is now a "3d-tensor like thing" where $P_{s \to s^{'}}^{a} = P [S_{t + 1} = s^{'} | A_{t} = a, S_{t} = s]$ .
$R$ is the reward matrix, where $R_{s}^{a} = E [R_{t + 1} | S_{t} = s, A_{t} = a]$
$γ$ is the discount factor.

The first thing we have to understand is that now the agent has a "policy", it has a distribution over the set of actions given that is in some state. This policy is what the model thinks (currently) is the best way to pick actions given its current state, to maximize future returns.
Note that the reward is immediately given after a state followed by an action and not just a state.

Policy

In a MDP, the policy of the agent is a distribution over the set of actions given a state. $π (s) = P [A_{t} = a | S_{t} = s]$ , since the time step we're in is irrelevant, we may write it as $π (s) = P [A = a | S = s]$ , but more importantly this policy is therefore stationary. It only depends onn the current state.

Given a policy, we can define The state-value function, as well as the action-value function in a MDP.

State-Value and Action-Value functions in MDP

For an MDP and a policy $π$ , we have:

The state-value function $v_{π} (s) = E_{π} [G_{t} | S_{t} = s]$ .
The action-value function $q_{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a]$

Yet again, we can decompose both of these using linearity of expectation and so on, but we will stick to an intuitive approach.
Imagine we (the agent) in a state $s$ , whenever we take an action $a$ from state $s$ we expect a return of $q_{π} (s, a)$ , and we take this action with probability $π (a | s)$ .
Therefore $v_{π} (s) = \sum_{a \in A} π (a | s) q_{π} (s, a)$

Further, if we start at the state $s$ and take an action $a$ , we will first incur an immediate reward of $R_{s}^{a}$ the environment dynamics will blow us to a state $s^{'}$ with probability $P_{s \to s^{'}}^{a}$ , where we expect a discounted return of $γ v_{π} (s^{'})$

Therefore $q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v_{π} (s^{'})$ .

We can stitch the above two equations to get the bellman equations for state-value and action-value functions

v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v_{π} (s^{'}))

q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} (\sum_{a \in A} π (a | s^{'}) q_{π} (s^{'}, a))

Optimality of MDP:

Optimal value functions and policies

The optimal state value function is the maximum of all value functions for all policies. The optimal action value function is defined analogously.

v^{*} (s) = max_{π} v_{π} (s) $ $ $ $ q^{*} (s, a) = max_{π} q_{π} (s, a)

Where maximum over functions means pointwise maximums over all inputs.
Now define a partial order on the policies $π > π^{'}$ if $v_{π} (s) > v_{π^{'}} (s) \forall s \in S$ . The (an)upperbound on this partial order is called the optimal policy $π^{*}$
There are theorems that show that the optimal policy exists an induces both the optimal state value and action value functions.
We can infer an optimal deterministic policy from the optimal action value function in the following way:

π^{*} (a | s) := 1 if a = \arg max_{a^{'}} q^{*} (s, a^{'}) else 0

Recursion to calculate optimal value functions

Suppose we are at state $s$ , and for each action $a$ in $A$ , we know the optimal action value function $q^{*} (s, a)$ for the state $s$ . the optimal state value function at $s$ : $v^{*} (s)$ is the same as the maximum of the action value function at $s$ , over all actions $a$ .
That is, $v^{*} (s) = max_{a \in A} q^{*} (s, a)$

Now suppose we start at the state $s$ and do the action $a$ . First we incur an immediate reward of $R_{s}^{a}$ , then the environment dynamics will move us to a state $s^{'}$ with probability $P_{s \to s^{'}}^{a}$ , where we expect a return of $γ v^{*} (s^{'})$ But beware!! there is no role of the policy after doing action $a$ from state $s$ , its purely the environment dynamics that will blow our agent into the state $s^{'}$ , therefore, $q^{*} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v^{*} (s^{'})$ .

Stitching the two equations yet again:

v^{*} (s) = max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v^{*} (s^{'}))

q^{*} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} (max_{a \in A} q^{*} (s^{'}, a))

Both the bellman optimality equations are non linear (think about how max(a,b) is non linear) unlike the normal bellman equations. And there are many methods to solve it:
Policy iteration, value iteration, Q learning, Sarsa etc.