0RL - Introduction
Supervised, unsupervised and Reinforcement Learning:
1. Supervised Learning
- Goal: Learn a mapping from inputs (features) to outputs (labels).
- Data: Labeled data — each input has a corresponding target output.
- Feedback: Direct feedback is provided during training through the loss function.
- Example Problems: Classification (e.g., spam detection), regression (e.g., predicting house prices).
- Algorithms: Linear Regression, Decision Trees, Support Vector Machines, Neural Networks.
Example:
If you have a dataset of images of cats and dogs labeled as "cat" or "dog," a supervised learning model learns to classify new, unseen images as either a cat or a dog.
2. Unsupervised Learning
- Goal: Discover underlying patterns, structures, or distributions in data without explicit labels.
- Data: Unlabeled data — only input data is available, no target output.
- Feedback: No direct feedback; evaluation is typically subjective or indirect.
- Example Problems: Clustering (e.g., customer segmentation), dimensionality reduction (e.g., PCA), anomaly detection.
- Algorithms: K-means, DBSCAN, Autoencoders, Gaussian Mixture Models.
Example:
Given a set of news articles without labels, an unsupervised learning algorithm can group them into clusters of similar topics.
3. Reinforcement Learning (RL)
- Goal: Learn to make a sequence of decisions that maximize a cumulative reward.
- Data: The agent interacts with an environment, receiving feedback through rewards or penalties.
- Feedback: Delayed, sparse, or intermittent — feedback is not always immediate or direct.
- Example Problems: Game playing (e.g., AlphaGo), robotic control, autonomous driving.
- Algorithms: Q-learning, Deep Q-Networks (DQN), Policy Gradients, Actor-Critic methods.
Example:
A robot learning to navigate a maze receives positive rewards for moving closer to the exit and negative rewards for hitting walls. Over time, it learns a strategy (policy) to reach the exit efficiently.
Feature | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
---|---|---|---|
Data | Labeled | Unlabeled | Interactive (states, actions, rewards) |
Goal | Predict outputs | Discover patterns | Maximize cumulative rewards |
Feedback | Direct | Indirect (latent) | Sparse and delayed |
Example Tasks | Classification, Regression | Clustering, Dim. Reduction | Game AI, Robotics |
Typical Algorithms | SVM, Neural Networks | K-means, PCA | Q-learning, DQN, PPO |
The RL problem template:
!Support/Pasted image 20250326075332.png
1. Agent’s Internal State
- Definition:
The staterepresents the agent's internal, private representation of all the variables and knowledge that it uses to decide what to do at time .
2. Action as a Function of History
- History
:
The agent maintains a history of its states and rewards: - Action:
The action taken at timeis determined by this history: This means that the agent's decision is based on all the past information it has accumulated.
3. Environment State and Its Transition
- Environment State:
The environment has its own state, which is typically not fully known to the agent. - Transition:
When the agent takes an action, the environment's state transitions from to a new state : where the environment's history is (the actions of the agent affect the environment):
4. Reward and Observation
- Reward:
After the environment transitions to, it provides a reward: This reward quantifies the immediate benefit (or cost) of the action taken. - Observation:
The environment also provides an observationto the agent, which carries new information from the environment.
5. Updating the Agent’s Internal State
-
State Update:
The agent updates its internal state based on:- Its previous state
, - The action it took
, - The observation
received, and - The reward
provided by the environment.
This update can be expressed as:
Here, the new state
incorporates the impact of both the observation and the reward following the agent's action. - Its previous state
-
the history of the agent is updated
In this way there is a cycle:
- Agent acts based on its history of states and rewards.
- the history of actions and environment states including the action that the agent JUST performed, determines the new state of the environment, and the reward given by the environment.
- Some part of the new state of the environment is packaged as an observation and sent to the agent along with the reward
- The agent updates its internal state, based on the observation, the action it previously took, and the reward it received, along with its previous sate. In some sense, the new state of the agent is at least partially affected by the environment via the observation sent to the agent.
Notation always makes things look final, formal and well defined, but the above discussion is purely informal, and we are only describing a rough idea of what a reinforcement learning problem looks like. The only thing to takeaway here, is that there is an agent that acts based on its history so far, and that this action (along with the history of actions and the environment getting affected by it) bekons the environment to give a reward and observation to the agent, which is then taken by the agent along with its history to update its state. The new state is then merged with the history.
Also note that these functions can be stochastic and not deterministic(they might spit out a probability distribution).
Formalizing the RL problem:
Okay what now? How do we formalize this problem? well, firstly, we have to get rid of storing histories which can get very unwieldy. Next, we can slowly build up to the formalization of an RL problem, instead of directly jumping into it.
1. Markov Process:
Imagine a tumbleweed floating around simply moved by the wind. The wind doesn't say anything to the tumbleweed and the tumbleweed doesn't think or do anything. Such is the nature of a Markov process. The agent is like a tumbleweed, completely at the mercy of the environment, can't think or even act, and the environment itself simply affects the dynamics of the agent, and does not give it any feedback.
A markov process is a tuple
Note that A Markov process has the property that
2. Markov Reward Process
Now the environment can talk to the tumbleweed, it can tell the tumbleweed if its doing good or bad, and the tumbleweed can think, it can evaluate stuff based on what the environment says to it, however poor tumbleweed still can't ACT.
That is, we're talking about an agent that can calculate stuff based on the feedback of the environment, or do prediction, but it cannot act, its state is still fully beckoned by the environment and natural chaos.
A Markov reward process is a tuple
is a Markov process. - The state-reward
is a vector where . That is, the reward of a state, is expectation over all possible immediately next rewards the environment gives, given that we (the agent) are currently in that state. We may write for the state reward of state at time , but which time step we're in doesn't matter, so we may simply write . is a discount factor for future rewards.
note: intuitively, we can think that the environment gives an immediate reward for every state that we (the agent) are in, but that the reward is noisy, following some probability distribution, and the agent simply calculates the expectation of this distribution for each states and stores it in
In truth, the agent likes to calculate not just the expectation of the next immediate reward, but rather the expected value over all possible chains of rewards into the future, given the current state. The agent also likes to discount future rewards, so as to not get stuck in infinite loops and allow each chain of rewards to converge. This is known as the state-value function.
Consider a chain of rewards
The state-value function maps a state
Bellman equation for the state-value function in a MRP.
Suppose we (the agent) are currently in the state
Hence,
The above equation is known as the bellman equation, for state value function in MRP.
note: we can also expand
As vectors and matrices, we write
A note on converting to matrix form: We see that
2. Markov Decision Process
Now finally, the environment can talk to the agent, the agent can calculate stuff and predict, and finally it can also do CONTROL which is the ability to act. In a Markov decision process, the state transition dynamics given by the environment is not only dependent on the state of the agent, but also the action of the agent, since the agent can ACT now.
A Markov decision process is a tuple
is a finite set of states, is a finite set of actions. is now a "3d-tensor like thing" where . is the reward matrix, where is the discount factor.
The first thing we have to understand is that now the agent has a "policy", it has a distribution over the set of actions given that is in some state. This policy is what the model thinks (currently) is the best way to pick actions given its current state, to maximize future returns.
Note that the reward is immediately given after a state followed by an action and not just a state.
In a MDP, the policy of the agent is a distribution over the set of actions given a state.
Given a policy, we can define The state-value function, as well as the action-value function in a MDP.
For an MDP and a policy
- The state-value function
. - The action-value function
Yet again, we can decompose both of these using linearity of expectation and so on, but we will stick to an intuitive approach.
Imagine we (the agent) in a state
Therefore
Further, if we start at the state
Therefore
We can stitch the above two equations to get the bellman equations for state-value and action-value functions
Optimality of MDP:
The optimal state value function is the maximum of all value functions for all policies. The optimal action value function is defined analogously.
Where maximum over functions means pointwise maximums over all inputs.
Now define a partial order on the policies
There are theorems that show that the optimal policy exists an induces both the optimal state value and action value functions.
We can infer an optimal deterministic policy from the optimal action value function in the following way:
Recursion to calculate optimal value functions
Suppose we are at state
That is,
Now suppose we start at the state
Stitching the two equations yet again:
Both the bellman optimality equations are non linear (think about how max(a,b) is non linear) unlike the normal bellman equations. And there are many methods to solve it:
Policy iteration, value iteration, Q learning, Sarsa etc.