Reinforcement Learning

Terminologies

General:

Term	Meaning
Reinforcement Learning	A branch/paradigm of machine learning, concerned with how an intelligent agent behaves in a dynamic environment.
BLUE(bilingual evaluation understudy)	An algorithm for evaluating the quality of text which has been machine-translated from one natural language to another
Reward Model(Actor model)	A model aligned with human feedback, predicting the reward of given actions
$G_{t}$	Return(aka the future reward), total sum of discounted rewards after time $t$: $G_{t} = {\sum}^{\infty}_{k = 0}\gamma^{k}R_{t + k + 1}$
$V_{\pi}(s)$	State-value function, measures the expected return of state $s$: $V(s) = \mathbb{E}_{\pi}[G_{t}\vert S_{t} = s]$ under $\pi$
$Q_{\pi}(s,a)$	Action-value function, measures the expected return of action $a$ under state $s$: $Q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_{t} \vert S_{t} = s, A_{t} = a]$ under $\pi$
Bellman Equations	A set of equations that decompose the value function into immediate reward + discounted future values
$A_{q}$	the action to update $Q$
$A_{t+1}$	the actual taken action

In RL Algorithms (mostly adjectives):

Term	Meaning
Model-based	Algorithms of RL relying on a (environment dynamic) model, which defines $P(s’\vert s,a), R(s,a)$
Model-free	Algorithms of RL learning by the interaction of the model with environment
Policy-Based(Policy Gradient) Methods	A branch of RL: quantize each action as PDF
Value-Based Methods	A branch of RL: quantize each action as value(PMF ? )
Current policy	The policy(actions) actually taken by an agent in an episode
On-policy	Using the action in current(actually exploited/taken) policy to update $V$
Off-policy	Using an action not from current policy to update $V$

Introduction

Value-based

Dynamic Programming

We can use Dynamic Programming to iteratively update and query value functions ($V_{\pi}$), with the help of Bellman equations, when the model is fully known.

Monte-Carlo

#model_free

Instead of modeling the environment, MC methods learns from episodes of raw experience, approximating the observed mean return as expected return.

To optimally learn in MC, we take following steps:

Improve the policy greedily: $\pi(s) = \underset{a}{argmax}Q(s, a)$
Generate a new episode with the combination of the new policy $\pi$ and randomness(e.g. $\epsilon$-greedy), balancing between exploitation and exploration
Estimate $Q$ with the generated episode $\pi$

Temporal Difference methods

#model-free

[!NOTE] TD learning can learn from incomplete episodes

Bootstrapping

Estimate the rewards, rather than exclusively carrying out the episode.

Value Estimation

The estimated Value funciont $V$ is updated towards an estimated return $R_{t+1} + \gamma V(S_{t+1})$

SARSA

#on-policy

[!TIP] Define $A_{q}$ as the action to update $Q$

State-Action-Reward-State-Action In each $t$:

Choose $A_{t} = \underset{a \in A}{argmax}{Q(S_{t}, a)}$ with $\epsilon$-greedy
Obtain $R_{t + 1}$
Set $A_{t+1} \sim \pi(\cdot|s) = A_{q}$, under current policy
Update $Q$ with the advantage of actual $A_{t+1}$ over expected reward:

$$ Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha(\underbrace{R_{t+1} + \gamma Q(S_{t + 1}, A_{t + 1})}_{\text{value of current policy, on-policy}} - \underbrace{Q(S_{t},A_{t})}_{\text{expected value}}) $$

$t = t + 1$

[!NOTICE] $A_{q} == A_{t + 1}$, making it on-policy

Q-Learning

#off-policy

Q-learning is an off-policy method, with the steps in one episodes ($t, S_{t}$) being:

Choose $A_{t} = \underset{a \in A}{argmax}Q(S_{t}, a)$ with $\epsilon$-greedy
Obtain $R_{t + 1}$
Set $A_{t+1} \sim \pi(\cdot|s)$, $A_{q} = \underset{a \in A}{\max} Q(S_{t + 1}, a)$
Update $Q$ with the advantage of optimal $A_{t + 1}$ over expected reward:
$$Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha(\underbrace{R_{t+1} + \gamma \underset{a \in A}{\max} Q(S_{t + 1}, a)}_{\text{best value after $A_{t}$, off-policy}} - \underbrace{Q(S_{t}, A_{t})}_{\text{expected value}}) $$
$t = t + 1$

[!NOTICE] $A_{q} = \underset{a \in A}{\max} Q(S_{t + 1}, a) \ne A_{t + 1}$, making it off-policy

DQN

#off-policy

Deep Q-Network, An improvement of Q-Learning:

Experience Replay: All the episode steps $e_{t} = (S_{t}, A_{t}, R_{t}, S_{t+1})$ are stored in one replay memory $D_{t} = {e_{1}, …, e_{t}}$. During Q-learning updates, samples are drawn at random from the replay memory and thus one sample could be used multiple times.
Periodically Updated Target: Q is optimized towards target values that are only periodically updated(not updated in each iteration anymore). The Q network is cloned and kept frozen as the optimization target every C steps (C is a hyperparameter).

[!WARNING] Known for overestimating value function $Q$

Policy Gradient

$$ \begin{align*} J(\theta) = \underset{s \in S}{\sum\limits} d^{\pi}(s)V^{\pi}(s) = \underset{s \in S}{\sum\limits} d^{\pi} \underset{a \in A}{\sum\limits} \pi_{\theta}(a|s)Q^{\pi}(s,a) \end{align*} $$

Actor-Critic

Actor-Critic learns the value function in addition to the policy, assisting the policy update.

It consists of two models:

Actor updates the policy $\theta$ of $\pi_\theta(a|s)$, suggested by critic
Critic updates the value estimation function $Q(a|s) | V_{w}(s)$

The main process being, for $t \in (1, T)$:

Sample $a \sim \pi_{\theta}(a|s), r_{t} \sim R(s,a), s’ \sim P(s’|s,a)$, next action $a’ \sim \pi_{\theta}(a’|s’)$
Update Actor $\theta$: $$ \theta \leftarrow \theta + \alpha_{\theta} Q_{w}(s,a)\nabla_{\theta} ln \pi_{\theta}( a|s)

to maximize the reward 5. Compute the correction (TD error, measures the quality of current policy $a’$): $$ \delta_{t} = \underbrace{r_{t} + \gamma Q_{w}(s', a')}_{\text{Action-Value of a'}} - \underbrace{Q_{w}(s,a)}_{\text{actual reward}} $$ 6. Update Critic: $w \leftarrow w + \alpha_{w}\delta_{t}\nabla_{w}Q_{w}(s,a)$ to reduce estimate error (ideally, $\delta_{t} \leftarrow 0$, as $a’ \sim \pi_{\theta}(a’|s’)$) 7. Update $a \leftarrow a’, s \leftarrow s'$

[!TIP] Adversarial training, resembles GAN: (generator, discriminator)

A2C

Model	Meaning
Actor $\pi_{\theta}$	The target model
Critic	Estimate $V$

$$ L(\theta) = -\log \pi_{\theta}(a_{t}|s_{t})\hat A_{t} $$

where:

$\hat A$: advantage function, the advantage of $a_{t}$ compared with average, normally $V$

[!WARNING] This objective function can lead to massive change to policy

A3C

Asynchronous Advantage Actor-Critic focuses on parallel training. Multiple actors are trained in parallel and get synced with global parameters.

DPG

#model-free #off-policy

Deterministic Policy Gradient models the policy as deterministic function $a = \mu(s)$.

It is trained by maximizing the objective function: the expected discounted reward:

$$ J(\theta) = \int_{S}\rho^{\mu}(s)Q(s, \mu_{\theta}(s))ds $$

where:

$\rho^{\mu}(s’)$: discounted state distribution
$\mu$: the deterministic action predictor

DDPG

#model-free #off-policy

Deep Deterministic Policy Gradient

Combining DQN (experience replay, freezing target model) and DPG

Key design:

Better exploration: $\mu’(s) = \mu_{\theta}(s) + \mathcal{N}$, adding noise $\mathcal{N}$ to policy
Soft updates: Moving average of parameter $\theta$

TD3

Twin Delayed Deep Deterministic applied tricks on DDPG to prevent overestimating value function:

Clipped Double Q-learning: Action selection and Q-value estimation are made by two networks separately.
Delayed update of target the policy network: Instead of updating actor and critic in one iteration, TD3 updates the actor at a lower frequency than critic, waiting for it to become stable. It helps reducing the variance.
Target policy smoothing: Introduce a smoothing regularization strategy by adding $\epsilon \sim clip(\mathcal{N}(0, \sigma), -c , +c)$ to the value function $Q_{w}(s’, \mu_{\theta}(s’) + \epsilon))$. It mitigates the risk of deterministic policies overfitting the value function.

SAC

Soft Actor-Critic learns a more random policy by incorporating the entropy of the policy $H(\pi)$ into the reward.

Three key components:

An actor-critic architecture
An off-policy approach
Entropy Maximization to encourage exploration

The policy is trained by maximizing the objective function: expected return + the entropy $$ J(\theta) = \sum\limits_{t = 1}^athbb{E}_{s_{t},a_{t} \sim \rho_{\pi_ {\theta}}} [r(s_{t},a_{t}) + \alpha \mathcal{H}(\pi_{\theta}(* | s_{t}))] $$

PPO (Proximal Policy Optimization)

#on-policy

[!TIP]

clipped objective

Proximal stands for Reward Model

As a successor of A2C, PPO defines 2 more models:

Model	Meaning
Reward $r_{\theta}$	Calculate $R$
Reference $\pi_{ref}$	Apply constraint and guidance to Actor
$r^{\ast}$	Ground-truth reward function
$r_\phi$	MLE of $r^{\ast}$

$$ L(\theta) = \underbrace{-\hat A_{t} \cdot min(r_{t}(\theta), clip(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon))}{\text{A2C loss, $\le 1$ + $\epsilon$}} - \underbrace{\beta D{KL}(\pi_{\theta}(y|x)||\pi_{ref}(y|x))}_{\text{penalty of being too distant to normal response}} $$

where:

$r_{t}(\theta) = \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}$: the ratio of new policy to old policy
$\epsilon$: normally 0.1 or 0.2

Generate two outputs from same input $x$: $y_{1}, y_{2}$
- Objective: $\mathcal{L} = \underset{\pi_{\theta}}{\max}\mathbb{E}[r_{\theta}(x,y_{2})]$
  - Update:
    - Optimize with the reward of current batch
    - TRO(Trust Region Optimization): using gradient constraint to make sure the update process doesn’t sabotage the stability of learning process.

[!TIP]

$r$ and $\pi$ can be optimized iteratively.

RLHF and PPO is difficult to train.

DPO(Direct Preference Optimization)

[!NOTE] The major difference Direct: directly optimize with reward, rather than $V | Q$: expected rewards from a reward model

Rewrite objective:

$$ \begin{align*} \pi &= \underset{\pi}{\max}(r_{\phi}(x,y) - \beta D_{KL}(\pi_{\theta}(y|x)||\pi_{ref}(y|x)))\\ &= \underset{\pi}{\max}(r_{\phi}(x,y) - \beta \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)})\\ &= \underset{\pi}{\min}( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)} - r_{\phi}(x,y)/\beta)\\ &= \underset{\pi}{\min}( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x) e^{r_{\phi}(x,y)/\beta} }) \end{align*} $$

^0639d4

Define partition function: $Z = \Sigma_{y}{\pi_{ref}(y|x) e^{r_{\theta}(x,y)/\beta}}$, which relates to the reward of $\theta$ over $ref$

We can get the optimal strategy $\pi^{\ast}$ under $r_{\phi}$(irrelevant of $\theta$):

$$ \pi^{*}(y|x) = \pi_{ref}(y|x)e^{\frac{r_{\phi} (x,y)}{\beta}} \frac{1}{Z(x)} $$

^5ee375

Then Eq [[#^0639d4]] became:

$$ \begin{align*} \pi &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x) e^{r_{\phi}(x,y)}{\beta}}\right)\\ &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi^{\ast}(y|x) Z(x)}\right)\\ &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi^{\ast}(y|x)}\right)\\ &= \underset{\pi}{\min}\left( D_{KL}(\pi_{\theta}(y|x) || \pi^{\ast}(y|x))\right) \end{align*} $$

Apparently, the optimal $\pi$ is: $\pi_{\theta} \to \pi^{*}$.

Noticing that the reward function of E.Q. [[#^5ee375]] can be rewritten(reparameterized) as(where $\pi_{ref}$ is the human-preference data as ground-truth):

$$ r_{\phi} (x,y) = \beta \log \frac{\pi^{\ast}(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x) $$

[!TIP] the reward function can be represted with best policy trained under it

By replacing $r_{\phi} (x,y)$ in the objective of RLHF as $\pi^{*}$, we get an objective function without the reward function:

$$ \begin{align} \mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{ref}) = -{{\mathbb{E}_{(x, y_{w}, y_{l}) \sim D}[\log \sigma{({\beta \frac{\pi_{\theta}(y_{w}|x)}{\pi_{ref}(y_{w}|x)} - \beta\frac{\pi_{\theta}(y_{l}|x)}{\pi_{ref}(y_{l}|x)} }})]}} \end{align} $$

From this equation, we found that: Training the reward model in RLHF is equivalent to training $\pi_{\theta}$ with the derived objective function.

That is to say, no need of 4 models, we can achieve the same target of RLHF with directly training $\pi_{\theta}$.

Methods

RLHF

RLHF(Reinforcement learning from human feedback) is a technique that trains a reward model.

It has following key concepts:

Reward Model: trained in advance directly from human feedback
human feedback: data collected by asking humans to rank instances of the agent’s behavior

The procedure is given by 3 steps

1. SFT

Pre-train a (target) model: $\pi^{SFT}$

2. Reward Modeling Phase

Train a reward model: $r_{\phi}(x,y) = r, r \in (0, + \infty)$, where $r$ is the reward of the given input.

Initialization: Often initialized from Pretrained Models
Data:
- $D$: $Prompt: x \to (Generation: y, Reward: r)$, generated by human or models
- Human Feedback: Ranking the outputs of different models under same prompt with $r$
  - effective ways of ranking: Comparing two/ ELO
- $(y_{win}, y_{loss})$ : sampled from generation
Train the RM with Data The Objective is (negative log-likelihood loss): $$ \begin{align*} \mathcal{L}_{R}(r_{\phi}, D) = -{{\mathbb{E}_{(x, y_{w}, y_{l}) \sim D}[\log{\sigma({r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l})}})]}} \end{align*} $$

maximize the gap of rewards between better/worse response

3. RL Fine-Tuning Phase: $\pi_{\theta}(x) = p(y)$

In the past, training LM with RL was considered impossible.
One of the proposed feasible plan is PGR(Policy Gradient RL)/PPO(Proximal Policy Optimization)

Terminologies¶

Introduction¶

Value-based¶

Dynamic Programming¶

Monte-Carlo¶

Temporal Difference methods¶

Bootstrapping¶

Value Estimation¶

SARSA¶

Q-Learning¶

DQN¶

Policy Gradient¶

Actor-Critic¶

A2C¶

A3C¶

DPG¶

DDPG¶

TD3¶

SAC¶

PPO (Proximal Policy Optimization)¶

DPO(Direct Preference Optimization)¶

Methods¶

RLHF¶

1. SFT¶

2. Reward Modeling Phase¶

3. RL Fine-Tuning Phase: $\pi_{\theta}(x) = p(y)$¶

Conclusion¶

Terminologies

Introduction

Value-based

Dynamic Programming

Monte-Carlo

Temporal Difference methods

Bootstrapping

Value Estimation

SARSA

Q-Learning

DQN

Policy Gradient

Actor-Critic

A2C

A3C

DPG

DDPG

TD3

SAC

PPO (Proximal Policy Optimization)

DPO(Direct Preference Optimization)

Methods

RLHF

1. SFT

2. Reward Modeling Phase

3. RL Fine-Tuning Phase: $\pi_{\theta}(x) = p(y)$

Conclusion