Terminologies
General:
Term | Meaning |
---|---|
Reinforcement Learning | A branch/paradigm of machine learning, concerned with how an intelligent agent behaves in a dynamic environment. |
BLUE(bilingual evaluation understudy) | An algorithm for evaluating the quality of text which has been machine-translated from one natural language to another |
Reward Model(Actor model) | A model aligned with human feedback, predicting the reward of given actions |
$G_{t}$ | Return(aka the future reward), total sum of discounted rewards after time $t$: $G_{t} = {\sum}^{\infty}_{k = 0}\gamma^{k}R_{t + k + 1}$ |
$V_{\pi}(s)$ | State-value function, measures the expected return of state $s$: $V(s) = \mathbb{E}_{\pi}[G_{t}\vert S_{t} = s]$ under $\pi$ |
$Q_{\pi}(s,a)$ | Action-value function, measures the expected return of action $a$ under state $s$: $Q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_{t} \vert S_{t} = s, A_{t} = a]$ under $\pi$ |
Bellman Equations | A set of equations that decompose the value function into immediate reward + discounted future values |
$A_{q}$ | the action to update $Q$ |
$A_{t+1}$ | the actual taken action |
In RL Algorithms (mostly adjectives):
Term | Meaning |
---|---|
Model-based | Algorithms of RL relying on a (environment dynamic) model, which defines $P(s’\vert s,a), R(s,a)$ |
Model-free | Algorithms of RL learning by the interaction of the model with environment |
Policy-Based(Policy Gradient) Methods | A branch of RL: quantize each action as PDF |
Value-Based Methods | A branch of RL: quantize each action as value(PMF ? ) |
Current policy | The policy(actions) actually taken by an agent in an episode |
On-policy | Using the action in current(actually exploited/taken) policy to update $V$ |
Off-policy | Using an action not from current policy to update $V$ |
Introduction
Value-based
Dynamic Programming
We can use Dynamic Programming to iteratively update and query value functions ($V_{\pi}$), with the help of Bellman equations, when the model is fully known.
Monte-Carlo
#model_free
Instead of modeling the environment, MC methods learns from episodes of raw experience, approximating the observed mean return as expected return.
To optimally learn in MC, we take following steps:
- Improve the policy greedily: $\pi(s) = \underset{a}{argmax}Q(s, a)$
- Generate a new episode with the combination of the new policy $\pi$ and randomness(e.g. $\epsilon$-greedy), balancing between exploitation and exploration
- Estimate $Q$ with the generated episode $\pi$
Temporal Difference methods
#model-free
[!NOTE] TD learning can learn from incomplete episodes
Bootstrapping
Estimate the rewards, rather than exclusively carrying out the episode.
Value Estimation
The estimated Value funciont $V$ is updated towards an estimated return $R_{t+1} + \gamma V(S_{t+1})$
SARSA
#on-policy
[!TIP] Define $A_{q}$ as the action to update $Q$
State-Action-Reward-State-Action In each $t$:
- Choose $A_{t} = \underset{a \in A}{argmax}{Q(S_{t}, a)}$ with $\epsilon$-greedy
- Obtain $R_{t + 1}$
- Set $A_{t+1} \sim \pi(\cdot|s) = A_{q}$, under current policy
- Update $Q$ with the advantage of actual $A_{t+1}$ over expected reward:
- $t = t + 1$
[!NOTICE] $A_{q} == A_{t + 1}$, making it on-policy
Q-Learning
#off-policy
Q-learning is an off-policy method, with the steps in one episodes ($t, S_{t}$) being:
- Choose $A_{t} = \underset{a \in A}{argmax}Q(S_{t}, a)$ with $\epsilon$-greedy
- Obtain $R_{t + 1}$
- Set $A_{t+1} \sim \pi(\cdot|s)$, $A_{q} = \underset{a \in A}{\max} Q(S_{t + 1}, a)$
- Update $Q$ with the advantage of optimal $A_{t + 1}$ over expected reward:
$$Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha(\underbrace{R_{t+1} + \gamma \underset{a \in A}{\max} Q(S_{t + 1}, a)}_{\text{best value after $A_{t}$, off-policy}} - \underbrace{Q(S_{t}, A_{t})}_{\text{expected value}}) $$ - $t = t + 1$
[!NOTICE] $A_{q} = \underset{a \in A}{\max} Q(S_{t + 1}, a) \ne A_{t + 1}$, making it off-policy
DQN
#off-policy
Deep Q-Network, An improvement of Q-Learning:
- Experience Replay: All the episode steps $e_{t} = (S_{t}, A_{t}, R_{t}, S_{t+1})$ are stored in one replay memory $D_{t} = {e_{1}, …, e_{t}}$. During Q-learning updates, samples are drawn at random from the replay memory and thus one sample could be used multiple times.
- Periodically Updated Target: Q is optimized towards target values that are only periodically updated(not updated in each iteration anymore). The Q network is cloned and kept frozen as the optimization target every C steps (C is a hyperparameter).
[!WARNING] Known for overestimating value function $Q$
Policy Gradient
$$ \begin{align*} J(\theta) = \underset{s \in S}{\sum\limits} d^{\pi}(s)V^{\pi}(s) = \underset{s \in S}{\sum\limits} d^{\pi} \underset{a \in A}{\sum\limits} \pi_{\theta}(a|s)Q^{\pi}(s,a) \end{align*} $$Actor-Critic
Actor-Critic learns the value function in addition to the policy, assisting the policy update.
It consists of two models:
- Actor updates the policy $\theta$ of $\pi_\theta(a|s)$, suggested by critic
- Critic updates the value estimation function $Q(a|s) | V_{w}(s)$
The main process being, for $t \in (1, T)$:
- Sample $a \sim \pi_{\theta}(a|s), r_{t} \sim R(s,a), s’ \sim P(s’|s,a)$, next action $a’ \sim \pi_{\theta}(a’|s’)$
- Update Actor $\theta$: $$ \theta \leftarrow \theta + \alpha_{\theta} Q_{w}(s,a)\nabla_{\theta} ln \pi_{\theta}( a|s)
$$
to maximize the reward 5. Compute the correction (TD error, measures the quality of current policy $a’$): $$ \delta_{t} = \underbrace{r_{t} + \gamma Q_{w}(s', a')}_{\text{Action-Value of a'}} - \underbrace{Q_{w}(s,a)}_{\text{actual reward}} $$ 6. Update Critic: $w \leftarrow w + \alpha_{w}\delta_{t}\nabla_{w}Q_{w}(s,a)$ to reduce estimate error (ideally, $\delta_{t} \leftarrow 0$, as $a’ \sim \pi_{\theta}(a’|s’)$) 7. Update $a \leftarrow a’, s \leftarrow s'$
[!TIP] Adversarial training, resembles GAN: (generator, discriminator)
A2C
Model | Meaning |
---|---|
Actor $\pi_{\theta}$ | The target model |
Critic | Estimate $V$ |
$$ L(\theta) = -\log \pi_{\theta}(a_{t}|s_{t})\hat A_{t} $$
where:
- $\hat A$: advantage function, the advantage of $a_{t}$ compared with average, normally $V$
[!WARNING] This objective function can lead to massive change to policy
A3C
Asynchronous Advantage Actor-Critic focuses on parallel training. Multiple actors are trained in parallel and get synced with global parameters.
DPG
#model-free #off-policy
Deterministic Policy Gradient models the policy as deterministic function $a = \mu(s)$.
It is trained by maximizing the objective function: the expected discounted reward:
$$ J(\theta) = \int_{S}\rho^{\mu}(s)Q(s, \mu_{\theta}(s))ds $$
where:
- $\rho^{\mu}(s’)$: discounted state distribution
- $\mu$: the deterministic action predictor
DDPG
#model-free #off-policy
Deep Deterministic Policy Gradient
Combining DQN (experience replay, freezing target model) and DPG
Key design:
- Better exploration: $\mu’(s) = \mu_{\theta}(s) + \mathcal{N}$, adding noise $\mathcal{N}$ to policy
- Soft updates: Moving average of parameter $\theta$
TD3
Twin Delayed Deep Deterministic applied tricks on DDPG to prevent overestimating value function:
- Clipped Double Q-learning: Action selection and Q-value estimation are made by two networks separately.
- Delayed update of target the policy network: Instead of updating actor and critic in one iteration, TD3 updates the actor at a lower frequency than critic, waiting for it to become stable. It helps reducing the variance.
- Target policy smoothing: Introduce a smoothing regularization strategy by adding $\epsilon \sim clip(\mathcal{N}(0, \sigma), -c , +c)$ to the value function $Q_{w}(s’, \mu_{\theta}(s’) + \epsilon))$. It mitigates the risk of deterministic policies overfitting the value function.
SAC
Soft Actor-Critic learns a more random policy by incorporating the entropy of the policy $H(\pi)$ into the reward.
Three key components:
- An actor-critic architecture
- An off-policy approach
- Entropy Maximization to encourage exploration
The policy is trained by maximizing the objective function: expected return + the entropy $$ J(\theta) = \sum\limits_{t = 1}^athbb{E}_{s_{t},a_{t} \sim \rho_{\pi_ {\theta}}} [r(s_{t},a_{t}) + \alpha \mathcal{H}(\pi_{\theta}(* | s_{t}))] $$
PPO (Proximal Policy Optimization)
#on-policy
[!TIP]
- clipped objective
- Proximal stands for Reward Model
As a successor of A2C, PPO defines 2 more models:
Model | Meaning |
---|---|
Reward $r_{\theta}$ | Calculate $R$ |
Reference $\pi_{ref}$ | Apply constraint and guidance to Actor |
$r^{\ast}$ | Ground-truth reward function |
$r_\phi$ | MLE of $r^{\ast}$ |
$$ L(\theta) = \underbrace{-\hat A_{t} \cdot min(r_{t}(\theta), clip(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon))}{\text{A2C loss, $\le 1$ + $\epsilon$}} - \underbrace{\beta D{KL}(\pi_{\theta}(y|x)||\pi_{ref}(y|x))}_{\text{penalty of being too distant to normal response}} $$
where:
- $r_{t}(\theta) = \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}$: the ratio of new policy to old policy
- $\epsilon$: normally 0.1 or 0.2
- Generate two outputs from same input $x$: $y_{1}, y_{2}$
- Objective: $\mathcal{L} = \underset{\pi_{\theta}}{\max}\mathbb{E}[r_{\theta}(x,y_{2})]$
- Update:
- Optimize with the reward of current batch
- TRO(Trust Region Optimization): using gradient constraint to make sure the update process doesn’t sabotage the stability of learning process.
- Update:
- Objective: $\mathcal{L} = \underset{\pi_{\theta}}{\max}\mathbb{E}[r_{\theta}(x,y_{2})]$
[!TIP]
- $r$ and $\pi$ can be optimized iteratively.
- RLHF and PPO is difficult to train.
DPO(Direct Preference Optimization)
[!NOTE] The major difference Direct: directly optimize with reward, rather than $V | Q$: expected rewards from a reward model
Rewrite objective:
$$ \begin{align*} \pi &= \underset{\pi}{\max}(r_{\phi}(x,y) - \beta D_{KL}(\pi_{\theta}(y|x)||\pi_{ref}(y|x)))\\ &= \underset{\pi}{\max}(r_{\phi}(x,y) - \beta \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)})\\ &= \underset{\pi}{\min}( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)} - r_{\phi}(x,y)/\beta)\\ &= \underset{\pi}{\min}( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x) e^{r_{\phi}(x,y)/\beta} }) \end{align*} $$
^0639d4
Define partition function: $Z = \Sigma_{y}{\pi_{ref}(y|x) e^{r_{\theta}(x,y)/\beta}}$, which relates to the reward of $\theta$ over $ref$
We can get the optimal strategy $\pi^{\ast}$ under $r_{\phi}$(irrelevant of $\theta$):
$$ \pi^{*}(y|x) = \pi_{ref}(y|x)e^{\frac{r_{\phi} (x,y)}{\beta}} \frac{1}{Z(x)} $$
^5ee375
Then Eq [[#^0639d4]] became:
$$ \begin{align*} \pi &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x) e^{r_{\phi}(x,y)}{\beta}}\right)\\ &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi^{\ast}(y|x) Z(x)}\right)\\ &= \underset{\pi}{\min}\left( \log \frac{\pi_{\theta}(y|x)}{\pi^{\ast}(y|x)}\right)\\ &= \underset{\pi}{\min}\left( D_{KL}(\pi_{\theta}(y|x) || \pi^{\ast}(y|x))\right) \end{align*} $$
Apparently, the optimal $\pi$ is: $\pi_{\theta} \to \pi^{*}$.
Noticing that the reward function of E.Q. [[#^5ee375]] can be rewritten(reparameterized) as(where $\pi_{ref}$ is the human-preference data as ground-truth):
$$ r_{\phi} (x,y) = \beta \log \frac{\pi^{\ast}(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x) $$
[!TIP] the reward function can be represted with best policy trained under it
By replacing $r_{\phi} (x,y)$ in the objective of RLHF as $\pi^{*}$, we get an objective function without the reward function:
$$ \begin{align} \mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{ref}) = -{{\mathbb{E}_{(x, y_{w}, y_{l}) \sim D}[\log \sigma{({\beta \frac{\pi_{\theta}(y_{w}|x)}{\pi_{ref}(y_{w}|x)} - \beta\frac{\pi_{\theta}(y_{l}|x)}{\pi_{ref}(y_{l}|x)} }})]}} \end{align} $$From this equation, we found that: Training the reward model in RLHF is equivalent to training $\pi_{\theta}$ with the derived objective function.
That is to say, no need of 4 models, we can achieve the same target of RLHF with directly training $\pi_{\theta}$.
Methods
RLHF
RLHF(Reinforcement learning from human feedback) is a technique that trains a reward model.
It has following key concepts:
- Reward Model: trained in advance directly from human feedback
- human feedback: data collected by asking humans to rank instances of the agent’s behavior
The procedure is given by 3 steps
1. SFT
Pre-train a (target) model: $\pi^{SFT}$
2. Reward Modeling Phase
Train a reward model: $r_{\phi}(x,y) = r, r \in (0, + \infty)$, where $r$ is the reward of the given input.
-
Initialization: Often initialized from Pretrained Models
-
Data:
- $D$: $Prompt: x \to (Generation: y, Reward: r)$, generated by human or models
- Human Feedback: Ranking the outputs of different models under same prompt with $r$
- effective ways of ranking: Comparing two/ ELO
- $(y_{win}, y_{loss})$ : sampled from generation
-
Train the RM with Data The Objective is (negative log-likelihood loss): $$ \begin{align*} \mathcal{L}_{R}(r_{\phi}, D) = -{{\mathbb{E}_{(x, y_{w}, y_{l}) \sim D}[\log{\sigma({r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l})}})]}} \end{align*} $$
maximize the gap of rewards between better/worse response
3. RL Fine-Tuning Phase: $\pi_{\theta}(x) = p(y)$
- In the past, training LM with RL was considered impossible.
- One of the proposed feasible plan is PGR(Policy Gradient RL)/PPO(Proximal Policy Optimization)