Terminologies
Term | Meaning |
---|---|
Full fine-tuning | Fine-Tune all the weights of a pretrained model |
Intrinsic dimension | An attribute of a dataset, essentially the minimum variable needed to encode the data |
low intrinsic dimension | A description of a dataset, describing that the intrinsic dimension of the dataset is low |
$h$ | The output of the model |
Introduction
PEFT(Parameter Efficient Fine-Tuning) is a technique to reduce the training cost of full fine-tuning by minimize the parameter count and the computation complexity.
According to UniPELT, existing PELT usually involves following variants:
- The functional form of $\Delta h$
- The form of insertion into Transformer
- Parallel: At input layer
- Sequential: At output layer
- The representation modifies
- attention layer
- ffn llayer
- Composition function of $h$ and $\Delta h$
Adapter Tuning
Only fine-tune the parameters of the layers close to downstream tasks.
While training, the parameter of the original pre-train model is frozen, with a newly-added adapter structure:
- Down-project layer: project the high-dim feature to lower dimension
- Non-linear
- Up-project layer: project back to high-dim
- Skip-connection: $identity$ in the worst case
Prefix Tuning
- Prefix: Prepend learnable task-related virtual tokens to input tokens at $W_{k} & W_{v}$ of each layer
- An MLP after prefix layer(only in training): down-project a smaller prefix $P_{\theta}^{’}$ to actual prefix$P_ {\theta}$, to stablize the training
[! NOTE] Similar to text prompt, but continuous and implicit
Prompt Tuning
A simplified version of Prefix Tuning, with:
- Prefix virtual tokens prepended only at input layer
- MLP removed.
P-Tuning
Notice the problem of LLM: The expression of the prompt has a significant impact on downstream tasks
P-Tuning is proposed to change the input Prompt to learnable embedding.
LoRA
All of the PEFT methods mentioned above has some problems:
- either: increase the model depth and inference time, e.g.Adapter Tuning
- or: with learnable parameters which are hard to train
It is observed that low intrinsic dimension is the key part of LLMs. Based on this observation, the attention matrix can be re-designed as: $$ h = \underbrace{W_{0}}{\text{original weight}}x + \underbrace{\Delta W}{\text{Adapte}}x = W_{0}x + BAx $$ where:
- $A \in \mathbb{R}^{d \times r} \sim \mathcal{N}(0, \sigma^{2})$
- $B \in \mathbb{R}^{r \times d}$
- $d > r$
Advantages being:
- No additional depth introduced
UniPELT
UniPELT provides a unified view of existing PEFTs, and compares each choices of variants:
- Parallel insertion form is bettern than Sequantial
- Modified representation:
- When the amout of parameter modified is huge, ffn is better
- ffn is task-related
- Otherwise Attention
- attention captures the text pattern
- When the amout of parameter modified is huge, ffn is better
- Scaling composition function is better