Tuning | Mick' Blog

Terminologies

Term	Meaning
Full fine-tuning	Fine-Tune all the weights of a pretrained model
Intrinsic dimension	An attribute of a dataset, essentially the minimum variable needed to encode the data
low intrinsic dimension	A description of a dataset, describing that the intrinsic dimension of the dataset is low
$h$	The output of the model

Introduction

PEFT(Parameter Efficient Fine-Tuning) is a technique to reduce the training cost of full fine-tuning by minimize the parameter count and the computation complexity.

According to UniPELT, existing PELT usually involves following variants:

The functional form of $\Delta h$
The form of insertion into Transformer
- Parallel: At input layer
- Sequential: At output layer
The representation modifies
- attention layer
- ffn llayer
Composition function of $h$ and $\Delta h$

Adapter Tuning

Only fine-tune the parameters of the layers close to downstream tasks.

While training, the parameter of the original pre-train model is frozen, with a newly-added adapter structure:

Down-project layer: project the high-dim feature to lower dimension
Non-linear
Up-project layer: project back to high-dim
Skip-connection: $identity$ in the worst case

Prefix Tuning

Prefix: Prepend learnable task-related virtual tokens to input tokens at $W_{k} & W_{v}$ of each layer
An MLP after prefix layer(only in training): down-project a smaller prefix $P_{\theta}^{’}$ to actual prefix$P_ {\theta}$, to stablize the training

[! NOTE] Similar to text prompt, but continuous and implicit

Prompt Tuning

A simplified version of Prefix Tuning, with:

Prefix virtual tokens prepended only at input layer
MLP removed.

P-Tuning

Notice the problem of LLM: The expression of the prompt has a significant impact on downstream tasks

P-Tuning is proposed to change the input Prompt to learnable embedding.

LoRA

All of the PEFT methods mentioned above has some problems:

either: increase the model depth and inference time, e.g.Adapter Tuning
or: with learnable parameters which are hard to train

It is observed that low intrinsic dimension is the key part of LLMs. Based on this observation, the attention matrix can be re-designed as: $$ h = \underbrace{W_{0}}{\text{original weight}}x + \underbrace{\Delta W}{\text{Adapte}}x = W_{0}x + BAx $$ where:

$A \in \mathbb{R}^{d \times r} \sim \mathcal{N}(0, \sigma^{2})$
$B \in \mathbb{R}^{r \times d}$
$d > r$

Advantages being:

No additional depth introduced

UniPELT

UniPELT provides a unified view of existing PEFTs, and compares each choices of variants:

Parallel insertion form is bettern than Sequantial
Modified representation:
- When the amout of parameter modified is huge, ffn is better
  - ffn is task-related
- Otherwise Attention
  - attention captures the text pattern
Scaling composition function is better

Terminologies¶

Introduction¶

Adapter Tuning¶

Prefix Tuning¶

Prompt Tuning¶

P-Tuning¶

LoRA¶

UniPELT¶