Terminology

Notations Mean
$X \sim p_{r}$ the input data
$z$ the encoded latent
$\theta$ the parameterized model
$\phi$ the encoder
$p_{\theta}(x)$ the likelihood of the data-reconstruction
$p(z)$ the distribution of latent variable $z$ as a prior, often $\mathcal{N}(0,1)$
$q_{\phi}(z|x)$ variational distribution
$q_{\phi}(z|x)$ variational distribution
MDL(Minimum Description Length)
Self-Information $I$ the amount of information, interpreted as the level of “surprise”
$$I(\mathcal{w}{n}) = f(P(\mathcal{w}{n})) = -\log(P(\mathcal{w}_{n})) \ge 0$$
Entropy $H(X)$ the average amount of information in a message. A measure of uncertainty.
$$H(X) = E[I(X)] = E[-\ln(P(X))]$$

Background

AutoEncoder is proposed to compress data and reduct dimensionality as a generalization of PCA, and largely used in signal processing, until someone found new samples can be generated by adding noise to latents and decoded by decoder.

However, the ability of AutoEncoder to generate new samples by the distribution of the latents $z$, this is why & when Variational AutoEncoder is developed.

[!TIP] AE is an approach of MDL

Requirements

  • In order to be able to generate new samples using decoder, we will be happy if $z \sim \mathcal{N}(0, 1)$

Modeling

We apply Maximum Likelihood Estimation here.

Log Likelihood is defined as: $$ Likelihood = \log P_{\theta}(X) $$

which represents the ability of the model to reconstruct the input data.

Hence, from the definition of the loss function:

$$ \mathcal{L}(\theta) = - \mathbb E_{x \sim data}[\log p_{\theta}(x)] $$

Normally, the $x\sim data$ is neglected.

Our goal is to minimize the loss function, in the mean time force encoder to encode $X$ as $z \sim \mathcal{N}(\mu, \sigma^{2}I)$

Implicit Model

We define $z$ as an implicit variable, making our model an implicit model.

Rewrite the log-likelihood: $$ p_{\theta}(x) = \int{p_{\theta}}(x|z)p_{\theta }(z)dz $$ where $\theta$ is the parameter of the implicit model (encoder and decoder).

However there’s a common problem for implicit models: the integration relies on the exhaustion on implicit variable $z$.

In our case, as $z \sim \mathcal{N}(\mu, \sigma^{2}I)$, it is deem impossible.

MC

Monte-Carlo is a method to approximate an intractable equation(integration) by sampling a lot of data ($p_{\theta}(x | z)$): $$ \begin{align*} p_{\theta}(x) &= \int{p_{\theta}}(x|z)p_{\theta }(z)dz\\ &\approx \frac{1}{m} \sum\limits_{j =1}^{m} p_{\theta}(x | z_{j}) \end{align*} $$ But that does not enforce $z \sim \mathcal{N}(\mu, \sigma^{2}I)$.

Variational Bayes

Deriving ELBO

Considering the log-likelihood can be rewritten in the following process:

$$ \begin{align*} \log p_{\theta}(x) &= \log p_{\theta}(x) \int_{z}p_{\phi}(z|x)dz &\text{Normalization} \\ &= \int_{z}p_{\theta}(z|x)\log p_{\theta}(x)dz \\ &= \int_{z}p_{\theta}(z|x) \log \frac{p_{\theta}(x,z)}{p(z|x)} dz &\text{Bayes’ Theorem} \\ &= \int_{z}(p_{\theta}(z|x)\log p_{\theta}(x,z) - p_{\theta}(z|x)\log p(z|x))dz \\ &= \log p_{\theta}(x,z) - \log p_{\theta}(z|x) \end{align*} $$

Since the posterior $\log p_{\theta}(z|x)$ is intractable (only involves integration on latent variable $z$, see Bayes’ Theorem), a new distribution(which is easy to learn) $q_{\phi}(z|x)$ is used to approximate it, where $\phi$ is the encoder.

Let’s continue by replacing:

$$ \begin{align*} \underbrace{\log p(x)}_{\text{evidence}} &= \log p_\theta(x,z) - \log q_{\phi}(z|x) \newline &= \int_{z} q_{\phi}(z|x)\log\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)}dz \newline &= \int_{z}q_{\phi}(z|x)\log(\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)} \cdot \frac{q_{\phi}(z|x)}{p(z|x)})dz \newline &= \int_{z}q_{\phi}(z|x)\log(\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)})dz + \int_{z}q_{\phi}(z|x)\log(\frac{q_{\phi}(z|x)}{p(z|x)})dz \newline &= \mathcal L(\theta,\phi; x) + D_{KL}(q_{\phi}, p_{\theta}) \newline &\ge \underbrace{\mathcal L(\theta,\phi; x)}_{\text{ ELBO }} & \text{$D_{KL}\ge 0$} \end{align*} $$

$\mathcal L(\theta, \phi; x) = \mathbb{E}_{z \sim q(.|x)}{\log \frac{p_{\theta}(x,z)}{q_{\phi}(z|x)}}$ is ELBO(Evidence Lower Bound), as it is the lower bound of evidence $\mathcal{L}(\theta)$, omitting the KL term. Maximizing ELBO is directly:

  • maximizing log-likelihood
  • minimizing KL-Divergence of posterior $p_{\theta}$ and variational distribution $q_{\phi}$

Maximizing ELBO

And we can break it down further: $$ \begin{align*} \underbrace{\mathcal L(\theta, \phi; x)}_{\text{ELBO}} &= \int_{z}q_{\phi}(z|x)\log(\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)})dz = \mathcal{H}[q_{\phi}(z|x)] + \mathbb{E}_{z}[p_{\theta}(x,z)] \\\\ &= \int_{z}q_{\phi}(z|x)\log(\frac{p(z) * p_{\theta}(x|z) }{q_{\phi}(z|x)})dz & \text{Bayes' Theorem}\\\\ &= \int_{z}q_{\phi}(z|x)\log\frac{p(z) }{q_{\phi}(z|x)}dz + \int_{z}q_{\phi}(z|x)\log p_{\theta}(x|z)dz\\\\ &= \underbrace{-D_{KL}(q_{\phi}(z|x), p(z))}_{\text{$\mathcal L_{reg}$}} + \underbrace{\mathbb E_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{\text{$\mathcal L_{reconstruct}$}}\\\\ \end{align*} $$

[!Note] $\int_{z}p(z)*f(z)dz = \mathbb E_{z \sim p(.)}[f(z)]$, which is the expectation of p with z sampled from $p(z)$

This is ELBO:

  • $\mathcal L_{reg}$: the KL-divergence of variational distribution and prior distribution
  • $\mathcal L_{reconstruct}$: the Expectation of log reconstruct-likelihood under variational distribution

Since $\mathcal{L}(\theta) = -\log p(x) \le - \text{ELBO}$, by maximizing ELBO, we can indirectly minimize $L(\theta)$.

Hence, we define $\mathcal{L} = -\text{ELBO}$.

Training

$$ \begin{align*} \text{ELBO} &= \underbrace{-D_{KL}(q_{\phi}(z|x), p(z))}{\text{$\mathcal L{reg}$}} + \underbrace{\mathbb E_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}{\text{$\mathcal L{reconstruct}$}}\\ &= \underbrace{-D_{KL}(q_{\phi}(z|x), p(z))}{\text{$\mathcal L{reg}$}} + MSE(x, \hat x) \end{align*} $$

As $z$ is sampled from $\sim q_{\phi}(z|x)$, which is a variational distribution, the gradient of ELBO will not be able to propagate back to encoder $\phi$ (in-differentiable, chain rule).

Thus, re-parameterization is applied: $z = \mu + \epsilon \times \sigma, \hat z \sim \mathcal{N}(0, I)$, where $\phi(X) = (\mu, \epsilon)$. This way, the gradient is passed back to $\phi$, by representing $z$ with the output of $\phi$, where $z$ participates in the loss-calculation

Problems

Blurry output

  • the prior: $p(z) \sim \mathcal{N}(0, I)$
  • MSE is used to measure $L_{reconstruct}$

DAE: corrupt X,降低图片的冗余度(图片的冗余性一般都很高)

Dall E

两阶段:

  1. clip 构造对比学习的正负样本对
  2. 文本 -> clip encoder -> text embedding -> (diffusion) prior -> image embedding -> diffusion model decoder -> image

transformer encoder 本质上是自回归模型,可以基于自注意力和输入,自回归地生成同类型的内容

![[Pasted image 20230618153050.png]]

![[Pasted image 20230618154911.png]]