RoFormer

RoFormer: Enhanced Transformer with Rotary Position Embeddi

contributions

绝对位置编码
$f_t(x_i, i) = W_t(x_i + p_i)$

$\left\{\begin{matrix} p_{i,2t} & = \sin{(k/10000^{2t/d})} \\ p_{i,2t+1} &= \cos{(k/10000^{2t/d})} \end{matrix}\right.$
相对位置编码
$\left\{\begin{matrix} f_q(x_m) = W_qx_m \\ f_k(x_n, n) = W_k(x_n + \hat{p}_r^k) \\ f_v(x_n n) = W_v(x_n + \hat{p}_r^k) \end{matrix}\right.$
- 核心思想是将绝对位置编码的正弦项进行替换。对 $k,v$ 添加可学习的相对位置编码，q不添加偏移量。 $r$ 表示相对距离.

Transformer为基础的模型主要通过attention 来传递位置信息。

$Attention(Q,K,V) = Softmax(\dfrac{QK^T\odot M}{\sqrt{d_k}})V$

由公式不难发现其通过 $q^T_mk_n$ 来实现不同位置 tokens 之间的信息传递。因此我们希望内积能以融入相对位置的编码信息，即：

\left<f_q(x_m,m),f_k(x_n,n)\right> = g(x_m, x_n, m-n)

借助复数的特性，设计编码函数

\begin{array}{c} f_q(x_m) = (W_qx_m)e^{im\theta} \\ f_k(x_n, n) = (W_kx_m)e^{in\theta} \\ g(x_m, x_n, m-n) = \text{Re}\left[(W_qx_m)(W_kx_m)^*e^{i(m-n)\theta}\right] \end{array}

写为旋转矩阵的形式：

f_{\{q, k\}}\left(\boldsymbol{x}_{m}, m\right)=\left(\begin{array}{cc} \cos m \theta & -\sin m \theta \\ \sin m \theta & \cos m \theta \end{array}\right)\left(\begin{array}{ll} W_{\{q, k\}}^{(11)} & W_{\{q, k\}}^{(12)} \\ W_{\{q, k\}}^{(21)} & W_{\{q, k\}}^{(22)} \end{array}\right)\binom{x_{m}^{(1)}}{x_{m}^{(2)}}

f_{\{q, k\}}\left(\boldsymbol{x}_{m}, m\right)=\boldsymbol{R}_{\Theta, m}^{d} \boldsymbol{W}_{\{q, k\}} \boldsymbol{x}_{m}

\scriptsize \begin{array}{c} \boldsymbol{R}_{\Theta, m}^{d}=\left(\begin{array}{ccccccc} \cos m \theta_{1} & -\sin m \theta_{1} & 0 & 0 & \cdots & 0 & 0 \\ \sin m \theta_{1} & \cos m \theta_{1} & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m \theta_{2} & -\sin m \theta_{2} & \cdots & 0 & 0 \\ 0 & 0 & \sin m \theta_{2} & \cos m \theta_{2} & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m \theta_{d / 2} & -\sin m \theta_{d / 2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m \theta_{d / 2} & \cos m \theta_{d / 2} \end{array}\right) \end{array}

其中 $d$ 为而偶数, $\Theta=\left\{\theta_{i}=10000^{-2(i-1) / d}, i \in[1,2, \ldots, d / 2]\right\}$ , 应用于self-attention 可以得到：

\boldsymbol{q}_{m}^{\top} \boldsymbol{k}_{n}=\left(\boldsymbol{R}_{\Theta, m}^{d} \boldsymbol{W}_{q} \boldsymbol{x}_{m}\right)^{\top}\left(\boldsymbol{R}_{\Theta, n}^{d} \boldsymbol{W}_{k} \boldsymbol{x}_{n}\right)=\boldsymbol{x}^{\top} \boldsymbol{W}_{q} R_{\Theta, n-m}^{d} \boldsymbol{W}_{k} \boldsymbol{x}_{n}

使用乘法，通过旋转矩阵乘积合并相对位置信息。

\small \begin{array}{c} \operatorname{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})_{m}=\dfrac{\sum_{n-1}^{N} \phi\left(\boldsymbol{q}_{m}\right)^{\top} \varphi\left(\boldsymbol{k}_{n}\right) \boldsymbol{v}_{n}}{\sum_{n-1}^{N} \phi\left(\boldsymbol{q}_{m}\right)^{\top} \varphi\left(\boldsymbol{k}_{n}\right)} \\ \Downarrow \\ \operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})_{m}=\dfrac{\sum_{n-1}^{N}\left(\boldsymbol{R}_{\Theta, m}^{d} \phi\left(\boldsymbol{q}_{m}\right)\right)^{\top}\left(\boldsymbol{R}_{\Theta, n}^{d} \varphi\left(\boldsymbol{k}_{n}\right)\right) \boldsymbol{v}_{n}}{\sum_{n-1}^{N} \phi\left(\boldsymbol{q}_{m}\right)^{\top} \varphi\left(\boldsymbol{k}_{n}\right)} \end{array}

$\phi(x)=\varphi(x) = \text{elu}(x) + 1$

\small \boldsymbol{R}_{\Theta, m}^{d}\boldsymbol{x} = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots\\ x_{d-1} \\ x_d \\ \end{pmatrix} \oplus \begin{pmatrix} \cos m\theta_1 \\ \cos m\theta_1 \\ \cos m\theta_2 \\ \cos m\theta_2 \\ \vdots \\ \cos m\theta_{d/2} \\ \cos m\theta_{d/2} \\ \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ -x_4 \\ x_3 \\ \vdots\\ -x_d \\ x_{d-1 }\\ \end{pmatrix} \oplus \begin{pmatrix} \sin m\theta_1 \\ \sin m\theta_1 \\ \sin m\theta_2 \\ \sin m\theta_2 \\ \vdots \\ \sin m\theta_{d/2} \\ \sin m\theta_{d/2} \\ \end{pmatrix}

RoPE

def

L(D, N, P) \approx A*N^{-\alpha} + B*D^{-\beta} + C^{-\gamma}