Location
- File:
chapters/nlp-book-chapter8.pdf
- Page: 57
- Section: 8.3.5.2 Attention with Non-learned Biases
Problem Description
I noticed a discrepancy in the formula for the ALiBi slopes ($m$) presented in the book compared to the original paper (Press et al., 2022).
The book seemingly presents the slope formula as roughly:
$$\beta_k = \frac{1}{2^{\frac{8}{k}}} \quad (\text{or similar incorrect form where } k \text{ is in the denominator})$$
However, according to the ALiBi paper, for a model with $n$ heads, the slopes form a geometric sequence:
"In general, for $n$ heads, our set of slopes is the geometric sequence that starts at $2^{\frac{-8}{n}}$ and uses that same value as its ratio."
Reasoning & Derivation
Based on the definition provided in the original paper:
-
Start term: $2^{-8/n}$
-
Ratio: $2^{-8/n}$
-
Head index: $k$ ($1, \dots, n$)
The slope $m_k$ for the $k$-th head should be derived as:
$$
\begin{align}
m_k &= (\text{Start}) \times (\text{Ratio})^{k-1} \quad \text{... if utilizing standard term indexing} \\
\text{OR directly as described: } \quad m_k &= (2^{-8/n})^k \\
&= 2^{-\frac{8k}{n}} \\
&= \frac{1}{2^{\frac{8k}{n}}}
\end{align}
$$
The variable $k$ (head index) must be in the numerator of the exponent to ensure the slopes properly interpolate the range.
Suggested Fix
Please update the formula to reflect the correct definition:
$$
m_k = \frac{1}{2^{\frac{8 \cdot k}{n}}}
$$
(Where $n$ is the total number of heads and $k$ is the current head index)
Thank you for maintaining this excellent resource.
Location
chapters/nlp-book-chapter8.pdfProblem Description$m$ ) presented in the book compared to the original paper (Press et al., 2022).
I noticed a discrepancy in the formula for the ALiBi slopes (
The book seemingly presents the slope formula as roughly:
$$\beta_k = \frac{1}{2^{\frac{8}{k}}} \quad (\text{or similar incorrect form where } k \text{ is in the denominator})$$
However, according to the ALiBi paper, for a model with$n$ heads, the slopes form a geometric sequence:
Reasoning & Derivation
Based on the definition provided in the original paper:
The slope$m_k$ for the $k$ -th head should be derived as:
The variable$k$ (head index) must be in the numerator of the exponent to ensure the slopes properly interpolate the range.
Suggested Fix
Please update the formula to reflect the correct definition:
(Where$n$ is the total number of heads and $k$ is the current head index)
Thank you for maintaining this excellent resource.