This section describes the fast integer square root algorithm used by vl_fast_sqrt_ui8, vl_fast_sqrt_ui16, vl_fast_sqrt_ui32, vl_fast_sqrt_ui64.

Given a non-negative integer \(x \in \mathbb{Z}_+\), the goal of this algorithm is to quickly compute the integer approximation of the square root of an integer number:

\[ y = \max_{\bar y\in\mathbb{Z}} \bar y, \qquad \text{such that}\ \bar y^2 \leq x. \]

Consider determining the k-th bit of \(y\). To this end, decompose \(y\) in three parts:

\[ y = y_{k+1} + q 2^k + r, \qquad \text{where}\ y_{k+1} \geq 2^{k+1}, r < 2^k, \]

and \(q\in\{0,1\}\) is the bit to be determined. Here \(y_{k+1}\) is a part of the result \(y\) that has already been determined, while the bit \(q\) and the remainder \(r\) are still unknown. Recall that the goal is to find the largest \(y^2\) such that \(y^2 \leq x\). Expanding \(y^2\) this condition becomes

\[ q (2^{2k} + 2 y_{k+1} 2^k) + r(r + 2q 2^k + 2 y_{k+1}) \leq x - y_{k+1}^2. \]

We can now determine if \(q=1\) or \(q=0\) based on the value of the residual \(x - y_{k+1}^2\). Specifically, \(q=1\) requires that:

\[ \boxed{ 2^{2k} + 2a2^k \leq x - y_{k+1}^2. } \]

On the other hand, if this equation is satisfied, then setting \(r=0\) shows that there exists at least one \(y\) such that \(q=1\) and \(y^2 \leq x\). In particular, greedily choosing \(q=1\) in \(x=y_{k+1} + 2^k q + r\) is optimal because \(2^k > r\). This yields the algorithm:

- Note that if \(x\) is stored in \(n\) bits and \(n\) is even, then the integer square root \(y\) does not require more than \(m = n / 2\) bit to be stored. Thus the first bit to be determined is \(k \leftarrow m - 1 = n/2 - 1\) and \(y_{n/2}=0\).
- The algorithm stores and updates \(y_k/2^{k}\) and \(x - y_{k}^2\) for convenience.
- During iteration \(k\), \(y_k\) is determined. On entering the iteration, the first step is to compute \(y_{k+1}/2^k = 2 y_{k+1}/2^{k+1}\).
- Then the bound \(t = (2^{2k} + 2 y_{k+1})2^k = 2^{2k}(1 + 2 y_{k+1}/2^k)\).
- If \(t \geq x - y_{k+1}\), the \(k\)-th bit of \(y_k\) is set to one. This means applying the update \(\hat y_{k}/2^k \leftarrow y_{k+1}/2^k + 1\). This also requires computing \(x - y_{k}^2 \leftarrow x - y_{k+1}^2 - t\).
- Decrement \(k \leftarrow k -1\) and, if \(k\geq 0\), continue from 3.