Variations on Montgomery’s trick

Peter Montgomery is a cryptographer at Microsoft. Just recently, Joppe Bos and Arjen Lenstra have edited a book titled Topics in Computational Number Theory Inspired by Peter L. Montgomery. The chapters range to Montgomery’s work on elliptic curves, factoring, evaluating polynomials, computing null-spaces of matrices over finite fields, and FFT extensions for algebraic groups. Bos and Lenstra say in their intro that most of this was “inspired by integer factorization, Peter’s main research interest since high school.” Factoring, always factoring…

Today we discuss one of Montgomery’s simpler but most influential ideas going back to a 1985 paper: how to compute mod ${N}$ when you can only do operations mod ${R}$.

We’ve previously thought in terms of the “make” and “break” sides of cryptography. Montgomery has certainly done a lot of work on the latter. He has even said to have demonstrated in the 1980s that the Data Encryption Standard as adopted in 1977 was vulnerable to attacks on even on IBM PCs.

However, we realize that the “make” side has three parts, two of which the “breakers” often help directly. The first part is the design of new cryptosystems. The only help from breakers is knowing what not to design. The second part however is their implementation. Any mathematical ideas used to make attacks faster might also help speed methods that are safer but were previously clunky. The third part is defense, specifically against timing attacks. Speedier execution is often smoother execution and hence less vulnerable to inference by timing.

## The Montgomery Reduction

The simple function conceived by Montgomery solves a problem whose most common case is as follows:

Compute ${a^x}$ modulo an odd number ${N}$ using only addition, multiplication, and operations modulo a power of ${2}$, that is, modulo ${R = 2^\ell}$. Pre-computed quantities such as ${R \bmod{N}}$ are also allowed.

Of course this is used for implementations of the RSA public-key cryptosystem. With binary notation, mod ${R}$ means ignoring all but the bottom ${\ell}$ bits and division by ${R}$ is just a right shift by ${\ell}$ bits. Montgomery multiplication has also been used to implement Peter Shor’s famous quantum factoring algorithm here and here. I and my student Chaowen Guan, who contributed some of the text and equations below, are using it to emulate Shor’s algorithm in a model where we translate quantum circuits into Boolean formulas in order to apply heuristic ${\mathsf{\#SAT}}$ solvers. We have a paper with Amlan Chakrabarti in the Springer LNCS journal Transactions on Computational Systems, which became the destination for this refinement of my old “cookbook” draft with Amlan which I described some years ago.

The basic trick is to compute the following function. Given ${N}$ and ${R}$ relatively prime, we can find integers ${s,t}$ such that

$\displaystyle Rs - Nt = 1.$

Then define the Montgomery reduction of any natural number ${a}$ by

$\displaystyle M\!R(a) = \frac{1}{R}(a + (at \pmod{R})\cdot N). \ \ \ \ \ (1)$

We have omitted the last line of the routine which Montgomery called “REDC” in his paper. This last line takes the output ${b}$ of ${M\!R(a)}$ as we have defined it and tests ${b < N}$; upon failure it outputs ${b - N}$ rather than ${b}$. We, however, want to avoid even this dependence on ${N}$. From several papers on avoiding this test, all surveyed in the new book's chapter on “Montgomery Arithmetic From a Software Perspective” by Bos and Montgomery himself, we follow this 2009 paper by Qiong Pu and Xiuying Zhao and section 2.2 of this earlier paper by Colin Walter and Susan Thompson (note the latter’s attention to timing attacks).

First we note that the value is always a natural number. Let ${c = at/\!/R}$ where ${/\!/}$ means integer division as in Python. Then ${at \bmod{R} = at - cR}$. We get:

$\displaystyle \begin{array}{rcl} M\!R(a) &=& \frac{1}{R}\left(a + (at - cR)\cdot N\right) \\ &=& \frac{1}{R}\left(a + aNt - cRN\right) \\ &=& \frac{1}{R}\left(a + a(Rs - 1) - cRN\right) \\ &=& as - cN. \end{array}$

So ${M\!R(a)}$ is an integer, and by (1) it is positive except that ${M\!R(0) = 0}$. Hence we can freely compose and iterate ${M\!R(\cdot)}$. What kind of function is it?

## Some Properties of MR

By our last calculation, ${M\!R}$ embodies integer division by ${R}$ both in the implicit use of ${c}$ and the explicit product with ${\frac{1}{R}}$. It does not, however, compute ${a/\!/R}$. Its actions on numbers decomposed via ${N}$ and via ${R}$ go as follows:

$\displaystyle \begin{array}{rcl} M\!R(a+bN) &=& \frac{1}{R}(a + bN + (at + bNt \pmod{R})N)\\ &=& \frac{1}{R}(a + bN + (at + b(Rs - 1) \pmod{R})N) \\ &=& \frac{1}{R}(a + bN + (at - b \pmod{R})N)\\ &=& \frac{1}{R}(a + bN + (at - b - c'R)N)\qquad \text{where } c' = (at-b)/\!/R\\ &=& \frac{1}{R}(a + atN -c'RN) \\ &=& \frac{1}{R}(a + a(Rs - 1) -c'RN)\\ &=& \frac{1}{R}(aRs - c'RN) \\ &=& as - c'N\\ &=& M\!R(a) + (c-c')N. \end{array}$

Thus ${M\!R(a+bN)}$ increases by one with ${b}$ only when subtracting ${b}$ from ${at}$ crosses a multiple of ${R}$. In that sense ${M\!R}$ imitates division by ${R}$. Moreover, when ${a}$ is relatively prime to ${N}$ then so is ${as}$ because ${aRs - aNt = a}$, so if ${as}$ shared a proper divisor with ${N}$ then so would ${a}$. Thus for such ${a}$, ${M\!R(a+bN)}$ is relatively prime to ${N}$ and in particular cannot be a multiple of ${N}$.

$\displaystyle \begin{array}{rcl} M\!R(a+bR) &=& \frac{1}{R}(a + bR + (at + bRt \pmod{R})N)\\ &=& \frac{1}{R}(a + bR + (at \pmod{R})N) \\ &=& b + M\!R(a). \end{array}$

It follows with ${a = 0}$ that ${M\!R(bR) = b}$ and ${M\!R(bR^2) = bR}$. Now presume ${R > N}$ and let ${R^k = qN + r_k}$. Then

$\displaystyle \begin{array}{rcl} M\!R(br_k) &=& M\!R(bR^k - bqN)\\ &=& bR^{k-1} + M\!R(-bqN) \\ &=& bR^{k-1} + \frac{1}{R}(-bqN + (-bqNt \pmod{R})N)\\ &=& bR^{k-1} + \frac{1}{R}(-bqN + (-bqRS + bq \pmod{R})N) \\ &=& bR^{k-1} + \frac{1}{R}(-bqN + (bq \pmod{R})N)\\ &=& bR^{k-1} + \frac{1}{R}(-bqN + (bq - \delta R)N)\qquad\text{ where } \delta = bq/\!/R\\ &=& bR^{k-1} - \delta N. \end{array}$

Thus with ${k = 1}$, ${M\!R(br)}$ has the same congruence modulo ${N}$ as ${M\!R(bR)}$, while with ${k = 2}$, ${M\!R(br_2) \equiv br \pmod{N}}$.

## Montgomery Multiplication and Inequalities

Most in particular, for any ${a,b \in \mathbb{Z}}$, if we define ${\bar{a} = aR}$ and ${\bar{b} = bR}$, then

$\displaystyle M\!R(\bar{a}\bar{b}) = M\!R(abR^2) = abR = \bar{ab}.$

Thus defining

$\displaystyle a \odot b = M\!R(ab)$

gives a commutative operator that when restricted to ${R\mathbb{Z}}$ acts like multiplication. Moreover, if we have ${a' = \bar{a} \pmod{N}}$ and ${b' = \bar{b} \pmod{N}}$ then with ${\alpha = aR/\!/N}$ and ${\beta = bR/\!/N}$ we have:

$\displaystyle \begin{array}{rcl} a' \odot b' &=& M\!R((\bar{a} - \alpha N)(\bar{b} - \beta N))\\ &=& M\!R(\bar{a}\bar{b} + \gamma N)\qquad(\text{where } \gamma = \alpha\beta - \alpha - \beta)\\ &=& M\!R(\bar{a}\bar{b}) + \gamma'N \qquad(\text{for some }\gamma')\\ &\equiv& \bar{a} \odot \bar{b} \pmod{N}. \end{array}$

Thus the multiplicative behavior of ${\odot}$ carries through mod ${N}$ on multiplying by ${R}$ modulo ${N}$. The ${\odot}$ operator is Montgomery multiplication proper. The idea is that one can do whole iterated products ${a' \odot b' \odot c' \odot \cdots}$ with one division by ${R}$ and one reduction mod ${R}$ per ${\odot}$ and have only the single reduction mod ${N}$ at the end.

But again we want to avoid even that one use. But there are cases where ${a \odot b < N}$ fails. The largest ratio for ${N \leq 1001}$ and ${R = 2^\ell}$ with ${\ell \leq 21}$ is ${1.96096096\dots}$ given for ${N = 999}$ and ${R = 1,\!024}$ by ${987 \odot 997 = 1,\!959}$. In search of a general result, we are helped by the following inequalities:

$\displaystyle \begin{array}{rcl} M\!R(a) &\leq& \frac{a}{R} + \frac{(R-1)N}{R} = N + \frac{a-N}{R}\\ &<& N + 1 \qquad\text{when } a < N+R\\ &\leq& N - 1 \qquad\text{when } a < N+R \text{ and } \gcd(a,N) = 1. \end{array}$

The last line follows because ${\gcd(a,N) = 1}$ prevents ${M\!R(a) = N}$. Finally note that if ${a,b \leq R}$ then

$\displaystyle a \odot b = M\!R(ab) \leq \frac{R^2 + RN - N}{R} < R + N.$

To use this iteratively, one wants the output ${c}$ to be ${ R}$ and if so return ${c - N}$ instead. The comparison with ${R}$ is simple to do.

## Iterated Multiplication

Suppose we want to multiply ${c = a_1,\dots,a_m}$ modulo ${N}$. We may suppose that each ${a_i}$ already satisfies ${1 \leq a_i \leq N-1}$. The problem with transforming each ${a_i}$ to ${\bar{a}_i = a_i R}$, besides the extra multiplication for each ${i}$, is going to magnitudes as high as ${R(N-1)}$. Transforming to ${a'_i = a_i R \bmod{N}}$ would defeat our purpose because while we can efficiently pre-compute ${s,t}$ giving ${Rs - Nt = 1}$, we cannot suppose the reductions mod ${N}$ for the ${a_i}$ to be pre-computed.

So we plunge ahead and do:

$\displaystyle b = a_1 \odot a_2 \odot a_3 \odot \cdots \odot a_m.$

First, how much can this grow? Put ${b_2 = a_1 \odot a_2}$. Since ${a_1\cdot a_2}$ can be greater than ${N+R}$, the precondition for ${M\!R(b_2) < N}$ is not met. But ${b_2 2N}$ then ${b_3 < 2N}$, so that we preserve the bound ${2N}$.

Note that when ${R}$ is a power of 2 it may need to approach ${4N}$, but that does not affect the ${2N}$ upper bound. The issue now is that ${b}$ includes ${m-1}$ extra factors of ${s}$, that is, ${b \equiv cs^{m-1} \pmod{N}}$. Hence ${c \equiv br^{m-1} \pmod{N}}$. If we can compute a number ${c'}$ such that ${c' \equiv br^m \pmod{N}}$ and ${0 < c' < n+R}$, then a final ${M\!R(c')}$ will give the final targeted value ${c}$ such that ${1 \leq c \leq N-1}$.

Thus, modulo ${N}$, we have reduced iterated multiplication to exponentiation. The extra ${O(\log m)}$ multiplications—or rather, ${O(\log m)}$ more applications of ${\odot}$—beats computing ${\bar{a}_i}$ or ${a'_i}$ for each ${i}$.

## Montgomery Exponentiation

When it comes to implementing fast modular exponentiation, we need to square a running product ${u}$ in the sense of computing ${u' = u \odot u}$ or something related. To maintain a constant bound ${u' < eN}$ given ${u dN}$, we need ${d}$ and ${e}$ so that:

$\displaystyle \frac{1}{R}((eN-1)^2 + (R-1)N) \leq eN-1,$

so

$\displaystyle e^2 N^2 - 2eN + 1 - N \leq R(eN - N - 1).$

Presupposing ${e > 1}$, we can substitute ${R = dN}$ to require

$\displaystyle e^2 N^2 - (2e+1)N + 1 \leq edN^2 - dN^2 - dN.$

Ignoring lower-order terms gives us ${e^2 \leq ed - d}$, so ${e^2 - de + d \leq 0}$. The discriminant is ${d^2 - 4d}$ so we need ${d \geq 4}$, which yields ${e = 2}$ to realize the same ${2N}$ upper bound. Since ${R}$ is a power of 2, ${R}$ may need to be chosen nearly ${8N}$ to use this guarantee. In hardware contexts this is annoying since ${R}$ may thus need one more machine word than ${N}$ (or else ${N}$ must be restricted to ${1/8}$ the possible size), but when working with individual bits—or qubits—this extra overhead from squaring is no big deal.

It remains to show how to imitate standard algorithms for fast exponentiation. Here is one, using (*) to mean ${\odot}$. Recall that ${r = R \bmod{N}}$ and ${r_2 = R^2 \bmod{N}}$ can be pre-computed.

          fastexp(a,x):                   montyfastexp(a,x): # needs R > 4N
A = a                           A = a (*) r_2   # cong. to ar
X = x                           X = x
Y = 1                           Y = r
while X > 0:                    while X > 0:
if X is even:                   if X is even:
A = (A * A)                     A = A (*) A
X = X // 2                      X = X // 2
else:                           else:
Y = (A * Y)                     Y = A (*) Y
X = X - 1                       X = X - 1
return Y                        return Y (*) 1


By induction, A and Y always have an “extra ${r}$” until the return ${Y \odot 1 = M\!R(Y)}$ strips it away. They always have magnitude at most ${2N < N + R}$, so that the final ${M\!R(Y)}$ returns a value ${< N}$. This value equals ${a^x \bmod{N}}$.

An example of why requiring only ${R > 2N}$ can fail is ${N = 123}$, ${R = 256}$, ${a = 76}$, ${e = 91}$: ${a^e = 13 \bmod{N}}$ but the right-hand code outputs ${136}$, which has the right congruence but is not ${ 3N}$; there is none with ${N R}$ as in the aforementioned papers, or returning ${Y \odot r \odot 1}$ as the last line of our code, make it work with smaller ${R}$.

## Open Problems

Our little “scholium” has turned up some interesting congruence properties. Does the above suggest any further ideas and savings?

[removed erroneous “Lemma 1”]

1. February 6, 2018 8:12 pm

It is interesting to note that a longstanding bug in OpenSSL found in 2016 came from improper implementation of Montgomery multipliers for larger than byte-length values of R.

On a separate note, have you thought about the parallel depth of these algorithms?

• February 6, 2018 8:21 pm

Wow about the bug—and there seems to be a bug in my Lemma 1: With N = 131, R = 2048, r = 83 which is primitive, but 21 = r^{98}, 110 = r^{33}, and 21 (*) 110 = 132 > 131. Maybe the exponents summing to N upsets something in the proof I gave. I saw papers about parallel Montgomery implementations but did not look at them.