# Doing Mod N Via Mod R

* Variations on Montgomery’s trick *

Peter Montgomery is a cryptographer at Microsoft. Just recently, Joppe Bos and Arjen Lenstra have edited a book titled *Topics in Computational Number Theory Inspired by Peter L. Montgomery*. The chapters range to Montgomery’s work on elliptic curves, factoring, evaluating polynomials, computing null-spaces of matrices over finite fields, and FFT extensions for algebraic groups. Bos and Lenstra say in their intro that most of this was “inspired by integer factorization, Peter’s main research interest since high school.” Factoring, always factoring…

Today we discuss one of Montgomery’s simpler but most influential ideas going back to a 1985 paper: how to compute mod when you can only do operations mod .

We’ve previously thought in terms of the “make” and “break” sides of cryptography. Montgomery has certainly done a lot of work on the latter. He has even said to have demonstrated in the 1980s that the Data Encryption Standard as adopted in 1977 was vulnerable to attacks on even on IBM PCs.

However, we realize that the “make” side has three parts, two of which the “breakers” often help directly. The first part is the *design* of new cryptosystems. The only help from breakers is knowing what not to design. The second part however is their *implementation*. Any mathematical ideas used to make attacks faster might also help speed methods that are safer but were previously clunky. The third part is *defense*, specifically against timing attacks. Speedier execution is often smoother execution and hence less vulnerable to inference by timing.

## The Montgomery Reduction

The simple function conceived by Montgomery solves a problem whose most common case is as follows:

Compute modulo an odd number using only addition, multiplication, and operations modulo a power of , that is, modulo . Pre-computed quantities such as are also allowed.

Of course this is used for implementations of the RSA public-key cryptosystem. With binary notation, mod means ignoring all but the bottom bits and division by is just a right shift by bits. Montgomery multiplication has also been used to implement Peter Shor’s famous quantum factoring algorithm here and here. I and my student Chaowen Guan, who contributed some of the text and equations below, are using it to emulate Shor’s algorithm in a model where we translate quantum circuits into Boolean formulas in order to apply heuristic solvers. We have a paper with Amlan Chakrabarti in the Springer LNCS journal *Transactions on Computational Systems*, which became the destination for this refinement of my old “cookbook” draft with Amlan which I described some years ago.

The basic trick is to compute the following function. Given and relatively prime, we can find integers such that

Then define the *Montgomery reduction* of any natural number by

We have omitted the last line of the routine which Montgomery called “REDC” in his paper. This last line takes the output of as we have defined it and tests ; upon failure it outputs rather than . We, however, want to avoid even this dependence on . From several papers on avoiding this test, all surveyed in the new book's chapter on “Montgomery Arithmetic From a Software Perspective” by Bos and Montgomery himself, we follow this 2009 paper by Qiong Pu and Xiuying Zhao and section 2.2 of this earlier paper by Colin Walter and Susan Thompson (note the latter’s attention to timing attacks).

First we note that the value is always a natural number. Let where means integer division as in Python. Then . We get:

So is an integer, and by (1) it is positive except that . Hence we can freely compose and iterate . What kind of function is it?

## Some Properties of MR

By our last calculation, embodies integer division by both in the implicit use of and the explicit product with . It does not, however, compute . Its actions on numbers decomposed via and via go as follows:

Thus increases by one with only when subtracting from crosses a multiple of . In that sense imitates division by . Moreover, when is relatively prime to then so is because , so if shared a proper divisor with then so would . Thus for such , is relatively prime to and in particular cannot be a multiple of .

It follows with that and . Now presume and let . Then

Thus with , has the same congruence modulo as , while with , .

## Montgomery Multiplication and Inequalities

Most in particular, for any , if we define and , then

Thus defining

gives a commutative operator that when restricted to acts like multiplication. Moreover, if we have and then with and we have:

Thus the multiplicative behavior of carries through mod on multiplying by modulo . The operator is **Montgomery multiplication** proper. The idea is that one can do whole iterated products with one division by and one reduction mod per and have only the single reduction mod at the end.

But again we want to avoid even that one use. But there are cases where fails. The largest ratio for and with is given for and by . In search of a general result, we are helped by the following inequalities:

The last line follows because prevents . Finally note that if then

To use this iteratively, one wants the output to be and if so return instead. The comparison with is simple to do.

## Iterated Multiplication

Suppose we want to multiply modulo . We may suppose that each already satisfies . The problem with transforming each to , besides the extra multiplication for each , is going to magnitudes as high as . Transforming to would defeat our purpose because while we can efficiently pre-compute giving , we cannot suppose the reductions mod for the to be pre-computed.

So we plunge ahead and do:

First, how much can this grow? Put . Since can be greater than , the precondition for is not met. But then , so that we preserve the bound .

Note that when is a power of 2 it may need to approach , but that does not affect the upper bound. The issue now is that includes extra factors of , that is, . Hence . If we can compute a number such that and , then a final will give the final targeted value such that .

Thus, modulo , we have reduced iterated multiplication to exponentiation. The extra multiplications—or rather, more applications of —beats computing or for each .

## Montgomery Exponentiation

When it comes to implementing fast modular exponentiation, we need to square a running product in the sense of computing or something related. To maintain a constant bound given , we need and so that:

so

Presupposing , we can substitute to require

Ignoring lower-order terms gives us , so . The discriminant is so we need , which yields to realize the same upper bound. Since is a power of 2, may need to be chosen nearly to use this guarantee. In hardware contexts this is annoying since may thus need one more machine word than (or else must be restricted to the possible size), but when working with individual bits—or qubits—this extra overhead from squaring is no big deal.

It remains to show how to imitate standard algorithms for fast exponentiation. Here is one, using `(*)` to mean . Recall that and can be pre-computed.

fastexp(a,x): montyfastexp(a,x): # needs R > 4N A = a A = a (*) r_2 # cong. to ar X = x X = x Y = 1 Y = r while X > 0: while X > 0: if X is even: if X is even: A = (A * A) A = A (*) A X = X // 2 X = X // 2 else: else: Y = (A * Y) Y = A (*) Y X = X - 1 X = X - 1 return Y return Y (*) 1

By induction, `A` and `Y` always have an “extra ” until the return strips it away. They always have magnitude at most , so that the final returns a value . This value equals .

An example of why requiring only can fail is , , , : but the right-hand code outputs , which has the right congruence but is not ; there is none with as in the aforementioned papers, or returning as the last line of our code, make it work with smaller .

## Open Problems

Our little “scholium” has turned up some interesting congruence properties. Does the above suggest any further ideas and savings?

[removed erroneous “Lemma 1”]

It is interesting to note that a longstanding bug in OpenSSL found in 2016 came from improper implementation of Montgomery multipliers for larger than byte-length values of R.

On a separate note, have you thought about the parallel depth of these algorithms?

Wow about the bug—and there seems to be a bug in my Lemma 1: With N = 131, R = 2048, r = 83 which is primitive, but 21 = r^{98}, 110 = r^{33}, and 21 (*) 110 = 132 > 131. Maybe the exponents summing to N upsets something in the proof I gave. I saw papers about parallel Montgomery implementations but did not look at them.

It was a nice article. Basically reducing the expensive division or modulo N operation.