What it takes to understand and verify the claim

 Cropped from 2014 Wired source

John Martinis of U.C. Santa Barbara and Google is the last author of a paper published Wednesday in Nature that claims to have demonstrated a task executed with minimum effort by a quantum computer that no classical computer can emulate without expending Herculean—or Sisyphean—effort.

Today we present a lay understanding of the claim and discuss degrees of establishing it.

There are 76 other authors of the paper. The first 75 are alphabetical, then comes Hartmut Neven before Martinis. Usually pride of place goes to the first author, but that depends on size. Martinis is also the corresponding author. The cox in a rowing race rides at the rear. We have discussed aspects of papers with a huge number of authors here.

Three planks of a quantum supremacy claim are:

1. Build a physical device capable of a nontrivial sampling task.

2. Prove that it gains advantage over known classical approaches.

3. Prove that comparable classical hardware cannot gain such advantage.

Scott Aaronson not only has made two great posts on these and many other aspects of the claim, he independently proposed in 2015 the sampling task that was programmed, and he analyzed it in a foundational paper with Lijie Chen of MIT. Researchers at Google had already been thinking along those lines, and they anchored the team composed from numerous other institutions as well. As if on cue—just a couple days before Wednesday’s announcement—a group from IBM put out a post and paper taking issue with the argument for the third plank.

Any ${n}$-qubit quantum circuit ${C}$ and input ${x}$ to ${C}$ induces a probability distribution ${D_C}$ on ${\{0,1\}^n}$. Because it will not matter if we prepend up to ${n}$ NOT gates to ${C}$, we may suppose ${x = 0^n}$. Then ${C(0^n)}$ is a unit complex vector of length ${N = 2^n}$ with entries ${a_z}$ corresponding to possible outputs ${z \in \{0,1\}^n}$. Then the probability of getting ${z}$ by a final measurement of all qubits is

$\displaystyle p_z = D_C(z) = |a_z|^2.$

Next we consider probability distributions ${D_1}$ that are generated uniformly at random by the following process, for some ${r \geq n}$ and taking ${R = 2^r}$:

for ${i = 1}$ to ${R}$:
choose a ${z \in \{0,1\}^n}$ uniformly at random;
increment its probability ${D_1(z)}$ by ${\frac{1}{R}}$.

Here we intend ${r}$ to be the number of binary nondeterministic gates in the circuit. In place of Hadamard gates the experimental circuits get their nondeterminism from these three single-qubit gates (ignoring global phase for ${\mathbf{Y}^{1/2}}$ in particular):

$\displaystyle \mathbf{X}^{1/2} = \frac{1}{2}\begin{bmatrix} 1 + i & 1 - i \\ 1 - i & 1 + i \end{bmatrix},~ \mathbf{Y}^{1/2} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & -1 \\ 1 & 1 \end{bmatrix},~ \mathbf{W}^{1/2} = \frac{1}{2}\begin{bmatrix} 1 + i & - i\sqrt{2} \\ \sqrt{2} & 1 + i \end{bmatrix}.$

Here ${\mathbf{W} = \frac{1}{\sqrt{2}}(\mathbf{X} + \mathbf{Y})}$ where ${\mathbf{Y} = \begin{bmatrix} 0 & -i \\ i & 0 \end{bmatrix}}$ and ${\mathbf{X}}$ is another name for NOT. The difference from using Hadamard gates matters to technical analysis of the distributions ${D_C}$ but the interplay between quantum nondeterministic gates and classical random coins remains in force.

The choice of ${\mathbf{X}^{1/2}}$, ${\mathbf{Y}^{1/2}}$, or ${\mathbf{W}^{1/2}}$ is itself uniformly random at each point where a single-qubit gate is used, except for not repeating the same gate on the same qubit, and those choices determine ${C}$. Now we can give an initial statement of the task tailored to what the paper achieves:

Given randomly-generated quantum circuits ${C}$ as inputs, distinguish ${D_C}$ with high probability from any ${D_1}$.

In more detail, the object is to take a number ${\delta > 0}$ and moderately large integer ${k}$, both dictated by practical elements of the experiment, and fulfill this task statement:

Given randomly-generated ${C}$, generate samples ${z_1,...,z_k \in \{0,1\}^n}$ such that ${\frac{1}{k}(D_C(z_1) + \cdots D_C(z_k)) \geq 1 + \delta}$.

It’s important to note that there are two stages of randomness: one over ${C}$ which chooses ${D_C}$, and then the stage of measuring after (perhaps imperfectly) executing ${C(0^n)}$. The latter can be repeated to get a large sample of strings ${z}$ for a given ${C}$. The nature of the former stage matters most to justifying how to interpret tests of the samples and to closing loopholes. Our ${D_1}$ does not signify having uniform distribution in the latter sampling, but rather covers classical alternatives in the former stage that (with overwhelming probability) belong to a class we call ${\mathcal{D}_1}$. The ${D_C}$ for random ${C}$ will (again w.o.p.) belong to a class ${\mathcal{D}_2}$ which we explain next.

## The World Series of Quantum Computing

In honor of the baseball World Series, we offer a baseball analogy. To make differences sharper to see, we take ${r = n}$, so ${R = N = 2^n}$. This is not what the experiment does: their biggest instance has 20 layers totaling ${r = 1,\!113}$ nondeterministic single-qubit gates (plus ${430}$ two-qubit gates) on the ${n = 53}$ qubits. But let us continue.

We are distributing ${N}$ units of probability among ${N}$ “batters” ${z \in \{0,1\}^n}$. A batter who gets two units hits a double, three units makes a triple, and so on. The key distinction is between the familiar batting average and the slugging average, which averages all the bases scored with hits:

• The chance of making an out—that is, getting no units—is ${(\frac{N-1}{N})^N}$ which is approximately ${\frac{1}{e} = 0.367879\dots}$

• The chance of hitting a single is also about ${\frac{1}{e}}$, leaving ${1 - \frac{2}{e}}$ as the frequency of getting an extra-base hit—which makes ${z}$ a “heavy hitter.”

• From ${k}$ batters chosen uniformly at random, their expected batting average will be ${1 - \frac{1}{e} = 0.632\dots}$.

• Their expected slugging average, however, will just be ${1}$: they expect ${k}$ units to be distributed among them.

Thus with respect to a random ${D_1}$, and without any knowledge of ${D_1}$, a chosen team of ${k}$ hitters cannot expect to have a joint slugging average higher than ${1}$. Moreover, for any fixed ${\delta > 0}$, the chance of getting a slugging average higher than ${1 + \delta}$ tails away exponentially in ${k}$ (provided ${N}$ also grows).

With respect to ${D_C}$, however, a quantum device can do better. Google’s device programs itself given ${C}$ as the blueprint. So it just executes ${C}$ and measures all qubits to sample the output. Finding its own heavy hitters is what a quantum circuit is good at. The probability of getting a hitter who hits a triple is magnified by ${3}$ compared to a uniform choice. Moreover, ${C}$ will never output a string with zero hits—a “can’t miss” property denied to a classical reader of ${C}$. For large ${N}$ the probability distribution approaches ${xe^{-x}}$ and the slugging expectation is approximately

$\displaystyle \int_0^\infty x^2 e^{-x} = 2.$

That is, a team ${z_1,\dots,z_k}$ drafted by sampling from random quantum circuits ${C}$ expects to have a slugging average near ${2}$. This defines the class ${\mathcal{D}_2}$. If ${C}$ works perfectly, the average will surpass ${1 + \delta}$ whenever ${0 < \delta < 1}$ with near certainty as ${N}$ grows.

Google’s circuits have up to ${r = 20n}$, so ${R \gg N}$. Then the “can’t miss” aspect of the quantum advantage is less sharp but the ${xe^{-x}}$ approximation is closer and the idea of ${\mathcal{D}_2}$ is the same. The nature of ${\mathcal{D}_2}$ can actually be seen from point intensities in speckle patterns of laser light:

## Real-World Execution

The practical challenge is that the implementation of ${C}$ is not perfect. The consequence of an error in the final output is severe. The heavy-hitter outputs ${z}$ of a random ${C}$ are generally not bit-wise similar, so sampling their neighbors is like sampling uniform distribution. As the paper says, “A single bit or phase flip over the course of the algorithm will completely shuffle the speckle pattern and result in close to zero fidelity.”

Their circuits are sufficiently random that effects of sporadic errors over millions of samples can be modeled by a simple equation using quantum mixed states. We shortcut the paper’s physical analysis by drawing on John Preskill’s illustration of a de-polarizing channel in chapter 3 of his wonderful online notes on quantum computation to reach the same equation (1). The modeling has informative symmetry when the errors of a bit flip, phase flip, or both are considered equally likely with probability ${\frac{p}{3}}$. The action on the entangled pair ${|\Phi^+\rangle}$ in the Bell basis is given by the density matrix evolution ${\rho = |\Phi^+\rangle\langle\Phi^+| \mapsto \rho'}$ where

$\displaystyle \begin{array}{rcl} \rho' &=& (1 - p)|\Phi^+\rangle\langle\Phi^+| \;+\; \frac{p}{3}\left(|\Psi^+\rangle\langle\Psi^+| \;+\; |\Phi^-\rangle\langle\Phi^-| \;+\; |\Psi^-\rangle\langle\Psi^-|\right)\\ ~~~\\ &=& (1 - p') |\Phi^+\rangle\langle\Phi^+| \;+\; p'\frac{I}{4}, \end{array}$

where ${p' = \frac{4}{3} p}$ and ${\frac{I}{4}}$ is the density matrix of the completely mixed two-qubit state which is just a classical distribution. This presumes ${p \leq \frac{3}{4}}$; note that ${p = \frac{3}{4}}$ completely mixes the Bell basis already. The fidelity of ${\vec{\rho}'}$ to the original state is then given by

$\displaystyle F = \langle\Phi^+|\rho'|\Phi^+\rangle = 1 - p'.$

This modeling already indicates that with ${m}$ serial opportunities for error the fidelity will decay as ${(1 - p')^m}$. The Google team found low ‘crosstalk’ between qubits and they used exactly this expression in the form ${(1 - \frac{e_1}{1 - 1/D^2})^m}$, evidently with ${p}$ being the native gate error rate they call ${e_1}$ and ${D = 2^k,}$ where having ${k=1}$ for single-qubit gates supplies the factor ${\frac{2^{2k}}{2^{2k} - 1} = \frac{4}{3}}$.
The error ${e_2}$ for the two-qubit gates is similarly represented. (The full modeling in the supplement, section V, is more refined.)

By observing their benchmarks (discussed below) for varying small ${m}$ they could calculate the decay concretely and hence estimate values of ${F}$ for the vast majority of runs with larger ${m}$. The random nature of the circuits ${C}$ evidently makes covariance of errors that could systematically upset this modeling negligible. Thus they can conclude that their device effectively samples from the distribution

$\displaystyle F|\langle z \;|\; C \;|\; 0^n\rangle|^2 \;\;+\;\; (1 - F)\frac{1}{N}. \ \ \ \ \ (1)$

Such distributions can be said to belong to the class ${\mathcal{D}_{1 + F}}$. The paper reports that their ${F}$ is driven below ${0.01}$ but stays above ${0.001}$ in trials. This bounds the range of the ${\delta}$ they can separate by. That ${\delta}$ is separated from zero achieves the first plank and starts on the second. The third needs attention first, however.

## The Third Plank

Both concrete and asymptotic complexity evidence matter for the third plank, the former for now and the latter for how ${n}$ and everything else may scale up in the future. In asymptotic complexity, we still don’t know that ${\mathsf{P}}$ and ${\mathsf{PSPACE}}$, which sandwich the quantum feasible class ${\mathsf{BQP}}$, are different. Thus asymptotic evidence about polynomial bounds must be conditional. Asymptotic evidence about linear time bounds can be sharper but then tends to be conditioned on forms of SETH in ways we still find puzzling.

Lower bounds in concrete complexity are less known and have a self-defeating aspect: We are trying to say that any program ${P}$ run for less than an infeasible time ${T}$ must fail. But we can’t run ${P}$ for time ${T-1}$ to show that it fails because time ${T-1}$ is just as infeasible as time ${T}$. The best we can do is run ${P}$ for a feasible ${T_0 \ll T}$, either (i) on a smaller task size, or (ii) on the original task but argue it doesn’t show progress. Neither is the same; we made some attempts on (ii).

What the paper does instead is argue that a particular classical approach ${P}$ (also from the Aaronson-Chen paper) would take 10,000 years on today’s hardware. This reminds us of a famous 1977 “Mathematical Games” column by Martin Gardner, which quotes an estimate by Ron Rivest that for factoring a 126-digit number on then-current hardware, “the running time required would be about 40 quadrillion years!” It took only until 1994 for this to be broken. Sure enough, IBM calculated that a more-clever implementation of ${P}$ on the Summit supercomputer would take under 3 days. The point is not so much that the Summit hardware is comparable as that estimates based on what are currently thought to be the best possible (classical) methods need asterisks.

On the asymptotic side, the last section (XI) of the paper’s 66-page supplement proves a theorem toward showing that a classical simulation from ${\mathcal{D}_{1 + \delta}}$ that scales polynomially with ${n}$ would collapse ${\mathsf{\#P}}$ to ${\mathsf{AM}}$, and similarly for sub-exponential running times. It does not get all the way there, however: improvements would need to be made in upper bounds for approximation and for worst-case to average-case equivalence. [Added 10/31/19: see this new paper by Scott A. and Sam Gunn.] Moreover, there is a difference from what their statistical testing achieves that we try to explain next.

## The Statistical Tests

We can cast the second plank in the general context of predictive modeling. Consider a forecaster who places estimates ${\{q_i\}}$ on the true probabilities ${\{p_i\}}$ of various events. Here we need to compute the probabilities ${\Pr(z)}$ of output strings ${z}$ observed from the physical device, using the given circuit ${C}$ and the estimate of ${F}$. This must be done classically, and incurs the “${T_0}$-versus-${T}$” issue discussed above.

But before we get to that issue, let’s say more from the viewpoint of predictive modeling. We measure how well the forecasts ${q_i}$ conform to the true ${p_i}$ by applying a prediction scoring rule. If outcome ${i}$ happens, then the log-likelihood rule assesses a penalty of

$\displaystyle L_i = \log(\frac{1}{q_i}).$

This is zero if the outcome was predicted with certainty but goes to infinity if the individual ${q_i}$ is very low—which is an issue in the quantum case. The expected score based on the true probabilities is

$\displaystyle E[L_i] = \sum_i p_i \log(\frac{1}{q_i}). \ \ \ \ \ (2)$

The log-likelihood rule is strictly proper insofar as the unique way to minimize ${E[L_i]}$ is to set ${q_i = p_i}$ for each ${i}$. In human contexts this means the model has incentive to be as accurate as possible.

The formula (2) is the cross-entropy from the ${\vec{q}}$ distribution to the ${\vec{p}}$ distribution. Before we can use it, we need to ask a question:

What is forecasting what? Is the device the imperfect model and do the “true ${p_i}$” come from the analysis of ${C}$ giving ${\Pr(z)}$?

This is how it appeared to us and seems from other writing, but we can argue the opposite from first principles: The physical device is the “ground truth” however it works. The assertion that it is executing a blueprint ${C}$ with some estimated loss in fidelity ${F}$ is really the model. Then it follows that ${\Pr(z)}$ is analogous to ${q_i}$ not ${p_i}$, and we can call it ${q_z}$. Since we can compute it, we can calculate ${\log(\frac{1}{q_i})}$ in (2).

Saying this leaves “${p_i}$” in (2) as denoting the device’s true probabilities ${p_z}$ of giving the output strings ${z}$. These are not directly observable: it is infeasible to sample the device often enough to expect a sufficiently large number of repeated occurrences based on the “birthday” threshold. Thus there appears to be no way to estimate individual values ${p_i}$ in (2), but this doesn’t matter: the very act of sampling the device carries out the “${\sum_i p_i}$” part of (2). Summing ${\log(\frac{1}{q_z})}$ for the ${z}$ that occur over a large but feasible number ${T}$ of trials gives estimates of (2) that are close enough to make the needed distinctions with high confidence. We can then match the estimate of ${E[L_i]}$ against the theoretical estimate, which we may call ${E_{1+F}}$, assuming accurate knowledge of ${F}$. By the ${L_i}$ scoring function being strictly proper, this match entails achieving ${p_z = q_z}$ with sufficient approximation. This property of the goal mitigates some of the modeling issue.

This issue was clarified by reading Ryan O’Donnell’s 2018 notes on quantum supremacy, which preview this same experiment. The above view on which is “forecaster”/”forecastee” might defend the team against his opinion that it “kind of gets its stats backwards”—but the inability to compute cross-entropy from the blueprints’ distribution to that of the device remains an issue. What the team did instead, however, is shift to something simpler they call “linear cross-entropy.” They simply show that the ${q_z}$ from their samples collectively beat the “${E_1}$” that applies to ${\mathcal{D}_1}$—more simply put, that when summed over ${T}$-many trials ${z_t}$,

$\displaystyle \frac{1}{T} \sum_{t = 1}^T q_{z_t} > \frac{1}{N} + \delta. \ \ \ \ \ (3)$

This just boils down to giving a z-score based on the modeling for ${\mathcal{D}_1}$. It is analogous to how I (Ken writing this) test for cheating at chess. We are blowing a whistle to say the physical device is getting surreptitious input from quantum mechanics to achieve a strength of ${1 + \delta}$ compared to a “classical player” who is “rated” as having strength ${1}$.

The difference from showing that the device’s score from (2) is within a hair of ${E_{1+F}}$ is that this is based on ${E_1}$. To be sure, the paper shows that their ${z}$-scores conform to those one would expect an “${E_{1+F}}$-rated” device to achieve. But this is still not the same as (2). Whether it is tantamount for enough purposes—including the theorem about ${\mathsf{AM}}$—is where we’re most unsure, and we note distinctions between fully (classically) sampling and “spoofing” the statistical tests(s) raised by Scott (including directly in reply to me here) and others. The authors say that using “linear cross-entropy” gave sharper results and that they tried other (unspecified) measures. We wonder how much of the space of scoring rules familiar in predictive modeling has been tried, and whether rules having more gentle tail behavior for tiny ${q_i}$ than ${L_i}$ might do better.

Finally, there is the issue that the team were able to verify ${q_z}$ exactly only for circuits up to ${43}$ qubits and/or with ${14}$ levels, not ${53}$ with ${20}$ levels. This creates a dilemma in that IBM’s paper may push them toward ${n = 60}$ or ${70}$, but that increases the gap from instance sizes they can verify. This also pushes away from the possibly of observing the ${\mathcal{D}_{1+F}}$ nature of ${D_C}$ more directly by finding repeated strings ${z}$ in the second-stage sampling of a fixed ${C}$. The “birthday paradox” threshold for repeats is roughly ${2^{n/2}}$ samples, which might be feasible for ${n}$ around ${50}$ (given the classical work needed for each ${z}$, which IBM’s cleverness might speed) but not above ${60}$. The distinguishing power of repeats drops further with ${F}$. We intend to say more about these last few points, and we are sure there are many chapters still to write about supremacy experiments.

## Open Problems

Is the evidence so far convincing to you? Is enough being done on the third plank to exclude possible clever classical use of the fact that the circuits ${C}$ are given as “white boxes”? Are there possible loopholes?

We would also be grateful to know where we may have oversimplified our characterization of the task and our analysis of the issues.

[Added more error-modeling details to the real-world section; some minor word changes; clarified how X,Y,W are chosen; addendum to clarify modeling issues; 10/31/19: removed addendum after blending it into a revision of the last main section—original version preserved here—and linked new Aaronson-Gunn paper.]

October 27, 2019 10:31 am

I am a quantum computer skeptic who is amazed that it appears to have worked for a 53 qubit machine. However, I am dissapointed that there does not appear to be a way to directly test it for say a 100 qubit machine, since no digital computer can simulate such a machine. So how long will we have to wait until Shor’s algorithm can be tested? How many qubits will be required?

October 27, 2019 5:54 pm

I”m between skeptic and believer because I think quantum supremacy is undecidable. Whenever people manage to build a quantum machine with tremendous effort, a possibility appears to solve the same task with huge classical resources. So we may never see quantum computers leave the laboratory.

October 28, 2019 11:27 am

It’s unfortunate that the Google authors didn’t just announce their device and the results of running it, and let the academic community engage in a tasteful discussion of whether this was an example of Quantum Supremacy, any outcome of which would not have reflected negatively on their impressive advance in engineering a reliable superconducting quantum logic circuit.

But of course, that’s not what happened, and with Pons and Fleischmann-like bravado, they had to proclaim to the world that they had won the Gold in the Quantum Supremacy Olympics, with the entirely predictable fertilizer storm over that claim grabbing press attention away from their actual accomplishment.

Quantum Supremacy should at least strongly suggest the inevitability of working quantum computers with low noise and useful numbers of bits. This result, especially with IBM’s contribution. seems to fall a bit short of that mark.

4. October 30, 2019 11:03 am

Excellent post!

As you write, there are two crucial questions regarding the Google’s quantum supremacy claims. The first is about the quality of the evidence for their prediction regarding the outcome of the 53 qubit experiment. They predict that the statistical test will give a value larger than t= 1/10,000. (Or 1+1/10,000 if we don’t subtract the 1.) Let’s call t the fidelity. The second question is about the quality of their claim that achieving a sample with t larger than  1/10,000 represents “quantum supremacy”. (To save latex power I use t instead of $\delta$.)

My own prediction (in several works) about experiments of this kind is that (modulo some small issues that I will neglect) the resulting distribution can be achieved by a classical computer with a polynomial time algorithm in the size of the circuit. And here “polynomial-time” refers to a low degree polynomial with moderate constants.

Regarding the second question. Let me remark that as far as I can see the IBM claim is about a full computation of all the 2^53 probabilities. It is reasonable to think that producing a sample (or even a complete list of 2^53 probabilities) with fidelity t reduces the classical effort linearly with t. (This is the claim about the specific algorithm used by the Google team.) If this holds for the IBM algorithm then the 2.5 days will go down to less than a minute. It will be very interesting to explore if for a version of the IBM algorithm we can have a running time which is linear with t. (It is also interesting if in a discussion based on heuristic arguments both for computational hardness and computational easiness, we can take for granted such a linear dependence on the algorithm running time on t.)

In my view the second issue is the crucial one. Reaching a fidelity above 1/10,000 for a 53 qubit experiments would cause me to have doubts on my conjecture “against” quantum supremacy. (Even if it is only one minute on the IBM super-duper computer.) The crucial aspect, in my view, is to understand (and also further develop, and replicate) the experiments in the regime that can be tested by a classical computer. (This regime is larger now by the IBM method.)

What I would like to see is:

A) an independent verification of the statistical tests outcomes for the experiments in the regime where the Google team classically computed the probabilities.

This seems quite easy to perform and maybe this was already done as part as the refereeing process. This looks to me a crucial step in a verification of such an important experiment.

B) Experiments giving full histograms for circuits in the 10-25 qubit range. (Again maybe this was already done.) See this comment by Ryan.

Perhaps at a later stage also:

C) Experiments in the 10-30 qubit range on the IBM quantum computer. (This is an excellent suggestion by Greg Kuperberg.)

D) In practice the 53 qubit samples might be still hard to check for IBM. (The 1/10,000 improvement that I heuristically suggested is not relevant for checking the fidelity in Google’s sample.)  But maybe Google can produce samples for 41-49 qubits for which IBM can compute the probabilities quickly and test the Google’s prediction on the fidelity.

5. December 30, 2019 3:10 pm

I ask: Does the distinction “polynomial time-nondeterministic polynomial time” dissolve here?

June 27, 2020 10:06 am

I have a some question concerning the main task statement –

” Given randomly-generated {C}, generate samples {z_1,…,z_k \in \{0,1\}^n} such that {\frac{1}{k}(D_C(z_1) + \cdots D_C(z_k)) \geq 1 + \delta}. ”

Conretely, we have for each i D_C(z_i) \leq 1, so the left part always is less or equal 1
and cannot be greater or equal 1+ \delta. Right ?

July 1, 2020 8:58 am

OK, but what if building an n-qubit quantum computer takes exponential time in n? Human tasks also have their time complexity…

July 4, 2020 9:56 am

What is the quantum equivalent of Moore’s law? How many years to double the number of qubits?