Skip to content

Making Public Information Secret

May 20, 2016


A way to recover and enforce privacy

ScottMcNealyHighNoon
McNealy bio source

Scott McNealy, when he was the CEO of Sun Microsystems, famously said nearly 15 years ago, “You have zero privacy anyway. Get over it.”

Today I want to talk about how to enforce privacy by changing what we mean by “privacy.”

We seem to see an unending series of breaks into databases. There is of course a huge amount of theory literature and methods for protecting privacy. Yet people are still broken into and lose their information. We wish to explore whether this can be fixed. We believe the key to the answer is to change the question:

Can we protect data that has been illegally obtained?

This sounds hopeless—how can we make data that has been broken into secure? The answer is that we need to look deeper into what it means to steal private data.

After The Horse Leaves The Barn

The expression “the horse has left the barn” means:

Closing/shutting the stable door after the horse has bolted, or trying to stop something bad happening when it has already happened and the situation cannot be changed.

Indeed, our source gives as its main example: “Improving security after a major theft would seem to be a bit like closing the stable door after the horse has bolted.”


JohnLundStableHorseCroppedBlendImagesPaid
Photo by artist John Lund via Blend Images, all rights reserved.


This strikes us as the nub of privacy. Once information is released on the Internet, whether by accident or by a break-in, there seems to be little that one can do. However, we believe that there may be hope to protect the information anyway. Somehow we believe we can shut the barn door after the horse has left, and get the horse back.

Suppose that some company makes a series of decisions. Can we detect if those decisions depend on information that they should not be using. Let’s call this Post-Privacy Detection.

Post-Privacy Detection

Consider a database that stores values {(v,a)} where {v} is an {n}-bit vector of attributes and {a} is a attribute. Think of {a} as small, even a single bit such as the sex of the individual with attributes {v}. Let us also suppose that the database is initially secure for {a} insofar as given many samples of the values of {v} only, it is impossible to gain advantage in inferring the values of {a}. Thus the leak of {a} is meaningful information.

Now say a decider is an entity {D} that uses information from this database to make decisions. {D} has one or more Boolean functions {f(v,a)} of the attributes. Think of {f(v,a)} as a yes/no on some issue: granting a loan, selling a house, giving insurance at a certain rate, and so on. The idea is that while {a} may not be secret—the database has been broken into—we can check that in aggregate that {a} is effectively secret.

The point here is that we can detect if {a} is being used in an unauthorized manner to make some decision, given protocols for transparency that enable sampling the values {(v,f(v,a))}. If given a polynomial number of samples we cannot tell {a}‘s within {\epsilon} then we have large-scale assurance that {a} was not material to the decision. Our point is this: a leak of values about individuals is material only if they are used by someone to make a decision that should not depend on their “private” information. Thus if a bank gets values of {a}, but does not use them to make a decision, then we would argue that that information while public was effectively private.

Definition 1 Let a database contain values of the form {(v,a)}, and let {f(v,a)} be a Boolean function. Say that the {a} part is effectively private for the decision {f} provided there is another function {g(v)} so that

\displaystyle  \mathsf{Prob}[f(v,a)=g(v)] \ge 1-\epsilon,

where {\epsilon>0}. A decider {D} respects {a} if {a} is effectively private in all of its decision functions.

We can prove a simple lemma showing that this definition implies that {a} is not compromised by sampling the decision values.

Lemma 2 If the database is secure for {a} and {a} is effectively private, then there is no function {h} such that {\mathsf{Prob}[h(v,f(v,a)) = a] \ge 1-\epsilon}.

Proof: Suppose for contradiction such an {h} exists. Also suppose for avoiding contradiction of effective privacy that a function {g(v)} as above exists. Then given {v}, we obtain {u = f(v,a)} with probability {1 - \epsilon}. Then using {h} we obtain {h(v,u) = a} with overall probability at least {1 - 2\epsilon}. This contradicts the initial security of the database for {a}. \Box

Pulling in the Long Reins

To be socially effective, our detection concept should exert influence on deciders to behave in a manner that overtly does not depend on the unauthorized information. This applies to repeatable decisions whose results can be sampled. The sampling would use protocols that effect transparency while likewise protecting the data.

Thus our theoretical notion would require social suasion for its effectiveness. This includes requiring deciders to provide infrastructure by which their decisions can be securely sampled. It might not require them to publish their {A}-oblivious decision functions {g}, only that they could—if challenged—provide one. Most of this is to ponder for the future.

What we can say now, however, is that there do exist ways we can rein in the bad effects of lost privacy. The horses may have bolted, but we can still exert some long-range control over the herd.

Open Problems

Is this idea effective? What things like it have been proposed?

19 Comments leave one →
  1. Roielle Hugh permalink
    May 20, 2016 3:33 pm

    Announcement

    A thorough treatment of the settlement of P vs. NP is now put in book form.

    Of the various questions pertaining to the conjecture that the book answers is the following from GLL:

    Could some open problems fall to: "We know [...] and by induction [...] we are done"?

    People should soon see a follow-up with a bit more details (unless things go suddenly berserk like last time).

    — Roielle Hugh

  2. Roielle Hugh permalink
    May 20, 2016 7:05 pm

    The Longest Second, now made available from both Amazon and NOOK, is a condensed version of the saga.

    Even though there are very elaborate proofs for P vs. NP both formal and ‘relaxed’ in the book, the true purpose is not for such superfluity. We are to precisely answer the more fundamental questions:

    Why can we still not solve the problem?
    Why is (and has been) the problem so hard?
    Why are all other attempts doomed to fail?
    

    Maybe more dramatically, with our resolution, is the following:

    Why, in a philosophical sense, can such problem never be solved?

    Despite the mild philosophical tint, the book has a quite practical side. It is to suggest to people, who want to prove, a way to produce a valid one, presenting a vivid and precise picture of what a proof will look like in two short paragraphs (of about 111 words) in the chapter “NP = P“. Implicitly, the book will also offer means to recognize insufficiency so that one who had an attempt can know in an exact manner whether it is valid or invalid. For those who has or will have such an attempt, a sense about the validity may also be gained (possibly before a claim is made).

    The book has a specific chapter, Vetting Claims, to provide the guideline, the necessary and sufficient conditions for a valid proof.

    It may or may not be a simple thing to settle, but won’t be as simple a task as prior believed with three bricks. The world evolved since 1998. It is already out of the scope of a single ASCII.

    The book starts, as can be seen via a preview, with the (slightly metaphorical) re-capture of the moment in which the solution of the Trio was announced to the world.

    — Roielle Hugh


    NOTE:
    Due to limitations of electronic delivery, the reader may have to put up with the less ideal formatting. Besides, math-related material is limited to a minimum and largely to text form, as shown in the following excerpt (of part of an informal discussion) from the book:

    Since $$C_{n}^{*}$$ is large (with regard to $$F_{n}$$) and $$C_{n}^{*} \subseteq C_{n} \subseteq F_{n}$$, $$C_{n}$$ is primarily $$C_{n}^{*}$$, which means $$C_{n}$$ is primarily easy/hard, contradicting $$C_{n}$$ is hard/easy.
    ...
    On the other hand, if we have an easy predicate $$p$$ for $$f_{n} \in C_{n}$$ with $$C_{n}$$ hard, ...
    

    will be in the book in text form as:

    Since $$C_{n}^{*}$$ is large (with regard to $$F_{n}$$) and $$C_{n}^{*}$$ is subset of $$C_{n}$$ which is subset of $$F_{n}$$, $$C_{n}$$ is primarily $$C_{n}^{*}$$, which means $$C_{n}$$ is primarily easy/hard, contradicting $$C_{n}$$ is hard/easy.
    ...
    On the other hand, if we have an easy predicate $$p$$ for "$$f_{n}$$ being a member of $$C_{n}$$ that is hard", ...
    
    • Roielle Hugh permalink
      May 20, 2016 10:02 pm

      Blockquote may work for the note above.


      NOTE:
      Due to limitations of electronic delivery, the reader will have to put up with the less ideal formatting. Besides, math-related material is limited to a minimum and largely to text form, as shown in the following excerpt (of part of an informal discussion) from the book:

      Since $$C_{n}^{*}$$ is large (with regard to $$F_{n}$$) and $$C_{n}^{*} \subseteq C_{n} \subseteq F_{n}$$, $$C_{n}$$ is primarily $$C_{n}^{*}$$, which means $$C_{n}$$ is primarily easy/hard, contradicting $$C_{n}$$ is hard/easy.

      On the other hand, if we have an easy predicate $$p$$ for $$f_{n} \in C_{n}$$ with $$C_{n}$$ hard, …

      will be in the book in text form as:

      Since $$C_{n}^{*}$$ is large (with regard to $$F_{n}$$) and $$C_{n}^{*}$$ is subset of $$C_{n}$$ which is subset of $$F_{n}$$, $$C_{n}$$ is primarily $$C_{n}^{*}$$, which means $$C_{n}$$ is primarily easy/hard, contradicting $$C_{n}$$ is hard/easy.

      On the other hand, if we have an easy predicate $$p$$ for “$$f_{n}$$ being a member of $$C_{n}$$ that is hard”, …

  3. May 20, 2016 7:24 pm

    Dear Professors,

    A major complexity class is NL. NL is the class of languages that are decidable on a nondeterministic logspace machine. A logspace machine is a Turing machine with a read-only input tape, a write-only output tape, and a read/write work tapes. The work tapes may contain at most O(log n) symbols. In addition, Sanjeev Arora described in his book “Computational complexity: A modern approach” the certificate-based definition for NL.

    The certificate-based definition of NL assumes that a deterministic logspace machine verifies the elements in a NL language using another separated read-only tape for the certificate. On each step of the machine the machine’s head on that tape can either stay in place or move to the right. In particular, it cannot reread any bit to the left of where the head currently is. For that reason this kind of special tape is called “read once”.

    CLIQUE is a well-known NP-complete problem. We show that a certificate u which represents a clique V’ of size k on a graph G can be verified by a deterministic logspace machine M using an additional special read-once input tape, and thereby CLIQUE will be in NL as a direct consequence of this previous definition. If any single NP-complete problem can be solved in polynomial time, then every NP problem has a polynomial time algorithm. However, every problem in NL is in P too. Consequently, we can conclude that P = NP.

    You can understand more this idea and debug my verifier algorithm in

    https://hal.archives-ouvertes.fr/hal-01316353/document

    Any suggestion will be welcome,
    Thanks in advance,
    Frank.

    • Peter Gerdes permalink
      May 21, 2016 11:42 am

      Did it ever occur to you that begging people to read your P=NP proofs via comments on blog posts might be strong evidence against the fact you have solved the problem?

      I mean surely if you are smart enough to have solved the problem it has occurred to you that blog comments to that effect wouldn’t be particularly believable.

      I am going to do everything I can to convince all the other mathematicians I know to announce their major results through crackpot blog postings just for the humor value however.

  4. May 20, 2016 7:59 pm

    This sounds like a very good idea if you want to regulate companies that (arguably illegally) buy and sell private user data. Unfortunately this technique doesn’t stop criminals who illegally buy and sell account information resulting from these leaks. Social pressure doesn’t influence them.

    • Daniel permalink
      July 26, 2016 1:59 pm

      Off course it can stop criminals. If they cannot re-introduce this information (or a derivative of it) into the main stream, then it is useless.

      • July 26, 2016 2:50 pm

        That’s my point: criminals don’t work in the main stream. The technique in this post only applies to institutions that are subject to oversight, which (uncaught) criminals are not by definition. If someone hacks into your bank, steals your loan information, sells it to Nigerians who later try to scam you using that information, this scheme will not protect you. The horse is still out of the barn.

  5. Dave Tweed permalink
    May 21, 2016 2:36 am

    Ignoring the detailed mathematics, in terms of information leaked there’s “access data” (bank account numbers, SSNs, etc) and there’s more attribute like-data (eg, has recently lost a lot of weight, buys a lot of alcohol, has the spending pattern of an adulterer, etc). In privacy terms the attribute data is more significant in decision making.

    The thing is that unlike access data which you can only know if you see it stored, roughly the same attribute data can be obtained from many sources (eg, for alcohol abuse from AA attendance, from empty alcohol cans in the garbage, from purchase records, from DUI citations, etc). So I think you face the problem of needing to show to a high-probability that the particular leaked version of the attribute data was used in a decision and not the same sort of datum from a different source.

  6. May 21, 2016 3:29 am

    This is not very related, but reading your post, I was thinking, why don’t they store dummy information in databases? For example, if someone stores one million records, they could add another million (or trillion!) phony entries. Only real entries could be logged into, because no one would know the password for the phony ones. If sometimes statistics needs to be done, then of course all phony entries should be generated from the real ones with some algorithm. If a real record is deleted, then even that could be balanced by generating more phony records. Even the number of the real records could be stored, if needed.

  7. May 21, 2016 6:47 am

    If you are a good person with nothing to hide, then “privacy” should not be an “issue”. End of story. Honest people are not afraid to post on the internet using their real identities.

    • Roielle Hugh permalink
      May 21, 2016 8:26 am

      You probably did not get it completely right.

      A good person has nothing to hide and that is the exact reason for “privacy”. That is actually what “privacy” means.

      Make a distinction between an identity (a name for the purpose) not used for deceit and information that has no purpose other than ‘intended’ use.

      Not something quite clear cut, perhaps.

    • May 21, 2016 10:47 am

      Am I a bad person because I have a credit card?

  8. May 21, 2016 6:54 am

    Fake named phoney’s are the real problem with the internet.

  9. Peter Gerdes permalink
    May 21, 2016 11:25 am

    Interesting, but it’s just not what people care about when keeping private information private.

    The information people really care about getting out is Information that is personally or professionally embarrassing. Things like membership in ashley madison, sexual orientation, activities at odds with one’s professional colleagues (or can be used against us legally like weed smoking)

    In all these cases we KNOW (or at least suspect) that this information has been used when the bad effects happen.

    The reason we worry about this information getting out isn’t that large corporations might secretly use it but we fear that some people will feel PERFECTLY JUSTIFIED IN USING THAT INFORMATION AGAINST US.

    I suspect that if we had the choice to entirely abandon the kind of big data privacy your scheme protects (statistical decisions by banks etc..) while retaining the personal privacy I mention above we wouldn’t hesitate to do so. After all these statistical decisions lack any personal barb and promise to reward the virtuous (and we all think we will benefit from that the same way almost everyone believes they are an above average driver).

  10. Craig Alan permalink
    May 25, 2016 9:55 am

    See an imaginary dialogue about the P vs NP problem: http://www.ptep-online.com/index_files/2016/PP-46-18.PDF

    Craig

    Sent from my iPhone

    • Craig permalink
      June 10, 2016 11:31 am

      I’m sorry, I wasn’t trying to post it to this forum, since it is off topic. That was from a private email sent to Professors Lipton and Regan from my iphone.

  11. Daniel permalink
    July 26, 2016 2:11 pm

    You are putting privacy and copyright on the same bucket, like a huge DRM system; which isn’t necessarily bad.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s