Hard Truths about the Hard Business of finding Hard Random Numbers
Editorial note: this rant was originally posted here but has now moved to a permanent home where it will be updated with new thoughts.
As many have noticed, there is now a permathread (Paul's term) on how to do random numbers. It's always been warm. Now the arguments are on solid simmer, raging on half a dozen cryptogroups, all thanks to the NSA and their infamous breach of NIST, American industry, mom's apple pie and the privacy of all things from Sunday school to Angry Birds.
Why is the topic of random numbers so bubbling, effervescent, unsatisfying? In short, because generators of same (RNGs), are *hard*. They are in practical experience trickier than most of the other modules we deal with: ciphers, HMACs, public key, protocols, etc.
Yet, we have come a long way. We now have a working theory. When Ada put together her RNG this last summer, it wasn't that hard. Out of our experience, herein is a collection of things we figured out; with the normal caveat that, even as RNs require stirring, the recipe for 'knowing' is also evolving.
- Use what your platform provides. Random numbers are hard, which is the first thing you have to remember, and always come back to. Random numbers are so hard, that you have to care a lot before you get involved. A hell of a lot. Which leads us to the following rules of thumb for RNG production.
- Use what your platform provides.
- Unless you really really care a lot, in which case, you have to write your own RNG.
- There isn't a lot of middle ground.
- So much so that for almost all purposes, and almost all users, Rule #1 is this: Use what your platform provides.
- When deciding to breach Rule #1, you need a compelling argument that your RNG delivers better results than the platform's. Without that compelling argument, your results are likely to be more random than the platform's system in every sense except the quality of the numbers.
- Software is our domain.
- Software is unreliable. It can be made reliable under bench conditions, but out in the field, any software of more than 1 component (always) has opportunities for failure. In practice, we're usually talking dozens or hundreds, so failure of another component is a solid possibility; a real threat.
- What about hardware RNGs? Eventually they have to go through some software, to be of any use. Although there are some narrow environments where there might be a pure hardware delivery, this is so exotic, and so alien to the reader here, that there is no point in considering it. Hardware serves software. Get used to it.
- As a practical reliability approach, we typically model every component as failing, and try and organise our design to carry on.
- Security is also our domain, which is to say we have real live attackers.
- Many of the sciences rest on a statistical model, which they can do in absence of any attackers. According to Bernoulli's law of big numbers, models of data will even out over time and quantity. In essence, we then can use statistics to derive strong predictions. If random numbers followed the law of big numbers, then measuring 1000 of them would tell us with near certainty that the machine was good for another 1000.
- In security, we live in a byzantine world, which means we have real live attackers who will turn our assumptions upside down, out of spite. When an attacker is trying to aggressively futz with your business, he will also futz with any assumptions and with any tests or protections you have that are based on those assumptions. Once attackers start getting their claws and bits in there, the assumption behind Bernoulli's law falls apart. In essence this rules out lazy reliance on statistics.
- No Test. There is no objective test of random numbers, because it is impossible to test for unpredictability. Which in practical terms means that you cannot easily write a test for it, nor can any test you write do the job you want it to do. This is the key unfortunate truth that separates RNs out from ciphers, etc (which latter are amenable to test vectors, and with vectors in hand, they become tractable).
- Entropy. Everyone talks about entropy so we must too, else your future RNG will exhibit the wrong sort of unpredictability. Sadly, entropy is not precisely the answer, enough such that talking about is likely missing the point. If we could collect it reliably, RNs would be easy. We can't so it isn't.
- Entropy is manifest physical energy, causing events which cannot be predicted using any known physical processes, by the laws of science. Here, we're typically talking about quantum energy, such as the unknown state of electrons, which can collapse either way into some measurable state, but it can only be known by measurement, and not predicted earlier. It's worth noting that quantum energy abounds inside chips and computers, but chips are designed to reduce the noise, not increase it, so turning chip entropy into RNs is not as easy as talking about it.
- There are objective statements we can make about entropy. The objective way to approach the collection of entropy is to carefully analyse the properties of the system and apply science to estimate the amount of (e.g.) quantum uncertainty one can derive from it. This is possible and instructive, and for a nice (deep) example of this, see John Denker's Turbid.
- At the level of implementation, objective statements about entropy fail for 2 reasons. Let's look at those, as understanding these limitations on objectivity is key to understanding why entropy does not serve us so willingly.
- Entropy can be objectively analysed as long as we do not have an attacker. An attacker can deliver a faulty device, can change the device, and can change the way the software deals with the device at the device driver level. And much more...
- This approach is complete if we have control of our environment. Of course, it is very easy to say Buy the XYZ RNG and plug it in. But many environments do not have that capability, often enough we don't know our environment, and the environment can break or be changed. Examples: rack servers lacking sound cards; phones; VMs; routers/firewalls; early startup on embedded hardware.
- In conclusion, entropy is too high a target to reach. We can reach it briefly, in controlled environments, but not enough to make it work for us. Not enough, given our limitations.
- CSRNs. The practical standard to reach therefore is what we call Cryptographically Secure Random Numbers.
- Cryptographically secure random numbers (or CSRNs) are numbers that are not predictable /to an attacker/. In contrast to entropy, we might be able to predict our CSRNs, but our enemies cannot. This is a strictly broader and easier definition than entropy, which is needed because collecting entropy is too hard, as above.
- Note our one big assumption here: that we can determine who is our attacker and keep him out, and determine who is friendly and let them in. This is a big flaw! But it happens to be a very basic and ever-present one in security, so while it exists, it is one we can readily work with.
- Design. Many experiments and research seem to have settled on the following design pattern, which we call a Trident Design Pattern:
Entropy collector ----\
In short, many collectors of entropy feed their small contributions in to a Mixer, which uses the melded result to seed an Expander. The high level caller (application) uses this Expander to request her random numbers.
\ _____ _________ / \ / \
Entropy collector ---->( mixer )----->( expansion )-----> RNs
Entropy collector ----/
- Collectors. After all the above bad news, what is left in the software toolkit is: redundancy .
- A redundant approach tells us to draw our RNs from different places. The component that collects RNs from one place is called a Collector. Therefore we want many Collectors.
- Each of the many places should be uncorrelated with each other. If one of these were to fail, it would be unlikely that others also would fail, as they are uncorrelated. Typical studies of fault-tolerant systems often suggest the number 3 as the target.
- Some common collector ideas are:
By the analysis that got us past Rule #1, there are no great Collectors by definition, as otherwise we'd already be using them, and this problem would go away.
- the platform's own RNG, as a Collector into your RNG
- any CPU RNG such as Intel's RDRAND,
- measuring the difference between two uncorrelated clocks,
- timings and other measurands from events (e.g., mouse clicks and locations),
- available sensors (movement on phones),
- differences seen in incoming new business packets,
- a roughly protected external source such as a business feed,
- An attacker is assumed to be able to take a poke at one or two of these sources, but not all. If the attacker can futz with all our sources, this implies that he has more or less unlimited control over our entire machine. In which case, it's his machine, and not ours. We have bigger problems than RNs.
- We tend to want more numbers than fault-tolerant reliability suggests because we want to make it harder for the attacker. E.g., 6 would be a good target.
- Remember, we want maximum uncorrelation. Adding correlated collectors doesn't improve the numbers.
- Because we have redundancy, on a large scale, we are not that fussed about the quality of each Collector. Better to add another collector than improve the quality of one of them by 10%. This is an important benefit of redundancy, we don't have to be paranoid about the quality of this code.
- Mixer. Because we want the best and simplest result delivered to the caller, we have to take the output of all those above Collectors, mix them together, and deliver downstream.
- The Mixer is the trickiest part of it all. Here, you make or break. Here, you need to be paranoid. Careful. Seek more review.
- The Mixer has to provide some seed numbers of say 128-512 bits to the Expander (see below for rationale). It has to provide this on demand, quickly, without waiting around.
- There appear to be two favourite designs here: Push or Pull. In Push the collectors send their data directly into Mixer, forcing it to mix it in as it's pushed in. In contrast, a Pull design will have the Mixer asking the Collectors to provide what they have right now. This in short suggests that in a Push design the Mixer has to have a cache, while in Pull mode, the Collectors might be well served in having caches within themselves.
- Push or Mixer-Cache designs are probably more popular. See Yarrow and Fortuna as perhaps the best documented efforts.
- We wrote our recent Trident effort (AdazPRING) using Pull. The benefits include: simplified API as it is direct pull all the way through; no cache or thread in mixer; and as the Collectors better understand their own flow, so they better understand the need for caching and threading.
- Expander. Out of the Mixer comes some nice RNs, but not a lot. That's because good collectors are typically not firehoses but rather dribbles, and the Mixer can't improve on that, as, according to the law of thermodynamics, it is impossible to create entropy.
- The caller often wants a lot of RNs and doesn't want to wait around.
- To solve the mismatch between the Mixer output and the caller's needs, we create an expansion function or Expander. This function is pretty simple: (a) it takes a small seed and (b) turns that into a hugely long stream. It could be called the Firehose...
- Recalling our truth above of (c) CSRNs being the goal, not entropy, we now have a really easy solution to this problem: Use a cryptographic stream cipher. This black box takes a small seed (a-check!) and provides a near-infinite series of bytes (b-check!) that are cryptographically secure (c-check!). We don't care about the plaintext, but by the security claims behind the cipher, the stream is cryptographically unpredictable without access to the seed.
- Super easy: Any decent, modern, highly secure stream cipher is probably good for this application. Our current favourite is ChaCha20 but any of the NESSIE set would be fine.
- In summary, the Expander is simply this: when the application asks for a PRNG, we ask the Mixer for a seed, initialise a stream cipher with the seed, and return it back to the user. The caller sucks on the output of the stream cipher until she's had her fill!
- When a system first starts up there is often a shortage of easy entropy to collect. This can lead to catastrophic results if your app decides that it needs to generate high-value keys as soon as it starts up. This is a real problem -- scans of keys on the net have found significant numbers that are the same, which is generally traced to the restart problem. To solve this, either change the app (hard) ... or store some entropy for next time. How you do this is beyond scope.
- Then, assuming the above, the problem is that your attacker can do a halt, read off your RNG's state in some fashion, and then use it for nefarious purposes. This is especially a problem with VMs. We therefore set the goal that the current state of the RNG cannot be rolled forward nor backwards to predict prior or future uses. To deal with this, a good RNG will typically:
- stir fresh entropy into its cache(s) even if not required by the callers. This can be done (e.g.) by feeding ones own Expander's output in, or by setting a timer to poll the Collectors.
- Use hash whiteners between elements. Typically, a SHA digest or similar will be used to protect the state of a caching element as it passes its input to the next stage.
- As a technical design argument, the only objective way that you can show that your design is at least as good as or better than the platform-provided RNG is the following:
- Very careful review and testing of the software and design, and especially the Mixer; and
- including the platform's RNG as a Collector.
- Business Justifications.
As you can see, doing RNGs is hard! Rule #1 -- use what the platform provides. You shouldn't be doing this. About the only rationales for doing your own RNG are the following.
Note that it is still a hard test, you want ALL of those to be true before you start mucking around in this chaotic area.
- Your application has something to do with money or journalism or anti-government protest or is a CVP. By money, we mean Bitcoin or other forms of hard digital cash, not online banking. The most common CVP or centralised vulnerability party (aka TTP or trusted third party) is the Certification Authority.
- Your operating platform is likely to be attacked by a persistent and aggressive attacker. This might be true if the platform is one of the following: any big American or government controlled software, Microsoft Windows, Java (code, not applets), any mobile phone OS, COTS routers/firewalls, virtual machines (VMs).
- You write your own application software, your own libraries *and* your own crypto!
- You can show objectively that you can do a better job.
That all said, good luck! Comments to the normal place, please, and Ed's note: this will improve in time.
Posted by iang at January 30, 2014 12:34 PM
Very good article, congratulations for the synthesis of so many sources.
You wrote: "the humble sound card can be put to lots of different uses, and it is relatively hard for the bad guys to mess with it in a way that subverts the crypto without making the device unusable for other purposes".
Well, actually it's trivial to add a filter (either in the kernel/usermode driver or directly in the hardware) to remove the precious random "noise" from the input source - and yet claim that removing noise is a valuable feature if someone is ever going to question that most probably undocumented choice.
We have seen this happening for some physical randomness sources you quote, and for an obvious reason: defeating the purpose of encryption is the cheapest way to break code.
I really enjoyed reading about your philosophy of RNGs. It reminds me of very similar discussions about RNGs that I had with Dr. M.M. Atalla so many years ago.
The RNG I implemented under his tutelage incorporated many of your rules of thumb and had a similar architecture of collector(s), mixer and expander. One interesting difference is that we collected a lot of (electron thermal noise) randomness up front to incorporate into the mixer/expander machinery. A lot was on the order of a CD-ROM worth of random bits. It chewed up RAM, which had to be tested constantly to make sure the bits didn't flip.
Then we used real-time collectors (it varied whether it came from electron thermal noise or from "good enough" sources) to feed into the mixer/expander. We would refresh the mixer/expander store of persistent random bits periodically.