Rust needs a better non-cryptographic prng for its rand crate. This is an explanation of how I went about designing one. I hope it will not only demonstrate that the resulting algorithm is worthy of consideration, but be useable as a guide for others who want to build a PRNG.

A bit of history

Originally SmallRng was PCG-32 and PCG-64. About a year or so ago it was switched to xoshiro128++ and xoshiro256++.

This was done after some concerns were raised by the author of xoshiro, who posted a blog entry on PCG and also raised an issue on the possibility of PCG being invertible. PCG’s author’s response is worth a read.

Later some students at the Sorbonne found a way to invert it. (In practice if you are using Rust’s PCG implementation, in a case where cryptographic security is not required this is probably not a concern. To invert Rust’s implementation and obtain the state it takes between 2⁶² — 2⁷⁰ operations) However it does highlight some areas which could be improved upon.

Xoshiro was selected as a replacement because it was faster and able to generate higher quality numbers (more on that later). There are not currently any known issues with xoshiro256++. However it is not clear how long this will remain the case. There are several known issues with other members of the family of generators:

Despite all of this it is likely that xoshiro256++ gives better quality numbers than PCG-64.

So what’s wrong with PCG?

PCG’s overall design is solid. However there are some practical issues.

PCG is based on multiplication. Multiplication is a great mixing function. If you work out what multiplication looks like in terms of ANDs and XORs it becomes apparent how elaborate a 64bit multiply is. The amount of transistors required to implement it in hardware prohibits multiplication from being used in most cryptographic algorithms. But for non-cryptographic PRNGs which only need to run on a general purpose CPU, multiplication is very useful because there is already a hardware implementation. Multiplication does have one major downside: low order bits can affect high order bits, but never the other way around.

In PCG this means the period of the low order bits is smaller than the high order bits. In fact, the period of the Nth-bit of a LCG is 2ⁿ. For PCG-64 this means that the lower 32bits of the 128bit internal state aren’t of very good quality. PCG’s output premutation is specifically designed to prevent this from being an issue.

With an LCG (which PCG is based on) the increment can be any odd number. In some implementations the increment is used as a separate parameter to create “streams” which are “independent” sequences which also have the full period of the generator. This is often pitched in older literature on LCGs; as a feature as a way to create multiple generators such as one for each of many threads. But this doesn’t really make a lot of sense today because it simply isn’t that expensive to initialize a whole new PRNG. (This is well handled by libraries, and a single seed can be expanded to many using a cryptographically strong prng)

The Rust rand library treats the increment as simply part of the seed. This avoids confusion caused by the exact definition of a “stream” and what assumptions can be safely made. It also makes the seed 256 bits. This is useful because it ensures large numbers of generators created at random don’t end up close to one another in the sequence due to the birthday paradox. The downside is it uses 256 bits of memory even though it only has 128 bits of state.

The other problem is PCG-64 is somewhat slow. The update requires a 128bit multiply. This isn’t a hardware instruction. (At least not on any CPU I have access to). It compiles to four 64bit multiplies and seven adds. On my computer PCG-64 is 10–15% slower than xoshiro256++ even though its state is half the size and its update function is simpler. (PCG-64-fast is faster, but still close.)

Because of its speed and memory use the PCG-64 implementation is compared to 256 bit generators even though it only has 128 bits of state. While PGC’s excellent design allows it to punch above its weight, there is no way even the best algorithm could compete with a decent one with twice the state size. For PRNGs the number bits of state can have a huge impact on quality. (The PGC paper does a good job of explaining why state size matters so much)

What’s good about PCG?

If you haven’t read it, the PGC paper is one of the best papers on random number generation. It’s well worth reading and is very accessible.

The paper makes a strong case for PCG’s design. It shows combining the state update function from an LCG with an output permutation makes a really robust generator. (Assuming the permutation does not commute with multiplication)

Instilling confidence is no easy task because while it is easy to show an algorithm is bad it is much harder to show it is good. PCG paper manages this.

The problem it addressed is that randomness tests aren’t very good. They cannot be. Applications are going to use generated numbers in millions of different ways that couldn’t possibly be covered by any test suite.

Instead the PCG paper shows how the ability to detect statistical flaws is a function of state size. By creating a “scaled down” version of the same algorithm which operates with a smaller state, it is much easier to detect problems. If an algorithm is robust with a small state size then the larger version should be to a exponentially greater extent. (Or if it doesn’t, something is wrong)

Goals for a new generator

Rather than trying to compete with this work, my goal is to build on it and create a generator that is a very close cousin of PCG but solves the problems in Rust’s implementation outlined above. The goals for such an algorithm are to:

  • Have a 64bit output
  • Have a 256bit state and utilize all of it.
  • A have a provable period. (We want to know for sure that there isn’t say a 1 in 2⁴⁵ probability of looping through the same 5 numbers, which was not spotted in testing)
  • Be structured similar to PCG, so that it can be easily analyzed and validated using the same techniques.
  • Be able to “scale down” to a smaller version of the algorithm that can be more thoroughly tested. (IE: prove it is not simply too big to fail)
  • Take advantage of multiplication hardware instructions
  • Use 64 bit multiplication rather than 128bit so the update is not expensive
  • Ensure that high order bits can affect low order bits
  • Pass all of the standard test suites for randomness with many bits to spare.
  • Be significantly faster than both PCG-64 and xoshiro256++.

Design idea

First we want a way to create a generator with a 256 bit state where none of the bits are wasted. PCG-Fast uses a MCG generator. This eliminates the increment which reduces the generator’s size to just the state size.

The difficulty with a 256bit LCG or MCG based generator is that it would require a 256 bit multiply which would makes the the update function expensive.

There is an alternative to an LCG which allows for extending the state: a Multiply with carry generator (MWC). A MWC generator is actually equivalent to a very particular MCG generator so it should still be able to fit in the PCG framework.

It should be possible to make a Lag-3 64bit MWC generator to have a 256bit state where the update function is the same cost as an LCG with a 64bit state. Combined with an appropriate permutation of the state for the output, it should be possible to build a fast high quality generator.

How a MWC generator works

A 64bit Lag-3 MWC generator has four 64 bit variables: “C”, “X1”, “X2”, “X3”, where the Xs contain the computed output for the previous 3 rounds. The update function consists of:

T = a * X3 + C

Where a is a constant. T is a temporary variable. The upper 64 bits of T are assigned to C and the lower are assigned to X1. After the previous value of X1 is stored in X2 and X2’s value in X3.

What’s going on here?

This is actually equivalent to a 256bit multiplicative congruential generator (MCG) of a particular form. Namely it is the same as:

X = a*X mod a*2¹⁹²-1

(Where here X is 256bit value)

The process works similar to long multiplication where each “digit” can have 2⁶⁴ values. So the process is:

When X3 is multiplied by a, the high order bits carry over and get added onto the result of a*X2 to generates the result in the 2⁶⁴s place. Similarly the high order bits from a*X2 carry over to add to a*X1 to generate the result in the 2¹²⁸s place. The high order bits from a*X1 carry over into the 2¹⁹²s place.

The true genius of the MWC generator is what happens when you need to multiply c by a, because c is already starting in the 2¹⁹²s place, and the whole operation is modulo a*2¹⁹²-1 and when c is “multiplied” by a it is guaranteed to wrap around exactly at the modulus boundary. So instead of performing a multiplication, c can simply be added to the 1s place.

This means that with three 64bit multiplies and three adds we can multiply a 256bit value by a 64bit value modulo another 256bit value. Three calls to the update function of the MWC are exactly the same as one update to the larger MCG. What’s more, the update is computed incrementally where after each multiply and add the bits for one 64 bit section is known. If only 64 bits of output are needed there is only one multiply and add required per output. (Additionally the Xs can be reordered to avoid the need for a pointer.)

The standard analysis techniques that apply to MCGs to be easily adapted to the MWC generator. We don’t need to store an increment and because the modulus is not a power of 2, high order bits end up affecting low order bits. However there are some rather strict criteria are needed when selecting the multiple, because the multiple also dictates the modulus.

Small scale testing

To study a random number generator, we need to look as some “randograms” of a scaled down version of the generator. A randogram is an image where the background is white, and each pixel has a X and a Y coordinate. Random numbers are generated the first number forms the X coordinate and the second becomes the Y and the corresponding pixel is darkened.

For example below is the output of an 8 bit LCG.

Each output is one byte, and each pair colors one pixel. The above runs through the full period of the LCG. As a result, its structure is visually apparent.

The same concept can be extended by adding a “lag”. Instead of plotting pairs of adjacent numbers, some number of values in between can be skipped over.

Here is the same 8bit LCG skipping 0–8 values. (From left to right)

Changing the lag makes the pattern appear different, but the structure is still apparent.

PCG adds a scrambling function on top of the LCG to make the structure harder to spot. Here is what it looks like when scaled down to 8 bits of state:

This looks a LOT better! Those clearly look much more random. The fact that there is still some hidden structure there is apparent by the fact that the apparent darkness or lightness varies depending on the lag. This corresponds to different rates of exact collisions.

If we instead plot the spectrum of an 8bit Lag-3 MWC generator we get the following:

These all look amazing… except for that 3rd section. That corresponds to the case where two numbers are skipped. If it looks a lot like the spectrum of the LCG above, that’s because that’s what it is.

When two numbers are skipped that corresponds to when the MWC’s update function is operating on the same X again. So it’s just performing:

X = a*X + c mod 2⁶⁴

On that same cell. This is exactly the same formula as an LCG, so of course it should look similar.

In fact if you go one step further an pair up consecutive outputs to turn the 8 bit generator into a 16 bit generator and graph a 65536 by 65536 version of the spectrum. (I’ve done this but not uploaded it here for obvious reasons) It looks just like the LCG or MCG spectrum. This is expected. After all the MWC is just a clever trick to allow working with a much larger MCG than the size of the hardware multiply.

Selecting an output permutation

To avoid having patterns in the output, we need to find an output permutation to mask it. It is important to understand WHY a pattern exists before attempting to mask it. In this case because we know it is the pattern of an LCG we can use the same technique that PCG uses to mask patterns from its LCG.

Fortunately the PGC paper provides a set of guidelines on just how to do this. There are two key ideas. First, the function should be a permutation (ie be reversible) this ensure that it can’t make the output any worse than the input. Second, it should introduce non-linearity by using operators which do not commute. (IE: if the update function is multiplication, then the output function should not also be multiplication) With a high probability, this makes the result is more random.

We can do: Additions, XORs, XorShifts, Random Rotations, and even a second multiplication if there is a xor in between. And there are 6 possible numbers to work with in each round X1–3, C, the High order bits after the multiplication and the low order bits.

Testing these combinations revealed that even very basic transformations like: X1^X3 left only faint traces of the LCG pattern. (You can spot it if you look closely)

(If this is blown up to the full 65k*65k image size the pattern can be seen better). I experimented with over a 100 different combinations of operations and parameters by constructing the 65k*65k image in each of 8 lags with two different multipliers and with and without reversing the output bits (32 images per combination). Then for all combinations for which no pattern was visually discernible, I coded the equivalent 64bit version of the algorithm and benchmarked it. I ended up with:

(X3 ^ X2) + (X1 ^ HI)

(Where HI are the high order bits from X3*a, ie what will become the next c if there is no carry from the add) Here is what it looks like:

Below are four 512x512 slices each from one of the 65k*65k images resulting from pairing the output and running through the full period. These correspond to lag 0 through 3 from left to right.

This looks really good and is understandable from a theoretical point of view the last operation performed on the Xs was an add, so it ends up with sequence of add, xor, add. This provides strong non-linearity, and combines distantly related parts of the state.

Per PCG’s naming conventions this means the full size algorithm is named:

Mwc-256-X-X-A-64

This is short for: “MultiplyWithCarry with a 256 bit state permuted with XOR, XOR, and Add (in that order) to obtain a 64 bit output.” Similarly the small version is named: Mwc-32-X-X-A-8.

How does it compare? We already saw what PCG looked like. Here is the equivalent 4 radiograms for a version of xoshiro256++ that has been scaled down to a 32bit state. aka xoshiro32++

The situation is similar to PCG. There is no obvious pattern but the differences in density at different lags hint at the underlying structure of the generator. This isn’t surprising or necessarily problematic. It can be reasonably expected for a generator when run through it’s full period.

However it does mean there may be a chance this approach may be able to generate higher quality number than both lcg-xls-rr and xoshiro++. Only testing will tell. However before we can do that, we first need to select a multiple.

Finding the right multiple

Finding the right multiple is tricky because there are several requirements. First it should have the maximal period.

With a normal MCG multiple needs to be coprime to the modulus to achieve maximum period with respect to the modulus. So if:

a * 2¹⁹² − 1

Is prime then it will be the maximum period. However we also want mod 64 output to be of the maximum period. This will be the case if:

a * 2¹⁹¹ − 1

is also prime. Then the the period will be:

a * 2¹⁹¹ − 1

This means if a is large (close to 2⁶⁴) this can get fairly close the the theoretical maximum period of a generator with 256 bits of state of 2²⁵⁶.

So far the requirements for a are:

  • Be large (close to 2⁶⁴)
  • a * 2¹⁹² − 1 is a prime number
  • a * 2¹⁹¹ − 1 is a prime number

Normally for LCGs and MCGs the multiplication parameter is evaluated with a “Spectral test”. This is based on work done in the 60’s and 70's. After asking around I was able to track down some code to do this. (Thank you, Dr. Melissa O’Neill :-) )

A spectral test works similarly to what we were doing visually above with randograms. It algorithmically looks at what the underlying structure of an LCG would create if run through it’s full period. The idea is that while an LCG will always have structure, it is better if the structure is more uniform and doesn’t create clear ‘bands’ of numbers which would create a distinctive pattern in the output.

Obviously this is less of an issue with a permuted output because the scrambling function will dramatically reduce the effect of the structure. And a 256bit PRNG is never going to get run anywhere near it’s full period. But given the choice it’s better to pick a multiple which works well before the output permutation.

I setup the code to search for a good spectrum for a MCG with the form:

X = a*X mod a*2¹⁹²-1

and quickly realized there is a problem. If a is nearly 2⁶⁴, then that is attempting to find a good 64bit multiple modulo 2²⁵⁶. This is simply not possible. To understand why; Imagine in our randogram that X just so happened to equal 1, the next value would be a. If X were 2 it would be 2a these obviously differ by a which is by definition less than 2⁶⁴. This is a miniscule fraction of 2²⁵⁶. There is no possible way this can even approach uniform distribution. To do that a would need to have no fewer than half the number of bits of the modulus.

So is that it? Are all MWCs with lag > 1 bad?

Not quite. We aren’t actually interested in:

X = a*X mod a*2¹⁹²-1

but rather

X = a*X mod a*2¹⁹²-1 mod 2⁶⁴

Because we are only concerned with individual 64bit components, not the whole number of the equivalent MCG.

In the case where X is small (< 2¹⁹²), this is the same as:

X = a*X mod mod 2⁶⁴

So we should evaluate the spectrum of a like it was for a normal 64bit MCG. For this to work a must be congruent to 3 or 5 mod 8.

There is also the question of when multiplication wrapps the a*2¹⁹²-1 modulus boundary. To avoid this we should also test the spectrum of:

X = a²*X mod a*2¹⁹²-1

and

X = a³*X mod a*2¹⁹²-1

etc.

So now our requirements for a are:

  • Be large (close to 2⁶⁴)
  • a * 2¹⁹² − 1 is a prime number
  • a * 2¹⁹¹ − 1 is a prime number
  • a mod 8 = 3 or 5
  • X = a*X mod mod 2⁶⁴ has good spectrum
  • X = a²*X mod a*2¹⁹²-1 has good spectrum
  • X = a³*X mod a*2¹⁹²-1 has good spectrum
  • X = a⁵*X mod a*2¹⁹²-1 has good spectrum

I wrote a program to automate this search. After testing over 100 million numbers I arrived at:

0xfeb344657c0af413

This provides a period of more than 2²⁵⁴, and has excellent spectral properties. It is possible to find a better value with more CPU time, but I am not sure it make a meaningful difference.

Benchmarking

Before getting deep into quality assessment, it’s worth checking if this actually worth it. Is a permuted Mwc actually faster? After some minor optimizing, on my laptop here are benchmarks on filling a 1kb byte array:

That is more than a factor of 2 improvement in speed. Obtaining this required manually optimizing the rotation of the rotation of the X values. For whatever reason LLVM was not great at doing that automatically. So if the function is not inlined it is not as fast, but there is a performance gain over the other algorithms. There may be opportunities to optimize this in the future.

Assessing quality

Ultimately it all comes down to quality. How can we be sure we are actually getting good output?

There are three major test suites for random number generators:

  • DieHarder — This is the simplest test suite and should be the easiest to pass. (Just install the package and pipe random to: `dieharder -a -k 2 -Y 1 -g 200`)
  • TestU01 — This includes many tests, but is a major pain to use, because it requires writing your PRNG in C and passing a function pointer that works by modifying the state in static memory. It also doesn’t hande of P value nearly as well as PractRand and DieHarder do. However it's suite BigCrush is the standard, so it's a very important benchmark.
  • PractRand — By far the easiest to deal with and is convenient for measuring “efficiency”

Mwc-256-X-X-A-64 passes all three test suites.

But that isn’t good enough. Instead tests are run on the “scaled down” version Mwc-32-X-X-A-8 to make sure they are not passing by coincidence because the state is too large.

Additionally, rather than looking at these test suites as a binary pass/fail, a better question is how many bits of state are required to pass.

For example BigCrush has tests which require a period above 2³⁵ to pass. So any PRNG with less than 36 bits of state will by definition fail because it’s output must cycle before the test is over. In practice even very good PRNGs will fail if they are only within a few bits of the theoretical limit due to the distribution of birthdays.

We can define the “efficiency” of a generator as the ratio of the number of bits it needs to pass a statistical test vs the theoretical minimum which could.

PractRand’s output is very convenient for measuring this because it keeps running tests and outputs a status update every time it crosses a power of two in the size of random values read, until eventually the test fails. All generators with a finite state will fail given an infinite time frame. However if they are good the length of time it takes for them to fail should increases exponentially with size of the state. For practical reasons the default cutoff is 32TB.

Consider the following graphic (blatantly stolen from the pcg-random.org blog “Does it beat the minimal standard”)

Based on this we can define the efficiency on the PractRand test suite of an LCG and an MCG as 58% and 56% respectively because they are as good as a theoretically ideal generator with that percentage of their bits.

For comparison other approaches:

  • The scaled down version of PCG from above, Lcg-32-Xsh-RR-16 aka PCG-16, has a failure point of 2²⁸ giving it an efficiency of 87%.
  • The scaled down version of xoshiro from above, xoshiro32++, also has a failure point of 2²⁸ giving it an efficiency of 87%.
  • The 8 bit lag-3 version from above, Mwc-32-X-X-A-8, has failure point in the PractRand testsuite of 2²⁸ giving it an efficiency of 87%

This implies all three algorithms with 32bits have approximately the same quality at 32bits.

Note: when testing the miniature MWCs, the multiplier that was used was NOT selected based on its spectra and would be considered a poor choice for an LCG or MCG. This is to validate the algorithm works even with a sub-optimal multiple.

As a second test, I created a different scaled down version: a 16bit MWC with a lag of 2 (48 total bits) and the output permutation:

(X2 ^ X1) + (C ^ HI);

The different output permutation is needed because there is no X3. This is referred to as Mwc-48-X-X-A-16.

Mwc-48-X-X-A-16's failure point in the PractRand testsuite is at 2⁴³. This gives the algorithm an efficiency of 89%. This means that the efficiency remain constant as the size increase. This is not surprising given that is is true of an MCG is what underpins the permuted MWC.

Scale invariant efficiency is a very important property because if an algorithm’s efficiency decreases as it scales up it indicates a design flaw where it may hit a ceiling at some point beyond which adding bits won’t improve its quality. (Such as an LFSR without any non-linear step)

The 32 bit version (Mwc-32-X-X-A-8) passes DieHarder and SmallCrush. Very few 32 bit generators can pass DieHarder and fewer pass SmallCrush, so this is significant. It does not pass BigCrush. (But that is impossible because it is below the theoretical minimum number of bits).

The 48 bit version (Mwc-48-X-X-A-16) passes the full DieHarder, Crush, and BigCrush test suites. This means for the BigCrush test suite the algorithm has an efficiency of at least 75%.

I wanted to test a 40 bit version of the algorithm by using a Lag-4 generator. However there are no 8 bit values for which a*2⁴⁰-1 and a*2³⁹-1 are both prime, so it is not possible to build a full period generator of this size. I tested one with the multiple of 227. This does not provide the maximal period but I experimentally measured the period as 32498585873 which is shy (~95%) of 2³⁵. (If a full period generator existed it would have had a period close to 2³⁹).

Shockingly this generator passes BigCrush! This makes it the shortest period generator to pass the test suite! (It’s period is actually below the size at which statistical problems must start to be detectable, but apparently not yet so far beyond as to cause the tests to actually fail.)

The reduced period 40 bit generator is somewhat analogous to a hypothetical full period 37 bit generator, because such a generator would have a similar period. This gives permuted MWC an efficiency of 97% on BigCrush. This efficiency is only beaten by PCG-RXS-M-XS the strongest and slowest member of PCG, which currently holds the record for the generator with the fewest bits of state to pass BigCrush.

Another metric to compare different generators is to measure the number different Crush and BigCrush tests that failed when the generator was below the theoretical minimum size for it to pass all the tests. This allows using Mwc-32-X-X-A-8 rather than the partial period generator.

In all of these quality metrics the ranking is the same. The Mwc-X-X-A generator’s quality surpass all of the commonly used generators: PCG-XSH-RR, PCG-XSL-RR, XorShift* and xoshiro, and is only exceeded by PCG-RXS-M-XS (which is rarely used for performance reasons)

These efficiencies are quite good. Assuming applications are just as sensitive to deviations from true random as PractRand; an application which uses Mwc-256-X-X-A-64 to instantiate 2⁶⁴ generators, which each produce 2⁶⁴ outputs will still have 100 bits of “headroom”. This gives me a fair degree of confidence that detectable deviations from true random will not ever arise in practice.

Advantages and Disadvantages

Obviously this particular algorithm is not the be all and end all of PRNGs. So here are some pros and cons that may help in deciding if it is appropriate.

  • MWC leans heavily on multiplication. That’s great if you are running on a CPU. But for GPUs, FPGAs, microcontrollers, or implementing in silicon, there are better alternatives.
  • There aren’t 64 bit multiply simd instructions, so while the performance is better than most other algorithms, there will probably be other faster algorithms that take advantage of hardware specific instructions. One interesting technique would be to use the aes-ni instruction. Randen does this and is very fast. A reduced round version could be even faster.
  • The output does not offer k-Dimensional Equidistribution. For most applications this is simply not relevant. For example while it can be done with PCG, but the Rust version never bothered to implement it. (or support it on any other generator). I think the main reason this exists, comes from legal requirements in areas like gambling, where a bias of 1 part in 2²⁵⁶ for or against a particular sequence of outputs has a legal consequence even if there is no practical one.
  • Length and size flexibility is good for other architectures. For example a 192bit version with a 32bit multiply for 32bit architectures is possible.
  • It is possible to extend the size of the state and increase the period by increasing the Lag. However doing so requires searching for a new multiple.
  • In the cryptographic sense of the word the PRNG is already “broken”. The output combines three 64bit values. Knowing any two of them and observing the output will reveal the third. Observing the next iteration will reveal the the whole state. Hence even though the state is 256bits it should be possible to figure out what it is in 2¹²⁸ operations by simply brute force guessing two of the values. This should not pose a problem for any non-cryptographic use.

Open Questions

  • Is 256 bits the right size? 192 would be smaller. 320 could allow for a faster permutation function, thus trading off memory for CPU.
  • Can performance be further improved? For example each round involves 4 loads and 4 stores in memory. It may be possible to combine those.
  • How much is gained by the larger size multiplication? For example would it be significantly worse to have a larger lag with a smaller multiple? Could it be faster?

Links and References

I have checked in all the code into a github repo and enabled discussions on the github repo. Any questions or clarifications can be posted there.

Related links worth reading:

Tom Kaitchuck is a software developer working on Pravega.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store