What is Romu?
Romu is a new family of random-number generators that match the statistical quality (randomness) of the best popular generators, but are faster. In fact, due to the processor’s ILP (explained below), Romu generators add no delay to an application when inlined. Having no delay makes Romu generators appear to be infinitely fast when inlined because Romu’s processing occurs in parallel with the application. And these generators are very fast when not inlined.
Such speed and high quality are possible because Romu generators are nonlinear, unlike most popular generators. A Romu generator combines the linear operation of multiplication with the nonlinear operation of rotation, yielding better randomness at higher speed.
Romu stands for rotate-multiply, which are the two arithmetic operations that all Romu generators employ. Most also perform one or more additions and/or subtractions. The names of Romu generators are suffixed with the name of the number of internal 64-bit variables they contain (which is their state), so the three-variable version is named RomuTrio. We highly recommend RomuTrio because its state-size is large enough to make trouble overwhelmingly unlikely, and is small enough to not consume too many registers in the processor, which would slow down an application due to a phenomenon called register pressure. RomuQuad is a larger version for those running gigantic jobs while wanting even lower probabilities of trouble. RomuDuo might be faster due to its lower register pressure. RomuDuoJr might be even faster still, and it’s very simple, consisting of only three arithmetic operations, but it might lose some randomness on large jobs.
You can learn all the details about Romu generators in the paper. Although this is a formal academic paper, the math in it isn’t overwhelming because it’s at the level of advanced algebra. The math derives equations for probabilities of trouble, such as the snowflake probability described next.
One Snowflake in History
Above, we mentioned that Romu generators are nonlinear. There is a reason people have avoided nonlinear generators in the past: Unlike a linear generator, a nonlinear generator does not have one cycle that covers all possible internal state-values. Instead, it gives you a random permutation of those values. Random permutations are known to consist of multiple cycles of random lengths. In the past, when generators stored their state in one 32-bit variable, these random cycle-lengths would cause users to sometimes encounter cycles that were too short. That problem eliminated nonlinear generators from consideration.
But nowadays, processors can easily handle 64-bit variables, and having two or three such variables in the generator’s state is no problem, which makes encountering a too-short cycle all but impossible. For example, if you gave RomuTrio (with 192 bits of state) the impossibly large job of outputting 262 numbers in one stream, the probability of encountering a too-short cycle is less than that of randomly selecting a specific snowflake among all snowflakes that have fallen in Earth’s history (293) and also randomly selecting the Queen of England out of all of Earth’s population (233). That inconceivably low probability of 2-126 means that you will never encounter a cycle that’s too short.
Some people might insist that a generator have one cycle with a guaranteed period (which eliminates nonlinear generators). That is unwise because they will lose the substantial benefit of higher speed in order to avoid a will-never-happen problem. The chance of randomly selecting one snowflake in Earth’s history and the Queen is nil. Losing a benefit to avoid that nil is unwise.
The snowflake-and-Queen probability tells us that the problem of too-short cycles has been solved, and now we can enjoy the increased speed that nonlinear generators can provide.
The Romu family has unusually good randomness, which is verified by passing the toughest test suites: BigCrush and PractRand. Both are brutal to random number generators, especially PractRand (which is newer). All Romu random number generators using 64-bit arithmetic pass both suites with no signs of stress. RomuDuoJr is the worst such generator, and even it passes easily.
It’s easier to appreciate the improved randomness of Romu generators using dot-plots of consecutive pairs of random numbers. Below, the left plot is from an LCG (linear congruential generator), which is historically the most common. The right plot is from a simple Romu generator. To make their patterns obvious, both generators were scaled down to 10 bits of state and run for their entire periods. The nonlinear Romu is obviously more random than the linear LCG.
There are several reasons why Romu generators are faster than others:
ILP stands for instruction-level parallelism, and most modern processors have this feature. It means that the CPU (core) executes multiple instructions in each clock cycle whenever possible. Intel/AMD CPUs can execute up to four instructions in each clock cycle, so we can think of each clock cycle as having four slots. For example, let’s analyze the source-code of RomuTrio:
The computations in this function will consume only 3 clock cycles in today’s PC CPUs. Here’s what happens in each of those clock cycles:
The multiply consumes 3 cycles, which is why the function consumes 3 cycles. The “Multiplier” column on the right shows when the multiplier is busy. Note that the other calculations are done at the same time as the multiply (in parallel). That is ILP in action. Romu generators are designed to take maximum advantage of this processor feature, making them very fast. But that’s not the only way that ILP helps.
Notice that the function returns the old value of
Capacity is the maximum number of values (or bytes) a generator can output while still maintaining good randomness. For generators with at least 64 bits of state, capacity is difficult to measure because doing so takes too long. The capacity of such a generator is determined in two ways: (1) by running PractRand up to some quantity of data, establishing a lower bound on its capacity, and (2) by running a scaled-down version of the generator until it fails, which provides an estimated capacity of the full-size generator. Scaled-down capacity measurements revealed that RomuQuad, RomuTrio, and RomuDuo maintain their randomness even under the heavy load of the largest jobs, with margin to spare.
Unless we missed one in the literature, it appears that the capacities of only three families of generators have been measured by scaled-down testing: Romu, PCG, and the Blackman-Vigna generators. Blackman has kindly supplied us with capacity estimates of their best generators (found in the paper). Thus, these are the only generator-families that will give you confidence for substantial jobs. If you run a medium-size or larger job with any other kind of generator, you will not know whether you will exceed its capacity.
Huge jobs are divided into several thousand streams of random numbers, all running in parallel for weeks or months. Each stream can consist of up to 250 random numbers, and perhaps more for the longest-running jobs. But what is the chance that one or more streams will overlap another stream? (Overlap means a portion of one stream is identical to another.) For RomuTrio and RomuQuad, the probability of overlap is lower than that of selecting a specific snowflake among all snowflakes in Earth’s history. Sound familiar? Streams will never overlap.
The Romu Generators
Romu generators use 64- or 32-bit arithmetic. The source-code for all of these is here.
Here are the 64-bit generators. All of them were tested in PractRand to 248 bytes, and were characterized using scaled-down versions, so they are known to have ample capacity unless otherwise noted.
Here are the 32-bit generators. Use these only if your processor lacks fast 64-bit instructions.
Seeding. The state variables must be seeded such that at least one bit of state is non-zero. If your seed is a single non-zero variable, you can conveniently seed a Romu generator by setting all of its state variables to that seed as in this example for RomuTrio:
Romu generators are forgiving about seeding. For example, in multiple runs, it’s fine to set one state variable to a run-counter (1,2,3,...) and to set the remaining variables to zero. Romu generators quickly escape from an initial state that’s nearly all zeros, but it doesn’t hurt to call the generator a few times to scramble its state so that the first few values used in the job will be random. However, to guarantee that multiple streams are uncorrelated, it’s prudent to seed the state variables with a different kind of random number generator. We recommend SplitMix64 (by Vigna) and SplitMix32. The source-code of both are in appendices in the paper.
Lagged Fibonacci. Variants of this generator has been around for decades. It uses a fair amount of memory, and its output can be quite random. But its Achille’s heel is its lags: Output values tend to have detectable correlations at those lags. It’s not too slow, but checking pointers or indices for wrap-around and reading/writing array-entries consume time, so it’s not among the fastest generators.
Blackman-Vigna. Blackman and Vigna invented a family of linear generators consisting of a base generator and a post-scrambler to improve randomness. Their randomness is excellent and they are fast, delaying the application by 3-4 clock cycles. They are the strongest competitor against the Romu family. But Romu generators are faster, having lower register pressure and a delay of 0 cycles.
LCG. The Linear Congruential Generator is simple, and has been the most popular generator everywhere for decades. But it has serious problems with its output that are tantamount to poor randomness, mostly in its lower-order bits. Many people realize that we should no longer use this relic. So the LCG was dead, until the PCG brought it back to life.
PCG. The PCG is an LCG with a post-scrambler on it that permutes its outputs. PCG stands for Permuted Congruential Generator. Unlike the LCG, the randomness of the PCG is excellent. The PCG with 64 bits of state is fast, delaying the application by 3 clock cycles. PCG generators have been tested with PractRand with scaled-down models, so their capacities are known. But 64 bits is not quite sufficient for the largest jobs, nor is it enough to ensure nonoverlapping streams unless its jump-function is used, which it seems few users do, So it needs at least 128 bits of state. But that causes the LCG in it to perform a 128-bit multiply and add, which are slow. Even with only 64 bits, the PCG is slower than Romu generators, and with 128 bits, it’s much slower.
Mersenne Twister. Although it’s popular, the MT consumes much memory, is slow, and its randomness is mediocre. It cannot compete against a modern random number generator.
Others. Other generators tend to be either very random or fast, but not both. Romu is the fastest of the few generators that are both very random and fast.
The paper has all the details, including source-code. It gives you the details of how testing and extrapolation were done, how constants were determined, plus the underlying mathematical theory of multi-cycle nonlinear generators, including equations for probabilities. Source-code is also available here.
The Contact link at the upper-right corner will tell you how to contact me both privately and through social media. I welcome thoughtful comments and suggestions.