Sometimes we’ll have a few thoughts, and we’ll write them down over here.

Generating lots of unique random strings

A customer had an interesting problem. They needed a file with 220 million unique strings, each 8 characters long, containing only characters from a specified range, and absolutely no duplicates. One of the requirements was that the distribution be as good as possible, so a shuffled list of 220 million elements was not going to suffice.

The customer had tried to achieve this on their own, but their efforts failed with a solution that was too resource intensive. This required a more creative approach.

A problem such as this immediately brings you to Donald Knuth. His random sampling algorithm seemed like a perfect candidate to tackle this. The basis of the algorithm is, in pseudo-code:

int n = 220000000;
int result[n];
for (int i = max; i > 0; i--)
    const int random = getrandom() % i;

    if (random < n)
      const int candidateValue = max - i;
      result[--n] = candidatevalue;

In other words: go over the entire range of numbers, and randomly decide per element to select it or not. The interesting aspect of this algorithm is that you manipulate random < n in such a way that, by the end, you are guaranteed to have picked n elements.

Next thing to address was the strings. They needed to be eight characters long and the amount of permissible characters was 24. So, there were 24^8 = 110075314176 combinations. That is less than a 64 bit integer can hold, so the aforementioned algorithm was the perfect choice to generate them. And with 8*220,000,000 bytes, it would 'only' take 1.8 GB of RAM. All we had to do was convert the integers to strings with a sort of base-24 encoder.

But we don't stop there. The range 0 through 220,000,000 can easily be cut up into segments, each assignable to a CPU core, thereby making the algorithm multi-threaded. To be safe, when all the data is combined, we do a sort+verify to be sure the range is truly unique, and re-shuffle. The end result is an application that can generate the data within a few minutes: