New GenAI Evaluation Metric, Ultrafast Search, and Perfect Randomness

This article covers three different GenAI topics. First, I introduce one of the best random number generators (PRNG) with infinite period. Then I show how to evaluate the synthesized numbers using the full multivariate empirical distribution (same as KS that I used for NoGAN evaluation), but this time with ultra-fast radix search, a competitor to KNN vector search. KS is the only metric that captures any kind of pattern in any dimension. Finally, I illustrate how it applies to large language models. In particular, the system is based on words with letters from an arbitrary alphabet, and it can be adapted to any prespecified multivariate distribution. It is very similar to synthesizing DNA sequences, an LLM technique discussed here.

At each step, the focus is both on quality and speed, revisiting old methods or inventing new ones, to get solutions performing significantly better and requiring much less computing time. The three components of this system are:

New powerful random number system

In its simplest form, the random numbers are the binary digits dn = xn mod 2, from the sequence xn+1 = 3 (xn // 2), where the double slash is the integer division. It is an improvement over binary digits of quadratic irrationals used previously (see section 4.4 in [12]) in the sense that xn grows only by a factor 3/2 at each iteration, rather than 2. All sequences (xn) that do not grow indefinitely necessarily result in periodic numbers. This is the case for all PRNGs on the market.

In addition, despite having very long periods, these random generators with finite periods exhibit subtle patterns in rather low dimensions: in short, lack of randomness. They can be quite sensitive to the seed and may require many warm-up iterations before reaching higher randomness. See here how you can crack the Mersenne twister used in the Numpy random function.

The question is this: how slowly can xn grow while preserving perfect randomness, fast implementation, and an infinite period? Read on to see how I managed to reduce the aforementioned exponential growth down to linear, while keeping an infinite period.

Ultrafast, robust evaluation metrics

The first step is to define what a strongly random sequence is, when it consists of deterministic digits. Details are again in chapter 4 in [12]. The takeaway: you need a metric that captures just that, when testing your system. This is true for all GenAI systems. Indeed, here I am re-using the full multivariate Kolmogorov-Smirnov distance (KS) specifically implemented in the context of synthetic data generation: see section 6.4.2 in [7] for details. There, I showed how poorly implemented metrics used by vendors fail to capture subtle departures from the target distribution.

In this article, I present a very fast implementation of KS. I also include a few other tests. Very large test batteries exist, for instance Diehard. However, most rely on old statistical practice, offering a large number of disparate, weak tests, rather than a centralized approach to dealing with the problem. You can do a lot better with much fewer tests. This is one of the goals of this project, also with a focus on hard-to-detect patterns.

Also note that the KS distance relies on the CDF rather than the PDF (probability density function). The latter, used in many tests such as Chi-squared, does not work when you have billions of cross-feature buckets in high dimensions, each with very few observations. As in many GenAI systems, this is what we face. To give you an idea, think about counting occurrences of billions of “words” such as

321023201031022303412310332310300311023102

in a sequence of trillions of digits in base 4 (in this case, the alphabet has 4 letters). Most counts will be zero. Likewise, the base (that is, the size of the alphabet) may be a very large integer. The KS distance handles this problem transparently by looking at closest strings found in the digit sequences, themselves having only one occurrence most of the time. Also, it easily takes care of conditional probabilities when needed.

My previous KS implementation involved thousands of Pandas SQL queries spanning across many features. The new version discussed here is based on the radix numeration system, turning long strings in big integers (called blocks), allowing for fast retrieval with simple binary search in a list of big numbers. In this context, a block can have many digits: the k-th feature is the k-th digit, although blocks may have a varying number of digits. I implicitly rely on the Python Bignum library to deal with the computations. Finally, the binary search is further improved and called weighted binary search, accelerating the computations by a factor 3 or 4 in the examples tested.

Connection to LLM

The problem is strikingly similar to DNA sequence synthetization discussed in section 7.1, where the alphabet has four letters (A, C, G, T) and the words consist of DNA subsequences. The main difference is that DNA sequences are far from random. Yet, the methodology presented here can easily be adapted to arbitrary target distributions. In particular to empirical distributions like those associated to DNA sequencing, or keyword distributions in ordinary text.

Comparing 4 versions of my PRNG, with Numpy

Download the full article and Python code

Download the full article on GitHub, here. No sign-up required. It includes the code, also available in the same folder on GitHub. This article is part of my upcoming book “Practical AI & Machine Learning Projects”, to be published here. Links are clickable only in the book. You may request a free copy of the book (126 pages so far, not yet finished) if you purchased any other book in my e-Store.

To not miss future articles and access members-only content, sign-up to my free newsletter, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.

Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory,  Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.

Leave a Reply

Discover more from Machine Learning Techniques

Subscribe now to keep reading and get access to the full archive.

Continue reading