New Python Library to Evaluate AI-generated Data and Compare Models

Called GenAI-Evalution, you use it for instance to assess the quality of tabular synthetic data. In this case, it measures how faithfully the synthetization mimics the real data it is derived from, by comparing the full joint empirical distributions (ECDF) attached to the two datasets. It works both with categorical and numerical features, and returns a value between 0 (best fit) and 1 (worst fit), adjusted for the dimension of the problem.

It comes with two functions:

  • multivariate_ecdf to compute the multivariate empirical distribution. It generalizes the standard ECDF function available in Python, to any dimension.
  • ks_statistic to compute the goodness of fit, based on the Kolmogorov-Smirnov distance between the two  ECDFs: real versus synthetic data.

The library is available here. Install it with pip install genai-evaluation.

Highlights

  • First implementation of the multivariate Kolmogorov-Smirnov distance in any dimension, for categorical or numerical features, or a mix of both.
  • Fast, returning results in a few seconds. The minimum value is zero (best fit), the maximum is one (worst fit). Thus, easy to interpret.
  • Outperforms all other evaluation metrics currently implemented by vendors. Will correctly identify poor synthetizations even on the very challenging “circle dataset”.
  • Adjusted for the number of features (the dimension). Produce a comparison scatterplot easy to interpret, regardless of dimension.
  • Also returns the multivariate ECDF (empirical distribution) attached to your datasets, synthetic and real. Generalizing the unidimensional ECDF function available in Python, to any dimension.
  • Free and easy to install.

Many evaluation metrics poorly capture the subtle interdependencies among features in your dataset, resulting in numerous false negatives: some output rated as very good, when it is indeed really bad. For instance,  all synthetic data vendors that I tested rate their synthetization of the circle dataset (pictured in Figure 1) as excellent, based on poor evaluation metrics. Details are posted  here. Clearly this is not the case for all of them. Of course, if your evaluation is based on comparing univariate distributions or pairwise feature correlations (all zero in this example), the fit is great, but you miss all the circular dependencies. The purpose of this new distance is to fix this type of issue.

Figure 1: Synthesizing the circle dataset

The KS distance in itself is not new. It has been known for a long time. However, it was never implemented in moderate or high dimensions due to intricacies, and the combinatorial complexity in high dimensions. Thus it remained mostly a topic for academic research and theoretical analysis. Now, the GenAI-Evaluation library finally offers a fast implementation tested on practical use cases. Along the way, it solves the problem of computing multivariate quantiles or percentiles, thus generalizing the one-dimensional percentile function available in Python.

Figure 2: comparing two multivariate ECDFs, three ways

The multivariate KS and ECDF was first tested on several datasets when designing NoGAN, a tabular data synthesizer described here, running 1000x faster and consistently delivering better results than methods based on deep neural networks. I use the new Python library in NoGAN2, another fast and high-performance synthesizer. This method is based on distribution-free hierarchical Bayesian models and deep resampling. The technical article will be published later this month.  In the meanwhile, sample code can be found here. To not miss this important upcoming release, sign-up (for free) to my newsletter, here.

Finally the ECDF function returns the full multivariate empirical distributions. It allows you to compare two high-dimensional ECDFs (with numerical and categorical features) in one single scatterplot, as shown in Figure 2. Perfect fit is when all the points lie on the main diagonal.

About the Author

Vincent Granville is a pioneering AI and machine learning expert, co-founder of Data Science Central (acquired by  TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number TheoryJournal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Synthetic Data and Generative AI”, available here. He lives  in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

Leave a Reply

Discover more from Machine Learning Techniques

Subscribe now to keep reading and get access to the full archive.

Continue reading