Resampling statistics

Many standard statistical tests assume that the quantities have a normal Gaussian distribution. This is often a fine approximation. Indeed, when the quantities of interest are averages of many components, the Central Limit Theorem (a.k.a. the law of large numbers) guarentees they will be close to a normal distribution.

However, strongly non-normal measurements abound in machine learning, and in my dissertation. Standard tests of significance can be misleading or erroneous.


Fortunately, there are "non-parametric" statistical procedures, so called because they assume nothing about the actual distributions (and therefore have no parameters, like average and standard deviation, etc.). Many of these techniques were known but impractible until the advent of cheap computing power, and therefore were not covered in elementary textbooks.
As in many machine learning works, this dissertation will often need to compare the "performance" of different methods. This work adopts the number of fitness evaluations required to find a "solution" as the natural figure of merit. Since not every run will find a solution, there needs to be a principled way to account for the evaluations expended on unsuccessful runs.

Some works just report the success percentage along with the evaluations per solution of successful runs. This isn't too bad when the methods have similiar success rates. Reporting total evaluations expended divided by the number of solutions found is fine for comparing averages: the expectations of the methods involved.

However, one also needs to know how much to trust the averages (standard deviation, perhaps), and, most importantly, whether the difference between two averages is "significant," or could be due to random chance. Thus, many works also report a statistical test of significance, often Student's t-test.

Whereas the average of many runs tends to be normally distributed, there is no guarentee whatsoever that the performance of each individual run will be so well-behaved. Furthermore, such tests take as input a simple set of numbers: there is no place to input the pass/fail indicator!

Often you can redefine your method to be a "restart" type, wherein it reinitializes to a random state when the failure predicate becomes true, so it is then natural to add the unsuccessful evaluations to those of the following successful run. But if restart is not a fundamental part of your method, why adopt this straight-jacket, when a more appropriate treatment is available?


Resampling offers an easy and principled way out of this conundrum. See the t-test page for discussion, examples, and references.

The only assumption of resampling is that the input samples be unbiased, and that there be enough of them to adaquately approximate the entire universe of possibilities. We use quality random number generators to avoid bias. The sample size requirement is not very difficult: Even with only 20 samples, if they are unbiased there is only about one chance in a million that they are all above or all below the true average.)


Resampling can sample from sets of vectors as well as any other kind of set. Treat each measurement (a run of some genetic algorithm) as a 2-vector consisting of (1) the number of fitness evaluations used, and (2) the pass/fail flag. Given a set of (resampled) vectors computing the evaluations per solution metric is trivial. Any basic text on resampling will explain the exact mechanics of how thousands of such metrics are combined to give confidence intervals, medians, percentiles, etc., with no dependence on the Gaussian distribution.

Lunneborg, C. E. (2000) Random assignment of available cases: Let the inference fit the design.
is an excellent introduction, and supplies this limit on resampling error:
Roughly, if the exact P-value for a test is 0.05, then using a reference set based on 5,000 rerandomizations will yield a P-value that falls between 0.042 and 0.058, 99% of the time. And, if the exact P-value is 0.01, then a 10,000 element reference distribution will give a value between 0.007 and 0.013, again 99% of the time.


Email: mcquesten@gmail.com. Author home page: http://bulldog2.redlands.edu/fac/paul_mcquesten.
Last update: $Id: resample.html 2656 2006-08-27 00:54:58Z paulmcq $