Many standard statistical tests assume that the quantities have a normal Gaussian distribution. This is often a fine approximation. Indeed, when the quantities of interest are averages of many components, the Central Limit Theorem (a.k.a. the law of large numbers) guarentees they will be close to a normal distribution.
However, strongly non-normal measurements abound in machine learning, and in my dissertation. Standard tests of significance can be misleading or erroneous.
Some works just report the success percentage along with the evaluations per solution of successful runs. This isn't too bad when the methods have similiar success rates. Reporting total evaluations expended divided by the number of solutions found is fine for comparing averages: the expectations of the methods involved.
However, one also needs to know how much to trust the averages (standard deviation, perhaps), and, most importantly, whether the difference between two averages is "significant," or could be due to random chance. Thus, many works also report a statistical test of significance, often Student's t-test.
Whereas the average of many runs tends to be normally distributed, there is no guarentee whatsoever that the performance of each individual run will be so well-behaved. Furthermore, such tests take as input a simple set of numbers: there is no place to input the pass/fail indicator!
Often you can redefine your method to be a "restart" type, wherein it reinitializes to a random state when the failure predicate becomes true, so it is then natural to add the unsuccessful evaluations to those of the following successful run. But if restart is not a fundamental part of your method, why adopt this straight-jacket, when a more appropriate treatment is available?
The only assumption of resampling is that the input samples be unbiased, and that there be enough of them to adaquately approximate the entire universe of possibilities. We use quality random number generators to avoid bias. The sample size requirement is not very difficult: Even with only 20 samples, if they are unbiased there is only about one chance in a million that they are all above or all below the true average.)
Lunneborg, C. E. (2000)
Random assignment of available cases: Let the
inference fit the design.
is an excellent introduction, and supplies this limit on resampling error:
Roughly, if the exact P-value for a test is 0.05, then using a
reference set based on 5,000 rerandomizations will yield a P-value that
falls between 0.042 and 0.058, 99% of the time. And, if the exact
P-value is 0.01, then a 10,000 element reference distribution will
give a value between 0.007 and 0.013, again 99% of the time.