Note

Click here to download the full example code or run this example in your browser via Binder

2. Cross-validation: some gotchas¶

Cross-validation is the ubiquitous test of a machine learning model. Yet many things can go wrong.

2.1. The uncertainty of measured accuracy¶

The first thing to have in mind is that the results of a cross-validation are noisy estimate of the real prediction accuracy

Let us create a simple artificial data

fromsklearnimportdatasets,discriminant_analysis
importnumpyasnp
np.random.seed(0)
data,target=datasets.make_blobs(centers=[(0,0),(0,1)])
classifier=discriminant_analysis.LinearDiscriminantAnalysis()

One cross-validation gives spread out measures

fromsklearn.model_selectionimportcross_val_score
print(cross_val_score(classifier,data,target))

Out:

[0.647058820.676470590.84375]

What if we try different random shuffles of the data?

fromsklearnimportutils
for_inrange(10):
data,target=utils.shuffle(data,target)
print(cross_val_score(classifier,data,target))

Out:

[0.764705880.705882350.65625]
[0.705882350.676470590.75]
[0.735294120.647058820.71875]
[0.705882350.588235290.8125]
[0.676470590.735294120.71875]
[0.705882350.647058820.75]
[0.676470590.676470590.71875]
[0.705882350.617647060.8125]
[0.764705880.764705880.59375]
[0.764705880.617647060.625]

This should not be surprising: if the lassification rate is p, the observed distribution of correct classifications on a set of size follows a binomial distribution

fromscipyimportstats
n=len(data)
distrib=stats.binom(n=n,p=.7)

We can plot it:

frommatplotlibimportpyplotasplt
plt.figure(figsize=(6,3))
plt.plot(np.linspace(0,1,n),distrib.pmf(np.arange(0,n)))

../_images/sphx_glr_cross_validation_001.png

It is wide, because there are not that many samples to mesure the error upon: iris is a small dataset

We can look at the interval in which 95% of the observed accuracy lies for different sample sizes

fornin[100,1000,10000,100000]:
distrib=stats.binom(n,.7)
interval=(distrib.isf(.025)-distrib.isf(.975))/n
print("Size: {0: 7}  | interval: {1}%".format(n,100*interval))

Out:

Size:100|interval:18.0%
Size:1000|interval:5.7%
Size:10000|interval:1.8%
Size:100000|interval:0.568%

At 100 000 samples, 5% of the observed classification accuracy still fall more than .5% away of the true rate

Keep in mind that cross-val is a noisy measure

Importantly, the variance across folds is not a good measure of this error, as the different data folds are not independent. For instance, doing many random splits will can reduce the variance arbitrarily, but does not provide actually new data points

2.2. Confounding effects and non independence¶

2.3. Measuring baselines and chance¶

Because of class imbalances, or confounding effects, it is easy to get it wrong it terms of what constitutes chances. There are two approaches to measure peformances of baselines or chance:

DummyClassifier The dummy classifier: sklearn.dummy.DummyClassifier, with different strategies to provide simple baselines

fromsklearn.dummyimportDummyClassifier
dummy=DummyClassifier(strategy="stratified")
print(cross_val_score(dummy,data,target))

Out:

[0.441176470.617647060.40625]

Chance level To measure actual chance, the most robust approach is to use permutations: sklearn.model_selection.permutation_test_score(), which is used as cross_val_score

fromsklearn.model_selectionimportpermutation_test_score
score,permuted_scores,p_value=permutation_test_score(classifier,data,target)
print("Classifier score: {0},\np value: {1}\nPermutation scores {2}"
.format(score,p_value,permuted_scores))

Out:

Classifierscore:0.669117647059,
pvalue:0.00990099009901
Permutationscores[0.549632350.471200980.510416670.599264710.450367650.44852941
529411760.598651960.478553920.390318630.459558820.56066176
601102940.381127450.451593140.460171570.580269610.57904412
591911760.580269610.510416670.539215690.411764710.37806373
621936270.520220590.417892160.509803920.44975490.59987745
478553920.539215690.440563730.607843140.471200980.47916667
489583330.58149510.508578430.430147060.530024510.48039216
488357840.438725490.438725490.599877450.458946080.40931373
520220590.459558820.449142160.520220590.551470590.47120098
490808820.492034310.477941180.53799020.629901960.51041667
469975490.440563730.561274510.609681370.471200980.54963235
547181370.560661760.479779410.429534310.438725490.38051471
438725490.411151960.488970590.480392160.609681370.60906863
571691180.527573530.51225490.520220590.460171570.62009804
470588240.599877450.550857840.468137250.537377450.54105392
488970590.511029410.480392160.501225490.470588240.59926471
490808820.41850490.470588240.49019608]

Total running time of the script: ( 0 minutes 0.602 seconds)

Download Python source code: cross_validation.py

Download Jupyter notebook: cross_validation.ipynb

Gallery generated by Sphinx-Gallery