Note

Click here to download the full example code or run this example in your browser via Binder

1.2. Cross-validation: some gotchas¶

Cross-validation is the ubiquitous test of a machine learning model. Yet many things can go wrong.

Uncertainty of measured accuracy
Cross-validation with non iid data

1.2.1. Uncertainty of measured accuracy ¶

1.2.1.1. Variations in cross_val_score: simple experiments ¶

The first thing to have in mind is that the results of a cross-validation are noisy estimate of the real prediction accuracy

Let us create a simple artificial data

fromsklearnimportdatasets,discriminant_analysis
importnumpyasnp
np.random.seed(0)
data,target=datasets.make_blobs(centers=[(0,0),(0,1)],n_samples=100)
classifier=discriminant_analysis.LinearDiscriminantAnalysis()

One cross-validation gives spread out measures

fromsklearn.model_selectionimportcross_val_score
print(cross_val_score(classifier,data,target))

Out:

[0.647058820.676470590.84375]

What if we try different random shuffles of the data?

fromsklearnimportutils
for_inrange(10):
data,target=utils.shuffle(data,target)
print(cross_val_score(classifier,data,target))

Out:

[0.764705880.705882350.65625]
[0.705882350.676470590.75]
[0.735294120.647058820.71875]
[0.705882350.588235290.8125]
[0.676470590.735294120.71875]
[0.705882350.647058820.75]
[0.676470590.676470590.71875]
[0.705882350.617647060.8125]
[0.764705880.764705880.59375]
[0.764705880.617647060.625]

1.2.1.2. A simple probabilistic model ¶

A sample probabilistic model gives the distribution of observed error: if the classification rate is p, the observed distribution of correct classifications on a set of size follows a binomial distribution

fromscipyimportstats
n=len(data)
distrib=stats.binom(n=n,p=.7)

We can plot it:

frommatplotlibimportpyplotasplt
plt.figure(figsize=(6,3))
plt.plot(np.linspace(0,1,n),distrib.pmf(np.arange(0,n)))

../../_images/sphx_glr_02_cross_validation_001.png

It is wide, because there are not that many samples to mesure the error upon: this is a small dataset.

We can look at the interval in which 95% of the observed accuracy lies for different sample sizes

fornin[100,1000,10000,100000,1000000]:
distrib=stats.binom(n,.7)
interval=(distrib.isf(.025)-distrib.isf(.975))/n
print("Size: {0: 8}  | interval: {1}%".format(n,100*interval))

Out:

Size:100|interval:18.0%
Size:1000|interval:5.7%
Size:10000|interval:1.8%
Size:100000|interval:0.568%
Size:1000000|interval:0.1796%

At 100 000 samples, 5% of the observed classification accuracy still fall more than .5% away of the true rate.

Keep in mind that cross-val is a noisy measure

1.2.1.3. Empirical distribution of cross-validation scores ¶

We can sample the distribution of scores using cross-validation iterators based on subsampling, such as sklearn.model_selection.ShuffleSplit, with many many splits

fromsklearnimportmodel_selection
cv=model_selection.ShuffleSplit(n_splits=200)
scores=cross_val_score(classifier,data,target,cv=cv)
importseabornassns
plt.figure(figsize=(6,3))
sns.distplot(scores)
plt.xlim(0,1)

../../_images/sphx_glr_02_cross_validation_002.png

The empirical distribution is broader than the theoretical one. This can be explained by the fact that as we are retraining the model on each fold, it actually fluctuates due the sampling noise in the training data, while the model above only accounts for sampling noise in the test data.

The situation does get better with more data:

data,target=datasets.make_blobs(centers=[(0,0),(0,1)],n_samples=1000)
scores=cross_val_score(classifier,data,target,cv=cv)
plt.figure(figsize=(6,3))
sns.distplot(scores)
plt.xlim(0,1)
plt.title("Distribution with 1000 data points")

../../_images/sphx_glr_02_cross_validation_003.png

The distribution is still very broader

Testing the observed scores

Importantly, the standard error of the mean across folds is not a good measure of this error, as the different data folds are not independent. For instance, doing many random splits will can reduce the variance arbitrarily, but does not provide actually new data points

fromscipyimportstats
plt.figure(figsize=(6,3))
sns.distplot(scores)
plt.axvline(np.mean(scores),color='k')
plt.axvline(np.mean(scores)+np.std(scores),color='b',label='std')
plt.axvline(np.mean(scores)-np.std(scores),color='b')
plt.axvline(np.mean(scores)+stats.sem(scores),color='r',label='SEM')
plt.axvline(np.mean(scores)-stats.sem(scores),color='r')
plt.legend(loc='best')
plt.xlim(0,1)
plt.title("Distribution with 1000 data points")

../../_images/sphx_glr_02_cross_validation_004.png

1.2.1.4. Measuring baselines and chance ¶

Because of class imbalances, or confounding effects, it is easy to get it wrong it terms of what constitutes chances. There are two approaches to measure peformances of baselines or chance:

DummyClassifier The dummy classifier: sklearn.dummy.DummyClassifier, with different strategies to provide simple baselines

fromsklearn.dummyimportDummyClassifier
dummy=DummyClassifier(strategy="stratified")
dummy_scores=cross_val_score(dummy,data,target)
print(dummy_scores)

Out:

[0.464071860.50.52409639]

Chance level To measure actual chance, the most robust approach is to use permutations: sklearn.model_selection.permutation_test_score(), which is used as cross_val_score

fromsklearn.model_selectionimportpermutation_test_score
score,permuted_scores,p_value=permutation_test_score(classifier,
data,target)
print("Classifier score: {0},\np value: {1}\nPermutation scores {2}"
.format(score,p_value,permuted_scores))

Out:

Classifierscore:0.688021547267,
pvalue:0.00990099009901
Permutationscores[0.500955920.515072270.520988140.489983890.46797970.50700406
511002090.528016260.521938050.515992110.486929750.5420304
483027920.495009980.516058240.491005940.484999880.51793401
518000140.504959960.535994280.500985980.517008150.48302191
513016140.538032370.512943990.503943920.465021760.48797586
490001920.504028090.497997980.517068270.46802780.49699998
523940070.508927930.522022220.493019980.48097780.51200611
480959770.497973930.517002140.497967920.527956140.50200202
474057910.499044080.511945990.490013950.507990040.49093981
505982010.486977850.532008270.491023980.484975830.50303008
511020130.462015730.52002020.501034080.520050260.481074
482029920.521980140.531082410.501016040.504004040.53400428
484019910.53202030.473979750.460987660.497985960.50091985
47694370.493038020.510028140.49297790.509937960.51700214
539986290.513016140.499062120.471929630.529910060.51099608
488011930.488955820.484043960.503018060.503042110.5189741
514940.484074020.491017960.50002405]

We can plot all the scores

plt.figure(figsize=(6,3))
sns.distplot(dummy_scores,color="g",label="Dummy scores")
sns.distplot(permuted_scores,color="r",label="Permuted scores")
sns.distplot(scores,label="Cross validation scores")
plt.legend(loc='best')
plt.xlim(0,1)

../../_images/sphx_glr_02_cross_validation_005.png

Permutation and performing many cross-validation splits are computationally expensive, but they give trust-worthy answers

1.2.2. Cross-validation with non iid data ¶

Another common caveat for cross-validation are dependencies in the observations that can easily creep in between the train and the test sets. Let us explore these problems in two settings.

1.2.2.1. Stock market: time series ¶

Download: Download and load the data:

importpandasaspd
importos
# Python 2 vs Python 3:
try:
fromurllib.requestimporturlretrieve
exceptImportError:
fromurllibimporturlretrieve
symbols={'TOT':'Total','XOM':'Exxon','CVX':'Chevron',
'COP':'ConocoPhillips','VLO':'Valero Energy'}
quotes=pd.DataFrame()
forsymbol,nameinsymbols.items():
url=('https://raw.githubusercontent.com/scikit-learn/examples-data/'
'master/financial-data/{}.csv')
filename="{}.csv".format(symbol)
ifnotos.path.exists(filename):
urlretrieve(url.format(symbol),filename)
this_quote=pd.read_csv(filename)
quotes[name]=this_quote['open']

Prediction: Predict ‘Chevron’ from the others

fromsklearnimportlinear_model,model_selection,ensemble
cv=model_selection.ShuffleSplit(random_state=0)
print(cross_val_score(linear_model.RidgeCV(),
quotes.drop(columns=['Chevron']),
quotes['Chevron'],
cv=cv).mean())

Out:

0.255791000942

Is this a robust prediction?

Does it cary over across quarters?

Stratification: To thest this we need to stratify cross-validation using a sklearn.model_selection.LeaveOneGroupOut

quarters=pd.to_datetime(this_quote['date']).dt.to_period('Q')
cv=model_selection.LeaveOneGroupOut()
print(cross_val_score(linear_model.RidgeCV(),
quotes.drop(columns=['Chevron']),
quotes['Chevron'],
cv=cv,groups=quarters).mean())

Out:

-55.0821056887

The problem that we are facing here is the auto-correlation in the data: these datasets are time-series.

quotes_with_dates=pd.concat((quotes,this_quote['date']),
axis=1).set_index('date')
quotes_with_dates.plot()

../../_images/sphx_glr_02_cross_validation_006.png

Testing for forecasting: If the goal is to do forecasting, than prediction should be done in the future, for instance using sklearn.model_selection.TimeSeriesSplit

Can we do forecasting: predict the future?

cv=model_selection.TimeSeriesSplit(n_splits=quarters.nunique())
print(cross_val_score(linear_model.RidgeCV(),
quotes.drop(columns=['Chevron']),
quotes['Chevron'],
cv=cv,groups=quarters).mean())

Out:

-177.720872861

No. This prediction is abysmal.

1.2.2.2. School grades: repeated measures ¶

Let us look at another dependency structure across samples: repeated measures. This is often often in longitudinal data. Here we are looking at grades of school students, across the years.

Download First we download some data on grades across several schools (centers)

The junior school data, originally from http://www.bristol.ac.uk/cmm/learning/support/datasets/

ifnotos.path.exists('exams.csv.gz'):
# Download the file if it is not present
urlretrieve('https://raw.githubusercontent.com/GaelVaroquaux/interpreting_ml_tuto/blob/master/src/01_how_well/exams.csv.gz',
filename)
exams=pd.read_csv('exams.csv.gz')
# Select data for students present all three years
continuing_students=exams.StudentID.value_counts()
continuing_students=continuing_students[continuing_students>2].index
exams=exams[exams.StudentID.isin(continuing_students)]

Visualization: Grade at tests in in exams depend on socio-economic status, year at school, …

The simplest way to do this is using seaborn’s pairplot function.

importseabornassns
sns.pairplot(exams.drop(columns=['StudentID']))

../../_images/sphx_glr_02_cross_validation_007.png

A more elaborate plot using density estimation gives better understanding of the dense regions:

g=sns.PairGrid(exams.drop(columns=['StudentID']),
diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(plt.scatter,s=2)
g.map_diag(sns.kdeplot,lw=3)

../../_images/sphx_glr_02_cross_validation_008.png

Prediction: Can we predict test grades in maths from demographics (ie, not from other grades)?

# A bit of feature engineering to get a numerical matrix (easily done
# with the ColumnTransformer in scikit-learn >= 0.20)
X=exams.drop(columns=['StudentID','Maths','Ravens','English'])
# Encode gender as an integer variables
X['Gender']=X['Gender']=='Girl'
# One-hot encode social class
X=pd.get_dummies(X,drop_first=True)
y=exams['Maths']
fromsklearnimportensemble
print(cross_val_score(ensemble.GradientBoostingRegressor(),X,y,
cv=10).mean())

Out:

0.0704221342991

We can predict!

But there is one caveat: are we simply learning to recognive students across the years? There is many implicit informations about students: notably in the school ID and the class ID.

Stratification To test for this, we can make sure that we have different students in the train and the test set.

fromsklearnimportmodel_selection
cv=model_selection.GroupKFold(10)
print(cross_val_score(ensemble.GradientBoostingRegressor(),X,y,
cv=cv,groups=exams['StudentID']).mean())

Out:

0.14218996362

It works better!

The classifier learns better to generalize, probably by learning stronger invariances from the repeated measures on the students

1.2.2.3. Summary ¶

Samples often have a dependency structure, such a with time-series, or with repeated measures. To have a meaningful measure of prediction error, the link between the train and the test set must match the important one for the application. In time-series prediction, it must be in the future. To learn a predictor of the success of an individual from demographics, it might be more relevant to predict across individuals. If the variance across individuals is much larger than the variance across repeated measurement, as in many biomedical applications, the choice of cross-validation strategy may make a huge difference.

Total running time of the script: ( 0 minutes 29.565 seconds)

Download Python source code: 02_cross_validation.py

Download Jupyter notebook: 02_cross_validation.ipynb

Gallery generated by Sphinx-Gallery