Note

Click here to download the full example code or run this example in your browser via Binder

1.1. Metrics to judge the sucess of a model¶

Pro & cons of various performance metrics.

Regression settings
Classification settings

The simple way to use a scoring metric during cross-validation is via the scoring parameter of sklearn.model_selection.cross_val_score().

1.1.1. Regression settings ¶

1.1.1.1. The Boston housing data ¶

fromsklearnimportdatasets
boston=datasets.load_boston()
# Shuffle the data
fromsklearn.utilsimportshuffle
data,target=shuffle(boston.data,boston.target,random_state=0)

A quick plot of how each feature is related to the target

frommatplotlibimportpyplotasplt
forfeature,nameinzip(data.T,boston.feature_names):
plt.figure(figsize=(4,3))
plt.scatter(feature,target)
plt.xlabel(name,size=22)
plt.ylabel('Price (US$)',size=22)
plt.tight_layout()

We will be using a random forest regressor to predict the price

fromsklearn.ensembleimportRandomForestRegressor
regressor=RandomForestRegressor()

1.1.1.2. Explained variance vs Mean Square Error ¶

The default score is explained variance

fromsklearn.model_selectionimportcross_val_score
print(cross_val_score(regressor,data,target))

Out:

[0.801858760.835011250.83122505]

Explained variance is convienent because it has a natural scaling: 1 is perfect prediction, and 0 is around chance

Now let us see which houses are easier to predict:

Not along the Charles river (feature 3)

print(cross_val_score(regressor,data[data[:,3]==0],
target[data[:,3]==0]))

Out:

[0.780687190.874493320.82714336]

Along the Charles river

print(cross_val_score(regressor,data[data[:,3]==1],
target[data[:,3]==1]))

Out:

[0.47939268-0.230911370.83871587]

So the houses along the Charles are harder to predict?

It is not so easy to conclude this from the explained variance: in two different sets of observations, the variance of the target differs, and the explained variance is a relative measure

MSE: We can use the mean squared error (here negated)

Not along the Charles river

print(cross_val_score(regressor,data[data[:,3]==0],
target[data[:,3]==0],
scoring='neg_mean_squared_error'))

Out:

[-14.84768471-8.53250191-16.77990573]

Along the Charles river

print(cross_val_score(regressor,data[data[:,3]==1],
target[data[:,3]==1],
scoring='neg_mean_squared_error'))

Out:

[-80.363225-62.05679167-28.57741818]

So the error is larger along the Charles river

1.1.1.3. Mean Squared Error versus Mean Absolute Error ¶

What if we want to report an error in dollars, meaningful for an application?

The Mean Absolute Error is useful for this goal

print(cross_val_score(regressor,data,target,
scoring='neg_mean_absolute_error'))

Out:

[-2.51940828-2.26195266-2.52642857]

1.1.1.4. Summary ¶

explained variance: scaled with regards to chance: 1 = perfect, 0 = around chance, but it shouldn’t used to compare predictions across datasets
mean absolute error: enables comparison across datasets in the units of the target

1.1.2. Classification settings ¶

1.1.2.1. The digits data ¶

digits=datasets.load_digits()
# Let us try to detect sevens:
sevens=(digits.target==7)
fromsklearn.ensembleimportRandomForestClassifier
classifier=RandomForestClassifier()

1.1.2.2. Accuracy and its shortcomings ¶

The default metric is the accuracy: the averaged fraction of success. It takes values between 0 and 1, where 1 is perfect prediction

print(cross_val_score(classifier,digits.data,sevens))

Out:

[0.973333330.976627710.98160535]

However, a stupid classifier can each good prediction wit imbalanced classes

fromsklearn.dummyimportDummyClassifier
most_frequent=DummyClassifier(strategy='most_frequent')
print(cross_val_score(most_frequent,digits.data,sevens))

Out:

[0.90.899833060.90133779]

Balanced accuracy (available in scikit-learn 0.20) fixes this, but can have surprising behaviors, such as being negative, and can only be used for binary classification.

print(cross_val_score(classifier,digits.data,sevens,
scoring='balanced_accuracy'))

Out:

[0.858333330.932405690.93882268]

1.1.2.3. Precision, recall, and their shortcomings ¶

We can measure separately false detection and misses

Precision: Precision counts the ratio of detections that are correct

print(cross_val_score(classifier,digits.data,sevens,
scoring='precision'))

Out:

[1.1.0.98039216]

Our classifier has a good precision: most of the sevens that it predicts are really sevens.

As predicting the most frequent never predicts sevens, precision is ill defined. Scikit-learn puts it to zero

print(cross_val_score(most_frequent,digits.data,sevens,
scoring='precision'))

Out:

[0.0.0.]

Recall: Recall counts the fraction of class 1 actually detected

print(cross_val_score(classifier,digits.data,sevens,scoring='recall'))

Out:

[0.733333330.80.84745763]

Our recall isn’t as good: we miss many sevens

But predicting the most frequent never predicts sevens:

print(cross_val_score(most_frequent,digits.data,sevens,scoring='recall'))

Out:

[0.0.0.]

Note: Measuring only the precision without the recall makes no sense, it is easy to maximize one at the cost of the other. Ideally, classifiers should be compared on a precision at a given recall

1.1.2.4. Area under the ROC curve ¶

../../_images/sphx_glr_roc_curve_001.png

If the classifier provides a decision function that can be thresholded to control false positives versus false negatives, the ROC curve summarise the different tradeoffs that can be achieved by varying this threshold.

Its Area Under the Curve (AUC) is a useful metric where 1 is perfect prediction and .5 is chance, independently of class imbalance

print(cross_val_score(classifier,digits.data,sevens,scoring='roc_auc'))

Out:

[0.986064810.997959180.99927675]

print(cross_val_score(most_frequent,digits.data,sevens,scoring='roc_auc'))

Out:

[0.50.50.5]

1.1.2.4.1. Average precision ¶

When the classifier exposes its unthresholded decision, another interesting metric is the average precision for all recall. Compared to ROC AUC it has a more linear behavior for very rare classes. Indeed, with very rare classes, small changes in the ROC AUC may mean large changes in terms of precision

print(cross_val_score(classifier,digits.data,sevens,
scoring='average_precision'))

Out:

[0.931690790.981687130.98500183]

Naive decisions are no longer at .5

print(cross_val_score(most_frequent,digits.data,sevens,
scoring='average_precision'))

Out:

[0.10.100166940.09866221]

1.1.2.5. Multiclass and multilabel settings ¶

To simplify the discussion, we have reduced the problem to detecting sevens, but maybe it is more interesting to predict the digit: a 10-class classification problem

Accuracy The accuracy is naturally defined in such multiclass settings

print(cross_val_score(classifier,digits.data,digits.target))

Out:

[0.878737540.913188650.90100671]

The most frequent label is no longer a very interesting baseline

random_choice=DummyClassifier()
print(cross_val_score(random_choice,digits.data,digits.target))

Out:

[0.112956810.080133560.08892617]

Precision and recall need the notion of specific class to detect (called positive class) and are not that easily defined in these settings, hence ROC AUC cannot be easily computed.

These notions are however well defined in a multi-label problem. In such a problem, the goal is to assign one or more labels to each instance, as opposed to a multiclass. A multiclass problem can be turned into a multilabel one, though the prediction will then be slightly different

fromsklearn.preprocessingimportLabelBinarizer
digit_labels=LabelBinarizer().fit_transform(digits.target)
print(digit_labels[:15])

Out:

[[1000000000]
[0100000000]
[0010000000]
[0001000000]
[0000100000]
[0000010000]
[0000001000]
[0000000100]
[0000000010]
[0000000001]
[1000000000]
[0100000000]
[0010000000]
[0001000000]
[0000100000]]

The ROC AUC can then be computed for each label, and the mean is reported

print(cross_val_score(classifier,digits.data,digit_labels,
scoring='roc_auc'))

Out:

[0.983691660.98860150.98884622]

as well as the average precision

print(cross_val_score(classifier,digits.data,digit_labels,
scoring='average_precision'))

Out:

[0.934981630.948809990.93500118]

Note that the confusion between classes may not well be captured in Such a measure, as in multiclass predictions are exclusive, and not in multilabel.

1.1.2.6. Summary ¶

Class imbalance and the tradeoffs between accepting many misses or many false detections are the things to keep in mind in classification.

In single-class settings, ROC AUC and average precision give nice summaries to compare classifiers when the threshold can be varied. In multiclass settings, this is harder, unless we are willing to consider the problem as multiple single-class problems (one-vs-all).

Total running time of the script: ( 0 minutes 2.456 seconds)

Download Python source code: 01_metrics.py

Download Jupyter notebook: 01_metrics.ipynb

Gallery generated by Sphinx-Gallery