1.3. Underfit vs overfit: do I need more data, or more complex models?

This is adapted from the scikit-learn chapter in the scipy lectures.

1.3.1. A toy problem: fitting polynomes

1.3.1.1. Data generation model

We consider data generated by the following mechanism, as a toy model of housing prices

import numpy as np
def generating_func(x, err=1):
return np.random.normal(10 - 1. / (x + 0.1), err)

1.3.1.2. Polynomial regression

The crucial hyperparameter is the degree

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())

1.3.2. Train error versus test error

Generate some data with few samples, for easy understanding

n_samples = 8
np.random.seed(0)
x = 10 ** np.linspace(-2, 0, n_samples)
y = generating_func(x)
# For plotting
x_plot = np.linspace(-0.2, 1.2, 1000)
# randomly sample the data
np.random.seed(1)
x_test = np.random.random(200)
y_test = generating_func(x_test)

1.3.2.1. Underfit: high bias

import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
plt.scatter(x, y, marker='x', c='k', s=100)
plt.scatter(x_test, y_test, marker='.', c='k', s=50, alpha=.2)
degree = 1
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(x[:, np.newaxis], y)
plt.plot(x_plot, model.predict(x_plot[:, np.newaxis]), '-b')
plt.xlim(-0.2, 1.2)
plt.ylim(0, 12)
plt.xlabel('house size')
plt.ylabel('price')
../../_images/sphx_glr_03_underfit_vs_overfit_001.png

This model has a low train score (high train error):

print(model.score(x[:, np.newaxis], y))

Out:

0.557186529575

If we apply it to the unseen data, the test error is

print(model.score(x_test[:, np.newaxis], y_test))

Out:

0.556942968895

1.3.2.2. Overfit: high variance

plt.figure(figsize=(6, 4))
plt.scatter(x, y, marker='x', c='k', s=100)
plt.scatter(x_test, y_test, marker='.', c='k', s=50, alpha=.2)
degree = 6
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(x[:, np.newaxis], y)
plt.plot(x_plot, model.predict(x_plot[:, np.newaxis]), '-b')
plt.xlim(-0.2, 1.2)
plt.ylim(0, 12)
plt.xlabel('house size')
plt.ylabel('price')
../../_images/sphx_glr_03_underfit_vs_overfit_002.png

This model has a high train score (very low train error):

print(model.score(x[:, np.newaxis], y))

Out:

0.994643291526

If we apply it to the unseen data, the test error is

print(model.score(x_test[:, np.newaxis], y_test))

Out:

-97767.6303449

1.3.3. Validation curve varying model complexity

Fit polynomes of different degrees to a dataset: for too small a degree, the model underfits, while for too large a degree, it overfits.

1.3.3.1. Generate a larger dataset

np.random.seed(1)
x = np.random.random(200)
y = generating_func(x)
# split into training, validation, and testing sets.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.4)

Show the training and validation sets

plt.figure(figsize=(6, 4))
plt.scatter(x_train, y_train, color='red', label='Training set')
plt.scatter(x_test, y_test, color='blue', label='Test set')
plt.title('The data')
plt.legend(loc='best')
../../_images/sphx_glr_03_underfit_vs_overfit_003.png

1.3.3.2. The validation_curve function

from sklearn.model_selection import validation_curve
degrees = np.arange(1, 21)
model = make_pipeline(PolynomialFeatures(), LinearRegression())
# The parameter to vary is the "degrees" on the pipeline step
# "polynomialfeatures"
train_scores, validation_scores = validation_curve(
model, x[:, np.newaxis], y,
param_name='polynomialfeatures__degree',
param_range=degrees)
# Plot the mean train error and validation error across folds
plt.figure(figsize=(6, 4))
plt.plot(degrees, validation_scores.mean(axis=1), lw=2,
label='cross-validation')
plt.plot(degrees, train_scores.mean(axis=1), lw=2, label='training')
plt.legend(loc='best')
plt.xlabel('degree of fit')
plt.ylabel('explained variance')
plt.title('Validation curve')
plt.tight_layout()
../../_images/sphx_glr_03_underfit_vs_overfit_004.png

1.3.4. Learning curves: varying the amount of data

Plot train and test error with an increasing number of samples

# A learning curve for d=1, 5, 15
for d in [1, 5, 15]:
model = make_pipeline(PolynomialFeatures(degree=d), LinearRegression())
from sklearn.model_selection import learning_curve
train_sizes, train_scores, validation_scores = learning_curve(
model, x[:, np.newaxis], y,
train_sizes=np.logspace(-1, 0, 20))
# Plot the mean train error and validation error across folds
plt.figure(figsize=(6, 4))
plt.plot(train_sizes, validation_scores.mean(axis=1),
lw=2, label='cross-validation')
plt.plot(train_sizes, train_scores.mean(axis=1),
lw=2, label='training')
plt.ylim(ymin=-.1, ymax=1)
plt.legend(loc='best')
plt.xlabel('number of train samples')
plt.ylabel('explained variance')
plt.title('Learning curve (degree=%i)' % d)
plt.tight_layout()
  • ../../_images/sphx_glr_03_underfit_vs_overfit_005.png
  • ../../_images/sphx_glr_03_underfit_vs_overfit_006.png
  • ../../_images/sphx_glr_03_underfit_vs_overfit_007.png

1.3.5. Summary

Comparing train and test error, ideal when vary parameters and amount of data, tells us whether the model is:

  • Overfitting, ie data limited, and should not be made more complex and regularization is important, or ideally aquiring more data
  • Underfitting, ie not rich enough for the data available, and should be made complex

Total running time of the script: ( 0 minutes 1.420 seconds)

Gallery generated by Sphinx-Gallery