3.1.6.6. Test for an education/gender interaction in wages¶

Wages depend mostly on education. Here we investigate how this dependence is related to gender: not only does gender create an offset in wages, it also seems that wages increase more with education for males than females.

Does our data support this last hypothesis? We will test this using statsmodels’ formulas (http://statsmodels.sourceforge.net/stable/example_formulas.html).

Load and massage the data

importpandas
importurllib
importos
ifnotos.path.exists('wages.txt'):
# Download the file if it is not present
urllib.urlretrieve('http://lib.stat.cmu.edu/datasets/CPS_85_Wages',
'wages.txt')
# EDUCATION: Number of years of education
# SEX: 1=Female, 0=Male
# WAGE: Wage (dollars per hour)
data=pandas.read_csv('wages.txt',skiprows=27,skipfooter=6,sep=None,
header=None,names=['education','gender','wage'],
usecols=[0,2,5],
)
# Convert genders to strings (this is particulary useful so that the
# statsmodels formulas detects that gender is a categorical variable)
importnumpyasnp
data['gender']=np.choose(data.gender,['male','female'])
# Log-transform the wages, because they typically are increased with
# multiplicative factors
data['wage']=np.log10(data['wage'])

simple plotting

importseaborn
# Plot 2 linear fits for male and female.
seaborn.lmplot(y='wage',x='education',hue='gender',data=data)

../../../_images/sphx_glr_plot_wage_education_gender_001.png

statistical analysis

importstatsmodels.formula.apiassm
# Note that this model is not the plot displayed above: it is one
# joined model for male and female, not separate models for male and
# female. The reason is that a single model enables statistical testing
result=sm.ols(formula='wage ~ education + gender',data=data).fit()
print(result.summary())

Out:

OLSRegressionResults
==============================================================================
Dep.Variable:wageR-squared:0.193
Model:OLSAdj.R-squared:0.190
Method:LeastSquaresF-statistic:63.42
Date:Tue,03Oct2017Prob(F-statistic):2.01e-25
Time:07:34:30Log-Likelihood:86.654
No.Observations:534AIC:-167.3
DfResiduals:531BIC:-154.5
DfModel:2
CovarianceType:nonrobust
==================================================================================
coefstderrtP>|t|[0.0250.975]
----------------------------------------------------------------------------------
Intercept0.40530.0468.7320.0000.3140.496
gender[T.male]0.10080.0185.6250.0000.0660.136
education0.03340.0039.7680.0000.0270.040
==============================================================================
Omnibus:4.675Durbin-Watson:1.792
Prob(Omnibus):0.097Jarque-Bera(JB):4.876
Skew:-0.147Prob(JB):0.0873
Kurtosis:3.365Cond.No.69.7
==============================================================================
Warnings:
[1]StandardErrorsassumethatthecovariancematrixoftheerrorsiscorrectlyspecified.

The plots above highlight that there is not only a different offset in wage but also a different slope

We need to model this using an interaction

result=sm.ols(formula='wage ~ education + gender + education * gender',
data=data).fit()
print(result.summary())

Out:

OLSRegressionResults
==============================================================================
Dep.Variable:wageR-squared:0.198
Model:OLSAdj.R-squared:0.194
Method:LeastSquaresF-statistic:43.72
Date:Tue,03Oct2017Prob(F-statistic):2.94e-25
Time:07:34:30Log-Likelihood:88.503
No.Observations:534AIC:-169.0
DfResiduals:530BIC:-151.9
DfModel:3
CovarianceType:nonrobust
============================================================================================
coefstderrtP>|t|[0.0250.975]
--------------------------------------------------------------------------------------------
Intercept0.29980.0724.1730.0000.1590.441
gender[T.male]0.27500.0932.9720.0030.0930.457
education0.04150.0057.6470.0000.0310.052
education:gender[T.male]-0.01340.007-1.9190.056-0.0270.000
==============================================================================
Omnibus:4.838Durbin-Watson:1.825
Prob(Omnibus):0.089Jarque-Bera(JB):5.000
Skew:-0.156Prob(JB):0.0821
Kurtosis:3.356Cond.No.194.
==============================================================================
Warnings:
[1]StandardErrorsassumethatthecovariancematrixoftheerrorsiscorrectlyspecified.

Looking at the p-value of the interaction of gender and education, the data does not support the hypothesis that education benefits males more than female (p-value > 0.05).

importmatplotlib.pyplotasplt
plt.show()

Total running time of the script: ( 0 minutes 0.704 seconds)

Gallery generated by Sphinx-Gallery