For the first exercise, we will use a table that is already present in th R envionment about occupational status of the son (labeled destination) and the occupational status of the father (labeled origin). You can find the table in a R object called occupationalStatus
.
Analyze a contingency table where the dependent variable is destination and the independent variable is the origin (of occupational status). Use chisq.test
.
## destination
## origin 1 2 3 4 5 6 7 8
## 1 50 19 26 8 7 11 6 2
## 2 16 40 34 18 11 20 8 3
## 3 12 35 65 66 35 88 23 21
## 4 11 20 58 110 40 183 64 32
## 5 2 8 12 23 25 46 28 12
## 6 12 28 102 162 90 554 230 177
## 7 0 6 19 40 21 158 143 71
## 8 0 3 14 32 15 126 91 106
## Warning in chisq.test(occupationalStatus): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: occupationalStatus
## X-squared = 1416, df = 49, p-value < 2.2e-16
The warning is given because the table violates an assumption of the chisquare test. Get the expected frequencies from your solution above, look under value
on chisq.test
, i.e. one of the output of this function are the expected frequencies. What assumption is violated here? How many times?
##
## FALSE TRUE
## 61 3
Collapse columns so that the assumption is no longer violated. How does the X2 change? One way would be to first collapse only first two columns so that the cells with few observations are gone, and then collapse for rows so that the table remains square:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 125 60 26 18 31 14 5
## [2,] 47 65 66 35 88 23 21
## [3,] 31 58 110 40 183 64 32
## [4,] 10 12 23 25 46 28 12
## [5,] 40 102 162 90 554 230 177
## [6,] 6 19 40 21 158 143 71
## [7,] 3 14 32 15 126 91 106
##
## Pearson's Chi-squared test
##
## data: occupationalStatus2
## X-squared = 1138.5, df = 36, p-value < 2.2e-16
##
## FALSE
## 49
For the exercises on logit regression, you can download the survey data file. Read in data using read.table
function & name the datafile. Name variables are on top header=TRUE
.
This dataset contains :
First, have a look at the dataset, using head
and summary
. Explain the passing rate of the IMC by warning: 1 means they did receive a warning, 0 that they didn’t. Calculate the odds ratio. Compare the glm coefficient with the odds ratio.
##
## Call:
## glm(formula = IMCcorrect ~ warning, family = binomial, data = imc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8722 0.6170 0.6170 0.7876 0.7876
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0116 0.2611 3.874 0.000107 ***
## warning 0.5506 0.4015 1.371 0.170318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 158.07 on 149 degrees of freedom
## Residual deviance: 156.16 on 148 degrees of freedom
## AIC: 160.16
##
## Number of Fisher Scoring iterations: 4
##
## 0 1
## 0 20 55
## 1 13 62
## [1] 0.5505841
Calculate effect of warning on IMCcorrect separately for men and women and the interaction effect which is the ratio of two odds ratio. Tip: first create two tables, one for males and one for females. Note that 1 is men, 2 is women.
Men:
##
## 0 1
## 0 11 31
## 1 5 31
## [1] 0.7884574
Women:
##
## 0 1
## 0 9 24
## 1 8 31
## [1] 0.3737164
And calculate the interaction effect:
## [1] 0.6605114
Check with glm function including the interaction effect between warning and sex. Note that the logit coefficient needs to be transformed to an odds by using exp().
##
## Call:
## glm(formula = IMCcorrect ~ warning * factor(sex), family = binomial,
## data = imc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9870 0.5469 0.6776 0.7793 0.7981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03609 0.35095 2.952 0.00315 **
## warning 0.78846 0.59618 1.323 0.18599
## factor(sex)2 -0.05526 0.52530 -0.105 0.91622
## warning:factor(sex)2 -0.41474 0.81576 -0.508 0.61116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 158.07 on 149 degrees of freedom
## Residual deviance: 155.57 on 146 degrees of freedom
## AIC: 163.57
##
## Number of Fisher Scoring iterations: 4
And transform the coefficients in the model with the interaction effect to odds ratios:
## [1] 2.200094
## [1] 1.453246
## [1] 0.6605384
Fit two models (one that has more independent variables than the other, i.e. nested models). For instance, use interval variables edu
and warning
to explain passing rate of IMC (model 1). Test whether the level of age influence the effect of warning on successfully pass the IMC by an interaction term (model 2).
##
## Call:
## glm(formula = IMCcorrect ~ warning * timeSeconds, family = binomial,
## data = survey)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8673 0.6204 0.6219 0.7716 0.9837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1622749 0.3788804 3.068 0.00216 **
## warning 0.3792952 0.4875195 0.778 0.43656
## timeSeconds -0.0002866 0.0005111 -0.561 0.57491
## warning:timeSeconds 0.0002981 0.0005152 0.579 0.56289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 158.07 on 149 degrees of freedom
## Residual deviance: 155.49 on 146 degrees of freedom
## AIC: 163.49
##
## Number of Fisher Scoring iterations: 8
Test whether the larger model fits better than the model with fewer parameters. Use pchisq
and check your result with anova
.
## [1] 0.5698695
## Analysis of Deviance Table
##
## Model 1: IMCcorrect ~ warning + timeSeconds
## Model 2: IMCcorrect ~ warning * timeSeconds
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 147 155.81
## 2 146 155.49 1 0.3229 0.5699
Estimate mcfadden R using deviance
and null.deviance
. Use the formula in the slides. Make sure you compare the same cases across models so that the null model uses the same cases! The R2 of the first model listed first and the second model second.
## [1] 0.01432478
## [1] 0.01636752
Check whether null.deviance
is the intercept only model. Compare your results with the models estimated above. Make sure that the intercept model has the same number of cases as the model you estimated before. In case of binary variables the null model is always intercept model.
## [1] 158.0724