For the first exercise, we will use a table that is already present in th R envionment about occupational status of the son (labeled destination) and the occupational status of the father (labeled origin). You can find the table in a R object called occupationalStatus.

EXE 5.1

X²

Analyze a contingency table where the dependent variable is destination and the independent variable is the origin (of occupational status). Use chisq.test.

##       destination
## origin   1   2   3   4   5   6   7   8
##      1  50  19  26   8   7  11   6   2
##      2  16  40  34  18  11  20   8   3
##      3  12  35  65  66  35  88  23  21
##      4  11  20  58 110  40 183  64  32
##      5   2   8  12  23  25  46  28  12
##      6  12  28 102 162  90 554 230 177
##      7   0   6  19  40  21 158 143  71
##      8   0   3  14  32  15 126  91 106

## Warning in chisq.test(occupationalStatus): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  occupationalStatus
## X-squared = 1416, df = 49, p-value < 2.2e-16

The warning is given because the table violates an assumption of the chisquare test. Get the expected frequencies from your solution above, look under value on chisq.test, i.e. one of the output of this function are the expected frequencies. What assumption is violated here? How many times?

## 
## FALSE  TRUE 
##    61     3

Collapse columns so that the assumption is no longer violated. How does the X2 change? One way would be to first collapse only first two columns so that the cells with few observations are gone, and then collapse for rows so that the table remains square:

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  125   60   26   18   31   14    5
## [2,]   47   65   66   35   88   23   21
## [3,]   31   58  110   40  183   64   32
## [4,]   10   12   23   25   46   28   12
## [5,]   40  102  162   90  554  230  177
## [6,]    6   19   40   21  158  143   71
## [7,]    3   14   32   15  126   91  106

## 
##  Pearson's Chi-squared test
## 
## data:  occupationalStatus2
## X-squared = 1138.5, df = 36, p-value < 2.2e-16

## 
## FALSE 
##    49

EXE 5.2

For the exercises on logit regression, you can download the survey data file. Read in data using read.table function & name the datafile. Name variables are on top header=TRUE.

This dataset contains :

gender : sex, 1=male, 2=female
age : in years
education : edu, 1 = low, 6 = high
IMC : check to see whether people read the question, 0 = no check, 1 = check
IMCcorrect : 0 = fail, 1 = correct
warning : warning that an attention check is used
brand attitude : att_brand1 to att_brand5
ad attitude : att_ad1 - att_ad5
purchase int. : pi1 - pi3
all measured on 7pt: 1=strongly disagree, 7=strongly agree
timeSeconds : time spent on survey in seconds

Logit regression

First, have a look at the dataset, using head and summary. Explain the passing rate of the IMC by warning: 1 means they did receive a warning, 0 that they didn’t. Calculate the odds ratio. Compare the glm coefficient with the odds ratio.

## 
## Call:
## glm(formula = IMCcorrect ~ warning, family = binomial, data = imc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8722   0.6170   0.6170   0.7876   0.7876  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0116     0.2611   3.874 0.000107 ***
## warning       0.5506     0.4015   1.371 0.170318    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 158.07  on 149  degrees of freedom
## Residual deviance: 156.16  on 148  degrees of freedom
## AIC: 160.16
## 
## Number of Fisher Scoring iterations: 4

##    
##      0  1
##   0 20 55
##   1 13 62

## [1] 0.5505841

Calculate effect of warning on IMCcorrect separately for men and women and the interaction effect which is the ratio of two odds ratio. Tip: first create two tables, one for males and one for females. Note that 1 is men, 2 is women.

Men:

##    
##      0  1
##   0 11 31
##   1  5 31

## [1] 0.7884574

Women:

##    
##      0  1
##   0  9 24
##   1  8 31

## [1] 0.3737164

And calculate the interaction effect:

## [1] 0.6605114

Check with glm function including the interaction effect between warning and sex. Note that the logit coefficient needs to be transformed to an odds by using exp().

## 
## Call:
## glm(formula = IMCcorrect ~ warning * factor(sex), family = binomial, 
##     data = imc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9870   0.5469   0.6776   0.7793   0.7981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           1.03609    0.35095   2.952  0.00315 **
## warning               0.78846    0.59618   1.323  0.18599   
## factor(sex)2         -0.05526    0.52530  -0.105  0.91622   
## warning:factor(sex)2 -0.41474    0.81576  -0.508  0.61116   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 158.07  on 149  degrees of freedom
## Residual deviance: 155.57  on 146  degrees of freedom
## AIC: 163.57
## 
## Number of Fisher Scoring iterations: 4

And transform the coefficients in the model with the interaction effect to odds ratios:

## [1] 2.200094

## [1] 1.453246

## [1] 0.6605384

EXE 5.3

Model fit

Fit two models (one that has more independent variables than the other, i.e. nested models). For instance, use interval variables edu and warning to explain passing rate of IMC (model 1). Test whether the level of age influence the effect of warning on successfully pass the IMC by an interaction term (model 2).

## 
## Call:
## glm(formula = IMCcorrect ~ warning * timeSeconds, family = binomial, 
##     data = survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8673   0.6204   0.6219   0.7716   0.9837  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          1.1622749  0.3788804   3.068  0.00216 **
## warning              0.3792952  0.4875195   0.778  0.43656   
## timeSeconds         -0.0002866  0.0005111  -0.561  0.57491   
## warning:timeSeconds  0.0002981  0.0005152   0.579  0.56289   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 158.07  on 149  degrees of freedom
## Residual deviance: 155.49  on 146  degrees of freedom
## AIC: 163.49
## 
## Number of Fisher Scoring iterations: 8

Test whether the larger model fits better than the model with fewer parameters. Use pchisq and check your result with anova.

## [1] 0.5698695

## Analysis of Deviance Table
## 
## Model 1: IMCcorrect ~ warning + timeSeconds
## Model 2: IMCcorrect ~ warning * timeSeconds
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       147     155.81                     
## 2       146     155.49  1   0.3229   0.5699

Estimate mcfadden R using deviance and null.deviance. Use the formula in the slides. Make sure you compare the same cases across models so that the null model uses the same cases! The R² of the first model listed first and the second model second.

## [1] 0.01432478

## [1] 0.01636752

Check whether null.deviance is the intercept only model. Compare your results with the models estimated above. Make sure that the intercept model has the same number of cases as the model you estimated before. In case of binary variables the null model is always intercept model.

## [1] 158.0724