Using plots to describe variables

The imdb dataset is a subset of the total imdb dataset (which includes various countries, movies, and periods) that we will use for the course. It consist of the movies shown in Germany with the rating given by people on imdb.com. For each movie, the genre is given, the runtime, the budget, and the name of the movie.This dataset will also be used for illustration of the commands discussed below.

Distributions

Variables have various measurement levels. Nominal when they only have two categorical categories (i.e. the numbers attached to the categories have no meaning). Ordinal when the categories follow an ordering (i.e. the order of numbers attached to the categories have a meaning). Interval when the categories are equally spaced (i.e. the distance between the numbers of the categories have meaning). Ratio when the starting point has a meaning. Examples are gender (nominal), education level (ordinal), income in 1000 euro (interval), prices (ratio). A classic example of an interval variable is Celcius (or Fahrenheit) temperature, where 0 temperature is not absolute (0 degrees of Celcius is not ‘no heat’). This dataset contains ratio (budget), interval (rating), ordinal (ratingcat), and nominal variables (genre).

summary(imdb)
##                   movie        runtime          budget      
##  (500) Days of Summer:  1   Min.   : 80.0   Min.   :  1.50  
##  2 Guns              :  1   1st Qu.: 97.0   1st Qu.: 21.00  
##  2012                :  1   Median :107.0   Median : 40.00  
##  21 and Over         :  1   Mean   :109.7   Mean   : 60.33  
##  21 Jump Street      :  1   3rd Qu.:120.0   3rd Qu.: 80.00  
##  22 Jump Street      :  1   Max.   :180.0   Max.   :250.00  
##  (Other)             :478                                   
##        genre         rating         revenues           screens      
##  Action   :179   Min.   :1.600   Min.   :   11705   Min.   :   7.0  
##  Comedy   :107   1st Qu.:5.800   1st Qu.:  481093   1st Qu.: 241.0  
##  Drama    : 58   Median :6.400   Median : 1215030   Median : 350.0  
##  Adventure: 37   Mean   :6.318   Mean   : 2191898   Mean   : 369.3  
##  Horror   : 33   3rd Qu.:7.000   3rd Qu.: 2769544   3rd Qu.: 501.0  
##  Crime    : 22   Max.   :8.800   Max.   :23845427   Max.   :1265.0  
##  (Other)  : 48                   NA's   :47         NA's   :45

Plot ratio (and interval) variables

Interval (and ratio) variables you can plot using plot. The y-axis is the value of the variable, and the x-axis is the position of the number in the variable. The first value is 7.5 which appears at the beginning of the x-axis.

imdb$rating[1:10]
##  [1] 7.5 5.8 7.8 6.8 5.9 7.2 7.1 6.2 6.1 6.3

A histogram shows the distribution of the interval variables. If you save the histogram to an object, this object shows the information of the histogram. As you can see, the breaks include the lowest and highest value, and values in between with an equal distance. The frequencies show how many times values occur between these breaks, and the density shows probability of these values to occur and is related to density function.

## $breaks
## [1] 1 2 3 4 5 6 7 8 9
## 
## $counts
## [1]   1   0  11  37 124 205  94  12
## 
## $density
## [1] 0.002066116 0.000000000 0.022727273 0.076446281 0.256198347 0.423553719
## [7] 0.194214876 0.024793388
## 
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
## 
## $xname
## [1] "imdb$rating"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

You can change the parameters of the histogram. Here we change the breaks, which makes the bars wider or smaller depending on the initial breaks in the histogram. Instead of frequencies (or counts), the y-axis can also display the density (or probability) of the values.

# breaks
hist(imdb$budget, breaks=50)

# density
hist(imdb$budget, prob=TRUE)

# kernel plot
plot(density(imdb$budget))

Plot ordinal (and nominal) variables

Nominal and ordinal variables are better analyzed using barplot which displays the frequencies. To get the frequencies you need the function table which shows the count of the variable. If you put two arguments in the function you get a cross table. Just for this example we have recoded the variable rating into 3 categories: low, middle and high.

table(imdb$genre)
## 
##      Action   Adventure   Animation   Biography      Comedy       Crime 
##         179          37          12          21         107          22 
## Documentary       Drama     Fantasy      Horror     Mystery     Romance 
##           6          58           1          33           3           1 
##      Sci-Fi       Short 
##           2           2
imdb$ratingcat<-cut(imdb$rating, breaks=c(1,6,7,8.8))
table(imdb$ratingcat,imdb$genre)
##          
##           Action Adventure Animation Biography Comedy Crime Documentary
##   (1,6]       56        15         7         1     46     4           5
##   (6,7]       91        12         3         5     50     8           0
##   (7,8.8]     32        10         2        15     11    10           1
##          
##           Drama Fantasy Horror Mystery Romance Sci-Fi Short
##   (1,6]      15       1     22       0       0      0     1
##   (6,7]      25       0     10       1       0      0     0
##   (7,8.8]    18       0      1       2       1      2     1

You can change the parameters of the histogram. The y-axis can also display the density of the values. The labels of the variable appear because the variable genre only contains characters.

barplot(table(imdb$genre))

imdb$genre[1:10]
##  [1] Biography Action    Comedy    Action    Comedy    Action    Action   
##  [8] Action    Action    Action   
## 14 Levels: Action Adventure Animation Biography Comedy ... Short

You can also depict the bars horizontally:

barplot(table(imdb$genre), horiz=TRUE)

A stacked barplot is created when you cross an interval, ratio variable with an nominal variable. The legend shows which color represents which category.

imdbshort<-imdb[imdb$genre=="Action"| imdb$genre=="Comedy",]
barplot(table(imdbshort$genre,imdbshort$ratingcat), legend=c("Action","Comedy"))

Recoding

You can recode variables using the commands learned above. Recoding can be achieved by simply creating a new object, and identify the values based on the values of the old variable. I create an object that is completely empty.

newvar<-vector(mode="numeric", length=nrow(juul))
juul<-cbind(juul,newvar)
juul[juul$sex== 2 & !is.na(juul$sex),"newvar"]<-"F"
juul[juul$sex== 1 & !is.na(juul$sex),"newvar"]<-"M"
table(juul$newvar)
## 
##   F   M 
## 713 621

Another way to recode is to use factor and cut. The first function transforms a numeric variable into a factor, where each number is considered a separate level, and labels are attached to these levels:

sexcat<-factor(juul$sex, levels=c(1,2), labels=c("M","F"))
table(sexcat)
## sexcat
##   M   F 
## 621 713

The second function allows you to create categories out of a numeric variable with many levels.

agecat<-factor(cut(juul$age, breaks=c(0,5,10,15,20,83), labels=c(1,2,3,4,5)), 
                  levels=c(1,2,3,4,5), labels=c("<5 years","5 to 10 years","10 to 15 years","15 to 20 years", ">20 years"))
table(agecat)
## agecat
##       <5 years  5 to 10 years 10 to 15 years 15 to 20 years      >20 years 
##             45            379            432            331            147