The imdb dataset is a subset of the total imdb dataset (which includes various countries, movies, and periods) that we will use for the course. It consist of the movies shown in Germany with the rating given by people on imdb.com. For each movie, the genre is given, the runtime, the budget, and the name of the movie.This dataset will also be used for illustration of the commands discussed below.
Variables have various measurement levels. Nominal when they only have two categorical categories (i.e. the numbers attached to the categories have no meaning). Ordinal when the categories follow an ordering (i.e. the order of numbers attached to the categories have a meaning). Interval when the categories are equally spaced (i.e. the distance between the numbers of the categories have meaning). Ratio when the starting point has a meaning. Examples are gender (nominal), education level (ordinal), income in 1000 euro (interval), prices (ratio). A classic example of an interval variable is Celcius (or Fahrenheit) temperature, where 0 temperature is not absolute (0 degrees of Celcius is not ‘no heat’). This dataset contains ratio (budget), interval (rating), ordinal (ratingcat), and nominal variables (genre).
summary(imdb)
## movie runtime budget
## (500) Days of Summer: 1 Min. : 80.0 Min. : 1.50
## 2 Guns : 1 1st Qu.: 97.0 1st Qu.: 21.00
## 2012 : 1 Median :107.0 Median : 40.00
## 21 and Over : 1 Mean :109.7 Mean : 60.33
## 21 Jump Street : 1 3rd Qu.:120.0 3rd Qu.: 80.00
## 22 Jump Street : 1 Max. :180.0 Max. :250.00
## (Other) :478
## genre rating revenues screens
## Action :179 Min. :1.600 Min. : 11705 Min. : 7.0
## Comedy :107 1st Qu.:5.800 1st Qu.: 481093 1st Qu.: 241.0
## Drama : 58 Median :6.400 Median : 1215030 Median : 350.0
## Adventure: 37 Mean :6.318 Mean : 2191898 Mean : 369.3
## Horror : 33 3rd Qu.:7.000 3rd Qu.: 2769544 3rd Qu.: 501.0
## Crime : 22 Max. :8.800 Max. :23845427 Max. :1265.0
## (Other) : 48 NA's :47 NA's :45
Interval (and ratio) variables you can plot using plot
. The y-axis is the value of the variable, and the x-axis is the position of the number in the variable. The first value is 7.5 which appears at the beginning of the x-axis.
imdb$rating[1:10]
## [1] 7.5 5.8 7.8 6.8 5.9 7.2 7.1 6.2 6.1 6.3
A histogram shows the distribution of the interval variables. If you save the histogram to an object, this object shows the information of the histogram. As you can see, the breaks include the lowest and highest value, and values in between with an equal distance. The frequencies show how many times values occur between these breaks, and the density shows probability of these values to occur and is related to density function.
## $breaks
## [1] 1 2 3 4 5 6 7 8 9
##
## $counts
## [1] 1 0 11 37 124 205 94 12
##
## $density
## [1] 0.002066116 0.000000000 0.022727273 0.076446281 0.256198347 0.423553719
## [7] 0.194214876 0.024793388
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
##
## $xname
## [1] "imdb$rating"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
You can change the parameters of the histogram. Here we change the breaks, which makes the bars wider or smaller depending on the initial breaks in the histogram. Instead of frequencies (or counts), the y-axis can also display the density (or probability) of the values.
# breaks
hist(imdb$budget, breaks=50)
# density
hist(imdb$budget, prob=TRUE)
# kernel plot
plot(density(imdb$budget))
Nominal and ordinal variables are better analyzed using barplot which displays the frequencies. To get the frequencies you need the function table
which shows the count of the variable. If you put two arguments in the function you get a cross table. Just for this example we have recoded the variable rating into 3 categories: low, middle and high.
table(imdb$genre)
##
## Action Adventure Animation Biography Comedy Crime
## 179 37 12 21 107 22
## Documentary Drama Fantasy Horror Mystery Romance
## 6 58 1 33 3 1
## Sci-Fi Short
## 2 2
imdb$ratingcat<-cut(imdb$rating, breaks=c(1,6,7,8.8))
table(imdb$ratingcat,imdb$genre)
##
## Action Adventure Animation Biography Comedy Crime Documentary
## (1,6] 56 15 7 1 46 4 5
## (6,7] 91 12 3 5 50 8 0
## (7,8.8] 32 10 2 15 11 10 1
##
## Drama Fantasy Horror Mystery Romance Sci-Fi Short
## (1,6] 15 1 22 0 0 0 1
## (6,7] 25 0 10 1 0 0 0
## (7,8.8] 18 0 1 2 1 2 1
You can change the parameters of the histogram. The y-axis can also display the density of the values. The labels of the variable appear because the variable genre only contains characters.
barplot(table(imdb$genre))
imdb$genre[1:10]
## [1] Biography Action Comedy Action Comedy Action Action
## [8] Action Action Action
## 14 Levels: Action Adventure Animation Biography Comedy ... Short
You can also depict the bars horizontally:
barplot(table(imdb$genre), horiz=TRUE)
A stacked barplot is created when you cross an interval, ratio variable with an nominal variable. The legend shows which color represents which category.
imdbshort<-imdb[imdb$genre=="Action"| imdb$genre=="Comedy",]
barplot(table(imdbshort$genre,imdbshort$ratingcat), legend=c("Action","Comedy"))
You can recode variables using the commands learned above. Recoding can be achieved by simply creating a new object, and identify the values based on the values of the old variable. I create an object that is completely empty.
newvar<-vector(mode="numeric", length=nrow(juul))
juul<-cbind(juul,newvar)
juul[juul$sex== 2 & !is.na(juul$sex),"newvar"]<-"F"
juul[juul$sex== 1 & !is.na(juul$sex),"newvar"]<-"M"
table(juul$newvar)
##
## F M
## 713 621
Another way to recode is to use factor
and cut
. The first function transforms a numeric variable into a factor, where each number is considered a separate level, and labels are attached to these levels:
sexcat<-factor(juul$sex, levels=c(1,2), labels=c("M","F"))
table(sexcat)
## sexcat
## M F
## 621 713
The second function allows you to create categories out of a numeric variable with many levels.
agecat<-factor(cut(juul$age, breaks=c(0,5,10,15,20,83), labels=c(1,2,3,4,5)),
levels=c(1,2,3,4,5), labels=c("<5 years","5 to 10 years","10 to 15 years","15 to 20 years", ">20 years"))
table(agecat)
## agecat
## <5 years 5 to 10 years 10 to 15 years 15 to 20 years >20 years
## 45 379 432 331 147