class: center, middle, inverse, title-slide .title[ # Data Types and Data Classes ] .author[ ###
Termeh Shafie ] --- layout: true --- class: middle ## Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ```r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## # ℹ 55 more rows ``` --- ## Oh why won't you work?! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ```r ?mean ``` <img src="img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## Let's take another look .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ```r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## ℹ In argument: `number_of_cats = case_when(...)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ```r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Now that we know what we're doing... ```r *cat_lovers <- cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .hand[.light-blue[now that we have a good motivation for]] .hand[.light-blue[learning about data types in R]] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle ## Data types --- ## Data types in R - **logical** - **double** - **integer** - **character** - and some more, but we won't be focusing on those --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ```r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ```r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ```r typeof(1.335) ``` ``` ## [1] "double" ``` ```r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ```r typeof(7L) ``` ``` ## [1] "integer" ``` ```r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Concatenation Vectors can be constructed using the `c()` function. ```r c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ```r c("Hello", "World!") ``` ``` ## [1] "Hello" "World!" ``` ```r c(c("hi", "hello"), c("bye", "jello")) ``` ``` ## [1] "hi" "hello" "bye" "jello" ``` --- ## Converting between types .hand[with intention...] .pull-left[ ```r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ```r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ```r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ```r x <- c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ```r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ```r y <- as.numeric(x) y ``` ``` ## [1] 1 0 ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types .hand[without intention...] R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ```r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ```r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ```r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ```r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector --- .midi[ .your-turn[ ### .hand[Your turn!] - Open `type-coercion.qmd`. - What is the type of the given vectors? First, guess. Then, try it out in R. If your guess was correct, great! If not, discuss why they have that type. ] ] -- .small[ **Example:** Suppose we want to know the type of `c(1, "a")`. First, I'd look at: .pull-left[ ```r typeof(1) ``` ``` ## [1] "double" ``` ] .pull-right[ ```r typeof("a") ``` ``` ## [1] "character" ``` ] and make a guess based on these. Then finally I'd check: .pull-left[ ```r typeof(c(1, "a")) ``` ``` ## [1] "character" ``` ] ] --- class: middle ## Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] --- ## `NA`s are special ❄️s ```r x <- c(1, 2, 3, 4, NA) ``` ```r mean(x) ``` ``` ## [1] NA ``` ```r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ```r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- ## `NA`s are logical R uses `NA` to represent missing values in its data structures. ```r typeof(NA) ``` ``` ## [1] "logical" ``` --- ## Mental model for `NA`s - Unlike `NaN`, `NA`s are genuinely unknown values - But that doesn't mean they can't function in a logical way - Let's think about why `NA`s are logical... -- .question[ Why do the following give different answers? ] .pull-left[ ```r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ] .pull-right[ ```r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] `\(\rightarrow\)` See next slide for answers... --- - `NA` is unknown, so it could be `TRUE` or `FALSE` .pull-left[ .midi[ - `TRUE | NA` ```r TRUE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r TRUE | FALSE # if NA was FALSE ``` ``` ## [1] TRUE ``` ] ] .pull-right[ .midi[ - `FALSE | NA` ```r FALSE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r FALSE | FALSE # if NA was FALSE ``` ``` ## [1] FALSE ``` ] ] - Doesn't make sense for mathematical operations - Makes sense in the context of missing data --- class: middle ## Data classes --- ## Data classes We talked about *types* so far, next we'll introduce the concept of *classes* - Vectors are like Lego building blocks - We stick them together to build more complicated constructs, e.g. *representations of data* - The **class** attribute relates to the S3 class of an object which determines its behaviour - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious - Examples: factors, dates, and data frames --- ## Factors R uses factors to handle categorical variables, variables that have a fixed and known set of possible values ```r x <- factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` --- ## More on factors We can think of factors like character (level labels) and an integer (level numbers) glued together ```r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` --- ## Dates ```r y <- as.Date("2020-01-01") y ``` ``` ## [1] "2020-01-01" ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ```r class(y) ``` ``` ## [1] "Date" ``` --- ## More on dates We can think of dates like an inte ger (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together ```r as.integer(y) ``` ``` ## [1] 18262 ``` ```r as.integer(y) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` --- ## Data frames We can think of data frames like like vectors of equal length glued together ```r df <- data.frame(x = 1:2, y = 3:4) df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` --- ## Lists Lists are a generic vector container vectors of any type can go in them ```r l <- list( x = 1:4, y = c("hi", "hello", "jello"), z = c(TRUE, FALSE) ) l ``` ``` ## $x ## [1] 1 2 3 4 ## ## $y ## [1] "hi" "hello" "jello" ## ## $z ## [1] TRUE FALSE ``` --- ## Lists and data frames - A data frame is a special list containing vectors of equal length - When we use the `pull()` function, we extract a vector from the data frame ```r df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` ```r df %>% pull(y) ``` ``` ## [1] 3 4 ``` --- class: middle ## Working with factors --- ## Read data in as character strings ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## But coerce when plotting ```r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="07-data-types-classes_files/figure-html/unnamed-chunk-45-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use forcats to manipulate factors ```r cat_lovers %>% * mutate(handedness = fct_infreq(handedness)) %>% ggplot(mapping = aes(x = handedness)) + geom_bar() ``` <img src="07-data-types-classes_files/figure-html/unnamed-chunk-46-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Come for the functionality ... stay for the logo <img src="img/forcats-part-of-tidyverse.png" width="40%" style="display: block; margin: auto;" /> .pull-left-wide[ - Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display - They are also useful in modeling scenarios - The **forcats** package provides a suite of useful tools that solve common problems with factors ] --- .midi[ .your-turn[ ### .hand[Your turn!] - `Hotels + Data types` > `hotels-forcats.qmd`. - Recreate the x-axis of the following plot. - **Stretch goal**: Recreate the y-axis. <img src="07-data-types-classes_files/figure-html/unnamed-chunk-47-1.png" width="90%" style="display: block; margin: auto;" /> ] ] --- class: middle ## Working with dates --- ## Make a date .pull-left[ <img src="img/lubridate-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right-[ - **lubridate** is the tidyverse package that makes dealing with dates a little easier ] --- class: middle .hand[.light-blue[ we're just going to scratch the surface of working with dates in R here...]] --- .question[ Calculate and visualise the number of bookings on any given arrival date.] ```r hotels %>% select(starts_with("arrival_")) ``` ``` ## # A tibble: 119,390 × 4 ## arrival_date_year arrival_date_month arrival_date_week_number ## <dbl> <chr> <dbl> ## 1 2015 July 27 ## 2 2015 July 27 ## 3 2015 July 27 ## 4 2015 July 27 ## 5 2015 July 27 ## # ℹ 119,385 more rows ## # ℹ 1 more variable: arrival_date_day_of_month <dbl> ``` --- ## Step 1. Construct dates .midi[] ```r library(glue) hotels %>% mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}") ) %>% relocate(arrival_date) ``` ``` ## # A tibble: 119,390 × 33 ## arrival_date hotel is_canceled lead_time arrival_date_year ## <glue> <chr> <dbl> <dbl> <dbl> ## 1 2015 July 1 Resort Ho… 0 342 2015 ## 2 2015 July 1 Resort Ho… 0 737 2015 ## 3 2015 July 1 Resort Ho… 0 7 2015 ## 4 2015 July 1 Resort Ho… 0 13 2015 ## 5 2015 July 1 Resort Ho… 0 14 2015 ## # ℹ 119,385 more rows ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, … ``` --- ## Step 2. Count bookings per date .midi[] ```r hotels %>% mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}") ) %>% count(arrival_date) ``` ``` ## # A tibble: 793 × 2 ## arrival_date n ## <glue> <int> ## 1 2015 August 1 110 ## 2 2015 August 10 207 ## 3 2015 August 11 117 ## 4 2015 August 12 133 ## 5 2015 August 13 107 ## # ℹ 788 more rows ``` --- ## Step 3. Visualise bookings per date .midi[] ```r hotels %>% mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}") ) %>% count(arrival_date) %>% ggplot(aes(x = arrival_date, y = n, group = 1)) + geom_line() ``` <img src="07-data-types-classes_files/figure-html/unnamed-chunk-51-1.png" width="60%" style="display: block; margin: auto;" /> --- .hand[zooming in a bit...] .question[ Why does the plot start with August when we know our data start in July? And why does August 10 come after August 1?] .midi[] <img src="07-data-types-classes_files/figure-html/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 1. *REVISED* Construct dates "as dates" .midi[] ```r library(glue) hotels %>% mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) ) %>% relocate(arrival_date) ``` ``` ## # A tibble: 119,390 × 33 ## arrival_date hotel is_canceled lead_time arrival_date_year ## <date> <chr> <dbl> <dbl> <dbl> ## 1 2015-07-01 Resort Ho… 0 342 2015 ## 2 2015-07-01 Resort Ho… 0 737 2015 ## 3 2015-07-01 Resort Ho… 0 7 2015 ## 4 2015-07-01 Resort Ho… 0 13 2015 ## 5 2015-07-01 Resort Ho… 0 14 2015 ## # ℹ 119,385 more rows ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, … ``` --- ## Step 2. Count bookings per date ```r hotels %>% mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) ) %>% count(arrival_date) ``` ``` ## # A tibble: 793 × 2 ## arrival_date n ## <date> <int> ## 1 2015-07-01 122 ## 2 2015-07-02 93 ## 3 2015-07-03 56 ## 4 2015-07-04 88 ## 5 2015-07-05 53 ## # ℹ 788 more rows ``` .midi[] --- ## Step 3a. Visualise bookings per date .midi[] ```r hotels %>% mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) ) %>% count(arrival_date) %>% ggplot(aes(x = arrival_date, y = n, group = 1)) + geom_line() ``` <img src="07-data-types-classes_files/figure-html/unnamed-chunk-55-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 3b. Visualise using a smooth curve .midi[] ```r hotels %>% mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) ) %>% count(arrival_date) %>% ggplot(aes(x = arrival_date, y = n, group = 1)) + geom_smooth() ``` <img src="07-data-types-classes_files/figure-html/unnamed-chunk-56-1.png" width="60%" style="display: block; margin: auto;" />