class: center, middle, inverse, title-slide # R Programming & Descriptive statistics ## Lecture 17 ###
Louis SIRUGUE ### CPES 2 - Spring 2023 --- <h3>Today: Refresher on R Programming and Descriptive Statistics</h3> -- <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. The basics of R programming</b></li> <ul style = "list-style: none"> <li>1.1. Types of R objects</li> <li>1.2. The dplyr grammar</li> <li>1.3. Data visualization</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Descriptive statistics</b></li> <ul style = "list-style: none"> <li>2.1. Distributions</li> <li>2.2. Central tendency</li> <li>2.3. Spread</li> <li>2.4. Joint distributions</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. A few words on using R</b></li> <ul style = "list-style: none"> <li>3.1. When it doesn't work the way you want</li> <li>3.2. Where to find help</li> <li>3.3. When it doesn't work at all</li> </ul> </ul> ] --- <h3>Today: Refresher on R Programming and Descriptive Statistics</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. The basics of R programming</b></li> <ul style = "list-style: none"> <li>1.1. Types of R objects</li> <li>1.2. The dplyr grammar</li> <li>1.3. Data visualization</li> </ul> </ul> ] --- ### 1. The basics of R programming #### 1.1. Types of R objects * The most <b>basic</b> element in R is just a <b>value</b>, an object of dimension `\(1\times1\)`: -- ```r a <- 1 a ``` ``` ## [1] 1 ``` <p style = "margin-bottom:1cm;"> -- ```r b <- "monday" b ``` ``` ## [1] "monday" ``` <p style = "margin-bottom:1cm;"> -- ```r c <- a == b c ``` ``` ## [1] FALSE ``` --- ### 1. The basics of R programming #### 1.1. Types of R objects <ul> <li>Next, there are <b>vectors</b>, objects of dimension \(n\times1\):</li> <ul> <li>Vectors can be created with <b>c()</b></li> <li>The elements of a vector should be of the <b>same class</b></li> <li>Class can be changed with <b>as</b> functions: as.[numeric/character/logical]()</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"> .pull-left[ .pull-left[ <center><b>Numeric</b></center> ```r a <- 1:4 a ``` ``` ## [1] 1 2 3 4 ``` <p style = "margin-bottom:1.5cm;"> ```r a * 2 ``` ``` ## [1] 2 4 6 8 ``` ] .pull-right[ <center><b>Character</b></center> ```r b <- c("a", "xyz") b ``` ``` ## [1] "a" "xyz" ``` <p style = "margin-bottom:1.5cm;"> ```r paste0("b", b) ``` ``` ## [1] "ba" "bxyz" ``` ] ] -- .pull-left[ .pull-left[ <center><b>Logical</b></center> ```r c <- c(F, 1 < 2) c ``` ``` ## [1] FALSE TRUE ``` <p style = "margin-bottom:1.5cm;"> ```r !c ``` ``` ## [1] TRUE FALSE ``` ] .pull-right[ <center><b>Factor</b></center> ```r d <- as.factor(a) d ``` ``` ## [1] 1 2 3 4 ## Levels: 1 2 3 4 ``` <p style = "margin-bottom:.9cm;"> ```r relevel(d, 3) ``` ``` ## [1] 1 2 3 4 ## Levels: 3 1 2 4 ``` ] ] --- ### 1. The basics of R programming #### 1.1. Types of R objects * Some useful functions/operators for vectors ```r vec <- c("a", "b", "c", "d") ``` -- <p style = "margin-bottom:1cm;"> ```r length(vec) ``` ``` ## [1] 4 ``` -- <p style = "margin-bottom:.8cm;"> ```r match("b", vec) ``` ``` ## [1] 2 ``` -- <p style = "margin-bottom:.8cm;"> ```r vec[3] ``` ``` ## [1] "c" ``` --- ### 1. The basics of R programming #### 1.1. Types of R objects <ul> <li>Finally, there are <b>tables</b>, objects of dimension \(n\times m\):</li> <ul> <li>Gather \(m\) vectors (columns) of \(n\) observations</li> <li>Several possible classes, e.g., <b>tibble()</b> from tidyverse</li> </ul> </ul> -- ```r library(tidyverse) data <- tibble(name = c("Bob", "Tom", "Kim"), age = c(43, 19, 27), male = c(T, T, F)) ``` -- ```r data ``` ``` ## # A tibble: 3 x 3 ## name age male ## <chr> <dbl> <lgl> ## 1 Bob 43 TRUE ## 2 Tom 19 TRUE ## 3 Kim 27 FALSE ``` --- ### 1. The basics of R programming #### 1.1. Types of R objects <ul> <li>Such datasets can be imported on R with <b>read functions</b></li> <ul> <li>There is one read function per <b>data format</b> (csv, xls, dta, ...)</li> <li>The main argument is the <b>path</b>, with slashes: "C:/User/.../data.csv"</li> </ul> </ul> <p style = "margin-bottom:1cm;"> -- ```r data <- read.csv("data/cereals.csv") # Import csv data data <- as_tibble(data) # Put in tibble format head(data, 5) # Print first 5 rows ``` -- ``` ## # A tibble: 5 x 16 ## name mfr type calories protein fat sodium fiber carbo sugars potass ## <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> ## 1 100% Bran N C 70 4 1 130 10 5 6 280 ## 2 100% Natu~ Q C 120 3 5 15 2 8 8 135 ## 3 All-Bran K C 70 4 1 260 9 7 5 320 ## 4 All-Bran ~ K C 50 4 0 140 14 8 0 330 ## 5 Almond De~ R C 110 2 2 200 1 14 8 -1 ## # ... with 5 more variables: vitamins <dbl>, shelf <int>, weight <dbl>, ## # cups <dbl>, rating <dbl> ``` --- ### 1. The basics of R programming #### 1.2. The dplyr grammar * `dplyr` provides **useful functions** to manipulate data and the **pipe operator (%>%)** to chain operations -- <p style = "margin-bottom:1.25cm;"> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption><b>Important functions of the dplyr grammar</b></caption> <thead> <tr> <th style="text-align:left;"> Function </th> <th style="text-align:left;"> Meaning </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mutate() </td> <td style="text-align:left;"> Modify or create a variable </td> </tr> <tr> <td style="text-align:left;"> select() </td> <td style="text-align:left;"> Keep a subset of variables </td> </tr> <tr> <td style="text-align:left;"> filter() </td> <td style="text-align:left;"> Keep a subset of observations </td> </tr> <tr> <td style="text-align:left;"> arrange() </td> <td style="text-align:left;"> Sort the data </td> </tr> <tr> <td style="text-align:left;"> group_by() </td> <td style="text-align:left;"> Group the data </td> </tr> <tr> <td style="text-align:left;"> summarise() </td> <td style="text-align:left;"> Summarizes variables into 1 observation per group </td> </tr> <tr> <td style="text-align:left;"> left/right/inner/full_join() </td> <td style="text-align:left;"> Merge data </td> </tr> </tbody> </table> --- ### 1. The basics of R programming #### 1.2. The dplyr grammar <ul> <li>We can first subset the data:</li> <ul> <li>The type <b>variable</b> only takes the value "C", we can remove it with <b>select()</b></li> <li>Some <b>observations</b> have negative values of potassium, we can remove them with <b>filter()</b></li> <li>These two operations can be <b>chained</b> using the pipe operator <b>%>%</b> </ul> </ul> -- <p style = "margin-bottom:1cm;"> ```r dim(data) # Dimensions of the data before the operation ``` ``` ## [1] 77 16 ``` -- ```r data <- data %>% select(-type) %>% filter(potass >= 0) ``` -- ```r dim(data) # Dimensions of the data after the operation ``` ``` ## [1] 75 15 ``` --- ### 1. The basics of R programming #### 1.2. The dplyr grammar <ul> <li>The <b>mutate()</b> function allows to modify and create variables</li> <ul> <li>Using simple <b>vector operations</b></li> <li>With <b>ifelse()</b> to create a binary variable based on a condition</li> <li>With <b>case_when()</b> to create a categorical variable</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> ```r data <- data %>% mutate(cal_100g = 100 * (calories / weight), ``` -- <p style = "margin-bottom:-.7cm;"></p> ```r low_cal = ifelse(cal_100g < 100, T, F), ``` -- <p style = "margin-bottom:-.7cm;"></p> ```r mfr = case_when(mfr == "N" ~ "Nestlé", mfr == "Q" ~ "Quaker Oats", mfr == "K" ~ "Kellogg's", mfr %in% c("G", "R") ~ "General Mills", mfr == "P" ~ "Post Consumer Brands LLC", mfr == "A" ~ "Maltex Co.")) ``` --- ### 1. The basics of R programming #### 1.2. The dplyr grammar * Such computations can also be done <b>separately</b> for each value of a variable <b>with group_by()</b> -- ```r data <- data %>% group_by(mfr) %>% mutate(n_brands = n()) %>% ungroup() ``` <p style = "margin-bottom:.75cm;"></p> -- .pull-left[ <ul> <li>Using <b>summarise()</b> instead of mutate() allows to:</li> <ul> <li>Keep only the grouping and summarized variables</li> <li>Keep one value per group (no duplicate row)</li> </ul> </ul> ```r data %>% group_by(mfr) %>% summarise(n_brands = n()) ``` ] -- .pull-right[ ``` ## # A tibble: 6 x 2 ## mfr n_brands ## <chr> <int> ## 1 General Mills 29 ## 2 Kellogg's 23 ## 3 Maltex Co. 1 ## 4 Nestlé 5 ## 5 Post Consumer Brands LLC 9 ## 6 Quaker Oats 8 ``` ] --- ### 1. The basics of R programming #### 1.2. The dplyr grammar * `dplyr` also provides functions to: -- <ul><ul><li>Rename variables ➜ <b>rename()</b></li></ul></ul> ```r data <- data %>% rename(manufacturer = mfr) ``` -- <ul><ul><li>Sort rows according to the values of one or several variables ➜ <b>arrange()</b></li></ul></ul> ```r data <- data %>% arrange(cal_100g) ``` -- <ul><ul><li>Joining another dataset with a common variable ➜ <b>[left/right/full/inner]_join()</b>:</li></ul></ul> ```r data <- data %>% left_join(tibble(manufacturer = c("Kellogg's", "Nestlé", "General Mills", "Post Consumer Brands LLC", "Quaker Oats", "Maltex Co."), creation = c(1906, 1966, 1928, 1895, 1877, 1899)), by = "manufacturer") ``` --- ### 1. The basics of R programming #### 1.3. Data visualization * The tidyverse packages also gives access to the <b>ggplot</b> grammar for data visualization -- <ul> <li>The core arguments of the ggplot() function are the following</li> <ul> <li><b>Data</b>: the values to plot</li> <li><b>Mapping</b> (aes, for aesthetics): the structure of the plot</li> <li><b>Geometry</b>: the type of plot</li> </ul> </ul> -- <ul> <li>These arguments should be specified as follows:</li> <ul> <li>Data and mapping should be specified within the parentheses</li> <li>The geometry and any other element should be added with a <b>+</b> sign</li> </ul> </ul> ```r ggplot(data, aes) + geometry + anything_else ``` -- * You can also apply the `ggplot()` function to your data with a pipe: ```r data %>% ggplot(., aes) + geometry ``` --- ### 1. The basics of R programming #### 1.3. Data visualization ```r test_data <- tibble(V1 = 1:6, V2 = c(64, 60, 16, 8, 16, 32)) ggplot(test_data, aes(x = V1, y = V2)) + geom_point(size = 3) ``` <p style = "margin-bottom:.75cm;"> -- .pull-left[ <p style = "margin-bottom:1cm;"> * We first specified our data: <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> V1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> V2 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 32 </td> </tr> </tbody> </table> * Then assigned `V1` to the x-axis and `V2` to the y-axis with `aes()` * And chose the `point` geometry with a size of 3 ] -- .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] --- ### 1. The basics of R programming #### 1.3. Data visualization <ul> <li>In some cases you would convey information with other means than a position on axis</li> <ul> <li>It can be with the color, size or shape of a geometry, ...</li> <li>For instance if you have two groups</li> </ul> </ul> ```r test_data <- test_data %>% mutate(Group = paste("Group", c(1, 1, 2, 2, 2, 2))) ``` -- .pull-left[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:right;"> V1 </th> <th style="text-align:right;"> V2 </th> <th style="text-align:left;"> Group </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:left;"> Group 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:left;"> Group 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> Group 2 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Group 2 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> Group 2 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:left;"> Group 2 </td> </tr> </tbody> </table> ] -- .pull-right[ * Just as we assigned the two numeric variables to the x an y axis with aes, we have to assign the group variable to the 'color axis' with aes ```r ggplot(test_data, aes(x = V1, y = V2, color = Group)) + ... ``` * But there is no proper 'color axis', that's why a legend will be generated ] --- ### 1. The basics of R programming #### 1.3. Data visualization <img src="slides_files/figure-html/unnamed-chunk-43-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. The basics of R programming #### 1.3. Data visualization <p style = "margin-bottom:.5cm;"> <b>➜ Case 1:</b> The style does not depend on the value of a variable -- <ul> <li>The style element should be <b>uniform</b> across all data points</li> <ul> <li>So it should be specified <b>within the geometry</b> function</li> </ul> </ul> ```r ggplot(test_data, aes(x = V1, y = V2)) + geom_point(color = "red", shape = 18) ``` -- <p style = "margin-bottom:1.25cm;"> <b>➜ Case 2:</b> The style element depends on the value of a variable -- <ul> <li>The style should <b>depend on the value of the variable</b> it has been assign to in <b>aes</b></li> <ul> <li>So just as for regular axes, modifications should take place in a scale function</li> </ul> </ul> ```r ggplot(test_data, aes(x = V1, y = V2, color = Group)) + scale_color_manual(name = "Group:", values = c("red", "blue")) + geom_point(shape = 18) ``` --- class: inverse, hide-logo ### Practice <p style = "margin-bottom:2cm;"></p> #### 1) Import the dataset `cereals.csv` -- <p style = "margin-bottom:2cm;"></p> #### 2) There is no documentation on the variable `rating`. Use the summary() function to deduce the unit of the variable based on its distribution. -- <p style = "margin-bottom:2cm;"></p> #### 3) Generate a scatter plot with `sugars` on the `x` axis and `rating` on the `y` axis to deduce whether the rating was made by nutritionists or consumers -- <p style = "margin-bottom:3cm;"></p> <center><h3><i>You've got 10 minutes!</i></h3></center>
−
+
10
:
00
--- class: inverse, hide-logo ### Solution <p style = "margin-bottom:2cm;"></p> #### 1) Import the dataset `cereals.csv` ```r cereals <- read.csv("C:/User/Documents/cereals.csv") ``` -- <p style = "margin-bottom:2cm;"></p> #### 2) There is no documentation on the variable `rating`. Use the summary() function to deduce the unit of the variable based on its distribution. -- ```r summary(cereals$rating) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.04 33.17 40.40 42.67 50.83 93.70 ``` <p style = "margin-bottom:1.5cm;"></p> <center>The variable is probably in percentages</center> --- class: inverse, hide-logo ### Solution #### 3) Generate a scatter plot with `sugars` on the `x` axis and `rating` on the `y` axis to deduce whether the rating was made by nutritionists or consumers -- ```r ggplot(cereals, aes(x = sugars, y = rating)) + geom_point(alpha = .8) ``` -- .left-column[ <img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:4cm;"></p> <center>The rating was probably made by nutritionists</center> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. The basics of R programming ✔</b></li> <ul style = "list-style: none"> <li>1.1. Types of R objects</li> <li>1.2. The dplyr grammar</li> <li>1.3. Data visualization</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Descriptive statistics</b></li> <ul style = "list-style: none"> <li>2.1. Distributions</li> <li>2.2. Central tendency</li> <li>2.3. Spread</li> <li>2.4. Joint distributions</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. A few words on using R</b></li> <ul style = "list-style: none"> <li>3.1. When it doesn't work the way you want</li> <li>3.2. Where to find help</li> <li>3.3. When it doesn't work at all</li> </ul> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. The basics of R programming ✔</b></li> <ul style = "list-style: none"> <li>1.1. Types of R objects</li> <li>1.2. The dplyr grammar</li> <li>1.3. Data visualization</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Descriptive statistics</b></li> <ul style = "list-style: none"> <li>2.1. Distributions</li> <li>2.2. Central tendency</li> <li>2.3. Spread</li> <li>2.4. Joint distributions</li> </ul> </ul> ] --- ### 2. Descriptive statistics #### 2.1. Distributions -- * The point of <b>descriptive statistics</b> is to <b>summarize variables</b> into a small set of tractable statistics. * The most comprehensive way to characterize a variable is to compute its distribution: * What are the values the variable takes? * How frequently does each of these values appear? -- <b> ➜ Consider for instance the following variable:</b> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Variable 1</caption> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> -- .pull-left[ * We can count how many times each value appears <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> Variable 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> n </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom: 1.5cm;"></p> * And we can represent this distribution graphically with a bar plot * Each possible value on the x-axis * Their number of occurrences on the y-axis ] --- ### 2. Descriptive statistics #### 2.1. Distributions <img src="slides_files/figure-html/unnamed-chunk-53-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.1. Distributions * But what if we would like to do the same thing for the following variable? -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Variable 2</caption> <tbody> <tr> <td style="text-align:right;"> 5.912877 </td> <td style="text-align:right;"> 5.006781 </td> <td style="text-align:right;"> 5.517149 </td> <td style="text-align:right;"> 5.854849 </td> <td style="text-align:right;"> 5.177872 </td> <td style="text-align:right;"> 3.815240 </td> </tr> <tr> <td style="text-align:right;"> 1.666582 </td> <td style="text-align:right;"> 4.422721 </td> <td style="text-align:right;"> 6.025062 </td> <td style="text-align:right;"> 5.411020 </td> <td style="text-align:right;"> 5.889811 </td> <td style="text-align:right;"> 6.729103 </td> </tr> <tr> <td style="text-align:right;"> 4.160800 </td> <td style="text-align:right;"> 6.519049 </td> <td style="text-align:right;"> 6.849172 </td> <td style="text-align:right;"> 8.368158 </td> <td style="text-align:right;"> 6.167404 </td> <td style="text-align:right;"> 2.882974 </td> </tr> <tr> <td style="text-align:right;"> 6.751888 </td> <td style="text-align:right;"> 3.202183 </td> <td style="text-align:right;"> 6.390224 </td> <td style="text-align:right;"> 3.942039 </td> <td style="text-align:right;"> 6.488909 </td> <td style="text-align:right;"> 8.195647 </td> </tr> <tr> <td style="text-align:right;"> 7.073922 </td> <td style="text-align:right;"> 4.790039 </td> <td style="text-align:right;"> 5.297919 </td> <td style="text-align:right;"> 1.218109 </td> <td style="text-align:right;"> 5.754213 </td> <td style="text-align:right;"> 7.225030 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> -- * Each value appears only once * So the count of each value does not help summarizing the variable -- <center><h4>➜<i>We should rather do a histogram</i></h4></center> --- ### 2. Descriptive statistics #### 2.1. Distributions * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount. -- <p style = "margin-bottom: 1.97cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-55-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.1. Distributions * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount. * We can divide the domain of this variable into 5 bins <p style = "margin-bottom: 1.25cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-56-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.1. Distributions * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount. * We can divide the domain of this variable into 5 bins * And count the number of observations within each bin <img src="slides_files/figure-html/unnamed-chunk-57-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.1. Distributions * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount. * We can divide the domain of this variable into 5 bins * And count the number of observations within each bin <img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.1. Distributions <ul> <li>There's no definitive rule to choose the number of bins</li> <ul> <li>But too many or too few can yield misleading histograms</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-59-1.png" width="83%" style="display: block; margin: auto;" /> -- <ul> <li>Densities are often used instead of histograms</li> <ul> <li>Both are based on the same principle, but densities are continuous</li> </ul> </ul> --- ### 2. Descriptive statistics #### 2.1. Distributions * Distributions are comprehensive representations but not simple statistics -- <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="100%" style="display: block; margin: auto;" /> * How to summarize these distributions with simple statistics? --- ### 2. Descriptive statistics #### 2.1. Distributions * Distributions are comprehensive representations but not simple statistics <img src="slides_files/figure-html/unnamed-chunk-61-1.png" width="100%" style="display: block; margin: auto;" /> * How to summarize these distributions with simple statistics? * By describing their central tendency (e.g., mean, median) --- ### 2. Descriptive statistics #### 2.1. Distributions * Distributions are comprehensive representations but not simple statistics <img src="slides_files/figure-html/unnamed-chunk-62-1.png" width="100%" style="display: block; margin: auto;" /> * How to summarize these distributions with simple statistics? * By describing their central tendency (e.g., mean, median) * And their spread (e.g., standard deviation, inter-quartile range) --- ### 2. Descriptive statistics #### 2.2. Central tendency <ul> <li>The mean is the most common statistic to describe central tendencies</li> <ul> <li>Take for instance the grades of group 2 last year for the second-semester final project</li> </ul> </ul> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades of G2 last year</caption> <tbody> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 16.0 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> -- * The mean is simply the sum of all the grades divided by the number of grades: <p style = "margin-bottom:.5cm;"> `$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$` -- `$$\frac{20 + 20 + 17.5 + 17.5 + 16 + 16 + 16 + 14.5 + 14.5 + 19.5 + 19.5 + 18.5 + 18.5 + 18.5}{14} = 17.61$$` --- ### 2. Descriptive statistics #### 2.2. Central tendency <ul> <li>The mean is the most common statistic to describe central tendencies</li> <ul> <li>Take for instance the grades of group 2 last year for the second-semester final project</li> </ul> </ul> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades of G2 last year</caption> <tbody> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 16.0 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> * It can also be expressed as the average of each possible value weighted by its number of occurrences: <p style = "margin-bottom:.5cm;"> `$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$` `$$\bar{x} = \frac{(2 \times 20) + (2 \times 17.5) + (3 \times 16) + (2 \times 14.5) + (2 \times 19.5) + (3 \times 18.5)}{2 + 2 + 3 + 2 + 2 + 3 = 14} = 17.61$$` --- ### 2. Descriptive statistics #### 2.2. Central tendency * To obtain the median you first need to sort the values: <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades of G2 last year</caption> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> 14 </td> </tr> <tr> <td style="text-align:left;"> 14.5 </td> <td style="text-align:left;"> 14.5 </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 17.5 </td> <td style="text-align:left;"> 17.5 </td> <td style="text-align:left;"> 18.5 </td> <td style="text-align:left;"> 18.5 </td> <td style="text-align:left;"> 18.5 </td> <td style="text-align:left;"> 19.5 </td> <td style="text-align:left;"> 19.5 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 20 </td> </tr> </tbody> </table> -- <p style = "margin-bottom:1cm;"> <ul> <li>The median is the value that divides the distribution into two halves</li> <ul> <li>With N even: Average of the last value of the first half and the first value of the second half</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"> * As we have 14 observations, here the median is the average of the 7<sup>th</sup> and the 8<sup>th</sup> observations: `$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\ \frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even} \end{cases} = \frac{17.5 + 18.5}{2} = 18$$` --- ### 2. Descriptive statistics #### 2.3. Spread -- <ul> <li>The most intuitive statistic to describe the spread of a variable is probably its <b>range</b></li> <ul> <li>The minimum and maximum value of the distribution</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> * But consider the following two distributions: <p style = "margin-bottom: 1.25cm;"></p> <style> .left-column {width: 66%;} .right-column {width: 32%;} </style> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-66-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:-1cm;"> <ul> <li>In the presence of outliers or very skewed distributions, the full range of a variable may not be representative of what we mean by 'spread'</li> </ul> <ul> <li>That's why we tend to prefer:</li> <ul> <li>The <b>inter-quantile range</b></li> <li>The <b>standard deviation</b></li> </ul> </ul> ] --- ### 2. Descriptive statistics #### 2.3. Spread <ul> <li><b>Quantiles</b> are observations that divide the population into <b>groups of equal size</b></li> <ul> <li>The median divides the population into 2 groups of equal size</li> <li>Quartiles divide the population into 4 groups of equal size</li> <li>There are also terciles, quintiles, deciles, and so on</li> </ul> </ul> -- <ul> <li><b>The interquartile range</b> is the difference between the third and the first quartile: \(\text{IQR} = Q_3 - Q_1\)</li> <ul> <li>Put differently, it corresponds to the bounds of the set which contains the <b>middle half of the distribution</b></li> </ul> </ul> -- <p style = "margin-bottom:.5cm;"> <img src="slides_files/figure-html/unnamed-chunk-67-1.png" width="90%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.3. Spread <ul> <li>The <b>variance</b> is a way to quantify how values of a variable tend to deviate from their mean</li> <ul> <li>If values tend to be close to the mean, then the spread is low</li> <li>If values tend to be far from the mean, then the spread is large</li> </ul> </ul> <ul> <li>Because deviations from the mean sum to 0, they have to be squared</li> <ul> <li>This is how the variance is computed: by <b>averaging the squared deviations from the mean</b></li> </ul> </ul> -- `$$\text{Var}(x) = \frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2$$` -- <p style = "margin-bottom:.75cm;"> <ul> <li>The variance is a sum of squares, so we have to take its square root to remain in the same unit as the data</li> <ul> <li>This is what we call the <b>standard deviation</b></li> </ul> </ul> -- <p style = "margin-bottom:.5cm;"> `$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$` --- ### 2. Descriptive statistics #### 2.4. Joint distributions * The joint distribution shows the possible values and associated frequencies for two variables simultaneously * Earlier we plotted the observations of a variable on a line, randomly shifted on the vertical axis -- <p style = "margin-bottom: .75cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-68-1.png" width="80%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.4. Joint distributions * The joint distribution shows the possible values and associated frequencies for two variable simultaneously * Earlier we plotted the observations of a variable on a line, randomly shifted on the vertical axis * Instead of shifting observations randomly, vertical coordinates can indicate the value of a second variable <p style = "margin-bottom: -.5cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-69-1.png" width="80%" style="display: block; margin: auto;" /> --- ### 2. Descriptive statistics #### 2.4. Joint distributions * When describing a <b>single distribution</b>, we're interested in its <b>spread and central tendency</b> * When describing a <b>joint distribution</b>, we're interested in the <b>relationship between the two variables</b> * This can be characterized by the covariance -- $$ \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^{N}(x_i − \bar{x})(y_i − \bar{y}) $$ -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The contribution of observation \(i\) to \(\text{Cov}(x, y)\) is:</li> <ul> <li>Positive when both \(x_i\) and \(y_i\) are above their respective mean</li> <li>Positive when both \(x_i\) and \(y_i\) are below their respective mean</li> <li>Negative when \(x_i\) and \(y_i\) are on different sides of their respective mean</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><h4> ➜ <i> If y tends to be large relative to its mean when x is large relative to its mean, their covariance is positive. Conversely, if one tends to be large when the other tends to be low, the covariance is negative.</i></h4></center> --- ### 2. Descriptive statistics #### 2.4. Joint distributions <img src="slides_files/figure-html/unnamed-chunk-70-1.png" width="75%" style="display: block; margin: auto;" /> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>One disadvantage of the <b>covariance</b> is that is it <b>not standardized</b></li> <ul> <li>You cannot directly compare the covariance of two pairs of completely different variables</li> <li>Theoretically the covariance can take values from \(-\infty\) to \(+\infty\)</li> </ul> </ul> --- ### 2. Descriptive statistics #### 2.4. Joint distributions <ul> <li>This is why we often use the <b>correlation coefficient</b></li> <ul> <li>It is obtained by dividing the covariance by the product of the standard deviation of the two variables</li> <li>This allows to <b>standardize the coefficient</b> between -1 and 1</li> </ul> </ul> -- `$$\text{Corr}(x, y) = \frac{\text{Cov}(x, y)}{\text{SD}(x)\times\text{SD}(y)}$$` -- * Consider for instance the following two distributions: .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-71-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <ul> <li>Here the association between the two variables feels tighter on the right panel</li> <ul> <li>But the covariance is larger for the first relationship because units are larger</li> <li>While the correlation, standardized between 0 and 1, is larger for the second one</li> </ul> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. The basics of R programming ✔</b></li> <ul style = "list-style: none"> <li>1.1. Types of R objects</li> <li>1.2. The dplyr grammar</li> <li>1.3. Data visualization</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Descriptive statistics ✔</b></li> <ul style = "list-style: none"> <li>2.1. Distributions</li> <li>2.2. Central tendency</li> <li>2.3. Spread</li> <li>2.4. Joint distributions</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. A few words on using R</b></li> <ul style = "list-style: none"> <li>3.1. When it doesn't work the way you want</li> <li>3.2. Where to find help</li> <li>3.3. When it doesn't work at all</li> </ul> </ul> ] --- ### 3. A few words on using R #### 3.1. When it doesn't work the way you want <ul> <li>When things do not work the way you want, <b>NA</b>s are the usual suspects</li> <ul> <li>For instance, this is how the mean function reacts to NAs:</li> </ul> </ul> -- .pull-left[ ```r mean(c(1, 2, NA)) ``` ``` ## [1] NA ``` ] -- .pull-right[ ```r mean(c(1, 2, NA), na.rm = T) ``` ``` ## [1] 1.5 ``` ] <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>Here it is obvious that NAs are the problem, but when chaining operations it's not always that transparent</li> <ul> <li>So check your data using <b>is.na()</b> to see whether NAs could mess things up</li> </ul> </ul> ```r is.na(c(1, 2, NA)) ``` ``` ## [1] FALSE FALSE TRUE ``` --- ### 3. A few words on using R #### 3.2. Where to find help <ul> <li>You can find help on <b>help files</b></li> <ul> <li>Sometimes things don't work just because you did not understand the arguments of the function</li> <li>Just enter the name of the function preceded by a <b>?</b> in your console</li> <li>The help file will appear in the Help tab of R studio</li> </ul> </ul> ```r ?pivot_longer ``` -- <center> <img src = "pivot_longer.png"/> </center> --- ### 3. A few words on using R #### 3.2. Where to find help <ul> <li>When it doesn't work, search on the <b>internet</b></li> <ul> <li><b>Every question</b> you might have at that stage is already asked and <b>answered</b> at <a href="https://stackoverflow.com/">stackoverflow.com</a></li> </ul> </ul> -- <center> <img src = "ask_google.png" width = 750 /> </center> --- ### 3. A few words on using R #### 3.3. When it doesn't work at all * Sometimes R breaks and returns an error, which is usually kind of cryptic -- ```r read.csv("C:\Users\l.sirugue\Documents\R") ``` ``` ## Error: '\U' non suivi de chiffres hexadécimaux dans la chaîne de caractères débutant ""C:\U" ``` -- <p style = "margin-bottom:1cm;"></p> * Try to look for keywords that might help you understand where it comes from * And paste it in Google with the name of your command, chances are many people already struggled with that -- <center> <img src = "error.png" width = 750 /> </center>