R Programming & Descriptive statistics

# R Programming & Descriptive statistics
## Lecture 17
### <br>Louis SIRUGUE
### CPES 2 - Spring 2023

---

<h3>Today: Refresher on R Programming and Descriptive Statistics</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. The basics of R programming</b></li>
  <ul style = "list-style: none">
    <li>1.1. Types of R objects</li>
    <li>1.2. The dplyr grammar</li>
    <li>1.3. Data visualization</li>
  </ul>
</ul>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Descriptive statistics</b></li>
  <ul style = "list-style: none">
    <li>2.1. Distributions</li>
    <li>2.2. Central tendency</li>
    <li>2.3. Spread</li>
    <li>2.4. Joint distributions</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none">
  <li><b>3. A few words on using R</b></li>
  <ul style = "list-style: none">
    <li>3.1. When it doesn't work the way you want</li>
    <li>3.2. Where to find help</li>
    <li>3.3. When it doesn't work at all</li>
  </ul>
</ul>

]

---

<h3>Today: Refresher on R Programming and Descriptive Statistics</h3>

]
 
---

### 1. The basics of R programming

#### 1.1. Types of R objects

* The most <b>basic</b> element in R is just a <b>value</b>, an object of dimension `$1\times1$`:

```r
a <- 1
a
```

```
## [1] 1
```
<p style = "margin-bottom:1cm;">

```r
b <- "monday"
b
```

```
## [1] "monday"
```
<p style = "margin-bottom:1cm;">

```r
c <- a == b
c
```

```
## [1] FALSE
```

---

### 1. The basics of R programming

#### 1.1. Types of R objects

<ul>
  <li>Next, there are <b>vectors</b>, objects of dimension $n\times1$:</li>
  <ul>
    <li>Vectors can be created with <b>c()</b></li>
    <li>The elements of a vector should be of the <b>same class</b></li>
    <li>Class can be changed with <b>as</b> functions: as.[numeric/character/logical]()</li>
  </ul>
</ul>

```r
a <- 1:4
a
```

```
## [1] 1 2 3 4
```
<p style = "margin-bottom:1.5cm;">

```r
a * 2
```

```
## [1] 2 4 6 8
```

]
.pull-right[
<center><b>Character</b></center>

```r
b <- c("a", "xyz")
b
```

```
## [1] "a"   "xyz"
```
<p style = "margin-bottom:1.5cm;">

```r
paste0("b", b)
```

```
## [1] "ba"   "bxyz"
```
]
]

```r
c <- c(F, 1 < 2)
c
```

```
## [1] FALSE  TRUE
```
<p style = "margin-bottom:1.5cm;">

```r
!c
```

```
## [1]  TRUE FALSE
```
]
.pull-right[
<center><b>Factor</b></center>

```r
d <- as.factor(a)
d
```

```
## [1] 1 2 3 4
## Levels: 1 2 3 4
```
<p style = "margin-bottom:.9cm;">

```r
relevel(d, 3)
```

```
## [1] 1 2 3 4
## Levels: 3 1 2 4
```
]
]

---

### 1. The basics of R programming

#### 1.1. Types of R objects

* Some useful functions/operators for vectors

```r
vec <- c("a", "b", "c", "d")
```

--
<p style = "margin-bottom:1cm;">

```r
length(vec)
```

```
## [1] 4
```

--
<p style = "margin-bottom:.8cm;">

```r
match("b", vec)
```

```
## [1] 2
```

--
<p style = "margin-bottom:.8cm;">

```r
vec[3]
```

```
## [1] "c"
```

---

### 1. The basics of R programming

#### 1.1. Types of R objects

<ul>
  <li>Finally, there are <b>tables</b>, objects of dimension $n\times m$:</li>
  <ul>
    <li>Gather $m$ vectors (columns) of $n$  observations</li>
    <li>Several possible classes, e.g., <b>tibble()</b> from tidyverse</li>
  </ul>
</ul>

```r
library(tidyverse)

data <- tibble(name = c("Bob", "Tom", "Kim"),
               age = c(43, 19, 27),
               male = c(T, T, F))
```

```r
data
```

```
## # A tibble: 3 x 3
##   name    age male 
##   <chr> <dbl> <lgl>
## 1 Bob      43 TRUE 
## 2 Tom      19 TRUE 
## 3 Kim      27 FALSE
```

---

### 1. The basics of R programming

#### 1.1. Types of R objects

<ul>
  <li>Such datasets can be imported on R with <b>read functions</b></li>
  <ul>
    <li>There is one read function per <b>data format</b> (csv, xls, dta, ...)</li>
    <li>The main argument is the <b>path</b>, with slashes: "C:/User/.../data.csv"</li>
  </ul>
</ul>

```r
data <- read.csv("data/cereals.csv") # Import csv data
data <- as_tibble(data)              # Put in tibble format
head(data, 5)                        # Print first 5 rows 
```

```
## # A tibble: 5 x 16
##   name       mfr   type  calories protein   fat sodium fiber carbo sugars potass
##   <chr>      <chr> <chr>    <int>   <int> <int>  <int> <dbl> <dbl>  <int>  <int>
## 1 100% Bran  N     C           70       4     1    130    10     5      6    280
## 2 100% Natu~ Q     C          120       3     5     15     2     8      8    135
## 3 All-Bran   K     C           70       4     1    260     9     7      5    320
## 4 All-Bran ~ K     C           50       4     0    140    14     8      0    330
## 5 Almond De~ R     C          110       2     2    200     1    14      8     -1
## # ... with 5 more variables: vitamins <dbl>, shelf <int>, weight <dbl>,
## #   cups <dbl>, rating <dbl>
```

---

### 1. The basics of R programming

#### 1.2. The dplyr grammar

* `dplyr` provides **useful functions** to manipulate data and the **pipe operator (%>%)** to chain operations

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption><b>Important functions of the dplyr grammar</b></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Function </th>
   <th style="text-align:left;"> Meaning </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> mutate() </td>
   <td style="text-align:left;"> Modify or create a variable </td>
  </tr>
  <tr>
   <td style="text-align:left;"> select() </td>
   <td style="text-align:left;"> Keep a subset of variables </td>
  </tr>
  <tr>
   <td style="text-align:left;"> filter() </td>
   <td style="text-align:left;"> Keep a subset of observations </td>
  </tr>
  <tr>
   <td style="text-align:left;"> arrange() </td>
   <td style="text-align:left;"> Sort the data </td>
  </tr>
  <tr>
   <td style="text-align:left;"> group_by() </td>
   <td style="text-align:left;"> Group the data </td>
  </tr>
  <tr>
   <td style="text-align:left;"> summarise() </td>
   <td style="text-align:left;"> Summarizes variables into 1 observation per group </td>
  </tr>
  <tr>
   <td style="text-align:left;"> left/right/inner/full_join() </td>
   <td style="text-align:left;"> Merge data </td>
  </tr>
</tbody>
</table>

---

### 1. The basics of R programming

#### 1.2. The dplyr grammar

<ul>
  <li>We can first subset the data:</li>
  <ul>
    <li>The type <b>variable</b> only takes the value "C", we can remove it with <b>select()</b></li>
    <li>Some <b>observations</b> have negative values of potassium, we can remove them with <b>filter()</b></li>
    <li>These two operations can be <b>chained</b> using the pipe operator <b>%>%</b>
  </ul>
</ul>

```r
dim(data) # Dimensions of the data before the operation
```

```
## [1] 77 16
```

```r
data <- data %>% 
  select(-type) %>% 
  filter(potass >= 0)
```

```r
dim(data) # Dimensions of the data after the operation
```

```
## [1] 75 15
```
 
---

### 1. The basics of R programming

#### 1.2. The dplyr grammar

<ul>
  <li>The <b>mutate()</b> function allows to modify and create variables</li>
  <ul>
    <li>Using simple <b>vector operations</b></li>
    <li>With <b>ifelse()</b> to create a binary variable based on a condition</li>
    <li>With <b>case_when()</b> to create a categorical variable</li>
  </ul>
</ul>

```r
data <- data %>% 
  mutate(cal_100g = 100 * (calories / weight),
```

--
<p style = "margin-bottom:-.7cm;"></p>

```r
         low_cal = ifelse(cal_100g < 100, T, F),
```

--
<p style = "margin-bottom:-.7cm;"></p>

```r
         mfr = case_when(mfr == "N"           ~ "Nestlé",
                         mfr == "Q"           ~ "Quaker Oats",
                         mfr == "K"           ~ "Kellogg's",
                         mfr %in% c("G", "R") ~ "General Mills",
                         mfr == "P"           ~ "Post Consumer Brands LLC",
                         mfr == "A"           ~ "Maltex Co."))
```

---

### 1. The basics of R programming

#### 1.2. The dplyr grammar

* Such computations can also be done <b>separately</b> for each value of a variable <b>with group_by()</b>

```r
data <- data %>% 
  group_by(mfr) %>% 
  mutate(n_brands = n()) %>% 
  ungroup()
```

<ul>
  <li>Using <b>summarise()</b> instead of mutate() allows to:</li>
  <ul>
    <li>Keep only the grouping and summarized variables</li>
    <li>Keep one value per group (no duplicate row)</li>
  </ul>
</ul>

```r
data %>% 
  group_by(mfr) %>% 
  summarise(n_brands = n())
```
]

```
## # A tibble: 6 x 2
##   mfr                      n_brands
##   <chr>                       <int>
## 1 General Mills                  29
## 2 Kellogg's                      23
## 3 Maltex Co.                      1
## 4 Nestlé                          5
## 5 Post Consumer Brands LLC        9
## 6 Quaker Oats                     8
```
]
---

### 1. The basics of R programming

#### 1.2. The dplyr grammar

* `dplyr` also provides functions to:

<ul><ul><li>Rename variables &#10140; <b>rename()</b></li></ul></ul>

```r
data <- data %>% rename(manufacturer = mfr)
```

<ul><ul><li>Sort rows according to the values of one or several variables &#10140; <b>arrange()</b></li></ul></ul>

```r
data <- data %>% arrange(cal_100g)
```

<ul><ul><li>Joining another dataset with a common variable &#10140; <b>[left/right/full/inner]_join()</b>:</li></ul></ul>

```r
data <- data %>% 
  left_join(tibble(manufacturer = c("Kellogg's", "Nestlé", "General Mills", 
                                    "Post Consumer Brands LLC", "Quaker Oats", "Maltex Co."),
                   creation = c(1906, 1966, 1928, 1895, 1877, 1899)),
            by = "manufacturer")
```

---

### 1. The basics of R programming

#### 1.3. Data visualization

* The tidyverse packages also gives access to the <b>ggplot</b> grammar for data visualization

<ul>
  <li>The core arguments of the ggplot() function are the following</li>
  <ul>
    <li><b>Data</b>: the values to plot</li>
    <li><b>Mapping</b> (aes, for aesthetics): the structure of the plot</li>
    <li><b>Geometry</b>: the type of plot</li>
  </ul>
</ul>

<ul>
  <li>These arguments should be specified as follows:</li>
  <ul>
    <li>Data and mapping should be specified within the parentheses</li>
    <li>The geometry and any other element should be added with a <b>+</b> sign</li>
  </ul>
</ul>

```r
ggplot(data, aes) + geometry + anything_else
```

* You can also apply the `ggplot()` function to your data with a pipe:

```r
data %>% ggplot(., aes) + geometry
```

---

### 1. The basics of R programming

#### 1.3. Data visualization

```r
test_data <- tibble(V1 = 1:6, 
                    V2 = c(64, 60, 16, 8, 16, 32))
ggplot(test_data, aes(x = V1, y = V2)) + geom_point(size = 3)
```

* We first specified our data: 
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;"> V1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V2 </td>
   <td style="text-align:right;"> 64 </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 32 </td>
  </tr>
</tbody>
</table>

* Then assigned `V1` to the x-axis and `V2` to the y-axis with `aes()` 
 
 * And chose the `point` geometry with a size of 3

]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" />
]

---

### 1. The basics of R programming

#### 1.3. Data visualization

<ul>
  <li>In some cases you would convey information with other means than a position on axis</li>
  <ul>
    <li>It can be with the color, size or shape of a geometry, ...</li>
    <li>For instance if you have two groups</li>
  </ul>
</ul>

```r
test_data <- test_data %>% mutate(Group = paste("Group", c(1, 1, 2, 2, 2, 2)))
```

.pull-left[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:right;"> V1 </th>
   <th style="text-align:right;"> V2 </th>
   <th style="text-align:left;"> Group </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 64 </td>
   <td style="text-align:left;"> Group 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:left;"> Group 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:left;"> Group 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> Group 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:left;"> Group 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 32 </td>
   <td style="text-align:left;"> Group 2 </td>
  </tr>
</tbody>
</table>
]

* Just as we assigned the two numeric variables to the x an y axis with aes, we have to assign the group variable to the 'color axis' with aes

```r
ggplot(test_data, aes(x = V1, y = V2, 
                      color = Group)) + ...
```

* But there is no proper 'color axis', that's why a legend will be generated

]

---

### 1. The basics of R programming

#### 1.3. Data visualization

---

### 1. The basics of R programming

#### 1.3. Data visualization

<b>&#10140; Case 1:</b> The style does not depend on the value of a variable

<ul>
  <li>The style element should be <b>uniform</b> across all data points</li>
  <ul>
    <li>So it should be specified <b>within the geometry</b> function</li>
  </ul>
</ul>

```r
ggplot(test_data, aes(x = V1, y = V2)) + 
  geom_point(color = "red", shape = 18)
```

<b>&#10140; Case 2:</b> The style element depends on the value of a variable

<ul>
  <li>The style should <b>depend on the value of the variable</b> it has been assign to in <b>aes</b></li>
  <ul>
    <li>So just as for regular axes, modifications should take place in a scale function</li>
  </ul>
</ul>

```r
ggplot(test_data, aes(x = V1, y = V2, color = Group)) + 
  scale_color_manual(name = "Group:", values = c("red", "blue"))  + 
  geom_point(shape = 18)
```

---

### Practice

#### 1) Import the dataset `cereals.csv`

#### 2) There is no documentation on the variable `rating`. Use the summary() function to deduce the unit of the variable based on its distribution.

#### 3) Generate a scatter plot with  `sugars` on the `x` axis and `rating` on the `y` axis to deduce whether the rating was made by nutritionists or consumers

<center><h3><i>You've got 10 minutes!</i></h3></center>

---

### Solution

#### 1) Import the dataset `cereals.csv`

```r
cereals <- read.csv("C:/User/Documents/cereals.csv")
```

#### 2) There is no documentation on the variable `rating`. Use the summary() function to deduce the unit of the variable based on its distribution.

```r
summary(cereals$rating)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.04   33.17   40.40   42.67   50.83   93.70
```

<center>The variable is probably in percentages</center>

---

### Solution

#### 3) Generate a scatter plot with `sugars` on the `x` axis and `rating` on the `y` axis to deduce whether the rating was made by nutritionists or consumers

```r
ggplot(cereals, aes(x = sugars, y = rating)) +
  geom_point(alpha = .8)
```

.left-column[
<img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="80%" style="display: block; margin: auto;" />
]

.right-column[
<p style = "margin-bottom:4cm;"></p>
<center>The rating was probably made by nutritionists</center>
]
---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. The basics of R programming &#10004;</b></li>
  <ul style = "list-style: none">
    <li>1.1. Types of R objects</li>
    <li>1.2. The dplyr grammar</li>
    <li>1.3. Data visualization</li>
  </ul>
</ul>

]

]

---

<h3>Overview</h3>

]
 
---

### 2. Descriptive statistics

#### 2.1. Distributions

* The point of <b>descriptive statistics</b> is to <b>summarize variables</b> into a small set of tractable statistics. 
 * The most comprehensive way to characterize a variable is to compute its distribution:
  * What are the values the variable takes?
  * How frequently does each of these values appear?

<b> &#10140; Consider for instance the following variable:</b>

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Variable 1</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

* We can count how many times each value appears

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;"> Variable 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
</tbody>
</table>

]

* And we can represent this distribution graphically with a bar plot
  * Each possible value on the x-axis
  * Their number of occurrences on the y-axis
]

---

### 2. Descriptive statistics

#### 2.1. Distributions

---

### 2. Descriptive statistics

#### 2.1. Distributions

* But what if we would like to do the same thing for the following variable?

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Variable 2</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 5.912877 </td>
   <td style="text-align:right;"> 5.006781 </td>
   <td style="text-align:right;"> 5.517149 </td>
   <td style="text-align:right;"> 5.854849 </td>
   <td style="text-align:right;"> 5.177872 </td>
   <td style="text-align:right;"> 3.815240 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.666582 </td>
   <td style="text-align:right;"> 4.422721 </td>
   <td style="text-align:right;"> 6.025062 </td>
   <td style="text-align:right;"> 5.411020 </td>
   <td style="text-align:right;"> 5.889811 </td>
   <td style="text-align:right;"> 6.729103 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4.160800 </td>
   <td style="text-align:right;"> 6.519049 </td>
   <td style="text-align:right;"> 6.849172 </td>
   <td style="text-align:right;"> 8.368158 </td>
   <td style="text-align:right;"> 6.167404 </td>
   <td style="text-align:right;"> 2.882974 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6.751888 </td>
   <td style="text-align:right;"> 3.202183 </td>
   <td style="text-align:right;"> 6.390224 </td>
   <td style="text-align:right;"> 3.942039 </td>
   <td style="text-align:right;"> 6.488909 </td>
   <td style="text-align:right;"> 8.195647 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7.073922 </td>
   <td style="text-align:right;"> 4.790039 </td>
   <td style="text-align:right;"> 5.297919 </td>
   <td style="text-align:right;"> 1.218109 </td>
   <td style="text-align:right;"> 5.754213 </td>
   <td style="text-align:right;"> 7.225030 </td>
  </tr>
</tbody>
</table>

* Each value appears only once
  * So the count of each value does not help summarizing the variable

<center><h4>&#10140;<i>We should rather do a histogram</i></h4></center>

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount.

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount.
  * We can divide the domain of this variable into 5 bins

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount.
  * We can divide the domain of this variable into 5 bins
  * And count the number of observations within each bin

---

### 2. Descriptive statistics

#### 2.1. Distributions

<img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="78%" style="display: block; margin: auto;" />
 
---

### 2. Descriptive statistics

#### 2.1. Distributions

<ul>
  <li>There's no definitive rule to choose the number of bins</li>
  <ul>
    <li>But too many or too few can yield misleading histograms</li>
  </ul>
</ul>

<ul>
  <li>Densities are often used instead of histograms</li>
  <ul>
    <li>Both are based on the same principle, but densities are continuous</li>
  </ul>
</ul>

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Distributions are comprehensive representations but not simple statistics

* How to summarize these distributions with simple statistics?

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Distributions are comprehensive representations but not simple statistics

* How to summarize these distributions with simple statistics?
  * By describing their central tendency (e.g., mean, median)

---

### 2. Descriptive statistics

#### 2.1. Distributions

* Distributions are comprehensive representations but not simple statistics

* How to summarize these distributions with simple statistics?
  * By describing their central tendency (e.g., mean, median)
  * And their spread (e.g., standard deviation, inter-quartile range)

---

### 2. Descriptive statistics

#### 2.2. Central tendency

<ul>
  <li>The mean is the most common statistic to describe central tendencies</li>
  <ul>
    <li>Take for instance the grades of group 2 last year for the second-semester final project</li>
  </ul>
</ul>
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Grades of G2 last year</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 16.0 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
</tbody>
</table>

* The mean is simply the sum of all the grades divided by the number of grades:

`$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$`

`$$\frac{20 + 20 + 17.5 + 17.5 + 16 + 16 + 16 + 14.5 + 14.5 + 19.5 + 19.5 + 18.5 + 18.5 + 18.5}{14} = 17.61$$`

---

### 2. Descriptive statistics

#### 2.2. Central tendency

* It can also be expressed as the average of each possible value weighted by its number of occurrences:
 
 <p style = "margin-bottom:.5cm;">
 
`$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$`

`$$\bar{x} = \frac{(2 \times 20) + (2 \times 17.5) + (3 \times 16) + (2 \times 14.5) + (2 \times 19.5) + (3 \times 18.5)}{2 + 2 + 3 + 2 + 2 + 3 = 14} = 17.61$$`

---

### 2. Descriptive statistics

#### 2.2. Central tendency

* To obtain the median you first need to sort the values:
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Grades of G2 last year</caption>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:left;"> 7 </td>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 10 </td>
   <td style="text-align:left;"> 11 </td>
   <td style="text-align:left;"> 12 </td>
   <td style="text-align:left;"> 13 </td>
   <td style="text-align:left;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 14.5 </td>
   <td style="text-align:left;"> 14.5 </td>
   <td style="text-align:left;"> 16 </td>
   <td style="text-align:left;"> 16 </td>
   <td style="text-align:left;"> 16 </td>
   <td style="text-align:left;"> 17.5 </td>
   <td style="text-align:left;"> 17.5 </td>
   <td style="text-align:left;"> 18.5 </td>
   <td style="text-align:left;"> 18.5 </td>
   <td style="text-align:left;"> 18.5 </td>
   <td style="text-align:left;"> 19.5 </td>
   <td style="text-align:left;"> 19.5 </td>
   <td style="text-align:left;"> 20 </td>
   <td style="text-align:left;"> 20 </td>
  </tr>
</tbody>
</table>

<ul>
  <li>The median is the value that divides the distribution into two halves</li>
  <ul>
    <li>With N even: Average of the last value of the first half and the first value of the second half</li>
  </ul>
</ul>
 
--

* As we have 14 observations, here the median is the average of the 7<sup>th</sup> and the 8<sup>th</sup> observations:

`$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\
\frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even}
\end{cases} = \frac{17.5 + 18.5}{2} = 18$$`

---

### 2. Descriptive statistics

#### 2.3. Spread

<ul>
  <li>The most intuitive statistic to describe the spread of a variable is probably its <b>range</b></li>
  <ul>
    <li>The minimum and maximum value of the distribution</li>
  </ul>
</ul>
 
--

* But consider the following two distributions:

.left-column[
<img src="slides_files/figure-html/unnamed-chunk-66-1.png" width="100%" style="display: block; margin: auto;" />
]

<ul>
  <li>In the presence of outliers or very skewed distributions, the full range of a variable may not be representative of what we mean by 'spread'</li>
</ul>
 
<ul>
  <li>That's why we tend to prefer:</li>
  <ul>
    <li>The <b>inter-quantile range</b></li>
    <li>The <b>standard deviation</b></li>
  </ul>
</ul>
 
]

---

### 2. Descriptive statistics

#### 2.3. Spread

<ul>
  <li><b>Quantiles</b> are observations that divide the population into <b>groups of equal size</b></li>
  <ul>
    <li>The median divides the population into 2 groups of equal size</li>
    <li>Quartiles divide the population into 4 groups of equal size</li>
    <li>There are also terciles, quintiles, deciles, and so on</li>
  </ul>
</ul>

<ul>
  <li><b>The interquartile range</b> is the difference between the third and the first quartile: $\text{IQR} = Q_3 - Q_1$</li>
  <ul>
    <li>Put differently, it corresponds to the bounds of the set which contains the <b>middle half of the distribution</b></li>
 </ul>
</ul>

---

### 2. Descriptive statistics

#### 2.3. Spread

<ul>
  <li>The <b>variance</b> is a way to quantify how values of a variable tend to deviate from their mean</li>
  <ul>
    <li>If values tend to be close to the mean, then the spread is low</li>
    <li>If values tend to be far from the mean, then the spread is large</li>
  </ul>
</ul>

<ul>
  <li>Because deviations from the mean sum to 0, they have to be squared</li>
  <ul>
    <li>This is how the variance is computed: by <b>averaging the squared deviations from the mean</b></li>
  </ul>
</ul>
  
--

`$$\text{Var}(x) = \frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2$$`

<ul>
  <li>The variance is a sum of squares, so we have to take its square root to remain in the same unit as the data</li>
  <ul>
    <li>This is what we call the <b>standard deviation</b></li>
  </ul>
</ul>

`$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$`

---

### 2. Descriptive statistics

#### 2.4. Joint distributions

* The joint distribution shows the possible values and associated frequencies for two variables simultaneously
  * Earlier we plotted the observations of a variable on a line, randomly shifted on the vertical axis

---

### 2. Descriptive statistics

#### 2.4. Joint distributions

* The joint distribution shows the possible values and associated frequencies for two variable simultaneously
  * Earlier we plotted the observations of a variable on a line, randomly shifted on the vertical axis
  * Instead of shifting observations randomly, vertical coordinates can indicate the value of a second variable

---

### 2. Descriptive statistics

#### 2.4. Joint distributions

* When describing a <b>single distribution</b>, we're interested in its <b>spread and central tendency</b>
 * When describing a <b>joint distribution</b>, we're interested in the <b>relationship between the two variables</b>
  * This can be characterized by the covariance
 
--

$$ \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^{N}(x_i − \bar{x})(y_i − \bar{y}) $$

<ul>
  <li>The contribution of observation $i$ to $\text{Cov}(x, y)$ is:</li>
  <ul>
    <li>Positive when both $x_i$ and $y_i$ are above their respective mean</li>
    <li>Positive when both $x_i$ and $y_i$ are below their respective mean</li>
    <li>Negative when $x_i$ and $y_i$ are on different sides of their respective mean</li>
  </ul>
</ul>

<p style = "margin-bottom:1cm;"></p>
 
<center><h4>  &#10140; <i> If y tends to be large relative to its mean when x is large relative to its mean, their covariance is positive. Conversely, if one tends to be large when the other tends to be low, the covariance is negative.</i></h4></center>

---

### 2. Descriptive statistics

#### 2.4. Joint distributions

<ul>
  <li>One disadvantage of the <b>covariance</b> is that is it <b>not standardized</b></li>
  <ul>
    <li>You cannot directly compare the covariance of two pairs of completely different variables</li>
    <li>Theoretically the covariance can take values from $-\infty$ to $+\infty$</li>
  </ul>
</ul>

---

### 2. Descriptive statistics

#### 2.4. Joint distributions

<ul>
  <li>This is why we often use the <b>correlation coefficient</b></li>
  <ul>
    <li>It is obtained by dividing the covariance by the product of the standard deviation of the two variables</li>
    <li>This allows to <b>standardize the coefficient</b> between -1 and 1</li>
  </ul>
</ul>

`$$\text{Corr}(x, y) = \frac{\text{Cov}(x, y)}{\text{SD}(x)\times\text{SD}(y)}$$`

* Consider for instance the following two distributions:
 
.pull-left[
<img src="slides_files/figure-html/unnamed-chunk-71-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<ul>
  <li>Here the association between the two variables feels tighter on the right panel</li>
  <ul>
    <li>But the covariance is larger for the first relationship because units are larger</li>
    <li>While the correlation, standardized between 0 and 1, is larger for the second one</li>
  </ul>
</ul>
]

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Descriptive statistics &#10004;</b></li>
  <ul style = "list-style: none">
    <li>2.1. Distributions</li>
    <li>2.2. Central tendency</li>
    <li>2.3. Spread</li>
    <li>2.4. Joint distributions</li>
  </ul>
</ul>

]

]

---

### 3. A few words on using R

#### 3.1. When it doesn't work the way you want

<ul>
  <li>When things do not work the way you want, <b>NA</b>s are the usual suspects</li>
  <ul>
    <li>For instance, this is how the mean function reacts to NAs:</li>
  </ul>
</ul>
 
--

```r
mean(c(1, 2, NA))
```

```
## [1] NA
```
]

```r
mean(c(1, 2, NA), na.rm = T)
```

```
## [1] 1.5
```
]

<ul>
  <li>Here it is obvious that NAs are the problem, but when chaining operations it's not always that transparent</li>
  <ul>
    <li>So check your data using <b>is.na()</b> to see whether NAs could mess things up</li>
  </ul>
</ul>

```r
is.na(c(1, 2, NA))
```

```
## [1] FALSE FALSE  TRUE
```

---

### 3. A few words on using R

#### 3.2. Where to find help

<ul>
  <li>You can find help on <b>help files</b></li>
  <ul>
    <li>Sometimes things don't work just because you did not understand the arguments of the function</li>
    <li>Just enter the name of the function preceded by a <b>?</b> in your console</li>
    <li>The help file will appear in the Help tab of R studio</li>
  </ul>
</ul>

```r
?pivot_longer
```

---

### 3. A few words on using R

#### 3.2. Where to find help

<ul>
  <li>When it doesn't work, search on the <b>internet</b></li>
  <ul>
    <li><b>Every question</b> you might have at that stage is already asked and <b>answered</b> at <a href="https://stackoverflow.com/">stackoverflow.com</a></li>
  </ul>
</ul>
 
--

---

### 3. A few words on using R

#### 3.3. When it doesn't work at all

* Sometimes R breaks and returns an error, which is usually kind of cryptic

```r
read.csv("C:\Users\l.sirugue\Documents\R")
```

```
## Error: '\U' non suivi de chiffres hexadécimaux dans la chaîne de caractères débutant ""C:\U"
```

* Try to look for keywords that might help you understand where it comes from
 * And paste it in Google with the name of your command, chances are many people already struggled with that