Basic data manipulation

# Basic data manipulation
## Lecture 3
### <br>Louis SIRUGUE
### CPES 2 - Fall 2022

---

### Quick reminder

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  
---

### Quick reminder

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  * Its **central tendency**

---

### Quick reminder

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  * Its **central tendency**
  * And its **spread**
  
---

### Quick reminder

#### 2. Central tendency

* The **mean** is the sum of all values divided by the number of observations

`$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$`

]

* The **median** is the value that divides the (sorted) distribution into two groups of equal size

`$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\
\frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even}
\end{cases}$$`

]

#### 3. Spread

* The **standard deviation** is square root of the average squared deviation from the mean

`$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$`
 
]

* The **interquartile range** is the difference between the maximum and the minimum value from the middle half of the distribution

$$\text{IQR} = Q_3 - Q_1 $$

]

---

### Quick reminder

#### 4. Inference

<ul> 
  <li>In Statistics, we view variables as a given realization of a <b>data generating process</b></li>
  <ul>
    <li>Hence, the <b>mean</b> is what we call an <b>empirical moment</b>, which is an <b>estimation</b>...</li>
    <li>... of the <b>expected value</b>, the <b>theoretical moment</b> of the DGP we're interested in</li>
  </ul>
</ul>

<ul> 
  <li>To know how confident we can be in this estimation, we need to compute a <b>confidence interval</b></li>
</ul>

`$$[\bar{x} - t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}; \:\bar{x} + t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}]$$`

<ul>
  <ul>
    <li>It gets <b>larger</b> as the <b>variance</b> of the distribution of $x$ increases</li>
    <li>And gets <b>smaller</b> as the <b>sample size</b> $n$ increases</li>
  </ul>
</ul>

---

### Warm up practice

#### 1) Import the `ligue1.csv` dataset and store it in an object called `fb`

#### 2) Create a subset of this dataset containing only matches that took place at 13h

#### 3) Print the number of matches in this subset and compute the average attendance

#### 4) Redo the same exercise on matches that took place at 20h45

<center><h3><i>You've got 5 minutes!</i></h3></center>

---

### Solution

#### 1) Import the `ligue1.csv` dataset and store it in an object called `fb`

```r
fb <- read.csv("C:/User/Documents/ligue1.csv", encoding = "UTF-8")
```

#### 2) Create a subset of this dataset containing only matches that took place at 13h

```r
sub13 <- fb[fb$Time == "13:00", ]
```

#### 3) Print the number of matches in this subset and compute the average attendance

```r
nrow(sub13)
```

```
## [1] 32
```

```r
mean(sub13$Attendance)
```

```
## [1] NA
```

---

### Solution

<ul>
  <li>When there are <b>missing values</b> in a vector, the <b>mean</b> function returns <b>NA</b></li>
  <ul>
    <li>We need to set the <b>na.rm</b> option to <b>TRUE</b></li>
  </ul>
</ul>

#### 3) Print the number of matches in this subset and compute the average attendance

```r
mean(sub13$Attendance, na.rm = T)
```

```
## [1] 19038
```

#### 4) Redo the same exercise on matches that took place at 20h45

```r
sub2045 <- fb[fb$Time == "20:45", ]
nrow(sub2045)
```

```
## [1] 29
```

```r
mean(sub2045$Attendance, na.rm = T)
```

```
## [1] 36418.64
```

---

<h3>Today we learn how to manipulate data</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. The dplyr package</b></li>
  <ul style = "list-style: none">
    <li>1.1. Packages</li>
    <li>1.2. Basic functions</li>
    <li>1.3. group_by() and summarise()</li>
  </ul>
</ul>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Merge and reshape</b></li>
  <ul style = "list-style: none">
    <li>2.1. Merge and append data</li>
    <li>2.2. Reshape data</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none">
  <li><b>3. A few words on learning R</b></li>
  <ul style = "list-style: none">
    <li>3.1. When it doesn't work the way you want</li>
    <li>3.2. Where to find help</li>
    <li>3.3. When it doesn't work at all</li>
  </ul>
</ul>
 
<p style = "margin-bottom:2.25cm;"></p>

<ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul>
]

---

<h3>Today we learn how to manipulate data</h3>

]

---

### 1. The `dplyr` package

#### 1.1. Packages

<ul>
  <li>So far we only used functions that are directly available in R</li>
  <ul>
    <li>But there are tons of <b>user-created functions</b> out there that can make your life so much easier</li>
    <li>These functions are shared in what we call <b>packages</b></li>
  </ul>
</ul>

<ul>
  <li>Packages are <b>bundles of functions</b> that R users put at the disposal of other R users</li>
  <ul>
    <li>Packages are <b>centralized</b> on the <a href = "https://cran.r-project.org/">Comprehensive R Archive Network (CRAN)</a></li>
    <li>To <b>download</b> and install a CRAN package you can simply use <b>install.packages()</b></li>
  </ul>
</ul>

<ul>
  <li>All the functions of the dplyr grammar are gathered in the <b>dplyr package</b></li>
  <ul>
    <li>We can download these functions and make them ready to use with the install.packages() function</li>
  </ul>
</ul>

```r
install.packages("dplyr") # Requires an internet connection
```

<ul>
  <li>The dplyr package is <b>now installed</b> on your computer</li>
  <ul>
    <li>You won't have to do it again</li>
  </ul>
</ul>
 
---

### 1. The `dplyr` package

#### 1.1. Packages

* The `dplyr` package is now <b>on your computer</b>, but it is <b>not loaded in R</b>

```r
ls("package:dplyr")
```

```
## Error in as.environment(pos): no item called "package:dplyr" on the search list
```

<ul>
  <li>You need to use the <b>library()</b> command to load it</li>
</ul>

```r
library(dplyr)
ls("package:dplyr")[1:5]
```

```
## [1] "%>%"        "across"     "add_count"  "add_count_" "add_row"
```

<ul>
  <li>But even though the package is permanently installed, it is <b>loaded only for your current session</b></li>
  <ul>
    <li>Each time you start a <b>new R session</b>, you'll have to load the packages you need with <b>library()</b></li>
  </ul>
</ul>

---

### 1. The `dplyr` package

#### 1.2. Basic functions

`dplyr` is a **grammar** of data manipulation providing very **user-friendly functions** to handle the most common **data manipulation** tasks:

* `mutate()`: add/modify variables
 * `select()`: keep/drop variables (columns)
 * `filter()`: keep/drop observations (rows)
 * `arrange()`: sort rows according to the values of given variable(s)
 * `summarise()`: aggregate the data into descriptive statistics

<ul style = "margin-left:6cm;">
  <li>A very handy <b>operator</b> to use with the <b>dplyr</b> grammar is the <b>pipe %>%</b></li>
</ul>

<ul style = "margin-left:6cm;">
  <ul>
    <li>You can basically read <b>a %>% b()</b> as <i>"apply function b() to object a"</i></li>
  </ul>
</ul>

<ul style = "margin-left:6cm;">
  <ul>
    <li>With this operator you can easily <b>chain the operations</b> you apply to an object</li>
  </ul>
</ul>

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb 
                                                    #
                                                    #
                                                    #
                                                    #
                                                    #
                                                    #
```

```text
##     Wk Day       Date  Time          Home  xG Score xG.1          Away Attendance  ...
## 1    1 Fri 2021-08-06 21:00        Monaco 2.0   1–1  0.3        Nantes       7500  ...
## 2    1 Sat 2021-08-07 17:00          Lyon 1.4   1–1  0.8         Brest      29018  ...
## 3    1 Sat 2021-08-07 21:00        Troyes 0.8   1–2  1.2     Paris S-G      15248  ...
## 4    1 Sun 2021-08-08 13:00        Rennes 0.6   1–1  2.0          Lens      22567  ...
## 5    1 Sun 2021-08-08 15:00      Bordeaux 0.7   0–2  3.3 Clermont Foot      18748  ...
## 6    1 Sun 2021-08-08 15:00    Strasbourg 0.4   0–2  0.9        Angers      23250  ...
## 7    1 Sun 2021-08-08 15:00          Nice 0.8   0–0  0.2         Reims      18030  ...
## 8    1 Sun 2021-08-08 15:00 Saint-Étienne 2.1   1–1  1.3       Lorient      20461  ...
## 9    1 Sun 2021-08-08 17:00          Metz 0.7   3–3  1.4         Lille      15551  ...
 ...  ... ...        ...   ...           ... ...   ...  ...           ...        ...  ...                      
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb %>%
  select(Home, xG, Score, xG.1, Away)               # Keep/drop certain columns
                                                    #
                                                    #
                                                    #
                                                    #
                                                    #
```

```text
##              Home  xG Score xG.1          Away
## 1          Monaco 2.0   1–1  0.3        Nantes
## 2            Lyon 1.4   1–1  0.8         Brest
## 3          Troyes 0.8   1–2  1.2     Paris S-G
## 4          Rennes 0.6   1–1  2.0          Lens
## 5        Bordeaux 0.7   0–2  3.3 Clermont Foot
## 6      Strasbourg 0.4   0–2  0.9        Angers
## 7            Nice 0.8   0–0  0.2         Reims
## 8   Saint-Étienne 2.1   1–1  1.3       Lorient
## 9            Metz 0.7   3–3  1.4         Lille
 ...             ... ...   ...  ...           ...
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb %>%
  select(Home, xG, Score, xG.1, Away) %>%           # Keep/drop certain columns
  mutate(home_winner = xG > xG.1)                   # Create a new variable
                                                    #
                                                    #
                                                    #
                                                    #
```

```text
##              Home  xG Score xG.1          Away home_winner
## 1          Monaco 2.0   1–1  0.3        Nantes        TRUE
## 2            Lyon 1.4   1–1  0.8         Brest        TRUE
## 3          Troyes 0.8   1–2  1.2     Paris S-G       FALSE
## 4          Rennes 0.6   1–1  2.0          Lens       FALSE
## 5        Bordeaux 0.7   0–2  3.3 Clermont Foot       FALSE
## 6      Strasbourg 0.4   0–2  0.9        Angers       FALSE
## 7            Nice 0.8   0–0  0.2         Reims        TRUE
## 8   Saint-Étienne 2.1   1–1  1.3       Lorient        TRUE
## 9            Metz 0.7   3–3  1.4         Lille       FALSE
 ...             ... ...   ...  ...           ...         ...
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb %>%
  select(Home, xG, Score, xG.1, Away) %>%           # Keep/drop certain columns
  mutate(home_winner = xG > xG.1) %>%               # Create a new variable
  filter(Home == "Rennes")                          # Keep/drop certain rows 
                                                    #
                                                    #
                                                    #
```

```text
##      Home  xG Score xG.1          Away home_winner
## 1  Rennes 0.6   1–1  2.0          Lens       FALSE
## 2  Rennes 0.9   1–0  0.5        Nantes        TRUE
## 3  Rennes 1.0   0–2  0.5         Reims        TRUE
## 4  Rennes 2.4   6–0  0.3 Clermont Foot        TRUE
## 5  Rennes 0.8   2–0  1.4     Paris S-G       FALSE
## 6  Rennes 1.5   1–0  0.6    Strasbourg        TRUE
## 7  Rennes 3.8   4–1  1.1          Lyon        TRUE
## 8  Rennes 3.1   2–0  0.7   Montpellier        TRUE
## 9  Rennes 0.8   1–2  0.6         Lille        TRUE
         ... ...   ...  ...           ...         ...
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb %>%
  select(Home, xG, Score, xG.1, Away) %>%           # Keep/drop certain columns
  mutate(home_winner = xG > xG.1) %>%               # Create a new variable
  filter(Home == "Rennes") %>%                      # Keep/drop certain rows
  arrange(-xG)                                      # Sort rows
                                                    #
                                                    #
```

```text
##      Home  xG Score xG.1          Away home_winner
## 1  Rennes 3.8   4–1  1.1          Lyon        TRUE
## 2  Rennes 3.3   6–0  0.4      Bordeaux        TRUE
## 3  Rennes 3.3   6–1  0.9          Metz        TRUE
## 4  Rennes 3.1   2–0  0.7   Montpellier        TRUE
## 5  Rennes 2.7   2–0  0.3         Brest        TRUE
## 6  Rennes 2.6   4–1  0.4        Troyes        TRUE
## 7  Rennes 2.4   6–0  0.3 Clermont Foot        TRUE
## 8  Rennes 1.9   2–3  2.9        Monaco       FALSE
## 9  Rennes 1.7   2–0  0.3        Angers        TRUE
         ... ...   ...  ...           ...         ...
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

```r
fb %>%
  select(Home, xG, Score, xG.1, Away) %>%           # Keep/drop certain columns
  mutate(home_winner = xG > xG.1) %>%               # Create a new variable
  filter(Home == "Rennes") %>%                      # Keep/drop certain rows
  arrange(-xG) %>%                                  # Sort rows
  summarise(expected_wins = mean(home_winner),      # Aggregate into statistics
            expected_goals = sum(xG))               #
```

```
##   expected_wins expected_goals
## 1     0.8421053           36.6
```

---

### 1. The `dplyr` package

#### 1.2. Basic functions

* Here are two very **handy functions** to use within `mutate()`
   
  <p style = "margin-bottom:1cm;"></p>

```r
fb %>%
  select(Home, Attendance) %>%  
  mutate(att_bin = ifelse(Attendance > 10000, 
                          "Large", 
                          "Low")
         ) %>% head()
```

```
##         Home Attendance att_bin
## 1     Monaco       7500     Low
## 2       Lyon      29018   Large
## 3     Troyes      15248   Large
## 4     Rennes      22567   Large
## 5   Bordeaux      18748   Large
## 6 Strasbourg      23250   Large
```
]

```r
fb %>%
  select(Home, xG, xG.1, Away) %>%  
  mutate(xWin = case_when(xG > xG.1 ~ "Home",
                          xG == xG.1 ~ "Draw",
                          xG < xG.1 ~ "Away")
         ) %>% head()
```

```
##         Home  xG xG.1          Away xWin
## 1     Monaco 2.0  0.3        Nantes Home
## 2       Lyon 1.4  0.8         Brest Home
## 3     Troyes 0.8  1.2     Paris S-G Away
## 4     Rennes 0.6  2.0          Lens Away
## 5   Bordeaux 0.7  3.3 Clermont Foot Away
## 6 Strasbourg 0.4  0.9        Angers Away
```
]

---

### 1. The `dplyr` package

#### 1.3. group_by() and summarise()

* With `group_by()` you can perform **computations separately** for the different **categories of a variable**

```r
fb %>%
  select(Wk, Home, xG) %>%
  mutate(all.xG = mean(xG)) %>%
  head(10)
```

```
##    Wk          Home  xG   all.xG
## 1   1        Monaco 2.0 1.473421
## 2   1          Lyon 1.4 1.473421
## 3   1        Troyes 0.8 1.473421
## 4   1        Rennes 0.6 1.473421
## 5   1      Bordeaux 0.7 1.473421
## 6   1    Strasbourg 0.4 1.473421
## 7   1          Nice 0.8 1.473421
## 8   1 Saint-Étienne 2.1 1.473421
## 9   1          Metz 0.7 1.473421
## 10  1   Montpellier 0.5 1.473421
```
]

```r
fb %>%
  select(Wk, Home, xG) %>%
* group_by(Home) %>%
  mutate(home.xG = mean(xG)) %>%
  head(6)
```

```
## # A tibble: 6 x 4
## # Groups:   Home [6]
##      Wk Home          xG home.xG
##   <int> <chr>      <dbl>   <dbl>
## 1     1 Monaco       2      1.69
## 2     1 Lyon         1.4    2.07
## 3     1 Troyes       0.8    1.21
## 4     1 Rennes       0.6    1.93
## 5     1 Bordeaux     0.7    1.23
## 6     1 Strasbourg   0.4    1.73
```
]

---

### 1. The `dplyr` package

#### 1.3. group_by() and summarise()

<ul>
  <li>It is particularly <b>useful with summarise()</b></li>
  <ul>
    <li>summarise keeps the grouping variable</li>
    <li>and computes <b>statistics for each category</b></li>
  </ul>
</ul>

```r
fb %>%
  group_by(Wk) %>%
  summarise(n = n(),
            tot_xG = sum(xG)+sum(xG.1),
            avg_WG = tot_xG/n) %>%
  head(4)
```

```
## # A tibble: 4 x 4
##      Wk     n tot_xG avg_WG
##   <int> <int>  <dbl>  <dbl>
## 1     1    10   23.4   2.34
## 2     2    10   26.6   2.66
## 3     3    10   25.7   2.57
## 4     4    10   30.4   3.04
```

]

<center><b>mutate() $\neq$ summarise()</b></center>

<ul>
  <li><b>mutate()</b> takes an operation that converts:</li>
  <ul>
    <li><b>A vector into another vector</b></li>
  </ul>
  <li><b>summarise()</b> takes an operation that converts:</li>
  <ul>
    <li><b>A vector into a value</b></li>
  </ul>
</ul>
]

<center><b>Ungrouping</b></center>

<ul>
  <li><b>group_by()</b> applies to all subsequent operations</li>
  <li>To cancel its effect you must <b>ungroup()</b> the data</li>
</ul>

```r
fb %>% 
  group_by(Wk) %>% 
  mutate(test = mean(xG)) %>% 
* ungroup() %>%
  ...
```
]

---

### Practice

#### 1) Start from the `fb` dataset and keep only the variables `Home`, `Score` and `Away`

#### 2) Use the `separate()` function from `tidyr` to split the `Score` variable into `home_score` and `away_score`

```r
data.frame(x = "a_b") %>% 
  separate(x, c("x", "y"), "_")
```

```
##   x y
## 1 a b
```

#### 3) Convert these two variables into numeric vectors

#### 4) Create a variable named `winner` that takes the values `Home`, `Draw` and `Away` depending on the score

#### 5) Use `group_by()` and `summarise()` to compute the percentage of draws, home wins and away wins

<center><h3><i>You've got 10 minutes!</i></h3></center>

---

### Solution

#### 1) Start from the `fb` dataset and keep only the variables `Home`, `Score` and `Away`

```r
fb %>% 
* select(Home, Score, Away) %>%
  head(2)
```

```
##     Home Score   Away
## 1 Monaco   1-1 Nantes
## 2   Lyon   1-1  Brest
```

#### 2) Use the `separate()` function from `tidyr` to split the `Score` variable into `home_score` and `away_score`

```r
fb %>% 
  select(Home, Score, Away) %>% 
* separate(Score, c("home_score", "away_score"), "-") %>%
  head(2)
```

```
##     Home home_score away_score   Away
## 1 Monaco          1          1 Nantes
## 2   Lyon          1          1  Brest
```

---

### Solution

#### 3) Convert these two variables into numeric vectors

#### 4) Create a variable named `winner` that takes the values `Home`, `Draw` and `Away` depending on the score

```r
fb %>% 
  select(Home, Score, Away) %>% 
  separate(Score, c("home_score", "away_score"), "-") %>%
* mutate(home_score = as.numeric(home_score),
*        away_score = as.numeric(away_score),
*        winner = case_when(home_score < away_score ~ "Away",
*                           home_score == away_score ~ "Draw",
*                           home_score > away_score ~ "Home")) %>%
  head()
```

```
##         Home home_score away_score          Away winner
## 1     Monaco          1          1        Nantes   Draw
## 2       Lyon          1          1         Brest   Draw
## 3     Troyes          1          2     Paris S-G   Away
## 4     Rennes          1          1          Lens   Draw
## 5   Bordeaux          0          2 Clermont Foot   Away
## 6 Strasbourg          0          2        Angers   Away
```

---

### Solution

#### 5) Use `group_by()` and `summarise() `to compute the percentage of draws, home wins and away wins

```r
fb %>% 
  select(Home, Score, Away) %>% 
  separate(Score, c("home_score", "away_score"), "-") %>%
  mutate(home_score = as.numeric(home_score),
         away_score = as.numeric(away_score),
         winner = case_when(home_score < away_score ~ "Away",
                            home_score == away_score ~ "Draw",
                            home_score > away_score ~ "Home")) %>%
* group_by(winner) %>%
* summarise(pct = 100 * (n() / nrow(fb)))
```

```
## # A tibble: 3 x 2
##   winner   pct
##   <chr>  <dbl>
## 1 Away    30.5
## 2 Draw    26.8
## 3 Home    42.6
```

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. The dplyr package &#10004;</b></li>
  <ul style = "list-style: none">
    <li>1.1. Packages</li>
    <li>1.2. Basic functions</li>
    <li>1.3. group_by() and summarise()</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul>
]

---

<h3>Overview</h3>

---

### 2. Merge and reshape

#### 2.1. Merge and append data

<ul>
  <li>Research projects often imply to <b>combine data</b> from different sources</li>
  <ul>
    <li>Either to <b>add observations</b> (append rows)</li>
    <li>Either to <b>add variables</b> (merge columns)</li>
  </ul>
</ul>

---

### 2. Merge and reshape

#### 2.1. Merge and append data

---

### 2. Merge and reshape

#### 2.1. Merge and append data

---

### 2. Merge and reshape

#### 2.1. Merge and append data: The `bind_rows()` function

--
.pull-left[

```r
read.csv("attainment_FR_UK_US.csv")
```

```
##   country year share_tertiary
## 1     FRA 2015       44.68760
## 2     GBR 2015       49.94341
## 3     USA 2015       46.51771
```
]
--
.pull-right[

```r
read.csv("attainment_IT_SP.csv")
```

```
##   country year share_tertiary
## 1     ITA 2015       25.14996
## 2     ESP 2015       40.95978
```
]

```r
attainment <- read.csv("attainment_FR_UK_US.csv") %>%
  bind_rows(read.csv("attainment_IT_SP.csv"))
attainment
```

```
##   country year share_tertiary
## 1     FRA 2015       44.68760
## 2     GBR 2015       49.94341
## 3     USA 2015       46.51771
## 4     ITA 2015       25.14996
## 5     ESP 2015       40.95978
```

.pull-right[
<p style = "margin-bottom:-4cm;"></p>
**Variables** in the two datasets should be the same:
<p style = "margin-bottom:-.5cm;"></p>
 * **Same name**
 * **Same class**
]

---

### 2. Merge and reshape

#### 2.1. Merge and append data: `*_join()` functions

* Join functions all work the same way:
  * A **dataset A** with a **variable X** and other variables
  * A **dataset B** with a **variable X** and other variables
  * X is the common variable, so datasets will be **joined *by* X**
  
--

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>The 4 main join functions</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Function </th>
   <th style="text-align:left;"> For X in A &amp; B </th>
   <th style="text-align:left;"> For X in A only </th>
   <th style="text-align:left;"> For X in B only </th>
   <th style="text-align:left;"> Summary </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A %&gt;% left_join(B, by = "X") </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Dropped </td>
   <td style="text-align:left;"> Only keeps what's in A </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A %&gt;% right_join(B, by = "X") </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Dropped </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Only keeps what's in B </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A %&gt;% inner_join(B, by = "X") </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Dropped </td>
   <td style="text-align:left;"> Dropped </td>
   <td style="text-align:left;"> Only keeps what's common </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A %&gt;% full_join(B, by = "X") </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Kept </td>
   <td style="text-align:left;"> Keeps everything </td>
  </tr>
</tbody>
</table>

---

### 2. Merge and reshape

<center><h4> &#9888; <i>Beware of NAs!</i> &#9888; </h4></center>

<ul>
  <li>When you have <b>values</b> of X that are <b>not common</b> to both datasets</li>
  <ul>
    <li>Any other join than the inner_join() will <b>generate NAs</b></li>
  </ul>
</ul>

```r
attainment %>% full_join(read.csv("spending.csv"), by = "country")
```

```
##   country year.x share_tertiary year.y share_gdp
## 1     FRA   2015       44.68760   2015     3.398
## 2     GBR   2015       49.94341     NA        NA
## 3     USA   2015       46.51771   2015     3.207
## 4     ITA   2015       25.14996     NA        NA
## 5     ESP   2015       40.95978     NA        NA
## 6     RUS     NA             NA   2015     1.843
```

* Any variable from `A` (`B`) other than those stated in `by= ` will be `NA` for observations that are only in `B` (`A`)

* This holds when a variable that is not mentioned in the `by= ` argument appears in both datasets:
     * In that case, R adds a data-specific suffix to the names and keeps them both
     * The variable from `B` (here `year.y`) will be `NA` for observations that are only in `A` only (here `GBR`, `ITA`, `ESP`)

---

### 2. Merge and reshape

#### 2.1. Merge and append data: example

```r
attainment %>% left_join(read.csv("spending.csv"), by = "country")
```
--

```r
attainment %>% right_join(read.csv("spending.csv"), by = "country")
```
--

```
##   country year.x share_tertiary year.y share_gdp
## 1     FRA   2015       44.68760   2015     3.398
## 2     USA   2015       46.51771   2015     3.207
## 3     RUS     NA             NA   2015     1.843
```
--

<center><h4><i> &#10140; What would be the result of an inner_join() here?</i></h4></center>

---

### 2. Merge and reshape

#### 2.2. Reshape data

<ul>
  <li>It is important to be able to <b>switch from</b> the <b><i>long</i> to</b> the <b><i>wide</i></b> format and conversely</li>
  <ul>
    <li>Some computations should be done in one format or the other</li>
  </ul>
</ul>
 
 <br>
 
--

.pull-left[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Wide format</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> country </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> share_tertiary </th>
   <th style="text-align:right;"> share_gdp </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> FRA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:right;"> 44.69 </td>
   <td style="text-align:right;"> 3.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> USA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:right;"> 46.52 </td>
   <td style="text-align:right;"> 3.21 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Long format</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> country </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:right;"> Value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> FRA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_tertiary </td>
   <td style="text-align:right;"> 44.69 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_gdp </td>
   <td style="text-align:right;"> 3.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> USA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_tertiary </td>
   <td style="text-align:right;"> 46.52 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> USA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_gdp </td>
   <td style="text-align:right;"> 3.21 </td>
  </tr>
</tbody>
</table>
]

---

### 2. Merge and reshape

#### 2.2. Reshape data: From wide to long with pivot_longer()

```r
wide <- attainment %>% 
  inner_join(read.csv("spending.csv") %>% select(-year), 
             by = "country")
wide
```
--

```
##   country year share_tertiary share_gdp
## 1     FRA 2015       44.68760     3.398
## 2     USA 2015       46.51771     3.207
```

<i> &#10140; Pivoting to <b>long format</b> can be seen as putting <b>variables on top of each other</b> rather side to side </i>

* We need to indicate:
  * **Which variables to stack**
  * The **name of** the variable in which we want the **values** of the stacked variables to be stored
  * The **name of** the variable that will indicate to which **variable** corresponds each value

---

### 2. Merge and reshape

#### 2.2. Reshape data: From wide to long with pivot_longer()

```r
long <- wide %>%
               # Which variable to should be stacked
  pivot_longer(c(share_tertiary, share_gdp), 
```
--
<p style = "margin-bottom:-1.01cm;"></p>

```r
               # Where their values should be stored
               values_to = "Value", 
```
--
<p style = "margin-bottom:-1.01cm;"></p>

```r
               # Where to store which variable corresponds each value
               names_to = "Variable")
```
--
<p style = "margin-bottom:-1.01cm;"></p>

```r
long
```

```
## # A tibble: 4 x 4
##   country  year Variable       Value
##   <chr>   <int> <chr>          <dbl>
## 1 FRA      2015 share_tertiary 44.7 
## 2 FRA      2015 share_gdp       3.40
## 3 USA      2015 share_tertiary 46.5 
## 4 USA      2015 share_gdp       3.21
```

---

### 2. Merge and reshape

#### 2.2. Reshape data: From long to wide with pivot_wider()

<ul>
  <li>To <b>pivot in a wide</b> format we need to indicate:</li>
  <ul>
    <li><b>Which variable</b> contains <b>values</b> of the variables we want to put side to side</li>
    <li><b>Which variable</b> indicates which <b>variable</b> correspond to each value</li>
  </ul>
</ul>

```r
wide <- long %>%
              # Where the values are
  pivot_wider(values_from = "Value", 
```
--
<p style = "margin-bottom:-1.01cm;"></p>

```r
              # Where the corresponding variable names are
              names_from = "Variable")
```
--
<p style = "margin-bottom:-1.01cm;"></p>

```r
wide
```

```
## # A tibble: 2 x 4
##   country  year share_tertiary share_gdp
##   <chr>   <int>          <dbl>     <dbl>
## 1 FRA      2015           44.7      3.40
## 2 USA      2015           46.5      3.21
```

---
class: inverse, hide-logo

### Practice

#### 1) From the `fb` dataset, create a variable `league` equal to `"ligue1"` and a variable `season` equal to `"2021-2022"` and save this new data in an object named `full_fb`

#### 2) In data.zip you will find the rest of the data for the seasons 2019-2020 to 2021-2022 for the league 1, the bundesliga and the premier league. Append all these data to `full_fb`. Make sure to create the variables `league` and `season` in each data set before appending.

#### 3) Use the `separate` function from `tidyr` to extract the number of goals scored by the home and away team

#### 4) Convert these variables as numeric and create a variable equal to the sum of the goals from the two teams

#### 5) Summarise your data into the total number of goals score per league/season

#### 6) Reshape your data such that you have 1 row per league and 1 column per season

<center><h3><i>You've got 10 minutes!</i></h3></center>

---
class: inverse, hide-logo

### Solution

#### 1) From the `fb` dataset, create a variable `league` equal to `"ligue1"` and a variable `season` equal to `"2021-2022"` and save this new data in an object named `full_fb`

```r
full_fb <- fb %>% mutate(league = "ligue1", season = "2021-2022")
```

```r
full_fb <- full_fb %>%
  bind_rows(read.csv("ligue1_2021.csv") %>% mutate(league = "ligue1", season = "2020-2021")) %>%
  bind_rows(read.csv("ligue1_1920.csv") %>% mutate(league = "ligue1", season = "2019-2020")) %>%
  bind_rows(read.csv("preml_2122.csv") %>% mutate(league = "preml", season = "2021-2022")) %>%
  bind_rows(read.csv("preml_2021.csv") %>% mutate(league = "preml", season = "2020-2021")) %>%
  bind_rows(read.csv("preml_1920.csv") %>% mutate(league = "preml", season = "2019-2020")) %>%
  bind_rows(read.csv("bundes_2122.csv") %>% mutate(league = "bundes", season = "2021-2022")) %>%
  bind_rows(read.csv("bundes_2021.csv") %>% mutate(league = "bundes", season = "2020-2021")) %>%
  bind_rows(read.csv("bundes_1920.csv") %>% mutate(league = "bundes", season = "2019-2020"))
```

---
class: inverse, hide-logo

### Solution

#### 3) Use the `separate` function from `tidyr` to extract the number of goals scored by the home and away team

```r
full_fb <- full_fb %>%
  separate(Score, c("home_score", "away_score"), "-") 
```

#### 4) Convert these variables as numeric and create a variable equal to the sum of the goals from the two teams

```r
full_fb <- full_fb %>%
  mutate(home_score = as.numeric(home_score),
         away_score = as.numeric(away_score),
         goals = home_score + away_score) 
```

#### 5) Summarise your data into the total number of goals score per league/season

```r
full_fb <- full_fb %>%
  group_by(league, season) %>%
  summarise(goals = sum(goals))
```

---
class: inverse, hide-logo

### Solution

#### 6) Reshape your data such that you have 1 row per league and 1 column per season

```r
full_fb %>%
  pivot_wider(names_from = "season", values_from = "goals")
```

```
## # A tibble: 3 x 4
## # Groups:   league [3]
##   league `2019-2020` `2020-2021` `2021-2022`
##   <chr>        <dbl>       <dbl>       <dbl>
## 1 bundes         982         928         954
## 2 ligue1         704        1049        1067
## 3 preml         1034        1024        1071
```

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Merge and reshape &#10004;</b></li>
  <ul style = "list-style: none">
    <li>2.1. Merge and append data</li>
    <li>2.2. Reshape data</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul>
]

---

<h3>Overview</h3>

]

---

### 3. A few words on learning R

#### 3.1. When it doesn't work the way you want

<ul>
  <li>When things do not work the way you want, <b>NAs are the usual suspects</b></li>
  <ul>
    <li>For instance, this is how the mean function reacts to NAs:</li>
  </ul>
</ul>
 
--

```r
mean(c(1, 2, NA))
```

```
## [1] NA
```
--

```r
mean(c(1, 2, NA), na.rm = T)
```

```
## [1] 1.5
```

<ul>
  <li>You should systematically <b>check for NAs!</b></li>
</ul>

```r
is.na(c(1, 2, NA))
```

```
## [1] FALSE FALSE  TRUE
```

---

### 3. A few words on learning R

#### 3.1. When it doesn't work the way you want

<ul>
  <li><b>Don't pipe blindfolded!</b></li>
  <ul>
    <li><b>Check</b> that each command does what it's expected to do</li>
    <li>View or print your data <b>at each step</b></li>
  </ul>
</ul>

```r
fb %>% 
  select(Home, Score, Away) %>%
  head(1)
```

```
##     Home Score   Away
## 1 Monaco   1-1 Nantes
```

```r
fb %>% 
  select(Home, Score, Away) %>% 
  separate(Score, c("home_score", "away_score"), "-") %>%
  head(1)
```

```
##     Home home_score away_score   Away
## 1 Monaco          1          1 Nantes
```

---

### 3. A few words on learning R

#### 3.2. Where to find help

<ul>
  <li>Oftentimes things don't work either because:</li>
  <ul>
    <li><b>You don't understand</b> a function's argument</li>
    <li>Or <b>you don't know</b> that there exists an argument that you should use</li>
  </ul>
</ul>

<ul>
  <li>This is precisely what <b>help files</b> are made for</li>
  <ul>
    <li>Every function has a help file, just enter <b>?</b> and the name of your <b>function</b> in the console</li>
    <li>The help file will <b>pop up in the Help tab</b> of R studio</li>
  </ul>
</ul>

```r
?paste
```

---

### 3. A few words on learning R

#### 3.2. Where to find help

<ul>
  <li>Search on the internet!</li>
  <ul>
    <li>Your question is for sure already asked and answered on <a href = "https://stackoverflow.com/">stackoverflow</a></li>
  </ul>
</ul>
 
--

---

### 3. A few words on learning R

#### 3.3. When it doesn't work at all

* Sometimes R breaks and returns an <b>error</b> (usually kind of cryptic)

```r
read.csv("C:\Users\Documents\R")
```

```
## Error: '\U' used without hex digits in character string starting ""C:\U"
```

<ol>
  <li>Look for <b>keywords</b> that might help you understand where it comes from</li>
  <li>Paste in on <b>Google</b> with the name of your command</li>
</ol>

---

<h3>Overview</h3>

]

<ul style = "margin-left:-1cm;list-style: none">
  <li><b>3. A few words on learning R &#10004;</b></li>
  <ul style = "list-style: none">
    <li>3.1. When it doesn't work the way you want</li>
    <li>3.2. Where to find help</li>
    <li>3.3. When it doesn't work at all</li>
  </ul>
</ul>
 
<p style = "margin-bottom:2.25cm;"></p>

<ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul>
]

---

### 4. Wrap up!

#### 1. Packages

```r
library(dplyr)
```

#### 2. Main dplyr functions

.left-column[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Function </th>
   <th style="text-align:left;"> Meaning </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> mutate() </td>
   <td style="text-align:left;"> Modify or create a variable </td>
  </tr>
  <tr>
   <td style="text-align:left;"> select() </td>
   <td style="text-align:left;"> Keep a subset of variables </td>
  </tr>
  <tr>
   <td style="text-align:left;"> filter() </td>
   <td style="text-align:left;"> Keep a subset of observations </td>
  </tr>
  <tr>
   <td style="text-align:left;"> arrange() </td>
   <td style="text-align:left;"> Sort the data </td>
  </tr>
  <tr>
   <td style="text-align:left;"> group_by() </td>
   <td style="text-align:left;"> Group the data </td>
  </tr>
  <tr>
   <td style="text-align:left;"> summarise() </td>
   <td style="text-align:left;"> Summarizes variables into 1 observation per group </td>
  </tr>
</tbody>
</table>
]

---

### 4. Wrap up!

#### 3. Merge data

```r
a <- data.frame(x = c(1, 2, 3), y = c("a", "b", "c"))
b <- data.frame(x = c(4, 5, 6), y = c("d", "e", "f"))
c <- data.frame(x = 1:6, z = c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot"))
```

```r
a %>% bind_rows(b) %>% left_join(c, by = "x")
```

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:right;"> x </th>
   <th style="text-align:left;"> y </th>
   <th style="text-align:left;"> z </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> alpha </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> b </td>
   <td style="text-align:left;"> bravo </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:left;"> charlie </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> delta </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> echo </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:left;"> foxtrot </td>
  </tr>
</tbody>
</table>

---

### 4. Wrap up!

#### 4. Reshape data

```r
data %>% pivot_longer(c(share_tertiary, share_gdp), names_to = "Variable", values_to = "Value")
```

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> country </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:right;"> Value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> FRA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_tertiary </td>
   <td style="text-align:right;"> 44.69 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_gdp </td>
   <td style="text-align:right;"> 3.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> USA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_tertiary </td>
   <td style="text-align:right;"> 46.52 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> USA </td>
   <td style="text-align:right;"> 2015 </td>
   <td style="text-align:left;"> share_gdp </td>
   <td style="text-align:right;"> 3.21 </td>
  </tr>
</tbody>
</table>

---

### For next time

<center><b><u>Install the R packages needed for Part I of the course:</u></b></center>

<center><b>ggplot2</b></center>
<p style = margin-bottom:.15cm;"></p>
<center><b>rmarkdown</b></center>
<p style = margin-bottom:.15cm;"></p>
<center><b>knitr</b></center>
<p style = margin-bottom:.15cm;"></p>
<center><b>DT</b></center>