Descriptive statistics

# Descriptive statistics
## Lecture 2
### <br>Louis SIRUGUE
### CPES 2 - Fall 2022

---

### Quick reminder

#### 1. Import data

```r
fb <- read.csv("C:/User/Documents/ligue1.csv", encoding = "UTF-8")
```

#### 2. Class

```r
is.numeric("1.6180339") # What would be the output?
```

```
## [1] FALSE
```

#### 3. Subsetting

```r
fb$Home[3]
```

```
## [1] "Troyes"
```

---

<h3>Today we learn how to describe data</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. Distributions</b></li>
  <ul style = "list-style: none">
    <li>1.1. Definition</li>
    <li>1.2. Graphical representation</li>
    <li>1.3. Common distributions</li>
  </ul>
</ul>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Central tendency</b></li>
  <ul style = "list-style: none">
    <li>2.1. Mean</li>
    <li>2.2. Median</li>
    <li>2.3. Mean vs. median</li>
  </ul>
</ul>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>3. Spread</b></li>
  <ul style = "list-style: none">
    <li>3.1. Range, quantiles, and the IQR</li>
    <li>3.2. Variance and standard deviation</li>
    <li>3.3. Standard deviation vs. IQR</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none">
  <li><b>4. Inference</b></li>
  <ul style = "list-style: none">
    <li>4.1. Data generating process</li>
    <li>4.2. Empirical vs. theoretical moments</li>
    <li>4.3. Confidence interval</li>
  </ul>
</ul>
 
<p style = "margin-bottom:1cm;"></p>

<ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul>
]

---

<h3>Today we learn how to describe data</h3>

---

### 1. Distributions

#### 1.1. Definition

* The point of descriptive statistics is to **summarize a big table** of values with a small set of **tractable statistics**  
 * The most comprehensive way to characterize a variable/vector is to compute its **distribution**:
  * **What** are the **values** the variable takes?
  * **How frequently** does each of these values appear?

<b> &#10140; Consider for instance the following variable:</b>

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Variable 1</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

* We can count how many times each value appears

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;"> Variable 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
</tbody>
</table>

]

* And we can represent this distribution graphically with a bar plot
  * Each possible value on the x-axis
  * Their number of occurrences on the y-axis
]

---

### 1. Distributions

#### 1.2. Graphical representation

---

### 1. Distributions

#### 1.2. Graphical representation

* But what if we would like to do the same thing for the following variable?

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Variable 2</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 5.912877 </td>
   <td style="text-align:right;"> 5.006781 </td>
   <td style="text-align:right;"> 5.517149 </td>
   <td style="text-align:right;"> 5.854849 </td>
   <td style="text-align:right;"> 5.177872 </td>
   <td style="text-align:right;"> 3.815240 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.666582 </td>
   <td style="text-align:right;"> 4.422721 </td>
   <td style="text-align:right;"> 6.025062 </td>
   <td style="text-align:right;"> 5.411020 </td>
   <td style="text-align:right;"> 5.889811 </td>
   <td style="text-align:right;"> 6.729103 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4.160800 </td>
   <td style="text-align:right;"> 6.519049 </td>
   <td style="text-align:right;"> 6.849172 </td>
   <td style="text-align:right;"> 8.368158 </td>
   <td style="text-align:right;"> 6.167404 </td>
   <td style="text-align:right;"> 2.882974 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6.751888 </td>
   <td style="text-align:right;"> 3.202183 </td>
   <td style="text-align:right;"> 6.390224 </td>
   <td style="text-align:right;"> 3.942039 </td>
   <td style="text-align:right;"> 6.488909 </td>
   <td style="text-align:right;"> 8.195647 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7.073922 </td>
   <td style="text-align:right;"> 4.790039 </td>
   <td style="text-align:right;"> 5.297919 </td>
   <td style="text-align:right;"> 1.218109 </td>
   <td style="text-align:right;"> 5.754213 </td>
   <td style="text-align:right;"> 7.225030 </td>
  </tr>
</tbody>
</table>

* Each value appears only once
  * So the count of each variable does not help summarizing the variable

<center><h4>  &#10140; <i> Let's have a look at the corresponding bar plot </i></h4></center>

---

### 1. Distributions

#### 1.2. Graphical representation

---

### 1. Distributions

#### 1.2. Graphical representation

<ul>
  <li>It does not look good for this variable because it is continuous, while the first one was discrete</li>
  <ul>
    <li><b>Discrete variables:</b> variables that can take a finite (or, in practice, a sufficiently small) number of values, e.g., number of siblings, eye color, ...</li>
    <li><b>Continuous variables:</b> variables that can take an infinite (or, in practice, a sufficiently large) number of values, e.g., annual income, height in centimeters, ...</li>
  </ul>
</ul>

<center><i>&#10140; In practice some variables can be difficult to classify. For instance, <b>age (in years)</b> can be viewed <b>as a discrete</b> variable because it can take a finite set of values, but this set being possibly quite wide, one could also view it <b>as a continuous variable</b>. It often depends on the context.</i></center>

<p style = "margin-bottom:1.25cm;">
  
--

<ul>
<li> One solution to get a sense of the <b>distribution</b> of a <b>continuous variable</b> is to do a <b>histogram</b></li>
  <ul>
    <li>Instead of taking each value separately, group them into <i>bins</i> and show how many values fall into each bin</li>
    <li>The bar plots we've seen so far are basically histograms with the number of bins being equal to the number of possible values</li>
  </ul>
</ul>

---

### 1. Distributions

#### 1.2. Graphical representation

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount

---

### 1. Distributions

#### 1.2. Graphical representation

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount
  * We can divide the domain of this variable into 5 bins

---

### 1. Distributions

#### 1.2. Graphical representation

* Consider for instance the following variable. For clarity each point is shifted vertically by a random amount
  * We can divide the domain of this variable into 5 bins
  * And count the number of observations within each bin

---

### 1. Distributions

#### 1.2. Graphical representation

<img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="78%" style="display: block; margin: auto;" />
 
---

### 1. Distributions

#### 1.2. Graphical representation

<ul>
  <li>There's no definitive rule to choose the number of bins</li>
  <ul>
    <li>But too many or too few can yield misleading histograms</li>
  </ul>
</ul>

<center><h4> &#10140; <i>  Note that choosing the number of bins is equivalent to choosing the width of each bin </i></h4></center>

---

### 1. Distributions

#### 1.2. Graphical representation

<ul>
  <li><b>Densities</b> are often used instead of <b>histograms</b></li>
  <ul>
    <li>Both are based on the <b>same principle</b>, but densities are <b>continuous</b></li>
  </ul>
</ul>

<ul>
  <li>We won't learn how to derive it in this course but the idea is the same</li>
  <ul>
    <li>The <b>higher the value</b> on the y-axis, the <b>more observations</b> there are around the corresponding x location</li>
  </ul>
</ul>

<ul>
  <li>The <b>smoothness</b> of the density can be tuned with the <b>bandwidth</b></li>
  <ul>
    <li>The larger the smoother</li>
  </ul>
</ul>

--
 
<img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="83%" style="display: block; margin: auto;" />

---

### 1. Distributions

#### 1.3. Common distributions: Normal distribution

---

### 1. Distributions

#### 1.3. Common distributions: Log-normal distribution

---

### 1. Distributions

#### 1.3. Common distributions: Uniform distribution

---

### 1. Distributions

#### 1.3. Common distributions: Summarizing distributions

* How to **summarize** these distributions with simple statistics?

---

### 1. Distributions

#### 1.3. Common distributions: Summarizing distributions

* How to **summarize** these distributions with simple statistics?
  * By describing their **central tendency** (e.g., mean, median)

---

### 1. Distributions

#### 1.3. Common distributions: Summarizing distributions

* How to **summarize** these distributions with simple statistics?
  * By describing their **central tendency** (e.g., mean, median)
  * And their **spread** (e.g., standard deviation, inter-quartile range)

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>1. Distributions &#10004;</b></li>
  <ul style = "list-style: none">
    <li>1.1. Definition</li>
    <li>1.2. Graphical representation</li>
    <li>1.3. Common distributions</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul>
]

---

<h3>Overview</h3>

]
 
---

### 2. Central tendency

#### 2.1. Mean

<ul>
  <li>The mean is the most common statistic to describe central tendencies</li>
  <ul>
    <li>Take for instance the grades I gave to the final projects in spring 2021:</li>
  </ul>
</ul>
 
--
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Grades I gave in spring 2021</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 16.0 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
</tbody>
</table>

* The mean is simply the sum of all the grades divided by the number of grades:

`$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$`

`$$\frac{20 + 20 + 17.5 + 17.5 + 16 + 16 + 16 + 14.5 + 14.5 + 19.5 + 19.5 + 18.5 + 18.5 + 18.5}{14} = 17.61$$`

---

### 2. Central tendency

#### 2.1. Mean

<ul>
  <li>The mean is the most common statistic to describe central tendencies</li>
  <ul>
    <li>Take for instance the grades I gave to the final projects in spring 2021:</li>
  </ul>
</ul>
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Grades I gave in spring 2021</caption>
<tbody>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 16.0 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17.5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 14.5 </td>
   <td style="text-align:right;"> 19.5 </td>
   <td style="text-align:right;"> 18.5 </td>
   <td style="text-align:right;"> 18.5 </td>
  </tr>
</tbody>
</table>

* Note that it can also be expressed as the sum of each value weighted by its proportion in the distribution
 
 <p style = "margin-bottom:1.5cm;">

`$$\bar{x} = \frac{2}{14} \times 20 + \frac{2}{14} \times 17.5 + \frac{3}{14} \times 16 + \frac{2}{14} \times 14.5 + \frac{2}{14} \times 19.5 + \frac{3}{14} \times 18.5 = 17.61$$`

---

### 2. Central tendency

#### 2.2. Median

* To obtain the median you first need to **sort the values**:

--
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Grades I gave in spring 2021</caption>
<tbody>
  <tr>
   <td style="text-align:left;text-align: center;"> 1 </td>
   <td style="text-align:left;text-align: center;"> 2 </td>
   <td style="text-align:left;text-align: center;"> 3 </td>
   <td style="text-align:left;text-align: center;"> 4 </td>
   <td style="text-align:left;text-align: center;"> 5 </td>
   <td style="text-align:left;text-align: center;"> 6 </td>
   <td style="text-align:left;text-align: center;"> 7 </td>
   <td style="text-align:left;text-align: center;"> 8 </td>
   <td style="text-align:left;text-align: center;"> 9 </td>
   <td style="text-align:left;text-align: center;"> 10 </td>
   <td style="text-align:left;text-align: center;"> 11 </td>
   <td style="text-align:left;text-align: center;"> 12 </td>
   <td style="text-align:left;text-align: center;"> 13 </td>
   <td style="text-align:left;text-align: center;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:left;text-align: center;"> 14.5 </td>
   <td style="text-align:left;text-align: center;"> 14.5 </td>
   <td style="text-align:left;text-align: center;"> 16 </td>
   <td style="text-align:left;text-align: center;"> 16 </td>
   <td style="text-align:left;text-align: center;"> 16 </td>
   <td style="text-align:left;text-align: center;"> 17.5 </td>
   <td style="text-align:left;text-align: center;"> 17.5 </td>
   <td style="text-align:left;text-align: center;"> 18.5 </td>
   <td style="text-align:left;text-align: center;"> 18.5 </td>
   <td style="text-align:left;text-align: center;"> 18.5 </td>
   <td style="text-align:left;text-align: center;"> 19.5 </td>
   <td style="text-align:left;text-align: center;"> 19.5 </td>
   <td style="text-align:left;text-align: center;"> 20 </td>
   <td style="text-align:left;text-align: center;"> 20 </td>
  </tr>
</tbody>
</table>

* The median is the value that **divides** the distribution into **two halves**
 * When there is an even number of observations, the median is the average of the last value of the first half and the first value of the second half
 
--
 
As we have 14 observations, the median is the average of the 7<sup>th</sup> and the 8<sup>th</sup> observations:

`$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\
\frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even}
\end{cases} = \frac{17.5 + 18.5}{2} = 18$$`

---

### 2. Central tendency

#### 2.3. Mean vs. median: relative magnitude

* The **relative magnitude** of the mean and the median depends on the **symmetry of the distribution**:
  * The **mean is larger** than the median if the distribution is **right-skewed**
  * The mean and the median are **equal** if the distribution is **symmetric**
  * The **mean is lower** than the median if the distribution is **left-skewed**

---

### 2. Central tendency

#### 2.3. Mean vs. median: robustness

* The **median** is indeed **less sensitive** than the mean to thick tails and outliers
 * For this reason we say that the median is a ***robust statistic***
 
--

<center><h4><i>Let's illustrate that with a small example!</i></h4></center>

.pull-left[
 
 * Consider the following variable:
 
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:right;"> -3 </td>
   <td style="text-align:right;"> -2 </td>
   <td style="text-align:right;"> -2 </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>

* How would the mean and the median **react** if we were to **add one single observation**?

- We can plot the value of the additional observation on the `$x$` axis and the value of the mean and the median on the `$y$` axis
]

<img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="95%" style="display: block; margin: auto;" />
]

---

### 2. Central tendency

#### 2.3. Mean vs. median: in R

* Both statistics have **dedicated R functions**
 
--

```r
variable <- c(1, 2, 4, 8, 12)
c(mean(variable), median(variable))
```

```
## [1] 5.4 4.0
```

* As always, you should **pay attention to NAs** when using these functions

```r
mean(c(1, 2, 3, 4, NA))
```

```
## [1] NA
```

```r
mean(c(1, 2, 3, 4, NA), na.rm = T)
```

```
## [1] 2.5
```

---

### 2. Central tendency

#### 2.3. Mean vs. median: with binary variable

<ul>
  <li>A <b>binary variable</b> is a variable that can take only <b>two values</b> <i>(e.g., male/female, accepted/rejected)</i></li>
  <ul>
    <li>Any binary variable can be expressed as a sequence of <b>0s and 1s</b></li>
  </ul>
</ul>

* Consider the following binary variable of length 4

- The **mean** of a binary variable is equal the the **percentage of 1s**:
 
`$$\frac{0 + 1 + 1 + 1}{4} = \frac{3}{4} = 75\%$$`

- The **median** of a binary variable is equal to the **mode** *(mode = most frequent value of a variable)*

`$$\frac{1 + 1}{2} = 1$$`

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>2. Central tendency &#10004;</b></li>
  <ul style = "list-style: none">
    <li>2.1. Mean</li>
    <li>2.2. Median</li>
    <li>2.3. Mean vs. median</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul>
]

---

<h3>Overview</h3>

]

---

### 3. Spread

#### 3.1. Range, quantiles, and the IQR

<ul>
  <li>The <b>most intuitive</b> statistic to describe the spread of a variable is probably</li>
  <ul>
    <li><b>The range: the minimum and maximum value it can take</b></li>
  </ul>
</ul>

<p style = "margin-bottom: 1.25cm;"></p>
 
 * But consider the following two distributions:

.left-column[
<img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" />
]

* In the presence of outliers or very skewed distributions, the <b>full range</b> of a variable <b>may not be representative</b> of what we mean by *'spread'*
 
 * That's why we tend to prefer **inter-quantile** ranges 
 
]

---

### 3. Spread

#### 3.1. Range, quantiles, and the IQR

<ul>
  <li><b>Quantiles</b> are observations that <b>divide</b> the population into <b>groups of equal size</b></li>
  <ul>
    <li>The <b>median</b> divides the population into <b>2 groups</b> of equal size</li>
    <li><b>Quartiles</b> divide the population into <b>4 groups</b> of equal size</li>
    <li>There are also <b>terciles</b>, <b>quintiles</b>, <b>deciles</b>, and so on</li>
  </ul>
</ul>

<ul>
  <li>One way to <b>compute quartiles</b>: divide the ordered variable according to the median</li>
    <ul>
      <li>The lower quartile value is the median of the lower half of the data</li>
      <li>The upper quartile value is the median of the upper half of the data</li>
      <li><i>If there is an odd number of data points in the original ordered data set, don't include the median in either half</i></li>
  </ul>
</ul>
 
--

.pull-left[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:right;"> -3 </td>
   <td style="text-align:right;"> -2 </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>

`$$Q_1 = -2,\:\:Q_2 = 0,\:\:Q_3 = 2$$`
]

.pull-right[
<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
<tbody>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:right;"> -3 </td>
   <td style="text-align:right;"> -2 </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>

`$$Q_1 = -1.5,\:\:Q_2 = 0,\:\:Q_3 = 1.5$$`
]

---

### 3. Spread

#### 3.1. Range, quantiles, and the IQR

* The **interquartile range** is the difference between the third and the first quartile: `$\text{IQR} = Q_3 - Q_1$`
 
--
 
 * Put differently, it corresponds to the **bounds** of the set which contains the **middle half** of the distribution
 
--

---

### 3. Spread

#### 3.2. Variance and standard deviation

<ul>
  <li>The <b>variance</b> is a way to quantify how the values of a variable tend to <b>deviate</b> from their <b>mean</b></li>
  <ul>
    <li>If values tend to be <b>close to the mean</b>, then the <b>spread is low</b></li>
    <li>If values tend to be far <b>from the mean</b>, then the <b>spread is large</b></li>
  </ul>
</ul>
  
--

* Can we just take the **average deviation** from the mean?

<table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:center;"> x </th>
   <th style="text-align:center;"> mean(x) </th>
   <th style="text-align:center;"> x - mean(x) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 2.5 </td>
   <td style="text-align:center;"> -1.5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> 2.5 </td>
   <td style="text-align:center;"> 1.5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> -3 </td>
   <td style="text-align:center;"> 2.5 </td>
   <td style="text-align:center;"> -5.5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 8 </td>
   <td style="text-align:center;"> 2.5 </td>
   <td style="text-align:center;"> 5.5 </td>
  </tr>
</tbody>
</table>
]

<br>

* By construction it would **always be 0**: values above and under the mean compensate
 
  * But we can use the **absolute value** of each deviation: `$|x_i-\bar{x}|$`
  
  * Or their **square**: `$(x_i-\bar{x})^2$`
  
]

---

### 3. Spread

#### 3.2. Variance and standard deviation

* This is how the **variance** is computed: by **averaging the squared deviations from the mean**
 
--

`$$\text{Var}(x) = \frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2$$`

* Because the **variance** is a **sum of squares**, it can get **quite big** compared to the other statistics like the mean, the median or the interquartile range.
  * To express the spread in the **same unit** as the data, we can take the **square root** of the variance, which is called the **standard deviation**    
  * In a way, *the standard deviation is to the mean what the IQR is to the median*

`$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$`

---

### 3. Spread

#### 3.3. Standard deviation vs. interquartile range

* Remember that the median is **less sensitive** than the mean to thick tails and outliers
 * This is also the case for the **IQR** relative to the **standard deviation**
 
--

<center><h4><i>Let's go back to our previous example!</i></h4></center>

* How would the standard deviation and the IQR **react** if we were to **add one single observation**?

- We can plot the value of the additional observation on the `$x$` axis and the value of the mean and the median on the `$y$` axis
]

<img src="slides_files/figure-html/unnamed-chunk-39-1.png" width="95%" style="display: block; margin: auto;" />
]

---

### 3. Spread

#### 3.3. Standard deviation vs. interquartile range

* But like for the median vs. the mean, it does **not** mean that one is **better than the other**
   * They just **capture different things**
 
--

.left-column[
<img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="100%" style="display: block; margin: auto;" />
]

<br>

* These two distributions
 
   * Have the **same interquartile range**
   
   * Have **different standard deviations**
   
]

---

### 3. Spread

#### 3.3. Standard deviation vs. interquartile range: in R

* Both statistics have **dedicated R functions**
 
--

```r
variable <- c(0, 1, 3, 4, 6, 7, 8, 10, 11)
c(sd(variable), IQR(variable))
```

```
## [1] 3.844188 5.000000
```

* You can obtain the **quantiles** of a variable using the `quantile()` function

```r
quantile(variable)
```

```
##   0%  25%  50%  75% 100% 
##    0    3    6    8   11
```

<center> &#10140; <i><b> See the help file ?quantile() for more info on quantile computation</b> </i></center>

---

### Practice

#### &#10140; Consider the following variable

```r
variable <- c(1, 3, 8, 4, 9, 5, 3, 8, 8, 7, 4, 9, 
              6, 5, 1, 999, 1, 2, 4, 5, 6, 9, 7, NA)
```

#### 1) Copy/paste the line above into an .R script and run it

#### 2) Compute the mean of this distribution

#### 3) Compute the three quartiles of this distribution

#### 4) Compute the interquartile range of this distribution

<center><h3><i>You've got 5 minutes!</i></h3></center>

---

### Solution

#### 1) Compute the mean of this distribution

```r
mean(variable, na.rm = T)
```

```
## [1] 48.43478
```

#### 2) Compute the three quartiles

```r
quartiles <- quantile(variable, 1:3/4, na.rm = T, names = F)
quartiles
```

```
## [1] 3.5 5.0 8.0
```

#### 3) Compute the inter quartile range

```r
quartiles[3] - quartiles[1]
```

```
## [1] 4.5
```

]

<center> &#10140; <i><b>The outlier 999 pulls the mean outside of the IQR! Descriptive statistics is a good tool to make sure the data is clean</b></i></center>

]

---

<h3>Overview</h3>

<ul style = "margin-left:1.5cm;list-style: none">
  <li><b>3. Spread &#10004;</b></li>
  <ul style = "list-style: none">
    <li>3.1. Range, quantiles, and the IQR</li>
    <li>3.2. Variance and standard deviation</li>
    <li>3.3. Standard deviation vs. IQR</li>
  </ul>
</ul>

]

<ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul>
]

---

<h3>Overview</h3>

]

---

### 4. Inference

#### 4.1. Data generating process

* In **practice**, we manipulate **concrete** variables such as age, sex, earnings, etc.
  * But on the **theoretical** side, we denote such variables with an **abstract** letter like `$x$`

* In Statistics and Econometrics, we indeed use letters like `$x$` to denote what we call **random variables**
  * These variables can take values according to a **data generating process** (DGP)
  * The data generating process is the *mechanism that causes the data to be the way we observe it*

* For instance your grades can be seen as a random variable
  * Which takes given values according to an unknown data generating process
  * The DGP probably depends on your effort, your background, many environmental factors, ...
  
--

* With descriptive statistics, we actually **infer** properties of the DGP **given the outcomes** we observe
  * **Like backward engineering**, from the output we try to understand the process
  * One **crucial implication** is that the mean we compute is just an **estimation** of the parameter of the DGP we're interested in

---

### 4. Inference

#### 4.1. Data generating process

<ul>
  <li>Consider for instance the <b>outcome of two dice</b> as a random variable</li>
  <ul>
    <li>Contrarily to the variables we usually study, <b>we know the DGP</b> of this one</li>
  </ul>
</ul>

* The DGP causes our random variable to take the following values with the following probabilities:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 - 1/36 (&#x2680;&#x2680;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3 - 2/36 (&#x2680;&#x2681; - &#x2681;&#x2680;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4 - 3/36 (&#x2680;&#x2682; - &#x2682;&#x2680; - &#x2681;&#x2681;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5 - 4/36 (&#x2681;&#x2682; - &#x2682;&#x2681; - &#x2681;&#x2683; - &#x2683;&#x2681;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6 - 5/36 (&#x2682;&#x2682; - &#x2681;&#x2683; - &#x2683;&#x2681; - &#x2684;&#x2680; - &#x2680;&#x2684;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;7 - 6/36 (&#x2682;&#x2683; - &#x2683;&#x2682; - &#x2684;&#x2681; - &#x2681;&#x2684; - &#x2680;&#x2685; - &#x2685;&#x2680;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;8 - 5/36 (&#x2683;&#x2683; - &#x2684;&#x2682; - &#x2682;&#x2684; - &#x2685;&#x2681; - &#x2681;&#x2685;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;9 - 4/36 (&#x2683;&#x2684; - &#x2684;&#x2683; - &#x2685;&#x2682; - &#x2682;&#x2685;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;10 - 3/36 (&#x2685;&#x2683; - &#x2683;&#x2685; - &#x2684;&#x2684;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;11 - 2/36 (&#x2685;&#x2684; - &#x2684;&#x2685;)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;12 - 1/36 (&#x2685;&#x2685;)

]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-47-1.png" width="100%" style="display: block; margin: auto;" />

]
---

### 4. Inference

#### 4.2. Empirical vs. theoretical moments

* Because we know the data generating process of our random variable, we can compute its **expected value**:

`$$\begin{align} \text{E}(x) = \frac{(2\times1) + (3\times2) + (4\times3) + (5\times4) + (6\times5) + (7\times6)}{36} +\\ \frac{(8\times5) + (9\times4) + (10\times3) + (11\times2) + (12\times1)}{36} = \frac{252}{36} = 7\end{align}$$`

<ul>
  <li>This is the parameter we are actually interested in</li>
  <ul>
    <li>The <b>expected value</b> is what we call a <b>theoretical moment</b> <i>(the first one)</i></li>
    <li>While the <b>mean</b> is the corresponding <b>empirical moment</b></li>
  </ul>
</ul>

<ul>
  <li><b>How confident</b> to be in our estimate of the expected value (i.e., <i>the mean</i>) depends on the <b>sample size</b></li>
  <ul>
    <li>For a given number of draws the mean won't necessarily be exactly 7</li>
    <li>But if we were to do <b>infinitely many draws</b>, the mean would <b>converge</b> towards 7 <i>(Law of Large Numbers)</i></li>
  </ul>
</ul>
 
---

### 4. Inference

#### 4.2. Empirical vs. theoretical moments

* Just like the **mean** that we compute empirically is an estimate of the **first moment** of the distribution,
  * the **variance** that we compute empirically is an estimate of the **second moment** of the distribution

<b>First moment:</b>

<b>Second moment:</b>

]

<center><h4>Theoretical moment</h4></center>

`$$\text{E}(x_{\text{discrete}}) = \sum_{i=1}^{k}x_ip_i$$`

`$$\text{E}(x_{\text{continuous}}) = \int_{\text{R}}xf(x)dx$$`

`$$\text{Var}(x) = \text{E}\left[(x - \text{E}(x))^2\right] \equiv \sigma^2$$`

]

<center><h4>Empirical moment</h4></center>

`$$\bar{x} = \frac{1}{N}\sum_{i=1}^Nx_i$$`

`$$\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\bar{x})^2$$`

]
  
---

### 4. Inference

#### 4.2. Empirical vs. theoretical moments

`$$\begin{align}
    \text{E}[X + Y]    = & \text{E}[X] + \text{E}[Y] \\[1em]
    \text{E}[aX]       = & a\text{E}[X] \\[1em]
    \text{E}[a]        = & a \\[1em]
    \text{E}[\text{E}[X]] = &\text{E}[X] \\[1em]
    \text{E}[XY]    \neq & \text{E}[X]\text{E}[Y]  \text{ unless } X \perp Y
\end{align}$$`
]

`$$\begin{align}
    \text{Var}(X)       > & 0  \,\,\,\,\,\, \text{Var}(a)       =  0 \\[1em]
    \text{Var}(X + a)   = & \text{Var}(X) \\[1em]
    \text{Var}(aX)      = & a^2\text{Var}(X) \\[1em]
    \text{Var}(aX + bY) = & a^2\text{Var}(X) +    b^2\text{Var}(Y) +\\ 
                          & 2ab\text{Cov}(X, Y) \\[1em]
    \text{Var}(aX - bY) = & a^2\text{Var}(X) +  b^2\text{Var}(Y) -\\
                          & 2ab\text{Cov}(X, Y)
\end{align}$$`
]

---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>Because the mean is an empirical <b>estimation</b> of the theoretical expected value</li>
  <ul>
    <li>We need a measure of the <b>confidence</b> we can have in this estimations</li>
    <li>This is something we can do as long as our variable is <b>normally distributed</b> <i>(bell-shaped)</i></li>
  </ul>
</ul>

<img src="slides_files/figure-html/unnamed-chunk-48-1.png" width="65%" style="display: block; margin: auto;" />
 
---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>Indeed, with such distributions we can recover <b>something we know</b></li>
</ul>

<img src="slides_files/figure-html/unnamed-chunk-49-1.png" width="65%" style="display: block; margin: auto;" />
 
---

### 4. Inference

#### 4.3. Confidence interval

<img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="65%" style="display: block; margin: auto;" />
 
---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>Indeed, with such distributions we can recover <b>something we know</b></li>
  <ul>
    <li>If we <b>divide</b> all the values of the variable by its <b>standard deviation</b>, the <b>variance becomes 1</b></li>
    <li>If we <b>subtract the mean</b> from all the values of the variable, the <b>mean becomes 0</b></li>
  </ul>
</ul>

<img src="slides_files/figure-html/unnamed-chunk-51-1.png" width="65%" style="display: block; margin: auto;" />
 
---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>In mathematical notation, what we just saw writes:</li>
</ul>

`$$\frac{x - \text{E}(x)}{\text{SD}(x)} \sim \mathcal{N}(0, 1)$$`
]

<ul>
  <li>And if we compute means on random draws of $x$</li>
  <ul>
    <li>These means would behave the same way:</li>
  </ul>
</ul>

`$$\frac{\bar{x} - \text{E}(x)}{\text{SD}(x)} \sim \mathcal{N}(0, 1)$$`
]

<ul>
  <ul>
    <li>This is actually <b>true with</b> a theoretical <b>infinite sample</b> of $x$ (i.e., the DGP)</li>
    <li>But <b>in practice</b>, we work with <b>finite samples</b> so things work slightly differently</li>
  </ul>
</ul>

<ul>
  <li>When we have a limited number $n$ of observations:</li>
  <ul>
    <li>We standardize using the <b>standard <i>error</i></b> of the mean $\text{SE}(x) = \text{SD}(x)/\sqrt{n}$</li>
    <li>And we know that:</li>
  </ul>
</ul>

`$$\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)} \equiv t \sim t_{n-1}$$`
    
<p style = "margin-bottom:.5cm;"></p>

<ul>
  <ul>    
    <li>Where $t$ reads <i>"t-stat"</i> and $t_{n-1}$ denotes a Student's $t$ distribution with $n-1$ degrees of freedom</li>
  </ul>
</ul>

---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>The Student's $t$ distribution is <b>very similar to the normal</b> distribution</li>
  <ul>
    <li>It is just <b>a bit flatter</b> when $n$ is low</li>
    <li>But it <b>converges quickly</b> to a normal distribution as $n \rightarrow \infty$</li>
  </ul>
</ul>

<img src="slides_files/figure-html/unnamed-chunk-52-1.png" width="65%" style="display: block; margin: auto;" />
 
---

### 4. Inference

#### 4.3. Confidence interval

<ul>
  <li>The good news is that:</li>
  <ul>
    <li>Because <b>we know</b> how $t \equiv \frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}$ is distributed ($\sim t_{n-1}$)</li>
    <li>We also know what are <b>the chances</b> that $\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}$ takes <b>certain values</b></li>
  </ul>
</ul>

<ul>
  <li>Consider a variable $x \sim \mathcal{N}(\text{E}(x), \text{SD}(x)^2)$</li>
</ul>

<ul>
  <ul>
    <li>And we know between which values lies a given share of the $t_{99}$ distribution</li>
  </ul>
</ul>
  
<ul>
  <ul>
    <li>For instance, 95% of the distribution lie in</li>
  </ul>
</ul>

`$$[-t_{99, 97.5\%};\:t_{99, 97.5\%}]\: \approx\: [-1.98;\: 1.98]$$`

]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-53-1.png" width="150%" style="display: block; margin: auto;" />
]

---

### 4. Inference

#### 4.3. Confidence interval

* In mathematical notation, what the previous graph shows writes:

--
 
`$$\text{Pr}\left[-t_{99, 97.5\%}\leq\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}\leq t_{99, 97.5\%}\right] = 95\%$$`

* Rearranging the terms yields:

`$$\text{Pr}\left[\bar{x} - t_{99, 97.5\%}\times \text{SE}(x)\leq \text{E}(x) \leq\bar{x} + t_{99, 97.5\%}\times \text{SE}(x)\right] = 95\%$$`

* Thus, we can say that there's 95% chance for `$\text{E}(x)$` to be within:
 
`$$\bar{x} \pm t_{99, 97.5\%}\times \text{SE}(x)$$`

<center><i><b>&#10140; This is our 95% confidence interval of the mean!</b></i></center>

---

### 4. Inference

#### 4.3. Confidence interval

* We can apply this calculations **in R** to get a **95% CI of the mean** of the grade distribution

```r
grades <- c(20, 20, 17.5, 17.5, 16, 16, 16, 14.5, 14.5, 19.5, 19.5, 18.5, 18.5, 18.5)
```

```r
# Mean, standard deviation, and n
mean <- mean(grades)
sd <- sd(grades)
n <- length(grades)
```

```r
# Standard error
se <- sd / sqrt(n)
```

```r
# t-stat 
t <- qt(.975, n - 1) # qt returns t-stat from 1 - ((1 - CL) / 2) and degrees of freedom
```

```r
# Confidene interval
c(mean - t*se, mean + t*se)
```

```
## [1] 16.49665 18.71764
```

---

<h3>Overview</h3>

]

<ul style = "margin-left:-1cm;list-style: none">
  <li><b>4. Inference &#10004;</b></li>
  <ul style = "list-style: none">
    <li>4.1. Data generating process</li>
    <li>4.2. Empirical vs. theoretical moments</li>
    <li>4.3. Confidence interval</li>
  </ul>
</ul>
 
<p style = "margin-bottom:1cm;"></p>

<ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul>
]

---

### 5. Wrap up!

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  
---

### 5. Wrap up!

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  * Its **central tendency**

---

### 5. Wrap up!

#### 1. Distributions

* The **distribution** of a variable documents all its possible values and how frequent they are

* We can describe a distribution with:
  * Its **central tendency**
  * And its **spread**
  
---

### 5. Wrap up!

#### 2. Central tendency

* The **mean** is the sum of all values divided by the number of observations

`$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$`

]

* The **median** is the value that divides the (sorted) distribution into two groups of equal size

`$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\
\frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even}
\end{cases}$$`

]

#### 3. Spread

* The **standard deviation** is square root of the average squared deviation from the mean

`$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$`
 
]

* The **interquartile range** is the difference between the maximum and the minimum value from the middle half of the distribution

$$\text{IQR} = Q_3 - Q_1 $$

]

---

### 5. Wrap up!

#### 4. Inference

<ul> 
  <li>In Statistics, we view variables as a given realization of a <b>data generating process</b></li>
  <ul>
    <li>Hence, the <b>mean</b> is what we call an <b>empirical moment</b>, which is an <b>estimation</b>...</li>
    <li>... of the <b>expected value</b>, the <b>theoretical moment</b> of the DGP we're interested in</li>
  </ul>
</ul>

<ul> 
  <li>To know how confident we can be in this estimation, we need to compute a <b>confidence interval</b></li>
</ul>

`$$[\bar{x} - t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}; \:\bar{x} + t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}]$$`

<ul>
  <ul>
    <li>It gets <b>larger</b> as the <b>variance</b> of the distribution of $x$ increases</li>
    <li>And gets <b>smaller</b> as the <b>sample size</b> $n$ increases</li>
  </ul>
</ul>