class: center, middle, inverse, title-slide # Descriptive statistics ## Lecture 2 ###
Louis SIRUGUE ### CPES 2 - Fall 2022 --- ### Quick reminder #### 1. Import data ```r fb <- read.csv("C:/User/Documents/ligue1.csv", encoding = "UTF-8") ``` -- <p style = "margin-bottom:1.5cm;"></p> #### 2. Class ```r is.numeric("1.6180339") # What would be the output? ``` -- ``` ## [1] FALSE ``` -- <p style = "margin-bottom:1.5cm;"></p> #### 3. Subsetting ```r fb$Home[3] ``` ``` ## [1] "Troyes" ``` --- <h3>Today we learn how to describe data</h3> <p style = "margin-bottom:1.5cm;"></p> -- .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul> ] --- <h3>Today we learn how to describe data</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> ] --- ### 1. Distributions #### 1.1. Definition -- * The point of descriptive statistics is to **summarize a big table** of values with a small set of **tractable statistics** * The most comprehensive way to characterize a variable/vector is to compute its **distribution**: * **What** are the **values** the variable takes? * **How frequently** does each of these values appear? -- <b> ➜ Consider for instance the following variable:</b> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Variable 1</caption> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> -- .pull-left[ * We can count how many times each value appears <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> Variable 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> n </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom: 1.5cm;"></p> * And we can represent this distribution graphically with a bar plot * Each possible value on the x-axis * Their number of occurrences on the y-axis ] --- ### 1. Distributions #### 1.2. Graphical representation <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation * But what if we would like to do the same thing for the following variable? -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Variable 2</caption> <tbody> <tr> <td style="text-align:right;"> 5.912877 </td> <td style="text-align:right;"> 5.006781 </td> <td style="text-align:right;"> 5.517149 </td> <td style="text-align:right;"> 5.854849 </td> <td style="text-align:right;"> 5.177872 </td> <td style="text-align:right;"> 3.815240 </td> </tr> <tr> <td style="text-align:right;"> 1.666582 </td> <td style="text-align:right;"> 4.422721 </td> <td style="text-align:right;"> 6.025062 </td> <td style="text-align:right;"> 5.411020 </td> <td style="text-align:right;"> 5.889811 </td> <td style="text-align:right;"> 6.729103 </td> </tr> <tr> <td style="text-align:right;"> 4.160800 </td> <td style="text-align:right;"> 6.519049 </td> <td style="text-align:right;"> 6.849172 </td> <td style="text-align:right;"> 8.368158 </td> <td style="text-align:right;"> 6.167404 </td> <td style="text-align:right;"> 2.882974 </td> </tr> <tr> <td style="text-align:right;"> 6.751888 </td> <td style="text-align:right;"> 3.202183 </td> <td style="text-align:right;"> 6.390224 </td> <td style="text-align:right;"> 3.942039 </td> <td style="text-align:right;"> 6.488909 </td> <td style="text-align:right;"> 8.195647 </td> </tr> <tr> <td style="text-align:right;"> 7.073922 </td> <td style="text-align:right;"> 4.790039 </td> <td style="text-align:right;"> 5.297919 </td> <td style="text-align:right;"> 1.218109 </td> <td style="text-align:right;"> 5.754213 </td> <td style="text-align:right;"> 7.225030 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> -- * Each value appears only once * So the count of each variable does not help summarizing the variable -- <center><h4> ➜ <i> Let's have a look at the corresponding bar plot </i></h4></center> --- ### 1. Distributions #### 1.2. Graphical representation <img src="slides_files/figure-html/unnamed-chunk-11-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation <ul> <li>It does not look good for this variable because it is continuous, while the first one was discrete</li> <ul> <li><b>Discrete variables:</b> variables that can take a finite (or, in practice, a sufficiently small) number of values, e.g., number of siblings, eye color, ...</li> <li><b>Continuous variables:</b> variables that can take an infinite (or, in practice, a sufficiently large) number of values, e.g., annual income, height in centimeters, ...</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"> <center><i>➜ In practice some variables can be difficult to classify. For instance, <b>age (in years)</b> can be viewed <b>as a discrete</b> variable because it can take a finite set of values, but this set being possibly quite wide, one could also view it <b>as a continuous variable</b>. It often depends on the context.</i></center> <p style = "margin-bottom:1.25cm;"> -- <ul> <li> One solution to get a sense of the <b>distribution</b> of a <b>continuous variable</b> is to do a <b>histogram</b></li> <ul> <li>Instead of taking each value separately, group them into <i>bins</i> and show how many values fall into each bin</li> <li>The bar plots we've seen so far are basically histograms with the number of bins being equal to the number of possible values</li> </ul> </ul> --- ### 1. Distributions #### 1.2. Graphical representation * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount -- <p style = "margin-bottom: 1.97cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount * We can divide the domain of this variable into 5 bins <p style = "margin-bottom: 1.25cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount * We can divide the domain of this variable into 5 bins * And count the number of observations within each bin <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation * Consider for instance the following variable. For clarity each point is shifted vertically by a random amount * We can divide the domain of this variable into 5 bins * And count the number of observations within each bin <img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="78%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.2. Graphical representation <ul> <li>There's no definitive rule to choose the number of bins</li> <ul> <li>But too many or too few can yield misleading histograms</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="83%" style="display: block; margin: auto;" /> -- <center><h4> ➜ <i> Note that choosing the number of bins is equivalent to choosing the width of each bin </i></h4></center> --- ### 1. Distributions #### 1.2. Graphical representation <ul> <li><b>Densities</b> are often used instead of <b>histograms</b></li> <ul> <li>Both are based on the <b>same principle</b>, but densities are <b>continuous</b></li> </ul> </ul> <p style = "margin-bottom:-.5cm;"> -- <ul> <li>We won't learn how to derive it in this course but the idea is the same</li> <ul> <li>The <b>higher the value</b> on the y-axis, the <b>more observations</b> there are around the corresponding x location</li> </ul> </ul> <p style = "margin-bottom:-.5cm;"> -- <ul> <li>The <b>smoothness</b> of the density can be tuned with the <b>bandwidth</b></li> <ul> <li>The larger the smoother</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.3. Common distributions: Normal distribution <img src="slides_files/figure-html/unnamed-chunk-18-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.3. Common distributions: Log-normal distribution <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.3. Common distributions: Uniform distribution <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="83%" style="display: block; margin: auto;" /> --- ### 1. Distributions #### 1.3. Common distributions: Summarizing distributions <p style = "margin-bottom:1cm;"> -- <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> -- * How to **summarize** these distributions with simple statistics? --- ### 1. Distributions #### 1.3. Common distributions: Summarizing distributions <p style = "margin-bottom:1cm;"> <img src="slides_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> * How to **summarize** these distributions with simple statistics? * By describing their **central tendency** (e.g., mean, median) --- ### 1. Distributions #### 1.3. Common distributions: Summarizing distributions <p style = "margin-bottom:1cm;"> <img src="slides_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" /> * How to **summarize** these distributions with simple statistics? * By describing their **central tendency** (e.g., mean, median) * And their **spread** (e.g., standard deviation, inter-quartile range) --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> ] --- ### 2. Central tendency #### 2.1. Mean <ul> <li>The mean is the most common statistic to describe central tendencies</li> <ul> <li>Take for instance the grades I gave to the final projects in spring 2021:</li> </ul> </ul> -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades I gave in spring 2021</caption> <tbody> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 16.0 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> -- * The mean is simply the sum of all the grades divided by the number of grades: -- <p style = "margin-bottom:.5cm;"> `$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$` -- `$$\frac{20 + 20 + 17.5 + 17.5 + 16 + 16 + 16 + 14.5 + 14.5 + 19.5 + 19.5 + 18.5 + 18.5 + 18.5}{14} = 17.61$$` --- ### 2. Central tendency #### 2.1. Mean <ul> <li>The mean is the most common statistic to describe central tendencies</li> <ul> <li>Take for instance the grades I gave to the final projects in spring 2021:</li> </ul> </ul> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades I gave in spring 2021</caption> <tbody> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 16.0 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 17.5 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 14.5 </td> <td style="text-align:right;"> 19.5 </td> <td style="text-align:right;"> 18.5 </td> <td style="text-align:right;"> 18.5 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.5cm;"> * Note that it can also be expressed as the sum of each value weighted by its proportion in the distribution <p style = "margin-bottom:1.5cm;"> `$$\bar{x} = \frac{2}{14} \times 20 + \frac{2}{14} \times 17.5 + \frac{3}{14} \times 16 + \frac{2}{14} \times 14.5 + \frac{2}{14} \times 19.5 + \frac{3}{14} \times 18.5 = 17.61$$` --- ### 2. Central tendency #### 2.2. Median * To obtain the median you first need to **sort the values**: -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Grades I gave in spring 2021</caption> <tbody> <tr> <td style="text-align:left;text-align: center;"> 1 </td> <td style="text-align:left;text-align: center;"> 2 </td> <td style="text-align:left;text-align: center;"> 3 </td> <td style="text-align:left;text-align: center;"> 4 </td> <td style="text-align:left;text-align: center;"> 5 </td> <td style="text-align:left;text-align: center;"> 6 </td> <td style="text-align:left;text-align: center;"> 7 </td> <td style="text-align:left;text-align: center;"> 8 </td> <td style="text-align:left;text-align: center;"> 9 </td> <td style="text-align:left;text-align: center;"> 10 </td> <td style="text-align:left;text-align: center;"> 11 </td> <td style="text-align:left;text-align: center;"> 12 </td> <td style="text-align:left;text-align: center;"> 13 </td> <td style="text-align:left;text-align: center;"> 14 </td> </tr> <tr> <td style="text-align:left;text-align: center;"> 14.5 </td> <td style="text-align:left;text-align: center;"> 14.5 </td> <td style="text-align:left;text-align: center;"> 16 </td> <td style="text-align:left;text-align: center;"> 16 </td> <td style="text-align:left;text-align: center;"> 16 </td> <td style="text-align:left;text-align: center;"> 17.5 </td> <td style="text-align:left;text-align: center;"> 17.5 </td> <td style="text-align:left;text-align: center;"> 18.5 </td> <td style="text-align:left;text-align: center;"> 18.5 </td> <td style="text-align:left;text-align: center;"> 18.5 </td> <td style="text-align:left;text-align: center;"> 19.5 </td> <td style="text-align:left;text-align: center;"> 19.5 </td> <td style="text-align:left;text-align: center;"> 20 </td> <td style="text-align:left;text-align: center;"> 20 </td> </tr> </tbody> </table> -- <p style = "margin-bottom:1cm;"> * The median is the value that **divides** the distribution into **two halves** * When there is an even number of observations, the median is the average of the last value of the first half and the first value of the second half -- As we have 14 observations, the median is the average of the 7<sup>th</sup> and the 8<sup>th</sup> observations: `$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\ \frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even} \end{cases} = \frac{17.5 + 18.5}{2} = 18$$` --- ### 2. Central tendency #### 2.3. Mean vs. median: relative magnitude -- * The **relative magnitude** of the mean and the median depends on the **symmetry of the distribution**: * The **mean is larger** than the median if the distribution is **right-skewed** * The mean and the median are **equal** if the distribution is **symmetric** * The **mean is lower** than the median if the distribution is **left-skewed** -- <p style = "margin-bottom:1.25cm;"> <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 2. Central tendency #### 2.3. Mean vs. median: robustness * The **median** is indeed **less sensitive** than the mean to thick tails and outliers * For this reason we say that the median is a ***robust statistic*** -- <center><h4><i>Let's illustrate that with a small example!</i></h4></center> -- .pull-left[ * Consider the following variable: <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> * How would the mean and the median **react** if we were to **add one single observation**? - We can plot the value of the additional observation on the `\(x\)` axis and the value of the mean and the median on the `\(y\)` axis ] -- .pull-right[ <p style = "margin-bottom:-1cm;"> <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ### 2. Central tendency #### 2.3. Mean vs. median: in R * Both statistics have **dedicated R functions** -- ```r variable <- c(1, 2, 4, 8, 12) c(mean(variable), median(variable)) ``` ``` ## [1] 5.4 4.0 ``` -- <p style = "margin-bottom:1.25cm;"> * As always, you should **pay attention to NAs** when using these functions -- ```r mean(c(1, 2, 3, 4, NA)) ``` ``` ## [1] NA ``` ```r mean(c(1, 2, 3, 4, NA), na.rm = T) ``` ``` ## [1] 2.5 ``` --- ### 2. Central tendency #### 2.3. Mean vs. median: with binary variable <ul> <li>A <b>binary variable</b> is a variable that can take only <b>two values</b> <i>(e.g., male/female, accepted/rejected)</i></li> <ul> <li>Any binary variable can be expressed as a sequence of <b>0s and 1s</b></li> </ul> </ul> <p style = "margin-bottom:1cm;"> * Consider the following binary variable of length 4 -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> <p style = "margin-bottom:1cm;"> -- - The **mean** of a binary variable is equal the the **percentage of 1s**: `$$\frac{0 + 1 + 1 + 1}{4} = \frac{3}{4} = 75\%$$` <p style = "margin-bottom:1cm;"> -- - The **median** of a binary variable is equal to the **mode** *(mode = most frequent value of a variable)* `$$\frac{1 + 1}{2} = 1$$` --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency ✔</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency ✔</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] --- ### 3. Spread #### 3.1. Range, quantiles, and the IQR -- <ul> <li>The <b>most intuitive</b> statistic to describe the spread of a variable is probably</li> <ul> <li><b>The range: the minimum and maximum value it can take</b></li> </ul> </ul> -- <p style = "margin-bottom: 1.25cm;"></p> * But consider the following two distributions: <p style = "margin-bottom: 1.25cm;"></p> <style> .left-column {width: 66%;} .right-column {width: 32%;} </style> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:-1cm;"> * In the presence of outliers or very skewed distributions, the <b>full range</b> of a variable <b>may not be representative</b> of what we mean by *'spread'* * That's why we tend to prefer **inter-quantile** ranges ] --- ### 3. Spread #### 3.1. Range, quantiles, and the IQR <ul> <li><b>Quantiles</b> are observations that <b>divide</b> the population into <b>groups of equal size</b></li> <ul> <li>The <b>median</b> divides the population into <b>2 groups</b> of equal size</li> <li><b>Quartiles</b> divide the population into <b>4 groups</b> of equal size</li> <li>There are also <b>terciles</b>, <b>quintiles</b>, <b>deciles</b>, and so on</li> </ul> </ul> <p style = "margin-bottom: 1.25cm;"></p> -- <ul> <li>One way to <b>compute quartiles</b>: divide the ordered variable according to the median</li> <ul> <li>The lower quartile value is the median of the lower half of the data</li> <li>The upper quartile value is the median of the upper half of the data</li> <li><i>If there is an odd number of data points in the original ordered data set, don't include the median in either half</i></li> </ul> </ul> -- <p style = "margin-bottom:1.5cm;"> .pull-left[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> `$$Q_1 = -2,\:\:Q_2 = 0,\:\:Q_3 = 2$$` ] -- .pull-right[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> `$$Q_1 = -1.5,\:\:Q_2 = 0,\:\:Q_3 = 1.5$$` ] --- ### 3. Spread #### 3.1. Range, quantiles, and the IQR * The **interquartile range** is the difference between the third and the first quartile: `\(\text{IQR} = Q_3 - Q_1\)` -- * Put differently, it corresponds to the **bounds** of the set which contains the **middle half** of the distribution -- <p style = "margin-bottom:.5cm;"> <img src="slides_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 3. Spread #### 3.2. Variance and standard deviation <ul> <li>The <b>variance</b> is a way to quantify how the values of a variable tend to <b>deviate</b> from their <b>mean</b></li> <ul> <li>If values tend to be <b>close to the mean</b>, then the <b>spread is low</b></li> <li>If values tend to be far <b>from the mean</b>, then the <b>spread is large</b></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"> .pull-left[ * Can we just take the **average deviation** from the mean? <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:center;"> x </th> <th style="text-align:center;"> mean(x) </th> <th style="text-align:center;"> x - mean(x) </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 2.5 </td> <td style="text-align:center;"> -1.5 </td> </tr> <tr> <td style="text-align:center;"> 4 </td> <td style="text-align:center;"> 2.5 </td> <td style="text-align:center;"> 1.5 </td> </tr> <tr> <td style="text-align:center;"> -3 </td> <td style="text-align:center;"> 2.5 </td> <td style="text-align:center;"> -5.5 </td> </tr> <tr> <td style="text-align:center;"> 8 </td> <td style="text-align:center;"> 2.5 </td> <td style="text-align:center;"> 5.5 </td> </tr> </tbody> </table> ] -- .pull-right[ <br> * By construction it would **always be 0**: values above and under the mean compensate * But we can use the **absolute value** of each deviation: `\(|x_i-\bar{x}|\)` * Or their **square**: `\((x_i-\bar{x})^2\)` ] --- ### 3. Spread #### 3.2. Variance and standard deviation * This is how the **variance** is computed: by **averaging the squared deviations from the mean** -- `$$\text{Var}(x) = \frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2$$` -- <p style = "margin-bottom:1cm;"> * Because the **variance** is a **sum of squares**, it can get **quite big** compared to the other statistics like the mean, the median or the interquartile range. * To express the spread in the **same unit** as the data, we can take the **square root** of the variance, which is called the **standard deviation** * In a way, *the standard deviation is to the mean what the IQR is to the median* <p style = "margin-bottom:1cm;"> -- `$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$` --- ### 3. Spread #### 3.3. Standard deviation vs. interquartile range * Remember that the median is **less sensitive** than the mean to thick tails and outliers * This is also the case for the **IQR** relative to the **standard deviation** -- <center><h4><i>Let's go back to our previous example!</i></h4></center> -- .pull-left[ * Consider the following variable: <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> * How would the standard deviation and the IQR **react** if we were to **add one single observation**? - We can plot the value of the additional observation on the `\(x\)` axis and the value of the mean and the median on the `\(y\)` axis ] -- .pull-right[ <p style = "margin-bottom:-.5cm;"> <img src="slides_files/figure-html/unnamed-chunk-39-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ### 3. Spread #### 3.3. Standard deviation vs. interquartile range * But like for the median vs. the mean, it does **not** mean that one is **better than the other** * They just **capture different things** -- .left-column[ <img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-column[ <br> * These two distributions * Have the **same interquartile range** * Have **different standard deviations** ] --- ### 3. Spread #### 3.3. Standard deviation vs. interquartile range: in R * Both statistics have **dedicated R functions** -- ```r variable <- c(0, 1, 3, 4, 6, 7, 8, 10, 11) c(sd(variable), IQR(variable)) ``` ``` ## [1] 3.844188 5.000000 ``` -- <p style = "margin-bottom:1.5cm;"> * You can obtain the **quantiles** of a variable using the `quantile()` function -- ```r quantile(variable) ``` ``` ## 0% 25% 50% 75% 100% ## 0 3 6 8 11 ``` <p style = "margin-bottom:1cm;"> <center> ➜ <i><b> See the help file ?quantile() for more info on quantile computation</b> </i></center> --- class: inverse, hide-logo ### Practice #### ➜ Consider the following variable ```r variable <- c(1, 3, 8, 4, 9, 5, 3, 8, 8, 7, 4, 9, 6, 5, 1, 999, 1, 2, 4, 5, 6, 9, 7, NA) ``` -- <p style = "margin-bottom:1.5cm;"></p> #### 1) Copy/paste the line above into an .R script and run it #### 2) Compute the mean of this distribution #### 3) Compute the three quartiles of this distribution #### 4) Compute the interquartile range of this distribution -- <p style = "margin-bottom:2cm;"></p> <center><h3><i>You've got 5 minutes!</i></h3></center>
−
+
05
:
00
--- class: inverse, hide-logo ### Solution #### 1) Compute the mean of this distribution ```r mean(variable, na.rm = T) ``` ``` ## [1] 48.43478 ``` -- #### 2) Compute the three quartiles ```r quartiles <- quantile(variable, 1:3/4, na.rm = T, names = F) quartiles ``` ``` ## [1] 3.5 5.0 8.0 ``` -- <p style = "margin-bottom:-.5cm;"></p> .pull-left[ #### 3) Compute the inter quartile range ```r quartiles[3] - quartiles[1] ``` ``` ## [1] 4.5 ``` ] -- .pull-right[ <p style = "margin-bottom:1.25cm;"></p> <center> ➜ <i><b>The outlier 999 pulls the mean outside of the IQR! Descriptive statistics is a good tool to make sure the data is clean</b></i></center> ] --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency ✔</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread ✔</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency ✔</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread ✔</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> ] --- ### 4. Inference #### 4.1. Data generating process * In **practice**, we manipulate **concrete** variables such as age, sex, earnings, etc. * But on the **theoretical** side, we denote such variables with an **abstract** letter like `\(x\)` -- <p style = "margin-bottom:1cm;"></p> * In Statistics and Econometrics, we indeed use letters like `\(x\)` to denote what we call **random variables** * These variables can take values according to a **data generating process** (DGP) * The data generating process is the *mechanism that causes the data to be the way we observe it* -- <p style = "margin-bottom:1cm;"></p> * For instance your grades can be seen as a random variable * Which takes given values according to an unknown data generating process * The DGP probably depends on your effort, your background, many environmental factors, ... -- <p style = "margin-bottom:1cm;"></p> * With descriptive statistics, we actually **infer** properties of the DGP **given the outcomes** we observe * **Like backward engineering**, from the output we try to understand the process * One **crucial implication** is that the mean we compute is just an **estimation** of the parameter of the DGP we're interested in --- ### 4. Inference #### 4.1. Data generating process <ul> <li>Consider for instance the <b>outcome of two dice</b> as a random variable</li> <ul> <li>Contrarily to the variables we usually study, <b>we know the DGP</b> of this one</li> </ul> </ul> -- * The DGP causes our random variable to take the following values with the following probabilities: .pull-left[ 2 - 1/36 (⚀⚀) 3 - 2/36 (⚀⚁ - ⚁⚀) 4 - 3/36 (⚀⚂ - ⚂⚀ - ⚁⚁) 5 - 4/36 (⚁⚂ - ⚂⚁ - ⚁⚃ - ⚃⚁) 6 - 5/36 (⚂⚂ - ⚁⚃ - ⚃⚁ - ⚄⚀ - ⚀⚄) 7 - 6/36 (⚂⚃ - ⚃⚂ - ⚄⚁ - ⚁⚄ - ⚀⚅ - ⚅⚀) 8 - 5/36 (⚃⚃ - ⚄⚂ - ⚂⚄ - ⚅⚁ - ⚁⚅) 9 - 4/36 (⚃⚄ - ⚄⚃ - ⚅⚂ - ⚂⚅) 10 - 3/36 (⚅⚃ - ⚃⚅ - ⚄⚄) 11 - 2/36 (⚅⚄ - ⚄⚅) 12 - 1/36 (⚅⚅) ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-47-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 4. Inference #### 4.2. Empirical vs. theoretical moments * Because we know the data generating process of our random variable, we can compute its **expected value**: <p style = "margin-bottom:1cm;"></p> `$$\begin{align} \text{E}(x) = \frac{(2\times1) + (3\times2) + (4\times3) + (5\times4) + (6\times5) + (7\times6)}{36} +\\ \frac{(8\times5) + (9\times4) + (10\times3) + (11\times2) + (12\times1)}{36} = \frac{252}{36} = 7\end{align}$$` -- <p style = "margin-bottom:1cm;"></p> <ul> <li>This is the parameter we are actually interested in</li> <ul> <li>The <b>expected value</b> is what we call a <b>theoretical moment</b> <i>(the first one)</i></li> <li>While the <b>mean</b> is the corresponding <b>empirical moment</b></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li><b>How confident</b> to be in our estimate of the expected value (i.e., <i>the mean</i>) depends on the <b>sample size</b></li> <ul> <li>For a given number of draws the mean won't necessarily be exactly 7</li> <li>But if we were to do <b>infinitely many draws</b>, the mean would <b>converge</b> towards 7 <i>(Law of Large Numbers)</i></li> </ul> </ul> --- ### 4. Inference #### 4.2. Empirical vs. theoretical moments * Just like the **mean** that we compute empirically is an estimate of the **first moment** of the distribution, * the **variance** that we compute empirically is an estimate of the **second moment** of the distribution -- .pull-left[ .pull-left[ <p style = "margin-bottom:4cm;"></p> <b>First moment:</b> <p style = "margin-bottom:3.5cm;"></p> <b>Second moment:</b> ] .pull-right[ <center><h4>Theoretical moment</h4></center> `$$\text{E}(x_{\text{discrete}}) = \sum_{i=1}^{k}x_ip_i$$` `$$\text{E}(x_{\text{continuous}}) = \int_{\text{R}}xf(x)dx$$` <p style = "margin-bottom:2cm;"></p> `$$\text{Var}(x) = \text{E}\left[(x - \text{E}(x))^2\right] \equiv \sigma^2$$` ] ] .pull-right[ <center><h4>Empirical moment</h4></center> <p style = "margin-bottom:1.9cm;"></p> `$$\bar{x} = \frac{1}{N}\sum_{i=1}^Nx_i$$` <p style = "margin-bottom:2.25cm;"></p> `$$\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\bar{x})^2$$` ] --- ### 4. Inference #### 4.2. Empirical vs. theoretical moments <p style = "margin-bottom:1cm;"></p> .pull-left[ <center><b>Expected value operations</b></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{E}[X + Y] = & \text{E}[X] + \text{E}[Y] \\[1em] \text{E}[aX] = & a\text{E}[X] \\[1em] \text{E}[a] = & a \\[1em] \text{E}[\text{E}[X]] = &\text{E}[X] \\[1em] \text{E}[XY] \neq & \text{E}[X]\text{E}[Y] \text{ unless } X \perp Y \end{align}$$` ] -- .pull-right[ <center><b>Variance operations</b></center> <p style = "margin-bottom:1.25cm;"></p> `$$\begin{align} \text{Var}(X) > & 0 \,\,\,\,\,\, \text{Var}(a) = 0 \\[1em] \text{Var}(X + a) = & \text{Var}(X) \\[1em] \text{Var}(aX) = & a^2\text{Var}(X) \\[1em] \text{Var}(aX + bY) = & a^2\text{Var}(X) + b^2\text{Var}(Y) +\\ & 2ab\text{Cov}(X, Y) \\[1em] \text{Var}(aX - bY) = & a^2\text{Var}(X) + b^2\text{Var}(Y) -\\ & 2ab\text{Cov}(X, Y) \end{align}$$` ] --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>Because the mean is an empirical <b>estimation</b> of the theoretical expected value</li> <ul> <li>We need a measure of the <b>confidence</b> we can have in this estimations</li> <li>This is something we can do as long as our variable is <b>normally distributed</b> <i>(bell-shaped)</i></li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-48-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>Indeed, with such distributions we can recover <b>something we know</b></li> </ul> <p style = "margin-bottom:1.95cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-49-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>Indeed, with such distributions we can recover <b>something we know</b></li> <ul> <li>If we <b>divide</b> all the values of the variable by its <b>standard deviation</b>, the <b>variance becomes 1</b></li> </ul> </ul> <p style = "margin-bottom:1.2cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>Indeed, with such distributions we can recover <b>something we know</b></li> <ul> <li>If we <b>divide</b> all the values of the variable by its <b>standard deviation</b>, the <b>variance becomes 1</b></li> <li>If we <b>subtract the mean</b> from all the values of the variable, the <b>mean becomes 0</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-51-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 4. Inference #### 4.3. Confidence interval <p style = "margin-bottom:-.75cm;"></p> .pull-left[ <ul> <li>In mathematical notation, what we just saw writes:</li> </ul> `$$\frac{x - \text{E}(x)}{\text{SD}(x)} \sim \mathcal{N}(0, 1)$$` ] -- .pull-right[ <ul> <li>And if we compute means on random draws of \(x\)</li> <ul> <li>These means would behave the same way:</li> </ul> </ul> `$$\frac{\bar{x} - \text{E}(x)}{\text{SD}(x)} \sim \mathcal{N}(0, 1)$$` ] -- <p style = "margin-bottom:.5cm;"></p> <ul> <ul> <li>This is actually <b>true with</b> a theoretical <b>infinite sample</b> of \(x\) (i.e., the DGP)</li> <li>But <b>in practice</b>, we work with <b>finite samples</b> so things work slightly differently</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li>When we have a limited number \(n\) of observations:</li> <ul> <li>We standardize using the <b>standard <i>error</i></b> of the mean \(\text{SE}(x) = \text{SD}(x)/\sqrt{n}\)</li> <li>And we know that:</li> </ul> </ul> <p style = "margin-bottom:-.25cm;"></p> `$$\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)} \equiv t \sim t_{n-1}$$` <p style = "margin-bottom:.5cm;"></p> <ul> <ul> <li>Where \(t\) reads <i>"t-stat"</i> and \(t_{n-1}\) denotes a Student's \(t\) distribution with \(n-1\) degrees of freedom</li> </ul> </ul> --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>The Student's \(t\) distribution is <b>very similar to the normal</b> distribution</li> <ul> <li>It is just <b>a bit flatter</b> when \(n\) is low</li> <li>But it <b>converges quickly</b> to a normal distribution as \(n \rightarrow \infty\)</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-52-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 4. Inference #### 4.3. Confidence interval <ul> <li>The good news is that:</li> <ul> <li>Because <b>we know</b> how \(t \equiv \frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}\) is distributed (\(\sim t_{n-1}\))</li> <li>We also know what are <b>the chances</b> that \(\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}\) takes <b>certain values</b></li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> -- .pull-left[ <ul> <li>Consider a variable \(x \sim \mathcal{N}(\text{E}(x), \text{SD}(x)^2)\)</li> </ul> <ul> <ul> <li>We know that with \(n = 100\), \(\frac{\bar{x} - \text{E}(x)}{\text{SD}(x)/\sqrt{n}} \sim t_{99}\)</li> </ul> </ul> <ul> <ul> <li>And we know between which values lies a given share of the \(t_{99}\) distribution</li> </ul> </ul> <ul> <ul> <li>For instance, 95% of the distribution lie in</li> </ul> </ul> `$$[-t_{99, 97.5\%};\:t_{99, 97.5\%}]\: \approx\: [-1.98;\: 1.98]$$` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-53-1.png" width="150%" style="display: block; margin: auto;" /> ] --- ### 4. Inference #### 4.3. Confidence interval * In mathematical notation, what the previous graph shows writes: -- `$$\text{Pr}\left[-t_{99, 97.5\%}\leq\frac{\bar{x} - \text{E}(x)}{\text{SE}(x)}\leq t_{99, 97.5\%}\right] = 95\%$$` <p style = "margin-bottom:1cm;"></p> -- * Rearranging the terms yields: `$$\text{Pr}\left[\bar{x} - t_{99, 97.5\%}\times \text{SE}(x)\leq \text{E}(x) \leq\bar{x} + t_{99, 97.5\%}\times \text{SE}(x)\right] = 95\%$$` -- <p style = "margin-bottom:1.75cm;"></p> * Thus, we can say that there's 95% chance for `\(\text{E}(x)\)` to be within: `$$\bar{x} \pm t_{99, 97.5\%}\times \text{SE}(x)$$` -- <p style = "margin-bottom:1.25cm;"></p> <center><i><b>➜ This is our 95% confidence interval of the mean!</b></i></center> --- ### 4. Inference #### 4.3. Confidence interval * We can apply this calculations **in R** to get a **95% CI of the mean** of the grade distribution ```r grades <- c(20, 20, 17.5, 17.5, 16, 16, 16, 14.5, 14.5, 19.5, 19.5, 18.5, 18.5, 18.5) ``` <p style = "margin-bottom:-.55cm;"></p> -- ```r # Mean, standard deviation, and n mean <- mean(grades) sd <- sd(grades) n <- length(grades) ``` <p style = "margin-bottom:-.55cm;"></p> -- ```r # Standard error se <- sd / sqrt(n) ``` <p style = "margin-bottom:-.55cm;"></p> -- ```r # t-stat t <- qt(.975, n - 1) # qt returns t-stat from 1 - ((1 - CL) / 2) and degrees of freedom ``` <p style = "margin-bottom:-.55cm;"></p> -- ```r # Confidene interval c(mean - t*se, mean + t*se) ``` ``` ## [1] 16.49665 18.71764 ``` --- <h3>Overview</h3> <p style = "margin-bottom:1.5cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Graphical representation</li> <li>1.3. Common distributions</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Central tendency ✔</b></li> <ul style = "list-style: none"> <li>2.1. Mean</li> <li>2.2. Median</li> <li>2.3. Mean vs. median</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>3. Spread ✔</b></li> <ul style = "list-style: none"> <li>3.1. Range, quantiles, and the IQR</li> <li>3.2. Variance and standard deviation</li> <li>3.3. Standard deviation vs. IQR</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Inference ✔</b></li> <ul style = "list-style: none"> <li>4.1. Data generating process</li> <li>4.2. Empirical vs. theoretical moments</li> <li>4.3. Confidence interval</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>5. Wrap up!</b></li></ul> ] --- ### 5. Wrap up! #### 1. Distributions * The **distribution** of a variable documents all its possible values and how frequent they are -- <img src="slides_files/figure-html/unnamed-chunk-59-1.png" width="95%" style="display: block; margin: auto;" /> -- <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: --- ### 5. Wrap up! #### 1. Distributions * The **distribution** of a variable documents all its possible values and how frequent they are <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="95%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: * Its **central tendency** --- ### 5. Wrap up! #### 1. Distributions * The **distribution** of a variable documents all its possible values and how frequent they are <img src="slides_files/figure-html/unnamed-chunk-61-1.png" width="95%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: * Its **central tendency** * And its **spread** --- ### 5. Wrap up! #### 2. Central tendency -- .pull-left[ * The **mean** is the sum of all values divided by the number of observations `$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$` ] -- .pull-right[ * The **median** is the value that divides the (sorted) distribution into two groups of equal size `$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\ \frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even} \end{cases}$$` ] -- #### 3. Spread -- .pull-left[ * The **standard deviation** is square root of the average squared deviation from the mean `$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$` ] -- .pull-right[ <p style = "margin-bottom:-5.5cm;"></p> * The **interquartile range** is the difference between the maximum and the minimum value from the middle half of the distribution <p style = "margin-bottom:1cm;"></p> $$\text{IQR} = Q_3 - Q_1 $$ ] --- ### 5. Wrap up! #### 4. Inference <ul> <li>In Statistics, we view variables as a given realization of a <b>data generating process</b></li> <ul> <li>Hence, the <b>mean</b> is what we call an <b>empirical moment</b>, which is an <b>estimation</b>...</li> <li>... of the <b>expected value</b>, the <b>theoretical moment</b> of the DGP we're interested in</li> </ul> </ul> -- <ul> <li>To know how confident we can be in this estimation, we need to compute a <b>confidence interval</b></li> </ul> `$$[\bar{x} - t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}; \:\bar{x} + t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}]$$` -- <ul> <ul> <li>It gets <b>larger</b> as the <b>variance</b> of the distribution of \(x\) increases</li> <li>And gets <b>smaller</b> as the <b>sample size</b> \(n\) increases</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-62-1.png" width="95%" style="display: block; margin: auto;" />