class: center, middle, inverse, title-slide # Univariate regressions ## Lecture 8 ###
Louis SIRUGUE ### CPES 2 - Fall 2022 --- ### Part I recap #### Import data ```r fb <- read.csv("C:/User/Documents/ligue1.csv", encoding = "UTF-8") ``` -- <p style = "margin-bottom:1.5cm;"></p> #### Class ```r is.numeric("1.6180339") # What would be the output? ``` -- ``` ## [1] FALSE ``` -- <p style = "margin-bottom:1.5cm;"></p> #### Subsetting ```r fb$Home[3] ``` ``` ## [1] "Troyes" ``` --- ### Part I recap #### Distributions * The **distribution** of a variable documents all its possible values and how frequent they are -- <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="95%" style="display: block; margin: auto;" /> -- <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: --- ### Part I recap #### Distributions * The **distribution** of a variable documents all its possible values and how frequent they are <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="95%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: * Its **central tendency** --- ### Part I recap #### Distributions * The **distribution** of a variable documents all its possible values and how frequent they are <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="95%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1cm;"> * We can describe a distribution with: * Its **central tendency** * And its **spread** --- ### Part I recap #### Central tendency -- .pull-left[ * The **mean** is the sum of all values divided by the number of observations `$$\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i$$` ] -- .pull-right[ * The **median** is the value that divides the (sorted) distribution into two groups of equal size `$$\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\ \frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even} \end{cases}$$` ] -- #### Spread -- .pull-left[ * The **standard deviation** is square root of the average squared deviation from the mean `$$\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}$$` ] -- .pull-right[ <p style = "margin-bottom:-5.5cm;"></p> * The **interquartile range** is the difference between the maximum and the minimum value from the middle half of the distribution <p style = "margin-bottom:1cm;"></p> $$\text{IQR} = Q_3 - Q_1 $$ ] --- ### Part I recap #### Inference <ul> <li>In Statistics, we view variables as a given realization of a <b>data generating process</b></li> <ul> <li>Hence, the <b>mean</b> is what we call an <b>empirical moment</b>, which is an <b>estimation</b>...</li> <li>... of the <b>expected value</b>, the <b>theoretical moment</b> of the DGP we're interested in</li> </ul> </ul> -- <ul> <li>To know how confident we can be in this estimation, we need to compute a <b>confidence interval</b></li> </ul> `$$[\bar{x} - t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}; \:\bar{x} + t_{n-1, \:97.5\%}\times\frac{\text{SD}(x)}{\sqrt{n}}]$$` -- <ul> <ul> <li>It gets <b>larger</b> as the <b>variance</b> of the distribution of \(x\) increases</li> <li>And gets <b>smaller</b> as the <b>sample size</b> \(n\) increases</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="95%" style="display: block; margin: auto;" /> --- ### Part I recap #### Packages ```r library(dplyr) ``` -- <p style = "margin-bottom:1.5cm;"></p> #### Main dplyr functions .left-column[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Function </th> <th style="text-align:left;"> Meaning </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mutate() </td> <td style="text-align:left;"> Modify or create a variable </td> </tr> <tr> <td style="text-align:left;"> select() </td> <td style="text-align:left;"> Keep a subset of variables </td> </tr> <tr> <td style="text-align:left;"> filter() </td> <td style="text-align:left;"> Keep a subset of observations </td> </tr> <tr> <td style="text-align:left;"> arrange() </td> <td style="text-align:left;"> Sort the data </td> </tr> <tr> <td style="text-align:left;"> group_by() </td> <td style="text-align:left;"> Group the data </td> </tr> <tr> <td style="text-align:left;"> summarise() </td> <td style="text-align:left;"> Summarizes variables into 1 observation per group </td> </tr> </tbody> </table> ] -- .right-column[ <img style = "margin-top:0cm; margin-left:1.5cm;" src = "pipe.png" width = "180"/> ] --- ### Part I recap #### Merge data ```r a <- data.frame(x = c(1, 2, 3), y = c("a", "b", "c")) b <- data.frame(x = c(4, 5, 6), y = c("d", "e", "f")) c <- data.frame(x = 1:6, z = c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot")) ``` -- ```r a %>% bind_rows(b) %>% left_join(c, by = "x") ``` <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:left;"> y </th> <th style="text-align:left;"> z </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> alpha </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> b </td> <td style="text-align:left;"> bravo </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> c </td> <td style="text-align:left;"> charlie </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> delta </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> echo </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> foxtrot </td> </tr> </tbody> </table> --- ### Part I recap #### Reshape data <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> share_tertiary </th> <th style="text-align:right;"> share_gdp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> FRA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:right;"> 44.69 </td> <td style="text-align:right;"> 3.40 </td> </tr> <tr> <td style="text-align:left;"> USA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:right;"> 46.52 </td> <td style="text-align:right;"> 3.21 </td> </tr> </tbody> </table> -- <p style = margin-bottom:1.25cm;"></p> ```r data %>% pivot_longer(c(share_tertiary, share_gdp), names_to = "Variable", values_to = "Value") ``` <p style = margin-bottom:1.25cm;"></p> -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> Variable </th> <th style="text-align:right;"> Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> FRA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:left;"> share_tertiary </td> <td style="text-align:right;"> 44.69 </td> </tr> <tr> <td style="text-align:left;"> FRA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:left;"> share_gdp </td> <td style="text-align:right;"> 3.40 </td> </tr> <tr> <td style="text-align:left;"> USA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:left;"> share_tertiary </td> <td style="text-align:right;"> 46.52 </td> </tr> <tr> <td style="text-align:left;"> USA </td> <td style="text-align:right;"> 2015 </td> <td style="text-align:left;"> share_gdp </td> <td style="text-align:right;"> 3.21 </td> </tr> </tbody> </table> --- ### Part I recap <p style = "margin-bottom:2cm;"> <center><h4> The 3 core components of the ggplot() function </h4></center> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Component </th> <th style="text-align:center;"> Contribution </th> <th style="text-align:center;"> Implementation </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Data </td> <td style="text-align:center;"> Underlying values </td> <td style="text-align:center;"> ggplot(data, | data %>% ggplot(., </td> </tr> <tr> <td style="text-align:left;"> Mapping </td> <td style="text-align:center;"> Axis assignment </td> <td style="text-align:center;"> aes(x = V1, y = V2, ...)) </td> </tr> <tr> <td style="text-align:left;"> Geometry </td> <td style="text-align:center;"> Type of plot </td> <td style="text-align:center;"> + geom_point() + geom_line() + ... </td> </tr> </tbody> </table> <p style = "margin-bottom:2cm;"> -- * Any **other element** should be added with a **`+` sign** ```r ggplot(data, aes(x = V1, y = V2)) + geom_point() + geom_line() + anything_else() ``` --- ### Part I recap .pull-left[ <p style = "margin-bottom:1.75cm;"> <center><h4> Main customization tools </h4></center> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Item to customize </th> <th style="text-align:left;"> Main functions </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Axes </td> <td style="text-align:left;"> scale_[x/y]_[continuous/discrete] </td> </tr> <tr> <td style="text-align:left;"> Baseline theme </td> <td style="text-align:left;"> theme_[void/minimal/.../dark]() </td> </tr> <tr> <td style="text-align:left;"> Annotations </td> <td style="text-align:left;"> geom_[[h/v]line/text](), annotate() </td> </tr> <tr> <td style="text-align:left;"> Theme </td> <td style="text-align:left;"> theme(axis.[line/ticks].[x/y] = ..., </td> </tr> </tbody> </table> ] -- .pull-right[ <center><h4> Main types of geometry </h4></center> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Geometry </th> <th style="text-align:center;"> Function </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Bar plot </td> <td style="text-align:center;"> geom_bar() </td> </tr> <tr> <td style="text-align:left;"> Histogram </td> <td style="text-align:center;"> geom_histogram() </td> </tr> <tr> <td style="text-align:left;"> Area </td> <td style="text-align:center;"> geom_area() </td> </tr> <tr> <td style="text-align:left;"> Line </td> <td style="text-align:center;"> geom_line() </td> </tr> <tr> <td style="text-align:left;"> Density </td> <td style="text-align:center;"> geom_density() </td> </tr> <tr> <td style="text-align:left;"> Boxplot </td> <td style="text-align:center;"> geom_boxplot() </td> </tr> <tr> <td style="text-align:left;"> Violin </td> <td style="text-align:center;"> geom_violin() </td> </tr> <tr> <td style="text-align:left;"> Scatter plot </td> <td style="text-align:center;"> geom_point() </td> </tr> </tbody> </table> ] --- ### Part I recap .pull-left[ <center><h4> Main types of aesthetics </h4></center> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Argument </th> <th style="text-align:left;"> Meaning </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> alpha </td> <td style="text-align:left;"> opacity from 0 to 1 </td> </tr> <tr> <td style="text-align:left;"> color </td> <td style="text-align:left;"> color of the geometry </td> </tr> <tr> <td style="text-align:left;"> fill </td> <td style="text-align:left;"> fill color of the geometry </td> </tr> <tr> <td style="text-align:left;"> size </td> <td style="text-align:left;"> size of the geometry </td> </tr> <tr> <td style="text-align:left;"> shape </td> <td style="text-align:left;"> shape for geometries like points </td> </tr> <tr> <td style="text-align:left;"> linetype </td> <td style="text-align:left;"> solid, dashed, dotted, etc. </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom:3.25cm;"></p> <ul> <li>If specified <b>in the geometry</b></li> <ul> <li>It will apply uniformly to every <b>all the geometry</b></li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>If assigned to a variable <b>in aes</b></li> <ul> <li>it will <b>vary with the variable</b> according to a scale documented in legend</li> </ul> </ul> ] <br> -- ```r ggplot(data, aes(x = V1, y = V2, size = V3)) + geom_point(color = "steelblue", alpha = .6) ``` --- ### Part I recap #### R Markdown: Three types of content .left-column[ <img src = "report_example_3.png" width = "700"/> ] .right-column[ <p style = "margin-bottom:1.5cm;"> <b>YAML header</b> <p style = "margin-bottom:1.75cm;"> <b>Code chunks</b> <p style = "margin-bottom:5cm;"> <b>Text</b> ] --- ### Part I recap #### Useful features ➜ **Inline code** allows to include the output of some **R code within text areas** of your report <p style = "margin-bottom:-.5cm;"> -- .pull-left[ <center> <h4> Syntax </h4> </center> ```r `paste("a", "b", sep = "-")` ``` ```r `r paste("a", "b", sep = "-")` ``` ] .pull-right[ <center> <h4> Output </h4> </center> `paste("a", "b", sep = "-")` <p style = "margin-bottom:1cm;"> a-b ] <p style = "margin-bottom:2cm;"> -- ➜ **`kable()`** for clean **html tables** and **`datatable()`** to navigate in **large tables** ```r kable(results_table) datatable(results_table) ``` --- ### Part I recap #### LaTeX for equations * `\(\LaTeX\)` is a convenient way to display **mathematical** symbols and to structure **equations** * The **syntax** is mainly based on **backslashes \ and braces {}** -- <p style = "margin-bottom:1cm;"> ➜ What you **type** in the text area: `$x \neq \frac{\alpha \times \beta}{2}$` ➜ What is **rendered** when knitting the document: `\(x \neq \frac{\alpha \times \beta}{2}\)` -- <p style = "margin-bottom:1.5cm;"> <center>To <b>include</b> a <b>LaTeX equation</b> in R Markdown, you simply have to surround it with the <b>$ sign</b></center> <p style = "margin-bottom:0cm;"> .pull-left[ <h4 style = "margin-bottom:0cm;">The mean formula with one `$` on each side</h4> ➜ For inline equations `\(\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i\)` ] .pull-right[ <h4 style = "margin-bottom:0cm;">The mean formula with two `$` on each side</h4> ➜ For large/emphasized equations `$$\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i$$` ] --- <h3>Today: <i>We start Econometrics!</i></h3> -- <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Binary variables</b></li> <ul style = "list-style: none"> <li>3.1. Binary dependent variables</li> <li>3.2. Binary independent variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Today: <i>We start Econometrics!</i></h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> ] --- ### 1. Joint distributions #### 1.1. Definition <ul> <li>The <b>joint distribution</b> shows the <b>values</b> and associated <b>frequencies</b> for <b>two variables</b> simultaneously</li> <ul> <li>Remember how the <b>density</b> could represent the distribution of a <b>single variable</b></li> </ul> </ul> <p style = "margin-bottom: 2cm;"></p> -- .pull-left[ <p style = "margin-bottom: 1cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 1. Joint distributions #### 1.1. Definition <ul> <li>The <b>joint distribution</b> shows the <b>values</b> and associated <b>frequencies</b> for <b>two variables</b> simultaneously</li> <ul> <li>Remember how the <b>density</b> could represent the distribution of a <b>single variable</b></li> <li>The <b>joint density</b> can represent the joint distribution of <b>two variables</b></li> </ul> </ul> <p style = "margin-bottom: 1.25cm;"></p> .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[
] --- ### 1. Joint distributions #### 1.2. Covariance <ul> <li>When describing a <b>single distribution</b>, we're interested in its <b>spread</b> and <b>central tendency</b></li> <li>When describing a <b>joint distribution</b>, we're interested in the <b>relationship</b> between the two variables</li> <ul> <li>This can be characterized by the <b><i>covariance</i></b></li> </ul> </ul> -- $$ \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^{N}(x_i − \bar{x})(y_i − \bar{y}) $$ -- .pull-left[ <p style = "margin-bottom: -.5cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-32-1.png" width="97%" style="display: block; margin: auto;" /> ] .pull-right[ <p style = "margin-bottom: 1.25cm;"></p> <center><i>If <b>y</b> tends to be <b>large</b> relative to its mean when <b>x</b> is <b>large</b> relative to its mean, their <b>covariance</b> is <b>positive</b></i></center> <p style = "margin-bottom: 1.25cm;"></p> <center><i>Conversely, if <b>one</b> tends to be <b>large</b> when the <b>other</b> tends to be <b>low</b>, the <b>covariance</b> is <b>negative</b></i></center> ] --- ### 1. Joint distributions #### 1.2. Covariance <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 1. Joint distributions #### 1.2. Covariance `$$\begin{align} \text{Cov}(X, a) = & 0\\[1.5em] \text{Cov}(X, X) = & \text{Var}(X)\\[1.5em] \text{Cov}(X, Y) = & \text{Cov}(Y, X)\\[1.5em] \text{Cov}(aX, bY) = & ab\text{Cov}(X, Y)\\[1.5em] \text{Cov}(X + a, Y + b) = & \text{Cov}(X, Y)\\[1.5em] \text{Cov}(aX + bY, cW + dZ) = & ac\text{Cov}(X, W) + ad\text{Cov}(X, Z) + \\ & bc\text{Cov}(Y, W) + bd\text{Cov}(Y, Z) \end{align}$$` --- ### 1. Joint distributions #### 1.3. Correlation <ul> <li>One disadvantage of the <b>covariance</b> is that is it <b>not standardized</b></li> <ul> <li>You <b>cannot</b> directly <b>compare</b> the covariance of two pairs of completely different variables</li> <li>Given distance variables will have a larger covariance in centimeters than in meters</li> </ul> </ul> <p style = "margin-bottom: 1cm;"></p> -- <center>➜ Theoretically the <b>covariance</b> can take <b>values</b> from \(-\infty\) to \(+\infty\)</center> <p style = "margin-bottom: 1.5cm;"></p> -- <ul> <li>To <b>net out</b> the covariance from the <b>unit</b> of the data, we can <b>divide</b> it by \(\text{SD}(x)\times\text{SD}(y)\)</li> <ul> <li>We call this <b>standardized</b> measure the <b>correlation</b></li> <li>Correlations coefficients are <b>comparable</b> because they are independent from the unit of the data</li> </ul> </ul> -- `$$\text{Corr}(x, y) = \frac{\text{Cov}(x, y)}{\text{SD}(x)\times\text{SD}(y)}$$` <p style = "margin-bottom: 1cm;"></p> -- <center>➜ The <b>correlation</b> coefficient is bounded between <b>values</b> from \(-1\) to \(1\)</center> --- ### 1. Joint distributions #### 1.3. Correlation <img src="slides_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 1. Joint distributions <center><i><b>➜ But a same correlation can hide very different relationships</b></i></center> <img src="slides_files/figure-html/unnamed-chunk-35-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 1. Joint distributions <center><i><b>➜ Covariance and correlation in R</b></i></center> ```r x <- c(50, 70, 60, 80, 60) y <- c(10, 30, 20, 30, 40) ``` -- <p style = "margin-bottom:1.5cm;"> * The <b>covariance</b> can be obtain with the function `cov()` ```r cov(x, y) ``` ``` ## [1] 70 ``` <p style = "margin-bottom:1.5cm;"> -- * The <b>correlation</b> can be obtain with the function `cor()` ```r cor(x, y) ``` ``` ## [1] 0.5384615 ``` --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Binary variables</b></li> <ul style = "list-style: none"> <li>3.1. Binary dependent variables</li> <li>3.2. Binary independent variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] --- ### 2. Univariate regressions #### 2.1. Introduction to regressions .pull-left[ * Consider the following dataset ```r ggcurve <- read.csv("ggcurve.csv") kable(head(ggcurve, 5), "First 5 rows") ``` <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>First 5 rows</caption> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> ige </th> <th style="text-align:right;"> gini </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 0.38 </td> </tr> <tr> <td style="text-align:left;"> Norway </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> 0.33 </td> </tr> <tr> <td style="text-align:left;"> Finland </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 0.38 </td> </tr> <tr> <td style="text-align:left;"> Canada </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.46 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.44 </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom:2cm;"></p> The data contains <b>2 variables</b> at the <b>country level</b>: <p style = "margin-bottom:1cm;"> <ul style = "margin-left:-.5cm;list-style: none"> <li>1. <b>IGE:</b> Intergenerational elasticity, which captures</i> </ul> <p style = "margin-left:2.38cm; margin-top:-.5cm">the % average increase in child income for</p> <p style = "margin-left:2.38cm; margin-top:-.5cm">a 1% increase in parental income</p> <p style = "margin-bottom:1.5cm;"> <ul style = "margin-left:-.5cm;list-style: none"> <li>2. <b>Gini:</b> Gini index of income inequality between </i> </ul> <p style = "margin-left:2.53cm; margin-top:-.5cm">0: everybody has the same income</p> <p style = "margin-left:2.53cm; margin-top:-.5cm">1: a single individual has all the income</p> ] --- ### 2. Univariate regressions #### 2.1. Introduction to regressions * To investigate the **relationship** between these two variables we can start with a **scatterplot** -- ```r ggplot(ggcurve , aes(x = gini, y = ige, label = country)) + geom_text() ``` <img src="slides_files/figure-html/unnamed-chunk-41-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Univariate regressions #### 2.1. Introduction to regressions <ul> <li>We see that the two variables are <b>positively correlated</b> with each other:</li> <ul> <li>When <b>one</b> tends to be <b>high</b> relative to its mean, <b>the other as well</b></li> <li>When <b>one</b> tends to be <b>low</b> relative to its mean, <b>the other as well</b></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"> ```r cor(ggcurve$gini, ggcurve$ige) ``` ``` ## [1] 0.6517277 ``` <p style = "margin-bottom:1cm;"> <ul> <li>The <b>correlation</b> coefficient is equal to <b>.65</b></li> <ul> <li>Remember that the correlation can take values from -1 to 1</li> <li>Here the correlation is indeed <b>positive</b> and <b>fairly strong</b></li> </ul> </ul> <p style = "margin-bottom:1cm;"> -- <ul> <li>But how useful is this for real-life applications? We may want more <b>practical</b> information:</li> <ul> <li>Like by how much \(y\) is <b>expected</b> to <b>increas</b>e for a given change in \(x\)</li> <li>This is of particular interest for economists and <b>policy</b> makers</li> </ul> </ul> --- <style> .left-column {width: 70%;} .right-column {width: 30%;} </style> ### 2. Univariate regressions #### 2.1. Introduction to regressions * Consider these two relationships : .left-column[ <img src="slides_files/figure-html/unnamed-chunk-43-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:2.5cm;"> ➜ One is less noisy but flatter <p style = "margin-bottom:.5cm;"> ➜ One is noisier but steeper <p style = "margin-bottom:1.5cm;"> <h4>Both have a correlation of .75</h4> ] --- ### 2. Univariate regressions #### 2.1. Introduction to regressions * Consider these two relationships : .left-column[ <img src="slides_files/figure-html/unnamed-chunk-44-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:3cm;"> <center><b><i>But a given increase in \(x\)</i></b></center> <center><b><i>is not associated with</i></b></center> <center><b><i>a same increase in \(y\)!</i></b></center> ] --- ### 2. Univariate regressions #### 2.1. Introduction to regressions <ul> <li>Knowing that income inequality is <b>negatively correlated</b> with intergenerational mobility is one thing</li> </ul> <p style = "margin-bottom:1.25cm;"> -- <ul> <li>But how much more intergenerational mobility could we expect for a given reduction in inequality?</li> <ul> <li>We need to characterize the <b><i>"steepness"</i></b> of the relationship!</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"> -- <ul> <li>It is usually the <b>type of questions</b> we're interested in:</li> <ul> <li><i>How much more should I expect to earn for an additional year of education?</i></li> <li><i>By how many years would life expectancy be expected to decrease for a given increase in air pollution?</i></li> <li><i>By how much would test scores increase for a given decrease in the number of students per teacher?</i></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"> -- * And once again, this is typically what is of interest for <b>policymakers</b> <p style = "margin-bottom:1.25cm;"> -- <center><h4><i>➜ But how to compute this expected change in \(y\) for a given change of \(x\)?</i></h4></center> --- ### 2. Univariate regressions #### 2.2. Coefficients estimation <ul> <li>The idea is to find the <b>line that fits the data</b> the best</li> <ul> <li>Such that its <b>slope</b> can indicate how we <b>expect y to change</b> if we <b>increase x by 1</b> unit</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-45-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * But how do we <b>find that line?</b> -- <img src="slides_files/figure-html/unnamed-chunk-46-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * We try to <b>minimize the distance</b> between each point and our line <img src="slides_files/figure-html/unnamed-chunk-47-1.png" width="100%" style="display: block; margin: auto;" /> --- ### 2. Univariate regressions #### 2.2. Coefficients estimation .pull-left[ <p style = "margin-bottom:1cm;"></p> Take for instance the 20<sup>th</sup> observation: Peru <img src="slides_files/figure-html/unnamed-chunk-48-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ And consider the following **notations**: * We denote `\(y_i\)` the ige of the `\(i^{\text{th}}\)` country * We denote `\(x_i\)` the gini of the `\(i^{\text{th}}\)` country * We denote `\(\widehat{y_i}\)` the value of the `\(y\)` coordinate of our line for `\(x = x_i\)` <p style = "margin-bottom:1.25cm;"></p> <center>➜ The distance between the \(i^{\text{th}}\) y value and the line is \(y_i - \widehat{y_i}\)</center> <p style = "margin-bottom:1.25cm;"></p> * We label that distance `\(\widehat{\varepsilon_i}\)` ] --- ### 2. Univariate regressions #### 2.2. Coefficients estimation .pull-left[ <p style = "margin-bottom:2.375cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-49-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * `\(\widehat{\varepsilon_i}\)` being the distance between a point `\(y_i\)` and its corresponding value on the line `\(\widehat{y_i}\)`, we can write: `$$y_i = \widehat{y_i} + \widehat{\varepsilon_i}$$` <p style = "margin-bottom:1cm;"></p> * And because `\(\widehat{y_i}\)` is a **straight line**, it can be expressed as `$$\widehat{y_i} = \hat{\alpha} + \hat{\beta}x_i$$` <p style = "margin-bottom:1cm;"></p> * Where: * `\(\hat{\alpha}\)` is the **intercept** * `\(\hat{\beta}\)` is the **slope** ] --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * **Combining** these two **definitions** yields the equation: `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \widehat{\varepsilon_i} \begin{cases} y_i = \widehat{y_i} + \widehat{\varepsilon_i}& \text{Definition of distance}\\ \widehat{y_i} = \hat{\alpha} + \hat{\beta}x_i & \text{Definition of the line} \end{cases}$$` -- <p style = "margin-bottom:1.25cm;"></p> * Depending on the values of `\(\hat{\alpha}\)` and `\(\hat{\beta}\)`, the value of every `\(\widehat{\varepsilon_i}\)` will change -- <p style = "margin-bottom:-.5cm;"></p> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:-.75cm;"></p> **Attempt 1:** `\(\hat{\alpha}\)` is too high and `\(\hat{\beta}\)` is <p style = "margin-left:2.97cm;margin-top:-.5cm;">too low ➜ \(\widehat{\varepsilon_i}\) are large</p> **Attempt 2:** `\(\hat{\alpha}\)` is too low and `\(\hat{\beta}\)` is <p style = "margin-left:2.97cm;margin-top:-.5cm;">too high ➜ \(\widehat{\varepsilon_i}\) are large</p> **Attempt 3:** both `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` seem <p style = "margin-left:2.97cm;margin-top:-.5cm;">right ➜ \(\widehat{\varepsilon_i}\) are low</p> ] --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * We want to find the values of `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` that **minimize** the overall **distance** between the points and the line -- `$$\min_{\hat{\alpha}, \hat{\beta}}\sum_{i=1}^{n}\widehat{\varepsilon_i}^2$$` <ul> <ul> <li>Note that we square \(\widehat{\varepsilon_i}\) to avoid that its positive and negative values compensate</li> <li>This method is what we call <b>Ordinary Least Squares (OLS)</b></li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> -- * To solve this **optimization problem**, we need to express `\(\widehat{\varepsilon_i}\)` it in terms of alpha `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \widehat{\varepsilon_i}$$` `$$\Longleftrightarrow$$` `$$\widehat{\varepsilon_i} = y_i -\hat{\alpha} - \hat{\beta}x_i$$` --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * And our minimization problem writes `$$\min_{\hat{\alpha}, \hat{\beta}}\sum_{i=1}^{n}(y_i -\hat{\alpha} - \hat{\beta}x_i)^2$$` -- $$ `\begin{align} \frac{\partial}{\partial\hat{\alpha}} = 0 & \:\: \Longleftrightarrow \:\: -2\sum_{i=1}^n(y_i - \hat{\alpha} - \hat{\beta}x_i) = 0 \\ \frac{\partial}{\partial\hat{\beta}} = 0 & \:\: \Longleftrightarrow \:\: -2x_i\sum_{i=1}^n(y_i - \hat{\alpha} - \hat{\beta}x_i) = 0 \end{align}` $$ <p style = "margin-bottom:1cm;"></p> -- * Rearranging the first equation yields `$$\sum_{i=1}^ny_i - n\hat{\alpha} - \sum_{i=1}^n\hat{\beta}x_i = 0 \:\: \Longleftrightarrow \:\: \hat{\alpha} =\bar{y} - \hat{\beta}\bar{x}$$` --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * Replacing `\(\hat{\alpha}\)` in the second equation by its new expression writes `$$-2x_i\sum_{i=1}^n(y_i - \hat{\alpha} - \hat{\beta}x_i) = 0 \:\: \Longleftrightarrow \:\: -2x_i\sum_{i=1}^n\left[y_i - (\bar{y} - \hat{\beta}\bar{x}) - \hat{\beta}x_i\right] = 0$$` -- <p style = "margin-bottom:1.25cm;"></p> * And by rearranging the terms we obtain `$$\hat{\beta} = \frac{\sum_{i = 1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i = 1}^n(x_i-\bar{x})^2}$$` -- <p style = "margin-bottom:1.25cm;"></p> * Notice that multiplying the nominator and the denominator by `\(1/n\)` yields: `$$\hat{\beta} = \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)} \:\:\:\:\:\:\:\:\: ; \:\:\:\:\:\:\:\:\: \hat{\alpha} = \bar{y} - \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)} \times\bar{x}$$` --- class: inverse, hide-logo ### Practice #### 1) Import `ggcurve.csv` and compute the `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` coefficients of that equation: `$$\text{IGE}_i = \hat{\alpha} + \hat{\beta}\times\text{gini}_i + \widehat{\varepsilon_i}$$` -- <p style = "margin-bottom:1cm;"></p> #### 2) Create a new variable in the dataset for `\(\widehat{\text{IGE}}\)` -- <p style = "margin-bottom:1cm;"></p> #### 3) Plot your results (scatter plot + line) *Hints: You can use different y variables for different geometries by specifying the mapping within the geometry function:* <p style = "margin-bottom:-.5cm;"></p> <center><i>geom_point(aes(y = y))</i></center> <p style = "margin-bottom:1cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)} \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: \hat{\alpha} = \bar{y} - \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)} \times\bar{x}$$` -- <p style = "margin-bottom:1.25cm;"></p> <center><h3><i>You've got 10 minutes!</i></h3></center>
−
+
10
:
00
--- class: inverse, hide-logo ### Solution #### 1) Import `ggcurve.csv` and compute the `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` coefficients of that equation: -- ```r # Read the data ggcurve <- read.csv("ggcurve.csv") # Compute beta beta <- cov(ggcurve$gini, ggcurve$ige) / var(ggcurve$gini) # Compute alpha alpha <- mean(ggcurve$ige) - (beta * mean(ggcurve$gini)) ``` -- ```r c(alpha, beta) ``` ``` ## [1] -0.09129311 1.01546204 ``` -- #### 2) Create a new variable in the dataset for `\(\widehat{\text{IGE}}\)` -- ```r ggcurve <- ggcurve %>% mutate(fit = alpha + beta * gini) ``` --- class: inverse, hide-logo ### Solution #### 3) Plot your results (scatter plot + line) -- ```r ggplot(ggcurve, aes(x = gini)) + geom_point(aes(y = ige)) + geom_line(aes(y = fit)) ``` -- <img src="slides_files/figure-html/unnamed-chunk-55-1.png" width="57%" style="display: block; margin: auto;" /> --- ### 2. Univariate regressions #### 2.2. Coefficients estimation * As usual there are <b>functions</b> to do that <b>in R</b> -- .pull-left[ <ul> <li><b>lm()</b> to estimate regression coefficients</li> <li>It has two main <b>arguments:</b></li> <ul> <li><b>Formula:</b> written as <b>y ~ x</b></li> <li><b>Data:</b> where y and x are</li> </ul> </ul> ```r lm(ige ~ gini, ggcurve) ``` ``` ## ## Call: ## lm(formula = ige ~ gini, data = ggcurve) ## ## Coefficients: ## (Intercept) gini ## -0.09129 1.01546 ``` ] -- .pull-right[ <ul> <li><b>geom_smooth()</b> to plot the fit</li> </ul> ```r ggplot(ggcurve, aes(x = gini, y = ige)) + geom_point() + geom_smooth(method = "lm", formula = y ~ x) ``` <img src="slides_files/figure-html/unnamed-chunk-57-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: inverse, hide-logo <center><h3> Vocabulary </h3></center> <p style = "margin-bottom:1.5cm;"></p> * This equation we're working on is called a <b>regression model</b> `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \widehat{\varepsilon_i}$$` -- <ul> <ul> <li> We say that we <b>regress \(y\) on \(x\)</b> to find the coefficients \(\hat{\alpha}\) and \(\hat{\beta}\) that characterize the regression line</li> <li> We often call \(\hat{\alpha}\) and \(\hat{\beta}\) <b>parameters</b> of the regression because we tune them to fit our model to the data</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>We also have different names for the \(x\) and \(y\) variables</li> <ul> <li> \(y\) is called the <b>dependent</b> or <b>explained</b> variable <li> \(x\) is called the <b>independent</b> or <b>explanatory</b> variable </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> * We call `\(\widehat{\varepsilon_i}\)` the <b>residuals</b> because it is what is left after we fitted the data the best we could -- <p style = "margin-bottom:1.25cm;"></p> * And `\(\hat{y_i} = \hat{\alpha} + \hat{\beta}x_i\)`, i.e., the value on the regression line for a given `\(x_i\)` are called the <b>fitted values</b> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions ✔</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Binary variables</b></li> <ul style = "list-style: none"> <li>3.1. Binary dependent variables</li> <li>3.2. Binary independent variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions ✔</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Binary variables</b></li> <ul style = "list-style: none"> <li>3.1. Binary dependent variables</li> <li>3.2. Binary independent variables</li> </ul> </ul> ] --- ### 3. Binary variables #### 3.1. Binary dependent variables <ul> <li><b>So far</b> we've considered only <b>continuous variables</b> in our regression models</li> <ul> <li>But what if our <b>dependent</b> variable is <b>discrete?</b></li> </ul> </ul> -- <ul> <li>Consider that we have data on candidates to a job:</li> <ul> <li>Their <i>Baccalauréat</i> grade (/20) </li> <li>Whether they got accepted</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"> <img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.1. Binary dependent variables <ul> <li>Even if the <b>outcome variable</b> is binary we can regress it on the grade variable</li> <ul> <li>We can convert it into a <b>dummy</b> variable, a variable taking either the value <b>0 or 1</b></li> <li>Here consider a dummy variable taking the value 1 if the person was accepted</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"> `$$1\{y_i = \text{Accepted}\} = \hat{\alpha} + \hat{\beta} \times \text{Grade}_i + \hat{\varepsilon_i}$$` <p style = "margin-bottom:1cm;"> <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-59-1.png" width="85%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:2cm;"> <center><h4><i> ➜ How would you interpret the beta coefficient from this regression?</i></h4></center> ] --- ### 3. Binary variables #### 3.1. Binary dependent variables <ul> <li>The <b>fitted values</b> can be viewed as the <b>probability</b> to be accepted for a given grade</li> <ul> <li>\(\hat{\beta}\) is thus by how much this probability would vary on expectation for a 1 point increase in the grade</li> <li>That's why we call OLS regression models with a binary outcome <b>Linear <i>Probability</i> Models</b></li> </ul> </ul> <p style = "margin-bottom:1.35cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.1. Binary dependent variables <ul> <li>But what kind of <b>problems</b> could we encounter with <b>such models?</b></li> <ul> <li>What would be the \(\hat{\alpha}\) coefficient here?</li> <li>And what's the probability to be accepted for a grade of 18?</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-61-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.1. Binary dependent variables <ul> <li>With an <b>LPM</b> you can end up with <b><i>"probabilities"</i></b> that are <b>lower than 0</b> and <b>greater than 1</b></li> <ul> <li><b>Interpretation</b> is only <b>valid</b> for values of x sufficiently <b>close to the mean</b></li> <li>Keep that in mind and be <b>careful</b> when interpreting the results of an LPM</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-62-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.2. Binary independent variables <ul> <li>Now consider that we have individual <b>data</b> containing</li> <ul> <li>The <b>sex</b></li> <li>The <b>height</b> (centimeters)</li> </ul> </ul> -- <p style = "margin-bottom:1.5cm;"> <ul> <li>So the situation is different</li> <ul> <li>We used to have a <b>binary dependent variable:</b></li> </ul> </ul> `$$1\{y_i = \text{Accepted}\} = \hat{\alpha} + \hat{\beta} \times \text{Grade}_i + \hat{\varepsilon_i}$$` <ul> <ul> <li>We now have a <b>binary independent variable:</b></li> </ul> </ul> `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{\text{Sex}_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- <p style = "margin-bottom:1.25cm;"> <center><h4><i> ➜ How would you interpret the coefficient \(\hat{\beta}\) from this regression?</i></h4></center> --- ### 3. Binary variables #### 3.2. Binary independent variables <ul> <li>If the sex variable was <b>continuous</b> it would be the expected increase in height for a <b><i>"1 unit increase"</i></b> in sex</li> <ul> <li>Here the <b><i>"1 unit increase"</i></b> is switching from 0 to 1, i.e. <b>from female to male</b></li> <li>With that in mind, how would you interpret the coefficient \(\hat{\beta}\)?</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-63-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.2. Binary independent variables <ul> <li>If I replace the point geometry by the corresponding <b>boxplots</b></li> <ul> <li>What this <b><i>"1 unit increase"</i></b> corresponds to should be <b>clearer</b></li> <li>The coefficient \(\hat{\beta}\) is actually the <b>difference</b> between the <b>average height</b> for males and females</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-64-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 3. Binary variables #### 3.2. Binary independent variables .pull-left[ `\(\overline{\text{Height}_{\left[\text{Sex}_i = \text{Female}\right]}} = 165\)` `\(\overline{\text{Height}_{\left[\text{Sex}_i = \text{Male}\right]}} = 176\)` <p style = "margin-bottom:1.5cm;"> `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{\text{Sex}_i = \text{Male}\} + \hat{\varepsilon_i}$$` `$$\hat{\alpha} = 165 \:\:\:\:\:\:\:\:\:\:\:\:\:\: \hat{\beta} = 11$$` <p style = "margin-bottom:1.5cm;"> `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{\text{Sex}_i = \text{Female}\} + \hat{\varepsilon_i}$$` `$$\hat{\alpha} = 176 \:\:\:\:\:\:\:\:\:\:\:\:\:\: \hat{\beta} = -11$$` ] .pull-right[ <p style = "margin-bottom:1cm;"> <img src="slides_files/figure-html/unnamed-chunk-65-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 3. Binary variables #### 3.2. Binary independent variables * In terms of <b>fitted values:</b> `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{\text{Sex}_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- * We now have `\(\hat{\alpha}\)` and `\(\hat{\beta}\)`: `$$\text{Height}_i = 165 + 11 \times 1\{\text{Sex}_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- * The fitted values write: `$$\widehat{\text{Height}_i} = 165 + 11 \times 1\{\text{Sex}_i = \text{Male}\}$$` -- .pull-left[ * When the dummy equals 0 *(females)*: `$$\begin{align} \widehat{\text{Height}_i} & = 165 + 11 \times 0\\ &= 165 =\overline{\text{Height}_{\left[\text{Sex}_i = \text{Female}\right]}} \end{align}$$` ] -- .pull-right[ * When the dummy equals 1 *(males)*: `$$\begin{align}\widehat{\text{Height}_i} & = 165 + 11 \times 1\\ &= 176 =\overline{\text{Height}_{\left[\text{Sex}_i = \text{Male}\right]}}\end{align}$$` ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Joint distributions ✔</b></li> <ul style = "list-style: none"> <li>1.1. Definition</li> <li>1.2. Covariance</li> <li>1.3. Correlation</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Univariate regressions ✔</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to regressions</li> <li>2.2. Coefficients estimation</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Binary variables ✔</b></li> <ul style = "list-style: none"> <li>3.1. Binary dependent variables</li> <li>3.2. Binary independent variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- ### 4. Wrap up! #### 1. Joint distribution <center>The <b>joint distribution</b> shows the possible <b>values</b> and associated <b>frequencies</b> for <b>two variables</b> simultaneously</center> -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-66-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <p style = "margin-bottom:1.5cm;"></p>
] --- ### 4. Wrap up! #### 1. Joint distribution <center><h4> ➜ <i> When describing a joint distribution, we're interested in the relationship between the two variables </i></h4></center> <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>The <b>covariance</b> quantifies the joint deviation of two variables from their respective mean</li> <ul> <li>It can take values from \(-\infty\) to \(\infty\) and depends on the unit of the data</li> </ul> </ul> $$ \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^{N}(x_i − \bar{x})(y_i − \bar{y})$$ <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>The <b>correlation</b> is the covariance of two variables divided by the product of their standard deviation</li> <ul> <li>It can take values from \(-1\) to \(1\) and is independent from the unit of the data</li> </ul> </ul> `$$\text{Corr}(x, y) = \frac{\text{Cov}(x, y)}{\text{SD}(x)\times\text{SD}(y)}$$` --- ### 4. Wrap up! #### 2. Regression .pull-left[ <p style = "margin-bottom:-.75cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-68-1.png" width="100%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1.2cm;"></p> ``` ## ## Call: ## lm(formula = y ~ x, data = data) ## ## Coefficients: ## (Intercept) x ## -0.09129 1.01546 ``` ] -- .pull-right[ * This can be expressed with the **regression equation:** `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \hat{\varepsilon_i}$$` * Where `\(\hat{\alpha}\)` is the **intercept** and `\(\hat{\beta}\)` the **slope** of the **line** `\(\hat{y_i} = \hat{\alpha} + \hat{\beta}x_i\)`, and `\(\hat{\varepsilon_i}\)` the **distances** between the points and the line <p style = "margin-bottom:1cm;"> `$$\hat{\beta} = \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)}$$` `$$\hat{\alpha} = \bar{y} - \hat{\beta} \times\bar{x}$$` * `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` minimize `\(\hat{\varepsilon_i}\)` ] --- ### 4. Wrap up! #### 3. Binary variables .pull-left[ <center>Binary <b>dependent</b> variables</center> <ul> <li>The <b>fitted values</b> can be viewed as <b>probabilities</b></li> <ul> <li>\(\hat{\beta}\) is the expected increase in the probability that \(y = 1\) for a one unit increase in \(x\)</li> </ul> </ul> <p style = "margin-bottom:1cm;"> <img src="slides_files/figure-html/unnamed-chunk-70-1.png" width="100%" style="display: block; margin: auto;" /> <p style = "margin-bottom:1cm;"> <ul> <ul> <li>We call that a <b>Linear Probability Model</b></li> </ul> </ul> ] -- .pull-right[ <center>Binary <b>independent</b> variables</center> <ul> <li>The \(x\) variable should be viewed as a <b>dummy 0/1</b></li> <ul> <li>\(\hat{\beta}\) is the difference between the average \(y\) for the group \(x = 1\) and the group \(x = 0\)</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-71-1.png" width="90%" style="display: block; margin: auto;" /> ]