class: center, middle, inverse, title-slide # Multivariate regressions ## Lecture 9 ###
Louis SIRUGUE ### CPES 2 - Fall 2022 --- <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> ### Quick reminder #### 1. Joint distribution <center>The <b>joint distribution</b> shows the possible <b>values</b> and associated <b>frequencies</b> for <b>two variables</b> simultaneously</center> -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <p style = "margin-bottom:1.5cm;"></p>
] --- ### Quick reminder #### 1. Joint distribution <center><h4> ➜ <i> When describing a joint distribution, we're interested in the relationship between the two variables </i></h4></center> <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>The <b>covariance</b> quantifies the joint deviation of two variables from their respective mean</li> <ul> <li>It can take values from \(-\infty\) to \(\infty\) and depends on the unit of the data</li> </ul> </ul> $$ \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^{N}(x_i − \bar{x})(y_i − \bar{y})$$ <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>The <b>correlation</b> is the covariance of two variables divided by the product of their standard deviation</li> <ul> <li>It can take values from \(-1\) to \(1\) and is independent from the unit of the data</li> </ul> </ul> `$$\text{Corr}(x, y) = \frac{\text{Cov}(x, y)}{\text{SD}(x)\times\text{SD}(y)}$$` --- ### Quick reminder #### 2. Regression .pull-left[ <p style = "margin-bottom:-.75cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> <p style = "margin-bottom:-1.2cm;"></p> ``` ## ## Call: ## lm(formula = y ~ x, data = data) ## ## Coefficients: ## (Intercept) x ## -0.09129 1.01546 ``` ] -- .pull-right[ * This can be expressed with the **regression equation:** `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \hat{\varepsilon_i}$$` * Where `\(\hat{\alpha}\)` is the **intercept** and `\(\hat{\beta}\)` the **slope** of the **line** `\(\hat{y_i} = \hat{\alpha} + \hat{\beta}x_i\)`, and `\(\hat{\varepsilon_i}\)` the **distances** between the points and the line <p style = "margin-bottom:1cm;"> `$$\hat{\beta} = \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)}$$` `$$\hat{\alpha} = \bar{y} - \hat{\beta} \times\bar{x}$$` * `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` minimize `\(\hat{\varepsilon_i}\)` ] --- ### Quick reminder #### 3. Binary variables .pull-left[ <center>Binary <b>dependent</b> variables</center> <ul> <li>The <b>fitted values</b> can be viewed as <b>probabilities</b></li> <ul> <li>\(\hat{\beta}\) is the expected increase in the probability that \(y = 1\) for a one unit increase in \(x\)</li> </ul> </ul> <p style = "margin-bottom:1cm;"> <img src="slides_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> <p style = "margin-bottom:1cm;"> <ul> <ul> <li>We call that a <b>Linear Probability Model</b></li> </ul> </ul> ] -- .pull-right[ <center>Binary <b>independent</b> variables</center> <ul> <li>The \(x\) variable should be viewed as a <b>dummy 0/1</b></li> <ul> <li>\(\hat{\beta}\) is the difference between the average \(y\) for the group \(x = 1\) and the group \(x = 0\)</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: inverse, hide-logo ### Warm up practice <p style = "margin-bottom:2cm;"></p> #### 1) Open the `asec.csv` data containing sex, race, weekly work hours, and annual earnings ($) -- <p style = "margin-bottom:1.5cm;"></p> #### 2) Regress the earnings variable on the sex variable -- <p style = "margin-bottom:1.5cm;"></p> #### 3) Check that the slope coefficient is equal to the difference between male and female average earnings <p style = "margin-bottom:3cm;"></p> -- <center><h3><i>You've got 10 minutes!</i></h3></center>
−
+
10
:
00
--- class: inverse, hide-logo ### Solution <p style = "margin-bottom:2cm;"></p> #### 1) Open the `asec.csv` data containing sex, race, weekly work hours, and annual earnings ($) ```r asec <- read.csv("asec.csv") ``` -- <p style = "margin-bottom:2cm;"></p> #### 2) Regress the earnings variable on the sex variable ```r lm(Earnings ~ Sex, asec) ``` ``` ## ## Call: ## lm(formula = Earnings ~ Sex, data = asec) ## ## Coefficients: ## (Intercept) SexMale ## 50915 21612 ``` --- class: inverse, hide-logo ### Solution #### 3) Check that the slope coefficient is equal to the difference between male and female average earnings -- ```r asec %>% # Group the data by sex group_by(Sex) %>% # Summarise mean earnings -> 2x2 dataset summarise(Mean = mean(Earnings)) %>% # Put means in columns instead of rows -> 1x2 dataset pivot_wider(names_from = Sex, values_from = Mean) %>% # Compute the difference in means mutate(Difference = Male - Female) ``` -- ``` ## # A tibble: 1 x 3 ## Female Male Difference ## <dbl> <dbl> <dbl> ## 1 50915. 72527. 21612. ``` --- <h3>Today: Multivariate regressions</h3> -- <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Interactions</b></li> <ul style = "list-style: none"> <li>3.1. Motivation</li> <li>3.2. Discrete interactions</li> <li>3.3. Continuous interactions</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Today: Multivariate regressions</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> ] --- ### 1. Adding variables #### 1.1. Continuous variables .pull-left[ * So far we focused on two-variable relationships <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ * What about three variable? *(pivot the plot)*
] --- ### 1. Adding variables #### 1.1. Continuous variables <p style = "margin-bottom:1.25cm;"></p> .pull-left[
] .pull-right[ <ul> <li>In this case we must fit a <b>plane</b></li> <ul> <li>It is characterized by <b>3 parameters</b></li> <li>And can be expressed as:</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> `$$y_i = \hat{\alpha} + \hat{\beta_1}x_{1,i} + \hat{\beta_2}x_{2,i} + \hat{\varepsilon_i}$$` <p style = "margin-bottom:1cm;"></p> <ul> <li>\(\hat{\alpha}\) is still the <b>intercept</b></li> <ul> <li>The value of \(\hat{y}\) (height) when \(x_1 = x_2= 0\)</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>And now there are <b>2 slopes</b></li> <ul> <li>\(\hat{\beta_1}\) along the \(x_1\) axis and \(\hat{\beta_2}\) along the \(x_2\) axis</li> </ul> </ul> ] --- ### 1. Adding variables #### 1.1. Continuous variables <ul> <li>The <b>same</b> applies with <b>more than 2</b> independent variables</li> <ul> <li>We would fit a <b>hyperplane</b> with as many dimension as \(x\) variables</li> <li>We would obtain one intercept and one slope per \(x\) variables</li> </ul> </ul> -- `$$y_i = \hat{\alpha} + \hat{\beta_1}x_{1, i} + \hat{\beta_2}x_{2, i} +...+ \hat{\beta_k}x_{k, i}+\hat{\varepsilon_i}$$` -- <ul> <li>We can estimate the parameters of these hyperplanes in <b>lm()</b></li> <ul> <li><b>Additional variables</b> must be introduced after a <b>+ sign</b></li> </ul> </ul> -- ```r lm(ige ~ gini + third_variable, ggcurve) ``` -- ``` ## ## Call: ## lm(formula = ige ~ gini + third_variable, data = ggcurve) ## ## Coefficients: ## (Intercept) gini third_variable ## -0.09536 0.98153 0.01122 ``` --- ### 1. Adding variables #### 1.2. Discrete variables <ul> <li><b>So far</b> we've been working with <b>binary</b> categorical variables:</li> <ul> <li>Accepted vs. Rejected, Male vs. Female</li> <li>But what about discrete variables with <b>more than two categories?</b></li> </ul> </ul> -- .pull-left[ * Take for instance the <b>race variable:</b> ```r asec %>% group_by(Race) %>% tally() ``` ``` ## # A tibble: 3 x 2 ## Race n ## <chr> <int> ## 1 Black 6835 ## 2 Other 6950 ## 3 White 50551 ``` ] -- .pull-right[ <p style = "margin-bottom:3.5cm;"></p> <center><b><i>How can we use this variable</i></b></center> <center><b><i>as an independent variable</i></b></center> <center><b><i>in our regression framework?</i></b></center> ] --- ### 1. Adding variables #### 1.2. Discrete variables <ul> <li>Remember how we converted our <b>2-category</b> variable into <b>1 dummy</b> variable</li> <ul> <li>We can convert an <b>n-category</b> variable into <b>n-1 dummy</b> variables</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- .pull-left[ <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption> </caption> <thead> <tr> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Male </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> Race </th> <th style="text-align:right;"> Black </th> <th style="text-align:right;"> Other </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom:-.25cm;"></p> ***➜ But why do we omit one category every time?*** <ul> <li>Because it would be redundant</li> <li>We only need 2 dummies for 3 groups:</li> <ul> <li><b>White:</b> Black = <b>0</b> & Other = <b>0</b></li> <li><b>Black:</b> Black = <b>1</b> & Other = <b>0</b></li> <li><b>Other:</b> Black = <b>0</b> & Other = <b>1</b></li> </ul> </ul> <ul> </li>\(\hat{\alpha}\) is the expected \(\hat{y}\) when \(x_k=0 \:\forall k\)</li> <ul> <li>Thus is does the job for the omitted groups!</li> <li>This group is called the <b>reference group</b></li> <li>\(\hat{\beta_k}\) are interpreted <b>relative</b> to that group</li> </ul> </ul> ] --- ### 1. Adding variables #### 1.2. Discrete variables <p style = "margin-bottom:1cm;"></p> .pull-left[ <center><b>2-category variable</b></center> <p style = "margin-bottom:2cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <center><b>3-category variable</b></center>
] --- ### 1. Adding variables #### 1.2. Discrete variables * This <b>plane</b> can be expressed as: `$$\text{Earnings}_i = \hat{\alpha} + \hat{\beta_1} 1\{\text{Race}_i = \text{Other}\} + \hat{\beta_2} 1\{\text{Race}_i = \text{White}\} + \hat{\varepsilon_i}$$` -- <p style = "margin-bottom:1cm;"></p> <ul> <li>And the <b>average</b> incomes for each group equal:</li> <ul> <li><b>Black: \(\hat{\alpha} + 0\hat{\beta_1} + 0\hat{\beta_2} = \hat{\alpha}\)</b></li> <li><b>Other: \(\hat{\alpha} + 1\hat{\beta_1} + 0\hat{\beta_2} = \hat{\alpha} + \hat{\beta_1}\)</b></li> <li><b>White: \(\hat{\alpha} + 0\hat{\beta_1} + 1\hat{\beta_2} = \hat{\alpha} + \hat{\beta_2}\)</b></li> </ul> </ul> <p style = "margin-bottom:-.25cm;"></p> -- .pull-left[ ``` ## ## Call: ## lm(formula = Earnings ~ Race, data = asec) ## ## Coefficients: ## (Intercept) RaceOther RaceWhite ## 50577 17477 12303 ``` ] .pull-right[ <p style = "margin-bottom:-2.5cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Average by group</caption> <thead> <tr> <th style="text-align:left;"> Race </th> <th style="text-align:right;"> Mean earnings </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 50577.49 </td> </tr> <tr> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 68054.63 </td> </tr> <tr> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 62880.49 </td> </tr> </tbody> </table> ] --- ### 1. Adding variables #### 1.2. Discrete variables <ul> <li>By <b>default</b>, lm() sorts categories by <b>alphabetical</b> order</li> <ul> <li>So every coefficient should be <b>interpreted relative</b> to the group which is first alphabetically</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li>But usually this is <b>not</b> the most <b>intuitive</b></li> <ul> <li>You may want everything to be relative to the <b>majority group</b></li> <li>Or to any group that has reasons to be the <b>reference</b></li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li>The <b>relevel()</b> function allows you to <b>change the reference</b> category</li> <ul> <li>But it works <b>only on factor</b> variables</li> </ul> </ul> -- .pull-left[ ```r asec <- asec %>% mutate(Race_fct = relevel(as.factor(Race), "White")) lm(Earnings ~ Race_fct, asec) ``` ] .pull-right[ <p style = "margin-bottom:-.5cm;"></p> ``` ## ## Call: ## lm(formula = Earnings ~ Race_fct, data = asec) ## ## Coefficients: ## (Intercept) Race_fctBlack Race_fctOther ## 62880 -12303 5174 ``` ] --- ### 1. Adding variables #### 1.2. Discrete variables <ul> <li>The <b>factor class</b> is made for variables whose values <b>indicate</b> different <b>groups</b></li> <ul> <li>Values are just <b>arbitrary group classifiers</b></li> </ul> </ul> -- ```r individuals <- as.factor(c(1, 2, 3, 4, 5)) individuals[1] ``` ``` ## [1] 1 ## Levels: 1 2 3 4 5 ``` -- <p style = "margin-bottom:1cm;"></p> <ul> <li>With <b>factors</b>, R understands that the different values <b>do not mean anything</b></li> <ul> <li>And applying <b>standard operations</b> to factors <b>does not make sense</b></li> </ul> </ul> -- ```r individuals * 2 ``` ``` ## Warning in Ops.factor(individuals, 2): '*' not meaningful for factors ``` ``` ## [1] NA NA NA NA NA ``` --- ### 1. Adding variables #### 1.2. Discrete variables * What you can also do is <b>create the dummies yourself:</b> ```r asec <- asec %>% mutate(Black = as.numeric(Race == "Black"), Other = as.numeric(Race == "Other")) ``` -- ```r lm(Earnings ~ Black + Other, asec) ``` ``` ## ## Call: ## lm(formula = Earnings ~ Black + Other, data = asec) ## ## Coefficients: ## (Intercept) Black Other ## 62880 -12303 5174 ``` -- <p style = "margin-bottom:1cm;"></p> <center><i>➜ This might be the <b>safest</b> option</i></center> --- ### 1. Adding variables #### 1.2. Discrete variables * But a <b>categorical</b> variable must <b>not</b> be introduced <b>as numeric</b> in lm() ```r asec <- asec %>% mutate(num_cat = case_when(Race == "White" ~ 0, Race == "Black" ~ 1, Race == "Other" ~ 2)) ``` ```r lm(Earnings ~ num_cat, asec) ``` ``` ## ## Call: ## lm(formula = Earnings ~ num_cat, data = asec) ## ## Coefficients: ## (Intercept) num_cat ## 62093.8 119.6 ``` -- <center><i>➜ lm() used our <b>categorical</b> variable as a <b>continuous</b> variable</i></center> --- ### 1. Adding variables #### 1.2. Discrete variables * Use the <b>factor</b> class ```r asec <- asec %>% mutate(fac_cat = as.factor(num_cat)) ``` -- ```r lm(Earnings ~ fac_cat, asec) ``` ``` ## ## Call: ## lm(formula = Earnings ~ fac_cat, data = asec) ## ## Coefficients: ## (Intercept) fac_cat1 fac_cat2 ## 62880 -12303 5174 ``` -- <p style = "margin-bottom:1.5cm;"></p> <center><i>➜ <b>Converting</b> all your <b>categorical</b> variables into <b>factors</b> is also a <b>safe</b> option</i></center> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Interactions</b></li> <ul style = "list-style: none"> <li>3.1. Motivation</li> <li>3.2. Discrete interactions</li> <li>3.3. Continuous interactions</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] --- ### 2. Control variables #### 2.1. Motivation <ul> <li>But <b>why</b> would we include <b>additional variables</b> in our regressions?</li> <ul> <li>The main reason is to <b>control</b> for potential <b>confounders</b></li> </ul> </ul> -- <ul> <li>Consider estimating the <b>relationship</b> between <b>income</b> and exposure air <b>pollution</b> in the Paris region</li> </ul> `$$\text{Pollution}_i = \hat{\alpha_1} + \hat{\beta_1} \text{Income}_i + \hat{\varepsilon_i}$$` -- <ul> <li>You would probably expect that \(\hat{\beta_1} < 0\)</li> <ul> <li>Meaning that <b>higher income</b> earners live in <b>less polluted</b> areas</li> <li>But the closer from <b>Paris</b> the higher the <b>rents</b> and the closer the <b>ring-road</b></li> <li>This phenomenon might counteract this effect and pull \(\hat{\beta_1}\) towards 0</li> </ul> </ul> -- <ul> <li>But how to <b>remove</b> the <b>impact</b> that <b>distance</b> from Paris has on the relationship?</li> <ul> <li><b>Including it</b> in the regression would make the corresponding coefficient <b>absorb the confounding effect</b></li> <li>In that case we would call distance a <b><i>control</i> variable</b></li> </ul> </ul> `$$\text{Pollution}_i = \hat{\alpha_2} + \hat{\beta_2} \text{Income}_i + \hat{\beta_3} \text{Distance}_i + \hat{\epsilon_i}$$` --- ### 2. Control variables #### 2.2. Discrete <ul> <li>The most <b>common control</b> variable is probably <b>sex/gender</b></li> <ul> <li>It may play a role in the <b>relationship</b> between <b>earnings</b> and <b>hours worked</b> for instance</li> <li>The fact that <b>women</b> work <b>part time</b> more often and <b>earn less</b> contribute to the relationship</li> <li>Just like distance did in the previous example</li> </ul> </ul> <p style = "margin-bottom:1.75cm;"></p> -- <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.2. Discrete <ul> <li>The most <b>common control</b> variable is probably <b>sex/gender</b></li> <ul> <li>It may play a role in the <b>relationship</b> between <b>earnings</b> and <b>hours worked</b> for instance</li> <li>The fact that <b>women</b> work <b>part time</b> more often and <b>earn less</b> contribute to the relationship</li> <li>Just like distance did in the previous example</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-34-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.2. Discrete <ul> <li>The most <b>common control</b> variable is probably <b>sex/gender</b></li> <ul> <li>It may play a role in the <b>relationship</b> between <b>earnings</b> and <b>hours worked</b> for instance</li> <li>The fact that <b>women</b> work <b>part time</b> more often and <b>earn less</b> contribute to the relationship</li> <li>Just like distance did in the previous example</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-35-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.2. Discrete <ul> <li>The most <b>common control</b> variable is probably <b>sex/gender</b></li> <ul> <li>It may play a role in the <b>relationship</b> between <b>earnings</b> and <b>hours worked</b> for instance</li> <li>The fact that <b>women</b> work <b>part time</b> more often and <b>earn less</b> contribute to the relationship</li> <li>Just like distance did in the previous example</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-36-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.2. Discrete ➜ The <b>relationship</b> is indeed <b>inflated</b> by the sex variable .pull-left[ <p style = "margin-bottom:1cm;"></p> <ul> <li>Because being a <b>male</b> is positively <b>correlated</b> with <b>both \(x\) and \(y\)</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li><b>Controlling</b> for sex would <b>solve that problem</b> by absorbing this effect</li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>Controlling for a <b>discrete</b> variable amounts to allow <b>one intercept per category</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>Giving <b>two parallel fitted lines</b> which are the intersections of the plane and the scatterplots</li> </ul> ] .pull-right[ <p style = "margin-bottom:-2cm;"></p>
] --- ### 2. Control variables #### 2.2. Discrete ➜ The <b>relationship</b> is indeed <b>inflated</b> by the sex variable .pull-left[ <p style = "margin-bottom:1cm;"></p> <ul> <li>Because being a <b>male</b> is positively <b>correlated</b> with <b>both \(x\) and \(y\)</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li><b>Controlling</b> for sex would <b>solve that problem</b> by absorbing this effect</li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>Controlling for a <b>discrete</b> variable amounts to allow <b>one intercept per category</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>Giving <b>two parallel fitted lines</b> which are the intersections of the plane and the scatterplots</li> </ul> ] .pull-right[ <p style = "margin-bottom:-2cm;"></p>
] --- ### 2. Control variables #### 2.2. Discrete `$$\text{Earnings}_i = \hat{\alpha} + \hat{\beta_1}\text{Hours}_i + \hat{\beta_2}1\{\text{Sex}_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- <p style = "margin-bottom:1cm;"></p> ``` ## (Intercept) Hours SexMale ## 1019.34269 11.86326 200.98782 ``` -- <p style = "margin-bottom:-.5cm;"></p> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="85%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:1cm;"></p> <center><b>Graphical counterpart</b></center> <p style = "margin-bottom:1cm;"></p> `\(\hat{\alpha}\)`: Intercept of the reference group `\(\hat{\beta_1}\)`: Common slope `\(\hat{\beta_2}\)`: Gap between the two lines `\(\hat{\alpha} +\hat{\beta_2}\)`: Intercept of the other group ] --- ### 2. Control variables #### 2.2. Discrete <ul> <li>We can <b>obtain</b> this common <b>slope</b> by:</li> <ol> <li><b>Demeaning</b> earnings and hours by group</li> <li><b>Regressing</b> the demeaned earnings on the hours</li> </ol> </ul> <img src="slides_files/figure-html/unnamed-chunk-41-1.gif" width="60%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.2. Discrete <ul> <li>Note that once we <b>control</b> for third variable</li> <ol> <li>As we move along the x axis, this <b>third variable remains constant</b></li> <li>Here, as the number of <b>hours increases</b> the probability to be a <b>male does not</b> increase anymore</li> </ol> </ul> <img src="slides_files/figure-html/unnamed-chunk-42-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Control variables #### 2.3. Continuous <ul> <li>The <b>same</b> idea apply when we control for <b>continuous</b> variables</li> <ul> <li>Including it in the regression allows to <b>account for another dimension</b></li> <li>Such that when \(x\) moves this variable <b>remains constant</b></li> <li>This <b>nets out</b> the relationship between \(x\) and \(y\) from the potential <b>confounding effect</b> of this variable</li> <li>This is why we call it <b><i>controlling</i> for something</b></li> </ul> </ul> <p style = "margin-bottom:-.55cm;"></p> --
--- class: inverse, hide-logo ### Practice <p style = "margin-bottom:2cm;"></p> #### 1) Using the `asec` data, regress (yearly) earnings on (weekly) hours worked <p style = "margin-bottom:2cm;"></p> #### 2) Regress earnings on hours worked controlling for sex <p style = "margin-bottom:2cm;"></p> #### 3) Interpret the difference between the results from 1) and 2) -- <p style = "margin-bottom:3cm;"></p> <center><h3><i>You've got 8 minutes!</i></h3></center>
−
+
08
:
00
--- class: inverse, hide-logo ### Solution <p style = "margin-bottom:2cm;"></p> #### 1) Using the `asec` data, regress (yearly) earnings on (weekly) hours worked ```r lm(Earnings ~ Hours, asec)$coefficients ``` ``` ## (Intercept) Hours ## -20038.85 2077.79 ``` -- <p style = "margin-bottom:2cm;"></p> #### 2) Regress earnings on hours worked controlling for sex ```r lm(Earnings ~ Hours + Sex, asec)$coefficients ``` ``` ## (Intercept) Hours SexMale ## -22296.150 1953.829 13794.385 ``` --- class: inverse, hide-logo ### Solution <p style = "margin-bottom:2cm;"></p> #### 3) Interpret the difference between the results from 1) and 2) .pull-left[ <ul><li>The <b>slope</b> is still positive <b>less steep</b></li></ul> <ul><ul><li>In the <b>first regression</b> as the number of <b>hours increases</b> the probability to be a <b>male does increase</b> as well</li></ul></ul> <ul><ul><li>Because <b>males</b> tend to <b>earn more</b> this <b>contributes</b> to the positive <b>relationship</b> between Hours and Earnings</li></ul></ul> <ul><ul><li>In the <b>second regression</b>, <b>controlling</b> for sex allows to maintain the probability to be a <b>male constant</b> along the hour axis to <b>remove this effect</b></li></ul></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Interactions</b></li> <ul style = "list-style: none"> <li>3.1. Motivation</li> <li>3.2. Discrete interactions</li> <li>3.3. Continuous interactions</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Interactions</b></li> <ul style = "list-style: none"> <li>3.1. Motivation</li> <li>3.2. Discrete interactions</li> <li>3.3. Continuous interactions</li> </ul> </ul> ] --- ### 3. Interactions #### 3.1. Motivation <ul> <li>Now we know how to <b>remove</b> the <b>confounding effect</b> of a third variable by <b>controlling</b> for it</li> <ul> <li>But what if the main <b>relationship varies</b> depending on the value of the <b>third variable?</b></li> </ul> </ul> * Let's get back to the previous example `$$\text{Pollution}_i = \hat{\alpha} + \hat{\beta_1} \text{Income}_i + \hat{\beta_2} \text{Distance}_i + \hat{\epsilon_i}$$` -- <ul> <li>The <b>equation imposes</b> that the <b>effect</b> of income on pollution is <b>constant:</b> \(\hat{\beta_2}\)</li> <ul> <li>But what if the relationship was actually not the same close to Paris than further away?</li> <li>Maybe that the closer from Paris the larger the effect (higher segregation, ...)</li> </ul> </ul> -- <ul> <li>But how to <b>capture how the relationship</b> between income and pollution varies</b> with distance?</li> <ul> <li>We should allow for it in the equation!</li> <li>By <b>adding a term</b> that depends both on income and distance</li> <li>What we use is their <b>product</b>, and we call that an <b>interaction</b></li> </ul> </ul> `$$\text{Pollution}_i = \hat{\alpha_2} + \hat{\beta_3} \text{Income}_i + \hat{\beta_4} \text{Distance}_i+ \hat{\beta_5} (\text{Distance}_i\times\text{Income}_i) + \hat{\epsilon_i}$$` --- ### 3. Interactions #### 3.2. Discrete <ul> <li>Take for instance the following <b>relationship</b> between <b>household income</b> and the <b>number of children</b></li> </ul> <p style = "margin-bottom:3.30cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-46-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 3. Interactions #### 3.2. Discrete <ul> <li>Take for instance the following <b>relationship</b> between <b>household income</b> and the <b>number of children</b></li> <ul> <li>The level of <b>education</b> seems to <b>play a role</b> in the relationship</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-47-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 3. Interactions #### 3.2. Discrete <ul> <li>Take for instance the following <b>relationship</b> between <b>household income</b> and the <b>number of children</b></li> <ul> <li>The level of <b>education</b> seems to <b>play a role</b> in the relationship</li> <li>But simply <b>controlling</b> for education does <b>not</b> seem <b>sufficient</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-48-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 3. Interactions #### 3.2. Discrete <ul> <li>This is because the <b>relationship</b> between income and children <b>varies with education</b></li> <ul> <li><b>Interacting</b> income with education allows to <b>account for that</b></li> <li>Like <b>controlling</b> allows for <b>different intercepts</b>, <b>interacting</b> allows for <b>different slopes</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-49-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 3. Interactions #### 3.2. Discrete <center><i>➜ It is clearly <b>equivalent to regressing</b> children on income separately <b>per education group</b></i></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{Children}_i = \:& \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i + & \small{\text{Baseline equation}}\\ & \hat{\beta_B}\text{Highschool}_i+ \hat{\beta_C}\text{College}_i+& \small{\text{Allow for} \neq \text{intercepts}} \\ & \text{Income}_i\times\left[\hat{\beta_D}\text{Highschool}_i+ \hat{\beta_E}\text{College}_i\right]+\hat{\varepsilon_i}& \small{\text{Allow for} \neq \text{slopes}}\end{align}$$` --- ### 3. Interactions #### 3.2. Discrete <center><i>➜ It is clearly <b>equivalent to regressing</b> children on income separately <b>per education group</b></i></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{Children}_i = \:& \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i + & \small{\text{Baseline equation}}\\ & \hat{\beta_B}\underbrace{\text{Highschool}_i}_{0}+ \hat{\beta_C}\underbrace{\text{College}_i}_{0}+& \small{\text{Allow for} \neq \text{Intercepts}} \\ & \text{Income}_i\times\left[\hat{\beta_D}\underbrace{\text{Highschool}_i}_{0}+ \hat{\beta_E}\underbrace{\text{College}_i}_{0}\right]+\hat{\varepsilon_i}& \small{\text{Allow for} \neq \text{slopes}}\end{align}$$` <p style = "margin-bottom:2cm;"></p> <ul style = "list-style: none; margin-left: 5cm;"> <li><b>< Highschool:</b> \(\:\:\: \text{Children}_i = \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i+\hat{\varepsilon_i}\)</li> </ul> --- ### 3. Interactions #### 3.2. Discrete <center><i>➜ It is clearly <b>equivalent to regressing</b> children on income separately <b>per education group</b></i></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{Children}_i = \:& \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i + & \small{\text{Baseline equation}}\\ & \hat{\beta_B}\underbrace{\text{Highschool}_i}_{1}+ \hat{\beta_C}\underbrace{\text{College}_i}_{0}+& \small{\text{Allow for} \neq \text{intercepts}} \\ & \text{Income}_i\times\left[\hat{\beta_D}\underbrace{\text{Highschool}_i}_{1}+ \hat{\beta_E}\underbrace{\text{College}_i}_{0}\right]+\hat{\varepsilon_i}& \small{\text{Allow for} \neq \text{slopes}}\end{align}$$` <p style = "margin-bottom:2cm;"></p> <ul style = "list-style: none; margin-left: 5cm;"> <li><b>Highschool:</b> \(\:\:\: \text{Children}_i = (\hat{\alpha_A} + \hat{\beta_B}) + (\hat{\beta_A} + \hat{\beta_D})\text{Income}_i+\hat{\varepsilon_i}\)</li> </ul> --- ### 3. Interactions #### 3.2. Discrete <center><i>➜ It is clearly <b>equivalent to regressing</b> children on income separately <b>per education group</b></i></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{Children}_i = \:& \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i + & \small{\text{Baseline equation}}\\ & \hat{\beta_B}\underbrace{\text{Highschool}_i}_{0}+ \hat{\beta_C}\underbrace{\text{College}_i}_{1}+& \small{\text{Allow for} \neq \text{intercepts}} \\ & \text{Income}_i\times\left[\hat{\beta_D}\underbrace{\text{Highschool}_i}_{0}+ \hat{\beta_E}\underbrace{\text{College}_i}_{1}\right]+\hat{\varepsilon_i}& \small{\text{Allow for} \neq \text{slopes}}\end{align}$$` <p style = "margin-bottom:2cm;"></p> <ul style = "list-style: none; margin-left: 5cm;"> <li><b>College:</b> \(\:\:\: \text{Children}_i = (\hat{\alpha_A} + \hat{\beta_C}) + (\hat{\beta_A} + \hat{\beta_E})\text{Income}_i+\hat{\varepsilon_i}\)</li> </ul> --- ### 3. Interactions #### 3.2. Discrete <center><i>➜ It is clearly <b>equivalent to regressing</b> children on income separately <b>per education group</b></i></center> <p style = "margin-bottom:2cm;"></p> `$$\begin{align} \text{Children}_i = \:& \hat{\alpha_A} + \hat{\beta_A}\text{Income}_i + & \small{\text{Baseline equation}}\\ & \hat{\beta_B}\text{Highschool}_i+ \hat{\beta_C}\text{College}_i+& \small{\text{Allow for} \neq \text{intercepts}} \\ & \text{Income}_i\times\left[\hat{\beta_D}\text{Highschool}_i+ \hat{\beta_E}\text{College}_i\right]+\hat{\varepsilon_i}& \small{\text{Allow for} \neq \text{slopes}}\end{align}$$` <p style = "margin-bottom:2cm;"></p> <ul style = "list-style: none; margin-left: 5cm;"> <li><b>< Highschool:</b> \(\:\:\: \text{Children}_i = \hat{\alpha_A} + \hat{\beta_A} \text{Income}_i+\hat{\varepsilon_i}\)</li> <li><b>Highschool:</b> \(\:\:\: \text{Children}_i = (\hat{\alpha_A} + \hat{\beta_B}) + (\hat{\beta_A} + \hat{\beta_D})\text{Income}_i+\hat{\varepsilon_i}\)</li> <li><b>College:</b> \(\:\:\: \text{Children}_i = (\hat{\alpha_A} + \hat{\beta_C}) + (\hat{\beta_A} + \hat{\beta_E})\text{Income}_i+\hat{\varepsilon_i}\)</li> </ul> --- ### 3. Interactions #### 3.3. Continuous <ul> <li>The <b>same principle</b> applies to <b>continuous variables:</b></li> </ul> `$$\text{Pollution}_i = \hat{\alpha} + \hat{\beta_1} \text{Income}_i + \hat{\beta_2} \text{Distance}_i+ \hat{\beta_3} (\text{Distance}_i\times\text{Income}_i) + \hat{\epsilon_i}$$` <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>What is the <b>effect of</b> a 1-unit increase in <b>income here?</b></li> </ul> -- `$$\hat{\beta_1} + \hat{\beta_3}\text{Distance}_i$$` <p style = "margin-bottom:1cm;"></p> -- <ul> <li>The <b>coefficient</b> associated with the <b>interaction</b>, \(\hat{\beta_3}\), indicates:</li> <ul> <li>By how the <b>effect</b> of a 1-unit increase in <b>income</b> on pollution <b>varies with distance</b></li> <li>When <b>distance = 0</b> the effect of income is \(\hat{\beta_1}\)</li> <li>For every <b>additional unit</b> of distance, the effect of income on pollution <b>increases by \(\hat{\beta_3}\)</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <center><i>➜ Don't omit to include your interaction variable as a control in the regression</i></center> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Adding variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Continuous variables</li> <li>1.2. Discrete variables</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Control variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Motivation</li> <li>2.2. Discrete controls</li> <li>2.3. Continuous controls</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Interactions ✔</b></li> <ul style = "list-style: none"> <li>3.1. Motivation</li> <li>3.2. Discrete interactions</li> <li>3.3. Continuous interactions</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- ### 4. Wrap up! #### 1. Multivariate regressions <ul> <li><b>Adding</b> a second independent <b>variable</b> in the regression amounts to <b>fitting a plane</b> instead of a line</li> <ul> <li>Adding a third variable would fit a hyperplane of dimension 3 and so on</li> </ul> </ul> -- .pull-left[ <center><b>Adding a continuous variable</b></center>
] .pull-right[ <center><b>Adding a discrete variable</b></center>
] --- ### 4. Wrap up! #### 2. Control variables <ul> <li>Adding a third variable \(z\) <b>removes</b> its potential <b>confounding effect</b> from the relationship between \(x\) and \(y\)</li> <ul> <li>As we move along the \(x\) axis, the <b>third variable remains constant</b></li> </ul> </ul> -- `$$\hat{y_i} = \hat{\alpha} + \hat{\beta_1} x + \hat{\beta_2} z + \hat{\varepsilon_i}$$` -- <img src="slides_files/figure-html/unnamed-chunk-52-1.gif" width="55%" style="display: block; margin: auto;" /> --- ### 4. Wrap up! #### 3. Interactions <ul> <li>Adding an <b>interaction</b> term with \(z\) allows to see <b>how the effect</b> of \(x\) on \(y\) <b>varies</b> with \(z\)</li> <ul> <li>If \(z\) is <b>discrete</b>, it amounts to <b>regressing</b> \(y\) on \(x\) <b>separately</b> for each \(z\) group</li> </ul> </ul> -- `$$\hat{y_i} = \hat{\alpha} + \hat{\beta_1} x + \hat{\beta_2} z + \hat{\beta_3}(x \times z)+ \hat{\varepsilon_i}$$` -- <img src="slides_files/figure-html/unnamed-chunk-53-1.png" width="55%" style="display: block; margin: auto;" />