class: center, middle, inverse, title-slide # Introductory Econometrics ## Lecture 18 ###
Louis SIRUGUE ### CPES 2 - Spring 2023 --- <style> .left-column {width: 70%;} .right-column {width: 30%;} </style> <h3>Today: Refresher on Introductory Econometrics</h3> -- <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Controls and interactions</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Interpretation</b></li> </ul> ] --- <h3>Today: Refresher on Introductory Econometrics</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation * Consider these two relationships: .left-column[ <img src="slides_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] -- .right-column[ <p style = "margin-bottom:2.5cm;"> ➜ One is less noisy but flatter <p style = "margin-bottom:.5cm;"> ➜ One is noisier but steeper <p style = "margin-bottom:1.5cm;"> <h4>Both have a correlation of .75</h4> ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation * Consider these two relationships: .left-column[ <img src="slides_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:3cm;"> ***But a given increase in x is not associated with a same increase in y!*** ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation * The idea of a regression is to find the <b>line</b> that <b>fits</b> the data the <b>best</b> * Such that its slope can indicate <b>how we expect <b>y</b> to <b>change</b> if we <b>increase x by 1 unit</b> -- <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Regressions with continuous variables #### 1.1. Estimation * To do so we should <b>minimize the distance</b> between each <b>point</b> and the <b>line</b> <p style = "margin-bottom:1cm;"></p> -- <img src="slides_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> --- ### 1. Regressions with continuous variables #### 1.1. Estimation .pull-left[ <p style = "margin-bottom:1cm;"></p> Take for instance the 20<sup>th</sup> observation: Peru <img src="slides_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ And consider the following notations: * We denote `\(y_i\)` the ige of the `\(i^{\text{th}}\)` country * We denote `\(x_i\)` the gini of the `\(i^{\text{th}}\)` country * We denote `\(\widehat{y_i}\)` the value of the `\(y\)` coordinate of our line when `\(x = x_i\)` <p style = "margin-bottom:1.5cm;"></p> ➜ The distance between the `\(i^{\text{th}}\)` y value and the line is thus `\(y_i - \widehat{y_i}\)` * We label that distance `\(\widehat{\varepsilon_i}\)` ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation .pull-left[ <p style = "margin-bottom:2cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <p style = "margin-bottom:-1cm;"></p> * Because `\(\widehat{\varepsilon_i}\)` is the value of the distance between a point `\(y_i\)` and its corresponding value on the line `\(\widehat{y_i}\)` we can write: `$$y_i = \widehat{y_i} + \widehat{\varepsilon_i}$$` * And because `\(\widehat{y_i}\)` is a straight line, it can be expressed as `$$\widehat{y_i} = \hat{\alpha} + \hat{\beta}x_i$$` * Where: * `\(\hat{\alpha}\)` is the y-intercept * `\(\hat{\beta}\)` is the slope * Both are estimations of the actual `\(\alpha\)` and `\(\beta\)` of the unknown DGP ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation * Combining these two definitions yields the equation: `$$y_i = \hat{\alpha} + \hat{\beta}x_i + \widehat{\varepsilon_i} \begin{cases} y_i = \widehat{y_i} + \widehat{\varepsilon_i}& \text{Definition of distance}\\ \widehat{y_i} = \hat{\alpha} + \hat{\beta}x_i & \text{Definition of the line} \end{cases}$$` -- <p style = "margin-bottom:1cm;"></p> * Depending on the values of `\(\hat{\alpha}\)` and `\(\hat{\beta}\)`, the value of every `\(\widehat{\varepsilon_i}\)` will change -- <p style = "margin-bottom:-.5cm;"></p> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:-.25cm;"></p> **Attempt 1:** `\(\hat{\alpha}\)` is too high and `\(\hat{\beta}\)` is too low ➜ `\(\widehat{\varepsilon_i}\)` are large **Attempt 2:** `\(\hat{\alpha}\)` is too low and `\(\hat{\beta}\)` is too high ➜ `\(\widehat{\varepsilon_i}\)` are large **Attempt 3:** `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` seem appropriate ➜ `\(\widehat{\varepsilon_i}\)` are low ] --- ### 1. Regressions with continuous variables #### 1.1. Estimation * We want to find the values of `\(\hat{\alpha}\)` and `\(\hat{\beta}\)` that minimize the overall distance between the points and the line -- `$$\min_{\hat{\alpha}, \hat{\beta}}\sum_{i=1}^{n}\widehat{\varepsilon_i}^2$$` <ul> <ul> <li>Note that we square \(\widehat{\varepsilon_i}\) to avoid that its positive and negative values compensate</li> <li>This method is what we call <b>Ordinary Least Squares (OLS)</b></li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>If we replace \(\widehat{\varepsilon_i}\) with \(y_i -\hat{\alpha} - \hat{\beta}x_i\)</li> <ul> <li>We can solve the minimization problem (see Lecture 7) to obtain:</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x_i, y_i)}{\text{Var}(x_i)} \:\:\:\:\:\:\:\:\: ; \:\:\:\:\:\:\:\:\: \hat{\alpha} = \bar{y} - \hat{\beta} \times\bar{x}$$` --- <center><h3> Vocabulary </h3></center> <p style = "margin-bottom:1.5cm;"></p> * This equation we're working on is called a <b>regression model</b> `$$y_i = \alpha + \beta x_i + \varepsilon_i$$` -- <ul><ul> <li> We say that we regress \(y\) on \(x\) to find the coefficients \(\hat{\alpha}\) and \(\hat{\beta}\) that characterize the regression line</li> <li> We often call \(\hat{\alpha}\) and \(\hat{\beta}\) <b><i>parameters</i></b> of the regression because it is what we tune to fit our model to the data</li> </ul></ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>We also have different names for the \(x\) and \(y\) variables</li> <ul> <li> \(y\) is called the <b><i>dependent</i></b> or <i>explained</i> variable <li> \(x\) is called the <b><i>independent</i></b> or <i>explanatory</i> variable </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> * We call `\(\widehat{\varepsilon_i}\)` the <b>residuals</b> because it is what is left after we fitted the data the best we could -- <p style = "margin-bottom:1.25cm;"></p> * And `\(\hat{y_i} = \hat{\alpha} + \hat{\beta}x_i\)`, i.e., the value on the regression line for a given `\(x_i\)` are called the <b>fitted values</b> --- ### 1. Regressions with continuous variables #### 1.2. Inference <ul> <li>Inference refers to the fact of being able to <b>conclude</b> something from our estimation</li> <ul> <li>The \(\hat{\beta}\) from our sample is actually an <b>estimation</b> of the unobserved \(\beta\) of the underlying population</li> <li>We would like to know how reliable \(\hat{\beta}\) is, <b>how confident we are</b> in its estimation</li> <li>The first step of inference is to compute the <b>standard error</b> of \(\hat{\beta}\)</li> </ul> </ul> -- <p style = "margin-bottom:1.75cm;"></p> `$$\text{se}(\hat{\beta}) = \sqrt{\widehat{\text{Var}(\hat{\beta})}} = \sqrt{\frac{\sum_{i = 1}^n\hat{\varepsilon_i}^2}{(n-\#\text{parameters})\sum_{i = 1}^n(x_i-\bar{x})^2}}$$` <p style = "margin-bottom:1.75cm;"></p> -- <ul> <li>Notice that the variance, and thus the standard error of our estimate:</li> <ul> <li>Decreases as our sample gets bigger</li> <li>Gets larger if the points are further away from the regression line on average for a given variance of \(x\)</li> </ul> </ul> --- ### 1. Regressions with continuous variables #### 1.2. Inference * The magnitude of the standard error gives an indication of the <b>precision</b> of our estimate: * The larger the estimate relative to its standard error, the more precise the estimate <p style = "margin-bottom:1.2cm;"></p> -- * But standard errors are not easily interpretable by themselves * A more direct way to get a sense of the precision for inference is to construct a <b>confidence interval</b> <p style = "margin-bottom:1.2cm;"></p> -- <center>➜ <b>Instead of saying that our estimation \(\hat{\beta}\) is equal to 1.02, we would like to say that we are 95% sure that the actual \(\beta\) lies between two given values</b></center> <p style = "margin-bottom:1.2cm;"></p> -- * To obtain a confidence interval we can use the fact that under specific conditions (that you're gonna see next year) it is possible to derive how this object is distributed: `$$\hat{t} \equiv \frac{\hat{\beta} - \beta}{\text{se}(\hat{\beta})}$$` --- ### 1. Regressions with continuous variables #### 1.2. Inference * Theory shows that `\(\hat{t} \equiv \frac{\hat{\beta} - \beta}{\text{se}(\hat{\beta})}\)` follows a Student t distribution whose number of degrees of freedom is equal to `\(n\)` (in our case 22 countries) minus the number of parameters estimated in the model (in our case 2: `\(\alpha\)` and `\(\beta\)`) -- <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Regressions with continuous variables #### 1.2. Inference * Denote `\(t_{97.5\%}\)` the value such that 97.5% of the distribution is below that value * Then 95% of the distribution lies between `\(-t_{97.5\%}\)` and `\(t_{97.5\%}\)` -- <p style = "margin-bottom:1cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="67%" style="display: block; margin: auto;" /> --- ### 1. Regressions with continuous variables #### 1.2. Inference * Because we know that `\(\hat{t} \equiv \frac{\hat{\beta} - \beta}{\text{se}(\hat{\beta})}\)` follows this distribution, we know that it has a 95% chance to fall within the two values `\(-t_{97.5\%}\)` and `\(t_{97.5\%}\)` -- `$$\text{Pr}\left[-t_{97.5\%}\leq\frac{\hat{\beta} - \beta}{\text{se}(\hat{\beta})}\leq t_{97.5\%}\right] = 95\%$$` -- * Rearranging the terms yields: `$$\text{Pr}\left[\hat{\beta} - t_{97.5\%}\times\text{se}(\hat{\beta})\leq \beta \leq\hat{\beta} + t_{97.5\%}\times\text{se}(\hat{\beta})\right] = 95\%$$` -- <p style = "margin-bottom:1.25cm;"></p> .left-column[ * Thus, we can say that there is a 95% chance for `\(\beta\)` to be within $$\hat{\beta} \pm t_{97.5\%}\times\text{se}(\hat{\beta}) $$ ] -- .right-column[ <p style = "margin-bottom:-.6cm;"></p> * To get `\(t_{97.5\%}\)` with 20 df: ```r qt(.975, 20) ``` ] --- ### 1. Regressions with continuous variables #### 1.2. Inference * ***Confidence intervals*** are very effective to get a sense of the precision of our estimates and of the **range of values the true parameters could reasonably take** -- * But the ***p-value*** is what we tend to ultimately focus on, it is the **% chance that our estimation of the true parameter is different from a given value (generally 0) just coincidentally** <p style = "margin-bottom:1cm;"></p> -- <ul> <li><b>Confidence intervals and p-values are tightly linked</b></li> <ul> <li>If there is a 4% chance that a parameter equal to 2 is different from 0, I know that the 95% confidence interval will start above 0 but quite close, and stop a bit before 4</li> <li>If a 95% confidence interval is bounded by 4 and 5, I know the the p-value will be way below 5%</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li>But these two indicators are <b>complementary</b> to easily get the full picture:</li> <ul> <li>With a p-value we can easily know how sure we are that the parameter is different from a given value, but it is difficult to get a sense of the set of values the parameters can reasonably take</li> <li>With the confidence interval it is the opposite</li> </ul> </ul> --- ### 1. Regressions with continuous variables #### 1.2. Inference <ul> <li><b>P-val. computation:</b> The principle is the same as for standard errors but the reasoning is reversed</li> <ul> <li>For <i>confidence intervals</i>: we want to know among which values the parameter has a given percentage chance to fall into</li> <li>For <i>p-value</i>: we want to know with which percentage chance 0 is out of the set of values that the parameter could reasonably take</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li><b>Vocabulary:</b> We talk about <i>significance level</i></li> <ul> <li> When \(\text{P-value} \leq .05\), we say that the estimate is significant(ly different from 0) at the 5% level</li> <li> When the p-value is greater than a given threshold of acceptability, we say that the estimate is not significant</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> -- <ul> <li><b>In practice:</b> Usually in Economics we use the 5% threshold</li> <ul> <li>But this is arbitrary, in other fields the benchmark p-value is different</li> <li>With this threshold we're wrong once in 20 times</li> </ul> </ul> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Controls and interactions</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Interpretation</b></li> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] --- ### 2. Regressions with discrete variables #### 2.1. Binary dependent variable <ul> <li>So far we've considered only continuous variables in our regression models</li> <ul> <li>But what if our dependent variable is discrete?</li> </ul> </ul> -- <ul> <li>Consider that we have data on candidates to a job:</li> <ul> <li>Their <i>Baccalauréat</i> grade (/20) </li> <li>Whether they got accepted</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"> <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.1. Binary dependent variable <ul> <li>Even if the outcome variable is binary we can regress it on the grade variable</li> <ul> <li>We can convert it into a <b>dummy</b> variable, a variable taking either the value 0 or 1</li> <li>Here consider a dummy variable taking the value 1 if the person was accepted</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"> `$$1\{y_i = \text{Accepted}\} = \hat{\alpha} + \hat{\beta} \times \text{Grade}_i + \hat{\varepsilon_i}$$` <p style = "margin-bottom:1cm;"> <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.1. Binary dependent variable <ul> <li>The fitted values can be viewed as the probability to be accepted for a given grade</li> </ul> <p style = "margin-bottom:-.5cm;"></p> -- <ul> <ul> <li>The slope is thus by how much the probability of being accepted would increase on expectation for a 1 point increase in the grade</li> <li>That's why we call OLS regression models with a binary outcome <i>Linear Probability Models</i></li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.1. Binary dependent variable <ul> <li>But with an LPM you can end up with 'probabilities' that are lower than 0 and greater than 1</li> <ul> <li>Interpretation is only valid for values of x sufficiently close to the mean</li> <li>Keep that in mind and be careful when interpreting the results of an LPM</li> </ul> </ul> <p style = "margin-bottom:1.75cm;"> <img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.2. Binary independent variable <ul> <li>Now consider that we individual data containing:</li> <ul> <li>The sex</li> <li>The height (centimeters)</li> </ul> </ul> -- <p style = "margin-bottom:1.5cm;"> <ul> <li>So instead of</li> <ul> <li>having a binary dependent variable :</li> </ul> </ul> `$$1\{y_i = \text{Accepted}\} = \hat{\alpha} + \hat{\beta} \times \text{Grade}_i + \hat{\varepsilon_i}$$` <ul> <ul> <li>we have a binary independent variable</li> </ul> </ul> `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{x_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- <p style = "margin-bottom:1.75cm;"> <center><h4><i> ➜ How to interpret the coefficient \(\hat{\beta}\) from this regression?</i></h4></center> --- ### 2. Regressions with discrete variables #### 2.2. Binary independent variable <ul> <li>If the sex variable was continuous it would be the expected increase in height for a <i>'1 unit increase'</i> in sex</li> <ul> <li>Here the <i>'1 unit increase'</i> is switching from 0 to 1, i.e. from female to male</li> <li>Here is the traditionnal scatter plot representation</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.2. Binary independent variable <ul> <li>Replacing the point geometry by the corresponding boxplots: </li> <ul> <li>What this <i>'1 unit increase'</i> corresponds to should be clearer</li> <li>The coefficient \(\hat{\beta}\) is actually the difference between the average height for males and females</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 2. Regressions with discrete variables #### 2.2. Binary independent variable * Let's have a look at the regression results and at the summary statistics of both distributions: -- <p style = "margin-bottom:-.5cm;"></p> .pull-left[ ``` ## ## ======================================== ## Dependent variable: ## --------------------------- ## Height ## ---------------------------------------- ## SexMale 9.5*** ## (0.6) ## ## Constant 165.0*** ## (0.4) ## ## ---------------------------------------- ## Observations 1,000 ## R2 0.2 ## ======================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] -- .pull-right[ <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Height summary statistics by sex</caption> <thead> <tr> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Med </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 135.9 </td> <td style="text-align:right;"> 158.8 </td> <td style="text-align:right;"> 164.6 </td> <td style="text-align:right;"> 165.0 </td> <td style="text-align:right;"> 170.9 </td> <td style="text-align:right;"> 194.7 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 145.2 </td> <td style="text-align:right;"> 168.3 </td> <td style="text-align:right;"> 174.4 </td> <td style="text-align:right;"> 174.5 </td> <td style="text-align:right;"> 180.6 </td> <td style="text-align:right;"> 202.6 </td> </tr> </tbody> </table> <p style = "margin-bottom:1.25cm;"></p> ➜ The `\(\hat{\alpha}\)` coefficient is equal to the expected value of `\(y\)` when `\(x = 0\)`, i.e., to the average height for females <p style = "margin-bottom:.75cm;"></p> ➜ The `\(\hat{\beta}\)` coefficient is equal to expected increase in `\(y\)` when going from `\(x = 0\)` to `\(x = 1\)`, i.e., to the difference between male and female average height ] --- ### 2. Regressions with discrete variables #### 2.2. Binary independent variable * Let's think of it in terms of a regression model: `$$\text{Height}_i = \hat{\alpha} + \hat{\beta} \times 1\{x_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- * We now have `\(\hat{\alpha}\)` and `\(\hat{\beta}\)`: `$$\text{Height}_i = 165.0 + 9.8 \times 1\{x_i = \text{Male}\} + \hat{\varepsilon_i}$$` -- * The fitted values write: `$$\widehat{\text{Height}_i} = 165.0 + 9.8 \times 1\{x_i = \text{Male}\}$$` -- .pull-left[ * When the dummy equals 0 (females): `$$\begin{align} \widehat{\text{Height}_i} & = 165.0 + 9.8 \times 0\\ &= 165.0 =\overline{\text{Height}_{\left[x_i = \text{Female}\right]}} \end{align}$$` ] -- .pull-right[ * When the dummy equals 1 (males): `$$\begin{align}\widehat{\text{Height}_i} & = 165.0 + 9.8 \times 1\\ &= 174.8 =\overline{\text{Height}_{\left[x_i = \text{Male}\right]}}\end{align}$$` ] --- ### 2. Regressions with discrete variables #### 2.3. Categorical independent variable <ul> <li>So far we've been working with binary categorical variables:</li> <ul> <li>Accepted vs. Rejected, Male vs. Female</li> <li>But what about discrete variables with more than two categories?</li> </ul> </ul> -- * Take for instance the race variable: ```r asec_2020 <- read.csv("asec_2020.csv") kable(asec_2020 %>% group_by(Race) %>% summarise(N = n()) %>% t(), caption = "Distribution of the Race categorical variable") ``` <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Distribution of the Race categorical variable</caption> <tbody> <tr> <td style="text-align:left;"> Race </td> <td style="text-align:left;"> Asian </td> <td style="text-align:left;"> Black </td> <td style="text-align:left;"> Other </td> <td style="text-align:left;"> White </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> 4528 </td> <td style="text-align:left;"> 6835 </td> <td style="text-align:left;"> 2422 </td> <td style="text-align:left;"> 50551 </td> </tr> </tbody> </table> -- <p style = "margin-bottom:.75cm;"></p> <center><b><i>➜ How can we use this variable as an independent variable in our regression framework?</i></b></center> --- ### 2. Regressions with discrete variables #### 2.3. Categorical independent variable * Just as we converted our `\(2\)`-category variable into `\(1\)` dummy variable, we can convert an `\(n\)`-category variable into `\(n-1\)` dummy variables: -- .pull-left[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption> </caption> <thead> <tr> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Male </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> Race </th> <th style="text-align:right;"> Black </th> <th style="text-align:right;"> Other </th> <th style="text-align:right;"> White </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Asian </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Asian </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;width: 3em; "> </td> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] -- .pull-right[ <p style = "margin-bottom:-.25cm;"></p> ***➜ But why do we omit one category every time?*** * Females are observations for which Male equals 0 * Asians are observations for which Black, Other, and White each equals 0 ➜ Females and Asians are ***reference categories*** * The coefficient associated with the Male dummy was interpreted ***relative*** to females * The coefficients associated with the Black, Other, and White dummies will be interpreted ***relative*** to Asians ] --- ### 2. Regressions with discrete variables #### 2.3. Categorical independent variable * Thus, regressing earnings on the race categorical variable amounts to estimate the equation: `$$\text{Earnings}_i = \hat{\alpha} + \hat{\beta_1} 1\{\text{Race}_i = \text{Black}\} + \hat{\beta_2} 1\{\text{Race}_i = \text{Other}\} + \hat{\beta_3} 1\{\text{Race}_i = \text{White}\} + \hat{\varepsilon_i}$$` -- <p style = "margin-bottom:1.25cm;"></p> * And if we compare the regression results to the average earnings by group: <p style = "margin-bottom:-.75cm;"></p> -- .left-column[ ```r summary(lm(Earnings ~ Race, asec_2020))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 77990.78 1149.552 67.84449 0.000000e+00 ## RaceBlack -27413.29 1482.197 -18.49503 3.571079e-76 ## RaceOther -28512.08 1947.305 -14.64181 1.819073e-48 ## RaceWhite -15110.29 1199.933 -12.59262 2.559272e-36 ``` <p style = "margin-bottom:1cm;"></p> <ul><ul> <li>\(\alpha\) is still the average earnings for the reference category </li> <li>coefficient are still <i>relative</i> to the reference category</li> </ul></ul> ] .right-column[ <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Mean earnings by race</caption> <thead> <tr> <th style="text-align:left;"> Race </th> <th style="text-align:right;"> Mean earnings </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Asian </td> <td style="text-align:right;"> 77990.78 </td> </tr> <tr> <td style="text-align:left;"> Black </td> <td style="text-align:right;"> 50577.49 </td> </tr> <tr> <td style="text-align:left;"> Other </td> <td style="text-align:right;"> 49478.70 </td> </tr> <tr> <td style="text-align:left;"> White </td> <td style="text-align:right;"> 62880.49 </td> </tr> </tbody> </table> ] --- ### 2. Regressions with discrete variables #### 2.3. Categorical independent variable * As you can see from the previous regression results, by default R sorts categories by alphabetical order: ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 77990.78 1149.552 67.84449 0.000000e+00 ## RaceBlack -27413.29 1482.197 -18.49503 3.571079e-76 ## RaceOther -28512.08 1947.305 -14.64181 1.819073e-48 ## RaceWhite -15110.29 1199.933 -12.59262 2.559272e-36 ``` -- * But oftentimes we would prefer the reference category to be the majority group * In R we can use the `relevel()` function to change the reference category of a factor -- ```r summary(lm(Earnings ~ relevel(as.factor(Race), "White"), asec_2020))$coefficients[, c(1, 2, 4)] ``` ``` ## Estimate Std. Error Pr(>|t|) ## (Intercept) 62880.49 344.0464 0.000000e+00 ## relevel(as.factor(Race), "White")Asian 15110.29 1199.9326 2.559272e-36 ## relevel(as.factor(Race), "White")Black -12302.99 996.8981 5.947231e-35 ## relevel(as.factor(Race), "White")Other -13401.79 1609.0045 8.294160e-17 ``` --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Controls and interactions</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Interpretation</b></li> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Controls and interactions</b></li> </ul> ] --- ### 3. Controls and interactions <ul> <li>We can add a third variable z in the regression for two reasons:</li> <ul> <li><b>Controlling for z</b> allows to <b>net out</b> the relationship between x and y from how they both relate to <b>z</b></li> <li><b>Interacting x with z</b> allows to <b>estimate how the relationship</b> between x and y <b>varies with z</b></li> </ul> </ul> -- <ul> <li>Consider the following fictitious dataset at the household level</li> <ul> <li>Household annual income</li> <li>Number of children in the household</li> <li>Parents' education level</li> </ul> </ul> .pull-left[ ```r data <- read.csv("household_data.csv") head(data, 7) # fictitious data ``` ``` ## Income Children Education ## 1 20 1 < Highschool ## 2 10 1 < Highschool ## 3 10 2 < Highschool ## 4 15 0 < Highschool ## 5 15 1 < Highschool ## 6 20 0 < Highschool ## 7 15 2 Highschool ``` ] -- .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ### 3. Controls and interactions <ul> <li>There's a clear positive relationship</li> </ul> ``` ## Estimate Pr(>|t|) ## (Intercept) -0.885 0.319 ## Income 0.166 0.000 ``` -- <ul> <ul> <li>But what if this relationship was driven by a third variable?</li> <li>Maybe it's just that more educated parents tend to earn more and to have more children</li> </ul> </ul> -- <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="50%" style="display: block; margin: auto;" /> -- .pull-right[ <ul> <li>In this example, <b>education</b> is indeed <b>positively correlated with both variables</b></li> <ul> <li>So at least part of the positive relationship we observe is actually due to education</li> <li><b>Controlling</b> for education estimates the relationship by <b>netting our the contribution of education</b></li> </ul> </ul> ] --- ### 3. Controls and interactions <ul> <li><b>Controlling</b> for education does the same to the slope <b>as recentering</b> the graph with respect to education</li> <ul> <li>In that way, when moving along the x axis, <b>z</b> does not increase but <b>remains constant</b></li> </ul> </ul> <p style = "margin-bottom:-.75cm;"></p> -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> <ul> <li>The crosses are located at the average x and y values for each education group</li> <ul> <li>Controlling for education shifts x and y by group such that crosses superimpose</li> </ul> </ul> ] -- .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ``` ## Estimate Pr(>|t|) ## (Intercept) -0.120 0.892 ## Income 0.064 0.196 ## EducationCollege 3.456 0.015 ## EducationHighschool 1.856 0.037 ``` ] --- ### 3. Controls and interactions <ul> <li>Here when we <b>do not control</b> for education:</li> </ul> `$$Children_i = \alpha + \beta Income_i + \varepsilon_i$$` <ul><ul> <li>We estimate the overall relationship (here, significantly positive)</li> </ul></ul> -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>But when we <b>control</b> for education:</li> </ul> `$$Children_i = \alpha + \beta Income_i + \gamma_1 1\{Education_i=\text{Highschool}\} + \gamma_2 1\{Education_i=\text{College}\} +\varepsilon_i$$` <ul><ul> <li>We estimate the relationship net of the effect of education (here, not significant)</li> </ul></ul> -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li><b>Interacting</b> the two variables is going one step further:</li> </ul> `$$\begin{align}Children_i & = \alpha + \beta Income_i + \gamma_1 1\{Education_i=\text{Highschool}\} + \gamma_2 1\{Education_i=\text{College}\} + \\ & \delta_1 Income_i\times1\{Education_i=\text{Highschool}\} + \delta_2 Income_i \times 1\{Education_i=\text{College}\} + \varepsilon_i\end{align}$$` <ul><ul> <li>It is not simply taking into account the fact that education may plays a role</li> <li>It estimates by how much the relationship between x and y varies according to z</li> </ul></ul> --- ### 3. Controls and interactions * <b>Interacting</b> income with education provides <b>one slope per education group</b>: -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <p style = "margin-bottom:2.5cm;"></p> ``` ## Estimate Pr(>|t|) ## (Intercept) 2.333 0.225 ## Income -0.100 0.411 ## EducationCollege -1.768 0.553 ## EducationHighschool 0.596 0.819 ## Income:EducationCollege 0.239 0.095 ## Income:EducationHighschool 0.111 0.445 ``` ] -- <ul> <li>The principle is the same when the third variable is continuous:</li> <ul> <li>Controlling nets out the slope from how the third variable enters the relationship</li> <li>Interacting gives by how much the slope changes on expectation when the third variable increases by 1</li> <li>And we can control for/interact with multiple third variables</li> </ul> </ul> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1cm;list-style: none"> <li><b>1. Regressions with continuous variables ✔</b></li> <ul style = "list-style: none"> <li>1.1. Estimation</li> <li>1.2. Inference</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1cm;list-style: none"> <li><b>2. Regressions with discrete variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Binary dependent variable</li> <li>2.2. Binary independent variable</li> <li>2.3. Categorical independent variable</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Controls and interactions ✔</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:-1cm;list-style: none"> <li><b>4. Interpretation</b></li> </ul> ] --- class: inverse, hide-logo ### 4. Interpretation <center><b>Train at interpreting coefficients from randomly drawn relationships</b></center> <p style = "margin-bottom:1cm;"></p> <center><a href="https://sirugue.shinyapps.io/lecture15/"><img src = "html.png" width = "900"/></a></center>