class: center, middle, inverse, title-slide # Causality ## Lecture 11 ###
Louis SIRUGUE ### CPES 2 - Fall 2022 --- <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> ### Quick reminder #### Data generating process <ul> <li>In practice we estimate coefficients on a <b>given realization of a data generating process</b></li> <ul> <li>So the <b>true coefficient</b> is <b>unobserved</b></li> <li>But our <b>estimation</b> is <b>informative</b> on the values the true coefficient is likely to take</li> </ul> </ul> .left-column[ <img src="slides_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ <p style = "margin-bottom:3cm"></p> `$$\frac{\hat{\beta}-\beta}{\text{SD}(\hat{\beta})} \sim \mathcal{N}(0, 1)$$` ] --- ### Quick reminder #### Confidence interval <ul> <li>This allows to infer a <b>confidence interval:</b></li> </ul> `$$\hat{\beta}\pm t(\text{df})_{1-\frac{\alpha}{2}}\times\text{se}(\hat{\beta})$$` <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>Where \(t(\text{df})_{1-\frac{\alpha}{2}}\) is the value from a <b>Student \(t\) distribution</b></li> <ul> <li>With the relevant number of <b>degrees of freedom</b> \(\text{df}\) (n - #parameters)</li> <li>And the desired <b>confidence level</b> \(1-\alpha\)</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> -- <ul> <li>And where \(\text{se}(\hat{\beta})\) denotes the <b>standard error</b> of \(\hat{\beta}\):</li> </ul> `$$\text{se}(\hat{\beta}) = \sqrt{\widehat{\text{Var}(\hat{\beta})}} = \sqrt{\frac{\sum_{i = 1}^n\hat{\varepsilon_i}^2}{(n-\#\text{parameters})\sum_{i = 1}^n(x_i-\bar{x})^2}}$$` --- ### Quick reminder #### P-value <ul> <li>It also allows to <b>test</b> how likely is \(\beta\) to be <b>different from a given value:</b></li> <ul> <li>If the <b>p-value</b> < 5%, we can <b>reject</b> that \(\beta\) equals the <b>hypothesized value</b> at the 95% confidence level</li> <li>This threshold, very common in Economics, implies that we have 1 chance out of 20 to be wrong</li> </ul> </ul> -- ```r linearHypothesis(lm(ige ~ gini, ggcurve), "gini = 0") ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## gini = 0 ## ## Model 1: restricted model ## Model 2: ige ~ gini ## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 21 0.46733 ## 2 20 0.26883 1 0.1985 14.767 0.001016 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- <h3>Today: Causality</h3> -- <p style = "margin-bottom:4.25cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Main sources of bias</b></li> <ul style = "list-style: none"> <li>1.1. Omitted variables</li> <li>1.2. Functional form</li> <li>1.3. Selection bias</li> <li>1.4. Measurement error</li> <li>1.5. Simultaneity</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>2. Randomized control trials</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to RCTs</li> <li>2.2. Types of randomization</li> <li>2.3. Multiple testing</li> </ul> </ul> <p style = "margin-bottom:.65cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- <h3>Today: Causality</h3> <p style = "margin-bottom:4.25cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Main sources of bias</b></li> <ul style = "list-style: none"> <li>1.1. Omitted variables</li> <li>1.2. Functional form</li> <li>1.3. Selection bias</li> <li>1.4. Measurement error</li> <li>1.5. Simultaneity</li> </ul> </ul> ] --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>Consider the following regression:</li> <ul> <li>Where \(\text{Earnings}_i\) denotes individuals' annual labor earnings</li> <li>And \(\text{Education}_i\) stands for individuals' number of years of education</li> </ul> </ul> `$$\text{Earnings}_i = \alpha + \beta \times \text{Education}_i + \varepsilon_i$$` -- ```r summary(lm(Earnings ~ Education, sim_dat))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7514.800 2994.3060 2.509697 1.209949e-02 ## Education 2643.312 205.2692 12.877294 1.220064e-37 ``` -- <p style = "margin-bottom:1cm;"></p> <ul> <li>Taking \(\hat{\beta}\) at face value, the <b>"expected returns"</b> from an additional year of education amount to $2,643/year</li> <ul> <li>But if we were to enforce an additional year of education for randomly selected individuals, would they earn $2,643 more than they would have earned otherwise?</li> </ul> </ul> -- <center><i>➜ The answer is <b>no</b>, because the estimated effect is <b>not causal!</b></i></center> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>The estimated relationship could be partly driven by some <b>confounding factors:</b></li> <ul> <li>Maybe <b>more skilled</b> individuals both <b>study longer</b> and <b>earn more</b> because they are skilled</li> <li>But with or without more education they would still earn more because they are skilled</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>The ability variable acts as a <b>confounding factor</b> because it is correlated with both \(x\) and \(y\)</li> <ul> <li>This would also be the case of parental socio-economic status and many other variables</li> <li>We need to put these variables in the regression as <b>control variables</b></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- `$$\text{Earnings}_i = \alpha + \beta_1\times \text{Education}_i + \beta_2\times\text{Skills}_i + \varepsilon_i$$` <p style = "margin-bottom:1.25cm;"></p> <ul> <li>According to you, would the estimated effect of education be higher or lower in this regression?</li> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><i>➜ If skills is indeed <b>positively correlated with both</b> education and earnings, the new coefficient would be <b>lower</b></i></center> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>Remember that <b>controlling</b> for a variable can be viewed as:</li> <ul> <li></li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>Remember that <b>controlling</b> for a variable can be viewed as:</li> <ul> <li>Allowing the <b>intercept</b> to <b>vary</b> with that variable</li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>Remember that <b>controlling</b> for a variable can be viewed as:</li> <ul> <li>Allowing the <b>intercept</b> to <b>vary</b> with that variable</li> <li>Keeping this <b>variable constant</b> as we move along the \(x\)-axis</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>In that case the <b>confounding</b> variable <b>no longer affects</b> our relationship of interest</li> <ul> <li>It fixes the fact that more skilled individuals tend to have both higher education and earnings</li> <li>Such that the <b>relationship</b> between education and earnings is <b>net of the effect of skills</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.1. Omitted variable bias <ul> <li>But <b>we are never able to control for all</b> potential confounding factors</li> <ul> <li>We can almost always think of variables that may affect both \(x\) and \(y\) but that are not in the data</li> <li>Resulting in what is called the <b>omitted variable bias</b></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>In that case you should either:</li> <ul> <li>Use causal identification Econometrics techniques (not covered in this course, except RCT)</li> <li>Acknowledge that your estimated effect is not causal with the phrase <i><b>"ceteris paribus"</b></i></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li><i>Ceteris paribus</i> means <b>"Everything else equal"</b></li> <ul> <li>We use these sentences to indicate that our <b>estimation is correct under the hypothesis that</b> when our \(x\) of interest moves, <b>no confounding factor</b> affecting \(y\) moves with it</li> <li>Indeed, if there is no other variable varying with \(x\) and \(y\), our regression doesn't need more controls</li> <li>We know this assumption is <b>not correct</b>, but it is <b>important to be transparent and clear</b> about what the coefficient means</li> </ul> </ul> --- ### 1. Main sources of bias #### 1.2. Functional form <ul> <li>Now consider the following relationship between years of education and earnings</li> <ul> <li></li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-11-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.2. Functional form <ul> <li>Now consider the following relationship between years of education and earnings</li> <ul> <li>We can fit a regression line as we usually do</li> <li>But would that be an appropriate estimation?</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.2. Functional form <ul> <li>We must capture the <b>non-linearity</b></li> <ul> <li>The relationship cannot be correctly captured by a straight line</li> <li></li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> `$$\text{Earnings}_i = \alpha + \beta_1\times \text{Education}_i + \varepsilon_i$$` --- ### 1. Main sources of bias #### 1.2. Functional form <ul> <li>We must capture the <b>non-linearity</b></li> <ul> <li>The relationship cannot be correctly captured by a straight line</li> <li>It has the shape of a <b>polynomial of degree 2</b></li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> `$$\text{Earnings}_i = \alpha + \beta_1\times \text{Education}_i + \color{SkyBlue}{\beta_2\times\text{Education}^2_i} + \varepsilon_i$$` -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>Given the previous graph, what would be the signs of \(\hat{\beta}_1\) and \(\hat{\beta}_2\)?</li> </ul> --- ### 1. Main sources of bias #### 1.2. Functional form <ul> <li>We must capture the <b>non-linearity</b></li> <ul> <li>The relationship cannot be correctly captured by a straight line</li> <li>It has the shape of a <b>polynomial of degree 2</b></li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> `$$\text{Earnings}_i = \alpha + \beta_1\times \text{Education}_i + \color{SkyBlue}{\beta_2\times\text{Education}^2_i} + \varepsilon_i$$` <p style = "margin-bottom:1.25cm;"></p> <ul> <li>Given the previous graph, what would be the signs of \(\hat{\beta}_1\) and \(\hat{\beta}_2\)?</li> <ul> <li>\(\hat{\beta}_1\) would be positive because the relationship is increasing</li> <li>\(\hat{\beta}_2\) would be negative because the relationship is concave</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>Polynomial functional forms are easy to handle in R</li> <ul> <li>You can <b>square the dependent variable and add it</b> in lm()</li> <li>geom_smooth() also allows to plot a polynomial fit</li> </ul> </ul> --- ### 1. Main sources of bias #### 1.2. Functional form ```r ggplot(quadratic, aes(x = Education, y = Earnings)) + geom_point() + geom_smooth(method = "lm") ``` .left-column[ <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 1. Main sources of bias #### 1.2. Functional form ```r ggplot(quadratic, aes(x = Education, y = Earnings)) + geom_point() + geom_smooth(method = "lm", formula = y ~ poly(x, 2)) ``` .left-column[ <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:2cm;"></p> <ul> <li>But functional form is not only about polynomial degrees:</li> <ul> <li>Interactions</li> <li>Logs</li> <li>Discretization</li> <li>...</li> </ul> </ul> ] --- ### 1. Main sources of bias #### 1.3. Selection bias <ul> <li>Now remember the example on high-school grades and job application acceptance</i> <ul> <li>We plotted the <b>grades</b> of individuals on the \(x\)-axis</i> <li>And <b>whether</b> or not <b>they got the job</b> on the \(y\)-axis</i> </ul> </ul> <p style = "margin-bottom:1cm;"></p> .left-column[ <p style = "margin-bottom:-1cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:-.25cm;"></p> <ul> <li>We estimated that a <b>1 unit</b> increase in Grade (/20) would <b>increase the probability</b> to be accepted by about <b>a third</b> on expectation, <b>ceteris paribus</b></i> </ul> <ul> <li>Is this estimation relevant?</i> <ul> <li>Look at the support of \(x\)</i> </ul> </ul> ] --- ### 1. Main sources of bias #### 1.3. Selection bias <ul> <li>The fact that almost all grades range between 13 and 17 hints at a <b>selection problem:</b></i> <ul> <li>Individuals with very <b>low grade won't apply</b> to the position because <b>they know they will be rejected</b></i> <li>Individuals with very <b>high grade won't apply</b> to the position because <b>they apply to better positions</b></i> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> .left-column[ <p style = "margin-bottom:-1cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-18-1.png" width="90%" style="display: block; margin: auto;" /> ] .right-column[ <ul> <li>Had these individuals applied, the estimated effect would be lower</li> </ul> <ul> <li>Our coefficient is specific to a non-representative sample</li> <ul> <li>Issue of <b>external validity</b></li> <li>The interpretation only holds in our specific setting</li> </ul> </ul> ] --- ### 1. Main sources of bias #### 1.3. Selection bias <ul> <li>Such <b>selection problems</b> are very common <b>threats to causality</b></li> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>What is the impact of going to a better neighborhood on your children outcomes?</li> <ul> <li>Those who move may be different from those who stay: <b>self-selection issue</b></li> <li>Here it is not that the sample is not representative of the population, but that <b>the outcomes of those who stayed are different from the outcomes those who moved would have had, if they had stayed</b></li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>This related to the notion of <b>counterfactual</b></li> <ul> <li>If those who moved were comparable to those who stayed, it would be valid to use the outcome of those who stayed as the counterfactual outcome of those who moved</li> <li>But because of selection movers are not comparable to stayers so we don't have a credible counterfactual</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>The notion of counterfactual is key to answer many questions:</li> <ul> <li>What is the impact of an immigrant inflow on the labor market outcomes of locals?</li> <li>We need to know how the labor market outcomes of locals would have evolved absent the immigrant inflow but we do not observe this situation</li> </ul> </ul> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Another way of obtaining <b>biased estimates</b> is to have an <b>independent variable measured with errors</b></li> <ul> <li>For instance if you want to measure the effect of cognitive skills but you only have IQ scores</li> <li>IQ is a noisy measure of cognitive skills as individuals' performances to such test are not always consistent</li> </ul> </ul> -- <ul> <li>It seems reasonable to assume that the measurement error follows a normal distribution:</li> <ul> <li></li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Another way of obtaining <b>biased estimates</b> is to have an <b>independent variable measured with errors</b></li> <ul> <li>For instance if you want to measure the effect of cognitive skills but you only have IQ scores</li> <li>IQ is a noisy measure of cognitive skills as individuals' performances to such test are not always consistent</li> </ul> </ul> <ul> <li>It seems reasonable to assume that the measurement error follows a normal distribution:</li> <ul> <li>Individuals <b>usually</b> perform <b>close to their average</b> performance</li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Another way of obtaining <b>biased estimates</b> is to have an <b>independent variable measured with errors</b></li> <ul> <li>For instance if you want to measure the effect of cognitive skills but you only have IQ scores</li> <li>IQ is a noisy measure of cognitive skills as individuals' performances to such test are not always consistent</li> </ul> </ul> <ul> <li>It seems reasonable to assume that the measurement error follows a normal distribution:</li> <ul> <li>Individuals <b>usually</b> perform <b>close to their average</b> performance</li> <li>And <b>larger deviations</b> are more <b>rare</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="50%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.4. Measurement error <p style = "margin-bottom:1cm;"></p> .pull-left[ <center>Denote \(x\) the IQ variable</center> `$$x \sim \mathcal{N}(100,\, 15^2)$$` ] .pull-right[ <center>Denote \(\eta\) the measurement error</center> `$$\eta \sim \mathcal{N}(0,\, 1)$$` ] -- <p style = "margin-bottom:1cm;"></p> * The true relationship is `$$y = \alpha + \beta x + \varepsilon$$` -- * But we only observe `$$\tilde{x} = x + \eta$$` -- * So we can only estimate: `$$y = \alpha + \beta \tilde{x} + \varepsilon \,\,\, \Longleftrightarrow \,\,\, y = \alpha + \beta (x + \eta) + \varepsilon$$` -- <p style = "margin-bottom:1cm;"></p> <center><i>➜ Let's <b>use simulations</b> to see how it may affect our estimation</i></center> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>We can start by <b>generating a relationship</b> without measurement error</li> </ul> `$$y_i = 1 + 2 x_i + \varepsilon_i,\, \text{with}\, \varepsilon \sim \mathcal{N}(0,\, 1)$$` ```r dat <- tibble(x = rnorm(1000, 100, 15), y = 1 + (2 * x) + rnorm(1000, 0, 1)) ``` -- <p style = "margin-bottom:1cm;"></p> .pull-left[ <ul> <li>Estimate the <b>unbiased</b> relationship</li> </ul> ```r lm(y ~ x, dat)$coefficient ``` ``` ## (Intercept) x ## 0.824755 2.001394 ``` <p style = "margin-bottom:1cm;"></p> Is it just random chance or is `\(\hat{\beta}\)` downward biased? ➜ ] .pull-right[ <ul> <li>And <b>with measurement error</b> \(\eta \sim \mathcal{N}(0,\, 1)\)</li> </ul> ```r dat <- dat %>% mutate(noisy_x = x + rnorm(1000, 0, 1)) lm(y ~ noisy_x, dat)$coefficient ``` ``` ## (Intercept) noisy_x ## 1.995596 1.990358 ``` ] --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> -- ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # # # # # # # # # # # ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # Empty vector for beta... beta <- c() # # # # # # # # # ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # Empty vector for beta... beta <- c() # ... to be filled in a loop for (i in sd_noise) { # # # # # # } ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # Empty vector for beta... beta <- c() # ... to be filled in a loop for (i in sd_noise) { # Generate noisy x with corresponding SD(eta) dat_i <- dat %>% mutate(noisy_x = x + rnorm(1000, 0, i)) # # # # } ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # Empty vector for beta... beta <- c() # ... to be filled in a loop for (i in sd_noise) { # Generate noisy x with corresponding SD(eta) dat_i <- dat %>% mutate(noisy_x = x + rnorm(1000, 0, i)) # Estimate the regression beta_i <- lm(y ~ noisy_x, dat_i)$coefficient[2] # # } ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>Let's have a look at how \(\hat{\beta}\) behaves with an increasingly high \(\text{SD}(\eta)\)</li> </ul> ```r # Vector of standard deviations from 0 to 20 sd_noise <- 0:20 # Empty vector for beta... beta <- c() # ... to be filled in a loop for (i in sd_noise) { # Generate noisy x with corresponding SD(eta) dat_i <- dat %>% mutate(noisy_x = x + rnorm(1000, 0, i)) # Estimate the regression beta_i <- lm(y ~ noisy_x, dat_i)$coefficient[2] # Store the coefficient beta <- c(beta, beta_i) } ``` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>We can then plot the \(\hat{\beta}\) for each value of \(\text{SD}(\eta)\)</li> <ul> <li></li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>We can then plot the \(\hat{\beta}\) for each value of \(\text{SD}(\eta)\)</li> <ul> <li>It is clear that the <b>measurement error</b> puts a <b>downward pressure</b> on our estimate</li> <li>And that the <b>noisier</b> the measure of \(x\) the <b>larger</b> the <b>bias</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-32-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>And this phenomenon can easily be shown <b>mathematically:</b></li> <ul> <li></li> <li></li> </ul> `$$\hat{\beta} = \frac{\text{Cov}(\tilde{x},\, y)}{\text{Var}(\tilde{x})}$$` <p style = "margin-bottom:.8cm;"></p> -- `$$\hat{\beta} = \frac{\text{Cov}(x + \eta,\, y)}{\text{Var}(x + \eta)}$$` <p style = "margin-bottom:.8cm;"></p> -- `$$\hat{\beta} = \frac{\text{Cov}(x,\, y) + \text{Cov}(\eta,\, y)}{\text{Var}(x) + \text{Var}(\eta) + 2\text{Cov}(x,\, \eta)}$$` -- <p style = "margin-bottom:.8cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x,\, y)}{\text{Var}(x) + \text{Var}(\eta)}$$` --- ### 1. Main sources of bias #### 1.4. Measurement error <ul> <li>And this phenomenon can easily be shown <b>mathematically:</b></li> <ul> <li>The extra term in the denominator puts a <b>downward pressure</b> on our estimate</li> <li>And the bias is <b>increasing in the amplitude of the <b>measurement error</b></li> </ul> `$$\hat{\beta} = \frac{\text{Cov}(\tilde{x},\, y)}{\text{Var}(\tilde{x})}$$` <p style = "margin-bottom:.8cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x + \eta,\, y)}{\text{Var}(x + \eta)}$$` <p style = "margin-bottom:.8cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x,\, y) + \text{Cov}(\eta,\, y)}{\text{Var}(x) + \text{Var}(\eta) + 2\text{Cov}(x,\, \eta)}$$` <p style = "margin-bottom:.8cm;"></p> `$$\hat{\beta} = \frac{\text{Cov}(x,\, y)}{\text{Var}(x) + \color{SkyBlue}{\text{Var}(\eta)}}$$` --- ### 1. Main sources of bias #### 1.5. Simultaneity <ul> <li><b>So far</b> we considered relationships whose <b>directions</b> were quite <b>unambiguous</b></li> <ul> <li>Education ➜ Earnings, and not the opposite</li> <li>High-school grades ➜ Job acceptance, and not the opposite</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><i>But now consider the relationship between <b>crime rate and police coverage</b> intensity</i></center> <p style = "margin-bottom:1.25cm;"></p> <ul> <li><b>What is the direction</b> of the relationship?</li> <ul> <li>It's likely that more crime would cause a positive response in police activity</li> <li>But also that police activity would deter crime</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>There is no easily solution to that problem apart from:</li> <ul> <li>Working out a <b>theoretical model</b> sorting this issue beforehand</li> <li>Or <b>designing an RCT</b> that cuts one of the two channels</li> </ul> </ul> --- <h3>Overview: Causality</h3> <p style = "margin-bottom:4.25cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Main sources of bias ✔</b></li> <ul style = "list-style: none"> <li>1.1. Omitted variables</li> <li>1.2. Functional form</li> <li>1.3. Selection bias</li> <li>1.4. Measurement error</li> <li>1.5. Simultaneity</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>2. Randomized control trials</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to RCTs</li> <li>2.2. Types of randomization</li> <li>2.3. Multiple testing</li> </ul> </ul> <p style = "margin-bottom:.65cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- <h3>Overview: Causality</h3> <p style = "margin-bottom:4.25cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Main sources of bias ✔</b></li> <ul style = "list-style: none"> <li>1.1. Omitted variables</li> <li>1.2. Functional form</li> <li>1.3. Selection bias</li> <li>1.4. Measurement error</li> <li>1.5. Simultaneity</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>2. Randomized control trials</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to RCTs</li> <li>2.2. Types of randomization</li> <li>2.3. Multiple testing</li> </ul> </ul> ] --- ### 2. Randomized control trials #### 2.1. Introduction to RCTs <ul> <li>A Randomized Controlled Trial (RCT) is a type of <b>experiment</b> in which the thing we want to know the impact of (called the treatment) is <b>randomly allocated</b> in the population</li> <ul> <li>It is a way to obtain causality from randomness</li> </ul> </ul> -- <p style = "margin-bottom:.85cm;"></p> <ul> <li>RCTs are very powerful tools to <b>sort out issues of:</b></li> <ul> <li>Omitted variables</li> <li>Selection bias</li> <li>Simultaneity</li> </ul> </ul> -- <p style = "margin-bottom:.85cm;"></p> <ul> <li>This method is particularly used to <b>identify causal relationships</b> in:</li> <ul> <li>Medicine</li> <li>Psychology</li> <li>Economics</li> <li>...</li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> -- <center><i><b>But how does randomness help obtaining causality?</b></i></center> --- ### 2. Randomized control trials #### 2.1. Introduction to RCTs <ul> <li>Consider estimating the <b>effect of vitamin</b> supplements intake<b> on health</b> </li> <ul> <li>Comparing health outcomes of vitamin <b>consumers vs. non-consumers</b>, the effect <b>won't be causal</b></li> <li>Vitamins consumers might be <b>richer</b> and <b>more healthy in general</b> for other reasons than vitamin intake</li> </ul> </ul> -- <ul> <li><b>Randomization</b> allows to <b>solve</b> this selection <b>bias</b></li> <ul> <li>If you take two groups randomly, they would have the <b>same characteristics</b> on expectation</li> <li>And thus they would be perfectly <b>comparable</b></li> </ul> </ul> -- Take for instance the `asec_2020.csv` dataset we've been working with: ```r asec_2020 %>% summarise(Earnings = mean(Earnings), Hours = mean(Hours), Black = mean(Race == "Black"), Asian = mean(Race == "Asian"), Other = mean(Race == "Other"), Female = mean(Sex == "Female")) ``` ``` ## Earnings Hours Black Asian Other Female ## 1 62132.37 39.54742 0.1062391 0.0703805 0.03764611 0.4809749 ``` --- ### 2. Randomized control trials #### 2.1. Introduction to RCTs <ul> <li>Let's compare the <b>average characteristics</b> for two <b>randomly selected groups:</b></li> </ul> -- ```r asec_2020 %>% * mutate(Group = ifelse(rnorm(n(), 0, 1) > 0, "Treatment", "Control")) %>% group_by(Group) %>% summarise(n = n(), Earnings = mean(Earnings), Female = 100 * mean(Sex == "Female"), Black = 100 * mean(Race == "Black"), Asian = 100 * mean(Race == "Asian"), Other = 100 * mean(Race == "Other"), Hours = mean(Hours)) ``` -- ``` ## # A tibble: 2 x 8 ## Group n Earnings Female Black Asian Other Hours ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Control 32195 62234. 48.2 10.7 7.02 3.80 39.5 ## 2 Treatment 32141 62030. 48.0 10.5 7.05 3.73 39.6 ``` --- ### 2. Randomized control trials #### 2.1. Introduction to RCTs <ul> <li>Their average <b>characteristics</b> are very close!</li> <ul> <li><b>On expectation</b> their average characteristics are <b>the same</b></li> </ul> </ul> -- <ul> <li>And just as the two randomly selected populations are comparable in terms of their observable characteristics</li> <ul> <li>On expectation they are also <b>comparable</b> in terms of their <b>unobservable characteristics!</b></li> <li>Randomization, if properly conducted, thus solves the problem of omitted variable bias</li> </ul> </ul> -- <center><h4><i>If we assign a treatment to Group 1, Group 2 would then be a valid counterfactual to estimate a causal effect!</i></h4></center> -- <ul> <li>But <b>RCTs are not immune</b> to every problem:</li> <ul> <li>If individuals <b>self-select</b> in participating to the experiment their would be a selection bias</li> <li>Even without self-selection, if the population among which treatment is randomized is not <b>representative</b> there is a problem of external validity</li> <li>For the RCT to work, individuals should <b>comply</b> with the treatment allocation</li> <li>The <b>sample</b> must be <b>sufficiently large</b> for the average characteristics across groups to be close enough to their expected value</li> <li>...</li> </ul> </ul> --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>To some extent their are ways to deal with these problems</li> <ul> <li>Notably we can <b>adjust the way the treatment is randomized</b></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>For instance if we want to ensure that a characteristic is well balanced among the two groups, we can <b>randomize within categories of this variable</b></li> <ul> <li>We don't give the treatment randomly hoping that we'll obtain the same % of females in both groups</li> <li>We assign the treatment randomly among females and among males separately</li> <li>This is called <b>randomizing by block</b></li> <li><i>Note that this only works with observable characteristics!</i></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> ```r asec_2020 %>% * group_by(Sex) %>% # Randomize treatment by sex mutate(Group = ifelse(rnorm(n(), 0, 1) > 0, 1, 0)) %>% ungroup() %>% group_by(Group) %>% summarise(...) ``` --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>What if you want to estimate the impact of <b>calorie intake</b> at the <b>10am break</b> on <b>pupils grades</b></li> <ol> <li>Find a school to run your experiment</li> <li>Take the list of pupils and randomly allocate them to treatment and control group</li> <li>Provide families with treated pupils a snack for the 10am break every school day</li> <li>Do that for a few month and collect the data on the grades of both groups</li> <li>Compute the difference in average grade for the treated and the control group</li> </ol> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>If the 10am snack has a <b>positive effect:</b></li> <ul> <li>This causal identification framework should ensure the correct estimation of that effect</li> <li>Right?</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>But what about <b>non-compliance?</b></li> <ul> <li>It is likely that during the 10am break, treated children share their snack with their untreated friends</li> <li>How would that <b>affect our estimation?</b></li> </ul> </ul> --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>While the observed effect would be positive under full compliance, <b>under treatment sharing:</b></li> <ul> <li></li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-37-1.png" width="31%" style="display: block; margin: auto;" /> --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>While the observed effect would be positive under full compliance, <b>under treatment sharing:</b></li> <ul> <li><b>Treated children</b> would have <b>lower grades</b> because they would benefit from less calories</li> <li></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-38-1.png" width="31%" style="display: block; margin: auto;" /> --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>While the observed effect would be positive under full compliance, <b>under treatment sharing:</b></li> <ul> <li><b>Treated children</b> would have <b>lower grades</b> because they would benefit from less calories</li> <li><b>Untreated children</b> would have <b>higher grades</b> because they would benefit from more calories</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-39-1.png" width="31%" style="display: block; margin: auto;" /> --- ### 2. Randomized control trials #### 2.2. Types of randomization <ul> <li>Thus <b>non-compliance</b> can bias our estimation</li> <ul> <li>There would be a <b>downward bias</b></li> <li>And our estimation <b>wouldn't be causal</b></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>One solution to that problem is to <b>randomize by cluster</b></li> <ul> <li>Children cannot share their snack with children from other schools</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>We must <b>treat at the school level</b> instead of the child level</li> <ul> <li>A treated unit is a school where some/all children are treated</li> <li>An untreated school is a school where no child is treated</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <center> <i>Beware that in terms of inference, computing standard errors the usual way<br> while the treatment is at a broader observational level than the outcome</br> would give fallaciously low standard errors, which would need to be corrected</i> </center> --- ### 2. Randomized control trials #### 2.3. Multiple testing <ul> <li>Another inference issue that RCTs can be subject to is <b>multiple testing</b></li> <ul> <li>If you conduct an RCT you might be tempted to exploit the causal framework to test a myriad of effects</li> </ul> </ul> <p style = "margin-bottom:1.2cm;"></p> -- <ul> <li>You randomize your treatment and you compare the averages of many outcomes between treated and untreated individuals</li> <ul> <li>You would be tempted to <b>conclude</b> that there is a <b>significant effect</b> for <b>every variable</b> whose corresponding <b>p-value < .05</b></li> <li>But <b>you cannot do that!</b></li> </ul> </ul> <p style = "margin-bottom:1.2cm;"></p> -- <ul> <li>The probability to have a p-value lower than .05 just by chance for one test is indeed 5%</li> <ul> <li>But if you do <b>multiple tests</b> in a row, the <b>probability</b> to have a <b>p-value lower than .05</b> for a null true effect among these multiple tests is <b>greater than 5%</b></li> <li>The greater the number of tests, the higher the probability to get a significant result just by chance</li> </ul> </ul> <p style = "margin-bottom:1.2cm;"></p> -- <center><h4>This is what we call <i>multiple testing</i></h4></center> --- ### 2. Randomized control trials #### 2.3. Multiple testing <img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 2. Randomized control trials #### 2.3. Multiple testing * There are many ways to correct for multiple testing <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>The simplest one is called the <b>Bonferroni</b> correction</li> <ul> <li>It consists in <b>multiplying the p-value by the number of tests</b></li> <li>But it also leads to a large <b>loss of power</b> (the probability to find an effect when there is indeed an effect decreases a lot)</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>There are more sophisticated ways to deal with the problem, which can be categorized into two approaches</li> <ul> <li><b>Family Wise Error Rate</b>: Control the probability that there is at least one true assumption rejected</li> <li><b>False Discovery Rate</b>: Control the share of true assumptions among rejected assumptions</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <center><i>➜ We won't cover these methods in this course but keep the multiple testing issue in mind when you encounter a long series of statistical tests</i></center> --- <h3>Overview: Causality</h3> <p style = "margin-bottom:4.25cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Main sources of bias ✔</b></li> <ul style = "list-style: none"> <li>1.1. Omitted variables</li> <li>1.2. Functional form</li> <li>1.3. Selection bias</li> <li>1.4. Measurement error</li> <li>1.5. Simultaneity</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>2. Randomized control trials ✔</b></li> <ul style = "list-style: none"> <li>2.1. Introduction to RCTs</li> <li>2.2. Types of randomization</li> <li>2.3. Multiple testing</li> </ul> </ul> <p style = "margin-bottom:.65cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- ### 3. Wrap up! #### Omitted variable bias <ul> <li>If a third <b>variable</b> is correlated with both \(x\) and \(y\), it would <b>bias the relationship</b></li> <ul> <li>We must then <b>control</b> for such variables</li> <li>And if we can't we must acknowledge that our estimate is not causal with <i><b>'ceteris paribus'</b></i></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-41-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 3. Wrap up! #### Functional form <ul> <li>Not capturing the <b>right functional form</b> might also lead to biased estimations:</li> <ul> <li>Polynomial order, interactions, logs, discretization matter</li> <li><b>Visualizing the relationship</b> is key</li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-42-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 3. Wrap up! #### Selection bias <ul> <li><b>Self-selection</b> is also a common threat to causality</li> </ul> <p style = "margin-bottom:.75cm;"></p> <ul> <li>What is the impact of going to a better neighborhood on your children outcomes?</li> <ul> <li>We cannot just regress children outcomes on a mobility dummy</li> <li>Individuals who move may be different from those who stay: <b>self-selection issue</b></li> <li>Here <b>the outcomes of those who stayed are different from the outcomes those who moved would have had, if they had stayed</b></li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> #### Simultaneity <ul> <li>Consider the relationship between <b>crime</b> rate and <b>police coverage</b> intensity</li> </ul> <p style = "margin-bottom:.75cm;"></p> <ul> <li>What is the <b>direction of the relationship?</b></li> <ul> <li>We cannot just regress crime rate on police intensity</li> <li>It's likely that more crime would cause a positive response in police activity</li> <li>And also that police activity would deter crime</li> </ul> </ul> --- ### 3. Wrap up! #### Measurement error <ul> <li><b>Measurement error</b> in the independent variable also induces a bias</li> <ul> <li>The resulting estimation would mechanically be <b>downward biased</b></li> <li>The <b>noisier</b> the measure, the <b>larger the bias</b></li> </ul> </ul> <img src="slides_files/figure-html/unnamed-chunk-43-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 3. Wrap up! #### Randomized Controlled Trials <ul> <li>A Randomized Controlled Trial (RCT) is a type of experiment in which the thing we want to know the impact of (called the treatment) is <b>randomly allocated</b> in the population</li> <ul> <li>The two <b>groups</b> would then have the same characteristics on expectation, and would be <b>comparable</b></li> <li>It is a way to obtain <b>causality</b> from randomness</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>RCTs are very <b>powerful tools</b> to sort out issues of:</li> <ul> <li>Omitted variables</li> <li>Selection bias</li> <li>Simultaneity</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>But RCTs are <b>not immune</b> to every problem:</li> <ul> <li>The sample must be representative and large enough</li> <li>Participants should comply with their treatment status</li> <li>Independent variables must not be noisy measures of the variable of interest</li> <li>...</li> </ul> </ul>