class: center, middle, inverse, title-slide # How to conduct a research project ## Lecture 16 ###
Louis SIRUGUE ### CPES 2 - Spring 2023 --- <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> <center><h3> Welcome to the second semester of this course!</h3></center> -- .pull-left[ <ul> <li>Dedicated to an empirical <b>research project</b>:</li> <ul> <li><b>By pairs</b>, apply programming & econometric tools from S1 to your own research question</li> <li>Find an example of what is expected <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">here</a></li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li>In Part I: <b>formal lectures</b>:</li> <ul> <li>Today: The steps of the research process</li> <li>The next two lectures: Refreshers from S1</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto; margin-left: auto; margin-right: auto;"> <caption><b>Part 1: Guidelines and refreshers</b></caption> <tbody> <tr> <td style="text-align:left;"> Lecture 1 </td> <td style="text-align:left;"> How to conduct a research project </td> </tr> <tr> <td style="text-align:left;"> Lecture 2 </td> <td style="text-align:left;"> Refresher: R Programming </td> </tr> <tr> <td style="text-align:left;"> Lecture 3 </td> <td style="text-align:left;"> Refresher: Econometrics </td> </tr> </tbody> </table> <p style = "margin-bottom:1cm;"></p> <ul> <li>In Part II: <b>follow-ups</b> and <b>reports/presentations</b></li> </ul> ] -- .pull-right[ <p style = "margin-bottom:.8cm;"></p> <table class="table table-hover table-condensed table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto; margin-left: auto; margin-right: auto;"> <caption><b>Part 2: Research project</b></caption> <tbody> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;"> Lecture 4 </td> <td style="text-align:left;font-weight: bold;font-style: italic;"> Presentation of your project </td> </tr> <tr> <td style="text-align:left;"> Lecture 5-6 </td> <td style="text-align:left;"> Follow-up: Data cleaning </td> </tr> <tr> <td style="text-align:left;"> Lecture 7 </td> <td style="text-align:left;"> Follow-up: Descriptive statistics </td> </tr> <tr> <td style="text-align:left;"> Lecture 8 </td> <td style="text-align:left;"> Follow-up: Visualizing the data </td> </tr> <tr> <td style="text-align:left;"> Lecture 9 </td> <td style="text-align:left;"> Follow-up: Regression analysis </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;"> Lecture 10 </td> <td style="text-align:left;font-weight: bold;font-style: italic;"> Midterm report feedback </td> </tr> <tr> <td style="text-align:left;"> Lecture 11 </td> <td style="text-align:left;"> Follow-up: Causality assessment </td> </tr> <tr> <td style="text-align:left;"> Lecture 12 </td> <td style="text-align:left;"> Follow-up: Robustness </td> </tr> <tr> <td style="text-align:left;"> Lecture 13 </td> <td style="text-align:left;"> Follow-up: Heterogeneity </td> </tr> <tr> <td style="text-align:left;"> Lecture 14 </td> <td style="text-align:left;"> Follow-up: Last tips </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;"> Lecture 15 </td> <td style="text-align:left;font-weight: bold;font-style: italic;"> Final presentation </td> </tr> </tbody> </table> ] --- <h3>Today: How to conduct a research project</h3> -- <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Analysis</b></li> <ul style = "list-style: none"> <li>3.1. Regression analysis</li> <li>3.2. Robustness</li> <li>3.3. Heterogeneity</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Today: How to conduct a research project</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> ] --- ### 1. Preliminary steps #### 1.1. Research question <ul> <li>The starting point of the research project is the <b>research question</b></li> <ul> <li>It is not easy to find a suitable research question, and not all questions are relevant</li> <li>Here are some guidelines to help you in the process</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The question should lead to <b>explain rather than describe</b> a phenomenon</li> <ul> <li style="color:#9B0000";>Do football teams win more home than away?</li> <li>This basically calls to a descriptive statistic, not any explanation</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The question should be <b>specific enough</b></li> <ul> <li style="color:#9B0000";>What are the reasons why football teams win more often home than away?</li> <li>You won't be able to cover all the determinants</li> <li>Closed-ended questions recommended (Yes/No, to what extent, ...)</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>It should be relatively <b>original</b> and <b>interesting to you!</b></li> <ul> <li>You're gonna spend the whole semester on that</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.1. Research question * Here is an example of valid research question: <center><i><b><p style="color:#1E5128";>Do supporters help the home team win the match?</p></b></i></center> -- <p style = "margin-bottom:1cm;"></p> * This is the research question we will take as an example to see all the steps of the research process -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>It is:</li> <ul> <li>More about explanation than description</li> <li>Specific enough, not too broad</li> <li>Relatively original</li> <li>Relatively interesting with respect to the sports literature</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>And importantly there is <b>data available</b> to answer this question</li> <ul> <li>There's no point having a good research question if you can't find data to answer it</li> <li>Usually finding data comes after the idea of research question</li> <li>But given the time constraint you should look for data while thinking about your research question</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.2. Finding data * **Open** access online **data** is increasingly common <p style = "margin-bottom:1cm;"></p> <ul> <li>Academic journals ask authors to share their data more and more systematically</li> <ul> <li>The <b>academic literature</b> is a great source of data</li> <li>Especially RCTs as they usually include many variables</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>At these two links you'll find an incredibly rich set of academic datasets</li> <ul> <li><a href="https://www.openicpsr.org/openicpsr/search/studies">openICPSR</a></li> <li><a href="https://dataverse.harvard.edu/dataverse/harvard?q=&fq0=subject_ss%3A%22Social%20Sciences%22&types=dataverses%3Adatasets&sort=dateSort&order=desc">Harvard Dataverse</a></li> <li>Browse the available datasets and check the corresponding articles, this may give you inspiration</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The <a href="https://www.aeaweb.org/resources/data">American Economic Association</a> also gathered a lot of data sources</li> <ul> <li>Mostly from national statistical institutes</li> <li>Here you may not find so much individual level data but rather local data</li> <li>If relevant it is also possible to combine different data sources</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.2. Finding data * Sometimes a simple Google search can be sufficient: <center><img src = "fbref.png" width = "800"/></center> --- ### 1. Preliminary steps #### 1.2. Finding data * At [fbref.com](https://fbref.com/) data on scores and attendance of football matches are available: <center><img src = "ligue1.png" width = "825"/></center> --- ### 1. Preliminary steps #### 1.2. Finding data <ul> <li>This data is appropriate for the exercise for two reasons</li> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>It contains the <b>necessary variables</b> to study the research question</li> <ul> <li>For each match the score and who played home and away</li> <li>The number of people in the stadium</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>And <b>additional variables</b> to use for robustness and heterogeneity analysis</li> <ul> <li>The time in the day and day in the week of the match</li> <li>The league/season (data available for several leagues/seasons)</li> <li>(We'll come back to that point in a few slides)</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>So we now have a valid research question and appropriate data to work with</li> <ul> <li>But there is one last preliminary step</li> <li>We need to know were the research idea stands with respect to the existing literature</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.3. Literature review <ul> <li>A good research project should be relevant with respect to the academic literature on the issue</li> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>You should find academic articles to get a sense of where your analysis will stand in the literature</li> <ul> <li>What do we <b>already know</b> on the topic?</li> <li>What <b>remains to be known</b>?</li> <li>What is your <b>contribution</b> to the literature?</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>You should refer to articles that are published in (peer reviewed) academic journals</li> <ul> <li>You can find such articles on <a href="https://scholar.google.com/">Google scholar</a></li> <li>And via <a href="https://catalogue.explore.psl.eu/">PSL explore</a></li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The articles you should start by looking for are:</li> <ul> <li><b>Reviews/meta analyses</b> that will have a lot of references that may be relevant</li> <li>Articles that are <b>as close as possible</b> to what you intend to do</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.3. Literature review <p style = "margin-bottom:1.5cm;"></p> <center><img src = "literature.png" width = "825"/></center> --- ### 1. Preliminary steps #### 1.3. Literature review <center><img src = "psl.png" width = "675"/></center> --- ### 1. Preliminary steps #### 1.3. Literature review <ul> <li>Almost every academic article includes some review of the literature</li> <ul> <li>You can go through it to find some inspiration and references</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> <center><img src = "in_text.png" width = "675"/></center> --- ### 1. Preliminary steps #### 1.3. Literature review <ul> <li>In an academic paper, every article mentioned in the text can be found in the <i>References</i> section at the end</li> </ul> <p style = "margin-bottom:.75cm;"></p> <center><img src = "refs.png" width = "675"/></center> -- <p style = "margin-bottom:.5cm;"></p> <center><i><b>➜ Citing articles this way is also something you will have to do</b></i></center> --- ### 1. Preliminary steps #### 1.3. Literature review * The way you should refer to academic articles in the text is codified: <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Referring to an academic article in-text</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> One author </th> <th style="text-align:center;"> Two authors </th> <th style="text-align:center;"> More authors </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Within the sentence </td> <td style="text-align:center;"> Smith (2012) showed that ... </td> <td style="text-align:center;"> Smith and Watson (2012) showed that ... </td> <td style="text-align:center;"> Smith et al. (2012) showed that ... </td> </tr> <tr> <td style="text-align:left;"> Outside the sentence </td> <td style="text-align:center;"> It has been shown that ... (Smith, 2012) </td> <td style="text-align:center;"> It has been shown that ... (Smith and Watson, 2012) </td> <td style="text-align:center;"> It has been shown that ... (Smith et al., 2012) </td> </tr> </tbody> </table> <p style = "margin-bottom:1.25cm;"></p> * Conventionally (in Economics) authors are listed by alphabetical order of surname <ul> <li>The reference of every article you cite should be added by alphabetical order in a Reference section at the end</li> <ul> <li>How to write the reference in this last section is also codified</li> <li>Take a look at the research project example available <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">here</a> to see what it should look like</li> </ul> </ul> --- ### 1. Preliminary steps #### 1.3. Literature review * To find the proper reference of an article, click on the Cite button and copy-paste it in your *References* section -- .pull-left[ <center><b>On Google scholar</b></center> <p style = "margin-bottom:.5cm;"></p> <center><img src = "cite_gs.png" width = "400"/></center> <center><img src = "cite_gs2.png" width = "400"/></center> ] -- .pull-right[ <center><b>On PSL explore</b></center> <p style = "margin-bottom:.5cm;"></p> <center><img src = "cite_psl.png" width = "400"/></center> <center><img src = "cite_psl2.png" width = "400"/></center> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps ✔</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Analysis</b></li> <ul style = "list-style: none"> <li>3.1. Regression analysis</li> <li>3.2. Robustness</li> <li>3.3. Heterogeneity</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps ✔</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] --- ### 2. Data description #### 2.1. Data cleaning <ul> <li>The <b>first</b> thing to do with the <b>data</b> is to <b>clean</b> it</li> <ul> <li>You should open the data and take a close look at it to understand what's inside</li> </ul> </ul> -- ```r library(tidyverse) data_match <- read.csv("data_match.csv") dim(data_match) ``` ``` ## [1] 4845 16 ``` <ul> <li>The data contains 4845 rows and 16 variables</li> <ul> <li>Let's see what these variables are</li> </ul> </ul> -- ```r names(data_match) ``` ``` ## [1] "Wk" "Day" "Date" "Time" "Home" ## [6] "xG" "Score" "xG.1" "Away" "Attendance" ## [11] "Venue" "Referee" "Match.Report" "Notes" "League" ## [16] "Season" ``` --- ### 2. Data description #### 2.1. Data cleaning ```r str(data_match) ``` ``` ## 'data.frame': 4845 obs. of 16 variables: ## $ Wk : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Day : chr "Fri" "Sat" "Sat" "Sat" ... ## $ Date : chr "2018-08-10" "2018-08-11" "2018-08-11" "2018-08-11" ... ## $ Time : chr "20:45" "17:00" "20:00" "20:00" ... ## $ Home : chr "Marseille" "Nantes" "Montpellier" "Lille" ... ## $ xG : num 2.8 1.6 2 1.5 2.5 1 1.3 1 0.2 2.8 ... ## $ Score : chr "4-0" "1-3" "1-2" "3-1" ... ## $ xG.1 : num 0.3 2.2 2 0.5 1.8 1.9 0.5 0.5 1.7 0.2 ... ## $ Away : chr "Toulouse" "Monaco" "Dijon" "Rennes" ... ## $ Attendance : int 60756 32760 12765 25708 9534 26006 21421 48263 23079 47289 ... ## $ Venue : chr "Orange Vélodrome" "Stade de la Beaujoire - Louis Fonteneau" "Stade de la Mosson" "Stade Pierre-Mauroy" ... ## $ Referee : chr "Ruddy Buquet" "Jérôme Brisard" "Florent Batta" "Willy Delajod" ... ## $ Match.Report: chr "Match Report" "Match Report" "Match Report" "Match Report" ... ## $ Notes : chr NA NA NA NA ... ## $ League : chr "Ligue 1" "Ligue 1" "Ligue 1" "Ligue 1" ... ## $ Season : chr "2018-2019" "2018-2019" "2018-2019" "2018-2019" ... ``` --- ### 2. Data description #### 2.1. Data cleaning <ul> <li>The dataset contains the following 16 variables:</li> <ul> <li><b>Wk:</b> Season week when the match took place</li> <li><b>Day:</b> Week day when the match took place</li> <li><b>Date:</b> Date of the match</li> <li><b>Time:</b> Time of the match</li> <li><b>Home:</b> Team that played home</li> <li><b>xG:</b> Expected number of goals for home team</li> <li><b>Score:</b> Score of the match</li> <li><b>xG.1:</b> Expected number of goals for away team</li> <li><b>Away:</b> Team that played away</li> <li><b>Attendance:</b> Number of supporters in the stadium</li> <li><b>Venue:</b> Name of the stadium where the match took place</li> <li><b>Referee:</b> Name of the referee</li> <li><b>Match.Report:</b> Link to an online report of the match</li> <li><b>Notes:</b> Miscellaneous information on the match</li> <li><b>League:</b> Name of the league</li> <li><b>Season:</b> Season from 2018-2019 to 2020-2021</li> </ul> </ul> --- ### 2. Data description #### 2.1. Data cleaning * We can keep only the relevant variables and look at the first rows of the data -- ```r data_match <- data_match %>% select(Day, Date, Time, Home, Score, Away, Attendance, League, Season) kable(head(data_match, n = 5), caption = "Outlook of the data:") ``` -- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Outlook of the data:</caption> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Home </th> <th style="text-align:left;"> Score </th> <th style="text-align:left;"> Away </th> <th style="text-align:right;"> Attendance </th> <th style="text-align:left;"> League </th> <th style="text-align:left;"> Season </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> 2018-08-10 </td> <td style="text-align:left;"> 20:45 </td> <td style="text-align:left;"> Marseille </td> <td style="text-align:left;"> 4-0 </td> <td style="text-align:left;"> Toulouse </td> <td style="text-align:right;"> 60756 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> <tr> <td style="text-align:left;"> Sat </td> <td style="text-align:left;"> 2018-08-11 </td> <td style="text-align:left;"> 17:00 </td> <td style="text-align:left;"> Nantes </td> <td style="text-align:left;"> 1-3 </td> <td style="text-align:left;"> Monaco </td> <td style="text-align:right;"> 32760 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> <tr> <td style="text-align:left;"> Sat </td> <td style="text-align:left;"> 2018-08-11 </td> <td style="text-align:left;"> 20:00 </td> <td style="text-align:left;"> Montpellier </td> <td style="text-align:left;"> 1-2 </td> <td style="text-align:left;"> Dijon </td> <td style="text-align:right;"> 12765 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> <tr> <td style="text-align:left;"> Sat </td> <td style="text-align:left;"> 2018-08-11 </td> <td style="text-align:left;"> 20:00 </td> <td style="text-align:left;"> Lille </td> <td style="text-align:left;"> 3-1 </td> <td style="text-align:left;"> Rennes </td> <td style="text-align:right;"> 25708 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> <tr> <td style="text-align:left;"> Sat </td> <td style="text-align:left;"> 2018-08-11 </td> <td style="text-align:left;"> 20:00 </td> <td style="text-align:left;"> Angers </td> <td style="text-align:left;"> 3-4 </td> <td style="text-align:left;"> Nîmes </td> <td style="text-align:right;"> 9534 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> </tbody> </table> --- ### 2. Data description #### 2.1. Data cleaning <ul> <li>Data cleaning involves:</li> <ul> <li><b>Recoding variables</b> in a practical way (it may imply <b>creating new variables</b>)</li> <li><b>Removing observations</b> that are not relevant, if any, typically <b>missing values</b></li> <li>Potentially <b>joining</b> data, and <b>pivoting</b> variables from wide to long or conversely</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption></caption> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Home </th> <th style="text-align:left;"> Score </th> <th style="text-align:left;"> Away </th> <th style="text-align:right;"> Attendance </th> <th style="text-align:left;"> League </th> <th style="text-align:left;"> Season </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> 2018-08-10 </td> <td style="text-align:left;"> 20:45 </td> <td style="text-align:left;"> Marseille </td> <td style="text-align:left;"> 4-0 </td> <td style="text-align:left;"> Toulouse </td> <td style="text-align:right;"> 60756 </td> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:left;"> 2018-2019 </td> </tr> </tbody> </table> <p style = "margin-bottom:1cm;"></p> <ul> <li>Here is what we can do already:</li> <ul> <li>Divide the score variable into two variables, for home and away</li> <li>Create a variable indicating who won</li> <li>Recode the Time variable as numeric</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> * Some datasets are cleaner than others, but there's always some data cleaning to do --- ### 2. Data description #### 2.1. Data cleaning * Recoding variables ```r data_match <- data_match %>% # Separate the home and away score into 2 variables separate(Score, c("Home", "Away"), "-") %>% # Convert these variables as numeric mutate(Home = as.numeric(Home), Away = as.numeric(Away), # Generate a variable for the outcome of the match depending on who scored the most Winner = case_when(Home > Away ~ "Home", Home == Away ~ "Draw", Home < Away ~ "Away"), # Recode the Time variable as a continuous variable Time = as.numeric(substr(Time, 1, 2)) + as.numeric(substr(Time, 4, 5)) / 60) ``` --- ### 2. Data description #### 2.1. Data cleaning <ul> <li>Let's take a look at the cleaned data</li> <ul> <li>(Pay attention to rows 11, 22, ...)</li> </ul> </ul> --
--- ### 2. Data description #### 2.1. Data cleaning <ul> <li><b>Between each week</b> of competition there is a <b>empty line</b> with missing values</li> <ul> <li>These rows are not actual observations so we should <b>delete</b> them</li> </ul> </ul> ```r data_match <- data_match %>% filter(!is.na(Home)) ``` -- <p style = "margin-bottom:1.5cm;"></p> * But we still need to **check** for actual **missing values** ```r data_match %>% summarise_all(~sum(is.na(.))) ``` ``` ## Day Date Time Attendance Home Away League Season Winner ## 1 0 0 0 1670 0 0 0 0 0 ``` -- <p style = "margin-bottom:1.5cm;"></p> <ul> <li>There is no missing value except for the <b>Attendance</b> variable that has <b>many NAs</b></li> <ul> <li>This is suspicious, we should <b>investigate</b> more</li> </ul> </ul> --- ### 2. Data description #### 2.1. Data cleaning * We must check the distribution of the variable: ```r summary(data_match$Attendance) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 13 16158 27717 31790 45014 93426 1670 ``` -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>Except for NAs, the distribution seems fine</li> <ul> <li>But the number of spectators per match starts at 13 while the COVID-19 pandemic prevented many matches from having any attendance</li> <li>These <b>NAs</b> for attendance may actually <b>mean 0 attendance</b></li> <li> This is particularly plausible given that there is no other variable with missing values in the data</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> <center><i>➜ To check this hypothesis we can plot the evolution of the monthly attendance, <b>replacing NAs by 0s</b></i></center> <center><i>(code in the <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">research project example</a>)</i></center> --- ### 2. Data description #### 2.1. Data cleaning .left-column[ <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-column[ <p style = "margin-bottom:1cm;"></p> <ul> <li>This graph confirms the hypothesis</li> <ul> <li>There is a drop to 0 attendance (the NAs) due to the pandemic right after March 2020</li> <li>Missing values for the Attendance variable should indeed be recoded as 0</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> ```r data_match <- data_match %>% mutate(Attendance = ifelse(is.na(Attendance), 0, Attendance)) ``` ] --- ### 2. Data description #### 2.2. Descriptive statistics <ul> <li>Now that the data is clean, we should describe it with <b>relevant statistics</b></li> <ul> <li>For <b>categorical variables</b>: Number of observations per category</li> <li>For <b>continuous variables</b>: Summarizing the distribution</li> </ul> </ul> -- ```r data_match %>% group_by(Winner) %>% summarise(N = n(), Pct = 100 * (n() / nrow(.))) %>% kable(., "Distribution of match outcomes") ``` <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Distribution of match outcomes</caption> <thead> <tr> <th style="text-align:left;"> Winner </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Pct </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Away </td> <td style="text-align:right;"> 1343 </td> <td style="text-align:right;"> 31.70 </td> </tr> <tr> <td style="text-align:left;"> Draw </td> <td style="text-align:right;"> 1067 </td> <td style="text-align:right;"> 25.18 </td> </tr> <tr> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1827 </td> <td style="text-align:right;"> 43.12 </td> </tr> </tbody> </table> --- ### 2. Data description #### 2.2. Descriptive statistics <ul> <li>The number of observations per season/league is also interesting to know:</li> <ul> <li><i>(code in the <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">research project example</a>)</i> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Number of matches:</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 2018-2019 </th> <th style="text-align:right;"> 2019-2020 </th> <th style="text-align:right;"> 2020-2021 </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Bundesliga </td> <td style="text-align:right;"> 306 </td> <td style="text-align:right;"> 306 </td> <td style="text-align:right;"> 306 </td> <td style="text-align:right;font-weight: bold;"> 918 </td> </tr> <tr> <td style="text-align:left;"> La Liga </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;font-weight: bold;"> 1140 </td> </tr> <tr> <td style="text-align:left;"> Ligue 1 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 279 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;font-weight: bold;"> 1039 </td> </tr> <tr> <td style="text-align:left;"> Premier League </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 380 </td> <td style="text-align:right;font-weight: bold;"> 1140 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Total </td> <td style="text-align:right;font-weight: bold;"> 1446 </td> <td style="text-align:right;font-weight: bold;"> 1345 </td> <td style="text-align:right;font-weight: bold;"> 1446 </td> <td style="text-align:right;font-weight: bold;font-weight: bold;"> 4237 </td> </tr> </tbody> </table> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>The distribution of the main continuous variables can also be summarized by season/league</li> <ul> <li><i>(code in the <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">research project example</a>)</i> </ul> </ul> --- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Season 2018-2019</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Attendance</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 19205 </td> <td style="text-align:right;"> 29230.50 </td> <td style="text-align:right;"> 40911.0 </td> <td style="text-align:right;"> 43453.18 </td> <td style="text-align:right;"> 52500.00 </td> <td style="text-align:right;"> 81365 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 3592 </td> <td style="text-align:right;"> 12074.50 </td> <td style="text-align:right;"> 19367.5 </td> <td style="text-align:right;"> 27118.68 </td> <td style="text-align:right;"> 39587.75 </td> <td style="text-align:right;"> 93265 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 12795.75 </td> <td style="text-align:right;"> 17577.5 </td> <td style="text-align:right;"> 22807.27 </td> <td style="text-align:right;"> 27378.50 </td> <td style="text-align:right;"> 64696 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 9980 </td> <td style="text-align:right;"> 25034.75 </td> <td style="text-align:right;"> 31948.0 </td> <td style="text-align:right;"> 38181.29 </td> <td style="text-align:right;"> 53282.75 </td> <td style="text-align:right;"> 81332 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals away</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.39 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.13 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.09 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.25 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals home</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 1.79 </td> <td style="text-align:right;"> 3.00 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.45 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.47 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.57 </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> --- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Season 2019-2020</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Attendance</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 27062.5 </td> <td style="text-align:right;"> 29783.37 </td> <td style="text-align:right;"> 49025.0 </td> <td style="text-align:right;"> 81365 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 16001.5 </td> <td style="text-align:right;"> 20694.99 </td> <td style="text-align:right;"> 33583.5 </td> <td style="text-align:right;"> 93426 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 12418.0 </td> <td style="text-align:right;"> 15814.0 </td> <td style="text-align:right;"> 22427.67 </td> <td style="text-align:right;"> 29440.5 </td> <td style="text-align:right;"> 65421 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 10346.5 </td> <td style="text-align:right;"> 30534.0 </td> <td style="text-align:right;"> 29796.04 </td> <td style="text-align:right;"> 45594.5 </td> <td style="text-align:right;"> 73737 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals away</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.55 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.04 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.03 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.21 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 9 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals home</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.66 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.44 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.49 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 1.52 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> --- <table class="table table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Season 2020-2021</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Attendance</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 503.57 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 11500 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 33.54 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4800 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 46.90 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 5000 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 224.22 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 10000 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals away</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.36 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.36 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.34 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 7 </td> </tr> <tr grouplength="4"><td colspan="7" style="border-bottom: 1px solid;"><strong>Goals home</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Bundesliga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.68 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> La Liga </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.37 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Ligue 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.40 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Premier League </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.35 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 9 </td> </tr> </tbody> </table> --- ### 2. Data description #### 2.3. Data visualization <ul> <li>The last step before the analysis is to <b>visualize the data</b></li> <ul> <li>The idea is also to describe the data, but <b>with relevant graphs</b></li> </ul> </ul> -- <b>➜ Attendance:</b> ```r ggplot(data_match, aes(x = Season, y = Attendance, fill = Season)) + geom_boxplot(show.legend = F, alpha = .75) + coord_flip() ``` -- <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 2. Data description #### 2.3. Data visualization <b>➜ Goals home vs. away:</b> ```r data_match %>% pivot_longer(c(Home, Away), names_to = "Variable", values_to = "Value") %>% ggplot(., aes(x = Season, y = Value, fill = Variable)) + geom_boxplot(alpha = .75) ``` -- <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" /> --- ### 2. Data description #### 2.3. Data visualization <b>➜ Winner:</b> ```r ggplot(data_match, aes(x = Season, fill = Winner)) + ylab("Number of matches") + geom_bar(stat = "count", position = position_dodge(width = .8), width = .7, alpha = .85) ``` -- <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="70%" style="display: block; margin: auto;" /> --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps ✔</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description ✔</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Analysis</b></li> <ul style = "list-style: none"> <li>3.1. Regression analysis</li> <li>3.2. Robustness</li> <li>3.3. Heterogeneity</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps ✔</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description ✔</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Analysis</b></li> <ul style = "list-style: none"> <li>3.1. Regression analysis</li> <li>3.2. Robustness</li> <li>3.3. Heterogeneity</li> </ul> </ul> ] --- ### 3. Analysis #### 3.1. Regression analysis * The first step of the regression analysis is to write down properly the <b>equation</b> you want to estimate: -- <p style = "margin-bottom:1.25cm;"></p> `$$1\{Winner_m=\text{Home}\}=\alpha+\beta \times1\{Public_m=\text{Yes}\}+\varepsilon_m$$` <p style = "margin-bottom:1.25cm;"></p> <ul> <li>Where for a given match \(m\):</li> <ul> <li>\(1\{Winner_m=\text{Home}\}\) takes the value \(1\) if the winning team is that playing home and \(0\) otherwise</li> <li>\(1\{Public_m=\text{Yes}\}\) takes the value \(1\) if there is public in the stadium and and \(0\) otherwise</li> </ul> </ul> -- <p style = "margin-bottom:1.5cm;"></p> * The two variables of interest should be coded properly for the regression: ```r data_match <- data_match %>% mutate(Winner_home = ifelse(Winner == "Home", 1, 0), Public = ifelse(Attendance > 0, "Public", "No public")) ``` --- ### 3. Analysis #### 3.1. Regression analysis .pull-left[ * Then the regression should be properly reported: ```r stargazer(lm(Winner_home~Public, data_match), dep.var.labels = c("Home win"), keep.stat = c("n", "adj.rsq"), type = "text") ``` <ul> <li>And the coefficient of interest properly interpreted:</li> </ul> <center><b><i>The presence of supporters in the audience increases by 5.9 percentage points on average the probability for the home team to win the match relative to loose or draw, everything else equal. The coefficient is statistically significantly different from 0 at the 1% significance level.</i></b></center> ] .pull-right[ ``` ## ## ======================================== ## Dependent variable: ## --------------------------- ## Home win ## ---------------------------------------- ## PublicPublic 0.059*** ## (0.016) ## ## Constant 0.395*** ## (0.012) ## ## ---------------------------------------- ## Observations 4,237 ## Adjusted R2 0.003 ## ======================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] --- ### 3. Analysis #### 3.1. Regression analysis <ul> <li>It is also a good practice to provide a visual representation of the relationship you estimate</li> <ul> <li>In this case it is not simple because both variables are binary</li> <li>But geom_jitter() allows to add noise in the location of each data point around the 4 possible coordinates</li> <li><i>(code in the <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">research project example</a>)</i> </ul> </ul> -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 3. Analysis #### 3.1. Regression analysis <ul> <li>It is also crucial to discuss whether or not the effect is <b>causal</b></li> <ul> <li>Self-selection issue?</li> <li>Omitted variable bias?</li> <li>Under which assumptions the effect would be causal?</li> </ul> </ul> <p style = "margin-bottom:.9cm;"></p> -- <ul> <li><b>Self-selection</b> issue</li> <ul> <li>Teams that play home or away cannot self-select into whether there is public or not</li> <li>Variations in x are fully driven by decision teams have no control over</li> </ul> </ul> <p style = "margin-bottom:.9cm;"></p> -- <ul> <li><b>Omitted variable</b> bias</li> <ul> <li>There may be other variables correlated with both x and y that drive this relationship</li> <li>When no public (i.e., pandemic), the trip to the stadium may be less tiring because there is less congestion on the roads due to remote working, or for any other reason</li> </ul> </ul> -- <p style = "margin-bottom:.9cm;"></p> <center><i><b>Thus, this result can be considered as causal only if there was no change concomitant to the attendance restrictions that could have a differentiated impact on the teams that play home and away</b></i></center> --- ### 3. Analysis #### 3.2. Robustness <ul> <li>Assessing the robustness of the result consists in progressively <b>adding control variables</b> in the regression</li> <ul> <li>If the result is robust this should <b>not affect too much the magnitude</b> of the coefficient</li> <li>If the result is robust the coefficient should <b>remain statistically significant</b></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>We can control for the day and time of the match</li> <ul> <li>These factors could be linked to the mechanism related to transport</li> <li>Even though it cannot rule out this mechanism, it can suggest whether or not time and day is a channel</li> <li>We can also control for the league in case the effect is driven by differences across leagues</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- ```r stargazer(lm(Winner_home ~ Public, data_match), lm(Winner_home ~ Public + League, data_match), lm(Winner_home ~ Public + League + Time, data_match), lm(Winner_home ~ Public + League + Time + Day, data_match), dep.var.labels = c("Home win"), type = "text", keep.stat = c("n", "adj.rsq")) ``` --- .left-column[ <p style = "margin-bottom:-1cm;"></p> ``` ## ## ======================================================== ## Dependent variable: ## ----------------------------------- ## Home win ## -------------------------------------------------------- ## PublicPublic 0.059*** 0.060*** 0.061*** 0.060*** ## (0.016) (0.016) (0.016) (0.016) ## ## LeagueBundesliga 0.003 0.009 0.008 ## (0.022) (0.023) (0.024) ## ## LeagueLa Liga 0.019 0.020 0.023 ## (0.021) (0.021) (0.022) ## ## LeaguePremier League 0.014 0.023 0.023 ## (0.021) (0.022) (0.023) ## ## Time 0.004 0.004 ## (0.003) (0.004) ## ## DayMon -0.040 ## (0.050) ## ## DaySat 0.019 ## (0.033) ## ## DaySun 0.008 ## (0.035) ## ## DayThu 0.009 ## (0.059) ## ## DayTue 0.057 ## (0.047) ## ## DayWed 0.026 ## (0.039) ## ## Constant 0.395*** 0.385*** 0.312*** 0.294*** ## (0.012) (0.018) (0.059) (0.086) ## ## -------------------------------------------------------- ## Observations 4,237 4,237 4,237 4,237 ## Adjusted R2 0.003 0.003 0.003 0.002 ## ======================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] .right-column[ <p style = "margin-bottom:6cm;"></p> <ul> <li>As control variables are included:</li> <ul> <li>The <b>magnitude</b> of the coefficient does not vary much</li> <li>The statistical <b>significance</b> does not change either</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> <center><i>➜ So the result is robust to controlling for these characteristics</i></center> ] --- ### 3. Analysis #### 3.2. Robustness <ul> <li>Note that robustness is not necessarily about including controls</li> <ul> <li>It can be about <b>excluding/including</b> some observations (e.g., outliers)</li> <li>About <b>changing the definition</b> of one or several variables, etc.</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>For instance, the independent variable of the regression could be defined in two ways:</li> <ul> <li>So far: Probability of winning relative to loosing or draw</li> <li>Alternative: Probability of winning relative to loosing only, omitting draws</li> </ul> </ul> -- <p style = "margin-bottom:1.25cm;"></p> ```r data_match <- data_match %>% mutate(Winner_home2 = ifelse(Winner != "Draw", Winner_home, NA)) stargazer(lm(Winner_home ~ Public, data_match), lm(Winner_home2 ~ Public, data_match), keep.stat = c("n", "adj.rsq"), model.numbers = FALSE, dep.var.labels = c("Home win vs. Home loss", "Home win vs. Home loss/Draw")) ``` --- ### 3. Analysis #### 3.2. Robustness .left-column[ ``` ## ## =============================================================== ## Dependent variable: ## -------------------------------------------------- ## Home win vs. Home loss Home win vs. Home loss/Draw ## --------------------------------------------------------------- ## PublicPublic 0.059*** 0.078*** ## (0.016) (0.018) ## ## Constant 0.395*** 0.529*** ## (0.012) (0.014) ## ## --------------------------------------------------------------- ## Observations 4,237 3,170 ## Adjusted R2 0.003 0.006 ## =============================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] .right-column[ <p style = "margin-bottom:-1.25cm;"></p> <ul> <li>Coefficients cannot be compared directly because they are mechanically inflated by the omission of the possibility of draw</li> </ul> <ul><ul> <li>But the ratio of the effect of public in the stadium on the probability to win, relative to the probability to win when there is no public, is very similar in the two cases (\(\approx\)0.15)</li> </ul></ul> <ul><ul> <li>And both statistically significantly different from 0 at the 99% confidence level</li> </ul></ul> <center><i><b>➜ Also robust</b></i></center> ] --- ### 3. Analysis #### 3.3. Heterogeneity <ul> <li>The last step of the analysis is to investigate the potential heterogeneity of the results</li> <ul> <li><b>Homogenous</b> effect: The coefficient is more or less <b>the same</b> for everybody</li> <li><b>Heterogenous</b> effects: The coefficient <b>varies a lot</b> depending on individual (/match) characteristics</li> <li>It can be according to sex, education, income, or here league for instance</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- <ul> <li>While <b>robustness</b> consisted in controlling for variables</li> <ul> <li>Estimating the relationship <b>net of the effect</b> of other (potentially confounding) <b>variables</b></li> </ul> </ul> <ul> <li><b>Heterogeneity</b> consists in interacting x with a third variable</li> <ul> <li>By how much the relationship between x and y <b>varies depending on</b> the value of a <b>third variable</b></li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> -- ```r stargazer(lm(Winner_home ~ Public, data_match), lm(Winner_home ~ Public + League, data_match), lm(Winner_home ~ Public + League + Public * League, data_match), dep.var.labels = c("Home win"), type = "text", keep.stat = c("n", "adj.rsq")) ``` --- .left-column[ <p style = "margin-bottom:-1cm;"></p> ``` ## ## =============================================================== ## Dependent variable: ## ----------------------------- ## Home win ## --------------------------------------------------------------- ## PublicPublic 0.059*** 0.060*** 0.075** ## (0.016) (0.016) (0.032) ## ## LeagueBundesliga 0.003 0.024 ## (0.022) (0.037) ## ## LeagueLa Liga 0.019 0.035 ## (0.021) (0.034) ## ## LeaguePremier League 0.014 0.016 ## (0.021) (0.034) ## ## PublicPublic:LeagueBundesliga -0.034 ## (0.046) ## ## PublicPublic:LeagueLa Liga -0.026 ## (0.044) ## ## PublicPublic:LeaguePremier League -0.002 ## (0.044) ## ## Constant 0.395*** 0.385*** 0.376*** ## (0.012) (0.018) (0.025) ## ## --------------------------------------------------------------- ## Observations 4,237 4,237 4,237 ## Adjusted R2 0.003 0.003 0.002 ## =============================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] .right-column[ <p style = "margin-bottom:3cm;"></p> <ul> <li style = "margin-left:.5cm;">Point estimates are:</li> <ul> <li>Ligue 1: 7.5pp</li> <li>Bundesliga: 7.5-3.4=4.1pp</li> <li>La Liga: 7.5-2.6=4.9pp</li> <li>Premier League: 7.5-0.2=7.3pp</li> </ul> </ul> <p style = "margin-bottom:1.25cm;"></p> <ul> <li style = "margin-left:.5cm;">But none of the coefficients associated with interaction terms are statistically significantly different from 0</li> <ul> <li>It's sufficiently likely that these variations across groups are just random noise for us not being able to conclude that there is heterogeneity across leagues, at least none we can detect</li> </ul> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:3cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Preliminary steps ✔</b></li> <ul style = "list-style: none"> <li>1.1. Research question</li> <li>1.2. Finding data</li> <li>1.3. Literature review</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Data description ✔</b></li> <ul style = "list-style: none"> <li>2.1. Data cleaning</li> <li>2.2. Descriptive statistics</li> <li>2.3. Data visualization</li> </ul> </ul> ] .pull-right[ <ul style = "margin-left:-1cm;list-style: none"> <li><b>3. Analysis ✔</b></li> <ul style = "list-style: none"> <li>3.1. Regression analysis</li> <li>3.2. Robustness</li> <li>3.3. Heterogeneity</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <ul style = "margin-left:-1cm;list-style: none"><li><b>4. Wrap up!</b></li></ul> ] --- ### Wrap up! #### Preliminary steps <ul> <li>It all starts with a good <b>research question</b>:</li> <ul> <li>More about <b>explanation</b> than description</li> <li><b>Specific</b> enough, not too broad (Yes/No question, to what extent, ...)</li> <li>Relatively <b>original and interesting</b> to you!</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>That can be studied with <b>data</b>:</li> <ul> <li><a href="https://www.openicpsr.org/openicpsr/search/studies">openICPSR</a>, <a href="https://dataverse.harvard.edu/dataverse/harvard?q=&fq0=subject_ss%3A%22Social%20Sciences%22&types=dataverses%3Adatasets&sort=dateSort&order=desc">Harvard Dataverse</a>, <a href="https://www.aeaweb.org/resources/data">American Economic Association</a>, ...</li> <li>With the <b>necessary variables</b> to study the research question</li> <li>And <b>additional variables</b> to use for robustness and heterogeneity analysis</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>And that is relevant with respect to the academic <b>literature</b> on the issue:</li> <ul> <li>What do we <b>already know</b> on the topic?</li> <li>What <b>remains to be known</b>?</li> <li>What is your <b>contribution</b> to the literature?</li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> <center><b>➜ That's what you have to do for the next 3 weeks!</b></center> --- ### Wrap up! #### Preliminary steps <ul> <li>During <b>lecture 4</b> you'll have to do at <b>5 to 10-minute presentation</b> with slides in which you should:</li> <ul> <li>Present and motivate your <b>research question</b></li> <li>Present your <b>data</b> (source, main variables description)</li> <li>Present a short review of the <b>related literature</b></li> </ul> </ul> -- <ul> <li>You can come up with <b>your own</b> research question or take one <b>from an existing article</b></li> <ul> <li>When you have an idea send it by e-mail with the data to be sure it's fine and not taken already</li> </ul> </ul> -- <ul> <li>It will be <b>graded</b> (detailed grading scheme <a href="https://louissirugue.github.io/metrics_on_R/project/grading.html">here</a>):</li> <ul> <li>25% of the grade on this presentation</li> <li>30% of the grade on the midterm report</li> <li>45% of the grade on the final research project/presentation</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><b><i>Please go through the <a href="https://louissirugue.github.io/metrics_on_R/project/example.html">example</a> to get familiar with what is expected from you</i></b></center> <p style = "margin-bottom:1cm;"></p> * All the following steps of the research process will be subject to weekly 10mn follow-ups by group --- ### Wrap up! #### Data description <ul> <li>After opening and eyeballing the data, the first thing to do is <b>data cleaning</b></li> <ul> <li><b>Recoding variables</b> in a practical way (it may imply <b>creating new variables</b>)</li> <li><b>Removing observations</b> that are not relevant, if any, typically <b>missing values</b></li> <li>Potentially <b>joining</b>, and <b>pivoting</b> variables from wide to long or conversely</li> </ul> </ul> ```r mutate() %>% filter() %>% select() ``` -- <ul> <li>It should then be summarized with relevant <b>descriptive statistics</b></li> <ul> <li>For <b>categorical variables:</b> Number of observations per category</li> <li>For <b>continuous variables:</b> Summarizing the distribution</li> </ul> </ul> ```r summarise(N = n()) // summary(variable) ``` -- <ul> <li>And the last step of the data description is <b>data visualization</b> (+/- same thing but with graphs)</li> </ul> ```r ggplot(., aes()) + ``` --- ### Wrap up! #### Analysis <ul> <li>The analysis should be carried out as follows</li> <ul> <li><b>Write down the equation</b> to estimate</li> <li>Estimate it and <b>interpret properly the coefficient(s)</b> of interest</li> <li><b>Represent graphically</b> the estimated relationship</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> `$$1\{Winner_m=\text{Home}\}=\alpha+\beta \times1\{Public_m=\text{Yes}\}+\varepsilon_m$$` ```r stargazer(lm(Winner_home~Public, data_match)) ``` -- <p style = "margin-bottom:1.25cm;"></p> <ul> <li>And it should be followed by these three steps</li> <ul> <li>A discussion on the <b>causality</b> of the estimated effect: OVB, selection, ...</li> <li>A <b>robustness</b> assessment: Include control variables, omit groups, ...</li> <li>A <b>heterogeneity</b> analysis: Interact with third variable(s)</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> * At some point, add the introduction (with literature review), conclusion, and references sections --- ### Wrap up! #### Some important remarks <ul> <li>Your final document should be an html file produced with <b>R Markdown</b></li> <ul> <li>It should be well formatted (stargazer, kable, LaTeX, inline code, ...)</li> <li>It can be written in English or in French</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>It should be <b>reproducible</b></li> <ul> <li>The R Markdown should knit without error</li> <li>Every data modification should be in the code, it should produce the html document from the raw data</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li><b>When you send an email</b> related to a coding issue</li> <ul> <li>Send your .Rmd and the data in attachment, do not copy-paste your code in the mail nor send screenshots</li> <li>You should first have viewed your data at each step to see where the problem comes from</li> <li>And copy-pasted your error message with keywords on Google to try to understand the problem</li> </ul> </ul> <p style = "margin-bottom:.75cm;"></p> -- <ul> <li>Beware of <b>technical issues</b></li> <ul> <li>Knit your .Rmd regularly to check it works</li> <li>Save your files regularly and on multiples devices/on your mailbox</li> </ul> </ul>