I. Introduction

It is well established that playing home at football grants an advantage over the other team (Pollard, 1986). Yet, the specific determinants of this advantage, and the extent of their respective contributions, has not been clearly identified so far. The presence of local supporters in the stadium ranks among the most plausible reasons for this stylized fact, but whether or not supporters do help the team that plays home to win a football match is a difficult question to answer given the small variation in the presence of supporters at football matches. But by preventing supporters to attend football matches, the COVID-19 pandemic provides a unique setting to study this question. I study the effect of the presence of supporters on the probability to win the match by comparing the outcome of matches with supporters before the pandemic to those without supporters after the pandemic.

Dowie (1982) was the first to elicit a home advantage at football. Even though no causal effect could be identified, he stressed three potential reasons: fatigue for the away team due to travel, familiarity with the environment for the home team, and fans that support the home team and may play on their motivation. Evidence for these three different channels were then put forward in later studies. Concerning fatigue, Pollard et al. (2008) showed that distance traveled by the away team significantly increases the number of expected goals in favor of the home team by 0.115 goal per thousand kilometers traveled. Loughead et al. (2003) found mixed evidence about the familiarity hypothesis: high quality teams suffered after a move from their familiar venue, whereas low quality teams seemed to benefit from it. But overall, their results provide little support for facility familiarity as an explanation for the home advantage. Finally, Greer (1983) showed that booing from the crowd at basketball games had a positive effect on performances of the home team and negative effects for the team playing away. Still, the overall effect of supporters on the outcome of sports events remains to be quantified.

II. Data cleaning

I use data on every match of Premier League, Ligue 1, La Liga, and Bundesliga from season 2018-2019 to season 2020-2021. The data is publicly available at fbref.com, and documents not only the score but also when and where the match took place, as well as the number of supporters attending the match. Each of the variables of the dataset is briefly described below.

# Load necessary packages
library(tidyverse)  # To manipulate the data
library(stargazer)  # To display regression results
library(kableExtra) # To make html tables

# Import data from csv file
data_match <- read.csv("data/data_match.csv")

# Display the name of each variable
names(data_match)
##  [1] "Wk"           "Day"          "Date"         "Time"         "Home"        
##  [6] "xG"           "Score"        "xG.1"         "Away"         "Attendance"  
## [11] "Venue"        "Referee"      "Match.Report" "Notes"        "League"      
## [16] "Season"

The dataset contains 16 variables:

Not all these variables are going to be useful, so I only keep the date and time at which the match took place, the teams involved and the score, the number of supporters in the stadium, the league and the season. The following table displays the first five observations of the data.

data_match <- data_match %>%
  # Keep only the variables listed below in data_match
  select(Day, Date, Time, Home, Score, Away, Attendance, League, Season)

# Display the first five observations of the data
kable(head(data_match, n = 5), caption = "Outlook of the data:")
Outlook of the data:
Day Date Time Home Score Away Attendance League Season
Fri 2018-08-10 20:45 Marseille 4-0 Toulouse 60756 Ligue 1 2018-2019
Sat 2018-08-11 17:00 Nantes 1-3 Monaco 32760 Ligue 1 2018-2019
Sat 2018-08-11 20:00 Montpellier 1-2 Dijon 12765 Ligue 1 2018-2019
Sat 2018-08-11 20:00 Lille 3-1 Rennes 25708 Ligue 1 2018-2019
Sat 2018-08-11 20:00 Angers 3-4 Nîmes 9534 Ligue 1 2018-2019

Before starting the analysis, some variables must be recoded for convenience. For instance, the Score variable is not in a practical format. It stores the number of goals scored by each team, separated with a dash. I should assign the score of each team to distinct variables, and set their class to numeric instead of character. The same type of modifications can be applied to the Time variable, which in currently in character format as hh:mm. To transform the time variable in a continuous variable expressed in hours, the number of minutes divided by 60 should be added to the number of hours. The following table displays the first 15 lines of the data recoded as described above.

data_match <- data_match %>%
  
  # Separate the home and away score into 2 variables
  separate(Score, c("Home", "Away"), "-") %>%
  
  # Convert these variables as numeric
  mutate(Home = as.numeric(Home),
         Away = as.numeric(Away),
         
         # Generate a variable for the outcome of the match depending on who scored the most
         Winner = case_when(Home > Away ~ "Home",
                            Home == Away ~ "Draw",
                            Home < Away ~ "Away"),
         
         # Recode the Time variable as a continuous variable
         Time = as.numeric(substr(Time, 1, 2)) + as.numeric(substr(Time, 4, 5)) / 60)

# Display the first 15 rows of the recoded data
kable(head(data_match, n = 15), caption = "Recoded data:")
Recoded data:
Day Date Time Attendance Home Away League Season Winner
Fri 2018-08-10 20.75 60756 4 0 Ligue 1 2018-2019 Home
Sat 2018-08-11 17.00 32760 1 3 Ligue 1 2018-2019 Away
Sat 2018-08-11 20.00 12765 1 2 Ligue 1 2018-2019 Away
Sat 2018-08-11 20.00 25708 3 1 Ligue 1 2018-2019 Home
Sat 2018-08-11 20.00 9534 3 4 Ligue 1 2018-2019 Away
Sat 2018-08-11 20.00 26006 2 1 Ligue 1 2018-2019 Home
Sat 2018-08-11 20.00 21421 0 1 Ligue 1 2018-2019 Away
Sun 2018-08-12 15.00 48263 2 0 Ligue 1 2018-2019 Home
Sun 2018-08-12 17.00 23079 0 2 Ligue 1 2018-2019 Away
Sun 2018-08-12 21.00 47289 3 0 Ligue 1 2018-2019 Home
NA NA NA NA Ligue 1 2018-2019 NA
Fri 2018-08-17 20.75 18917 1 0 Ligue 1 2018-2019 Home
Sat 2018-08-18 17.00 19003 1 3 Ligue 1 2018-2019 Away
Sat 2018-08-18 20.00 10402 1 2 Ligue 1 2018-2019 Away
Sat 2018-08-18 20.00 19300 1 0 Ligue 1 2018-2019 Home

An important step of the data cleaning process is to handle missing values. It can be seen from the table above that between each week of competition there is an empty line with missing values. These rows can be deleted by filtering out every observation for which the Home variable is blank.

# Drop blank rows
data_match <- data_match %>% filter(Home != "")

To check for the presence of actual missing values in the data, the following table shows the number of missing values for each variable of the dataset.

# Show the number of missing values for each variable
kable(data_match %>% summarise_all(~sum(is.na(.))), 
      caption = "Number of missing values per variable:")
Number of missing values per variable:
Day Date Time Attendance Home Away League Season Winner
0 0 0 1670 0 0 0 0 0

The only variable with missing values is Attendance. There are 1670 matches for which the number of supporters in the stadium is not reported. To get a better understanding of what is going on with this variable, the following table summarizes the distribution of Attendance with its minimum and its maximum value, its mean, and the three quartiles.

# Display the summary statistics of the Attendance variable
kable(as.matrix(summary(data_match$Attendance)) %>% t(), 
      caption = "Attendance - Descriptive statistics:")
Attendance - Descriptive statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
13 16158 27717 31789.99 45014 93426 1670

The number of spectators per match ranges from 13 to 93426. But due to the COVID-19 pandemic that prevented many matches from having supporters in the stadium, there should be values of Attendance equal to 0. It is thus possible that the missing values of Attendance are actually these matches where no supporter was allowed to attend the event, and that missing values are actually not missing but a way of coding no attendance. This hypothesis is even more plausible given that other than the Attendance variable, there is no issue of missing value in the data. A visual check can be conducted to test this hypothesis, by recoding missing values to 0 and showing the monthly evolution of the average number of supporters in stadiums. This can be done separately for each league to see whether or not the issue is league-specific.

attendance_data <- data_match %>%
         # Replace missing values of Attendance by 0
  mutate(Attendance = ifelse(is.na(Attendance), 0, Attendance),
         # Keep only the YYYY-MM part of the Date variable (YYYY-MM-DD)
         Month = substr(Date, 1, 7)) %>%
  # Do computations separately for each month and each league
  group_by(Month, League) %>%
  # Compute the average number of supporters in the stadium
  summarize(Attendance = mean(Attendance)) %>%
  # Sort the data by ascending order of month and group by month
  ungroup() %>% arrange(Month) %>%  group_by(Month) %>%
  # Attribute a number from 1 to N to each month whatever the league
  mutate(Month_id = cur_group_id())

ggplot(attendance_data, 
       # Assign month/attendance to the x-/y-axis and one color per league
       aes(x = Month_id, y = Attendance, color = League), alpha = .75) +
  # Draw a line and a point geometry and rename legend
  geom_line(size = 1.2) + geom_point(size = 1.5) + labs(color = "League:") +
  # Label the x axis with months in character format
  scale_x_continuous(name = "Month", breaks = unique(attendance_data$Month_id), 
                     labels = unique(attendance_data$Month)) + 
  # Rotate the month labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

The sudden drop to 0 attendance due to the pandemic right after March 2020 is striking, and confirms that the missing values of the Attendance variable should indeed be recoded as 0. It also illustrates that the COVID-19 pandemic provides an ideal setting to test for the potential effect of the presence of supporters in the stadium on the probability for the home team to win the match.

# Replace missing values of Attendance by 0
data_match <- data_match %>% mutate(Attendance = ifelse(is.na(Attendance), 0, Attendance))

III. Descriptive statistics

Once the data is cleaned and recoded, it should be described with appropriate statistics. The first relevant information is the number of observations. The observation level of the data being the match, the following table displays the number of matches in the data separately for each league and each season.

nb_obs <- data_match %>%
  # Do computations separately for each season and each league
  group_by(League, Season) %>%
  # Compute the number of match per season/league
  summarise(n_match = n()) %>% 
  # Put these values in separate columns for each season
  pivot_wider(names_from = "Season", values_from = "n_match") %>%
  # Compute the total number of matches per league
  mutate(Total = `2018-2019` + `2019-2020` + `2020-2021`) 

nb_obs %>%
  # Add one Total row which is the sum of all the above
  bind_rows(nb_obs %>% mutate(League = "Total") %>% group_by(League) %>% summarise_all(~sum(.))) %>%
  # Display in an htlm table
  kable(.,  caption = "Number of matches:") %>%
  # Set characters in the column Total in bold
  column_spec(5, bold = T) %>%
  # Set characters in the row Total in bold
  row_spec(5, bold = T) 
Number of matches:
League 2018-2019 2019-2020 2020-2021 Total
Bundesliga 306 306 306 918
La Liga 380 380 380 1140
Ligue 1 380 279 380 1039
Premier League 380 380 380 1140
Total 1446 1345 1446 4237

The data contains a total number of 4237 observations, with slightly less observations in the 2019-2020 season than in the two others due to the cancellation of matches in Ligue 1. Besides this event, each league has 380 matches per season except the Bundesliga for which the number of matches per season amounts to 306. The following table shows the number of matches won home, away, and the number of draws, along with their respective proportion in the dataset.

data_match %>%
  # Do computations separately for each outcome
  group_by(Winner) %>%
  # Compute the number of observations and the percentage
  summarise(N = n(), Pct = n() / nrow(.)) %>%
  # Display in an html table
  kable(., "Distribution of match outcomes")
Distribution of match outcomes
Winner N Pct
Away 1343 0.32
Draw 1067 0.25
Home 1827 0.43

This confirm the well-established stylized fact that football matches have greater chance to be won by the home team. To provide an overview of the variables that are used in this analysis, the following tables summarizes the distribution of the three main variables: the number of supporters in the stadium, the number of goals scored by the team that plays home, and that scored by the team that plays away. These statistics are provided separately for each league and each season.

descriptive_data <- data_match %>%
  # Put the variables of interest in long format
  pivot_longer(c(Attendance, Home, Away), 
               names_to = "Variable", values_to = "Value") %>%
  # Group the data by variable of interest, season, and league
  group_by(Variable, Season, League) %>%
  # Compute the descriptive statistics
  summarise(Min = min(Value), 
            Q1 = quantile(Value, 1/4),
            Median = median(Value), 
            Mean = mean(Value), 
            Q3 = quantile(Value, 3/4),
            Max = max(Value)) %>%
  # Ungroup the data
  ungroup()

2018-2019
descriptive_data %>% 
  # Keep only the observations of the 2018-2019 season
  filter(Season == "2018-2019") %>% 
  # Keep only the variables to display
  select(-c(Variable, Season)) %>%
  # Add a caption to the table
  kable(., caption = paste("Season", "2018-2019")) %>%
  # Display the name of the variable for the corresponding rows
  pack_rows("Attendance", 1, 4) %>%
  pack_rows("Goals away", 5, 8) %>%
  pack_rows("Goals home", 9, 12)
Season 2018-2019
League Min Q1 Median Mean Q3 Max
Attendance
Bundesliga 19205 29230.50 40911.0 43453.18 52500.00 81365
La Liga 3592 12074.50 19367.5 27118.68 39587.75 93265
Ligue 1 0 12795.75 17577.5 22807.27 27378.50 64696
Premier League 9980 25034.75 31948.0 38181.29 53282.75 81332
Goals away
Bundesliga 0 0.00 1.0 1.39 2.00 6
La Liga 0 0.00 1.0 1.13 2.00 6
Ligue 1 0 0.00 1.0 1.09 2.00 5
Premier League 0 0.00 1.0 1.25 2.00 6
Goals home
Bundesliga 0 1.00 2.0 1.79 3.00 8
La Liga 0 1.00 1.0 1.45 2.00 8
Ligue 1 0 1.00 1.0 1.47 2.00 9
Premier League 0 1.00 1.0 1.57 2.00 6
2019-2020
descriptive_data %>% 
  # Keep only the observations of the 2019-2020 season
  filter(Season == "2019-2020") %>% 
  # Keep only the variables to display
  select(-c(Variable, Season)) %>%
  # Add a caption to the table
  kable(., caption = paste("Season", "2019-2020")) %>%
  # Display the name of the variable for the corresponding rows
  pack_rows("Attendance", 1, 4) %>%
  pack_rows("Goals away", 5, 8) %>%
  pack_rows("Goals home", 9, 12)
Season 2019-2020
League Min Q1 Median Mean Q3 Max
Attendance
Bundesliga 0 0.0 27062.5 29783.37 49025.0 81365
La Liga 0 0.0 16001.5 20694.99 33583.5 93426
Ligue 1 0 12418.0 15814.0 22427.67 29440.5 65421
Premier League 0 10346.5 30534.0 29796.04 45594.5 73737
Goals away
Bundesliga 0 1.0 1.0 1.55 2.0 6
La Liga 0 0.0 1.0 1.04 2.0 5
Ligue 1 0 0.0 1.0 1.03 2.0 5
Premier League 0 0.0 1.0 1.21 2.0 9
Goals home
Bundesliga 0 1.0 1.0 1.66 2.0 8
La Liga 0 1.0 1.0 1.44 2.0 6
Ligue 1 0 1.0 1.0 1.49 2.0 6
Premier League 0 1.0 1.0 1.52 2.0 8
2020-2021
descriptive_data %>% 
  # Keep only the observations of the 2020-2021 season
  filter(Season == "2020-2021") %>% 
  # Keep only the variables to display
  select(-c(Variable, Season)) %>%
  # Add a caption to the table
  kable(., caption = paste("Season", "2020-2021")) %>%
  # Display the name of the variable for the corresponding rows
  pack_rows("Attendance", 1, 4) %>%
  pack_rows("Goals away", 5, 8) %>%
  pack_rows("Goals home", 9, 12)
Season 2020-2021
League Min Q1 Median Mean Q3 Max
Attendance
Bundesliga 0 0 0 503.57 0 11500
La Liga 0 0 0 33.54 0 4800
Ligue 1 0 0 0 46.90 0 5000
Premier League 0 0 0 224.22 0 10000
Goals away
Bundesliga 0 0 1 1.36 2 5
La Liga 0 0 1 1.14 2 6
Ligue 1 0 1 1 1.36 2 5
Premier League 0 0 1 1.34 2 7
Goals home
Bundesliga 0 1 1 1.68 2 8
La Liga 0 0 1 1.37 2 6
Ligue 1 0 0 1 1.40 2 6
Premier League 0 0 1 1.35 2 9

From these tables it appears that the average number of supporters in the stadium started to decline during the 2019-2020 season, to the extent that in 2020-2021 most matches in all leagues had no attendance at all, and the few matches with supporters were way below the full capacity. Also, the average and maximum number of goals scored tend to be larger for teams that play home than for teams that play away, especially for the 2018-2019 season.

IV. Visualizing the data

To get a finer depiction of the distribution of these variables, they can be represented graphically by superimposing their density and their boxplot separately for each league and each season.

# Assign the League to the x and fill axes and the attendance to the y axis
ggplot(data_match, aes(x = League, y = Attendance, fill = League)) +
  # Overlay a violin density and a boxplot with transparency
  geom_violin(show.legend = F, alpha = .55) +
  geom_boxplot(width = 0.1, show.legend = F, alpha = .75) + 
  # Rotate the graph and plot separately by season
  coord_flip() + facet_wrap(~ Season, ncol = 1) + ylab("") + xlab("")

data_match %>%
  # Put the goals scored home and away in long format
  pivot_longer(c(Home, Away), names_to = "Variable", values_to = "Value") %>%
  # Assign the League to the x and fill axis and the goals scored to the y axis
  ggplot(., aes(x = League, y = Value, fill = League)) +
  # Overlay a violin density and a boxplot with transparency
  geom_violin(show.legend = F, alpha = .55) + 
  geom_boxplot(width = 0.1, show.legend = F, alpha = .75) + 
  # Plot separately by season and for home/away, and custom the axes
  facet_grid(Season ~ Variable) + ylab("") + xlab("") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

While the decline in attendance over the seasons is striking visually, it is less the case for the difference between the distributions of the goals scored home and those scored away. To get a more precise picture of the evolution of the outcome of matches over the seasons, the following graph displays the number of matches won by the home team, by the team playing away, and the number of draws, separately for each league and each season.

# Assign the League to the x axis and the outcome of the match to the fill axis
ggplot(data_match, aes(x = League, fill = Winner)) +
  # Bar plot geometry counting the number of each outcome, bars side to side
  geom_bar(stat = "count", position = "dodge", alpha = .85) + 
  # Plot separately by season and custom the axes
  facet_wrap(~Season) + ylab("Number of matches") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

First, it confirms that the home team tends to win more frequently than the team that plays away. But except for the Bundesliga, it is quite clear visually that there is a decline in the difference between the number of matches won by the team that plays home and the number of matches won by the team that plays away, which seems concomitant with the restrictions imposed on attendance in stadiums.

Before estimating formally the relationship between the presence of supporters in the stadium and the probability to win the match, the following graph compares the ratio between home wins and home losses when there are supporters in the stadium and when there is none, separately for each league.

# Generate a binary variable indicating the presence of supporters
data_match <- data_match %>%
  mutate(Public = ifelse(Attendance > 0, "Public", "No public")) 

data_match %>%
  # Do computations separately by league and presence of supporters
  group_by(League, Public) %>%
  # Compute the ratio of home vs. away wins
  summarise(Ratio = sum(Winner == "Home") / sum(Winner == "Away")) %>%
  # Assign the presence of supporters to the x axis, the ratio to the y axis,
  # and the league to the color axis
  ggplot(., aes(x = Public, y = Ratio, fill = League), alpha = .85) +
  # Add a bar geometry to display the values side to side
  geom_bar(position = "dodge", stat = "identity", show.legend = FALSE) + 
  # Plot separately for each league and custom the axes
  facet_wrap(~League, nrow = "1") + ylab("Home wins/Home losses ratio") + xlab("")

From this graph it is clear that the ratio between home wins and home losses tends to be higher when there are supporters in the stadium than when there is none, and that holds for the four leagues considered. But to be able to draw clear conclusions on the relationship between the presence of supporters in the stadium and the probability for the home team to win the match, a regression analysis should be carried out.

V. Regression analysis

The equation to estimate writes:

\[1\{Winner_m=\text{Home}\}=\alpha+\beta \times1\{Public_m=\text{Yes}\}+\varepsilon_m,\] where for a given match \(m\) the variable \(1\{Winner_m=\text{Home}\}\) takes the value \(1\) if the winning team is that playing home and \(0\) otherwise, and the variable \(1\{Public_m=\text{Yes}\}\) takes the value \(1\) if there is public in the stadium and and \(0\) otherwise. Because the dependent variable is binary, this equation corresponds to a linear probability model and coefficients have to be interpreted in percentage points of the probability that the home team wins the match. Because the independent variable is binary, the constant \(\alpha\) in this model corresponds to the probability that the home team wins the match when there is no public, and the slope \(\beta\) corresponds to the expected percentage-point change in this probability when there are supporters in the stadium.

# Generate a binary variable that takes the value one if home team won
data_match <- data_match %>%
  mutate(Winner_home = ifelse(Winner == "Home", 1, 0)) 

# Estimate the regression model
stargazer(lm(Winner_home ~ Public, data_match), dep.var.labels = c("Home win"))
Dependent variable:
Home win
PublicPublic 0.059***
(0.016)
Constant 0.395***
(0.012)
Observations 4,237
Adjusted R2 0.003
Note: p<0.1;⋆⋆p<0.05;⋆⋆⋆p<0.01

According to these results, the presence of supporters in the audience increases by 5.9 percentage points on expectation the probability for the home team to win the match, everything else equal. Given that the probability for the home team to win the match (relative to loose or draw) is equal to 39.5% when there is no public, this corresponds to an average increase of about 15% in relative terms. Given that the p-values associated with \(\hat{\alpha}\) and \(\hat{\beta}\) are lower than 1%, these two values are statistically significantly different from 0 at the 99% confidence level.

The following plot represents the regression line estimated in the previous table. Because both the dependent and the independent variables are binary, each point can only take 4 locations on the graph. To facilitate visualization, I use geom_jitter() to introduce some noise in the location of each data point around these 4 possible coordinates. It appears that the number of home wins relative to the number of home losses and draws is indeed lower when there is no supporter in the stadium.

# Assign the dependent and the independent variables to x and y axes
ggplot(data_match, aes(x = Public, y = Winner_home)) +
  # Plot the data points with some noise to avoid overplotting
  geom_jitter(width = .25, height = .25, alpha = .5, color = "#6794A7") +
  # Plot the regression line centered with respect to the data points
  geom_smooth(data = data_match %>% 
                mutate(Public = ifelse(Public == "Public", 1, 0) + 1), 
              aes(x = Public, y = Winner_home), 
              method = "lm", se = F, color = "#014D64") +
  # Custom the axes
  scale_x_discrete(name = "1{Public[m] = Yes}", labels = 0:1) +
  ylab("Probability of winning vs. loosing home") 

VI. Causality assessment

The previous regression table documented a positive and statistically significant relationship between the presence of supporters in the stadium and the probability for the home team to win. These results suggest the supporters have an influence on the outcome of the match, be it directly, e.g., by impacting on the motivation of players, or indirectly, e.g., by impacting the decisions of the referees in favor of the home team.

But even though this result provides support for this hypothesis, it is not sufficient to prove the presence of a causal effect. Indeed, there may be other variables, correlated both with the dependent and the independent variable, that drive this relationship. The COVID-19 pandemic may have simultaneously prevented supporters from going to the stadium and changed the conditions for the team that plays away in a favorable way, for instance if the trip to the stadium is less tiring because there is less congestion on the roads due to remote working, or for any other reason. In other words, there may be an omitted variable bias driving part or all of the estimated relationship. Because it is not feasible to control for such variables in the regression, more sophisticated econometric specifications would be required to conclude on the causality of the effect.

VII. Robustness

But even if it is not possible to include all the relevant controls, some variables can still be added to the regression to check the robustness of the baseline result. Indeed, if the probability differential with and without supporters can be linked to changes in transport conditions with the pandemic, it could also be linked to the day in the week and the time in the day at which the match takes place, as transport conditions may also depend on that. Thus, even though controlling for these variables would not prove any irrelevance of the mechanisms mentioned in the above section, it is important to check that the baseline result is robust to the inclusion of the variables that can be controlled for given the data available. The following table progressively includes the league, the day of the week, and the time of the day as controls in the regression. Because the League variables is categorical, I first set a reference category to this variable using the relevel() function.

# Set the League variable as factor and its reference category to "Premier League"
data_match <- data_match %>% 
  mutate(League = relevel(as.factor(League), "Premier League"))

# Progressively include control variables in the regression
stargazer(lm(Winner_home ~ Public, data_match), 
          lm(Winner_home ~ Public + League, data_match), 
          lm(Winner_home ~ Public + League + Day, data_match), 
          lm(Winner_home ~ Public + League + Day + Time, data_match),
          dep.var.labels = c("Home win vs. Home loss", "Home win vs. Home loss/Draw"))
Dependent variable:
Home win vs. Home loss
(1) (2) (3) (4)
PublicPublic 0.059*** 0.060*** 0.060*** 0.060***
(0.016) (0.016) (0.016) (0.016)
LeagueBundesliga -0.012 -0.012 -0.015
(0.022) (0.022) (0.022)
LeagueLa Liga 0.004 0.007 -0.001
(0.021) (0.021) (0.022)
LeagueLigue 1 -0.014 -0.013 -0.023
(0.021) (0.022) (0.023)
DayMon -0.039 -0.040
(0.050) (0.050)
DaySat 0.005 0.019
(0.031) (0.033)
DaySun -0.008 0.008
(0.031) (0.035)
DayThu 0.006 0.009
(0.059) (0.059)
DayTue 0.055 0.057
(0.047) (0.047)
DayWed 0.023 0.026
(0.039) (0.039)
Time 0.004
(0.004)
Constant 0.395*** 0.400*** 0.396*** 0.317***
(0.012) (0.017) (0.034) (0.078)
Observations 4,237 4,237 4,237 4,237
Adjusted R2 0.003 0.003 0.002 0.002
Note: p<0.1;⋆⋆p<0.05;⋆⋆⋆p<0.01

The baseline coefficient remains virtually unchanged in terms of magnitude with the inclusion of control variables, and is always statistically significantly different from 0 at 99% confidence level. Thus, the baseline estimate is robust to the inclusion of these three control variables.

Another robustness check could be performed regarding the definition of the dependent variable. Indeed, the regressions estimated so far are about the probability of winning relative to loosing or draw. An alternative definition would be to consider the probability of winning relative to loosing only, omitting draws. The following table compares the results from the baseline regression using these two possible definitions of the independent variable.

# Generate an outcome variable that does not account for draws
data_match <- data_match %>%
  mutate(Winner_home2 = ifelse(Winner != "Draw", Winner_home, NA))

# Regress whether the home team won on the presence of supporters for these 
# two definitions of the reference group
stargazer(lm(Winner_home ~ Public, data_match), 
          lm(Winner_home2 ~ Public, data_match), 
          dep.var.labels = c("Home win vs. Home loss", "Home win vs. Home loss/Draw"))
Dependent variable:
Home win vs. Home loss Home win vs. Home loss/Draw
(1) (2)
PublicPublic 0.059*** 0.078***
(0.016) (0.018)
Constant 0.395*** 0.529***
(0.012) (0.014)
Observations 4,237 3,170
Adjusted R2 0.003 0.006
Note: p<0.1;⋆⋆p<0.05;⋆⋆⋆p<0.01

Using this alternative definition, it appears that even though the presence of supporters is associated with a higher probability to win for the home team, even with no public the home team is still more likely to win than the team playing away, by about 3 percentage points (\(\hat{\alpha}>50\%\)). The coefficient of interest is statistically significantly different from 0 at 99% confidence level for both variable definitions. In terms of magnitude, the coefficients from the two definitions cannot be compared directly because they are mechanically inflated by the omission of the possibility of draw, but the ratio of the effect of public in the stadium on the probability to win, relative to the probability to win when there is no public, is very similar in the two cases (\(\frac{0.059}{0.395}\)\(=\)\(0.1494\)\(\approx\)\(0.1475\)\(=\)\(\frac{0.078}{0.529}\)). It is thus reasonable to conclude that this result is also robust to variations in the definition of the reference category of the outcome variable.

VIII. Heterogeneity

But the fact that the coefficient is robust does not mean that it is homogeneous. To investigate whether the relationship differs from one league to another, the independent variable of interest should be interacted with the League variable, which is equivalent to estimating the regression separately for each league.

# Progressively control and interact with League in the regression
stargazer(lm(Winner_home ~ Public, data_match),
          lm(Winner_home ~ Public + League, data_match),
          lm(Winner_home ~ Public + League + Public * League, data_match),
          dep.var.labels = c("Home win"))
Dependent variable:
Home win
(1) (2) (3)
PublicPublic 0.059*** 0.060*** 0.074**
(0.016) (0.016) (0.030)
LeagueBundesliga -0.012 0.008
(0.022) (0.035)
LeagueLa Liga 0.004 0.019
(0.021) (0.032)
LeagueLigue 1 -0.014 -0.016
(0.021) (0.034)
PublicPublic:LeagueBundesliga -0.032
(0.045)
PublicPublic:LeagueLa Liga -0.025
(0.042)
PublicPublic:LeagueLigue 1 0.002
(0.044)
Constant 0.395*** 0.400*** 0.392***
(0.012) (0.017) (0.023)
Observations 4,237 4,237 4,237
Adjusted R2 0.003 0.003 0.002
Note: p<0.1;⋆⋆p<0.05;⋆⋆⋆p<0.01

Column (3) shows that the coefficient of interest for the reference category, Premier League, amounts to 7.4 percentage points and is statistically different from 0 at the 95% confidence level. The difference between the effect in Premier League and that in other leagues range from -3.2 percentage points (i.e., an effect of 4.2 percentage points, for Bundesliga) to 0.2 percentage points (i.e., an effect of 7.6 percentage points, for Ligue 1). Yet, because the coefficients associated with the interaction terms are not significant, we cannot conclude that these different league-specific effects are statistically significant from each other. In other words, there is no evidence of a heterogeneity of the effect across leagues.

IX. Conclusion

In this analysis I use data on football matches in Premier League, Ligue 1, La Liga, and Bundesliga, from season 2018-2019 to season 2020-2021, to investigate the relationship between the presence of supporters in the stadium and the probability for the football team that plays home to win the match. The estimation of this relationship relies on the fact that the COVID-19 pandemic prevented supporters from going to the stadium, such that the outcome of these matches can be compared to those played in regular conditions. Graphical evidence indeed show a clear and sudden drop to 0 attendance in stadiums, concomitant to the pandemic right after March 2020.

Results show that the presence of supporters in the audience increases by 5.9 percentage points on expectation the probability for the home team to win the match, everything else equal. Yet, this result may not be interpreted as causal if the COVID-19 pandemic have simultaneously prevented supporters from going to the stadium and changed the conditions for the team that plays away relative to the conditions for the team that plays home. In addition, the external validity of the result is not granted, as it is estimated using four European football leagues only. Still, the estimated coefficient appears to be robust to controlling for the league, the day of the week, and the time of the day, as well as changes in the definition of the outcome variable, and results show no evidence for a heterogeneity of the effect across leagues.

References

Dowie, J. (1982). Why Spain should win the world cup. New Scientist, 94(10), 693-695.

Greer, D. L. (1983). Spectator booing and the home advantage: A study of social influence in the basketball arena. Social psychology quarterly, 252-261.

Loughead, T. M., Carron, A. V., Bray, S. R., & Kim, A. J. (2003). Facility familiarity and the home advantage in professional sports. International Journal of Sport and Exercise Psychology, 1(3), 264-274.

Pollard, R. (1986). Home advantage in soccer: A retrospective analysis. Journal of sports sciences, 4(3), 237-248.

Pollard, R., Silva, C. D., & Medeiros, N. C. (2008). Home advantage in football in Brazil: differences between teams and the effects of distance traveled. Revista Brasileira de Futebol (The Brazilian Journal of Soccer Science), 1(1), 3-10.