class: center, middle, inverse, title-slide # Maps and geolocalized data ## Lecture 14 ###
Louis SIRUGUE ### CPES 2 - Fall 2022 --- <style> .left-column {width: 65%;} .right-column {width: 35%;} </style> ### Quick reminder #### Causal approach (Behaghel et al., 2015) * Applicants resumes randomly anonymized or not before being sent to employers -- `$$Y_i = \alpha + \beta D_i +\gamma An_i + \delta D_i\times An_i + \varepsilon_i$$` * `\(\hat{\delta}\)` captures how the difference in interview rates between the minority and the majority differs between the treated and the control employers -- ```r summary(lm(interview ~ minority + treatment + minority*treatment, data_rct, weights = weight))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.11638530 0.01630149 7.1395491 1.575140e-12 ## minority -0.02365790 0.02368346 -0.9989208 3.180243e-01 ## treatment 0.06101349 0.02419977 2.5212424 1.181630e-02 ## minority:treatment -0.10670915 0.03479712 -3.0666092 2.210982e-03 ``` <p style = "margin-bottom:1cm;"></p> <center><h4> ➜ Self-selection issue: discriminatory employers did not enter the program </h4></center> --- ### Quick reminder #### Correlational approach (Chetty et al., 2014) `$$\text{percentile}(y^c_i) = \alpha + \beta_{RRC}\text{percentile}(y^p_i)+\varepsilon_i$$` -- .left-column[ <center><img src = "local_cef.png" width = "520"/></center> ] -- .right-column[ **Relative mobility:** `\(\widehat{\beta_{RRC}}\)` **Absolute mobility:** `\(\widehat{\alpha} + 25\times\widehat{\beta_{RRC}}\)` <p style = "margin-bottom:1cm;"></p> <ul> <li>Strong persitence in the United-States</li> <li>Large variations across commuting zones</li> <li>Intergenerational mobility correlated with characteristics of childhood environment</li> </ul> ] --- ### Quick reminder #### Structural approach (Nerlove, 1963) * **Theoretical modeling** `$$\begin{cases} \text{min } & C = p_LL + p_KK\\ \text{s.t. } & Y = A L^\lambda K^\kappa u \end{cases} \,\, \Longleftrightarrow \,\, \text{min }\,\mathcal{L} = p_LL + p_KK + \mu(Y - A L^\lambda K^\kappa u)$$` -- <p style = "margin-bottom:1cm;"></p> * **Regression expression** `$$\log (C) = \underbrace{\log \Bigg(\left[\frac{\big(\frac{\lambda}{\kappa}\big)^\kappa +\big(\frac{\kappa}{\lambda}\big)^\lambda}{A}\right]^\frac{1}{\lambda + \kappa} \Bigg)}_{\alpha} + \underbrace{\frac{1}{\lambda + \kappa}}_{\beta}\log(Y) + \underbrace{\frac{\lambda}{\lambda+\kappa}}_{\gamma}\log(p_L) + \underbrace{\frac{\kappa}{\lambda+\kappa}}_{\delta}\log(p_K) + \underbrace{\log \bigg(u^\frac{1}{\lambda+\kappa}\bigg)}_{\varepsilon}$$` -- <p style = "margin-bottom:1cm;"></p> * **Estimation** `$$\log(C)= \alpha + \beta\log(Y) + \gamma\log(p_L) + \delta\log(p_K) + \varepsilon \,\,\,\, \Rightarrow \,\,\,\, H_0: \gamma + \delta = 1$$` --- <h3>Today: Maps and geolocalized data</h3> -- <p style = "margin-bottom:2cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Geolocalized data</b></li> <ul style = "list-style: none"> <li>1.1. Shapefiles and rasters</li> <li>1.2. Opening geolocalized data</li> <li>1.3. Coordinate Reference Systems</li> <li>1.4. Subsetting geolocalized data</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Geographic variables</b></li> <ul style = "list-style: none"> <li>2.1. Import from csv</li> <li>2.2. Zonal statistics</li> <li>2.3. Centroids and distance</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- <h3>Today: Maps and geolocalized data</h3> <p style = "margin-bottom:2cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Geolocalized data</b></li> <ul style = "list-style: none"> <li>1.1. Shapefiles and rasters</li> <li>1.2. Opening geolocalized data</li> <li>1.3. Coordinate Reference Systems</li> <li>1.4. Subsetting geolocalized data</li> </ul> </ul> ] --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters ➜ The <b>shapefile</b> is a format for storing <b>geographic location</b> and associated attribute information <p style = "margin-bottom:1cm;"></p> -- .left_column[ <ul> <li>Out-software it <b>can seem quite complicated:</b></li> <ul> <li>There are several files for a single dataset</li> <li>Always the necessary .shp, .shx, and .dbf</li> <li>Sometimes other files such as .prj, .cpg, etc.</li> </ul> </ul> ] .right-column[ <p style = "margin-bottom:-4.5cm;"></p> <center><img src = "out_software.png" width = "150"/></center> ] -- <ul> <li>But in-software, it is <b>not so far from what we're used to:</b></li> <ul> <li>There is one line per geographic entity</li> <li>And one column per variable about these entities</li> <li>But there is one non-standard variable: the <b>geometry</b></li> </ul> </ul> -- <ul> <li>The geometry can be:</li> <ul> <li><b>Points</b> at given coordinates: location of weather stations/of a phone</li> <li><b>Polylines</b> that link sets of points: rivers/roads</li> <li><b>Polygons</b> that link sets of points to form a closed shape: countries/land parcels</li> </ul> </ul> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * For instance consider a shapefile * With two observations * And a polygon geometry -- <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * The shapefile can also contain **attribute variable** * For instance the **median income** of residents of the area delimited by the polygon * We can plot this attribute variable using the fill aesthetic of the polygon geometry -- <img src="slides_files/figure-html/unnamed-chunk-5-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * The <b>raster</b> is another type of format for storing geolocalized data -- <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>It works <b>like a picture</b>, with cells like pixels</li></ul></ul> <p style = "margin-bottom:1.22cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-6-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * The <b>raster</b> is another type of format for storing geolocalized data <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>It works <b>like a picture</b>, with cells like pixels</li></ul></ul> <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>And each cell can take a given value, e.g. pollution observed from satellites</li></ul></ul> <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * The <b>raster</b> is another type of format for storing geolocalized data <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>It works <b>like a picture</b>, with cells like pixels</li></ul></ul> <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>And each cell can take a given value, e.g. pollution observed from satellites</li></ul></ul> <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * Today we're gonna use this to estimate the relationship between <b>income and exposure to air pollution</b> -- <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>Using <b>raster</b> satellite data on pollution level</li></ul></ul> <p style = "margin-bottom:1.22cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.1. Shapefiles and rasters * Today we're gonna use this to estimate the relationship between <b>income and exposure to air pollution</b> <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>Using <b>raster</b> satellite data on pollution level</li></ul></ul> <p style = "margin-bottom:-.55cm;"></p> <ul><ul><li>And a <b>shapefile</b> to compute the average exposure in different location and merge it with income data</li></ul></ul> <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data * We can read shapefiles in R using the `read_sf()` function from the `sf` package (data from [IGN](https://geoservices.ign.fr/adminexpress)) -- ```r library(sf) dep_shp <- read_sf("data/dep_shp/DEPARTEMENT.shp") head(dep_shp, 5) ``` -- ``` ## Simple feature collection with 5 features and 5 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: 0.06558525 ymin: 45.23103 xmax: 7.621946 ymax: 50.0722 ## Geodetic CRS: WGS 84 ## # A tibble: 5 x 6 ## ID NOM_DEP_M NOM_DEP INSEE_DEP INSEE_REG geometry ## <chr> <chr> <chr> <chr> <chr> <MULTIPOLYGON [°]> ## 1 DEPARTEM00000~ JURA Jura 39 27 (((5.442842 46.84592, 5.~ ## 2 DEPARTEM00000~ SEINE-MA~ Seine-~ 76 28 (((0.3875418 49.77178, 0~ ## 3 DEPARTEM00000~ YONNE Yonne 89 27 (((3.126677 47.99127, 3.~ ## 4 DEPARTEM00000~ LOIRE Loire 42 84 (((3.831716 45.99944, 3.~ ## 5 DEPARTEM00000~ HAUT-RHIN Haut-R~ 68 44 (((6.923645 47.95226, 6.~ ``` --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data * We can then view the data using the `geom_sf()` geometry * It will understand that the `geometry` variable contains the coordinates of the polygons to be plotted -- ```r library(tidyverse) ggplot(dep_shp) + geom_sf(fill = "#6794A7", color = "#014D64", alpha = .6) + theme_void() ``` -- <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="45%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data * There are several **formats of raster** files (.tif, .nc, ...) * We're gonna use **NetCDF** satellite data on PM<sub>2.5</sub> from the [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datasets/surface-pm2-5/) * So we're gonna need the packages `raster` and `ncdf4` -- ```r library(raster) select <- dplyr::select # The raster packages has an overriding select function library(ncdf4) ``` -- <p style = "margin-bottom:-.55cm;"></p> ```r pm_data <- raster("data/acag_2016.nc") pm_data ``` <p style = "margin-bottom:-.55cm;"></p> .left-column[ ``` ## class : RasterLayer ## dimensions : 2300, 5000, 11500000 (nrow, ncol, ncell) ## resolution : 0.01, 0.009999999 (x, y) ## extent : -15, 35, 36, 59 (xmin, xmax, ymin, ymax) ## crs : +proj=longlat +datum=WGS84 +no_defs ## source : acag_2016.nc ## names : Hybrid.PM_2_._5...mug.m.3. ## zvar : GWRPM25 ``` ] -- .right-column[ <ul> <li>As you can see NetCDFs are complex objects:</li> <ul> <li><b>Not a simple table</b> with figures</li> <li>Note that the PM<sub>2.5</sub> variable is Hybrid.PM_2_._5...mug.m.3.</li> </ul> </ul> ] --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data * We can **rename** the PM<sub>2.5</sub> variable into something more convenient ```r names(pm_data) <- "pm2.5" ``` -- * And convert the data into a standard dataset using the `rasterToPoints()` function * It will **generate a table** with 1 row per cell * An x variable for the longitude, y for latitude, and pm2.5 for pollution -- ```r head(as_tibble(rasterToPoints(pm_data)), 5) ``` ``` ## # A tibble: 5 x 3 ## x y pm2.5 ## <dbl> <dbl> <dbl> ## 1 -3.38 59.0 4.60 ## 2 -3.37 59.0 4.60 ## 3 -3.36 59.0 4.60 ## 4 -3.35 59.0 4.70 ## 5 -3.34 59.0 4.70 ``` --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data * This table format allows to easily **plot raster data** using the `geom_tile` geometry: <p style = "margin-bottom:1.25cm;"></p> -- ```r library(viridis) # for nice color gradient ggplot(as_tibble(rasterToPoints(pm_data)), # Data converted in table aes(x = x, # Longitude on the x-axis y = y, # Latitude on the y-axis fill = pm2.5)) + # Pollution on the fill-axis geom_tile() + # One filled square per cell scale_fill_viridis(option = "B", # Nice color gradient direction = -1) + # The darker the more polluted theme_void() # Remove axes and grid ``` <p style = "margin-bottom:1.25cm;"></p> -- * This is gonna plot the 2,300 x 5,000 = 11,500,000 cells * With a color that is proportional to the value of `pm2.5` * Like the pixels of picture --- ### 1. Geolocalized data #### 1.2. Opening geolocalized data <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="65%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems <ul> <li>You might have noticed that France does not look the same on the two graphs</li> <ul> <li>This is because we have not <b>reprojected the data</b> yet</li> <li>But this is the <b>first thing to do</b> when working with geolocalized data</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>Right now our two maps may not be in the same CRS</li> <ul> <li>CRS stands for <b>Coordinate Reference System</b></li> <li>It is a model of the Earth in which each location is coded using degrees</li> <li>WGS84 is the standard CRS used by GPS, Google maps, ...</li> <li>But it is <b>not suited to all applications</b></li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>To visualize spatial data, we need to <b>project the surface of the globe on a plane</b></li> <ul> <li>There is <b>no correct way of doing that</b></li> <li>Like you cannot flatten an orange peel without distorting it</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><h4><i>➜ There is a tradeoff between shape and scale preservation</i></h4></center> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems * For instance the **Mercator projection** preserves shape but distorts scale: <p style = "margin-bottom:-.58cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-22-1.png" width="45%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems * While the **Equal-Area Cylindrical projection** preserves scale but distorts shape: <p style = "margin-bottom:1.5cm;"></p> <img src="slides_files/figure-html/unnamed-chunk-23-1.png" width="90%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems * This is also the case for the **Mollweide projection:** <img src="slides_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems * But most projections are in between, like the **Robin projection:** <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems <ul> <li>The <b>smaller</b> the <b>area</b> you want to map the <b>less</b> it is a <b>problem</b></li> <ul> <li>While you would have a hard time flatten the whole orange peel</li> <li>Flattening a tiny bit of orange peel wouldn't require to distort it too much</li> </ul> </ul> -- <ul> <li>Even though there's <b>no perfect projection</b>, some are more suited to specific regions</li> <ul> <li>You wouldn't use the Mercator projection if you focus on the poles</li> </ul> </ul> -- <ul> <li>The website <a href="https://epsg.io/?q=France">epsg.io</a> allows you to <b>find the appropriate CRS</b> for the area you want to map</li> <ul> <li>For <b>France</b>, the recommended projection is <b>Lambert 93</b>, the corresponding EPSG code is 2154</li> <li>The most common projection is WGS84, the corresponding EPSG code is 4326</li> <li>What is important is that all the datasets are projected the same way</li> </ul> </ul> -- .pull-left[ * For shapefile: `st_transform()` ```r dep_shp <- st_transform(dep_shp, "EPSG:4326") ``` ] .pull-right[ * For raster: `crs()` ```r crs(pm_data) <- "EPSG:4326" ``` ] --- ### 1. Geolocalized data #### 1.3. Coordinate Reference Systems * Data projected in WGS84: .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data * To keep only metropolitan France in the shapefile, we can filter 2-digit department codes -- ```r dep_shp <- dep_shp %>% filter(nchar(INSEE_DEP) == 2) ``` -- <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="35%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data * Then, we can drop from the raster everything outside the longitude/latitude frame of our shapefile -- ```r pm_data <- crop(pm_data, extent(dep_shp)) ``` -- <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="40%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data * We can then replace everything that is not in a polygon of the shapefile by missing values -- ```r pm_data <- mask(pm_data, dep_shp) ``` -- <img src="slides_files/figure-html/unnamed-chunk-35-1.png" width="40%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data * The two datasets can now be overlaid: ```r ggplot() + geom_tile(data = as_tibble(rasterToPoints(pm_data)), aes(x = x, y = y, fill = pm2.5)) + geom_sf(data = dep_shp, fill = NA, color = alpha("grey20", .6)) + scale_fill_viridis(option = "A", direction = -1) + theme_void() ``` -- <img src="slides_files/figure-html/unnamed-chunk-37-1.png" width="30%" style="display: block; margin: auto;" /> --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data * We do not see much variation... Let's take a look at the PM<sub>2.5</sub> distribution -- ```r subset_data <- as_tibble(rasterToPoints(pm_data)) ggplot(subset_data, aes(x = pm2.5)) + geom_density(fill = "#6794A7", color = "#014D64", alpha = .6) ``` -- .left-column[ <img src="slides_files/figure-html/unnamed-chunk-39-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .right-column[ <ul style = "margin-top:1cm;"> <li style = "margin-left:-1.1cm;">There are <b>few</b> very high and very low <b>values that stretch the scale</b></li> </ul> <p style = "margin-bottom:1cm;"></p> <i>➜ To better visualize the variations, we can <b>discretize the variable into deciles</b></i> ] --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data <ul> <li>To convert a continuous into deciles we should:</li> <ol> <li>Arrange the values in ascending order</li> <li>Compute the ratio of their index over N to get something \(\in (0; 1]\)</li> <li>Multiply by the desired number of quantiles and take the ceiling</li> <li>Set as factor so that R does not interpret it as continuous</li> </ol> </ul> -- ```r subset_data <- subset_data %>% arrange(pm2.5) %>% mutate(decile = as.factor(ceiling(10 * row_number()/n()))) ``` -- <p style = "margin-bottom:.75cm;"></p> * And plot the deciles instead of the continuous values ```r ggplot() + geom_tile(data = subset_data, aes(x = x, y = y, fill = decile)) + geom_sf(data = dep_shp, fill = NA, color = alpha("grey20", .6)) + scale_fill_viridis_d(option = "A", direction = -1) + theme_void() ``` --- ### 1. Geolocalized data #### 1.4. Subsetting geolocalized data <img src="slides_files/figure-html/unnamed-chunk-42-1.png" width="48%" style="display: block; margin: auto;" /> --- class: inverse, hide-logo ### Practice <p style = "margin-bottom:.5cm;"></p> #### First, make sure we're on the same page ```r # Load packages library(tidyverse) library(viridis) library(sf) library(raster) select <- dplyr::select library(ncdf4) # Import data dep_shp <- read_sf("data/dep_shp/DEPARTEMENT.shp") pm_data <- raster("data/acag_2016.nc") # Reproject data dep_shp <- st_set_crs(dep_shp, "EPSG:4326") crs(pm_data) <- "EPSG:4326" # Rename PM 2.5 layer names(pm_data) <- "pm2.5" ``` --- class: inverse, hide-logo ### Practice <p style = "margin-bottom:2.5cm;"></p> #### 1) Filter only the Île-de-France region both in the shapefile and the raster *Hint: The Île-de-France geographic code is 11* -- <p style = "margin-bottom:1.5cm;"></p> #### 2) Plot the PM<sub>2.5</sub> concentration in Île-de-France and overlay the department shapefile <p style = "margin-bottom:3.5cm;"></p> -- <center><h3><i>You've got 8 minutes!</i></h3></center>
−
+
08
:
00
--- class: inverse, hide-logo ### Solution #### 1) Filter only the Île-de-France region both in the shapefile and the raster -- ```r # Filter only departments of Île-de-France in shapefile dep_shp <- dep_shp %>% filter(INSEE_REG == 11) ``` -- <p style = "margin-bottom:-.8cm;"></p> ```r # Drop fromraster everything outside the lon/lat frame of the shapefile pm_data <- crop(pm_data, extent(dep_shp)) ``` -- <p style = "margin-bottom:-.8cm;"></p> ```r # Replace everything that is not in a polygon of the shapefile by NA pm_data <- mask(pm_data, dep_shp) ``` -- #### 2) Plot the PM<sub>2.5</sub> concentration in Île-de-France and overlay the department shapefile ```r ggplot() + ``` -- <p style = "margin-bottom:-.8cm;"></p> ```r # Plot the raster geom_tile(data = as_tibble(rasterToPoints(pm_data)), aes(x = x, y = y, fill = pm2.5)) + ``` -- <p style = "margin-bottom:-.8cm;"></p> ```r # Plot the shapefile geom_sf(data = dep_shp, fill = NA) + ``` -- <p style = "margin-bottom:-.8cm;"></p> ```r # Custom scale scale_fill_viridis(option = "B") + theme_void() ``` --- class: inverse, hide-logo ### Solution <img src="slides_files/figure-html/unnamed-chunk-51-1.png" width="65%" style="display: block; margin: auto;" /> --- <h3>Overview</h3> <p style = "margin-bottom:2cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Geolocalized data ✔</b></li> <ul style = "list-style: none"> <li>1.1. Shapefiles and rasters</li> <li>1.2. Opening geolocalized data</li> <li>1.3. Coordinate Reference Systems</li> <li>1.4. Subsetting geolocalized data</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Geographic variables</b></li> <ul style = "list-style: none"> <li>2.1. Import from csv</li> <li>2.2. Zonal statistics</li> <li>2.3. Centroids and distance</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:2cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Geolocalized data ✔</b></li> <ul style = "list-style: none"> <li>1.1. Shapefiles and rasters</li> <li>1.2. Opening geolocalized data</li> <li>1.3. Coordinate Reference Systems</li> <li>1.4. Subsetting geolocalized data</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Geographic variables</b></li> <ul style = "list-style: none"> <li>2.1. Import from csv</li> <li>2.2. Zonal statistics</li> <li>2.3. Centroids and distance</li> </ul> </ul> ] --- ### 2. Geographic variables #### 2.1. Import from csv <ul> <li>Geographic variables can simply be stored in csv:</li> <ul> <li>The Gini index of inequality of countries</li> <li>The unemployment rate of departments</li> <li>The number of inhabitants of cities</li> <li>...</li> </ul> </ul> -- <ul> <li>In this case you can simply</li> <ul> <li>Import the csv data you need</li> <li>Join it to the shapefile as you would do with any other data</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <center><i><b>➜ Let's study the relationship between income and exposure to air pollution at the city level in Île-de-France</b></i></center> <p style = "margin-bottom:1cm;"></p> -- <ul> <li>We need:</li> <ol> <li>To import the shapefile of cities in Île-de-France</li> <li>To import and join median income by city</li> </ol> </ul> --- ### 2. Geographic variables #### 2.1. Import from csv * Import shapefile of cities in Île-de-France and project it in WGS84 -- ```r idf_shp <- read_sf("data/idf_shp/idf.shp") idf_shp <- st_set_crs(idf_shp, "EPSG:4326") head(idf_shp, 5) ``` -- ``` ## Simple feature collection with 5 features and 5 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: 1.643507 ymin: 48.36067 xmax: 2.365918 ymax: 49.17332 ## Geodetic CRS: WGS 84 ## # A tibble: 5 x 6 ## NOM_COM INSEE_COM INSEE_DEP INSEE_REG POPULATION geometry ## <chr> <chr> <chr> <chr> <int> <MULTIPOLYGON [°]> ## 1 Haute-Isle 95301 95 11 278 (((1.691893 49.07523, 1.~ ## 2 Ambleville 95011 95 11 372 (((1.691683 49.12863, 1.~ ## 3 Chalou-Mou~ 91131 91 11 431 (((2.043785 48.36067, 2.~ ## 4 Morsang-su~ 91434 91 11 20909 (((2.36231 48.64807, 2.3~ ## 5 Vienne-en-~ 95656 95 11 416 (((1.74087 49.07267, 1.7~ ``` --- ### 2. Geographic variables #### 2.1. Import from csv <ul> <li>Join csv data on median income by city and plot together with department shapefile</li> <ul> <li>We shall make sure that the <b>join variable(s)</b> have the <b>same name and class</b></li> </ul> </ul> -- ```r insee_data <- as_tibble(read.csv("data/insee_2016.csv")) head(insee_data, 1) ``` ``` ## # A tibble: 1 x 3 ## INSEE_COM CITY MED16 ## <int> <chr> <dbl> ## 1 75056 Paris 26.8 ``` -- <p style = "margin-bottom:.75cm;"></p> * Same name but not same class -- ```r idf_shp <- idf_shp %>% mutate(INSEE_COM = as.numeric(INSEE_COM)) %>% left_join(insee_data) ggplot() + geom_sf(data = idf_shp, aes(fill = MED16), color = alpha("grey20", .4)) + geom_sf(data = dep_shp, fill = NA, color = alpha("grey20", .6), size = 1.2) + scale_fill_viridis(option = "B") + theme_void() ``` --- ### 2. Geographic variables #### 2.1. Import from csv <img src="slides_files/figure-html/unnamed-chunk-56-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.2. Zonal statistics * Right now we have pollution at the cell level in our raster -- <img src="slides_files/figure-html/unnamed-chunk-57-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.2. Zonal statistics * And a shapefile delimiting the cities in which we want to know the pollution level <img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.2. Zonal statistics <ul> <li>This is the principle of <b>zonal statistics:</b></li> <ul> <li>Compute <b>statistics on areas delimited by a shapefile from values of a raster</b></li> <li>We want the average PM<sub>2.5</sub> value of the cells that fall within the municipality geometry</li> <li>With our shapefile and raster that cover the same area and are projected the same way</li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> <ul> <li>In R zonal statistics can be computed with the function <b>extract()</b></li> <ul> <li><b>x:</b> the data containing the values to compute statistics from</li> <li><b>y:</b> the data delimiting the zones in which to compute statistics</li> <li><b>FUN:</b> the statistic to compute (min, max, median, mean, ...) </li> </ul> </ul> -- <p style = "margin-bottom:1cm;"></p> ```r idf_shp <- idf_shp %>% mutate(pm2.5 = extract(x = pm_data, y = ., fun = mean)) ggplot() + geom_sf(data = idf_shp, aes(fill = pm2.5), color = alpha("grey20", .4)) + geom_sf(data = dep_shp, fill = NA, color = alpha("grey20", .6), size = 1.2) + scale_fill_viridis(option = "B") + theme_void() ``` --- ### 2. Geographic variables #### 2.2. Zonal statistics <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.2. Zonal statistics .pull-left[ <p style = "margin-bottom:1.25cm;"></p> <ul> <li>We now have in the same dataset:</li> <ul> <li>Average exposure to PM<sub>2.5</sub></li> <li>Median income</li> <li>Both at the city level</li> </ul> </ul> <p style = "margin-bottom:1.5cm;"></p> <center><h4><i>➜ Let's do the regression</i><h4></center> <p style = "margin-bottom:1.5cm;"></p> ```r stargazer(lm(pm2.5 ~ MED16, idf_shp), type = "text", keep.stat = c("n", "rsq")) ``` ] -- .pull-right[ ``` ## ## ======================================== ## Dependent variable: ## --------------------------- ## pm2.5 ## ---------------------------------------- ## MED16 -0.051*** ## (0.007) ## ## Constant 13.037*** ## (0.183) ## ## ---------------------------------------- ## Observations 1,250 ## R2 0.040 ## ======================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] --- ### 2. Geographic variables #### 2.3. Centroids and distances <center><i><b>Results indicate that at the city level in Île-de-France, an increase of 1,000 euros is associated</b></i></center> <p style = "margin-bottom:-.65cm;"></p> <center><i><b>with a reduction of 0.05 μg/m<sup>3</sup> in the concentration of PM<sub>2.5</sub>, ceteris paribus</b></i></center> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>But could this <b>effect</b> just be due to the <b>distance to Paris?</b></li> <ul> <li>Maybe richer individuals tend to live away from the urban center</li> <li>And thus to be less exposed to pollution</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> <ul> <li>We need to <b>control</b> for the distance to Paris</li> <ul> <li>To do so we can find the location of the <b>centroid</b> of each municipality with st_centroid()</li> <li>It corresponds to the arithmetic mean position of all the points in the polygon</li> </ul> </ul> -- <p style = "margin-bottom:.75cm;"></p> ```r idf_shp <- idf_shp %>% mutate(centroid = st_centroid(geometry)) ggplot() + geom_sf(data = idf_shp$geometry, fill = "#6794A7", color = "#014D64", alpha = .6) + geom_sf(data = idf_shp$centroid, color = "#014D64") + theme_void() ``` --- ### 2. Geographic variables #### 2.3. Centroids and distances <img src="slides_files/figure-html/unnamed-chunk-64-1.png" width="55%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.3. Centroids and distances <ul><li>Note that the centroid of a polygon can be outside the polygon</li></ul> -- <p style = "margin-bottom:-.5cm;"></p> <ul><ul><li>Take for instance the municipality Les Ulis:</li></ul></ul> <img src="slides_files/figure-html/unnamed-chunk-65-1.png" width="45%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.3. Centroids and distances * To compute distances we should first convert the centroid variable into a longitude and a latitude variable -- ```r idf_shp <- idf_shp %>% group_by(INSEE_COM) %>% mutate(cent_lon = unlist(centroid)[1], cent_lat = unlist(centroid)[2]) ``` -- * And store the coordinates of Paris ```r paris <- idf_shp %>% filter(NOM_COM == "Paris") %>% select(cent_lon, cent_lat) %>% st_drop_geometry() # Specific function to drop geometry paris ``` ``` ## # A tibble: 1 x 2 ## cent_lon cent_lat ## * <dbl> <dbl> ## 1 2.34 48.9 ``` --- ### 2. Geographic variables #### 2.3. Centroids and distances * We can now compute the distance between each centroid and that of Paris using `geodist_vec()` -- ```r library(geodist) idf_shp <- idf_shp %>% mutate(dist_paris = geodist_vec(x1 = cent_lon, y1 = cent_lat, x2 = paris$cent_lon, y2 = paris$cent_lat, measure = "geodesic") / 1000) ``` <p style = "margin-bottom:.7cm;"></p> -- * We can plot this new variable ```r ggplot() + geom_sf(data = idf_shp, aes(fill = dist_paris), color = alpha("grey20", .4)) + geom_sf(data = dep_shp, fill = NA, color = alpha("grey20", .6), size = 1.2) + scale_fill_viridis(option = "B") + theme_void() ``` <p style = "margin-bottom:.7cm;"></p> -- * And add it in the regression ```r stargazer(lm(pm2.5 ~ MED16 + dist_paris, idf_shp), type = "text", keep.stat = c("n", "rsq")) ``` --- ### 2. Geographic variables #### 2.3. Centroids and distances <img src="slides_files/figure-html/unnamed-chunk-71-1.png" width="60%" style="display: block; margin: auto;" /> --- ### 2. Geographic variables #### 2.3. Centroids and distances .pull-left[ <p style = "margin-bottom:-.7cm;"></p> ``` ## ## ======================================== ## Dependent variable: ## --------------------------- ## pm2.5 ## ---------------------------------------- ## MED16 -0.092*** ## (0.005) ## ## dist_paris -0.041*** ## (0.001) ## ## Constant 15.749*** ## (0.131) ## ## ---------------------------------------- ## Observations 1,250 ## R2 0.620 ## ======================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] -- .pull-right[ <p style = "margin-bottom:.75cm;"></p> <ul> <li style = "margin-left:-.5cm;">Once controlling for distance to Paris, the coefficients associated with median income is even more negative</li> <ul> <li style = "margin-left:-.5cm;">Distance to Paris has opposite relationships with pollution and median income</li> <li style = "margin-left:-.5cm;">Richer municipalities tend to be closer to Paris</li> <li style = "margin-left:-.5cm;">But also to be less polluted than poorer municipalities</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul> <li style = "margin-left:-.5cm;">In Île-de-France the relationship between exposure to PM<sub>2.5</sub> and median income (in thousand euros) is even stronger than with distance to Paris (in km)</li> <ul> <li style = "margin-left:-.5cm;">Both coefficients are significant at 99% confidence level</li> </ul> </ul> ] --- <h3>Overview</h3> <p style = "margin-bottom:2cm;"></p> .pull-left[ <ul style = "margin-left:1.5cm;list-style: none"> <li><b>1. Geolocalized data ✔</b></li> <ul style = "list-style: none"> <li>1.1. Shapefiles and rasters</li> <li>1.2. Opening geolocalized data</li> <li>1.3. Coordinate Reference Systems</li> <li>1.4. Subsetting geolocalized data</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"> <li><b>2. Geographic variables ✔</b></li> <ul style = "list-style: none"> <li>2.1. Import from csv</li> <li>2.2. Zonal statistics</li> <li>2.3. Centroids and distance</li> </ul> </ul> <p style = "margin-bottom:1cm;"></p> <ul style = "margin-left:1.5cm;list-style: none"><li><b>3. Wrap up!</b></li></ul> ] --- ### 3. Wrap up! #### Shapefiles and rasters <center><b>Two main types of geolocalized datasets:</b></center> -- <p style = "margin-bottom:.5cm;"></p> .pull-left[ <center><b>Shapefiles</b></center> <ul> <li>One row per entity/one column per variable</li> <li>A geometry variable with the coordinates of the points/polylines/polygons</li> </ul> <img src="slides_files/figure-html/unnamed-chunk-73-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <center><b>Rasters</b></center> <ul> <li>Works like a picture, with cells like pixels</li> <li>And each cell can take a given value, e.g. pollution observed from satellites</li> </ul> <img src="slides_files/figure-html/unnamed-chunk-74-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### 3. Wrap up! #### Coordinates Reference Systems <ul> <li>A Coordinate Reference System (CRS) is a model of the Earth in which each location is coded using degrees</li> <ul> <li>It allows to <b>project the surface of the globe on a plane</b></li> <li>But there is a <b>tradeoff</b> between preserving:</li> </ul> -- <p style = "margin-bottom:.5cm;"></p> .pull-left[ <center><b>Shape</b> (like the Mercator projection)</center> <img src="slides_files/figure-html/unnamed-chunk-75-1.png" width="70%" style="display: block; margin: auto;" /> ] -- .pull-right[ <center><b>Scale</b> (like the Equal-Area Cylindrical projection)</center> <img src="slides_files/figure-html/unnamed-chunk-76-1.png" width="100%" style="display: block; margin: auto;" /> <p style = "margin-bottom:1cm;"></p> <ul> <li>Most projections are somewhere in between</li> <li>For France: Lambert 93 projection (EPSG:2154)</li> </ul> <p style = "margin-bottom:.5cm;"></p> <center><i>➜ First thing to do: <b>reprojection</b></i></center> ] --- ### 3. Wrap up! #### Operations on geolocalized data .pull-left[ <center><b>Zonal statistics</b></center> <ul> <li>Computing statistics on areas delimited by a shapefile from values of a raster</li> <ul> <li>Project shapefile and raster the same way</li> <li>Compute the mean/max/... of cell values</li> </ul> <img src="slides_files/figure-html/unnamed-chunk-77-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ <center><b>Centroids</b></center> <ul> <li>The centroid is the arithmetic mean position of all the points in the polygon</li> <ul> <li>To compute distances between polygons</li> <li>A centroid is not always within its polygon</li> </ul> <img src="slides_files/figure-html/unnamed-chunk-78-1.png" width="82%" style="display: block; margin: auto;" /> ]