I’m double posting this, even though it’s over on my mapping in R site. I’ve gotten a number of questions along these lines lately and it seemed a good time to “repost” it here. It’s a general summary of joins in R, plus some fun ways to dispaly climate data as polar plots.
Spatial Joins in R with sf
Some of the most common and useful geospatial operations are joins based on some component of the spatial topology. For example, you want to figure out what attributes of certain points that are associated with or within certain polygons on the landscape…like bus-stops in a county or river gaging stations within a watershed.
Spatial joins are based on the intersection between two spatial objects, often points and the polygons. There are many ways we can join objects, which may include specific options like crosses,near, within, touches, etc. The point being, we can do all this in R! Robin Lovelace et al. have a great online book available: https://geocompr.robinlovelace.net/spatial-operations.html that covers some of this material. Check it out!
Let’s load the libraries we’re going to need first.
Polygon Data
We’ll be using California and CA counties pulled from the USAboundaries package. Note, the first time you use this, you may need to install the dataset and restart your R session.
Point Data
Now we have some polygon data to work with…let’s add some climate data and practice joining polygons to points and points to polygons! First let’s use the GSODR (Global Surface Summary of the Day) package to get global climate station locations. Then we can join to a few specific states/counties, and plot. First the GSOD data:
Note, this is a fairly large set of point data, with 28,104 observations globally. Let’s map this so we can see how dense this dataset actually is. Let’s use a nice set of global maps from the rnaturalearth package. Because the points are so dense, let’s plot those first, then we’ll add a layer of world country outlines.
That’s a lot of points! Let’s look at just California to make this a little more manageable.
Great, now we have a dataframe in our environment that has both global climate station locations, and only stations associated with California, USA. You’ll notice there are a number of stations that fall outside of the CA border, largely those associated with buoys along the coast.
Spatial Joins
Select POLYGONS containing POINTS
This first approach only selects polygons that contain points. For demonstration sake, let’s use the larger global point dataset. Note this does not modify the polygon dataframe in any form (i.e., add attributes, update, summarize, etc). It is only selecting or filtering to the polygons that contain points using a spatial join.
Anti-Join Non-Matching Objects
So most counties have at least one point present. What if we specifically wanted to find the counties that don’t have a climate GSOD station in them? We can use something called an “anti_join”, which does precisely that, it identifies the items that don’t have a match. There’s a few possible ways to do this, but the most flexible I’ve found is using the following, because it’s easy to return whatever spatial object you prefer (e.g., points, polygons, lines).
The key is to use the same subsetting [ ] option, but add the !lengths() function to return a logical vector of all the non-matching objects. We are essentially filtering by this vector, so this doesn’t actually add any data from one layer to the other, it simply filters where there aren’t any overlapping bits.
Join Attributes: POINTS inside POLYGONS
Great, what about joining the data attributes? Let’s look for points that fall within CA counties, and add ATTRIBUTES from the county polygons to the climate station points. Just a reminder, here’s the data columns (or attributes) in the polygon dataset:
So in this case, let’s say we want to add the county name attribute to our POINT dataset, which looks like this (notice there’s no county field or name field):
So to spatially join the county name attribute with the appropriate point locations, let’s use st_join. If we use left=TRUE here, our result will retain all the points in the dataset rather than just the the spatial overlaps (where points fall inside polygons). So left=TRUE is essentially a dplyr::left_join, and left=FALSE is equivalent to a dplyr::inner_join.
Now we have only points that fall inside of a CA county, AND the new data frame now has a new column/attribute called “name” (all our climate station points have a named CA county associated with them). We could easily specify additional columns inside our st_join function, or if we don’t specify any columns, then all columns from the polygon dataframe that spatially joined/matched the points data would be added to the points dataframe.
isd_ca_co_pts <- st_join(isd_history, left = FALSE, ca_co) # join all columns
Practice with Climate Data Example!
Hopefully the above was useful…but let’s actually practice how we may use this by actually using some spatial joins to select and download some climate data from the GSODR package, and then make some visualizations. To start, let’s take a look at what stations have data between 1980 and 2018. Check the GSODR vignette for more details, I’m just applying some of the commands they lay describe.
Check Stations for Data Availability
Here we check what stations contain data between a given date range. Some of these stations go back to the 1930’s, but we’ll focus on 1980–2018.
Calculate Stations Per County
Looks like there are 53 stations, and some counties have more than one. Let’s apply our spatial join powers to filter this list down a bit. Let’s:
Summarize our station data using the spatial_joined county name attribute so we can calculate how many stations we have per county.
Create a dataset that includes only stations from counties with a single station
Create a dataset that contains stations that are within a set distance from the centroid of the county
We’ll mainly use some basic dplyr here, which is possible because as sf objects, these are still simple data frames.
Pick Station Nearest the Centroid of County
Well great, what do we do for counties with multiple stations? How about picking the station nearest the centroid of the county. Steps:
We need to work with just counties with more than one station.
We need to add the centroid of each of the counties in question.
We need to select the station nearest the centroid.
For Step 2, we’re going to use the purrr package to map or apply the st_centroid function over each county in our dataframe. This is the equivalent of a for-loop, it just looks very different.
Step 3: This is the trickiest part…
There are probably a few different ways to do this, but let’s try to use the one that seems simplest and uses the fewest lines of code. A handy package (st_nn) will allows us to nearest neighbors between points/lines/polygons, and can provide distances as well. So let’s get the station nearest our county centroid for all the counties with stations > 1.
That was a lot! But Looks like all works, and now we have a final set of stations we can use to download data. For simplicity in this example, I’m picking three stations, one with the lowest elevation, one with the highest elevation, and one with the greatest latitude.
Download Climate Data
Now we can use our station list to download daily data for each station for any period of record we want. Note, this code works but took 3-4 minutes to run. To speed things up I’ve saved the file here, or just grab the summarized data shown in the next section.
Summarize and Visualize!
Finally, we have data, let’s summarize and visualize these data. I’ll aggregate these data to monthly means so we can see different timing patterns in precipitation and temperature across the state.
Monthly Average
Let’s calculate averages and then see how these look for these different stations.
Daily
Let’s do the same thing for temperature!
Summary
California is an interesting place because we get most of our water during the winter, and typically have a long warm/dry period through the summer (a common Mediterranean climate pattern). It’s also striking to see the difference in mean precipitation (over a ~40 year period) across the state.
Hopefully there were some useful tidbits in this lesson/post. Please let me know if you have questions/comments (see the Contact tab).