I’m double posting this, even though it’s over on my mapping in R site. I’ve gotten a number of questions along these lines lately and it seemed a good time to “repost” it here. It’s a general summary of joins in R, plus some fun ways to dispaly climate data as polar plots.

## Spatial Joins in R with `sf`

Some of the most common and useful geospatial operations are joins based on some component of the spatial topology. For example, you want to figure out what attributes of certain points that are associated with or within certain polygons on the landscape…like bus-stops in a county or river gaging stations within a watershed.

Spatial joins are based on the intersection between two spatial objects, often points and the polygons. There are many ways we can join objects, which may include specific options like crosses,near, within, touches, etc. The point being, we can do all this in R! Robin Lovelace et al. have a great online book available: https://geocompr.robinlovelace.net/spatial-operations.html that covers some of this material. Check it out!

Let’s load the libraries we’re going to need first.

### Polygon Data

We’ll be using California and CA counties pulled from the `USAboundaries` package. Note, the first time you use this, you may need to install the dataset and restart your R session.

### Point Data

Now we have some polygon data to work with…let’s add some climate data and practice joining polygons to points and points to polygons! First let’s use the `GSODR` (Global Surface Summary of the Day) package to get global climate station locations. Then we can join to a few specific states/counties, and plot. First the GSOD data:

Note, this is a fairly large set of point data, with `28,104` observations globally. Let’s map this so we can see how dense this dataset actually is. Let’s use a nice set of global maps from the `rnaturalearth` package. Because the points are so dense, let’s plot those first, then we’ll add a layer of world country outlines.

That’s a lot of points! Let’s look at just California to make this a little more manageable.

Great, now we have a dataframe in our environment that has both global climate station locations, and only stations associated with California, USA. You’ll notice there are a number of stations that fall outside of the CA border, largely those associated with buoys along the coast.

## Spatial Joins

### Select POLYGONS containing POINTS

This first approach only selects polygons that contain points. For demonstration sake, let’s use the larger global point dataset. Note this does not modify the polygon dataframe in any form (i.e., add attributes, update, summarize, etc). It is only selecting or filtering to the polygons that contain points using a spatial join.

### Anti-Join Non-Matching Objects

So most counties have at least one point present. What if we specifically wanted to find the counties that don’t have a climate GSOD station in them? We can use something called an “`anti_join`”, which does precisely that, it identifies the items that don’t have a match. There’s a few possible ways to do this, but the most flexible I’ve found is using the following, because it’s easy to return whatever spatial object you prefer (e.g., points, polygons, lines).

The key is to use the same subsetting `[ ]` option, but add the `!lengths()` function to return a logical vector of all the non-matching objects. We are essentially filtering by this vector, so this doesn’t actually add any data from one layer to the other, it simply filters where there aren’t any overlapping bits.

### Join Attributes: POINTS inside POLYGONS

Great, what about joining the data attributes? Let’s look for points that fall within CA counties, and add ATTRIBUTES from the county polygons to the climate station points. Just a reminder, here’s the data columns (or attributes) in the polygon dataset:

So in this case, let’s say we want to add the county `name` attribute to our POINT dataset, which looks like this (notice there’s no `county` field or `name` field):

So to spatially join the county `name` attribute with the appropriate point locations, let’s use `st_join`. If we use `left=TRUE` here, our result will retain all the points in the dataset rather than just the the spatial overlaps (where points fall inside polygons). So `left=TRUE` is essentially a `dplyr::left_join`, and `left=FALSE` is equivalent to a `dplyr::inner_join`.

Now we have only points that fall inside of a CA county, AND the new data frame now has a new column/attribute called “`name`” (all our climate station points have a named CA county associated with them). We could easily specify additional columns inside our `st_join` function, or if we don’t specify any columns, then all columns from the polygon dataframe that spatially joined/matched the points data would be added to the points dataframe.

`isd_ca_co_pts <- st_join(isd_history, left = FALSE, ca_co) # join all columns`

## Practice with Climate Data Example!

Hopefully the above was useful…but let’s actually practice how we may use this by actually using some spatial joins to select and download some climate data from the `GSODR` package, and then make some visualizations. To start, let’s take a look at what stations have data between 1980 and 2018. Check the `GSODR` vignette for more details, I’m just applying some of the commands they lay describe.

### Check Stations for Data Availability

Here we check what stations contain data between a given date range. Some of these stations go back to the 1930’s, but we’ll focus on 1980–2018.

### Calculate Stations Per County

Looks like there are 53 stations, and some counties have more than one. Let’s apply our spatial join powers to filter this list down a bit. Let’s:

• Summarize our station data using the spatial_joined county `name` attribute so we can calculate how many stations we have per county.
• Create a dataset that includes only stations from counties with a single station
• Create a dataset that contains stations that are within a set distance from the centroid of the county

We’ll mainly use some basic `dplyr` here, which is possible because as `sf` objects, these are still simple data frames.

### Pick Station Nearest the Centroid of County

Well great, what do we do for counties with multiple stations? How about picking the station nearest the centroid of the county. Steps:

1. We need to work with just counties with more than one station.
2. We need to add the centroid of each of the counties in question.
3. We need to select the station nearest the centroid.

For Step 2, we’re going to use the `purrr` package to `map` or `apply` the `st_centroid` function over each county in our dataframe. This is the equivalent of a for-loop, it just looks very different.

Step 3: This is the trickiest part…

There are probably a few different ways to do this, but let’s try to use the one that seems simplest and uses the fewest lines of code. A handy package (`st_nn`) will allows us to nearest neighbors between points/lines/polygons, and can provide distances as well. So let’s get the station nearest our county centroid for all the counties with stations > 1.

That was a lot! But Looks like all works, and now we have a final set of stations we can use to download data. For simplicity in this example, I’m picking three stations, one with the lowest elevation, one with the highest elevation, and one with the greatest latitude.

Now we can use our station list to download daily data for each station for any period of record we want. Note, this code works but took 3-4 minutes to run. To speed things up I’ve saved the file here, or just grab the summarized data shown in the next section.

### Summarize and Visualize!

Finally, we have data, let’s summarize and visualize these data. I’ll aggregate these data to monthly means so we can see different timing patterns in precipitation and temperature across the state.

#### Monthly Average

Let’s calculate averages and then see how these look for these different stations.

#### Daily

Let’s do the same thing for temperature!

### Summary

California is an interesting place because we get most of our water during the winter, and typically have a long warm/dry period through the summer (a common Mediterranean climate pattern). It’s also striking to see the difference in mean precipitation (over a ~40 year period) across the state.

Hopefully there were some useful tidbits in this lesson/post. Please let me know if you have questions/comments (see the Contact tab).