Part I: Calculating Distance Matrices for Nearest Points with sf, SearchTrees, & Imap

As usual I wanted to post more regularly and it hasn’t happened. However, I’ve been using the sf package a bunch lately, and I wanted to provide a series of posts that highlight some of the ways I’ve used it. Overall I’ve been impressed, it’s fast, does nearly everything I need, and it even plays with ggplot21.

Hopefully the next set of posts will show a few ways you can use these tools, and a few random tricks for data mining and working with spatial data.

One of the more difficult and interesting tasks I’ve had to sort out recently is how to snap/attribute the nearest stream segment or USGS flow gage to a set of study sites in order to calculate distances between sites. So, more plainly, how to identify the nearest neighbor for one set of points to another. Later, I’ll show how to snap a set of points to a river network and calculate distances along that line network.

Load the Packages

I recognize this part isn’t always trivial…especially if you are working on a Linux system. I’m not going to go into too much detail, but I know getting sf and tidverse installed does require a number of dependencies, so if you run into trouble, try some googling, or drop me a note.

The main packages I’m going to use in this post:

# load libraries
suppressMessages({
  library(tidyverse);
  library(sf);
  library(rvest);
  library(SearchTrees);
  library(Imap)
})

Getting Data to Calculate Distances Matrix

For my purposes, I’m often interested in associating a given stream reach or sample site with a USGS Gage in order to extract flow data. See the section on Snapping to Nearest Points. A future post will show how to calculate river or network distances between sets of points, as well as how to download stream/river data. For now, let’s focus on how to find the nearest point and calculate pairwise distances.

Scrape Web for CDEC Water Temperature Localities

Just for fun, I’m going to show how to scrape some web data (a table on a website) using the rvest package. This is a fairly painless process, though a little tricky at first.

CDEC has a bunch of stations, and it can be hard to find the one you need/want. This webpage provides a way to filter things down to what you want. For this example, let’s filter to active CDEC stations that collect water temperature data (sensor=25), and get a table of sites and locations.

I’ve linked to the CDEC page with the table here, go take a look so you know what the data looks like.

Identify the Table & get the Xpath

To pull that data in as a dataframe, there’s a few small intermediate steps that will help us grab the data. I’m going to show this using Chrome, but other browsers likely have similar capabilities.

  1. Once you have the webpage up, if you right click anywhere on the page and select Inspect, you’ll see a window with all the crazy underbits of the webpage.
  2. If you click inside the upper right hand side box, and hover your mouse around in that same box, you’ll see a corresponding highlighted area on the left hand side (the actual webpage). Do that until you find the line that highlights the exact table you are interested in, as shown in this screen shot below.

In this case, you should have clicked on a line that says something like:

<table width="850" border="1">...</table> == $0

  1. Right click on that line in the upper right hand box, and select Copy and then Copy XPath, you’ll have completed all the tricky parts of this section! On to getting some R code written.

Now Scrape/Download the Data

Let’s load up the rvest package, set our url, and then copy that Xpath into the xpath= argument. This basically helps the computer parse out the webpage so that you only end up with the good bits inside that specific table, and nothing else.

library(rvest)

url <- "https://cdec.water.ca.gov/cgi-progs/staSearch?sta=&sensor_chk=on&sensor=25&dur=&active=Y&lon1=&lon2=&lat1=&lat2=&elev1=-5&elev2=99000&nearby=&basin=ALAMEDA+CR&hydro=CENTRAL+COAST&county=ALAMEDA&operator=%2B&display=sta"

df <- url %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="main_content"]/div/div[1]/table') %>% # inspect and xpath with chrome
  html_table()
df_locs <- df[[1]] # reads in as a list, so we are just eliminating the list and making this a dataframe

head(df_locs)
##    ID                         Station Name   River Basin     County
## 1 130 SAN JOAQUIN R - MONITORING WELL #130 SAN JOAQUIN R     FRESNO
## 2 142 SAN JOAQUIN R - MONITORING WELL #142 SAN JOAQUIN R     FRESNO
## 3 49B                      SJRRP MW-09-49B SAN JOAQUIN R     FRESNO
## 4 AFD          AMERICAN R BELOW FOLSOM DAM    AMERICAN R SACRAMENTO
## 5 AFO          AMERICAN RIVER AT FAIR OAKS    AMERICAN R SACRAMENTO
## 6 AHZ       AMERICAN R AT HAZEL AVE BRIDGE    AMERICAN R SACRAMENTO
##   Longitude Latitude Elevation Feet                 Operator Map
## 1 -120.5149 36.99820      130 &nbsp US Bureau of Reclamation    
## 2 -120.6829 37.18610      100 &nbsp US Bureau of Reclamation    
## 3 -120.2702 36.77120      171 &nbsp US Bureau of Reclamation    
## 4 -121.1667 38.68830      220 &nbsp     US Geological Survey    
## 5 -121.2277 38.63546       72 &nbsp     US Geological Survey    
## 6 -121.2240 38.63600      100 &nbsp     US Geological Survey

Clean and Make Spatial (with sf)

Ok! We now have a dataframe of CDEC sites that collect temperature data. Let’s take that data and do a little revising, and make our data a sf spatial object:

# names
names(df_locs)
## [1] "ID"             "Station Name"   "River Basin"    "County"        
## [5] "Longitude"      "Latitude"       "Elevation Feet" "Operator"      
## [9] "Map"
# rename cols:
df_locs <- df_locs %>% rename(station="Station Name", river_basin="River Basin", 
                              lon=Longitude, lat=Latitude, elev_ft=`Elevation Feet`) %>%
  select(-Map)

# remove and filter out the weird UNIX code "&nbsp"
df_locs$elev_ft <- as.numeric(gsub(pattern = "&nbsp", replacement = "", df_locs$elev_ft)) 

# make CDEC data sf: 
df_locs <- st_as_sf(df_locs, 
                    coords = c("lon", "lat"), # for point data
                    remove = F, # don't remove these lat/lon cols from df
                    crs = 4326) # add projection (this is WGS84)

summary(df_locs)
##       ID              station          river_basin       
##  Length:365         Length:365         Length:365        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     County               lon              lat           elev_ft       
##  Length:365         Min.   :-124.2   Min.   :34.22   Min.   :    0.0  
##  Class :character   1st Qu.:-122.0   1st Qu.:37.73   1st Qu.:    9.0  
##  Mode  :character   Median :-121.5   Median :38.06   Median :  100.0  
##                     Mean   :-121.4   Mean   :38.40   Mean   :  736.1  
##                     3rd Qu.:-120.9   3rd Qu.:39.38   3rd Qu.:  580.0  
##                     Max.   :-117.3   Max.   :41.93   Max.   :10050.0  
##    Operator                  geometry  
##  Length:365         POINT        :365  
##  Class :character   epsg:4326    :  0  
##  Mode  :character   +proj=long...:  0  
##                                        
##                                        
## 

Great! First thing you’ll notice is that sf reads this in as a simple dataframe, and has a list-column on the end called geometry which is where the spatial data lives. Now we have data ready to go.

Get USGS Gage Locations

Next step, let’s get another set of points to compare/snap our data to, and we’ll use the USGS Gage dataset. These data are available via a Gages II dataset (download the shapefile here). Here we can read this shapefile in (once unzipped) with the sf package. Note, these are gages for the entire US.

library(sf)
gages <- st_read("data/gagesII_9322_sept30_2011.shp", quiet = F)

Insider sf/shapefile Trick

Figured out this recently, you can keep your shapefile zipped! The great thing about this is it saves space and just the sheer number of files you have to deal with. You can take it or leave it, but here’s how it works. You essentially keep the shapefiles zipped, but unzip and read them in temporarily, then remove the temporary unzipped files, but keep the file in your R environment for analysis.

library(sf)

# read in zipped
gages <- st_read(unzip("data/gagesII_9322_sept30_2011.zip"), quiet = F)
## Reading layer `gagesII_9322_sept30_2011' from data source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/gagesII_9322_sept30_2011.dbf' using driver `ESRI Shapefile'
## Simple feature collection with 9322 features and 14 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -6233389 ymin: -47038.1 xmax: 3271609 ymax: 6043894
## epsg (SRID):    NA
## proj4string:    +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
# then remove raw files since file is added in memory
file.remove(list.files(pattern = "gagesII_9322_sept30_2011*",recursive = F))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

There are 9322 records, and this data already has a projection (aea) and is in NAD83, so we should probably reproject into something different for the next few sections. Let’s also filter our data to only sites in California. The cool thing is we can just use normal dplyr/subsetting functions as we would on a normal dataframe.

# filter to only active gages in CA
gages_ca <- gages %>% filter(ACTIVE09=="yes", STATE=="CA") %>% 
  st_transform(crs = 4326)

Great, only 532 gages now…that should be enough to cover our CDEC temperature locations.

MAPS: Make a Leaflet Map of Our Point Data

So now we have two layers to work with, let’s make a quick leaflet map so we can get a sense of what we have. The blue dots are USGS data, and orange circles are the CDEC temperature data.

#quick plot:
suppressPackageStartupMessages({
  library(leaflet)
  library(htmltools)
})

# MAP
aMAP<-leaflet() %>% #width = "100%", height="100%"
  addTiles() %>% 
  addProviderTiles("Esri.WorldTopoMap", group = "Topo") %>%
  addProviderTiles("Esri.WorldImagery", group = "ESRI Aerial") %>%

  # add scale bar
  addMeasure(position = "topright",
             primaryLengthUnit = "meters",
             primaryAreaUnit = "sqmeters",
             activeColor = "#3D535D",
             completedColor = "#7D4479") %>%
 
  # all CA gages
  addCircleMarkers(data=gages_ca, group="USGS gages CA",
                   stroke=TRUE, weight=0.3, radius=4,
                   fillOpacity = 0.7,
                   fillColor = "blue",
                   popup=paste0("USGS Site: ", gages_ca$STANAME, 
                                "<br>","SiteNo: ", gages_ca$STAID,
                                "<br>", "Ecoregion: ", gages_ca$AGGECOREGI,
                                "<br>", "Drainage Area (sqkm): ", gages_ca$DRAIN_SQKM)) %>% 
                   #clusterOptions = markerClusterOptions(),
                   #clusterId = "gagesCluster") %>%
  
  # add samples
  addCircleMarkers(data=df_locs, group="CDEC Stream Temp Localities",
                   opacity = 0.8, 
                   popup=paste0("CDEC Site: ", df_locs$ID, "<br>",
                                "Station Name: ", df_locs$station, "<br>",
                                "Elev_ft: ", df_locs$elev_ft, "<br>",
                                "Operator: ",df_locs$Operator),
                   weight=0.6,radius=10, stroke=TRUE,
                   fillColor = "orange") %>%
  #hideGroup("GW Localities") %>% 
  
  addLayersControl(
    baseGroups = c("Topo","ESRI Aerial"),
    overlayGroups = c("CDEC Stream Temp Localities", "USGS gages CA"),
    options = layersControlOptions(collapsed = T)) %>% 
  
  addLegend(position = "bottomright", labels=c("USGS Flow", "CDEC Water Temp"),
            colors=c("blue", "orange"), opacity=1)

aMAP

Snap CDEC Sites to Nearest USGS Gages

Now we can calculate the nearest USGS gage to each CDEC water temperature location. There are a few options to do this, but the two primary packages I found useful for this purpose were:

  • SearchTrees: uses k nearest-neighbor lookup
  • Imap: which uses a gdist function

Nearest Points Using knn lookup

This code determines the nearest-n points to an existing set of points. Here, we can use it to find/assign the nearest USGS gage to our existing sampling sites. First we load the SearchTrees package, then calculate indices based based on spatial locations, and then get the 2 nearest locations to each point (you can select as many nearest as you need). Then make a simple plot using the base plotting and sf. The plot shows the CDEC temperature gage (orange square), and the two nearest USGS gages in matching colored circles.

library(SearchTrees)

# filter df_locs to just a few counties for nicer visualization:
df_locs <- df_locs %>% filter(County %in% c("EL DORADO", "PLACER", "SACRAMENTO", "NEVADA"))

## Find indices of the two nearest points in A to each of the points in B
tree <- createTree(st_coordinates(gages_ca))
inds <- knnLookup(tree, newdat=st_coordinates(df_locs), k=2) # can be 2 or more

## Show that it worked
plot(st_coordinates(df_locs), # randomly selected pts
       #bg=c("orange","maroon","purple","yellow", "green"), 
       pch=22, col="orange", cex=1.5)
points(st_coordinates(gages_ca), pch=1, cex=0.8)

## Plot two nearest neigbors
points(df_locs$lon[3], df_locs$lat[3], pch=21, col="forestgreen", cex=3)
points(st_coordinates(gages_ca)[inds[3,],], pch=16, col=adjustcolor("forestgreen", alpha=0.9), cex=1.5)

points(df_locs$lon[11], df_locs$lat[11], pch=21, col="blue", cex=3)
points(st_coordinates(gages_ca)[inds[11,],], pch=16, col=adjustcolor("blue", alpha=0.7), cex=1.5)

points(df_locs$lon[4], df_locs$lat[4], pch=21, col="purple", cex=3)
points(st_coordinates(gages_ca)[inds[4,],], pch=16, col=adjustcolor("purple", alpha=0.7), cex=1.5)

points(df_locs$lon[6], df_locs$lat[6], pch=21, col="maroon", cex=3)
points(st_coordinates(gages_ca)[inds[6,],], pch=16, col=adjustcolor("maroon", alpha=0.7), cex=1.5)
legend(legend=c("CDEC","USGS"), "bottomright", col=c("orange", "black"), pch=c(22, 1), pt.cex=c(1.5, 0.8))

Make Dataframe Joining to Nearest Points

Now we have the data and an index of the nearest point to every other point (or nearest 2), we can rebuild a dataframe for future use. This just lists the nearest 2 USGS gages to the CDEC temperature gages, along with a few other important pieces of info.

inds_df <- tibble("usgs01"=inds[,1], "usgs02"=inds[,2],
                  "usgs01_ID"=gages_ca$STAID[inds[,1]],
                  "usgs02_ID"=gages_ca$STAID[inds[,2]],
                  "cdec_id"=df_locs$ID,
                  "cdec_sta"=df_locs$station,
                  "cdec_lon"=df_locs$lon,
                  "cdec_lat"=df_locs$lat)

head(inds_df)
## # A tibble: 6 x 8
##   usgs01 usgs02 usgs01_ID usgs02_ID cdec_id
##    <int>  <int>    <fctr>    <fctr>   <chr>
## 1    444    445  11446500  11447293     AFD
## 2    444    445  11446500  11447293     AFO
## 3    444    445  11446500  11447293     AHZ
## 4    446    342  11447360  11336580     AWB
## 5    446    444  11447360  11446500     AWP
## 6    420    419  11422500  11421790     BRE
## # ... with 3 more variables: cdec_sta <chr>, cdec_lon <dbl>,
## #   cdec_lat <dbl>

Calculate Distance Matrix Using Imap

Well, that was awesome the first time it worked for me. One other thing that can be helpful is making a matrix of the distances (so all the possible pairwise comparisons between sites. This example is using the geodesic or euclidean distances (not river distances), but could be useful for comparing testing (Euclidean vs. river networks distance). Also check out geosphere package for additional functions of daylength, dist2lines, and random sampling of lat/longs.

Here we load two different functions, which I did not write (I need to track down reference where I found these!),

  • ReplaceLowerOrUpperTriangle: Basically helps fill the upper or lower half of the distance matrix since they are exact opposites
  • GeoDistanceInMetresMatrix: This helps calculate the matrix of distances between each pairwise comparison (each set of sites).

To use these, we:

  • load the matrix of CDEC temperatures (specifying only the ID and lat/long columns),
  • Run the function
  • Plot these data and add row/col names,
  • add the actual value in the grid,
  • add a title.
library(Imap)

# make function
ReplaceLowerOrUpperTriangle <- function(m, triangle.to.replace){
  # If triangle.to.replace="lower", replaces the lower triangle of a square matrix with its upper triangle.
  # If triangle.to.replace="upper", replaces the upper triangle of a square matrix with its lower triangle.
  
  if (nrow(m) != ncol(m)) stop("Supplied matrix must be square.")
  if      (tolower(triangle.to.replace) == "lower") tri <- lower.tri(m)
  else if (tolower(triangle.to.replace) == "upper") tri <- upper.tri(m)
  else stop("triangle.to.replace must be set to 'lower' or 'upper'.")
  m[tri] <- t(m)[tri]
  return(m)
}

GeoDistanceInMetresMatrix <- function(df.geopoints){
  # Returns a matrix (M) of distances between geographic points.
  # M[i,j] = M[j,i] = Distance between (df.geopoints$lat[i], df.geopoints$lon[i]) and
  # (df.geopoints$lat[j], df.geopoints$lon[j]).
  # The row and column names are given by df.geopoints$name.
  
  GeoDistanceInMetres <- function(g1, g2){
    # Returns a vector of distances. (But if g1$index > g2$index, returns zero.)
    # The 1st value in the returned vector is the distance between g1[[1]] and g2[[1]].
    # The 2nd value in the returned vector is the distance between g1[[2]] and g2[[2]]. Etc.
    # Each g1[[x]] or g2[[x]] must be a list with named elements "index", "lat" and "lon".
    # E.g. g1 <- list(list("index"=1, "lat"=12.1, "lon"=10.1), list("index"=3, "lat"=12.1, "lon"=13.2))
    DistM <- function(g1, g2){
      require("Imap")
      return(ifelse(g1$index > g2$index, 0, gdist(lat.1=g1$lat, lon.1=g1$lon, lat.2=g2$lat, lon.2=g2$lon, units="m")))
    }
    return(mapply(DistM, g1, g2))
  }
  
  n.geopoints <- nrow(df.geopoints)
  
  # The index column is used to ensure we only do calculations for the upper triangle of points
  df.geopoints$index <- 1:n.geopoints
  
  # Create a list of lists
  list.geopoints <- by(df.geopoints[,c("index", "lat", "lon")], 1:n.geopoints, function(x){return(list(x))})
  
  # Get a matrix of distances (in metres)
  mat.distances <- ReplaceLowerOrUpperTriangle(outer(list.geopoints, list.geopoints, GeoDistanceInMetres), "lower")
  
  # Set the row and column names
  rownames(mat.distances) <- df.geopoints$Locality
  colnames(mat.distances) <- df.geopoints$Locality
  
  return(mat.distances)
}

# calc dist in km
distmatrix<-round(GeoDistanceInMetresMatrix(df_locs[,c(1,5:6)]) / 1000)
head(distmatrix)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    0    8    8   23   18   53   57   84  128    39    59    50    59
## [2,]    8    0    0   16   10   60   50   76  136    31    51    43    51
## [3,]    8    0    0   16   11   60   50   76  135    32    51    43    51
## [4,]   23   16   16    0    5   73   37   62  151    16    38    28    38
## [5,]   18   10   11    5    0   69   41   67  146    21    43    33    43
## [6,]   53   60   60   73   69    0  110  135   86    89   111   102   111
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]   125    80   121   115    21   118    71    47    86    58    54
## [2,]   133    88   129   123    28   125    63    39    78    50    46
## [3,]   133    87   128   123    28   125    64    39    78    50    46
## [4,]   148   103   144   139    43   141    49    25    64    36    31
## [5,]   143    98   139   133    38   135    54    30    69    41    36
## [6,]    82    57    79    75    32    76   123    98   137   109   104
##      [,25] [,26] [,27] [,28] [,29] [,30]
## [1,]    69   121    80    79   103    79
## [2,]    61   129    72    71   111    71
## [3,]    61   129    72    72   111    71
## [4,]    47   145    58    58   127    58
## [5,]    52   139    63    62   121    62
## [6,]   120    80   131   131    70   131
# dist matrix in heatmap form
image(1:nrow(distmatrix), 1:ncol(distmatrix), distmatrix, axes = FALSE, 
      xlab="", ylab="", col = terrain.colors(100))
colnames(distmatrix)<-df_locs$SiteName
rownames(distmatrix)<-df_locs$SiteName
axis(1, 1:nrow(distmatrix), rownames(distmatrix), cex.axis = 0.7, las=3, family="Roboto Condensed")
axis(2, 1:nrow(distmatrix), colnames(distmatrix), cex.axis = 0.5, las=1, family="Roboto Condensed")
text(expand.grid(1:nrow(distmatrix), 1:ncol(distmatrix)), sprintf("%0.1f", distmatrix), cex=0.6, family="Roboto Condensed")
title("Euclidean Distance matrix (km) \n for All CDEC Water Temp Sites", cex.main=0.8, family="Roboto Condensed")

VICTORY!!

Yup….we did some stuff. Maybe it was helpful!

Summary

Stay tuned for Part II next week, which will cover some more detailed ways to use sf to buffer, clip/crop, intersect as well as how to plot sf spatial data using the ggplot2 framework.

[1] Getting sf and ggplot2 can be quirky at times, but overall once you understand how to get the layers to work it’s great.