Part III: Snapping Points to NHD Flowline Data and Calculating River Distance with riverdist

Though it’s a bit late in getting posted, this is part III in a spatial series that is becoming a quasi-workshop in the variety of ways we can work with spatial river & stream data in R. This post will continue using the sf package, and introduce the riverdist package, but we’ll still be playing with data on rivers and streams.

To recap, in the first few posts we made a leaflet map, scraped data from tables on a webpage, demonstrated how to calculate distance matrices between point data, and showed how to use functions written by folks at USGS (OWI) as well as the Hydrosheds data to download riverline data. That’s a fair amount to chew on I realize, but I’d like to demo a few additional things that might be useful for folks working with river data.

This post will (try to) show you:

  • How to use the riverdist package to process a riverline network
  • How to snap spatial point data to the processed riverline data (which we downloaded in the previous post)
  • Finally, how to calculate the river distance between sites (so along the river network)
  • and a quick ggmap example

I think in a future post I’d like to demo how to implement a wavelet analysis of these flow data, as well as show an example of the very cool geoknife package to assess patterns of river impairment. Onward!

So, hopefully these posts are useful and it doesn’t feel like this:

Load the Packages

The main packages I’m going to use in this post:

# load libraries
suppressMessages({
  library(dplyr); # data munging and piping
  library(purrr); # for list functions
  library(ggplot2); # plotting
  library(ggrepel) # for labeling
  library(sf); # spatial simple features
  library(USAboundaries); # state/county data
  library(Imap); # nice mapping/color functions
  #library(geoknife); # USGS tool set (Next post)
  #library(dataRetrieval); # USGS tool set (next post)
  #library(httr) # scraping webdata (to download flowlines)
})

# Note, to plot `sf` objects with `ggplot`, you need to have the most recent version (>2.2.1). A quick and easy check you can add to your code is an `if` statement: `if (utils::packageVersion("ggplot2") > "2.2.1"))`

Calculating River Distances

To use the riverdist package, we need to load a shapefile with a river network of some type. This example uses NHD Streamline data, cropped to the North Fork American watershed. I downloaded the flowline data for the North Fork American watershed using the get_flowline function from our previous post, at Stream Order 1, 2, 3, and 4. The function was from an excellent gist by Laura DeCicco (see here), and I modified it slightly to work with the sf package.

To save time, I’ve put these data on github here for download. Let’s use HUC8 data from the previous post as well. Download it to a “data” folder and the code below should work.

Load Flowline & HUC Data

Let’s take a look at the Streamline data for Stream Order=2. Note the river network was downloaded using a bounding box or extent (max and min lat and lon values for a given shapefile), so it extends outside of the specific HUC8 watershed boundary we’re interested in.

# load the streamline data
load("data/nhd_rivs_ord_1_4.rda")

# load the hucs
huc8 <- read_sf(unzip("data/h8_AMR_BEA_YUB.zip"), quiet = F) %>%
  st_transform(crs=4326) #%>% 
## Reading layer `h8_AMR_BEA_YUB' from data source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/h8_AMR_BEA_YUB.dbf' using driver `ESRI Shapefile'
## Simple feature collection with 4 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -121.5976 ymin: 38.61455 xmax: -119.9829 ymax: 39.77696
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
# warning message here is ok!

# then remove raw files since file is added in memory
file.remove(list.files(pattern = "h8_AMR_BEA_YUB*",recursive = F))
## [1] TRUE TRUE TRUE TRUE
# filter to the NF American only
h8 <- huc8 %>% filter(HU_8_NAME=="North Fork American") 

# plot 
plot(st_geometry(h8), axes=F, lwd = 3)
plot(st_geometry(rivers2), col="blue", add=T)
title(main = "NF American Watershed & NHD Flowline (Stream order=2)", family="Roboto Condensed")

Crop and Save as Shapefile

In order to use the riverdist package, we are going to want to trim our flowline data to a single watershed, and get rid of the streams outside of that boundary. Let’s crop the flowline data to the NF American Watershed using sf::st_intersection. Then we can write the data as a shapefile.

# crop river data by huc8 data with sf function
rivs2 <- st_intersection(rivers2, h8)

# plot to check:
plot(st_geometry(h8), axes=F, lwd=3)
plot(st_geometry(rivs2), col="blue", add=T)
title(main = "NF American Watershed & NHD Flowline (Stream order=2)", family="Roboto Condensed")

# okay looks good lets save:
st_write(rivs2, "data/nfa_rivers2.shp", delete_dsn = T) 
## Deleting source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/data/nfa_rivers2.shp' using driver `ESRI Shapefile'
## Writing layer `nfa_rivers2' to data source `/Users/ryanpeek/Documents/github/ryanpeek.github.io/data/nfa_rivers2.shp' using driver `ESRI Shapefile'
## features:       654
## fields:         95
## geometry type:  Line String
# "delete_dsn=T" will allow you to overwrite existing file so you don't get an error like this:
### Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  : 
###  Dataset already exists.

You might see a warning message about attribute field names, that’s okay, it’s just making field names unique. The file should be where you saved it, and have 4 separate files (.dbf, .prj, .shp, .shx). There can be more associated with a shapefile, but these 4 are the core set that are required.

Clean River Network Topology

Now that we have a river network, we need to clean it and check for errors so that when riverdist is routing things along the network, distances will be accurate.

The main steps are:

  • Load data and transform into a projected coordinate system
  • Plot and Check Topology
  • Clean the network topology so we can snap points, save data
  • Calculate distances along the river network between sites

Load Data and Project

First we need to load the package and project our data. riverdist won’t work unless data are in a projected coordinate system. If you aren’t sure what the difference between projected (PCS) and geographic coordinate systems (GCS) are, here’s the quickest summary I can provide.

  • GCS are based on a spherical earth model (3-dimensional), so use coordinates like latitude & longitude (a datum) in a 3-dimensional space to define locations based on a spheroid model (of which there are several). The datum is used to link the spheroid model to the earth’s surface.
  • PCS project locations onto a 2-dimensional Cartesian coordinate plane. That’s why we refer to these as map projections. Different projections have different ways of stretching the 3-d earth’s surface onto a flat 2-d plane. But by putting things onto a flat 2-d surface, calculations (like calculating distances along a river network) are usually easier to do.
# load the package
library(riverdist)

# riverdist only works w/ projected coordinate systems...a few options:
## EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
## EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic
## EPSG:102005 USA_Contiguous_Equidistant_Conic

# read in the network and project with Albers Equal Area
rivs <- line2network(path="data/", layer="nfa_rivers2", reproject = "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs")
## 
##  Units: m 
## 
##  Removed 36 segments with lengths shorter than the connectivity tolerance.
# should see message about "Removed 36 Segments w short lengths"

# plot of all the segments
plot(x=rivs)

Plot and Check Topology

Ok, so we should see a whole bunch of rainbow colored numbered segments. If we want to look at topology, we can plot the nodes (confluences). Green dots represent connections or confluences between segments, red dots are “headwaters” or isolated branches. Notice there are some red dots in some odd locations.

# check topology (nodes)
topologydots(rivers=rivs)

Cleanup: Insert Vertices and Identify River Mouth

The riverdist tutorial/vignette is pretty good and will walk folks through the main steps. For larger river networks and the finer scale data you have, the more likely you can end up with weird and even persistent things you need to clean (stream braiding, unconnected segments, etc). Be aware this cleaning of topology is a stepwise process and it can take some time, but once you have everything cleaned up, you can save the data and re-use network in the future.

I’ll try to screen shot this process below. First let’s run the cleanup() function.

# do a clean up:
rivs_fixed <- cleanup(rivers = rivs)

Initially the cleanup will identify all the segments and attempt to remove duplicate lines and dissolve segments that are split (uncessarily). When it asks whether to Dissolve or not, select yes.

Dissolving... 
Simplified from 615 to 243 segments. 

 Checking for segments with length less than the connectivity tolerance... 

 0 microsegments identified. 

 Checking if splitting segments is needed... 
Checking between-vertex lengths... 
Maximum distance between vertices is 772 
Insert vertices to reduce distances between vertices and increase point snapping precision? (y/n) ?

The Insert vertices question makes a more evenly distributed network so when we try snap points, there will be more vertices available in the network. The histogram provides a sense of the distribution of the segement sizes. I try to select something near the peak so most segments are a consistent size.

step1

step1

Select “y”, and pick a number…I picked 100 meters for this example. The output will then say it’s inserting vertices, followed by:

Please identify segment number of river mouth:

step2

step2

Here I’ve selected segment 192, and then selected the mouth as 1. Once the river mouth has been identified, the program will ask if you want to remove any additional segments. I select n here, but many options.

Cleanup: Remove Additional Segments/Braiding

The cleanup function will then continue and ask if you want to remove additional segments, and check for braiding. Just walk through and determine what you want to keep or not. If there are braided segment, you’ll need to pick out the components you want to keep and which to delete. This can be a bit tedious (especially with very fine-scale datasets), but once you walk through, you’ll be able to save your updated network. Here I remove as many of the braided segments as I can (207, 213, 205, 204, 214, 138, 10, 11, 232, 230).

Once done with removing braided segments, riverdist will ask if you want to build segment routes. I select y here.

Save the Cleaned Data

According to riverdist, we should save our cleaned topology file as a Rdata file. Let’s save our file.

save(rivs_fixed, file = "data/nfa_rivs2_fixed.rda")

Plot Cleaned Topology

Let’s take a look at what we did, and why it matters, by plotting the original river network topology vs. the cleaned version.

load("data/nfa_rivs2_fixed.rda")

# now plot the old vs. the new
par(mfrow=c(1,2))
topologydots(rivers=rivs)
graphics::title("Raw River Topology", family="Roboto Condensed")
topologydots(rivs_fixed)
graphics::title("Clean River Topology", family="Roboto Condensed")

We can see that the Clean river topology has fewer nodes and is a cleaner network, which means it will be more accurate and faster for calculating network distances.

## null device 
##           1

Snapping Points to Lines

Now that we have a river network, we can snap points to the nearest line, and use that to calculate distances between sites along the river network. First let’s grab some points! For this example let’s go back and grab all CDEC stations from the NF American watershed. To do that I searched for any/all stations in the American River basin using this interface. The resulting page I ended up with was this. Then I used the web-scraping technique I showed in the first post in this series) to pull that data in.

Get Point Data (for Snapping)

library(rvest)

url <- "https://cdec.water.ca.gov/cgi-progs/staSearch?sta=&sensor=211&dur=&active=&lon1=&lon2=&lat1=&lat2=&elev1=-5&elev2=99000&nearby=&basin_chk=on&basin=AMERICAN+R&hydro=CENTRAL+COAST&county=ALAMEDA&operator=%2B&display=sta"

df <- url %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="main_content"]/div/div[1]/table') %>% # inspect and xpath with chrome
  html_table()
df_locs <- df[[1]] # reads in as a list, so we are just eliminating the list and making this a dataframe

head(df_locs)
##    ID                   Station Name River Basin     County Longitude
## 1 ABN                   LAKE AUDRAIN  AMERICAN R  EL DORADO -120.0370
## 2 ACK      ARCADE CK NR DEL PASO HTS  AMERICAN R SACRAMENTO -121.3820
## 3 ADR               AUBURN DAM RIDGE  AMERICAN R     PLACER -121.0450
## 4 AFD    AMERICAN R BELOW FOLSOM DAM  AMERICAN R SACRAMENTO -121.1667
## 5 AFO    AMERICAN RIVER AT FAIR OAKS  AMERICAN R SACRAMENTO -121.2277
## 6 AHZ AMERICAN R AT HAZEL AVE BRIDGE  AMERICAN R SACRAMENTO -121.2240
##   Latitude Elevation Feet                    Operator Map
## 1 38.82000     7300 &nbsp Placerville Ranger District    
## 2 38.63400       50 &nbsp        US Geological Survey    
## 3 38.88200     1200 &nbsp    US Bureau of Reclamation    
## 4 38.68830      220 &nbsp        US Geological Survey    
## 5 38.63546       72 &nbsp        US Geological Survey    
## 6 38.63600      100 &nbsp        US Geological Survey
dim(df_locs)
## [1] 139   9
# cleanup data
df_locs <- df_locs %>% rename(station="Station Name", river_basin="River Basin", 
                              lon=Longitude, lat=Latitude, elev_ft=`Elevation Feet`) %>%
  select(-Map)

# remove and filter out the weird UNIX code "&nbsp"
df_locs$elev_ft <- as.numeric(gsub(pattern = "&nbsp", replacement = "", df_locs$elev_ft)) 

# make CDEC data sf object (spatial): 
df_locs <- st_as_sf(df_locs, 
                    coords = c("lon", "lat"), # for point data
                    remove = F, # don't remove these lat/lon cols from df
                    crs = 4326) # add projection (this is WGS84)