---
title: "Indexing and Referencing"
author: "dblodgett@usgs.gov"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Indexing and Referencing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(nhdplusTools)

local <- (Sys.getenv("BUILD_VIGNETTES") == "TRUE")
if(local) {
  cache_path <- file.path(nhdplusTools_data_dir(), "index_v")
} else {
  cache_path <- tempdir()
}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=6, 
  fig.height=4,
  eval=local,
  cache=local,
  cache.path=(cache_path)
)

oldoption <- options(scipen = 9999,
                     "rgdal_show_exportToProj4_warnings"="none")
```

## Introduction

`nhdplusTools` offers a few indexing functions and some supporting functions that are worth being aware of. Most of these functions wrap similar but more general functions from [`hydroloom`](https://doi-usgs.github.io/hydroloom/reference/index.html#indexing-and-linear-referencing). 

The core functions are: `get_flowline_index()` and `get_waterbody_index()` -- they do the heavy lifting of finding flowlines and waterbodies near point locations. 

For flowline indexes, there are a number of useful utilities: 
- `disambiguate_flowline_indexes()` uses numeric or character attributes to attempt to determine the best flowline match when many near by matches exist. This is especially useful for mainstem / tributary disambiguation.
- `get_hydro_location()` retrieves the point location of an index along a flowline.
- `rescale_measures()` converts 0:100 reachcode measures to 0:100 flowline measures.
- `get_partial_length()` retrieves a partial length (upstream and downstream) of an index location.
- `get_path_lengths()` retrieves the distance between the outlets of pairs of flowlines. Use with `get_partial_length()` to determine network distance between indexes.

For waterbody indexes, the `get_wb_outlet()` function is helpful too determine which flowline is the outlet of a waterbody.

## Flowline Indexing

First we'll load up some data. In this case, we use flowlines from the NHDPlus subset that's included in the package and a set of points to index. We'll use the NHDPlus Gages layer for this example. The data in this example is big. The R session needs a lot of memory to hold the whole NHDPlus flowline layer and run the calculations. 

```{r nhdplus_path_setup, echo=FALSE, include=FALSE}
library(dplyr, warn.conflicts = FALSE)

work_dir <- file.path(nhdplusTools_data_dir(), "index_vignette")
dir.create(work_dir, recursive = TRUE)

source(system.file("extdata/sample_data.R", package = "nhdplusTools"))

file.copy(sample_data,
          file.path(work_dir, "natseamless.gpkg"))
```

```{r nhdplus_path, echo=TRUE}
library(nhdplusTools)

nhdplus_path(file.path(work_dir, "natseamless.gpkg"))

flowlines <- sf::read_sf(nhdplus_path(), "NHDFlowline_Network") |>
  sf::st_zm()
gages <- sf::read_sf(nhdplus_path(), "Gage")
```

Now we can call `get_flowline_index()` on the data we just loaded. The `get_flowline_index()` function has an input, `search_radius` which should correspond to the units of the `points` input. Projection and unit conversion is attempted. See the function documentation for details. See the documentation of the [nn2 function from the RANN package](https://cran.r-project.org/package=RANN) for more information on how the search works.

NOTE: If you have a small area in which you need flowline indexes, `get_flowline_index()` has an 
option to download flowlines in the bounding box of your input points.

```{r get_indexes}
indexes <- get_flowline_index(sf::st_transform(flowlines, 5070), # albers
                              sf::st_transform(sf::st_geometry(gages), 5070), 
                              search_radius = units::set_units(200, "meters"), 
                              max_matches = 1)

indexes <- left_join(sf::st_sf(id = c(1:nrow(gages)), 
                               geom = sf::st_geometry(gages)), 
                     indexes, by = "id")

plot(sf::st_geometry(sf::st_zm(flowlines)))
plot(sf::st_geometry(indexes), add = TRUE)

```

Now let's look at the results and see how the `get_flowline_index()` did. The below shows the percent of COMIDs and REACHCODEs that match and shows a histogram of the measure differences for the REACHCODEs that were matched.

```{r analyze_index}
p_match <- 100 * length(which(indexes$COMID %in% gages$FLComID)) / nrow(gages)
paste0(round(p_match, digits = 1), 
       "% were found to match the COMID in the NHDPlus gages layer")

p_match <- 100 * length(which(indexes$REACHCODE %in% gages$REACHCODE)) / nrow(gages)
paste0(round(p_match, digits = 1), 
       "% were found to match the REACHCODE in the NHDPlus gages layer")

matched <- cbind(indexes, 
                 dplyr::select(sf::st_drop_geometry(gages), 
                               REACHCODE_ref = REACHCODE, 
                               COMID_ref = FLComID, 
                               REACH_meas_ref = Measure)) %>%
  dplyr::filter(REACHCODE == REACHCODE_ref) %>%
  dplyr::mutate(REACH_meas_diff = REACH_meas - REACH_meas_ref)

hist(matched$REACH_meas_diff, breaks = 100, 
     main = "Difference in measure for gages matched to the same reach.")

round(quantile(matched$REACH_meas_diff, 
               probs = c(0, 0.1, 0.25, 0.5, 0.75, 0.9, 1)), 
      digits = 2)
```

## Flowline Indexing with higher precision

The above example used the native nodes of the NHDPlus as the potential measure snap locations. The `get_flowline_index()` function has the ability to refine these by segmentizing the line to some given resolution. Let's try the same thing using a resolution of 10m and see if we can do any better. 

Note that the `sf::st_segmentize` function takes care of the distance conversion and segmentizes our lon/lat lines to 10m on the fly. Also note, we are working in units of degrees for this sample. (this is probably not a good way to do things, but is being shown here for the sake of demonstration)

```{r get_indexes_precise}
indexes <- get_flowline_index(flowlines, 
                              sf::st_geometry(gages), 
                              search_radius = units::set_units(0.1, "degrees"), 
                              precision = 10)

indexes <- left_join(data.frame(id = seq_len(nrow(gages))), indexes, by = "id")
```

Now lets look at out comparison again.

```{r analyze_inde_precise}
p_match <- 100 * length(which(indexes$COMID %in% gages$FLComID)) / nrow(gages)
paste0(round(p_match, digits = 1), 
       "% were found to match the COMID in the NHDPlus gages layer")

p_match <- 100 * length(which(indexes$REACHCODE %in% gages$REACHCODE)) / nrow(gages)
paste0(round(p_match, digits = 1), 
       "% were found to match the REACHCODE in the NHDPlus gages layer")

matched <- cbind(indexes, 
                 dplyr::select(sf::st_set_geometry(gages, NULL), 
                               REACHCODE_ref = REACHCODE, 
                               COMID_ref = FLComID, 
                               REACH_meas_ref = Measure)) %>%
  dplyr::filter(REACHCODE == REACHCODE_ref) %>%
  dplyr::mutate(REACH_meas_diff = REACH_meas - REACH_meas_ref)

hist(matched$REACH_meas_diff, breaks = 100, 
     main = "Difference in measure for gages matched to the same reach.")

round(quantile(matched$REACH_meas_diff, 
               probs = c(0, 0.1, 0.25, 0.5, 0.75, 0.9, 1)), digits = 2)
```

## Finding multiple indexes

`get_flowline_index()` has a parameter `max_matches` that controls how many indexed flowlines are returned per point. This is useful for points that are near many flowlines and some further disambiguation is needed to determine exactly which flowline the point should be indexed to.

For this example, we'll just look at one point but keep all the sites for disambiguation down below.

```{r multi}
all_indexes <- get_flowline_index(flowlines,
                                  sf::st_geometry(gages), 
                                  search_radius = units::set_units(0.01, "degrees"), 
                                  max_matches = 10)

indexes <- left_join(sf::st_sf(id = 42, 
                               geom = sf::st_geometry(gages)[42]), 
                     all_indexes[all_indexes$id == 42, ], by = "id")

plot(sf::st_geometry(sf::st_buffer(indexes, 500)), border = NA)
plot(sf::st_geometry(indexes), add = TRUE)
plot(sf::st_geometry(sf::st_zm(flowlines)), col = "blue", add = TRUE)
indexes
```

Now that we have multiple matches, we can use the function `disambiguate_flowline_indexes()` to figure out which is the "best" match. "best" is in scare quotes here because there are many potential sources of ambiguity here and we are really just narrowing down based on the information we have at hand. (read below for a case in point)

Below, we run `disambiguate_flowline_indexes()` on all the indexes we found then pull out the one we looked at just above as an example (gage 42 in our list).

```{r disamb}
unique_indexes <- disambiguate_flowline_indexes(
  all_indexes, 
  flowlines[, c("COMID", "TotDASqKM"), drop = TRUE],
  data.frame(ID = seq_len(nrow(gages)),
             area = gages$DASqKm))

unique_index <- left_join(sf::st_sf(id = 42, 
                                    geom = sf::st_geometry(gages)[42]), 
                          unique_indexes[unique_indexes$id == 42, ], by = "id")

plot(sf::st_geometry(sf::st_buffer(indexes, 500)), border = NA)
plot(sf::st_geometry(indexes), add = TRUE)
plot(sf::st_geometry(sf::st_zm(flowlines[flowlines$COMID %in% indexes$COMID,])), 
     col = "grey", lwd = 3, add = TRUE)
plot(sf::st_geometry(sf::st_zm(flowlines[flowlines$COMID %in% unique_index$COMID,])),
     col = "blue", add = TRUE)

unique_index
```

As can be seen in this example, a drainage area disambiguation resulted in an unexpected result. Further inspection of this particular gage and the network data used, shows that the main path through this small diversion is coded counter to the main path in the real world. So in this case, if our interest is in the best match to the hydrographic network data we have, this **is** the best match, as the closest spatial match is incorrectly modeled by the hydrographic data set. As always, buyer beware! 

## Waterbody Indexing

The `get_flowline_index()` function estimates a hydrographic address as a linear reference to a flowline. For points near bodies of water, these could be an inappropriate kind of index. This is because where flowlines run through a waterbody. they are "artificial paths" and do not represent the waterbody. The `get_waterbody_index()` function is intended to address points that are in or near the shore of a waterbody.

This next block of code loads the NHDPlus Waterbody layer and creates an interactive map. Of interest on gages that are near the short of bodies of water but far away from flowlines. Note that we drop the NHDPlus geometry and use the source `LonSite` and `LatSite` attributes for geometry.

```{r waterbodies}
waterbody <- sf::read_sf(nhdplus_path(), "NHDWaterbody")

gages <- sf::st_drop_geometry(gages) %>%
  dplyr::filter(!is.na(LonSite)) %>%
  sf::st_as_sf(coords = c("LonSite", "LatSite"), crs = 4326)

plot(sf::st_geometry(sf::st_zm(flowlines)))
plot(sf::st_geometry(waterbody), add = TRUE)
plot(sf::st_geometry(gages), add = TRUE)
```

This next block shows how to call `get_flowline_index()` and `get_waterbody_index()` and what the output looks like. 

```{r index_waterbodies}

flowline_indexes <- left_join(data.frame(id = seq_len(nrow(gages))),
                              get_flowline_index(
                                sf::st_transform(flowlines, 5070), 
                                sf::st_geometry(sf::st_transform(gages, 5070)), 
                                search_radius = units::set_units(200, "m")), by = "id")
                              
indexed_gages <- cbind(dplyr::select(gages, 
                                      orig_REACHCODE = REACHCODE, 
                                      orig_Measure = Measure, 
                                      FLComID, 
                                      STATION_NM), 
                        flowline_indexes,
                        get_waterbody_index(
                          st_transform(waterbody, 5070), 
                          st_transform(gages, 5070), 
                          st_drop_geometry(flowlines), 
                          search_radius = units::set_units(200, "m")))

plot(sf::st_geometry(sf::st_zm(flowlines)))
plot(sf::st_geometry(waterbody), add = TRUE)
plot(sf::st_geometry(indexed_gages), add = TRUE)

dplyr::select(sf::st_drop_geometry(indexed_gages), near_wb_COMID, near_wb_dist, in_wb_COMID, outlet_fline_COMID)

```

```{r teardown, include=FALSE}
options(oldoption)
if(Sys.getenv("BUILD_VIGNETTES") != "TRUE") {
  unlink(work_dir, recursive = TRUE)
}
```