EGRET Overview and Update

Introduction

Evaluating long-term changes in river conditions (water quality and discharge) is an important use of hydrologic data. To carry out such evaluations, the hydrologist needs tools to facilitate several key steps in the process: acquiring the data records from a variety of sources, structuring it in ways that facilitate the analysis, processing the data with routines that extract information about changes that may be happening, and displaying findings with graphical techniques. An R package called EGRET (Exploration and Graphics for RivEr Trends) has been developed for carrying out each of these steps in an integrated manner. The analysis of the water quality data used in EGRET is a statistical method called Weighted Regressions on Time, Discharge, and Season (WRTDS). This EGRET Update Overview is intended as a guide for users, pointing to the particular capabilities that they may want to use to meet their data analysis goals. The most detailed description of the EGRET package and WRTDS method are contained in the EGRET User Guide.

However, several major enhancements and extensions of the method have been made since the User Guide was published in 2015. This EGRET Update Overview is designed to help a user decide which of EGRET’s capabilities they want to apply, and then point them to the relevant documents that describe those in detail. These include various journal articles that explain motivations for these enhancements, provide the mathematical formulations for them, and show examples of their application. It also includes references to various supporting vignettes that explain how to apply these enhancements to the EGRET package. There are also several special applications for which R scripts have been written that depend on the EGRET package, this EGRET Overview and Update will describe them and point to those scripts.

There is an additional R package that users of EGRET should be aware of, called EGRETci (which stands for confidence intervals). The EGRETci package is designed to provide a variety of approaches to uncertainty analysis related to the water quality trend results computed in EGRET. EGRETci is tightly linked to EGRET, such that output dataframes or other objects produced by EGRET serve as inputs to functions in EGRETci. Throughout this EGRET Update and Overview we will make mention of the relevant functions of EGRETci. There is a home page for EGRETci. At the end of this document there is a brief description in the recent changes that have been made to EGRETci.

General Introduction to the EGRET package

EGRET includes statistics and graphics for streamflow history, water quality trends, and the statistical modeling algorithm Weighted Regressions on Time, Discharge, and Season (WRTDS). It is entirely focused on two data types.

  • A record of daily mean discharge at a single streamgage. This record must be a complete record, no missing values are allowed. It also is designed to deal with streams that are perennial, although there is an adjustment that is made for days with zero or negative discharge, but this is only intended to be used in situations where the number of zero or negative discharge is very small (e.g. less than 0.1% of days).

  • A record of water quality observations at (or near) the streamgage for which the discharge data are available. These water quality observations are only for a single analyte. There are no requirements of maximum or minimum frequency for these data, nor is there a requirement that the frequency be consistent over time. There can be multiple samples on any given day and there can be many days with no samples. There is no time of day information stored for the sample data. This means that the EGRET software is not appropriate for working with data that are believed to experience substantial diurnal variations, unless the samples are always collected at the same time each day. The data can be censored or uncensored. A censored value is one for which the reported value is either: less than some reporting limit, or greater than some reporting limit.

There are analyses in EGRET that work only with discharge data (we call them Flow History) analyses and others that work with water quality data, but all of the water quality analyses depend on their being discharge data covering the entire time period being analyzed.

Sources of information about 11 EGRET topics

1. EGRET overview and data entry

These topics are covered extensively both in the Introduction to the EGRET package and in more detail in the EGRET User Guide.

3. Describing water quality without using the WRTDS model.

In some situations, the analyst may want to simply display a set of water quality data without any use of models or modeling assumptions. This can be a useful addition to showing results derived from the WRTDS statistical model. These are described in section 6 of Introduction to the EGRET package as well as on pages 29 -37 of the EGRET User Guide.

Improved graphical treatment of censored data.

An enhancement of the EGRET packages provides an additional way to graph the data when there are censored values in the data set (typically values reported as “less than” the laboratory reporting limit). The standard way that EGRET depicts less-than values is by showing a vertical line that extends from the reporting limit value all the way to the bottom of the graph. This is not the most visually effective way of showing the data even though it is a more honest approach than simply representing them by plotting them at their reporting limit. The vertical lines tend to draw a great amount of attention to the less-than values.

The enhancement now in the EGRET package is called Random Residuals. What is done here is to take each censored value in the data set and randomly select a value that lies between the reporting limit and zero and show it on various plots as an open circle (as opposed to actual measured values, which are shown as a closed circle).

The function that accomplishes this is called makeAugmentedSample(). These random values are not used in computing any of the WRTDS results (trends, mean concentrations or fluxes) but are only used for graphical representations. They are very valuable for visual presentation but have the drawbacks that: 1) there is no unique “correct” representation of the data. Multiple implementations will result in different random values. 2) the graphs of the data are no longer free from the influence of any statistical model, the computation of these random values depends on an assumption about the distribution of the true value below the reporting limit, and that is based on the assumptions of the fitted WRTDS model.

Having said that, we should recognize that the purpose of many of these graphs of the data is to present a general idea of how the data behave (as a function of time or of discharge) and this approach will provide an appropriate impression of the data. Many of the graphical functions in EGRET have an argument called randomCensored and it should be set to TRUE for this feature to be implemented.

4. Making the best possible estimates of concentration or flux for some specific day, month, or year.

Although the primary purpose for the development of the WRTDS method and EGRET software was to evaluate long term trends, they can also be used for this different purpose: creating the best possible estimate of conditions for any specific day, month, season, or year. For trend analysis (which is discussed in item 5 below) our objective is to remove the confounding influence of the year-to-year variations in discharge in order to see more clearly the underlying trend in water quality. The results from the WRTDS method that are used for the trend analysis purpose are the “Flow-Normalized” values. But here we are considering a different kind of purpose. Here, the question we are focused on is something like: “What’s our best estimate of the flux of total phosphorus into the lake during the month of May 2020?” or “What’s our best estimate of the chloride concentration on February 2, 2019?” For these kinds of estimates we can use a new method called “WRTDS Kalman” or WRTDS_K. In addition there are two publications that describe the concept and the mathematics. Lee, Hirsch and Crawford, 2019 and Zhang and Hirsch, 2019.

WRTDS_K is based on this simple idea. On any given sampled day, the measured concentration will almost certainly differ from the value that WRTDS would predict. The measured value is always a better value to use for that day than is the WRTDS estimate for that day. Furthermore, days surrounding the day of measurement are likely to have values not too different from the sampled day. For any given day, the WRTDS_K method uses a mixture of the WRTDS model’s estimated value and the measured value on the nearest sampled day prior and nearest sampled day after the given day to create an optimal estimate for that day. The mathematics of the method uses an auto-regressive, lag-one day, stochastic model to make these estimates of concentration (and hence flux) for each day in the record.

The new functions in EGRET that are used in this process are WRTDSKalman(), plotWRTDSKalman(), and plotTimeSlice(). The EGRETci package also has new capabilities to estimate the uncertainty of these WRTDS_K estimates. See Graphics for Prediction Intervals.

6. Abrupt change in water quality.

We refer to this as the wall. The idea of the wall is to appropriately handle events that you believe would result in a sharp discontinuity in the way that concentrations behave as a function of discharge, year, and season. WRTDS was designed with a working assumption that changes in the concentration-discharge relationship are gradual, responding to changes in behavior of many actors in the watershed. The use of statistical smoothing in time, which is central to WRTDS, is predicated on the idea that the changes in the system are gradual. The gradual changes could be driven by changes in population (and hence wasteloads), changes in land use or land use practices by many landowners, or many changes in point source controls implemented by many dischargers. But we know that there can be situations that depart from that assumption of gradual change. Some obvious ones are: upgrades to a single dominant point source discharge, large infrastructure changes in storm-water management (e.g. tunnel projects in cities that significantly curtail combined sewer overflows), construction or removal of a major dam (causing the pollutant of interest to mix with a large volume of water in the reservoir significantly changing the concentration-discharge relationship and in the most extreme cases virtually eliminating any relationship between concentration and discharge). There is another category of abrupt change that could also be considered, and that is a “reset” of river conditions that may be the result of an extreme flood or extreme drought. The hypothesis is that the behavior of water quality has changed as a result of this extreme event and that this change is not a short-term thing (i.e. the duration of the event) but rather is something that persists for a number of years. The new “wall” concept can provide an effective way to perform a hypothesis test that the event in question brought about a lasting change, and the approach allows us to describe the nature of the change that took place.

Conceptually the approach is simple, the analyst specifies the date on which this change takes place, and then the smoothing process involved in estimating the WRTDS model places a “wall” at that specific time. What this means is that estimates of the concentration for any time, discharge and season, that lies prior to the time of the wall, is estimated only from data collected before the wall. Conversely the estimates after the wall are only based on data collected after the wall. The result of this process is that the contour surface seen in plotContours() can have a sharp break at a vertical line where the wall is located.

The idea of using this as a way of testing if some specific action brought about a statistically significant change in water quality can be accomplished by using the runGroups() function and letting the two groups be distinguished by being either before or after the wall. The confidence level or uncertainty of this before and after difference can then be determined in using EGRETci::runGroupsBoot() in the EGRETci package. The wall can also be implemented in the runPairs() and runSeries() functions as well.

8. Parallel computing

Because EGRET can be computationally intensive, and EGRETci even more intensive, there are a set of techniques provided to run EGRET processes in parallel and thus greatly compress the amount of time required to complete a large analysis that covers many sites and many analytes. EGRET and EGRETci allow for parallel computing. These techniques can be executed on a single computer with several cores, or they can be executed across a large array of connected computers.

9. errorStats to describe the overall accuracy of a WRTDS model

Sometimes it is useful to have statistics that describe the magnitude of the errors in a WRTDS model, the way we would for a multiple linear regression. A function called errorStats() has been added to the EGRET package. It computes the equivalent of an R-squared value (like in multiple regression). The R-squared is computed on the log of the concentration as well as on the log of the flux. It also provides a root mean squared error and a standard error of prediction in percent. It can provide a useful tool for comparison to similar water quality statistical models, or for comparing different settings for the three window parameters of the WRTDS model (windowY, windowS, and windowQ).

10. Dealing with problems of model convergence

On rare occasions the WRTDS estimation process fails to converge numerically, or results in highly unreasonable results (e.g. estimated concentrations that are far beyond the limits of those observed in the record). These problems are very rare in EGRET, but they can be a little more common in EGRETci, because of the odd properties that the bootstrap samples can have. A fix for these rare cases has been developed for the EGRET package, and it is also used in the EGRETci package. Because these problems are sometimes related to multicollinearity in the Sample data, the idea is to add some “jitter” (small amounts of random noise) to the time and discharge data. This often solves the numerical problems. It should only be used if there is a clear problem with the results. The function used is called jitterSam(). Its use is primarily in functions in the EGRETci package.

11. Alternative discharge variables in the WRTDS model

Sometimes the usual formulation of WRTDS is not ideal because we believe that the daily discharge on the sampling day is not a very good explanatory variable and we have an idea of a better variable we might use. Dams and reservoirs create situations where this may be an issue. It may turn out that a better explanatory variable model for the WRTDS model might be a rolling-mean of discharges over some period of many days up to (and including) the sampled day instead of the daily discharge on the sampling day. There is an article describing this concept and a script for making such an analysis using an alternative Q variable.

Enhancements of EGRETci

As mentioned above EGRETci is an R package, closely related to EGRET. It runs a variety of analyses that can provide information about the uncertainty of any of the trend results that are run in EGRET. These outputs include p-values that can be used for hypothesis tests, confidence intervals on the magnitude of annual flow-normalized values, and also likelihood measures that can provide an indication of how confident we should be that the sign of the estimated change is actually the correct sign (e.g. our estimate of trend said it was upward, what’s the chance it might actually be downward?). These techniques, for evaluating trend results are described in detail in the Guide to EGRETci 2.0 enhancements.

In addition, there is now a set of functions in EGRETci which allow one to quantify the uncertainty about individual values (not flow-normalized) of average concentration or average flux. Here’s an example of how these results might be used. A colleague has a deterministic model of a particular lake that predicts the summertime chlorophyl a concentration based on the nutrient input over recent months as well as air temperature, sunlight, precipitation and other variables relevant to the processes involved. The colleague asks for your estimates of the actual time series of N and P inputs to the lake at a monthly time step for a period of several years. You provide your colleague with a time series of the WRTDSKalman estimates of the flux for each of these months. They say “Thanks, but can you quantify your uncertainty about those values. Can you place confidence intervals around those values. Is the 90% confidence interval on the order of +/- 5%, or more like 20% or even worse 50%?” These techniques are designed to provide answers to those kinds of questions. They are designed to calculate prediction intervals for daily, monthly, or annual averages of concentration or flux. These methods are described in detail in Graphics for Prediction Intervals. They also provide the means to graphically depict the probability of concentration values exceeding some regulatory threshold value.