Data Preparation

Step 2: Assemble Available Data

The availability of data often determines whether stressor-response analysis can be applied. After collecting all available data, these data should be assessed with respect to whether enough data are available and whether data provides sufficient temporal and spatial coverage.

Identify sources of data

State and national monitoring data sets often are the primary sources of data, but other entities also might have applicable data. A list of potential data sources is provided in the Data Library.

Is there enough data?

The more data you can obtain, the more flexibility you will have in analyzing the data. At a minimum, 10 independent samples are required for each degree of freedom estimated in the model. For a simple linear regression line defined by two coefficients, this rule of thumb suggests that a minimum of 20 samples is required. Each additional variable you consider further increases this minimum requirement, so if a single classification variable is considered in addition to the linear regression, the minimum number of samples increases to 30. The precision of different model parameter estimates also depends on the number of samples, so inferences from the stressor-response model are more accurate with more data.

Do the data provide adequate temporal and spatial coverage?

Consider whether the temporal and/or spatial coverage of the available data limits the applicability of the analysis results. For example, data collected only in the summer might indicate that criteria derived from those data are applicable only during the summer. The summer is commonly regarded as the critical period for the deleterious effects of elevated concentrations of nutrients to occur (e.g., primary production rates increase with warmer temperatures), however, so criteria based on summer data often are broadly protective.

Data matching

Matching data collected by different entities or at different frequencies by the same agency can often be challenging. For example, in a particular lake, weekly measurements of cyanotoxins might be available, but only a single concentration each for TN and TP are available from a different year. Deciding how to match these data requires you to understand the underlying processes by which elevated concentrations of nutrients are manifested as ecological effects and the management decisions that could be informed by the analysis. Possible questions to consider include the following:

  • What are the timescales of the assessment endpoint and the management goal (e.g., the duration and frequency of the assessment endpoint)?
  • Cyanobacteria blooms and associated elevated cyanotoxin concentrations can appear and disappear within days. How often should I allow exceedances of a cyanotoxin threshold while still assessing whether a water body is meeting its designated uses?
  • What are the timescales of the nutrient concentrations? Nutrient concentrations in streams can vary substantially over short periods of time as flow changes, whereas nutrient concentrations in receiving lakes can be somewhat less variable in time.
  • How quickly do I expect assessment endpoints to change in response to changes in nutrient concentrations? In lakes, conventional wisdom suggests that lakes respond to seasonally integrated loads of nutrients, whereas in streams, near-field effects of elevated concentrations of nutrients can occur in response to much briefer periods of elevated concentrations. Beneficial effects from reductions in phosphorus loads can occur relatively quickly in small streams, whereas in lakes, reductions in phosphorus loads might not yield immediate changes because loading from lake sediments may continue.

Data for different variables can be matched based on insights on different temporal and spatial scales. You might match summer mean nutrient concentrations in lakes with all cyanotoxin measurements collected during that summer because you expect that the variability of cyanotoxin concentrations during one summer is not related to the overall nutrient load. Rather, cyanobacteria are responding to other environmental factors such as temperature, growth dynamics, and water column stability.

Step 3: Explore Relationships across Data

Exploratory data analysis is a critical first step in understanding and visualizing relationships across different variables. It can provide you with initial insights into how different parameters vary in relation to each other. You can determine whether different variables are related and gain an initial understanding of the shape of those relationships. Data gaps and unanticipated relationships between variables also can be identified by exploring all of the available data. You can use graphical or numerical methods to explore the available data.

Graphical methods

  • Scatter plots: One of the simplest ways to visualize the relationship between two variables (see Figure 2).
A set of scatterplots illustrating relationships between chlorophyll a, TN, and TP.

Figure 2. Simultaneous scatterplots of several different variables can be a convenient way to examine relationships. These pairwise relationships suggest that detection limits affect observations of chlorophyll a (evidenced by the nearly straight lower boundary of the cloud of points) and that TP, TN, and chlorophyll a are all strongly correlated.


  • Coplots: An enhancement of scatter plots in which data are first grouped with respect to a third variable, then scatter plots are examined within groups (see Figure 3). This technique is particularly useful for examining the potential effect of different classification variables on the relationship between stressor and response variables.


A set of coplots illustrating relationships between chlorophyll a, TN, and lake color.

Figure 3. Coplots display scatter plots between variables (e.g., TN and chlorophyll a in lakes) while conditioning on a third variable (e.g., lake color). The resulting plot can show how the third variable influences relationships estimated between the two variables of interest.

Numerical methods

  • Data summaries: Examining means, standard deviations, ranges, and quartiles of different variables can help identify outliers and suggest appropriate variable transformation. For example, measurements for nutrients such as TN or TP often need to be log-transformed to reduce the skewness in their distributions.
  • Correlation analysis: Calculating the correlation coefficients between different pairs of variables can supplement insights gained from examining scatter plots. Strongly correlated variables might need to be included in subsequent analysis.

Case Studies

Yaquina Estuary, OR

  • Gathered available historical and recent data
  • Assembled causal (nutrient) and response variables and physical data at three spatial scales

Pensacola Bay

  • Data gap from 1975 to 1996, when EPA began conducting surveys
  • No dramatic change in concerns from 1975 to present

Coastal Bays in MD and VA

  • Used water quality data collected at multiple stations throughout the bays
  • Monitored parameters include nutrients, Secchi depth, temperature, DO, and salinity

Barnegat Bay-Little Egg Harbor

  • Nitrogen loading rates and ambient concentrations are documented
  • Chlorophyll and primary production values are documented
  • SAV coverage and standing stock of hard clams are documented

Yaquina Estuary

  • Looked at data between 1960 and 1984
  • Gathered nutrient, TSS, DO, salinity data

San Francisco Bay

  • Monitoring stations are located throughout the bay
  • Collected data include nutrients, Secchi depth, temperature, and DO

Nutrients in Neuse River Estuary

  • Monitoring stations are located throughout the estuary
  • Collected data include nutrients, Secchi depth, temperature, and DO

Nutrients in Chesapeake Bay

  • Water quality data collected throughout the Bay
  • Monitored parameters include nutrients, Secchi depth, temperature, and DO

Nutrients in Delaware Estuary

  • Water quality data are collected throughout the estuary
  • Parameters include nutrients, Secchi depth, temperature, and DO

Nutrients in Narragansett Bay

  • Water quality data were regularly collected throughout the bay
  • Data include nutrients, Secchi depth, temperature, DO, and salinity
  • Data pulled from several reports and studies

Nutrient Effects in CA Streams

  • Survey data compiled from wadeable stream monitoring programs
  • Sites selected using a combination of stratification and unequal probability weighting
  • Results from two probability surveys were used

Red River of the North

  • Compiled data from all jurisdictions into a single database
  • Exploratory data analysis could include relationships among ecological components and human disturbance of the relationships

Virginia Freshwater Nutrient Criteria

  • Analyze DEQ monitoring and associated data
  • Ambient monitoring data taken from the state’s lakes since the late 1970s

Proposed Criteria for Tampa Bay

  • Collected data from 52 stations over the course of several decades
  • Parameters included chlorophyll a, light attenuation, nutrients, and nitrogen loads

St. Louis Bay, MS

  • Data from a comprehensive sampling program conducted in 2011 was used
  • Exploratory data analysis was used to characterize data

Wisconsin Lake Phosphorus Criteria

  • Historical TP data from STORET were selected by the three classified regions
  • Data were screened for specific criteria on acreage, depth, and collected date
  • Compared data with more recent data
This website is in beta. Information on this website is not final and is subject to change