Gauge data and gauged period

Author

Galen Holt

library(foreach)
library(dplyr)
library(hydrogauge)
library(reticulate)

import mdba_gauge_getter as gg

We often end up needing to pull gauge data for lots of projects. If we just want flow data, and especially if we’re working in python, there’s the mdba_gauge_getter, and I’ve written {hydrogauge} for R and expanded it to cover Victoria, NSW, and Qld. A primary advantage of {hydrogauge} is that it can access more of the API calls and return any desired variables.

One issue we often end up having is wanting to grab gauge data for a range of dates that may or may not go earlier than a gauge was put in place. There’s actually an issue with the Kisters API for that, where it silently returns values of 0 for those dates, which is not good. Because it comes from the API itself, it affects both packages.

For example, we can go to the website and find the period of record for NSW gauge 410007 (gauges_to_pull[138]) is 10/01/1979 - present. We’ll show first what happens for three situations with both mdba_gauge_getter and hydrogauge

all dates pre-gauge
all dates within period
dates before and after

and then we’ll show how to handle it in {hydrogauge}.

Set some variables

gaugenum <- '410007'

# initial date (plus a couple days to be sure there's data)
gaugestart = lubridate::ymd(19790110)
# a week days later- this data should all exist
weeklater <- lubridate::ymd(19790117)
# Before initial date
weekbefore = lubridate::ymd(19790103)
daybefore <- lubridate::ymd(19790109)

All dates pre-gauge

If we ask for the period before the gauge is operational {hydrogauge} passes the API error through.

get_ts_traces(portal = 'NSW', 
                site_list = gaugenum, 
                var_list = '141',
                start_time = weekbefore,
                end_time = daybefore,
                interval = 'day',
                data_type = 'mean')

Error in {: task 1 failed - "API error number 126. Message: No data within specified period"

We get an empty dataframe from mdba_gauge_getter

demo_levs_pre = gg.gauge_pull(r.gaugenum, start_time_user = r.weekbefore, end_time_user = r.daybefore)

C:\Users\galen\DOCUME~1\code\WEB_TE~1\GALEN_~1\RpyEnvs\PYTEST~1\VENV~1\Lib\site-packages\mdba_gauge_getter\gauge_getter.py:82: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
  gauges['State'] = gauges['gauge_owner'].str.strip().str.split(' ', 1).str[0]

demo_levs_pre

Empty DataFrame
Columns: [DATASOURCEID, SITEID, SUBJECTID, DATETIME, VALUE, QUALITYCODE]
Index: []

So, that’s slightly different behavior, but neither is returning misleading data.

All dates with gauge

Now {hydrogauge} gives a dataframe.

get_ts_traces(portal = 'NSW', 
                site_list = gaugenum, 
                var_list = '141',
                start_time = gaugestart,
                end_time = weeklater,
                interval = 'day',
                data_type = 'mean')

# A tibble: 8 × 20
  error_num compressed site_short_name    longitude site_name  latitude org_name
      <int> <chr>      <chr>                  <dbl> <chr>         <dbl> <chr>   
1         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
2         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
3         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
4         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
5         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
6         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
7         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
8         0 0          YANCO CK @ OFFTAKE      146. YANCO CRE…    -34.7 WaterNSW
# ℹ 13 more variables: value <dbl>, time <dttm>, quality_codes_id <int>,
#   site <chr>, variable_short_name <chr>, precision <chr>, subdesc <chr>,
#   variable <chr>, units <chr>, variable_name <chr>, database_timezone <chr>,
#   quality_codes <chr>, data_type <chr>

As does mdba_gauge_getter

demo_levs_exists = gg.gauge_pull(r.gaugenum, start_time_user = r.gaugestart, end_time_user = r.weeklater)
demo_levs_exists

  DATASOURCEID  SITEID SUBJECTID    DATETIME    VALUE  QUALITYCODE
0          NSW  410007     WATER  1979-01-10  510.939          255
1          NSW  410007     WATER  1979-01-11  551.848          130
2          NSW  410007     WATER  1979-01-12  535.988          130
3          NSW  410007     WATER  1979-01-13  515.685          130
4          NSW  410007     WATER  1979-01-14  512.529          130
5          NSW  410007     WATER  1979-01-15  486.422          130
6          NSW  410007     WATER  1979-01-16  491.710          130
7          NSW  410007     WATER  1979-01-17  532.730          130

That again returns what it should. All the dates have data.

Dates spanning gauge start

Now {hydrogauge} gives a dataframe, but that initial period has value = 0, which is wrong. It should be NA, but the API returns 0 silently.

get_ts_traces(portal = 'NSW', 
                site_list = gaugenum, 
                var_list = '141',
                start_time = weekbefore,
                end_time = weeklater,
                interval = 'day',
                data_type = 'mean')

# A tibble: 15 × 20
   error_num compressed site_short_name    longitude site_name latitude org_name
       <int> <chr>      <chr>                  <dbl> <chr>        <dbl> <chr>   
 1         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 2         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 3         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 4         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 5         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 6         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 7         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 8         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
 9         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
10         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
11         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
12         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
13         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
14         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
15         0 0          YANCO CK @ OFFTAKE      146. YANCO CR…    -34.7 WaterNSW
# ℹ 13 more variables: value <dbl>, time <dttm>, quality_codes_id <int>,
#   site <chr>, variable_short_name <chr>, precision <chr>, subdesc <chr>,
#   variable <chr>, units <chr>, variable_name <chr>, database_timezone <chr>,
#   quality_codes <chr>, data_type <chr>

The same thing happens with mdba_gauge_getter

demo_levs_span = gg.gauge_pull(r.gaugenum, start_time_user = r.weekbefore, end_time_user = r.weeklater)
demo_levs_span

   DATASOURCEID  SITEID SUBJECTID    DATETIME    VALUE  QUALITYCODE
0           NSW  410007     WATER  1979-01-03    0.000          255
1           NSW  410007     WATER  1979-01-04    0.000          255
2           NSW  410007     WATER  1979-01-05    0.000          255
3           NSW  410007     WATER  1979-01-06    0.000          255
4           NSW  410007     WATER  1979-01-07    0.000          255
5           NSW  410007     WATER  1979-01-08    0.000          255
6           NSW  410007     WATER  1979-01-09    0.000          255
7           NSW  410007     WATER  1979-01-10  510.939          255
8           NSW  410007     WATER  1979-01-11  551.848          130
9           NSW  410007     WATER  1979-01-12  535.988          130
10          NSW  410007     WATER  1979-01-13  515.685          130
11          NSW  410007     WATER  1979-01-14  512.529          130
12          NSW  410007     WATER  1979-01-15  486.422          130
13          NSW  410007     WATER  1979-01-16  491.710          130
14          NSW  410007     WATER  1979-01-17  532.730          130

So, that’s not good. Especially the silent part

Solution

I can’t figure out how to use mdba_gauge_getter to automatically query the period of record. So for that, you have to make sure to check for every request.

On the other hand, {hydrogauge} can hit other API calls (get_variable_list) that return the period of record, and then use them to call get_ts_traces. We can do this manually, which is a bit of a hassle, because get_variable_list returns all the variables, and we have to remember to filter for ‘100’ if we want ‘100’ or ‘140’, or ‘141’ (see testing hydrogauge).

With multiple sites, we always will have to loop, since the start_times (and maybe end_time) will differ.

varl <- get_variable_list(portal = 'NSW', 
                          site_list = c('422028', '410007'),
                          datasource = 'A')

varl100 <- varl %>% dplyr::filter(variable == "100.00")

# If we have multiple sites, we have to loop over them.
siteout_m <- foreach(i = 1:nrow(varl100)) %do% {
  get_ts_traces(portal = 'NSW', 
                site_list = varl100$site[i], 
                var_list = '141',
                start_time = varl100$period_start[i],
                end_time = varl100$period_end[i],
                interval = 'day',
                data_type = 'mean')
}

That produces a list of dataframes for the period of record, which we can check do not have leading zeros.

head(siteout_m[[1]] |> select(time, value, site_name, everything() ))

# A tibble: 6 × 20
  time                value site_name       error_num compressed site_short_name
  <dttm>              <dbl> <chr>               <int> <chr>      <chr>          
1 1999-06-25 20:38:00  695. BARWON RIVER A…         0 0          BARWON @ BEEME…
2 1999-06-26 20:38:00  748. BARWON RIVER A…         0 0          BARWON @ BEEME…
3 1999-06-27 20:38:00  829. BARWON RIVER A…         0 0          BARWON @ BEEME…
4 1999-06-28 20:38:00  890. BARWON RIVER A…         0 0          BARWON @ BEEME…
5 1999-06-29 20:38:00  922. BARWON RIVER A…         0 0          BARWON @ BEEME…
6 1999-06-30 20:38:00  942. BARWON RIVER A…         0 0          BARWON @ BEEME…
# ℹ 14 more variables: longitude <dbl>, latitude <dbl>, org_name <chr>,
#   quality_codes_id <int>, site <chr>, variable_short_name <chr>,
#   precision <chr>, subdesc <chr>, variable <chr>, units <chr>,
#   variable_name <chr>, database_timezone <chr>, quality_codes <chr>,
#   data_type <chr>

head(siteout_m[[2]] |> select(time, value, site_name, everything() ))

# A tibble: 6 × 20
  time                value site_name       error_num compressed site_short_name
  <dttm>              <dbl> <chr>               <int> <chr>      <chr>          
1 1979-01-09 15:30:00  514. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
2 1979-01-10 15:30:00  553. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
3 1979-01-11 15:30:00  534. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
4 1979-01-12 15:30:00  515. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
5 1979-01-13 15:30:00  512. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
6 1979-01-14 15:30:00  485. YANCO CREEK AT…         0 0          YANCO CK @ OFF…
# ℹ 14 more variables: longitude <dbl>, latitude <dbl>, org_name <chr>,
#   quality_codes_id <int>, site <chr>, variable_short_name <chr>,
#   precision <chr>, subdesc <chr>, variable <chr>, units <chr>,
#   variable_name <chr>, database_timezone <chr>, quality_codes <chr>,
#   data_type <chr>

The cleaner way used to be to use get_ts_traces2, which does the hunting for dates automatically when given start_time = 'all' and end_time = 'all'. This function is now deprecated while a more general solution is found.

siteout_a <- get_ts_traces2(portal = 'NSW', 
                            site_list = c('422028', '410007'), 
                            var_list = "141",
                            start_time = 'all',
                            end_time = 'all',
                            interval = 'day',
                            data_type = 'mean',
                            returnformat = 'sitelist')

That’s really just a wrapper over the manual method above, and also does not contain leading zeros. It can return a long dataframe if returnformat = 'df', but for comparison I’ve used returnformat = 'sitelist'.

head(siteout_a[[1]] |> select(time, value, site_name, everything() ))
head(siteout_a[[2]] |> select(time, value, site_name, everything() ))