library(foreach)
library(dplyr)
library(hydrogauge)
library(reticulate)
Gauge data and gauged period
import mdba_gauge_getter as gg
We often end up needing to pull gauge data for lots of projects. If we just want flow data, and especially if we’re working in python, there’s the mdba_gauge_getter, and I’ve written {hydrogauge} for R and expanded it to cover Victoria, NSW, and Qld. A primary advantage of {hydrogauge} is that it can access more of the API calls and return any desired variables.
One issue we often end up having is wanting to grab gauge data for a range of dates that may or may not go earlier than a gauge was put in place. There’s actually an issue with the Kisters API for that, where it silently returns values of 0 for those dates, which is not good. Because it comes from the API itself, it affects both packages.
For example, we can go to the website and find the period of record for NSW gauge 410007 (gauges_to_pull[138]
) is 10/01/1979 - present. We’ll show first what happens for three situations with both mdba_gauge_getter
and hydrogauge
all dates pre-gauge
all dates within period
dates before and after
and then we’ll show how to handle it in {hydrogauge}.
Set some variables
<- '410007'
gaugenum
# initial date (plus a couple days to be sure there's data)
= lubridate::ymd(19790110)
gaugestart # a week days later- this data should all exist
<- lubridate::ymd(19790117)
weeklater # Before initial date
= lubridate::ymd(19790103)
weekbefore <- lubridate::ymd(19790109) daybefore
All dates pre-gauge
If we ask for the period before the gauge is operational {hydrogauge} passes the API error through.
get_ts_traces(portal = 'NSW',
site_list = gaugenum,
var_list = '141',
start_time = weekbefore,
end_time = daybefore,
interval = 'day',
data_type = 'mean')
Error in {: task 1 failed - "API error number 126. Message: No data within specified period"
We get an empty dataframe from mdba_gauge_getter
= gg.gauge_pull(r.gaugenum, start_time_user = r.weekbefore, end_time_user = r.daybefore) demo_levs_pre
C:\Users\galen\DOCUME~1\code\WEB_TE~1\GALEN_~1\RpyEnvs\PYTEST~1\VENV~1\Lib\site-packages\mdba_gauge_getter\gauge_getter.py:82: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
gauges['State'] = gauges['gauge_owner'].str.strip().str.split(' ', 1).str[0]
demo_levs_pre
Empty DataFrame
Columns: [DATASOURCEID, SITEID, SUBJECTID, DATETIME, VALUE, QUALITYCODE]
Index: []
So, that’s slightly different behavior, but neither is returning misleading data.
All dates with gauge
Now {hydrogauge} gives a dataframe.
get_ts_traces(portal = 'NSW',
site_list = gaugenum,
var_list = '141',
start_time = gaugestart,
end_time = weeklater,
interval = 'day',
data_type = 'mean')
# A tibble: 8 × 20
error_num compressed site_short_name longitude site_name latitude org_name
<int> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
2 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
3 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
4 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
5 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
6 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
7 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
8 0 0 YANCO CK @ OFFTAKE 146. YANCO CRE… -34.7 WaterNSW
# ℹ 13 more variables: value <dbl>, time <dttm>, quality_codes_id <int>,
# site <chr>, variable_short_name <chr>, precision <chr>, subdesc <chr>,
# variable <chr>, units <chr>, variable_name <chr>, database_timezone <chr>,
# quality_codes <chr>, data_type <chr>
As does mdba_gauge_getter
= gg.gauge_pull(r.gaugenum, start_time_user = r.gaugestart, end_time_user = r.weeklater)
demo_levs_exists demo_levs_exists
DATASOURCEID SITEID SUBJECTID DATETIME VALUE QUALITYCODE
0 NSW 410007 WATER 1979-01-10 510.939 255
1 NSW 410007 WATER 1979-01-11 551.848 130
2 NSW 410007 WATER 1979-01-12 535.988 130
3 NSW 410007 WATER 1979-01-13 515.685 130
4 NSW 410007 WATER 1979-01-14 512.529 130
5 NSW 410007 WATER 1979-01-15 486.422 130
6 NSW 410007 WATER 1979-01-16 491.710 130
7 NSW 410007 WATER 1979-01-17 532.730 130
That again returns what it should. All the dates have data.
Dates spanning gauge start
Now {hydrogauge} gives a dataframe, but that initial period has value = 0
, which is wrong. It should be NA, but the API returns 0 silently.
get_ts_traces(portal = 'NSW',
site_list = gaugenum,
var_list = '141',
start_time = weekbefore,
end_time = weeklater,
interval = 'day',
data_type = 'mean')
# A tibble: 15 × 20
error_num compressed site_short_name longitude site_name latitude org_name
<int> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
2 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
3 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
4 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
5 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
6 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
7 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
8 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
9 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
10 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
11 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
12 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
13 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
14 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
15 0 0 YANCO CK @ OFFTAKE 146. YANCO CR… -34.7 WaterNSW
# ℹ 13 more variables: value <dbl>, time <dttm>, quality_codes_id <int>,
# site <chr>, variable_short_name <chr>, precision <chr>, subdesc <chr>,
# variable <chr>, units <chr>, variable_name <chr>, database_timezone <chr>,
# quality_codes <chr>, data_type <chr>
The same thing happens with mdba_gauge_getter
= gg.gauge_pull(r.gaugenum, start_time_user = r.weekbefore, end_time_user = r.weeklater)
demo_levs_span demo_levs_span
DATASOURCEID SITEID SUBJECTID DATETIME VALUE QUALITYCODE
0 NSW 410007 WATER 1979-01-03 0.000 255
1 NSW 410007 WATER 1979-01-04 0.000 255
2 NSW 410007 WATER 1979-01-05 0.000 255
3 NSW 410007 WATER 1979-01-06 0.000 255
4 NSW 410007 WATER 1979-01-07 0.000 255
5 NSW 410007 WATER 1979-01-08 0.000 255
6 NSW 410007 WATER 1979-01-09 0.000 255
7 NSW 410007 WATER 1979-01-10 510.939 255
8 NSW 410007 WATER 1979-01-11 551.848 130
9 NSW 410007 WATER 1979-01-12 535.988 130
10 NSW 410007 WATER 1979-01-13 515.685 130
11 NSW 410007 WATER 1979-01-14 512.529 130
12 NSW 410007 WATER 1979-01-15 486.422 130
13 NSW 410007 WATER 1979-01-16 491.710 130
14 NSW 410007 WATER 1979-01-17 532.730 130
So, that’s not good. Especially the silent part
Solution
I can’t figure out how to use mdba_gauge_getter
to automatically query the period of record. So for that, you have to make sure to check for every request.
On the other hand, {hydrogauge} can hit other API calls (get_variable_list
) that return the period of record, and then use them to call get_ts_traces
. We can do this manually, which is a bit of a hassle, because get_variable_list
returns all the variables, and we have to remember to filter for ‘100’ if we want ‘100’ or ‘140’, or ‘141’ (see testing hydrogauge).
With multiple sites, we always will have to loop, since the start_time
s (and maybe end_time
) will differ.
<- get_variable_list(portal = 'NSW',
varl site_list = c('422028', '410007'),
datasource = 'A')
<- varl %>% dplyr::filter(variable == "100.00")
varl100
# If we have multiple sites, we have to loop over them.
<- foreach(i = 1:nrow(varl100)) %do% {
siteout_m get_ts_traces(portal = 'NSW',
site_list = varl100$site[i],
var_list = '141',
start_time = varl100$period_start[i],
end_time = varl100$period_end[i],
interval = 'day',
data_type = 'mean')
}
That produces a list of dataframes for the period of record, which we can check do not have leading zeros.
head(siteout_m[[1]] |> select(time, value, site_name, everything() ))
# A tibble: 6 × 20
time value site_name error_num compressed site_short_name
<dttm> <dbl> <chr> <int> <chr> <chr>
1 1999-06-25 20:38:00 695. BARWON RIVER A… 0 0 BARWON @ BEEME…
2 1999-06-26 20:38:00 748. BARWON RIVER A… 0 0 BARWON @ BEEME…
3 1999-06-27 20:38:00 829. BARWON RIVER A… 0 0 BARWON @ BEEME…
4 1999-06-28 20:38:00 890. BARWON RIVER A… 0 0 BARWON @ BEEME…
5 1999-06-29 20:38:00 922. BARWON RIVER A… 0 0 BARWON @ BEEME…
6 1999-06-30 20:38:00 942. BARWON RIVER A… 0 0 BARWON @ BEEME…
# ℹ 14 more variables: longitude <dbl>, latitude <dbl>, org_name <chr>,
# quality_codes_id <int>, site <chr>, variable_short_name <chr>,
# precision <chr>, subdesc <chr>, variable <chr>, units <chr>,
# variable_name <chr>, database_timezone <chr>, quality_codes <chr>,
# data_type <chr>
head(siteout_m[[2]] |> select(time, value, site_name, everything() ))
# A tibble: 6 × 20
time value site_name error_num compressed site_short_name
<dttm> <dbl> <chr> <int> <chr> <chr>
1 1979-01-09 15:30:00 514. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
2 1979-01-10 15:30:00 553. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
3 1979-01-11 15:30:00 534. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
4 1979-01-12 15:30:00 515. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
5 1979-01-13 15:30:00 512. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
6 1979-01-14 15:30:00 485. YANCO CREEK AT… 0 0 YANCO CK @ OFF…
# ℹ 14 more variables: longitude <dbl>, latitude <dbl>, org_name <chr>,
# quality_codes_id <int>, site <chr>, variable_short_name <chr>,
# precision <chr>, subdesc <chr>, variable <chr>, units <chr>,
# variable_name <chr>, database_timezone <chr>, quality_codes <chr>,
# data_type <chr>
The cleaner way used to be to use get_ts_traces2
, which does the hunting for dates automatically when given start_time = 'all'
and end_time = 'all'
. This function is now deprecated while a more general solution is found.
<- get_ts_traces2(portal = 'NSW',
siteout_a site_list = c('422028', '410007'),
var_list = "141",
start_time = 'all',
end_time = 'all',
interval = 'day',
data_type = 'mean',
returnformat = 'sitelist')
That’s really just a wrapper over the manual method above, and also does not contain leading zeros. It can return a long dataframe if returnformat = 'df'
, but for comparison I’ve used returnformat = 'sitelist'
.
head(siteout_a[[1]] |> select(time, value, site_name, everything() ))
head(siteout_a[[2]] |> select(time, value, site_name, everything() ))