Read in data and aggregate along theme and spatial dimension
read_and_agg.Rd
Allows passing only paths to data rather than objects (though objects work as
well for consistency). Wrapper over read_and_geo()
, make_edges()
and
multi_aggregate()
. Particularly useful if we want to pass parameters as
strings from a config before anything is read in, and parallelisation (set a
future::plan()
).
Usage
read_and_agg(
datpath,
type,
geopath,
causalpath,
groupers = "scenario",
group_until = rep(NA, length(groupers)),
prepfun = "prep_ewr_output",
prepargs = list(),
aggCols,
aggsequence,
funsequence,
saveintermediate = FALSE,
namehistory = TRUE,
keepAllPolys = FALSE,
failmissing = TRUE,
auto_ewr_PU = FALSE,
pseudo_spatial = NULL,
returnList = TRUE,
savepath = NULL,
extrameta = NULL,
rparallel = FALSE,
par_recursive = TRUE,
savepar = "combine",
...
)
Arguments
- datpath
path to indicator data, or indicator data itself as a data frame. Currently needs to be EWR (same as
ewrpath
argument inread_and_geo()
), but left more general here for future- type
character, a grep for the files to choose. The special case 'everything' gets all files
- geopath
sf object with geographic locations matching a column in the data, or path to a csv with gauge locations in lat/long (assumes BOM currently) or a shapefile
- causalpath
path to the causal relationships .rds file or the causal network list object or its name
- groupers
as in
general_aggregate()
, with the note that these should be all grouping columns except theme and spatial groupings. These are both automatically added togroupers
according toaggsequence
before passing togeneral_aggregate()
.- group_until
named list of groupers (column names) and the step to which they should be retained. Default NA (retain all groupers for all steps). FOR EWR USE, best option is
group_until = list(SWSDLName = 'sdl_units', planning_unit_name = 'sdl_units', gauge = is_notpoint)
. This groups by planning unit and gauge and planning unit until larger spatial grouping has happened, dealing with the issue of gauges reporting into multiple PUs and SDLs. Leaving 'gauge' off is mathematically safe, since the gauge geometry forces that grouping, but then the 'gauge' column gets dropped. Step can be an index, name, or a function that evaluates to TRUE or FALSE when run on the aggregation sequence. Named list does not need to contain all groupers, but if so, those that persist throughout should be given NA or numeric values longer than aggsequence. Vectors the length of groupers usually work, but are less-well supported.- prepfun
a function that does any post-read and pre-aggregation preparation of the module data. This is where mutates should go. If no data transformation is needed, use
identity()
. That should be the default for generality, but given the common use with the EWR tool, the defaults for EWR tool are included in HydroBOT (andprep_ewr_output()
is the default here).- prepargs
a list of arguments to
prepfun
. e.g.list(type = 'achievement', add_max = FALSE)
. Settingprepargs = list(type = achievement')
andtype = 'yearly'
is a better (more general) way to declare that processing.- aggCols
an expression for the columns to aggregate (the data columns). See
selectcreator
for formats- aggsequence
a named list of aggregation steps in the order to apply them. Entries for theme aggregation should be character vectors- e.g.
name = c('from_theme', 'to_theme')
. Entries for spatial aggregation should be the sf polygon to aggregate to, e.g.name = sfpolygons
or a length-1 character, e.g.name = "sfpolygons"
. The latter requires the object to be available withget("sfpolygons")
, but allows passing characters rather than objects. Not requiring names and is high on the list of improvements. If we want to be able to re-run from auto-saved metadata params, we need the names of the spatial levels to match the object, e.g. basin: basin.- funsequence
a list of aggregation functions to apply in the order to apply them. Each list entry can be one value, e.g. a character or bare name, or can be multiple if multiple aggregations should be done at that step, e.g.
c('ArithmeticMean', 'LimitingFactor')
. The entries can also be lists themselves, useful for passing functions with arguments, e.glist(wm = ~weighted.mean(., w = area, na.rm = TRUE))
. Important: as ofdplyr
1.1, if these are anonymous functions that refer to data variables (like thew = area
argument in theweighted.mean()
example), that list needs to be wrapped inrlang::quo()
, e.g.rlang::quo(list(wm = ~weighted.mean(., w = area, na.rm = TRUE))
. And we can no longer mix character and other forms in the same sub-list (single aggregation step).- saveintermediate
logical, default
FALSE
. *FALSE
(the default): Save only the final result as a tibble or sf *TRUE
: Save every step of the aggregation as a tibble or sf in a list- namehistory
logical, default
TRUE
.TRUE
(the default): The name of the aggregated column(s) retain the full aggregation history of the formagglevelN_aggfunctionN_...agglevel1_aggfunction1_originalcolumn
. This is ugly, but saves memory and says exactly what the values in each column are.FALSE
: The aggregation history is moved out of the column names and into new columns that define it usingagg_names_to_cols()
. The column name(s) become(s) the original column name(s) specified byaggCols
. This is far cleaner and easier for analysis (e.g. filtering on aggregation functions at a particular step), but increases the size of the dataset and the meaning of the values in the aggregation column have to be interpreted with the values in the new columns defining history.
- keepAllPolys
logical, default
FALSE
. Should polygons into_geo
that have no values indat
be retained? The defaultFALSE
keeps NA polygons from cluttering things up, butTRUE
can be useful to not lose them, especially for later plotting. However, it is typically best from a data and cleanliness perspective to useFALSE
here and use the bare set of polys as anunderlay
inplot_outcomes()
.- failmissing
logical, default
TRUE
: fail if the requested grouping or aggregation columns not exist. IfFALSE
, proceed with those that do exist and silently drop those that don't. Similar totidyselect::all_of()
vstidyselect::any_of()
intidyselect
- auto_ewr_PU
logical, default
FALSE
. Auto-detect EWRs and enforce appropriate theme and spatial scaling related to gauges and planning units, as defined intheme_aggregate()
andspatial_aggregate()
. Specifically, ifTRUE
, this automatically manages thegroup_until
andpseudo_spatial
arguments.- pseudo_spatial
a character or numeric vector giving the names or indices (NOT the column names to join on) of aggsequence that should have 'psuedo-spatial' aggregation. This is when we go from one spatial data level to another, but do the join and aggregation with a non-spatial
dplyr::left_join()
. It is developed for the EWR situation, where the incoming data is indexed to gauges, planning units, and sdl units, but has gauge point geometry, and spatial joining to planning units or sdl is not appropriate, because single gauges affect multiple units. So it would join to theplanning_units
orsdl_units
by column names instead of spatially, and then aggregate according to those units. For EWR use, typically the best option ispseudo_spatial = c('planning_units', 'sdl_units')
, with the grouping to sdls only pseudo because some planning units spill over sdl boundaries.- returnList
default
TRUE
, whether to return the output to the current session- savepath
default
NULL
, a path to save the output to. Note that this names the output rds file directly 'type
_aggregated.rds', so the path should include only the directory structure. IfNULL
, does not save. Ifsavepath = NULL
andreturnList = FALSE
, the function errors to avoid wasting resources.- extrameta
list, extra information to include in saved metadata documentation for the run. Default NULL.
- rparallel
logical, default FALSE. If TRUE, parallelises over the scenarios in hydro_dir using
furrr
. To use, installfurrr
and set afuture::plan()
(likelymultisession
ormulticore
)- par_recursive
logical, default TRUE. If parallel, do we use the innermost level of directory containing EWR outputs (TRUE) or the next level in from datpath (FALSE)
- savepar
'combine' (default) or 'each'. If parallel over scenarios, should this combine the output (default) or save each scenario's aggregation separately ('each')
- ...
passed to
read_and_geo()
, primarilygaugefilter
,scenariofilter
.