Read in data and aggregate along theme and spatial dimension

Allows passing only paths to data rather than objects (though objects work as well for consistency). Wrapper over read_and_geo(), make_edges() and multi_aggregate(). Particularly useful if we want to pass parameters as strings from a config before anything is read in, and parallelisation (set a future::plan()).

Usage

read_and_agg(
  datpath,
  type,
  geopath,
  causalpath,
  groupers = "scenario",
  group_until = rep(NA, length(groupers)),
  prepfun = "prep_ewr_output",
  prepargs = list(),
  aggCols,
  aggsequence,
  funsequence,
  saveintermediate = FALSE,
  namehistory = TRUE,
  keepAllPolys = FALSE,
  failmissing = TRUE,
  auto_ewr_PU = FALSE,
  pseudo_spatial = NULL,
  returnList = TRUE,
  savepath = NULL,
  extrameta = NULL,
  rparallel = FALSE,
  par_recursive = TRUE,
  savepar = "combine",
  ...
)

Arguments

datpath

path to indicator data, or indicator data itself as a data frame. Currently needs to be EWR (same as ewrpath argument in read_and_geo()), but left more general here for future

type

character, a grep for the files to choose. The special case 'everything' gets all files

geopath

sf object with geographic locations matching a column in the data, or path to a csv with gauge locations in lat/long (assumes BOM currently) or a shapefile

causalpath

path to the causal relationships .rds file or the causal network list object or its name

groupers

as in general_aggregate(), with the note that these should be all grouping columns except theme and spatial groupings. These are both automatically added to groupers according to aggsequence before passing to general_aggregate().

group_until

named list of groupers (column names) and the step to which they should be retained. Default NA (retain all groupers for all steps). FOR EWR USE, best option is group_until = list(SWSDLName = 'sdl_units', planning_unit_name = 'sdl_units', gauge = is_notpoint). This groups by planning unit and gauge and planning unit until larger spatial grouping has happened, dealing with the issue of gauges reporting into multiple PUs and SDLs. Leaving 'gauge' off is mathematically safe, since the gauge geometry forces that grouping, but then the 'gauge' column gets dropped. Step can be an index, name, or a function that evaluates to TRUE or FALSE when run on the aggregation sequence. Named list does not need to contain all groupers, but if so, those that persist throughout should be given NA or numeric values longer than aggsequence. Vectors the length of groupers usually work, but are less-well supported.

prepfun

a function that does any post-read and pre-aggregation preparation of the module data. This is where mutates should go. If no data transformation is needed, use identity(). That should be the default for generality, but given the common use with the EWR tool, the defaults for EWR tool are included in HydroBOT (and prep_ewr_output() is the default here).

prepargs

a list of arguments to prepfun. e.g. list(type = 'achievement', add_max = FALSE). Setting prepargs = list(type = achievement') and type = 'yearly' is a better (more general) way to declare that processing.

aggCols

an expression for the columns to aggregate (the data columns). See selectcreator for formats

aggsequence

a named list of aggregation steps in the order to apply them. Entries for theme aggregation should be character vectors- e.g. name = c('from_theme', 'to_theme'). Entries for spatial aggregation should be the sf polygon to aggregate to, e.g. name = sfpolygons or a length-1 character, e.g. name = "sfpolygons". The latter requires the object to be available with get("sfpolygons"), but allows passing characters rather than objects. Not requiring names and is high on the list of improvements. If we want to be able to re-run from auto-saved metadata params, we need the names of the spatial levels to match the object, e.g. basin: basin.

funsequence

a list of aggregation functions to apply in the order to apply them. Each list entry can be one value, e.g. a character or bare name, or can be multiple if multiple aggregations should be done at that step, e.g. c('ArithmeticMean', 'LimitingFactor'). The entries can also be lists themselves, useful for passing functions with arguments, e.g list(wm = ~weighted.mean(., w = area, na.rm = TRUE)). Important: as of dplyr 1.1, if these are anonymous functions that refer to data variables (like the w = area argument in the weighted.mean() example), that list needs to be wrapped in rlang::quo(), e.g. rlang::quo(list(wm = ~weighted.mean(., w = area, na.rm = TRUE)). And we can no longer mix character and other forms in the same sub-list (single aggregation step).

saveintermediate

logical, default FALSE. * FALSE (the default): Save only the final result as a tibble or sf * TRUE: Save every step of the aggregation as a tibble or sf in a list

namehistory

logical, default TRUE.

TRUE (the default): The name of the aggregated column(s) retain the full aggregation history of the form agglevelN_aggfunctionN_...agglevel1_aggfunction1_originalcolumn. This is ugly, but saves memory and says exactly what the values in each column are.
FALSE: The aggregation history is moved out of the column names and into new columns that define it using agg_names_to_cols(). The column name(s) become(s) the original column name(s) specified by aggCols. This is far cleaner and easier for analysis (e.g. filtering on aggregation functions at a particular step), but increases the size of the dataset and the meaning of the values in the aggregation column have to be interpreted with the values in the new columns defining history.

keepAllPolys

logical, default FALSE. Should polygons in to_geo that have no values in dat be retained? The default FALSE keeps NA polygons from cluttering things up, but TRUE can be useful to not lose them, especially for later plotting. However, it is typically best from a data and cleanliness perspective to use FALSE here and use the bare set of polys as an underlay in plot_outcomes().

failmissing

logical, default TRUE: fail if the requested grouping or aggregation columns not exist. If FALSE, proceed with those that do exist and silently drop those that don't. Similar to tidyselect::all_of() vs tidyselect::any_of() in tidyselect

auto_ewr_PU

logical, default FALSE. Auto-detect EWRs and enforce appropriate theme and spatial scaling related to gauges and planning units, as defined in theme_aggregate() and spatial_aggregate(). Specifically, if TRUE, this automatically manages the group_until and pseudo_spatial arguments.

pseudo_spatial

a character or numeric vector giving the names or indices (NOT the column names to join on) of aggsequence that should have 'psuedo-spatial' aggregation. This is when we go from one spatial data level to another, but do the join and aggregation with a non-spatial dplyr::left_join(). It is developed for the EWR situation, where the incoming data is indexed to gauges, planning units, and sdl units, but has gauge point geometry, and spatial joining to planning units or sdl is not appropriate, because single gauges affect multiple units. So it would join to the planning_units or sdl_units by column names instead of spatially, and then aggregate according to those units. For EWR use, typically the best option is pseudo_spatial = c('planning_units', 'sdl_units'), with the grouping to sdls only pseudo because some planning units spill over sdl boundaries.

returnList

default TRUE, whether to return the output to the current session

savepath

default NULL, a path to save the output to. Note that this names the output rds file directly 'type_aggregated.rds', so the path should include only the directory structure. If NULL, does not save. If savepath = NULL and returnList = FALSE, the function errors to avoid wasting resources.

extrameta

list, extra information to include in saved metadata documentation for the run. Default NULL.

rparallel

logical, default FALSE. If TRUE, parallelises over the scenarios in hydro_dir using furrr. To use, install furrr and set a future::plan() (likely multisession or multicore)

par_recursive

logical, default TRUE. If parallel, do we use the innermost level of directory containing EWR outputs (TRUE) or the next level in from datpath (FALSE)

savepar

'combine' (default) or 'each'. If parallel over scenarios, should this combine the output (default) or save each scenario's aggregation separately ('each')

...

passed to read_and_geo(), primarily gaugefilter, scenariofilter.

Value

either a tibble or sf of aggregated values at the final level (if saveintermediate = FALSE) or a list of tibbles or sfs with aggregated values at each step (saveintermediate = TRUE)