Iterative aggregation along theme and spatial dimensions

Wraps spatial_aggregate() and theme_aggregate() within a loop over lists of aggregation levels and functions to apply at each level. Includes some small dataprep and cleanup depending on arguments for what the output should look like

Usage

multi_aggregate(
  dat,
  causal_edges = NULL,
  groupers = "scenario",
  group_until = rep(NA, length(groupers)),
  aggCols,
  aggsequence,
  funsequence,
  saveintermediate = FALSE,
  namehistory = TRUE,
  keepAllPolys = FALSE,
  failmissing = TRUE,
  auto_ewr_PU = FALSE,
  pseudo_spatial = NULL
)

Arguments

dat

input dataframe. Must be sf if aggsequence includes any spatial aggregation. Otherwise, as in theme_aggregate() and spatial_aggregate()

causal_edges

causal links between all theme levels included in aggsequence, though can also include others, which are ignored. Creates the theme grouping

groupers

as in general_aggregate(), with the note that these should be all grouping columns except theme and spatial groupings. These are both automatically added to groupers according to aggsequence before passing to general_aggregate().

group_until

named list of groupers (column names) and the step to which they should be retained. Default NA (retain all groupers for all steps). FOR EWR USE, best option is group_until = list(SWSDLName = 'sdl_units', planning_unit_name = 'sdl_units', gauge = is_notpoint). This groups by planning unit and gauge and planning unit until larger spatial grouping has happened, dealing with the issue of gauges reporting into multiple PUs and SDLs. Leaving 'gauge' off is mathematically safe, since the gauge geometry forces that grouping, but then the 'gauge' column gets dropped. Step can be an index, name, or a function that evaluates to TRUE or FALSE when run on the aggregation sequence. Named list does not need to contain all groupers, but if so, those that persist throughout should be given NA or numeric values longer than aggsequence. Vectors the length of groupers usually work, but are less-well supported.

aggCols

an expression for the columns to aggregate (the data columns). See selectcreator for formats

aggsequence

a named list of aggregation steps in the order to apply them. Entries for theme aggregation should be character vectors- e.g. name = c('from_theme', 'to_theme'). Entries for spatial aggregation should be the sf polygon to aggregate to, e.g. name = sfpolygons or a length-1 character, e.g. name = "sfpolygons". The latter requires the object to be available with get("sfpolygons"), but allows passing characters rather than objects. Not requiring names and is high on the list of improvements. If we want to be able to re-run from auto-saved metadata params, we need the names of the spatial levels to match the object, e.g. basin: basin.

funsequence

a list of aggregation functions to apply in the order to apply them. Each list entry can be one value, e.g. a character or bare name, or can be multiple if multiple aggregations should be done at that step, e.g. c('ArithmeticMean', 'LimitingFactor'). The entries can also be lists themselves, useful for passing functions with arguments, e.g list(wm = ~weighted.mean(., w = area, na.rm = TRUE)). Important: as of dplyr 1.1, if these are anonymous functions that refer to data variables (like the w = area argument in the weighted.mean() example), that list needs to be wrapped in rlang::quo(), e.g. rlang::quo(list(wm = ~weighted.mean(., w = area, na.rm = TRUE)). And we can no longer mix character and other forms in the same sub-list (single aggregation step).

saveintermediate

logical, default FALSE. * FALSE (the default): Save only the final result as a tibble or sf * TRUE: Save every step of the aggregation as a tibble or sf in a list

namehistory

logical, default TRUE.

TRUE (the default): The name of the aggregated column(s) retain the full aggregation history of the form agglevelN_aggfunctionN_...agglevel1_aggfunction1_originalcolumn. This is ugly, but saves memory and says exactly what the values in each column are.
FALSE: The aggregation history is moved out of the column names and into new columns that define it using agg_names_to_cols(). The column name(s) become(s) the original column name(s) specified by aggCols. This is far cleaner and easier for analysis (e.g. filtering on aggregation functions at a particular step), but increases the size of the dataset and the meaning of the values in the aggregation column have to be interpreted with the values in the new columns defining history.

keepAllPolys

logical, default FALSE. Should polygons in to_geo that have no values in dat be retained? The default FALSE keeps NA polygons from cluttering things up, but TRUE can be useful to not lose them, especially for later plotting. However, it is typically best from a data and cleanliness perspective to use FALSE here and use the bare set of polys as an underlay in plot_outcomes().

failmissing

logical, default TRUE: fail if the requested grouping or aggregation columns not exist. If FALSE, proceed with those that do exist and silently drop those that don't. Similar to tidyselect::all_of() vs tidyselect::any_of() in tidyselect

auto_ewr_PU

logical, default FALSE. Auto-detect EWRs and enforce appropriate theme and spatial scaling related to gauges and planning units, as defined in theme_aggregate() and spatial_aggregate(). Specifically, if TRUE, this automatically manages the group_until and pseudo_spatial arguments.

pseudo_spatial

a character or numeric vector giving the names or indices (NOT the column names to join on) of aggsequence that should have 'psuedo-spatial' aggregation. This is when we go from one spatial data level to another, but do the join and aggregation with a non-spatial dplyr::left_join(). It is developed for the EWR situation, where the incoming data is indexed to gauges, planning units, and sdl units, but has gauge point geometry, and spatial joining to planning units or sdl is not appropriate, because single gauges affect multiple units. So it would join to the planning_units or sdl_units by column names instead of spatially, and then aggregate according to those units. For EWR use, typically the best option is pseudo_spatial = c('planning_units', 'sdl_units'), with the grouping to sdls only pseudo because some planning units spill over sdl boundaries.

Value

either a tibble or sf of aggregated values at the final level (if saveintermediate = FALSE) or a list of tibbles or sfs with aggregated values at each step (saveintermediate = TRUE)