library(tidyverse)
Tidy programming
The issue
Tidyverse, and particularly dplyr and ggplot, are great for quickly doing very powerful rearrangements and calculations of data and making plots. One of the main way they achieve this is by allowing us to use bare variable names- unquoted, no $ syntax. However, that becomes tricky when programming and we might want to pass variables as an argument. Passing other things as arguments can also be a pain, e.g. functions for summarize
. I’ve encountered many different things that trip me up, depending on what I’m trying to pass, but my fixes are typically ad-hoc and scattered around my code. I’ll use this doc as a central place to sort out solutions to various problems as they come up. There’s quite a lot of answers from dplyr itself, but for some reason I always have to figure things out for myself.
Passing to group_by
let’s say we want to allow the user to pass which functions to group_by. The two usual ways I end up doing this are double-embracing or just using character vectors. Let’s demo and test with a grouped mean for mtcars. Embracing allows the user to pass bare names, chars makes them pass characters and we have to use across(all_of())
which is annoying syntax.
# embracing
<- function(data, groupers) {
groupbrace <- data %>%
gm group_by({{groupers}}) %>%
summarise(meanmpg = mean(mpg)) %>%
ungroup()
return(gm)
}
# characters
<- function(data, groupers) {
groupchar <- data %>%
gm group_by(across(all_of(groupers))) %>%
summarise(meanmpg = mean(mpg)) %>%
ungroup()
return(gm)
}
How do we use those for a single grouping variable?
groupbrace(mtcars, groupers = gear)
# A tibble: 3 × 2
gear meanmpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
groupchar(mtcars, groupers = 'gear')
# A tibble: 3 × 2
gear meanmpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
What happens when we try to group by more than one column?
# groupbrace(mtcars, groupers = c(gear, carb))
#
# groupchar(mtcars, groupers = c('gear', 'carb'))
works with the characters, but the embracing fails (unsurprisingly).
The website says to use …
, so we can do that as follows:
<- function(data, ...) {
groupdots <- data %>%
gm group_by(...) %>%
summarise(meanmpg = mean(mpg)) %>%
ungroup()
return(gm)
}
groupdots(mtcars, gear, carb)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb meanmpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
That works, but it becomes an issue if we’re ALSO supplying arguments for other things in the function. See below.
The website only uses the dots example, but across()
works like it does with summarize
. This I think ends up being the answer for bare variable names that don’t get mixed up between grouping and summarizing. See below.
<- function(data, groupers) {
groupacross <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(meanmpg = mean(mpg)) %>%
ungroup()
return(gm)
}
groupacross(mtcars, c(gear, carb))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb meanmpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
Passing to summarise/mutate
I’m going to set this up with a simple group_by in all cases because it sets up the combo, and I almost never actually call summarise on a full dataset anyway.
Columns to operate on
If we just want one column, but the user supplies its name, we can again embrace or quote.
Names is an issue here too. They can just be left as a fixed value, but if we want to have the name of the new column reflect what’s being passed in, we handle that in different ways. With the braces we use the glue :=, and the .names argument if characters.
Now, the dots don’t seem to work to pass multiple bare names, I think probably because of issues with names? But we can modify the simple embraced version to use across(), making it more similar to the character version.
# embracing
<- function(data, sumcols) {
sumbrace <- data %>%
gm group_by(gear) %>%
summarise("mean_{{sumcols}}" := mean({{sumcols}})) %>%
ungroup()
return(gm)
}
# characters
<- function(data, sumcols) {
sumchar <- data %>%
gm group_by(gear) %>%
summarise(across(all_of(sumcols), mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
# mulitple bare
<- function(data, sumcols) {
sumbaremulti <- data %>%
gm group_by(gear) %>%
summarise(across({{sumcols}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
With a single user-supplied column
sumbrace(mtcars, sumcols = mpg)
# A tibble: 3 × 2
gear mean_mpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
sumchar(mtcars, sumcols = 'mpg')
# A tibble: 3 × 2
gear mean_mpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
Multiple user-supplied cols
sumbaremulti(mtcars, sumcols = c(mpg, hp))
# A tibble: 3 × 3
gear mean_mpg mean_hp
<dbl> <dbl> <dbl>
1 3 16.1 176.
2 4 24.5 89.5
3 5 21.4 196.
sumchar(mtcars, sumcols = c('mpg', 'hp'))
# A tibble: 3 × 3
gear mean_mpg mean_hp
<dbl> <dbl> <dbl>
1 3 16.1 176.
2 4 24.5 89.5
3 5 21.4 196.
Combine with group_by
I often want to pass a set of variable names to group_by and a set of names to summarize. If we use the dots method, these would get all jumbled together. So the options are embracing or characters, and when embracing we still need the c(bare1, bare2, …, bareN) so each component is a single argument.
# characters
<- function(data, groupers, sumcols) {
gsumchar <- data %>%
gm group_by(across(all_of(groupers))) %>%
summarise(across(all_of(sumcols), mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
# mulitple bare
<- function(data, groupers, sumcols) {
gsumbaremulti <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
Now we can feed it multiple grouping columns and multiple summary columns
gsumbaremulti(mtcars, groupers = c(gear, carb), sumcols = c(mpg, hp))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 4
gear carb mean_mpg mean_hp
<dbl> <dbl> <dbl> <dbl>
1 3 1 20.3 104
2 3 2 17.2 162.
3 3 3 16.3 180
4 3 4 12.6 228
5 4 1 29.1 72.5
6 4 2 24.8 79.5
7 4 4 19.8 116.
8 5 2 28.2 102
9 5 4 15.8 264
10 5 6 19.7 175
11 5 8 15 335
gsumchar(mtcars, groupers = c('gear', 'carb'), sumcols = c('mpg', 'hp'))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 4
gear carb mean_mpg mean_hp
<dbl> <dbl> <dbl> <dbl>
1 3 1 20.3 104
2 3 2 17.2 162.
3 3 3 16.3 180
4 3 4 12.6 228
5 4 1 29.1 72.5
6 4 2 24.8 79.5
7 4 4 19.8 116.
8 5 2 28.2 102
9 5 4 15.8 264
10 5 6 19.7 175
11 5 8 15 335
It’s really not clear why I’d ever use the dots version, or why we wouldn’t always use the across() wrap to give us generality. I guess if that generality isn’t needed? But while dots can be handy, they’re vague and it’s not like the across() wrap is hard to type.
What this makes very clear is the similarity between the two methods- they’re really just using the select()
syntax in the across()
, but one has to embrace bare names and the other uses the all_of()
modifier we always have to include when we want to select()
with a character vector.
Passing select syntax
Since we’re using that across
, is it possible to pass other select()
syntax than variable names? e.g. is.numeric
, starts_with()
or b:f
? Let’s test it just with the summarize bit.
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = is.numeric)
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(is.numeric, mean, .names = "mean_{.col}")`.
Caused by warning:
! Use of bare predicate functions was deprecated in tidyselect 1.1.0.
ℹ Please use wrap predicates in `where()` instead.
# Was:
data %>% select(is.numeric)
# Now:
data %>% select(where(is.numeric))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 11
gear carb mean_mpg mean_cyl mean_disp mean_hp mean_drat mean_wt mean_qsec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 20.3 5.33 201. 104 3.18 3.05 19.9
2 3 2 17.2 8 346. 162. 3.04 3.56 17.1
3 3 3 16.3 8 276. 180 3.07 3.86 17.7
4 3 4 12.6 8 416. 228 3.22 4.69 16.9
5 4 1 29.1 4 84.2 72.5 4.06 2.07 19.2
6 4 2 24.8 4 121. 79.5 4.16 2.68 20.0
7 4 4 19.8 6 164. 116. 3.91 3.09 17.7
8 5 2 28.2 4 108. 102 4.1 1.83 16.8
9 5 4 15.8 8 351 264 4.22 3.17 14.5
10 5 6 19.7 6 145 175 3.62 2.77 15.5
11 5 8 15 8 301 335 3.54 3.57 14.6
# ℹ 2 more variables: mean_vs <dbl>, mean_am <dbl>
That works but is angry about missing where()
. Just throwing the bare select
syntax straight in works though, for the where()
type arguments but seems to be general- works for col:col
and starts_with()
as well.
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = where(is.numeric))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 11
gear carb mean_mpg mean_cyl mean_disp mean_hp mean_drat mean_wt mean_qsec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 20.3 5.33 201. 104 3.18 3.05 19.9
2 3 2 17.2 8 346. 162. 3.04 3.56 17.1
3 3 3 16.3 8 276. 180 3.07 3.86 17.7
4 3 4 12.6 8 416. 228 3.22 4.69 16.9
5 4 1 29.1 4 84.2 72.5 4.06 2.07 19.2
6 4 2 24.8 4 121. 79.5 4.16 2.68 20.0
7 4 4 19.8 6 164. 116. 3.91 3.09 17.7
8 5 2 28.2 4 108. 102 4.1 1.83 16.8
9 5 4 15.8 8 351 264 4.22 3.17 14.5
10 5 6 19.7 6 145 175 3.62 2.77 15.5
11 5 8 15 8 301 335 3.54 3.57 14.6
# ℹ 2 more variables: mean_vs <dbl>, mean_am <dbl>
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = mpg:disp)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 5
gear carb mean_mpg mean_cyl mean_disp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 20.3 5.33 201.
2 3 2 17.2 8 346.
3 3 3 16.3 8 276.
4 3 4 12.6 8 416.
5 4 1 29.1 4 84.2
6 4 2 24.8 4 121.
7 4 4 19.8 6 164.
8 5 2 28.2 4 108.
9 5 4 15.8 8 351
10 5 6 19.7 6 145
11 5 8 15 8 301
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = starts_with('d'))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 4
gear carb mean_disp mean_drat
<dbl> <dbl> <dbl> <dbl>
1 3 1 201. 3.18
2 3 2 346. 3.04
3 3 3 276. 3.07
4 3 4 416. 3.22
5 4 1 84.2 4.06
6 4 2 121. 4.16
7 4 4 164. 3.91
8 5 2 108. 4.1
9 5 4 351 4.22
10 5 6 145 3.62
11 5 8 301 3.54
Select syntax issues
Sometimes we might want to pass a vector of columns to select, but have those that don’t exist get ignored- basically, select however many of this set of columns exist in the dataset. With a character vector, that’s straightforward with any_of
. But it fails with bare names, and any_of
requires characters.
# These both fail
%>% select(c(mpg, fakecolumn)) mtcars
Error in `select()`:
! Can't select columns that don't exist.
✖ Column `fakecolumn` doesn't exist.
%>% select(any_of(mpg, fakecolumn)) mtcars
Error in `select()`:
ℹ In argument: `any_of(mpg, fakecolumn)`.
Caused by error in `any_of()`:
! `...` must be empty.
ℹ Did you forget `c()`?
ℹ The expected syntax is `any_of(c("a", "b"))`, not `any_of("a", "b")`
An obvious solution is to use character vectors.
%>% select(any_of(c('mpg', 'fakecolumn'))) mtcars
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
Valiant 18.1
Duster 360 14.3
Merc 240D 24.4
Merc 230 22.8
Merc 280 19.2
Merc 280C 17.8
Merc 450SE 16.4
Merc 450SL 17.3
Merc 450SLC 15.2
Cadillac Fleetwood 10.4
Lincoln Continental 10.4
Chrysler Imperial 14.7
Fiat 128 32.4
Honda Civic 30.4
Toyota Corolla 33.9
Toyota Corona 21.5
Dodge Challenger 15.5
AMC Javelin 15.2
Camaro Z28 13.3
Pontiac Firebird 19.2
Fiat X1-9 27.3
Porsche 914-2 26.0
Lotus Europa 30.4
Ford Pantera L 15.8
Ferrari Dino 19.7
Maserati Bora 15.0
Volvo 142E 21.4
But does that then preclude using other tidyselect syntax such as :, starts_with, etc? Sure, we can swap back and forth if we’re accessing select directly, but not if this is embedded in a function. The answer is sometimes- it works with starts_with
but not :
(not really shown here because it fails).
%>% select(any_of(starts_with('d'))) mtcars
disp drat
Mazda RX4 160.0 3.90
Mazda RX4 Wag 160.0 3.90
Datsun 710 108.0 3.85
Hornet 4 Drive 258.0 3.08
Hornet Sportabout 360.0 3.15
Valiant 225.0 2.76
Duster 360 360.0 3.21
Merc 240D 146.7 3.69
Merc 230 140.8 3.92
Merc 280 167.6 3.92
Merc 280C 167.6 3.92
Merc 450SE 275.8 3.07
Merc 450SL 275.8 3.07
Merc 450SLC 275.8 3.07
Cadillac Fleetwood 472.0 2.93
Lincoln Continental 460.0 3.00
Chrysler Imperial 440.0 3.23
Fiat 128 78.7 4.08
Honda Civic 75.7 4.93
Toyota Corolla 71.1 4.22
Toyota Corona 120.1 3.70
Dodge Challenger 318.0 2.76
AMC Javelin 304.0 3.15
Camaro Z28 350.0 3.73
Pontiac Firebird 400.0 3.08
Fiat X1-9 79.0 4.08
Porsche 914-2 120.3 4.43
Lotus Europa 95.1 3.77
Ford Pantera L 351.0 4.22
Ferrari Dino 145.0 3.62
Maserati Bora 301.0 3.54
Volvo 142E 121.0 4.11
# mtcars %>% select(any_of(hp:wt))
Is the trick to pass it the whole any_of
expression? that IS a tidyselect call. Try it in the function directly, to get all the across
in there correctly. First, this fails if we just pass extra columns:
# gsumbaremulti(mtcars,
# groupers = c(gear, carb),
# sumcols = c(mpg, fakecol))
If we know some might not exist, we can instead pass the whole any_of and character names. Is this cleaner? No, now we’re back to characters, but ALSO needing to pass the any_of. So why do it? if we sometimes also need to pass other tidyselect syntax.
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = any_of(c('mpg', 'fakecol')))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb mean_mpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
Now, what if that is in turn buried in a function, so we need to set the argument outside the call? This might happen if we have a user interface where they choose columns. For example, they might set the cols, and then call a function that calls what we have above.
<- c('mpg', 'fakecol')
whichcols
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = any_of(whichcols))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb mean_mpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
That’s easy enough. But what if whichcols could be tidyselect syntax? That can’t be saved to an object. It can be saved with expr
, but then that has to be unpacked with !!
.
# Fails
# whichcols <- starts_with('m')
<- expr(starts_with('m'))
whichcols
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = !!whichcols)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb mean_mpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
That allows passing tidyselect, but does it break the any_of
situation? Not if we wrap it in expr.
<- expr(any_of(c('mpg', 'fakecol')))
whichcols
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = !!whichcols)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb mean_mpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
That means that if we might have a character vector and might have tidyselect, we can have a multi-step process to create the expression and pass it to the function. Ie the user can set whichcols directly as an expr
-wrapped tidyselect, OR if a character vector it makes it itself. See the next two code blocks.
<- c('mpg', 'fakecol')
colstosum # colstosum <- expr(starts_with('d'))
if (is.character(colstosum)) {
<- expr(any_of(colstosum))
whichcols else {
} <- colstosum
whichcols
}
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = !!whichcols)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 3
gear carb mean_mpg
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
<- expr(starts_with('d'))
colstosum
if (is.character(colstosum)) {
<- expr(any_of(colstosum))
whichcols else {
} <- colstosum
whichcols
}
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = !!whichcols)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 4
gear carb mean_disp mean_drat
<dbl> <dbl> <dbl> <dbl>
1 3 1 201. 3.18
2 3 2 346. 3.04
3 3 3 276. 3.07
4 3 4 416. 3.22
5 4 1 84.2 4.06
6 4 2 121. 4.16
7 4 4 164. 3.91
8 5 2 108. 4.1
9 5 4 351 4.22
10 5 6 145 3.62
11 5 8 301 3.54
Because that’s ugly, I’m not going to spend more time on it, but it is a workaround for sometimes needing to pass tidyselect syntax and sometimes column names that might not exist. There’s likely a more general way to do this using tidyselect::eval_select
, but what I have here will work for now.
tidyselect::eval_select
I’m now running into issues where the approach above isn’t working well, because sometimes the expression ends up including the name of an object (e.g. a passed-in character vector), and by the time we get to the {{}}
, we’re too far into the call stack and it ends up failing because it essentially tries to do something like group_by(starts_with(NAME_OF_VECTOR))
instead of group_by(starts_with(VALUES_IN_VECTOR)
.
So, one way to handle this is to in the outer layer use tidyselect::eval_select
in the outer layer to get column names and indices. Then we can just pass those around rather than all the promises that get lost doing it other ways. It’s a bit cruder, but i think will involve less gymnastics.
How does eval_select work?
First, how does eval_select
work? What do we need to feed it?
A bare tidyselect function fails
<- starts_with('d')
colstosum ::eval_select(colstosum, mtcars) tidyselect
Works if wrapped in expr
<- expr(starts_with('d'))
colstosum
::eval_select(colstosum, mtcars) tidyselect
disp drat
3 5
Works with character vectors.
<- c('disp', 'mpg')
colstosum ::eval_select(colstosum, mtcars) tidyselect
disp mpg
3 1
Does not work if there are values in the character vector that aren’t in the data.
<- c('disp', 'mpg', 'notinmtcars')
colstosum ::eval_select(colstosum, mtcars) tidyselect
So we likely still need the conditional to use any_of
<- c('disp', 'mpg', 'notinmtcars')
colstosum
if (is.character(colstosum)) {
<- expr(any_of(colstosum))
whichcols else {
} <- colstosum
whichcols
}
::eval_select(whichcols, mtcars) tidyselect
disp mpg
3 1
And, what if we pass an argument to a tidyselect? I don’t think this is enough to break the original way without some intervening function calls, but it’s the same idea that’s breaking it as we move down a stack.
<- 'd'
startletter
<- expr(starts_with(startletter))
colstosum ::eval_select(colstosum, mtcars) tidyselect
disp drat
3 5
What actually is that returning? A named vector of indices.
<- tidyselect::eval_select(colstosum, mtcars)
tsout str(tsout)
Named int [1:2] 3 5
- attr(*, "names")= chr [1:2] "disp" "drat"
So, with the conditional in there to guard against grabbing things that don’t exist, that looks like it should work by basically transporting around our selects as character vectors or indices if we evaluate them early enough. In the sort of uses I’m imagining- evaluating this early, and then passing in to further functions- I’d be really nervous about using indices, and so would tend to use the names. How might that work?
gsumbaremulti(mtcars,
groupers = c(gear, carb),
sumcols = names(tsout))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 11 × 4
gear carb mean_disp mean_drat
<dbl> <dbl> <dbl> <dbl>
1 3 1 201. 3.18
2 3 2 346. 3.04
3 3 3 276. 3.07
4 3 4 416. 3.22
5 4 1 84.2 4.06
6 4 2 121. 4.16
7 4 4 164. 3.91
8 5 2 108. 4.1
9 5 4 351 4.22
10 5 6 145 3.62
11 5 8 301 3.54
A function to parse eval_select
What if I actually make the function do the parsing? So I can pass it the characters, bare names, or expr(selectsyntax)
?
<- function(data, groupers, sumcols) {
gsumtidy
if (is.character(groupers)) {
<- expr(any_of(groupers))
whichg else {
} <- groupers
whichg
}
if (is.character(sumcols)) {
<- expr(any_of(sumcols))
whichs else {
} <- sumcols
whichs
}
<- whichg %>%
gnames ::eval_select(data) %>%
tidyselectnames()
<- whichs %>%
snames ::eval_select(data) %>%
tidyselectnames()
<- data %>%
gm group_by(across({{gnames}})) %>%
summarise(across({{snames}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
Test that with different sorts of things.
gsumtidy(mtcars,
groupers = 'cyl',
sumcols = expr(starts_with('d')))
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
How about if we include extra cols? works fine.
gsumtidy(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = expr(starts_with('d')))
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
The whole top part of that could be its own function, and get run at any point in a call stack. Returning the names and not the indices, but could return the whole thing I guess, depending on safety of indices.
<- function(data, selector) {
selectnames
if (is.character(selector)) {
<- expr(any_of(selector))
whichg else {
} <- selector
whichg
}
<- whichg %>%
selnames ::eval_select(data) %>%
tidyselectnames()
return(selnames)
}
<- function(data, groupers, sumcols) {
gtidysimple
<- selectnames(data, groupers)
gnames <- selectnames(data, sumcols)
snames
<- data %>%
gm group_by(across({{gnames}})) %>%
summarise(across({{snames}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
gtidysimple(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = expr(starts_with('d')))
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
expr() vs enquo()
The above needs to wrap tidyselect syntax with expr to work- passing the bare starts_with
fails
gtidysimple(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = starts_with('d'))
Likewise with bare names
gtidysimple(mtcars,
groupers = c(cyl, notinmtcars),
sumcols = expr(starts_with('d')))
That’s because things other than character vectors need to be “defused” (see ?enquo
). expr
defuses ‘your own local expressions’, while enquo
defuses function arguments. So, there are two options- defuse locally when giving the argument to the funciton with expr
(as I’ve done above), or defuse internally with enquo
.
In that case, we re-write the outer function to enquo
its arguments.
<- function(data, groupers, sumcols) {
gtidyquo
<- selectnames(data, enquo(groupers))
gnames <- selectnames(data, enquo(sumcols))
snames
<- data %>%
gm group_by(across({{gnames}})) %>%
summarise(across({{snames}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
Now, that should work without wrapping tidyselect syntax in expr()
, and take bare names or character vectors.
gtidyquo(mtcars,
groupers = cyl,
sumcols = starts_with('d'))
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
It also takes characters
gtidyquo(mtcars,
groupers = 'cyl',
sumcols = c('disp', 'drat'))
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
But it’s no longer ignoring values not in the data
gtidyquo(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = expr(starts_with('d')))
That’s because the internal enquo(groupers)
in gtidyquo
means that selectnames
is always seeing selector
as language, not character, and so bypassing the any_of()
conditional. I don’t want to drop that whole conditional section from selectnames
, because that keeps selectnames
more general (doesn’t have to be fed enquo
’d arguments). Instead, we can use the strict
argument in eval_select
to decide whether to fail or silently ignore missings. This choice is probably good to have, rather than enforce one or the other- it’s often the case that we should fail if missing columns are called, rather than just ignore silently. The same argument can also be used in the conditional as a switch to make the situation with character selector
fail or pass.
<- function(data, selector, failmissing = TRUE) {
selectnames
if (is.character(selector)) {
if (failmissing) {
<- expr(all_of(selector))
whichg else {
} <- expr(any_of(selector))
whichg
}
else {
} <- selector
whichg
}
<- whichg %>%
selnames ::eval_select(data, strict = failmissing) %>%
tidyselectnames()
return(selnames)
}
We also need to rewrite gtidyquo
to pass failmissing. Could use …
, but that’s vague.
<- function(data, groupers, sumcols, failmissing = TRUE) {
gtidyquo
<- selectnames(data, enquo(groupers), failmissing)
gnames <- selectnames(data, enquo(sumcols), failmissing)
snames
<- data %>%
gm group_by(across({{gnames}})) %>%
summarise(across({{snames}}, mean, .names = 'mean_{.col}')) %>%
ungroup()
return(gm)
}
Now, does that work with values not in the data?
gtidyquo(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = starts_with('d'),
failmissing = FALSE)
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
as bare names
gtidyquo(mtcars,
groupers = c(cyl, notinmtcars),
sumcols = starts_with('d'),
failmissing = FALSE)
# A tibble: 3 × 3
cyl mean_disp mean_drat
<dbl> <dbl> <dbl>
1 4 105. 4.07
2 6 183. 3.59
3 8 353. 3.23
and it should fail if failmissing = TRUE
(or left off, since that’s the default).
gtidyquo(mtcars,
groupers = c('cyl', 'notinmtcars'),
sumcols = starts_with('d'))
Conclusions
That seems a bit lame to just translate to characters, but it ends up being a very robust and flexible workaround for situations where passing an object into a tidyselect ends up trying to select the object instead of its contents once we’re further down a call stack, and lets us use characters, bare names, and tidyselect and choose whether or not to fail when columns don’t exist.
Functions to use
Sometimes we want to tell the function how to summarise the data. Sometimes we want to do this including arguments, e.g. mean with na.rm = TRUE
. Sometimes we want to pass multiple functions and have the names appended, and sometimes those functions are user-defined. Further, sometimes they have an argument internal to the data (such as a weighting column) that they need to access.
We’ll start simple, though I’ll keep the multi-group and multi-col syntax from above because it keeps things general, and allows testing with multiple summarise cols. I’ll use the bare names and embracing for the grouping and summarise variables, but that shouldn’t affect the way function-passing works if we used the character version instead.
Passing a function by name
It’s typically a good idea to name the resulting column with the function when we don’t know what the function will be. And that sets us up for multi-functions.
In the simplest case we can just use a FUN argument. While using the all-caps “FUN” as the argument name seems to be a convention, this isn’t a special argument name and it could be whatever we want.
Previously, we had defined the function to apply inside our function, and so we had hardcoded the naming, e.g. 'mean_{.col}
. But now, we won’t know what it is. We thus need to get the name of the function as well, using as.character(substitute).
<- function(data, groupers, sumcols,
funpass
FUN) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUN, .names = funcolname)) %>%
ungroup()
return(gm)
}
funpass(mtcars,
groupers = gear,
sumcols = mpg,
FUN = mean)
# A tibble: 3 × 2
gear mean_mpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
We run into problems as soon as we try to pass arguments to that function, for example when there are NA and we want to use na.rm
<- mtcars %>%
nacars mutate(randnum = rnorm(n()),
nampg = ifelse(randnum >= 0, mpg, NA))
#| error:false
# funpass(nacars,
# groupers = gear,
# sumcols = nampg,
# FUN = mean, na.rm = TRUE)
Using dots syntax works to allow arguments.
<- function(data, groupers, sumcols,
funpasst
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUN,..., .names = funcolname)) %>%
ungroup()
return(gm)
}
funpasst(nacars,
groupers = gear,
sumcols = nampg,
FUN = mean, na.rm = TRUE)
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(nampg, FUN, ..., .names = funcolname)`.
ℹ In group 1: `gear = 3`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 3 × 2
gear mean_nampg
<dbl> <dbl>
1 3 15.6
2 4 25.4
3 5 21.7
As usual, dots can be an issue if we’re doing several things. But we’ll get to that. One solution that is also relevant generally is to specify a custom function. In a simple case this could be mean with na.rm = TRUE
, but it could be anything.
Custom function
Maybe we want a custom function. That might be as simple as changing the na.rm
default, or it might be something complicated with a few arguments. Here, I’ll demo a version with a swapped na.rm
default, illustrating a way to avoid passing arguments, and a more complex function that lags values and multiplies them.
<- function(x) {
meanna mean(x, na.rm = TRUE)
}
<- function(x, lag_k = 1, na.rm = TRUE, multiplier) {
customfun <- lag(x, lag_k)
xl <- sum(xl, na.rm = na.rm)*multiplier
xs return(xs)
}
funpasst(nacars,
groupers = gear,
sumcols = nampg,
FUN = meanna)
# A tibble: 3 × 2
gear meanna_nampg
<dbl> <dbl>
1 3 15.6
2 4 25.4
3 5 21.7
funpasst(nacars,
groupers = gear,
sumcols = nampg,
FUN = customfun, lag_k = 0, multiplier = 10)
# A tibble: 3 × 2
gear customfun_nampg
<dbl> <dbl>
1 3 1246
2 4 1524
3 5 651
and that works with multiple columns and groupers as well
funpasst(nacars,
groupers = c(gear, am),
sumcols = c(nampg, hp),
FUN = customfun, lag_k = 0, multiplier = 10)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
# A tibble: 4 × 4
gear am customfun_nampg customfun_hp
<dbl> <dbl> <dbl> <dbl>
1 3 0 1246 26420
2 4 0 244 4030
3 4 1 1280 6710
4 5 1 651 9780
Function with internal data argument
Sometimes we might want to use a function that relies on multiple columns- for example, the mean of one column using weights in another.
In the simplest case, we can hardcode that column. Here in a silly example of finding the mean hp
weighted by wt
. I’ve removed the dots for now, we’ll get to other arguments next.
<- function(data, groupers, sumcols,
funinternal
FUN) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUN, wt, .names = funcolname)) %>%
ungroup()
return(gm)
}
funinternal(nacars,
groupers = gear,
sumcols = mpg,
FUN = weighted.mean)
# A tibble: 3 × 2
gear weighted.mean_mpg
<dbl> <dbl>
1 3 15.6
2 4 23.6
3 5 19.7
and yes, that is weighting- if we just pass mean we get
funpasst(nacars,
groupers = gear,
sumcols = mpg,
FUN = mean)
# A tibble: 3 × 2
gear mean_mpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
But what if we need to specify other arguments? We can use dots again.
<- function(data, groupers, sumcols,
funinternald
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUN, wt, ..., .names = funcolname)) %>%
ungroup()
return(gm)
}
funinternald(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
Another way to do this that might be a bit clearer, especially as the number of arguments grows is to use tilde function specification. This is nearly the same, but makes it clear what arguments belong to the FUN.
<- function(data, groupers, sumcols,
funinternaldt
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~FUN(., wt, ...), .names = funcolname)) %>%
ungroup()
return(gm)
}
That yields the same result, we’ve just specified the summary function differently.
funinternaldt(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
Passing internal columns by name
So far, the internal columns have been hardcoded, and at a known position in the arguments to the FUN. What if we want to specify them on calling the function?
Can we just use the dots? Not with a bare name.
# funpasst(nacars,
# groupers = gear,
# sumcols = nampg,
# FUN = weighted.mean, wt, na.rm = TRUE)
Does it work to use the tilde version?
<- function(data, groupers, sumcols,
funtildedots
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~FUN(., ...), .names = funcolname)) %>%
ungroup()
return(gm)
}
No, that still can’t find the bare name- it looks for an object, not something internal to the data.
funtildedots(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, wt, na.rm = TRUE)
If we know that there will be a second data-variable argument to the function, we might be able to embrace.
<- function(data, groupers, sumcols,
funinteralembrace
FUN, arg2, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUN, {{arg2}}, ..., .names = funcolname)) %>%
ungroup()
return(gm)
}
<- function(data, groupers, sumcols,
funinteralembraceT
FUN, arg2, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~FUN(., {{arg2}}, ...), .names = funcolname)) %>%
ungroup()
return(gm)
}
That works for both the tilde and non-tilde versions.
funinteralembrace(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean,
arg2 = wt, na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
funinteralembraceT(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean,
arg2 = wt, na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
But what if we want a function that works with FUNS that may or may not require a second data-variable argument? Do the above functions work with something like mean
that won’t have an arg2? No.
funinteralembrace(nacars,
groupers = gear,
sumcols = nampg,
FUN = mean, na.rm = TRUE)
funinteralembraceT(nacars,
groupers = gear,
sumcols = nampg,
FUN = mean, na.rm = TRUE)
Is there a way to write a function that may have any number from 0 to n internal data arguments, as well as other non-data arguments (e.g. na.rm etc)? It will be tricky, because some unknown number of items will need to be embraced. Usual methods to unpack the ellipses using list(...)
won’t work, I don’t think. And if they do, it’s still unclear how many of the items in the list should be embraced. Does it even work if we know how many need to be embraced? Test with a simple case of whether we can even do the list(…)
.
<- function(data, groupers, sumcols,
testdots
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname
# This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
<- list(...)
dots
print(dots)
# gm <- data %>%
# group_by(across({{groupers}})) %>%
# summarise(across({{sumcols}}, ~FUN(., {{dots[1]}}, ...), .names = funcolname)) %>%
# ungroup()
# return(gm)
}
Even that doesn’t work- including bare names in the dots and then embracing doesn’t work because list() needs them as objects.
testdots(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, wt, na.rm = TRUE)
What is it I’m actually trying to do here? Write a function that takes an arbitrary number of data-variable arguments and an arbitrary number of passed env-arguments. That’s always going to be tricky, and will get trickier to sort things out like the order of the arguments. Is it possible? Almost certainly. But I think I’ll leave sorting it out for later. We have a version that works for a known number of arguments in a known order, which is enough in some situations. A workaround will become apparent anyway after the next section, where I pass in external vectors.
Function with vector argument passed in
One way to get around the issue above is instead of passing the name of a data variable, pass in the vector itself as an object. This also allows passing in vectors unattached to the dataframe being operated on, though since the’ll need to have the same nrow
s, in most cases they’ll be attached.
How does this work? We write the main function to do the grouping and summarizing, and within it define the function to evaluate in the summarize, accounting for the various types of arguments and the grouping. This works because the …
are all env-variables (vectors and scalars) instead of bare names of data-variables. This is all based on funpasst
above, with the addition of the internal function creation. Because the function we define may be grouped, it needs to be passed the indices for the current group rows so it only operates on those. I’m using tilde notation to keep it clearer how that function gets called in the summarise.
We could write the function that creates the function to evaluate inside the main function, or elsewhere. Writing it inside allows us to take some shortcuts because it can access objects in the outer function environment and avoid explicitly passing as many objects around. Though that can be dangerous.
The !!!
unpacks a list of function arguments.
<- function(data, groupers, sumcols,
arbvecscal
FUN, ...) {# function name as character
<- as.character(substitute(FUN))
funname # This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
# Define the function to evaluate
<- function(x, indices) {
thisfun <- list(...)
elip
# deal with the case of no passed arguments
if (length(elip) == 0) {
return(rlang::exec(FUN, x))
else {
}
# clip vector ... arguments (e.g. weights) to just the group
for (i in 1:length(elip)) {
if (length(elip[[i]]) == nrow(data)) {
<- elip[[i]][indices]
elip[[i]]
}
}
return(rlang::exec(FUN, x, !!!elip))
}
}
# The main group and summarise
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~thisfun(., cur_group_rows()), .names = funcolname)) %>%
ungroup()
return(gm)
}
So, for something like the weighted average with an na.rm
argument, we specify the vector of weights, rather than their bare name in the dataframe.
arbvecscal(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, nacars$wt, na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
And that also works if we want a function without any data-variables
arbvecscal(nacars,
groupers = gear,
sumcols = nampg,
FUN = mean, na.rm = TRUE)
# A tibble: 3 × 2
gear mean_nampg
<dbl> <dbl>
1 3 15.6
2 4 25.4
3 5 21.7
If we don’t want to pass vectors but pass bare names, we might be able to do that with the same approach, but will need to specify which are which. Then we’d create the vectors internal to the function using the same select syntax as before.
Now the internal function has to be a bit different (simpler) since it doesn’t have to do the checking for length since we’ve specified dataargs.
<- function(data, groupers, sumcols,
arbdatanames
FUN, dataargs, ...) {# function name as character
<- as.character(substitute(FUN))
funname # This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
# make a tibble so it doesn't collapse to vector if only one column
<- data %>%
datavecs as_tibble() %>%
select({{dataargs}})
# Define the function to evaluate
<- function(x, indices) {
thisfun <- list(...)
elip
# deal with the case of no passed arguments
if (length(elip) == 0 & nrow(datavecs) == 0) {
return(rlang::exec(FUN, x))
else {
}
# clip data arguments (e.g. weights) to just the group
<- datavecs[indices, ]
thisdata
# make all the arguments a list so we can call it
<- c(as.list(thisdata), elip)
allargs
return(rlang::exec(FUN, x, !!!allargs))
}
}
# The main group and summarise
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~thisfun(., cur_group_rows()), .names = funcolname)) %>%
ungroup()
return(gm)
}
Now, that should work for the weighted mean as well. Note that now the data variables have to be part of the dataframe- this function does not accept vectors passed in from elsewhere.
BUT, it doesn’t work because the names of the arguments need to be the names in the list of arguments following the !!!
. And here, wt
is the name, but weighted.mean
wants w
. So we not only need to specify the data-variable name, but the function-argument name for that variable as well. This is getting very in the weeds.
arbdatanames(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, dataargs = wt, na.rm = TRUE)
For example, arguments with the wrong names just get ignored. Names are essential, the execution does not just rely on order like if we called a function directly. Which makes sense for safety, but makes things harder here.
= rnorm(10)
vals <- list(x = vals, wt = 1:10, na.rm = TRUE)
arglist ::exec(weighted.mean, !!!arglist) rlang
[1] 0.1166201
<- list(x = vals, w = 1:10, na.rm = TRUE)
arglist2 ::exec(weighted.mean, !!!arglist2) rlang
[1] 0.08719472
It would be nice to pass name-value pairs, but the bare names are going to trip us up, I think. Could do it with paired characters I guess, but we’ve just spent quite a lot of time trying to avoid that. Would work though. Kind of a pain to setup- would make most sense as two paired columns or vectors. And if we do that, it’d end up being roughly equivalent to just adding another argument to the function for the matched names.
Skipping the rename if dataargnames
aren’t specified allows ignoring it if the columns have the correct names, and helps it work more smoothly if there aren’t dataargs
at all.
<- function(data, groupers, sumcols,
arbdatanames dataargnames = NULL, ...) {
FUN, dataargs, # function name as character
<- as.character(substitute(FUN))
funname # This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
# make a tibble so it doesn't collapse to vector if only one column
<- data %>%
datavecs as_tibble() %>%
select({{dataargs}})
if (!is.null(dataargnames)) {
names(datavecs) <- dataargnames
}
# Define the function to evaluate
<- function(x, indices) {
thisfun <- list(...)
elip
# deal with the case of no passed arguments
if (length(elip) == 0 & nrow(datavecs) == 0) {
return(rlang::exec(FUN, x))
else {
}
# clip data arguments (e.g. weights) to just the group
<- datavecs[indices, ]
thisdata
# make all the arguments a list so we can call it
<- c(as.list(thisdata), elip)
allargs
return(rlang::exec(FUN, x, !!!allargs))
}
}
# The main group and summarise
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~thisfun(., cur_group_rows()), .names = funcolname)) %>%
ungroup()
return(gm)
}
Now that works. The alternative would be to have a table of matched dataargs
and dataargnames
, and have that table be a single argument to arbdatanames
, but we’d still have to created it and that’d involve more overhead.
arbdatanames(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, dataargs = wt, dataargnames = 'w',
na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
And it works in situations without data args.
arbdatanames(nacars,
groupers = gear,
sumcols = nampg,
FUN = mean, na.rm = TRUE)
# A tibble: 3 × 2
gear mean_nampg
<dbl> <dbl>
1 3 15.6
2 4 25.4
3 5 21.7
That should work for >1 data variable as well. Let’s define a function that needs multiple data variables. This is very contrived with just some division and multiplication, but works as a check.
<- function(x, w, d, m, na.rm) {
multidat <- x/d*m
preprep <- weighted.mean(preprep, w, na.rm = na.rm)
outcome }
That works. Note that the dependence on argument names means we can specify out of order- we get two very different answers depending on whether we call cyl
and hp
d
and m
or m
and d
.
arbdatanames(nacars,
groupers = gear,
sumcols = nampg,
FUN = multidat, dataargs = c(wt, cyl, hp), dataargnames = c('w', 'd', 'm'),
na.rm = TRUE)
# A tibble: 3 × 2
gear multidat_nampg
<dbl> <dbl>
1 3 358.
2 4 466.
3 5 654.
arbdatanames(nacars,
groupers = gear,
sumcols = nampg,
FUN = multidat, dataargs = c(wt, cyl, hp), dataargnames = c('w', 'm', 'd'),
na.rm = TRUE)
# A tibble: 3 × 2
gear multidat_nampg
<dbl> <dbl>
1 3 0.643
2 4 1.39
3 5 0.608
Is there any reason to specify thisfun
externally to the main function? I guess maybe? It forces us to specify arguments, and potentially makes things clearer.
# Define the function to evaluate within the summary
<- function(x, indices, FUN, datavecs, ...) {
sumfun <- list(...)
elip
# deal with the case of no passed arguments
if (length(elip) == 0 & nrow(datavecs) == 0) {
return(rlang::exec(FUN, x))
else {
}
# clip data arguments (e.g. weights) to just the group
<- datavecs[indices, ]
thisdata
# make all the arguments a list so we can call it
<- c(as.list(thisdata), elip)
allargs
return(rlang::exec(FUN, x, !!!allargs))
} }
<- function(data, groupers, sumcols,
newfun dataargnames = NULL, ...) {
FUN, dataargs, # function name as character
<- as.character(substitute(FUN))
funname # This just avoids clutter in the summarise
<- paste0(funname, '_{.col}')
funcolname
# make a tibble so it doesn't collapse to vector if only one column
<- data %>%
datavecs as_tibble() %>%
select({{dataargs}})
if (!is.null(dataargnames)) {
names(datavecs) <- dataargnames
}
# The main group and summarise
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, ~sumfun(., indices = cur_group_rows(), FUN = FUN, datavecs = datavecs, ...), .names = funcolname)) %>%
ungroup()
return(gm)
}
That does work just as above. I’m not sure which will be cleaner in practice, but I like that this relies less on borrowing variables from the creating environment.
newfun(nacars,
groupers = gear,
sumcols = nampg,
FUN = weighted.mean, dataargs = wt, dataargnames = 'w',
na.rm = TRUE)
# A tibble: 3 × 2
gear weighted.mean_nampg
<dbl> <dbl>
1 3 14.9
2 4 24.8
3 5 19.6
newfun(nacars,
groupers = gear,
sumcols = nampg,
FUN = multidat, dataargs = c(wt, cyl, hp), dataargnames = c('w', 'm', 'd'),
na.rm = TRUE)
# A tibble: 3 × 2
gear multidat_nampg
<dbl> <dbl>
1 3 0.643
2 4 1.39
3 5 0.608
Multiple functions- with appropriate named outputs
Simple - hardcoded number of functions
Sometimes we might want to calculate multiple summary or mutate functions for the same set of data, and so rather than repeating the above functions multiple times with different FUN
arguments, it would be good to be able to send them all at once for one run-through. The simplest way to do this is to have a known number of functions and write that number of summaries, e.g.
<- function(data, groupers, sumcols,
simplemultifun dataargnames1 = NULL,
FUN1, dataargs1, dataargnames2 = NULL, ...) {
FUN2, dataargs2, # function name as character
<- as.character(substitute(FUN1))
funname1 # This just avoids clutter in the summarise
<- paste0(funname1, '_{.col}')
funcolname1
# function name as character
<- as.character(substitute(FUN2))
funname2 # This just avoids clutter in the summarise
<- paste0(funname2, '_{.col}')
funcolname2
# make a tibble so it doesn't collapse to vector if only one column
<- data %>%
datavecs1 as_tibble() %>%
select({{dataargs1}})
if (!is.null(dataargnames1)) {
names(datavecs1) <- dataargnames1
}
# make a tibble so it doesn't collapse to vector if only one column
<- data %>%
datavecs2 as_tibble() %>%
select({{dataargs2}})
if (!is.null(dataargnames2)) {
names(datavecs2) <- dataargnames2
}
# The main group and summarise
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}},
~sumfun(., indices = cur_group_rows(),
FUN = FUN1, datavecs = datavecs1, ...),
.names = funcolname1),
across({{sumcols}},
~sumfun(., indices = cur_group_rows(),
FUN = FUN2, datavecs = datavecs2, ...),
.names = funcolname2)) %>%
ungroup()
return(gm)
}
Then as an example, let’s do a weighted mean but unweighted sd. note that they need to share the dots.
simplemultifun(nacars,
groupers = gear,
sumcols = nampg,
FUN1 = weighted.mean, dataargs1 = wt, dataargnames1 = 'w',
FUN2 = sd,
na.rm = TRUE)
# A tibble: 3 × 3
gear weighted.mean_nampg sd_nampg
<dbl> <dbl> <dbl>
1 3 14.9 4.01
2 4 24.8 4.84
3 5 19.6 7.89
That works, but is really hardcoded in terms of what we can do. It has to have two functions. So, let’s try to say we can pass an arbitrary set of functions from 1 to n.
Note that a different data structure out the end is likely to be warranted, especially if we calculate these functions on multiple variables.- making this long with a column for the variable name and then the values of the functions might be the way to go if we do this for multiple variables.
Variable number of functions
What we really want here is to be able to pass in an arbitrary number of functions. That will get complicated if they have things like different data-variable arguments. In the simplest case, we can make the FUNS a list, and summarise just handles it. However, this breaks the names and the dots for arguments- the list needs to have all the info in it.
<- function(data, groupers, sumcols,
funmulti
FUNS, ...) {
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUNS)) %>%
ungroup()
return(gm)
}
If the list is named (using lst
here, but list(mean = mean, sd = sd)
would work too), those names get appended.
funmulti(nacars,
groupers = gear,
sumcols = mpg,
FUNS = lst(mean, sd))
# A tibble: 3 × 3
gear mpg_mean mpg_sd
<dbl> <dbl> <dbl>
1 3 16.1 3.37
2 4 24.5 5.28
3 5 21.4 6.66
That approach should work for arbitrary arguments if I use the tilde notation, and even allows data variables. This is a bit messier in the function call than I’d like, and there’s a bit less control over the names, but I think neither of those are major issues. Would be hard to be less verbose, really, and still have argument specification make any sense across multiple functions.
Actually, can I control the names with .names
after all?
<- function(data, groupers, sumcols,
funmulti
FUNS, ...) {
# nameparser <- paste0('prefix_{.fn}_{.col}')
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, FUNS,
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
As of {dplyr} 1.1, this no longer works- it looks for wt
as an object, not a data-variable. We’ll need to find a new solution.
funmulti(nacars,
groupers = gear,
sumcols = nampg,
FUNS = list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
wm = ~weighted.mean(., wt, na.rm = TRUE)))
Error in `summarise()`:
ℹ In argument: `across(nampg, FUNS, .names = "prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_nampg`.
Caused by error:
! object 'wt' not found
Note that now the function args in the list work with data-variables and scalars, not with vectors passed in. This is not solely because of the grouping needing to be handled as we did above with the sumfun
cutting to the correct indices, because even if we don’t group, we get errors about promise evals. This is because the FUNS list is being evaluated inside the summarise, and so thinks everything is a data-variable. There is probably a way to sort that out by using .env[['variablename']]
in the specification, but that’ll just get more complex than just adding the column to the dataframe if we hit this situation. Especially since we’d have to pass the vector in so it’s available inside the funmulti
environment, not just the global environment.
<- 1:nrow(nacars)
outerweights
funmulti(nacars,
# groupers = gear,
sumcols = nampg,
FUNS = list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
wm = ~weighted.mean(., w = outerweights,
na.rm = TRUE)))
# A tibble: 1 × 3
prefix_mean_nampg prefix_sd_nampg prefix_wm_nampg
<dbl> <dbl> <dbl>
1 20.1 6.59 20.3
Could we do something fancy with a list of FUNS and lists of arglists, parallelling how we did things above? Probably. I think in most instances though, this approach will work. I’ll develop that more complex situation only if needed.
Bringing it all together
Now, let’s choose a couple grouping columns, a selection of cols to summarise, and multiple summary functions, some with data arguments and some that are custom.
None of this works as of dplyr 1.1. The issue is that with new behaviour in dplyr, it is looking for the additional arguments not in the column names but as objects. See section below.
<- funmulti(nacars,
complexSummary groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
wm = ~weighted.mean(., wt, na.rm = TRUE),
custom = ~multidat(., w = wt, d = cyl, m = hp,
na.rm = FALSE)))
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), FUNS, .names =
"prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error:
! object 'wt' not found
complexSummary
Error in eval(expr, envir, enclos): object 'complexSummary' not found
This is complex enough it’s probably useful to pivot_longer
<- complexSummary %>%
longsums pivot_longer(cols = -c(gear, carb),
names_to = c('variable', 'summary_statistic'),
names_sep = '_',
values_to = 'value')
Error in eval(expr, envir, enclos): object 'complexSummary' not found
longsums
Error in eval(expr, envir, enclos): object 'longsums' not found
But that actually puts a lot of values with different meaning in the same value
column. What’s probably better is to give different statistics their own columns, as sort of an intermediate long/wide.
<- longsums %>%
longwide pivot_wider(names_from = summary_statistic, values_from = value)
Error in eval(expr, envir, enclos): object 'longsums' not found
longwide
Error in eval(expr, envir, enclos): object 'longwide' not found
Anyway, this sort of arrangement isn’t the point of this document, so I’ll stop there.
Adjusting to dplyr 1.1
As of dplyr 1.1, new behaviour means that if we pass multi-argument functions, it looks for the additional arguments not as data-variables (column names), but as objects. E.g., we now get errors for all the weighted.mean
calls above, since it cannot find a wt
object when wt
is a column name.
This is discussed as a dplyr github issue, where there is a workaround using rlang::quo
, but I really don’t like it for a couple reasons, primarily that it forces a user to wrap their code in rlang::quo
, and it matters where in the call stack the function gets defined. I’m not sure I’ll figure anything out that works better for me, since the tidyverse people came up with the workaround, but I need to try.
Re-demoing the issue
That workaround uses {{}}
around FUNS in the function. Building on funmulti
above,
<- function(data, groupers, sumcols,
funbrace
FUNS, ...) {
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
That actually works when we define the function to call inside the function argument.
<- funbrace(nacars,
bracecheck groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = list(mean = ~mean(., na.rm = TRUE),
wm = ~weighted.mean(., wt, na.rm = TRUE)))
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
bracecheck
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
However, if we define the functions to call in an object, it fails
<- list(mean = ~mean(., na.rm = TRUE),
funstocall wm = ~weighted.mean(., rlang::data_sym('wt'), na.rm = TRUE))
<- funbrace(nacars,
bracecheck2 groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funstocall)
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), funstocall, .names =
"prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `x * w`:
! non-numeric argument to binary operator
bracecheck2
Error in eval(expr, envir, enclos): object 'bracecheck2' not found
And the ‘solution’ is to use rlang::quo
, followed by !!
in the call
<- rlang::quo(list(mean = ~mean(., na.rm = TRUE),
funstocallq wm = ~weighted.mean(., wt, na.rm = TRUE)))
<- funbrace(nacars,
bracecheckq groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = !!funstocallq)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
bracecheckq
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
That works, but it sure requires a lot of fiddling by the user with quosures.
We can bring the !!
inside the function, which seems to work. I’ve run into issues before where this then requires quosures for everything, but it seems to be working here for mean
, which doesn’t need the quosure because it doesn’t reference data-variables.
The !!
method is
<- function(data, groupers, sumcols,
fundefuse
FUNS, ...) {
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, !!FUNS,
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
And so now we don’t have to defuse in the function call.
<- rlang::quo(c(mean = ~mean(., na.rm = TRUE),
funstocallq wm = ~weighted.mean(., wt, na.rm = TRUE)))
<- fundefuse(nacars,
defusecheckb groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funstocallq)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
defusecheckb
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
and mean
works as well, even when it’s not wrapped in rlang::quo
because it doesn’t reference data-variables.
<- list(mean = ~mean(., na.rm = TRUE))
funmean
<- fundefuse(nacars,
defusecheckm groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funmean)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
defusecheckm
# A tibble: 11 × 5
gear carb prefix_mean_disp prefix_mean_drat prefix_mean_nampg
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 3.18 21.4
2 3 2 346. 3.04 17.8
3 3 3 276. 3.07 NaN
4 3 4 416. 3.22 12.4
5 4 1 84.2 4.06 27.6
6 4 2 121. 4.16 25.4
7 4 4 164. 3.91 21
8 5 2 108. 4.1 30.4
9 5 4 351 4.22 NaN
10 5 6 145 3.62 19.7
11 5 8 301 3.54 15
Searching for a solution
I really don’t want to require quosures. And I want to be able to pass character function names.
Make reference internal to a custom function?
Attempt 1: can I simply define a function with the data-var internally referenced so it only takes one argument? I doubt it, but that might be the easiest.
<- function(x) {
weightcars weighted.mean(x, w = wt, na.rm = TRUE)
}
That doesn’t work with either the !!
or {{}}
method.
<- list(mean = ~mean(., na.rm = TRUE),
funscustom wm = ~weightcars(.))
<- fundefuse(nacars,
defusecheckc groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funscustom)
Error in `summarise()`:
ℹ In argument: `across(...)`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `weightcars()`:
! object 'wt' not found
defusecheckc
Error in eval(expr, envir, enclos): object 'defusecheckc' not found
<- funbrace(nacars,
bracecheckc groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funscustom)
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), funscustom, .names =
"prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `weightcars()`:
! object 'wt' not found
bracecheckc
Error in eval(expr, envir, enclos): object 'bracecheckc' not found
And that doesn’t even work with the rlang::quo
wrapper (unsurprisingly, I suppose).
<- rlang::quo(list(mean = ~mean(., na.rm = TRUE),
funscustom wm = ~weightcars(.)))
<- fundefuse(nacars,
defusecheckc groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funscustom)
Error in `summarise()`:
ℹ In argument: `across(...)`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `weightcars()`:
! object 'wt' not found
defusecheckc
Error in eval(expr, envir, enclos): object 'defusecheckc' not found
Modify the aggregation function somehow
My basic thought here is whether I can auto-build the data referencing. I’ve tried using rlang::data_sym
in the weighted mean function, and doing a bunch of other things, but I haven’t come up with anything yet. Maybe rlang::inject
?
Even if I specify the FUNS as a list inside the function, I need the rlang::quo
. Which is surprising, since I don’t need it if they’re specified as a function argument. I’m missing something about quoting, I think.
<- function(data, groupers, sumcols,
funbrace
FUNS, ...) {
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
Different formulat specification
What if instead of using the formula version of anonymous functions, we use \(x)
? I think this will behave like the custom weightcars
above, but maybe we can have more control inside the aggregation function?
First, does it work with the quo
?
<- rlang::quo(list(mean = \(x) mean(x, na.rm = TRUE),
anonq wm = \(x) weighted.mean(x, wt, na.rm = TRUE)))
<- fundefuse(nacars,
defusechecka groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonq)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
defusechecka
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
Works with the !!
but not {{}}
.
<- funbrace(nacars,
bracechecka groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonq)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'gear', 'carb'. You can override using the
`.groups` argument.
bracechecka
# A tibble: 22 × 5
gear carb prefix_1_disp prefix_1_drat prefix_1_nampg
<dbl> <dbl> <named list> <named list> <named list>
1 3 1 <fn> <fn> <fn>
2 3 1 <fn> <fn> <fn>
3 3 2 <fn> <fn> <fn>
4 3 2 <fn> <fn> <fn>
5 3 3 <fn> <fn> <fn>
6 3 3 <fn> <fn> <fn>
7 3 4 <fn> <fn> <fn>
8 3 4 <fn> <fn> <fn>
9 4 1 <fn> <fn> <fn>
10 4 1 <fn> <fn> <fn>
# ℹ 12 more rows
Now, can we get it to work without quo
???
<- list(mean = \(x) mean(x, na.rm = TRUE),
anonbare wm = \(x) weighted.mean(x, wt, na.rm = TRUE))
anonbare <- list(mean = \(x) mean(x, na.rm = TRUE),
Not immediately. but can we modify those functions?
<- fundefuse(nacars,
defusecheckab groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonbare)
Error in `summarise()`:
ℹ In argument: `across(...)`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error:
! object 'wt' not found
defusecheckab
Error in eval(expr, envir, enclos): object 'defusecheckab' not found
<- funbrace(nacars,
bracecheckab groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonbare)
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), anonbare, .names =
"prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error:
! object 'wt' not found
bracecheckab
Error in eval(expr, envir, enclos): object 'bracecheckab' not found
Is the answer to drop dplyr?
I thought about moving to stats::aggregate
, but it seems like that is going to cause just as many problems, especially when we get to passing it arbitrary lists of functions. The syntax is just so clumsy (at least to me).
Does it just work if I give it the vector?
This won’t solve the whole problem, and I think it still won’t actually work with the groupings, but should test.
<- list(mean = \(x) mean(x, na.rm = TRUE),
anonbarevec wm = \(x) weighted.mean(x, nacars$wt, na.rm = TRUE))
As expected, that fails because the external vector doesn’t get broken up by the groups.
<- fundefuse(nacars,
defusecheckab groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonbarevec)
Error in `summarise()`:
ℹ In argument: `across(...)`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `weighted.mean.default()`:
! 'x' and 'w' must have the same length
defusecheckab
Error in eval(expr, envir, enclos): object 'defusecheckab' not found
<- funbrace(nacars,
bracecheckab groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonbarevec)
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), anonbarevec, .names =
"prefix_{.fn}_{.col}")`.
ℹ In group 1: `gear = 3` and `carb = 1`.
Caused by error in `across()`:
! Can't compute column `prefix_wm_disp`.
Caused by error in `weighted.mean.default()`:
! 'x' and 'w' must have the same length
bracecheckab
Error in eval(expr, envir, enclos): object 'bracecheckab' not found
Build and feed a character string
We know we need the rlang::quo
to get this to work, but we can see the expressions we need in the list inside the function while debugging. So can we build the list wrapped in rlang::quo inside the function? Not very directly, as far as I can tell. But eval(parse(STRING))
seems to be a crude way forward.
It works to feed it a character string
<- "rlang::quo(list(mean = function(x) mean(x, na.rm = TRUE), wm = function(x) weighted.mean(x, wt, na.rm = TRUE)))"
charfuns
# seems to work. NOW, how can I do that, and do it safely?
# Likely turn the list into characters, then put rlang::quo on it, and round and round we go. Going to need lots of testing.
And a function that parses that
<- function(data, groupers, sumcols,
funchar
FUNS, ...) {
<- eval(parse(text = FUNS))
FUNS
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
<- funchar(nacars,
charcheck groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = charfuns)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charcheck
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
So, that works. This is getting very messy though. We certianly don’t want to make a user send us that string- that’s far worse than just wrapping in rlang::quo
.
BUT, does this allow us to programatically build that string inside the function? Should try
without it first, and then if it fails, build the string. Make a function that does that.
<- function(data, groupers, sumcols,
funbracechar
FUNS, ...) {
<- try(data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup(), silent = TRUE)
if (inherits(gm, 'try-error')) {
<- paste0(c("rlang::quo(", deparse(FUNS), ")"), collapse = '')
fchar # FUNS2 <- eval(parse(text = fchar)) # base R
<- rlang::eval_tidy(rlang::parse_expr(fchar)) # rlang claims to be faster?
FUNS3
}
<- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS3}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
return(gm)
}
Will need to test this with ~ functions, bare names, and \(x)
anonymous functions. I don’t think I expect it to work with character names. But it might work with character specification of the whole function?
<- list(mean = \(x) mean(x, na.rm = TRUE),
anonbare wm = \(x) weighted.mean(x, wt, na.rm = TRUE))
it works with the \(x)
style anonymous function
<- funbracechar(nacars,
charcheck groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = anonbare)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charcheck
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
works with tilde-style anonymous functions
<- list(mean = ~mean(., na.rm = TRUE),
funstilde wm = ~weighted.mean(., wt, na.rm = TRUE))
<- funbracechar(nacars,
chartilde groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funstilde)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
chartilde
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
and unsurprisingly with the long form anonymous
<- list(mean = function(x) mean(x, na.rm = TRUE),
funsfullanon wm = function(x) weighted.mean(x, wt, na.rm = TRUE))
<- funbracechar(nacars,
charfullanon groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsfullanon)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charfullanon
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
It works with custom functions with the argument inside. If we look at what the deparse
does inside the debugger, we can see that it expands those functions out, and so the thing that gets quoted is actually exactly the same as the previous version in funsfullanon
.
<- function(x) {
weightcustom weighted.mean(x, w = wt, na.rm = TRUE)
}
<- function(x) {
meancustom mean(x, na.rm = TRUE)
}
<- list(mean = meancustom,
funscustom wm = weightcustom)
<- funbracechar(nacars,
charweightcustom groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funscustom)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charweightcustom
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_wm_disp prefix_mean_drat prefix_wm_drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18 3.13
2 3 2 346. 347. 3.04 3.03
3 3 3 276. 276. 3.07 3.07
4 3 4 416. 425. 3.22 3.19
5 4 1 84.2 85.3 4.06 4.05
6 4 2 121. 128. 4.16 4.05
7 4 4 164. 164. 3.91 3.91
8 5 2 108. 110. 4.1 4.16
9 5 4 351 351 4.22 4.22
10 5 6 145 145 3.62 3.62
11 5 8 301 301 3.54 3.54
# ℹ 2 more variables: prefix_mean_nampg <dbl>, prefix_wm_nampg <dbl>
It works when there’s a single function, not a list, too. If we look in the debugger, this does still fail with the simple {{}}
, triggers the try loop, and gets deparsed.
<- weightcustom
funsnolist
<- funbracechar(nacars,
charweightcustom groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsnolist)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charweightcustom
# A tibble: 11 × 5
gear carb prefix_1_disp prefix_1_drat prefix_1_nampg
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 208. 3.13 21.4
2 3 2 347. 3.03 17.8
3 3 3 276. 3.07 NaN
4 3 4 425. 3.19 12.3
5 4 1 85.3 4.05 27.5
6 4 2 128. 4.05 24.6
7 4 4 164. 3.91 21
8 5 2 110. 4.16 30.4
9 5 4 351 4.22 NaN
10 5 6 145 3.62 19.7
11 5 8 301 3.54 15
I expect it not to work for a character vector, and it doesn’t.
<- 'weightcustom'
funsnolistchar
<- funbracechar(nacars,
charnolistchar groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsnolistchar)
Error in `summarise()`:
ℹ In argument: `across(c(starts_with("d"), nampg), "weightcustom",
.names = "prefix_{.fn}_{.col}")`.
Caused by error in `across()`:
! `.fns` must be a function, a formula, or a list of functions/formulas.
charnolistchar
Error in eval(expr, envir, enclos): object 'charnolistchar' not found
But, does it work if we add an mget
line?
<- function(data, groupers, sumcols,
funbracechar
FUNS, ...) {if (is.character(FUNS)) {
<- mget(FUNS, inherits = TRUE)
FUNS
}<- try(data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup(), silent = TRUE)
if (inherits(gm, 'try-error')) {
<- paste0(c("rlang::quo(", deparse(FUNS), ")"), collapse = '')
fchar # FUNS2 <- eval(parse(text = fchar)) # base R
<- rlang::eval_tidy(rlang::parse_expr(fchar)) # rlang claims to be faster?
FUNS3 <- data %>%
gm group_by(across({{groupers}})) %>%
summarise(across({{sumcols}}, {{FUNS3}},
.names = 'prefix_{.fn}_{.col}')) %>%
ungroup()
}
return(gm)
}
It works for a single function
<- 'weightcustom'
funsnolistchar
<- funbracechar(nacars,
charnolistchar groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsnolistchar)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charnolistchar
# A tibble: 11 × 5
gear carb prefix_weightcustom_disp prefix_weightcustom_drat
<dbl> <dbl> <dbl> <dbl>
1 3 1 208. 3.13
2 3 2 347. 3.03
3 3 3 276. 3.07
4 3 4 425. 3.19
5 4 1 85.3 4.05
6 4 2 128. 4.05
7 4 4 164. 3.91
8 5 2 110. 4.16
9 5 4 351 4.22
10 5 6 145 3.62
11 5 8 301 3.54
# ℹ 1 more variable: prefix_weightcustom_nampg <dbl>
And for multiple functions if they are in a character vector
<- c('mean', 'weightcustom')
funsmultichar
<- funbracechar(nacars,
charmultichar groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsmultichar)
`summarise()` has grouped output by 'gear'. You can override using the
`.groups` argument.
charmultichar
# A tibble: 11 × 8
gear carb prefix_mean_disp prefix_weightcustom_disp prefix_mean_drat
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 201. 208. 3.18
2 3 2 346. 347. 3.04
3 3 3 276. 276. 3.07
4 3 4 416. 425. 3.22
5 4 1 84.2 85.3 4.06
6 4 2 121. 128. 4.16
7 4 4 164. 164. 3.91
8 5 2 108. 110. 4.1
9 5 4 351 351 4.22
10 5 6 145 145 3.62
11 5 8 301 301 3.54
# ℹ 3 more variables: prefix_weightcustom_drat <dbl>, prefix_mean_nampg <dbl>,
# prefix_weightcustom_nampg <dbl>
But not for a list. This is not unexpected- the mget
is in if(is.character(FUNS))
, and so the list won’t get mgot. I think that’s good enough for now. It would be doable obviously to purrr
over the list and mget
the items that are characters, but that’s not really the focus here. We have figured out an (ugly) workaround for the dplyr 1.1 issue, and that will have to do for now- applying it over all possible organisations of FUNS will have to be for another day.
<- list(m = 'mean', wm = 'weightcustom')
funsmulticharl
<- funbracechar(nacars,
charmulticharl groupers = c(gear, carb),
sumcols = c(starts_with('d'), nampg),
FUNS = funsmulticharl)
Error in `summarise()`:
ℹ In argument: `across(...)`.
Caused by error in `across()`:
! `.fns` must be a function, a formula, or a list of functions/formulas.
charmulticharl
Error in eval(expr, envir, enclos): object 'charmulticharl' not found
eval_tidy
I keep feeling like eval_tidy
should work somehow, since it allows the .data
pronoun, but I can’t seem to get my head around how it would work here. I’d happily write something like \(x) eval_tidy(weighted.mean(x, .data$wt, na.rm = TRUE)))
. Maybe I can get that to work with the right sort of enquoing? I tried for a while and couldn’t figure it out, but maybe come back fresh later on.