library(microbenchmark)
library(doFuture)
library(foreach)
library(furrr)
library(future.apply)
library(doRNG)
library(listenv)
Investigating parallel speedups
I tend to run a lot of code that can be parallelised, but it’s not always clear when it’s worth it and how best to structure the paralellisation. Should it be at the outermost layer, where I’m typically looping over parameters, some intermediate layer where I might be looping over indices or iterators, or to handle large datasets?
For reference, I often have population dynamics models with many species and locations. At each timestep I need to make a lot of calculations on the species, including some large matrix multiplications to get dispersal. These could be parallelised over species. And after simulations are complete, I calculate a lot of covariances over species, space, and time that can be parallelised over pairwise combinations. Both of these cases operate on large arrays, and so would feed large amounts of data to parallelised functions, which would then do some limited processing on it (e.g. calculate covariances and clean them up for return). At the other extreme, each of these situations is governed by an initial set of parameters, giving, for example, environmental conditions, species growth rates, etc. These are often just vectors, and so parallelising over them would feed the parallel function small amounts of data and kick off large amounts of work.
To test parallel performance under these different situations, I’ll attempt to build an example that is non-trivial, but still try to stay minimally complex to avoid getting into writing a complex population dynamics model.
Packages and setup
I’ll use the {future} package, along with {dofuture} and {foreach}, because I tend to like writing for loops (there’s a reason- I’ll try to write up sometime later). I’ll also test {furrr} and {future.apply} to see if they differ in any appreciable way.
Just set up a typical doFuture situation with plan(multisession)
. Sorting out plans is a topic for another day.
registerDoFuture()
plan(multisession)
A data-heavy loop
Let’s just say it’s a biggish (but not obscene) matrix multiplication, which for a single iteration (maybe this is species 1’s dispersal) looks like this:
<- rnorm(1000)
a <- matrix(rnorm(1000*1000), nrow = 1000)
b microbenchmark(c <- a %*% b, times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
c <- a %*% b 1.213801 1.345602 1.475621 1.442501 1.6289 1.7425 10
Now, let’s say we have a lot of species- ie there are 100 a
’s in the situation above. I’ll put them in a list so I can use foreach
or furrr
or future.apply
might make the most sense. My code often makes the most sense in loops because of dependencies and so I tend to keep writing them, but anything parallelisable should be able to be coerced into working with furrr
or future.apply.
So, let’s say I have 100 ‘species’ that each get multiplied by the b above
<- matrix(rnorm(100*1000), ncol = 100) a100
In a single- case, this is
<- a100[,1] %*% b c1
So what we want to do is multiply each column of a100 by b and then glue them back together. And yes, we could do all this more cleanly with linear algebra. The point here is to test the situation where we have a reasonably large amount of data we want to iterate over and do some smallish number of operations on. I’ll make each one a function so we can easily run them through microbenchmark
(and later, nest them).
foreach
<- function(a, b) {
mult_foreach <- foreach(i = 1:ncol(a), .combine = rbind) %dopar% {
c_foreach %*% b
a[,i]
}return(t(c_foreach))
}
furrr
purrr
(and so furrr
) don’t seem to work on matrices. So, I guess have a silly pre-step to make it a list. I’m going to do that outside the function, simply because if we went this way, we’d set the data up to work.
<- function(a_list, b) {
mult_furrr <- future_map(a_list, \(x) x %*% b)
c_map matrix(unlist(c_map), ncol = 2)
}
future.apply
<- function(a, b) {
mult_apply future_apply(a, MARGIN = 2, FUN = function(x) x %*% b)
}
simple for
Preallocate, because I’m not a heathen
<- function(a, b) {
mult_for
<- a
c_for for(i in 1:ncol(a)) {
<- a[,i] %*% b
c_for[,i]
}return(c_for)
}
future for
We can write a usual for loop if we use futures directly. the futures themselves have to go in a list, because they are futures, not values, and so can’t go straight into a matrix. That list can be preallocated.
There are two ways to do this- explicit and implicit- see the future docs.
Explicit futures
<- function(a, b) {
mult_for_future_e
<- vector(mode = 'list', length = ncol(a))
c_for
for(i in 1:ncol(a)) {
<- future({a[,i] %*% b})
c_for[[i]]
}# get values and make a matrix
<- lapply(c_for, FUN = value)
v_for
return(matrix(unlist(v_for), ncol = ncol(a)))
}
Implicit futures
using listenv
<- function(a, b) {
mult_for_future_i
<- listenv()
c_for
for(i in 1:ncol(a)) {
%<-% {a[,i] %*% b}
c_for[[i]]
}# get values and make a matrix
<- as.list(c_for)
v_for
return(matrix(unlist(v_for), ncol = ncol(a)))
}
linear algebra
<- function(a,b) {
mult_linear t(a %*% b)
}
Test they all work
set up a small test case and see if they are all returning the same answer
<- matrix(rnorm(10*2), ncol = 2)
asmall <- as.list(as.data.frame(asmall)) # This is silly
asmall_l <- matrix(rnorm(10*10), ncol = 10) bsmall
<- mult_foreach(asmall, bsmall)
c_foreach <- mult_furrr(asmall_l, bsmall)
c_furrr <- mult_apply(asmall, bsmall)
c_apply <- mult_for(asmall, bsmall)
c_for <- mult_for_future_e(asmall, bsmall)
c_for_fe <- mult_for_future_i(asmall, bsmall)
c_for_fi <- mult_linear(t(asmall), bsmall) c_linear
all(c_foreach == c_furrr)
[1] TRUE
all(c_foreach == c_apply)
[1] TRUE
all(c_foreach == c_for)
[1] TRUE
all(c_foreach == c_for_fe)
[1] TRUE
all(c_foreach == c_for_fi)
[1] TRUE
all(c_foreach == c_linear)
[1] TRUE
Benchmark
<- as.list(as.data.frame(a100))
a100_l
microbenchmark(
futureforeach = mult_foreach(a100, b),
futurefurrr = mult_furrr(a100_l, b),
futureapply = mult_apply(a100, b),
bare_for = mult_for(a100, b),
future_for_e = mult_for_future_e(a100, b),
future_for_i = mult_for_future_i(a100, b),
bare_linear = mult_linear(t(a100), b),
times = 10
)
Unit: milliseconds
expr min lq mean median uq
futureforeach 2391.7859 2547.7285 2824.28290 2858.64935 3109.1553
futurefurrr 2735.9775 2758.9662 3276.79272 3026.24630 3230.1335
futureapply 2563.8114 2641.1631 2797.72951 2722.93280 2952.8528
bare_for 91.1790 98.1436 107.60745 106.17695 108.6603
future_for_e 13317.3280 14126.5008 15066.07841 15425.70625 15793.8948
future_for_i 13905.7498 14113.6200 14788.64297 14595.06900 15508.8098
bare_linear 39.3198 43.7770 46.23252 44.91655 45.9903
max neval
3168.6752 10
5711.8441 10
3171.3136 10
137.4717 10
16143.9597 10
15898.4413 10
60.5414 10
Well, that’s interesting. The futures are hugely slower than even the bare for
. And the direct futures in a for
are the worst by far (so bad there must be something major wrong). Is it just overhead passing the matrices to workers? That might explain the poor performance of the direct futures in for
- if the other future
functions chunk to avoid passing everything every time, then the overhead hits the direct versions much harder. Or something else? Does preallocating help for the foreach, apply, etc?
Preallocation and lists
This is a bit of an aside- we know preallocation speeds up for
loops. Does it speed up foreach
? Seems unlikely, since by nature foreach
constructs the object additively as a list or whatever’s specified in .combine
. That can make the code cleaner in a lot of cases if we’re building something in a loop, but it seems more like the un-preallocated for
method of doing things.
Let’s test and see.
preallocate the foreach
I typically don’t do this, since my understanding of foreach is that it builds them with the .combine, so preallocating doesn’t do anything. But maybe?
<- function(a, b) {
mult_foreach_pre <- t(a)
c_foreach <- foreach(i = 1:ncol(a), .combine = rbind) %dopar% {
c_foreach %*% b
a[,i]
}return(t(c_foreach))
}
Return the foreach as a list
It is possible that using .combine
is forcing slower behaviour for the foreach, and it’s optimized for a list?
<- function(a, b) {
mult_foreach_list <- foreach(i = 1:ncol(a)) %dopar% {
c_foreach %*% b
a[,i]
}
# Do the binding in one step
return(matrix(unlist(c_foreach), ncol = 2))
}
Don’t preallocate the for
How bad is this- I almost always DO preallocate (it’s faster, and the loop starts at 1, and it’s just cleaner), but it’s possible not to.
<- function(a, b) {
mult_for_build
<- a[,1] %*% b
c_for
for(i in 2:ncol(a)) {
<- a[,i] %*% b
ctemp <- rbind(c_for, ctemp)
c_for
}return(t(c_for))
}
I’ll include the furrr
and future.apply
here too, I guess as reference.
microbenchmark(
futurefurrr = mult_furrr(a100_l, b),
futureapply = mult_apply(a100, b),
futureforeach = mult_foreach(a100, b),
preallocate_foreach = mult_foreach_pre(a100, b),
foreach_list = mult_foreach_list(a100, b),
bare_for = mult_for(a100, b),
unallocate_for = mult_for_build(a100, b),
bare_linear = mult_linear(t(a100), b),
times = 10
)
Unit: milliseconds
expr min lq mean median uq
futurefurrr 2571.1246 2767.6439 2857.16904 2866.4501 2897.1658
futureapply 2507.3732 2652.2697 2799.40048 2846.0770 2917.3628
futureforeach 2293.2059 2438.9546 2593.98055 2637.0460 2725.0197
preallocate_foreach 2343.7728 2587.9531 2620.46876 2612.0335 2671.4092
foreach_list 2291.8114 2404.4547 2544.75845 2517.0032 2714.5863
bare_for 93.8633 99.2174 100.78608 99.9848 101.3844
unallocate_for 117.2363 120.5628 124.45097 124.5449 127.4235
bare_linear 40.2038 40.7167 42.76278 42.6698 43.5522
max neval
3297.9195 10
3056.1415 10
2785.5028 10
2820.4449 10
2794.5918 10
109.7385 10
132.2418 10
47.7955 10
These results are interesting, and the actual slowdown (not just lack of speedup) is worrying for how I do some things. I think the overhead of shifting the data around is absolutely killing these parallel processes. And I definitely have code that does this sort of thing.
I have two more questions now (plus one for later):
Does the
foreach
loop work as fast as afor
if I use%do%
instead of%dopar%
? Or is the overhead still there?- And same with all the futures if I set
plan(sequential)
?
- And same with all the futures if I set
If I don’t send prebuilt data, but just some parameters and build the data internally, how do they compare?
- Sometimes this flow makes sense, and sometimes it doesn’t– e.g. if I’m simulating populations, the outer set of parallelisation just sends parameters. But the internal set often needs to work on things like population matrices. And so maybe that internal loop just shouldn’t be parallel.
How do other
plan
options affect these answers? I think this deserves its own page, and could get very complicated once I get intofuture.callr
,future.batchtools
,cluster
, etc. Andmulticore
might avoid the passing and use pointers, but i can’t test on windows?
Un-parallelising
foreach %do%
Let’s try shifting to %do%
for the foreach
<- function(a, b) {
mult_foreach_do <- foreach(i = 1:ncol(a), .combine = rbind) %do% {
c_foreach %*% b
a[,i]
}return(t(c_foreach))
}
list foreach %do%
<- function(a, b) {
mult_foreach_list_do <- foreach(i = 1:ncol(a)) %do% {
c_foreach %*% b
a[,i]
}
# Do the binding in one step
return(matrix(unlist(c_foreach), ncol = 2))
}
Test against parallel foreach, bare for, and the unallocated for (since that’s kind of what the foreach is doing- building up an object). Since that’s sometimes nice behaviour and leads to cleaner code than a for loop, I’d like to see how they compare.
microbenchmark(
futureforeach = mult_foreach(a100, b),
foreachdo = mult_foreach_do(a100, b),
foreachlistdo = mult_foreach_list_do(a100, b),
bare_for = mult_for(a100, b),
unallocate_for = mult_for_build(a100, b),
bare_linear = mult_linear(t(a100), b),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max
futureforeach 2294.3315 2413.2883 2441.58145 2426.2623 2491.8492 2593.4213
foreachdo 108.1478 110.7516 117.48842 118.1481 122.2505 128.2518
foreachlistdo 110.9159 111.7610 118.85446 117.0267 119.7476 144.9399
bare_for 91.0620 96.5476 106.97241 108.5482 112.5208 131.5772
unallocate_for 116.1919 120.7025 124.04948 124.7919 127.6500 130.3755
bare_linear 39.1770 40.9914 42.25231 42.5430 43.0318 46.4730
neval
10
10
10
10
10
10
Interesting. Nearly identical to the unallocated for and quite a bit faster than the parallel version, which seems to hint that it’s the data transfer that’s killing things. And backs up my assumption of what’s going on under the hood in terms of constructing the object as in an unallocated for loop. There’s no appreciable difference in using the foreach with a list and then combining vs combining as we go with .combine
.
plan(sequential)
How do the parallel versions work with plan(sequential)
? Do they all get a speedup from avoiding data transfer? This is the same benchmark test as above, but now run sequentially.
plan(sequential)
microbenchmark(
futurefurrr = mult_furrr(a100_l, b),
futureapply = mult_apply(a100, b),
futureforeach = mult_foreach(a100, b),
bare_for = mult_for(a100, b),
unallocate_for = mult_for_build(a100, b),
bare_linear = mult_linear(t(a100), b),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
futurefurrr 116.6781 119.5224 124.35162 125.1431 127.5115 134.7231 10
futureapply 105.6057 110.9022 114.05056 113.2071 119.2455 120.0307 10
futureforeach 115.8364 123.0657 127.43182 125.7630 128.5088 141.5289 10
bare_for 92.1208 103.9310 108.69223 110.5858 113.8986 118.0519 10
unallocate_for 118.7530 123.6296 127.38882 125.9754 130.5657 140.5084 10
bare_linear 40.6936 42.4847 43.83156 44.3669 44.9938 45.7687 10
Interesting. foreach
and furrr
both sped up about as expected (furrr
just seems a bit slower in general), but future.apply
had much less of a speedup. It must not fall back to a simpler function, and still tries to use the parallel data shuffling? It is still much faster than with plan(multisession)
(was 590 microseconds), so something is happening, but it’s not getting down to the speeds of the other futures. And the simple for
is still fastest (other than just using linear algebra, obviously).
The message so far
Test the parallel implementation at different points in the code- if there’s no way to avoid data passing, a simple for
(or other sequential function like apply
could be fastest.
Create the data in the function
Sometimes it does work (or makes more sense) to create the data inside each parallel process. This is certainly the case where we’re looping over parameter sets for simulations, which tend to be embarassingly parallel and low-data, which then sets of lots of processing. It’s also likely to be the case in another situation I encounter often- reading chunks of large raster data, where I can keep it as stars_proxy
objects until inside a parallel loop, where I can bring a chunk into memory. Something similar is likely true of things like netcdf reading.
To test this sort of situation, let’s use the same functions as above, but now create the vectors and matrices internally, rather than pass them in. I’m going to remake the b
matrix internally because the point is to check this use of parallelism (lots of processing triggered from small data), even though in practice for this particular sort of calculation we’d need to pass it in so it’s common to all the a’s. Here, it’s just a stand-in for “do some work”. Let’s make the first argument actually do something other than just replicate, so have it set the mean of a
.
I’m not going to bother with setting seeds because it’ll be a pain to get them to match across the different ways of doing this, so the answers will be different.
Do we need to adjust globals here? If we leave defaults, we copy in the whole parent environment, right? and so don’t end-run anything? e.g. .options = furrr_options(globals = "y")
for furrr
? No, see tests of this below- future
only passes in the needed objects from global.
Turn parallel processing back on.
plan(multisession)
foreach
changing this a bit too so all the work is done inside (no t
on the return)
<- function(n_reps = 100, size = 1000) {
mult_foreach_internal <- foreach(i = 1:n_reps, .combine = cbind) %dorng% {
c_foreach <- rnorm(size, mean = i)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}return(c_foreach)
}
furrrr
reframing this one too, because future_map
needs to call the function that does all the generating. This construction gets contrived very quickly.
<- function(n_reps = 100, size = 1000) {
mult_furrr_internal <- function(rep, size) {
fn_to_call <- rnorm(size, mean = rep)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}
<- future_map(1:n_reps, fn_to_call, size = size, .options = furrr_options(seed = TRUE))
c_map matrix(unlist(c_map), ncol = n_reps)
}
future.apply
modifying to future_lapply
because it doesn’t make sense to apply
over the margins of a matrix that gets created. Again, this is getting contrived.
<- function(n_reps = 100, size = 1000) {
mult_apply_internal <- function(rep, size) {
fn_to_call <- rnorm(size, mean = rep)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}
<- future_lapply(1:n_reps, FUN = fn_to_call, size, future.seed = TRUE)
c_apply
matrix(unlist(c_apply), ncol = n_reps)
}
For
back to preallocating- we know it’s better
<- function(n_reps = 100, size = 1000) {
mult_for_internal
<- matrix(nrow = size, ncol = n_reps)
c_for
for(i in 1:n_reps) {
<- rnorm(size, mean = i)
a <- matrix(rnorm(size * size), nrow = size)
b
<- a %*% b
c_for[,i]
}return(c_for)
}
future for
We can write a usual for loop if we use futures directly. the futures themselves have to go in a list, because they are futures, not values, and so can’t go straight into a matrix. That list can be preallocated.
There are two ways to do this- explicit and implicit- see the future docs.
Explicit futures
<- function(n_reps = 100, size = 1000) {
mult_for_future_internal_e
<- vector(mode = 'list', length = n_reps)
c_for
for(i in 1:n_reps) {
<- future({a <- rnorm(size, mean = i)
c_for[[i]] <- matrix(rnorm(size * size), nrow = size)
b %*% b}, seed = TRUE)
a
}# get values and make a matrix
<- lapply(c_for, FUN = value)
v_for
return(matrix(unlist(v_for), ncol = n_reps))
}
Implicit futures
using listenv
That’s a funny way to set the seed. Good to know.
<- function(n_reps = 100, size = 1000) {
mult_for_future_internal_i
<- listenv()
c_for
for(i in 1:n_reps) {
%<-% {a <- rnorm(size, mean = i)
c_for[[i]] <- matrix(rnorm(size * size), nrow = size)
b %*% b} %seed% TRUE
a
}# get values and make a matrix
<- as.list(c_for)
v_for
return(matrix(unlist(v_for), ncol = n_reps))
}
There’s no reason to have a linear algebra version here, since we’re by definition not operating on existing matrices.
Try that without adjusting what globals are passed to the futures
microbenchmark(
futurefurrr = mult_furrr_internal(n_reps = 100, size = 1000),
futureapply = mult_apply_internal(n_reps = 100, size = 1000),
futureforeach = mult_foreach_internal(n_reps = 100, size = 1000),
futurefor_e = mult_for_future_internal_e(n_reps = 100, size = 1000),
futurefor_i = mult_for_future_internal_i(n_reps = 100, size = 1000),
bare_for = mult_for_internal(n_reps = 100, size = 1000),
times = 10
)
Unit: seconds
expr min lq mean median uq max
futurefurrr 2.911519 2.937818 3.446520 3.176596 3.296120 6.415807
futureapply 2.824095 2.847494 3.067614 3.021533 3.228708 3.546101
futureforeach 2.829144 2.940019 3.237918 3.160864 3.245488 4.222348
futurefor_e 12.650138 13.192132 13.856649 13.579496 14.916949 15.034860
futurefor_i 13.213725 13.252212 13.510405 13.378892 13.803717 14.127473
bare_for 3.626247 3.706880 3.980060 4.089718 4.211040 4.323426
neval
10
10
10
10
10
10
Now the parallelisation is helping quite a bit, even the for
loops with futures, though they’re still much slower than the other future
methods. However, this is much slower than creating the matrices as we actually should do for this particular calculation, and is even slower than passing those matrices in. So, if the operations need to happen this way anyway (parallel lots of stuff over parameters), this makes lots of sense and speeds up. If we CAN pre-generate matrices, linear algebra is fastest (unsurprisingly), and the loops are next, even if we have to eat pass-in cost. Though in that case sequential is probably better.
I think the slower bare futures are likely because there’s no chunking being done, and so data is copied to each iteration, rather than to chunks of iterations (which is built into doFuture
and I think furrr
and future.apply
. I could try to manually chunk to test that, but I think I’m just going to skip it.
Globals
One of the nice things about future
compared to some other parallel backends is that it does pass the global environment, so i don’t have to manage what it gets- it works as it does interactively. BUT, if data passing is what’s killing the speed, maybe I do need to manage what gets passed. In theory, the test above should get even faster if we don’t pass it the global environment, but only the arguments.
So, a new version of the above, explicitly limiting passing.
foreach
<- function(n_reps = 100, size = 1000) {
mult_foreach_internal_g <- foreach(i = 1:n_reps,
c_foreach .combine = cbind,
.export = NULL) %dorng% {
<- rnorm(size, mean = i)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}return(c_foreach)
}
furrrr
<- function(n_reps = 100, size = 1000) {
mult_furrr_internal_g <- function(rep, size) {
fn_to_call <- rnorm(size, mean = rep)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}
<- future_map(1:n_reps, fn_to_call, size = size,
c_map .options = furrr_options(seed = TRUE, globals = NULL))
matrix(unlist(c_map), ncol = n_reps)
}
future.apply
<- function(n_reps = 100, size = 1000) {
mult_apply_internal_g <- function(rep, size) {
fn_to_call <- rnorm(size, mean = rep)
a <- matrix(rnorm(size * size), nrow = size)
b t(a %*% b)
}
<- future_lapply(1:n_reps, FUN = fn_to_call, size,
c_apply future.seed = TRUE, future.globals = FALSE)
matrix(unlist(c_apply), ncol = n_reps)
}
Benchmark
microbenchmark(
futurefurrr = mult_furrr_internal(n_reps = 100, size = 1000),
futureapply = mult_apply_internal(n_reps = 100, size = 1000),
futureforeach = mult_foreach_internal(n_reps = 100, size = 1000),
futurefurrr_g = mult_furrr_internal_g(n_reps = 100, size = 1000),
futureapply_g = mult_apply_internal_g(n_reps = 100, size = 1000),
futureforeach_g = mult_foreach_internal_g(n_reps = 100, size = 1000),
bare_for = mult_for_internal(n_reps = 100, size = 1000),
times = 10
)
Unit: seconds
expr min lq mean median uq max neval
futurefurrr 2.934941 2.991153 3.073430 3.036014 3.115680 3.383588 10
futureapply 2.961602 2.986763 3.314419 3.268155 3.529925 3.972054 10
futureforeach 2.981263 2.996048 3.114741 3.084971 3.208947 3.359805 10
futurefurrr_g 2.852686 2.938126 3.049807 3.008130 3.150936 3.425673 10
futureapply_g 2.933841 3.054000 3.218608 3.119524 3.173425 3.860498 10
futureforeach_g 2.941432 3.039213 3.142060 3.138042 3.216203 3.484246 10
bare_for 3.615668 3.733735 3.783454 3.758691 3.801958 4.025737 10
So, that’s not hauling around a ton of extra stuff. I think, based on the doFuture vignette, that future
will send the globals, but only those that are needed. So it’s not that they’re in functions, it’s that future
can tell it doesn’t need to pass extra variables. This means we don’t really get performance gains because future
already had our back.
Conclusions
If we can just write good linear algebra, that’s best. No surprise there. But sometimes we can’t.
If we have large amounts of data, it’s likely fastest to use a preallocated
for
loop orapply
orpurrr::map
other sequential operation.- Even if we have to parallelise and take a performance hit (why? who knows? maybe because it’s buried in a function that needs the ability to be parallel if it’s the outer function?), passing in existing data is still faster than generating the data inside (over what range of data sizes?). But sometimes we need to generate the data inside anyway. In which case-
If we have a bunch of work to do, based on small amounts of data (e.g. parameters, iterators), then the parallel futures are best, without much difference between the flavours.