Foreach globals and speed

Author

Galen Holt

I previously tested the impact of unused globals on speed, but only briefly. Here, I’ll be more systematic, because it gets tricky fast if we need to be super careful about what objects exist in the global environment.

There are a couple things to check here

I’ll tackle these by

  1. Running speed tests before I initialise any globals

    1. Bare processing

    2. Inside a function

  2. Create a big global, and compare two identical processing steps that either ignore it or reference it without doing any processing on it.

    1. Bare

    2. Inside a function

library(doFuture)
library(future.apply)
library(furrr)
library(doRNG)
library(microbenchmark)
registerDoFuture()
plan(multisession)

Nothing exists

Well, almost nothing. I’m going to set a couple scalars and define a function for furrr and future.apply . I’m not using any of the globals or export arguments in the functions.

Bare

n_reps = 100
size <- 1000

fn_to_call <- function(rep, size) {
    a <- rnorm(size, mean = rep)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
  }

Benchmark

microbenchmark(
  dofut0 = {foreach(i = 1:n_reps, 
                       .combine = cbind) %dorng% {
    a <- rnorm(size, mean = i)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
                       }},
  furr0 = {future_map(1:n_reps, fn_to_call, size = size, 
                      .options = furrr_options(seed = TRUE))},
  
  fuapply0 = {future_lapply(1:n_reps, FUN = fn_to_call, size, 
                           future.seed = TRUE)},
  times = 10
)
Unit: seconds
     expr      min       lq     mean   median       uq      max neval
   dofut0 3.082435 3.167459 3.289073 3.304749 3.409156 3.455932    10
    furr0 2.996915 3.189255 3.787848 3.425178 3.551950 7.862844    10
 fuapply0 3.127753 3.205452 3.283468 3.256335 3.342933 3.515369    10

So, doFuture and furrr are slower than future.apply, but not by a ton. The key thing here is this sets the baseline, so we can see if things slow down once we have big objects in memory.

Inside a function

These functions are from testing parallel speed, though they have different names here. I’ve added the ability to change the way they handle globals so I don’t have to write new functions for comparing that later, with the default set at the function default.

foreach

foreach_fun <- function(n_reps = 100, size = 1000, .export = NULL, .noexport = NULL) {
  c_foreach <- foreach(i = 1:n_reps, 
                       .combine = cbind,
                       .export = .export,
                       .noexport = .noexport) %dorng% {
    a <- rnorm(size, mean = i)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
  }
  return(c_foreach)
}

furrrr

furrr_fun <- function(n_reps = 100, size = 1000, globals = TRUE) {
  fn_to_call <- function(rep, size) {
    a <- rnorm(size, mean = rep)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
  }
  
  c_map <- future_map(1:n_reps, fn_to_call, size = size, 
                      .options = furrr_options(seed = TRUE, 
                                               globals = globals))
  matrix(unlist(c_map), ncol = n_reps)
}

future.apply

fuapply_fun <- function(n_reps = 100, size = 1000, future.globals = TRUE) {
    fn_to_call <- function(rep, size) {
    a <- rnorm(size, mean = rep)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
    }
    
  c_apply <- future_lapply(1:n_reps, FUN = fn_to_call, size, 
                           future.seed = TRUE,
                           future.globals = future.globals)
  
    matrix(unlist(c_apply), ncol = n_reps)
}

Benchmark

microbenchmark(
  dofut_fun = foreach_fun(n_reps = 100, size = 1000),
  fur_fun = furrr_fun(n_reps = 100, size = 1000),
  app_fun = fuapply_fun(n_reps = 100, size = 1000),
  times = 10
)
Unit: seconds
      expr      min       lq     mean   median       uq      max neval
 dofut_fun 2.894709 2.943020 3.032240 3.025625 3.048341 3.243421    10
   fur_fun 2.907043 2.959023 3.009576 2.999129 3.035445 3.159187    10
   app_fun 2.874484 3.003329 3.057768 3.066695 3.118491 3.288474    10

This sets the other baseline before we have big objects in memory, so we can see if things respond differently when used inside a function’s environment vs directly in the global. Now all three functions are basically equivalent.

With big global

Default future.globals.maxsize is 500MB. Should i increase that, or just try to hit it? I think just try to get just under it.

# This is 1.6GB
# big_obj <- matrix(rnorm(20000*10000), nrow = 10000)
# 496 MB
big_obj <- matrix(rnorm(10000*6200), nrow = 10000)

Now, same tests as before, and some that reference it but don’t use it.

The comparisons to make here are:

  • Matched to above- does just having the object exist slow things down, even if not called?

  • Referenced and not- does it only get passed in if asked for and slow things down?

    • Not exactly sure how I’ll check that. Maybe instead of referencing it in the function (which is hard to do without using it, especially with furrr and future.apply), I’ll explicitly send it in with their globals arguments.

Bare

Benchmark

I’m going to run this for default (no global argument), explicitly sending them in, and explicitly excluding them.

microbenchmark(
  # default- same as above, but now big_obj exists, but is not used in the actual processing
  dofut0 = {foreach(i = 1:n_reps, 
                       .combine = cbind) %dorng% {
    a <- rnorm(size, mean = i)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
                       }},
  furr0 = {future_map(1:n_reps, fn_to_call, size = size, 
                      .options = furrr_options(seed = TRUE))},
  
  fuapply0 = {future_lapply(1:n_reps, FUN = fn_to_call, size, 
                           future.seed = TRUE)},
  
  # Explicitly telling it not to send big global (I can't sort out getting .export to work)
  dofut_no_g = {foreach(i = 1:n_reps, 
                       .combine = cbind,
                       .noexport = "big_obj") %dorng% {
    a <- rnorm(size, mean = i)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
                       }},
  
  furr_no_g = {future_map(1:n_reps, fn_to_call, size = size, 
                      .options = furrr_options(seed = TRUE, 
                                               globals = FALSE))},
  
  fuapply_no_g = {future_lapply(1:n_reps, FUN = fn_to_call, size, 
                           future.seed = TRUE,
                           future.globals = FALSE)},
  
  # Explicitly telling it to send the unused global
  dofut_g = {foreach(i = 1:n_reps, 
                       .combine = cbind,
                       .export = 'big_obj') %dorng% {
    a <- rnorm(size, mean = i)
    b <- matrix(rnorm(size * size), nrow = size)
    t(a %*% b)
                       }},
  
  furr_g = {future_map(1:n_reps, fn_to_call, size = size, 
                      .options = furrr_options(seed = TRUE, 
                                               globals = 'big_obj'))},
  
  fuapply_g = {future_lapply(1:n_reps, FUN = fn_to_call, size, 
                           future.seed = TRUE,
                           future.globals = 'big_obj')},
  
  times = 10
)
Unit: seconds
         expr       min        lq      mean    median        uq       max neval
       dofut0  2.922838  3.000344  3.249182  3.182809  3.262618  4.264726    10
        furr0  2.949936  2.981717  3.102662  3.052143  3.217054  3.398791    10
     fuapply0  2.992581  3.023170  3.220997  3.148239  3.436145  3.624800    10
   dofut_no_g  2.945903  3.066295  3.140082  3.118970  3.276743  3.335264    10
    furr_no_g  2.874103  3.076374  3.204159  3.253832  3.306481  3.568914    10
 fuapply_no_g  2.912136  3.057193  3.108612  3.094443  3.177574  3.341436    10
      dofut_g 13.140054 13.625128 14.188250 14.119873 14.878983 15.087787    10
       furr_g 13.786607 14.154656 14.397246 14.394677 14.609236 15.120794    10
    fuapply_g 13.455395 13.675143 14.085484 13.918776 14.251080 15.601045    10

Now there’s a big object sitting in global memory, but it does not slow down the default run relative to the enforced-non-pass version or the version from before it existed (above). It does show major slowdown when it is explicitly passed.

Unused globals therefore are NOT passed by default, even when code is running straight in the global environment.

Inside functions

The functions have an option to change the way globals are handled.

Benchmark

microbenchmark(
  # default
  dofut_default = foreach_fun(n_reps = 100, size = 1000),
  fur_default = furrr_fun(n_reps = 100, size = 1000),
  app_default = fuapply_fun(n_reps = 100, size = 1000),
  
  # No globals
  dofut_no_g = foreach_fun(n_reps = 100, size = 1000, .noexport = 'big_obj'),
  fur_no_g = furrr_fun(n_reps = 100, size = 1000,
                          globals = FALSE),
  app_no_g = fuapply_fun(n_reps = 100, size = 1000,
                            future.globals = FALSE),
  
  # Explicit globals
  dofut_g = foreach_fun(n_reps = 100, size = 1000,
                              .export = 'big_obj'),
  fur_g = furrr_fun(n_reps = 100, size = 1000, 
                          globals = 'big_obj'),
  app_g = fuapply_fun(n_reps = 100, size = 1000,
                            future.globals = 'big_obj'),
  
  
  times = 10
)
Unit: seconds
          expr       min        lq      mean    median        uq       max
 dofut_default  2.820070  3.133994  3.418282  3.494786  3.643098  4.122530
   fur_default  2.955963  3.170553  3.401131  3.247953  3.674548  4.264421
   app_default  2.777021  2.957015  3.264931  3.286259  3.545510  3.748255
    dofut_no_g  2.741006  3.076965  3.271767  3.285788  3.454271  3.721240
      fur_no_g  2.961480  3.051561  3.413412  3.445728  3.553680  4.109865
      app_no_g  2.835050  3.020232  3.264410  3.235451  3.499824  3.772277
       dofut_g 13.666079 14.104627 15.448594 15.349449 15.978492 18.617738
         fur_g 13.761059 15.330178 15.521168 15.707339 16.317537 16.369009
         app_g 12.988882 14.554517 15.117242 15.415819 15.672232 16.115645
 neval
    10
    10
    10
    10
    10
    10
    10
    10
    10

Using functions yields the same result as before- the big objects sitting in the global environment do not get passed in and slow things down if they aren’t actually used in the functions (or explicitly sent in).

Unused globals therefore are NOT passed by default into parallelised functions.