Errors with map

Sometimes when we use purrr::map or similar functions, one of the iterations hits an error. When this happens, we lose the whole set of runs, even if the others would run or have already run. That can be a waste of time, make it hard to find the issue, and prevent re-running just the failed bits if the error is intermittent (e.g. many HTTP errors).

Using a test function from error handling that errors on even numbers, warns if the number is 5, and otherwise returns the input.

err_even_warn5 <- function(x) {
  if ((x %% 2) == 0) {
    stop('Even numbers are error')
  } else if (x == 5) {
    warning('5 throws a warning')
  } else {x}
}

The issue here is that if we try to run that, not only does it error for 5, we don’t get any of the results

purrr::map(1:10, err_even_warn5)
Error in `purrr::map()`:
ℹ In index: 2.
Caused by error in `.f()`:
! Even numbers are error

But what if we want to get all the other results, and possibly identify the failures and correct them or retry?

One option is to use purrr::safely, as it returns a list with a result and error item. This means that purrring over things where some may fail doesn’t kill everything, but we need to unpack it a bit.

The syntax is typical map, but with the function to apply wrapped in the ‘adverb’ safely.

errpurr <- purrr::map(1:10, 
                      purrr::safely(err_even_warn5))
Warning in .f(...): 5 throws a warning
errpurr
[[1]]
[[1]]$result
[1] 1

[[1]]$error
NULL


[[2]]
[[2]]$result
NULL

[[2]]$error
<simpleError in .f(...): Even numbers are error>


[[3]]
[[3]]$result
[1] 3

[[3]]$error
NULL


[[4]]
[[4]]$result
NULL

[[4]]$error
<simpleError in .f(...): Even numbers are error>


[[5]]
[[5]]$result
[1] "5 throws a warning"

[[5]]$error
NULL


[[6]]
[[6]]$result
NULL

[[6]]$error
<simpleError in .f(...): Even numbers are error>


[[7]]
[[7]]$result
[1] 7

[[7]]$error
NULL


[[8]]
[[8]]$result
NULL

[[8]]$error
<simpleError in .f(...): Even numbers are error>


[[9]]
[[9]]$result
[1] 9

[[9]]$error
NULL


[[10]]
[[10]]$result
NULL

[[10]]$error
<simpleError in .f(...): Even numbers are error>

Note that safely only deals with errors, the ‘warning’ at index 5 just passes through and is included in the result. We could use quietly instead if we want to capture all possibilities except errors, which still cause quietly to fail. We can do things like look for the values with or without errors

whicherrors <- purrr::map(errpurr, 
                          \(x) !is.null(x$error)) |> 
  unlist() |> 
  which()

whicherrors 
[1]  2  4  6  8 10

Those without errors (or with a non-null result )can be used to extract the clean outputs. Note that this includes the warning.

noterrors <- purrr::map(errpurr,
                        \(x) purrr::pluck(x, 'result'))

noterrors
[[1]]
[1] 1

[[2]]
NULL

[[3]]
[1] 3

[[4]]
NULL

[[5]]
[1] "5 throws a warning"

[[6]]
NULL

[[7]]
[1] 7

[[8]]
NULL

[[9]]
[1] 9

[[10]]
NULL

Another option is to use list_transpose and then get the result and error lists. Two plucks is likely better, especially if we usually only need one.

terr <- purrr::list_transpose(errpurr)

terr$result
[[1]]
[1] 1

[[2]]
NULL

[[3]]
[1] 3

[[4]]
NULL

[[5]]
[1] "5 throws a warning"

[[6]]
NULL

[[7]]
[1] 7

[[8]]
NULL

[[9]]
[1] 9

[[10]]
NULL
terr$error
[[1]]
NULL

[[2]]
<simpleError in .f(...): Even numbers are error>

[[3]]
NULL

[[4]]
<simpleError in .f(...): Even numbers are error>

[[5]]
NULL

[[6]]
<simpleError in .f(...): Even numbers are error>

[[7]]
NULL

[[8]]
<simpleError in .f(...): Even numbers are error>

[[9]]
NULL

[[10]]
<simpleError in .f(...): Even numbers are error>

The use of safely above is really handy if we want to read the errors. If not, and we just want to save the non-errors, possibly with a default is likely better (cleaner).

errpurrP <- purrr::map(1:10, 
                       purrr::possibly(err_even_warn5,
                                       NA))
Warning in .f(...): 5 throws a warning
errpurrP
[[1]]
[1] 1

[[2]]
[1] NA

[[3]]
[1] 3

[[4]]
[1] NA

[[5]]
[1] "5 throws a warning"

[[6]]
[1] NA

[[7]]
[1] 7

[[8]]
[1] NA

[[9]]
[1] 9

[[10]]
[1] NA

Note that in use we’d likely still want to have a cleanup step/function to chuck out the warnings before concatenating the rest.

There’s also a question of what happens if an iteration has a warning and a result. For example

err_even_warn510 <- function(x) {
  if ((x %% 2) == 0) {
    stop('Even numbers are error')
  } else if (x == 5) {
    warning('5 doubles')
    x <- 10
  } else {x}
  return(x)
}

For both safely and possibly, a real result plus warning ends up with the real result in the output and the warning bubbling up. So that’s good- warnings don’t change the structure of the data if there is data.

ep5 <- purrr::map(1:10, 
                      purrr::safely(err_even_warn510))
Warning in .f(...): 5 doubles
ep5
[[1]]
[[1]]$result
[1] 1

[[1]]$error
NULL


[[2]]
[[2]]$result
NULL

[[2]]$error
<simpleError in .f(...): Even numbers are error>


[[3]]
[[3]]$result
[1] 3

[[3]]$error
NULL


[[4]]
[[4]]$result
NULL

[[4]]$error
<simpleError in .f(...): Even numbers are error>


[[5]]
[[5]]$result
[1] 10

[[5]]$error
NULL


[[6]]
[[6]]$result
NULL

[[6]]$error
<simpleError in .f(...): Even numbers are error>


[[7]]
[[7]]$result
[1] 7

[[7]]$error
NULL


[[8]]
[[8]]$result
NULL

[[8]]$error
<simpleError in .f(...): Even numbers are error>


[[9]]
[[9]]$result
[1] 9

[[9]]$error
NULL


[[10]]
[[10]]$result
NULL

[[10]]$error
<simpleError in .f(...): Even numbers are error>
epP <- purrr::map(1:10, 
                       purrr::possibly(err_even_warn510,
                                       NA))
Warning in .f(...): 5 doubles
epP
[[1]]
[1] 1

[[2]]
[1] NA

[[3]]
[1] 3

[[4]]
[1] NA

[[5]]
[1] 10

[[6]]
[1] NA

[[7]]
[1] 7

[[8]]
[1] NA

[[9]]
[1] 9

[[10]]
[1] NA

Programmatic use

I have several cases where I want to run map in packages or large analyses, and assess what’s happening to the fails and possibly re-run. That needs a few wrappers or standard sequences of steps around what I have above. The standard steps might be better, then we don’t have to deal with function passing, which is a hassle.

Let’s set up a function that will fail about half the time, but re-runs might work.

failhalf <- function(x) {
  if (runif(1) <= 0.5) {
    x <- x+5
  } else {
    stop("random above 0.5")
  }
  return(x)
}

I’ll work with safely- I think that’s more general than possibly, and gives a developer the ability to go in and look at the errors in debug, even if they’re not returned. Let’s assume we have a variable to feed it, as we usually would in a function.

larg <- 1:10

# first run
x5 <- purrr::map(larg, purrr::safely(failhalf))
# get the results
r5 <- purrr::map(x5, purrr::pluck('result'))

# Get the errors- we might have this somewhere a dev could get it, but not always use it.
e5 <- purrr::map(x5, purrr::pluck('error'))

If we don’t want to retry, that’s as far as we need to go. We could easily return a list just like what would be returned normally and another list of the errors. That’s as simple as

safepurr <- function(input, fun) {
  # first run
x5 <- purrr::map(input, purrr::safely(fun))
# get the results
r5 <- purrr::map(x5, purrr::pluck('result'))

# Get the errors- we might have this somewhere a dev could get it, but not always use it.
e5 <- purrr::map(x5, purrr::pluck('error'))

return(list(r5, e5))
}

Though I’m not sure what the point is. By the time we unpack that we might as well have just done it inline with pluck().

Retries

If we do want to retry, we need to re-run the failures. This actually makes sense to do in a single while, rather than with the above first. We can put a retries argument in easily enough.

# if we have e5, we could use it, but it's not any harder to get error indices directly
larg <- 1:10
whicherrors <- 1:length(larg)
lout <- vector(mode = 'list', length = 0)
while(length(whicherrors) > 0) {
  larg <- larg[whicherrors]
  # first run
x5 <- purrr::map(larg, purrr::safely(failhalf))
# get the results, dropping the NULLs
r5 <- purrr::map(x5, purrr::pluck('result'))

# where are the errors
whicherrors <- purrr::map(x5, 
                          \(x) rlang::is_error(
                            x$error)
                          ) |> 
  unlist() |> 
  which()

# append
lout <- c(lout, r5[-whicherrors])


}

That works fine if we don’t care about order, but if we do, we’ll need to make sure we know which list items are erroring and replace them. That will almost always be what we want to do, and isn’t any more complicated.

larg <- 1:10
whicherrors <- 1:length(larg)
lout <- vector(mode = 'list', length = length(larg))
# the indices, to track which are being filled/left
indlist <- 1:10
while(length(whicherrors) > 0) {
  larg <- larg[whicherrors]
  # first run
x5 <- purrr::map(larg, purrr::safely(failhalf))
# get the results, dropping the NULLs
r5 <- purrr::map(x5, purrr::pluck('result'))

# replace the indices that were errors with new data. Some might still be errors, they will fill subsequently
lout[indlist] <- r5

# where are the errors
whicherrors <- purrr::map(x5, 
                          \(x) rlang::is_error(
                            x$error)
                          ) |> 
  unlist() |> 
  which()
# which ORIGINAL indices are we left with?
indlist <- indlist[whicherrors]


}

Rather than a while, can we recurse? Yes, and it’s a bit cleaner. But, it’s not tail-recursive and there’s no obvious way to set a retries.

getsafe <- function(larg) {
  x5 <- purrr::map(larg, purrr::safely(failhalf))
  # get the results, dropping the NULLs
  r5 <- purrr::map(x5, purrr::pluck('result'))
  
  whicherrors <- purrr::map(x5, 
                          \(x) rlang::is_error(
                            x$error)
                          ) |> 
  unlist() |> 
  which()
  
  if (length(whicherrors > 0)) {
    eout <- getsafe(larg[whicherrors])
    r5[whicherrors] <- eout
  }
  
  return(r5)
}
getsafe(1:10)
[[1]]
[1] 6

[[2]]
[1] 7

[[3]]
[1] 8

[[4]]
[1] 9

[[5]]
[1] 10

[[6]]
[1] 11

[[7]]
[1] 12

[[8]]
[1] 13

[[9]]
[1] 14

[[10]]
[1] 15

How bad is it to make a function that takes the input and the function and does the while loop?

safe_clean_retries <- function(input, fun, retries) {
  whicherrors <- 1:length(input)
  lout <- vector(mode = 'list', length = length(input))
  # the indices, to track which are being filled/left
  indlist <- 1:length(input)
  counter = 0
  
  while (length(whicherrors) > 0 & counter <= retries) {
    # run the purrr
    x5 <- purrr::map(input, purrr::safely(fun))
    # get the results, dropping the NULLs
    r5 <- purrr::map(x5, purrr::pluck('result'))
    
    # if we want the errors, we could put in a debug here
    e5 <- purrr::map(x5, purrr::pluck('result'))
    
    # replace the indices that were errors with new data. Some might still be errors, they will fill subsequently
    lout[indlist] <- r5
    
    # where are the errors
    whicherrors <- purrr::map(x5,
                              \(x) rlang::is_error(x$error)) |>
      unlist() |>
      which()
    
    # Cut the data to the fails
    input <- input[whicherrors]

    # which ORIGINAL indices are we left with?
    indlist <- indlist[whicherrors]
    
    counter <- counter + 1
    
  }
  
  return(lout)
}

And that lets us use it

safe_clean_retries(1:10, failhalf, retries = 5) |> 
  unlist()
 [1]  6  7  8  9 10 11 12 13 14 15

It should work to pass it anonymous functions or otherwise custom?

safe_clean_retries(1:10,
                   \(x) ifelse(sample(c(1,2), 1) == 1,
                               stop(), x), 
                   retries = 10) |> 
  unlist()
 [1]  1  2  3  4  5  6  7  8  9 10

It works for furrr too, though in this case it’s slower (not surprising, for this test case the overhead will be much bigger than the computation).

safe_clean_retries_f <- function(input, fun, retries) {
  whicherrors <- 1:length(input)
  lout <- vector(mode = 'list', length = length(input))
  # the indices, to track which are being filled/left
  indlist <- 1:length(input)
  counter = 0
  
  while (length(whicherrors) > 0 & counter <= retries) {
    # run the purrr
    # Only parallel this one. The others are just indexing
    x5 <- furrr::future_map(input, purrr::safely(fun), .options = furrr_options(seed = TRUE))
    # get the results, dropping the NULLs
    r5 <- purrr::map(x5, purrr::pluck('result'))
    
    # if we want the errors, we could put in a debug here
    e5 <- purrr::map(x5, purrr::pluck('result'))
    
    # replace the indices that were errors with new data. Some might still be errors, they will fill subsequently
    lout[indlist] <- r5
    
    # where are the errors
    whicherrors <- purrr::map(x5,
                              \(x) rlang::is_error(x$error)) |>
      unlist() |>
      which()
    
    # Cut the data to the fails
    input <- input[whicherrors]

    # which ORIGINAL indices are we left with?
    indlist <- indlist[whicherrors]
    
    counter <- counter + 1
    
  }
  
  return(lout)
}
library(furrr)
Loading required package: future
plan(multisession)

safe_clean_retries_f(1:10,
                   \(x) ifelse(sample(c(1,2), 1) == 1,
                               stop(), x), 
                   retries = 10) |> 
  unlist()
 [1]  1  2  3  4  5  6  7  8  9 10

And finally, we can clean that up to use the same arg names as purrr and do both parallel or not

safe_map <- function(.x, .f, ..., retries = 10, parallel = FALSE) {
  whicherrors <- 1:length(.x)
  result_list <- vector(mode = 'list', length = length(.x))
  # the indices, to track which are being filled/left
  orig_indices <- 1:length(.x)
  counter = 0
  
  while (length(whicherrors) > 0 & counter <= retries) {
    # run the purrr
    # Only parallel this one. The others are just indexing
    if (parallel) {
          full_out <- furrr::future_map(.x, purrr::safely(.f),
                                        .options = furrr_options(seed = TRUE))
    } else {
      full_out <- purrr::map(.x, purrr::safely(.))
    }
    # get the results, dropping the NULLs
    intermed_result <- purrr::map(full_out, purrr::pluck('result'))
    
    # if we want the errors, we could put in a debug here
    err_list <- purrr::map(full_out, purrr::pluck('result'))
    
    # replace the indices that were errors with new data. Some might still be errors, they will fill subsequently
    result_list[orig_indices] <- intermed_result
    
    # where are the errors
    whicherrors <- purrr::map(full_out,
                              \(x) rlang::is_error(x$error)) |>
      unlist() |>
      which()
    
    # Cut the data to the fails
    .x <- .x[whicherrors]

    # which ORIGINAL indices are we left with?
    orig_indices <- orig_indices[whicherrors]
    
    counter <- counter + 1
    
  }
  
  return(result_list)
}

Benchmarking

The speed question is interesting- how much does it slow things down to run in this wrapper? Should I put everything in it, or is the speed hit only worth it where there’s a high likelihood of failure and each iteration is big?

Let’s set something a bit bigger up and test. Just purrrr, assume furrr will scale similarly. I’m not going to have any errors- the point here is to ask how much this hurts when there aren’t errors. And if that tradeoff is worth the ability to fix others.

inlist <- list(iris, mtcars, iris, mtcars, iris, mtcars)

testfun <- function(x) {
 x <- x |> 
   dplyr::mutate(across(where(is.numeric), mean)) |> 
   dplyr::summarise(across(where(is.numeric), sum))
 
 return(x)
}

The hit there isn’t too bad. Seems like it’s probably usually worth it, especially for big computations. For big jobs, the consequences of errors will be worse in terms of lost time/results, and the additional overhead will be a smaller proportion of the time compared to the main purrr call.

microbenchmark::microbenchmark(
  barepurrr = purrr::map(inlist, testfun),
  safepurrr = safe_clean_retries(inlist, testfun, retries = 10),
  times = 100
)
Unit: milliseconds
      expr     min       lq     mean  median       uq      max neval
 barepurrr 19.8354 22.37945 25.35412 23.1743 25.61785 159.4209   100
 safepurrr 21.1953 22.90615 25.38399 24.4367 25.77285  88.9560   100

Function construction

The functions above all just use a function with a single unspecified argument. But things get trickier with anonymous functions or multiple arguments. The locations for the arguments aren’t always intuitive- they go after the possibly(function()). The reason is because possibly and safely both create new functions.

For example, if we have a simple function, still just with one argument

add5 <- function(x) {
  x+5
}

Then the simple version works

purrr::map(1:5, purrr::safely(add5))
[[1]]
[[1]]$result
[1] 6

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 7

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 8

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 9

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] 10

[[5]]$error
NULL

If we want to be more specific and make it anonymous, though, where does the x go? What is the safe equivalent of this?

purrr::map(1:5, \(x) add5(x))
[[1]]
[1] 6

[[2]]
[1] 7

[[3]]
[1] 8

[[4]]
[1] 9

[[5]]
[1] 10

This works. The anonymous function is wholly inside safely, and so the whole anonymous function gets transformed into a safe version.

purrr::map(1:5, purrr::safely(\(x) add5(x)))
[[1]]
[[1]]$result
[1] 6

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 7

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 8

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 9

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] 10

[[5]]$error
NULL

This does not. The safely can’t be inside the anonymous function

purrr::map(1:5, \(x) purrr::safely(add5(x)))
[[1]]
function (...) 
capture_error(.f(...), otherwise, quiet)
<bytecode: 0x000001c0b17027d8>
<environment: 0x000001c0b56976d0>

[[2]]
function (...) 
capture_error(.f(...), otherwise, quiet)
<bytecode: 0x000001c0b17027d8>
<environment: 0x000001c0b579a518>

[[3]]
function (...) 
capture_error(.f(...), otherwise, quiet)
<bytecode: 0x000001c0b17027d8>
<environment: 0x000001c0b5797040>

[[4]]
function (...) 
capture_error(.f(...), otherwise, quiet)
<bytecode: 0x000001c0b17027d8>
<environment: 0x000001c0b57a35b8>

[[5]]
function (...) 
capture_error(.f(...), otherwise, quiet)
<bytecode: 0x000001c0b17027d8>
<environment: 0x000001c0b57a41d0>

But this does- safely(fun) is a function, and so we can give it the argument.

purrr::map(1:5, \(x) purrr::safely(add5)(x))
[[1]]
[[1]]$result
[1] 6

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 7

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 8

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 9

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] 10

[[5]]$error
NULL

This can be useful with multiple arguments, e.g.

adder <- function(x,y) {
  x + y
}

Again, as anonymous, wholly inside works

purrr::map(1:5, purrr::safely(\(x) adder(x, 10)))
[[1]]
[[1]]$result
[1] 11

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 12

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 13

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 14

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] 15

[[5]]$error
NULL

It does not work if it’s not anonymous, ie just giving it the second argument. While this syntax works normally,

purrr::map(1:5, adder, 10)
[[1]]
[1] 11

[[2]]
[1] 12

[[3]]
[1] 13

[[4]]
[1] 14

[[5]]
[1] 15

Similar does not work with safely.

purrr::map(1:5, purrr::safely(adder, 10))
[[1]]
[[1]]$result
[1] 10

[[1]]$error
<simpleError in .f(...): argument "y" is missing, with no default>


[[2]]
[[2]]$result
[1] 10

[[2]]$error
<simpleError in .f(...): argument "y" is missing, with no default>


[[3]]
[[3]]$result
[1] 10

[[3]]$error
<simpleError in .f(...): argument "y" is missing, with no default>


[[4]]
[[4]]$result
[1] 10

[[4]]$error
<simpleError in .f(...): argument "y" is missing, with no default>


[[5]]
[[5]]$result
[1] 10

[[5]]$error
<simpleError in .f(...): argument "y" is missing, with no default>
purrr::map(1:5, purrr::safely(adder(10)))
Error in adder(10): argument "y" is missing, with no default

To get this to work with safely, we have to anonymize, but being careful to feed the arguments after the final safely parenthesis.

purrr::map(1:5, \(x) purrr::safely(adder)(x, y=10))
[[1]]
[[1]]$result
[1] 11

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 12

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 13

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 14

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] 15

[[5]]$error
NULL