Using future.batchtools

Author

Galen Holt

I need to sort out how to use futures for parallel processing on the HPC.There’s a few things I’ve tried previously that didn’t work, and a way I’ve cobbled together that’s not ideal.

There are some issues with my current approach

What do I want?

There are some improvements I want to make that will make my code work better (and faster)

  • Minor (if any) changes when running locally or on an HPC

  • In-code splitting of work into nodes to balance the work across them

  • Make sure we’re using both nodes and cores within them

The main solution seems to be using future.batchtools, but I wasn’t able to get it working quickly.

Some questions

There are a few big-picture things I have questions about that aren’t clear from reading the docs (in addition to just ‘how do we get a run to work’)

  • How do jobs actually start? I think, but am not 100% sure, that the R script essentailly builds slurm bash scripts and then calls sbatch. Is that what happens?

  • Do we still use sbatch or other command-line bash at all? Or is everything managed in R? If so, how do we actually start the runs? Rscript? srun on an R control script?

    • if Rscript, do we end up with that main R process running in the login node the whole time? What about if srun or…
    • This issue implies we need to leave R running. But can it run with Rscript? srun?sinteractive? sbatch any_R.sh analysis_script.R? With any_R.sh having low resources but maybe long walltime? It looks like srun barfs to the terminal and blocks, while sbatch outputs to file and is non-blocking. So that’s likely the way to go.
  • Does it make sense to manage nodes and cores separately, or do we just ask for a ton of cores and it auto-manages nodes to get them?

    • I think {slurmR} with plan(cluster) does the latter, but not positive

    • I’m not actually sure what future.batchtools does by default (will check), but I think a list-plan likely makes sense.

  • do we use SLURM job arrays? Or does it generate a bunch of batch scripts that get called as separate jobs instead of array jobs? Does it matter?

    • what are chunks.
  • If we feed it a big set of iterations, does it send each one to its own node? Its own core? Is there any chunking?

Getting it to run

I’ll use my HPC testing repo to build some template and testing files. There are useful help pages, but somehow none of them quite just make complete sense to me, so I’m going to have to try things and see what happens.

Useful pages I’m referencing:

To get it working

future.batchtools github

batchtools github

To make it work better

list-plans

Templates

I’m still a bit confused by the overall workflow, but it’s clear I need a template. There’s one in future.batchtools github, and a few at the batchtools github.

Then I think in the plan call, we tweak that? Let’s just get it working. Trying first with the one that comes from future.batchtools. Though the one from batchtools looks like it has more capability for doing things like managing cores on nodes. Maybe try them both as we go?

I have both of those. I kind of want to test both. The docs say the template should be either at ./batchtools.slurm.tmpl (associated with a particular working directory) or ~/.batchtools.slurm.tmpl (for all processes to find it). But I want to be able to test multiple templates. I should be able to use

plan(batchtools_slurm, template = "/path/to/batchtools.slurm.tmpl")

but its a bit unclear whether the templates still need to be named batchtools.slurm.tmpl, or can have whatever filename we want, as long as I give the path. Guess I’ll test that. Try first with it as batchtools.slurm.tmpl in the repo directory first though.

Try a simple job

Use the future.batchtools template, and a foreach loop using %:%.

Start with Rscript filename.R

It starts printing directly to terminal, which is annoying.

Opening another terminal and typing squeue shows that I have 4 nodes- though actually that was just at that moment.

It doesn’t seem to produce stdout or stderr, which is going to make it tricky to see what happened. See below- this is true unless we run the master R session through sbatch.

I can copy in from the terminal output:

::: {#Simple output} Loading required package: foreach Loading required package: future Warning message: package ‘future’ was built under R version 4.0.5 Loading required package: parallelly Warning messages: 1: package ‘future.batchtools’ was built under R version 4.0.5 2: package ‘parallelly’ was built under R version 4.0.5

Plan is: List of future strategies: 1. batchtools_slurm: - args: function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = NULL, resources = list(), workers = NULL, registry = list(), …) - tweaked: FALSE - call: plan(batchtools_slurm)

available workers:

localhost localhost localhost localhost localhost localhost localhost localhost

total workers:

8

unique workers:

localhost

available Cores:

non-slurm

8

slurm method

1

Main PID:

3195570 There were 50 or more warnings (use warnings() to see the first 50)

Unique processes

100

IDs of all cores used

238059 1524768 1518289 1500375 1513942 3264114 1269189 1345061 2623254 238128 3264182 1269259 1345124 1524852 1518372 3264251 238197 1500450 1514026 1524930 1518447 238268 3264327 1500525 1514101 238335 1525010 1518527 1500596 1514176 3264408 238401 1525084 1518604 1500666 1514254 3264478 1518681 238466 1525164 1500739 1518756 1514328 238530 1500811 1525241 1514402 1518828 1500879 238594 1525316 1514479 3264562 1518901 1500952 238662 1525391 1518977 1514555 1501019 3264640 238727 1525467 1519049 1514630 238791 1501090 1525542 3264715 1519121 1501158 238857 1525621 1519194 1514708 1501225 3264796 238922 1525693 1519271 1514786 1501292 3264863 1525766 238989 1519345 1514861 1525839 1501361 239060 3264939 1519420 1514936 1501428 1525912 239129 3265003 1519494 1501496 1525987

Nodes and pids

           238059 238128 238197 238268 238335 238401 238466 238530 238594

bilbo 6 6 6 6 6 6 7 6 7 frodo-vs01 0 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0 0

           238662 238727 238791 238857 238922 238989 239060 239129 1269189

bilbo 6 7 7 6 7 6 7 6 0 frodo-vs01 0 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 7 gandalf-vm04 0 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0 0

           1269259 1345061 1345124 1500375 1500450 1500525 1500596 1500666

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 6 6 6 6 7 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 6 0 0 0 0 0 0 0 gandalf-vm04 0 6 6 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1500739 1500811 1500879 1500952 1501019 1501090 1501158 1501225

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 6 6 6 6 6 6 6 6 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1501292 1501361 1501428 1501496 1513942 1514026 1514101 1514176

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 7 6 7 6 0 0 0 0 frodo-vs04 0 0 0 0 6 7 6 6 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1514254 1514328 1514402 1514479 1514555 1514630 1514708 1514786

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 6 7 7 6 6 6 6 6 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1514861 1514936 1518289 1518372 1518447 1518527 1518604 1518681

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 7 7 6 6 6 6 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 6 6 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1518756 1518828 1518901 1518977 1519049 1519121 1519194 1519271

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 6 6 7 7 6 7 7 6 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1519345 1519420 1519494 1524768 1524852 1524930 1525010 1525084

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 6 6 6 7 6 frodo-vs02 7 6 7 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1525164 1525241 1525316 1525391 1525467 1525542 1525621 1525693

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 6 6 6 6 6 6 6 6 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 0 0 0 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           1525766 1525839 1525912 1525987 2623254 3264114 3264182 3264251

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 6 6 6 6 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 0 0 0 0 0 6 7 6 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 6 0 0 0

           3264327 3264408 3264478 3264562 3264640 3264715 3264796 3264863

bilbo 0 0 0 0 0 0 0 0 frodo-vs01 0 0 0 0 0 0 0 0 frodo-vs02 0 0 0 0 0 0 0 0 frodo-vs03 0 0 0 0 0 0 0 0 frodo-vs04 0 0 0 0 0 0 0 0 gandalf-vm02 7 7 6 6 6 6 6 6 gandalf-vm03 0 0 0 0 0 0 0 0 gandalf-vm04 0 0 0 0 0 0 0 0 gandalf-vm05 0 0 0 0 0 0 0 0

           3264939 3265003

bilbo 0 0 frodo-vs01 0 0 frodo-vs02 0 0 frodo-vs03 0 0 frodo-vs04 0 0 gandalf-vm02 6 6 gandalf-vm03 0 0 gandalf-vm04 0 0 gandalf-vm05 0 0 :::

The formatting of the table is all boogered up, but that looks like I got 9 nodes, and used something like 9-14 PIDs each, each about 7-6 times. It’s a bit confusing, because sinfo --Node --long shows that the number of CPUs is different across those nodes. And that seems to be what happened here. I need to change the output so I can better see what I used. But, roughly, that looks about right- it send jobs to CPUs as needed.

# number of iterations
25*25
[1] 625
9*12*6
[1] 648

Calling and output

We seem to need to leave an R session running to operate the futures. We can probably do that interactively module load R, R, then run the code, or Rscript. But that’s all interactive. As is srun. And it all prints all printable output to the terminal, and doesn’t save it. I think an sbatch might be the way to go. Try that.

That does run, and spawn the others

It’s silly that we need another wrapper call, but maybe that’s ok. Gives us the option to do some auto-setup in an intermediate script.

repeatedly calling squeue shows something interesting- it looks like it’s creating separate jobs for each task, not doing anything node-aware (e.g. chunking jobs to a node, and then multisessioning). I guess that makes sense, based on the help and other testing, but we can almost certainly speed things up by chunking and list-planning.

If the passing-object overhead really becomes an issue, we might want to just use batchtools directly, which I think also could remove the need to leave a master R session running. But I think that gets rid of the major advantage of being able to use essentially the same code locally and on the HPC, just by changing the plan.

The output using sbatch batchtools_R.sh testing_future_batchtools.R (where batchtools_R.sh is a lightweight master job) is

ℹ Using R 4.0.3 (lockfile was generated with R 4.2.2)

 Plan is:
List of future strategies:
1. batchtools_slurm:
   - args: function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = NULL, resources = list(), workers = NULL, registry = list(), ...)
   - tweaked: FALSE
   - call: plan(batchtools_slurm)

### available workers:
gandalf-vm01


### total workers:
1

### unique workers:
gandalf-vm01

### available Cores:

#### non-slurm
1

#### slurm method
1

### Main PID:
4135554

### Unique processes
100

IDs of all cores used

241089
1528094
1522003
1503388
1516888
3269690
1522077
241160
1528170
1503459
1516963
3269759
1522156
1528243
241230
3269825
1271878
1528317
1522235
241297
1503531
1528392
1517044
1522308
241367
1503606
1528471
1522380
1517120
241436
3269894
1528548
1522456
1503681
1517195
241505
1522531
1528626
3269975
1271968
1503749
241573
1522605
3270042
1528703
1503817
241641
1522682
1528778
1517275
3270121
1272046
241709
1528853
1522758
1503892
1517348
241778
1528932
1522829
1503961
1517420
3270199
241843
1529005
1522901
1504030
241910
1529082
1517500
3270277
1522975
1504098
241978
1529157
1523051
1504167
1517581
3270355
1529231
242042
1523125
1504238
1529306
1517659
3270422
242108
1523197
1529384
1504307
1517731
3270488
242173
1523272
1529459
1504376
1517812
3270555
242239
1529532

## Nodes and pids
# A tibble: 100 × 3
# Groups:   node [7]
    node             pid n_reps
    <chr>          <int>  <int>
  1 bilbo         241089      6
  2 bilbo         241160      6
  3 bilbo         241230      7
  4 bilbo         241297      6
  5 bilbo         241367      6
  6 bilbo         241436      6
  7 bilbo         241505      6
  8 bilbo         241573      6
  9 bilbo         241641      7
 10 bilbo         241709      6
 11 bilbo         241778      7
 12 bilbo         241843      6
 13 bilbo         241910      6
 14 bilbo         241978      7
 15 bilbo         242042      6
 16 bilbo         242108      6
 17 bilbo         242173      6
 18 bilbo         242239      6
 19 frodo-vs01   1528094      6
 20 frodo-vs01   1528170      6
 21 frodo-vs01   1528243      6
 22 frodo-vs01   1528317      6
 23 frodo-vs01   1528392      6
 24 frodo-vs01   1528471      7
 25 frodo-vs01   1528548      6
 26 frodo-vs01   1528626      6
 27 frodo-vs01   1528703      6
 28 frodo-vs01   1528778      6
 29 frodo-vs01   1528853      7
 30 frodo-vs01   1528932      6
 31 frodo-vs01   1529005      6
 32 frodo-vs01   1529082      6
 33 frodo-vs01   1529157      6
 34 frodo-vs01   1529231      6
 35 frodo-vs01   1529306      6
 36 frodo-vs01   1529384      6
 37 frodo-vs01   1529459      6
 38 frodo-vs01   1529532      6
 39 frodo-vs02   1522003      7
 40 frodo-vs02   1522077      7
 41 frodo-vs02   1522156      6
 42 frodo-vs02   1522235      7
 43 frodo-vs02   1522308      6
 44 frodo-vs02   1522380      6
 45 frodo-vs02   1522456      6
 46 frodo-vs02   1522531      6
 47 frodo-vs02   1522605      7
 48 frodo-vs02   1522682      6
 49 frodo-vs02   1522758      6
 50 frodo-vs02   1522829      6
 51 frodo-vs02   1522901      7
 52 frodo-vs02   1522975      6
 53 frodo-vs02   1523051      6
 54 frodo-vs02   1523125      7
 55 frodo-vs02   1523197      6
 56 frodo-vs02   1523272      7
 57 frodo-vs03   1503388      6
 58 frodo-vs03   1503459      6
 59 frodo-vs03   1503531      6
 60 frodo-vs03   1503606      6
 61 frodo-vs03   1503681      6
 62 frodo-vs03   1503749      6
 63 frodo-vs03   1503817      6
 64 frodo-vs03   1503892      6
 65 frodo-vs03   1503961      6
 66 frodo-vs03   1504030      6
 67 frodo-vs03   1504098      6
 68 frodo-vs03   1504167      6
 69 frodo-vs03   1504238      6
 70 frodo-vs03   1504307      7
 71 frodo-vs03   1504376      6
 72 frodo-vs04   1516888      6
 73 frodo-vs04   1516963      7
 74 frodo-vs04   1517044      7
 75 frodo-vs04   1517120      6
 76 frodo-vs04   1517195      7
 77 frodo-vs04   1517275      7
 78 frodo-vs04   1517348      6
 79 frodo-vs04   1517420      7
 80 frodo-vs04   1517500      7
 81 frodo-vs04   1517581      7
 82 frodo-vs04   1517659      6
 83 frodo-vs04   1517731      6
 84 frodo-vs04   1517812      6
 85 gandalf-vm02 3269690      6
 86 gandalf-vm02 3269759      6
 87 gandalf-vm02 3269825      6
 88 gandalf-vm02 3269894      7
 89 gandalf-vm02 3269975      7
 90 gandalf-vm02 3270042      6
 91 gandalf-vm02 3270121      6
 92 gandalf-vm02 3270199      6
 93 gandalf-vm02 3270277      6
 94 gandalf-vm02 3270355      6
 95 gandalf-vm02 3270422      7
 96 gandalf-vm02 3270488      6
 97 gandalf-vm02 3270555      7
 98 gandalf-vm03 1271878      6
 99 gandalf-vm03 1271968      6
100 gandalf-vm03 1272046      6

Time taken for code: 185

So, we can see the lightweight wrapper uses 1 cpu on one node, but spawns 100 workers, each of which gets used 6-7 times.

TODO

other templates

managing resources

list-plans and nesting

chunks.as.array

auto-managing plan locally vs HPC

consequences of master R session dying (if Rscript, if sbatched)

etc