Running on a SLURM Cluster

library(ssdsims)

The “Running a Sharded Pipeline” vignette runs a scenario’s shards through a targets pipeline on a single machine. This vignette takes that same pipeline to a SLURM cluster — and the only thing that changes is the crew controller (TARGETS-DESIGN.md §4). The per-task sample/fit/hc results are byte-identical to the local run; the cluster is just a faster backend.

You do not need to know crew or targets to follow this guide — only your cluster’s own (non-R) usage instructions. It is four steps: map your site’s SLURM instructions onto the controller, confirm the mapping with a preflight, run a minimal first job, then swap in your own scenario. This guide targets SLURM; crew.cluster also supports SGE, PBS/TORQUE, and LSF — see Targeting a non-SLURM cluster below.

The cluster template

The package ships an editable cluster template (a sibling of small and large). Three ingredients assemble into the run, and only B is cluster-specific:

Ingredient What Where
A — shape the per-shard target graph ssd_scenario_targets() in _targets.R (reused verbatim)
B — backend the SLURM crew controller (queue, modules, scratch, workers, walltime) the one editable block in controller.R
C — content the scenario (seed, datasets, grids) the inline scenario block in _targets.R

It is just four files — separating only what cannot be inlined:

list.files(system.file("targets-templates", "cluster", package = "ssdsims"))
#> [1] "_targets.R"   "controller.R" "preflight.R"  "README.md"    "run.R"

controller.R (the one editable controller block, sourced by both the pipeline and the preflight), _targets.R (the clean pipeline: controller + the inline scenario + factory), preflight.R (the standalone connectivity/prerequisite check, kept out of the pipeline), and run.R (the driver: preflight then tar_make()). Copy them to your project root:

dir <- system.file("targets-templates", "cluster", package = "ssdsims")
file.copy(list.files(dir, full.names = TRUE, all.files = TRUE, no.. = TRUE), ".")

Step 1 — Map your site’s SLURM instructions to the controller

Your real starting point is your site’s documentation: how to log in, submit an sbatch job, which partition/queue and account/allocation to use, the module system that provides R, the scratch filesystem, and the walltime/core limits. None of that mentions R, crew, or targets. This table bridges it to the one editable controller block in controller.R (crew.cluster::crew_controller_slurm() / crew.cluster::crew_options_slurm()):

Your site says… …goes here
log in to the login/submit node run run.R from there (where sbatch is on PATH)
submit with sbatch nothing to set — crew.cluster calls sbatch for you
partition / queue name crew_options_slurm(partition = "...")
account / allocation code a "#SBATCH --account=..." line in script_lines
module load R/4.x (and any deps) a "module load R/4.x" line in script_lines
scratch filesystem path an "export TMPDIR=/scratch/$USER/..." line in script_lines
walltime limit (per job) crew_options_slurm(time_minutes = ...)
cores per task crew_options_slurm(cpus_per_task = ...)
memory per cpu crew_options_slurm(memory_gigabytes_per_cpu = ...)
how many jobs you may run at once crew_controller_slurm(workers = ...)

So a site whose instructions read sbatch to the short queue, module load R/4.3, account abc123, scratch under /scratch/$USER, 60-minute walltime” becomes:

controller <- crew.cluster::crew_controller_slurm(
  workers = 4L, # max concurrent SLURM jobs
  seconds_idle = 60, # let an idle worker release its allocation
  options_cluster = crew.cluster::crew_options_slurm(
    partition = "short",
    time_minutes = 60,
    cpus_per_task = 1,
    memory_gigabytes_per_cpu = 4,
    script_lines = c(
      "#SBATCH --account=abc123",
      "module load R/4.3", # exposes R + the ssdsims binary library (below)
      "export TMPDIR=/scratch/$USER/ssdsims" # writable scratch for tempdir()
    )
  )
)

Anything not covered by a named argument (a constraint, a QOS, a --gres) goes in script_lines as a literal #SBATCH line — script_lines is injected verbatim into the generated sbatch script. Also set expected_r_version (in controller.R) to the major.minor your module load provides, so the preflight (step 2) can confirm it.

Backend prerequisite — the worker install path

A SLURM worker is a fresh R process on a compute node. It must be able to library(ssdsims) without compiling the dependency tree from source (compute nodes are often offline or slow to build). The crew labs pinned this down with a ManyLinux binary install path: install ssdsims and its dependencies as pre-built Linux binaries (e.g. from the Posit Public Package Manager binary repository for your distro) into the library your module load R/... exposes, so the worker resolves everything at load time. Put the module load that exposes that library in script_lines. The preflight in step 2 is the login-node prerequisite checker: it fails loudly if a worker cannot load ssdsims, so you fix the install/module path before launching the study.

Step 2 — Confirm the mapping with the preflight

preflight.R is a standalone check, kept out of the main pipeline. Run it on its own:

source("preflight.R") # or, from a shell:  Rscript preflight.R

It submits one SLURM job through your controller and, inside that job, checks the things that actually break a real run:

  1. R resolves at expected_r_version — else fix the module load line;
  2. library(ssdsims) loads — else fix the worker install/module path (the ManyLinux binary path above);
  3. tempdir() is writable — else fix the scratch export TMPDIR=... line.

A green preflight prints a small witness (the worker’s R version + node id) and means your controller block matches the site. A failed preflight aborts before the pipeline runs and names exactly which check failed, with the fix:

  • no job dispatched → controller / queue / account wiring;
  • R or ssdsims missing on the workermodule load / install path;
  • scratch not writable → the TMPDIR / scratch path.

So you debug the cluster wiring, never an obscure scenario error. run.R runs this preflight for you before tar_make(), so a broken cluster never reaches the expensive scenario shards.

Step 3 — Run a minimal first job

For your first job, make the scenario tiny and cheap so you are testing the wiring, not waiting on a big study. Temporarily shrink the scenario block near the top of _targets.R to a single cheap cell:

scenario <- ssd_define_scenario(
  ssddata::ccme_boron,
  nsim = 2L,
  seed = 42L,
  nrow = 5L
)

Then run the pipeline from the login node:

source("run.R") # or, from a shell:  Rscript run.R

run.R runs the preflight first, then dispatches the scenario’s shards across SLURM jobs (independent shards run concurrently), writes one Parquet per shard under results/layout=<hash>/<step>/..., unions them into results/layout=<hash>/summary.parquet, and prints a peek at the estimates. That is your first running cluster job, end to end.

Note

No cluster handy? run.R guards: if crew.cluster is not installed or sbatch is not on PATH, it aborts with a clear message naming the missing prerequisite. To run the same study off a cluster (no scheduler), use the large template — it builds the identical pipeline (same factory + scenario) under a crew::crew_controller_local() controller, and its run-serial.R asserts the results are byte-identical.

Step 4 — Swap in your own scenario

Once the minimal job succeeds, expand the scenario block in _targets.R back to the full sweep the template ships (or to your own study; see large/scenario.R). Leave controller.R and the ssd_scenario_targets() call unchanged — the scenario and the factory call are scheduler-independent. Run run.R again.

Targeting a non-SLURM cluster (untested)

This template and guide target SLURM, but crew.cluster also ships controllers for SGE (Sun Grid Engine / Son of Grid Engine), PBS/TORQUE, and LSF, so the same pipeline can in principle run on those schedulers too. ssdsims has only been exercised on SLURM — treat the others as supported by crew.cluster but untested here, and lean on the preflight (step 2) to catch wiring problems early.

Only ingredient B changes: in controller.R, swap the controller constructor and its options function. The factory (A), the scenario (C), and the whole preflight stay exactly the same.

Scheduler Controller Options
SLURM crew_controller_slurm() crew_options_slurm()
SGE crew_controller_sge() crew_options_sge()
PBS / TORQUE crew_controller_pbs() crew_options_pbs()
LSF crew_controller_lsf() crew_options_lsf()

The cluster-independent parts carry over unchanged:

  • script_lines exists on every options function and works the same way — put your module load R/..., your account/project directive, and your scratch export TMPDIR=... there (these are shell / #SCHEDULER directives, not R). So the worker-prerequisite mapping (the ManyLinux binary path) and the step-2 preflight are identical on every scheduler.
  • workers and seconds_idle are crew settings on the controller, not scheduler-specific.

The named resource arguments differ by scheduler; anything without a named argument goes in script_lines as a literal directive (#$ for SGE, #PBS for PBS/TORQUE, #BSUB for LSF):

Resource SLURM SGE PBS / TORQUE LSF
cores per worker cpus_per_task cores cores cores
memory memory_gigabytes_per_cpu memory_gigabytes_limit memory_gigabytes_required memory_gigabytes_limit
walltime time_minutes script_lines (#$ -l h_rt=...) walltime_hours script_lines (#BSUB -W ...)
queue / partition partition script_lines (#$ -q ...) script_lines (#PBS -q ...) script_lines (#BSUB -q ...)

For example, the SLURM controller from step 1 becomes, on SGE:

controller <- crew.cluster::crew_controller_sge(
  workers = 4L,
  seconds_idle = 60,
  options_cluster = crew.cluster::crew_options_sge(
    cores = 1,
    memory_gigabytes_limit = 4,
    script_lines = c(
      "#$ -q short.q", # queue (SGE has no `partition` argument)
      "#$ -l h_rt=01:00:00", # walltime
      "module load R/4.3",
      "export TMPDIR=/scratch/$USER/ssdsims"
    )
  )
)

See ?crew.cluster::crew_controller_sge (and the _pbs / _lsf equivalents) for the exact arguments. Everything else in this guide — the preflight, the minimal first job, swapping in your scenario — is unchanged.

Shard ↔︎ SLURM-job packing

Shards are the unit of parallelism: independent shard targets run concurrently across your SLURM jobs. How many shard targets ride in one job is a crew configuration knob (workers caps concurrent jobs; tasks_max on the controller packs several shards into one worker’s lifetime — “many-to-one”; tasks_max = 1L gives one shard per job — “one-to-one”). It is documented in controller.R, not hard-coded as 1:1.

See also

  • The template’s own README.md (the same guide, alongside the files) and TARGETS-DESIGN.md §4 (from local to a cluster), §11 (open questions), §12 (cluster-pipeline).
  • ?ssd_scenario_targets, ?crew.cluster::crew_controller_slurm.
  • The “Running a Sharded Pipeline” vignette — the small/large/cluster template trio.
  • “Uploading Shards to Cloud Storage” — ship the cluster’s shards to an object store (the upload = ssd_upload_azure(...) line in this template’s _targets.R) and read them back in place.