Running on a SLURM Cluster

library(ssdsims)

The vignette("sharded-pipeline") article runs a scenario’s shards through a targets pipeline on a single machine. This vignette takes that same pipeline to a SLURM cluster — and the only thing that changes is the crew controller. The per-task sample/fit/hc results are byte-identical to the local run; the cluster is just a faster backend.

You do not need to know crew or targets to follow this guide — only your cluster’s own (non-R) usage instructions. It is four steps: map your site’s SLURM instructions onto the controller, confirm the mapping with a preflight, run a minimal first job, then swap in your own scenario. This guide targets SLURM; crew.cluster also supports SGE, PBS/TORQUE, and LSF — see Targeting a non-SLURM cluster below.

The `cluster` template

The package ships an editable cluster template (a sibling of small and large). Three ingredients assemble into the run, and only B is cluster-specific:

Ingredient	What	Where
A — shape	the per-shard target graph	`ssd_scenario_targets()` in `_targets.R` (reused verbatim)
B — backend	the SLURM `crew` controller (queue, modules, scratch, workers, walltime)	the one editable block in `controller.R`
C — content	the scenario (seed, datasets, grids)	the inline `scenario` block in `_targets.R`

It is just five files — separating only what cannot be inlined:

list.files(system.file("targets-templates", "cluster", package = "ssdsims"))
#> [1] "_targets.R"   "controller.R" "preflight.R"  "README.md"    "run.R"

controller.R (the one editable controller block, sourced by both the pipeline and the preflight), _targets.R (the clean pipeline: controller + the inline scenario + factory), preflight.R (the standalone connectivity/prerequisite check, kept out of the pipeline), run.R (the driver: preflight then tar_make()), and a README.md copy of this guide. Copy them to your project root:

dir <- system.file("targets-templates", "cluster", package = "ssdsims")
file.copy(list.files(dir, full.names = TRUE, all.files = TRUE, no.. = TRUE), ".")

Step 1 — Map your site’s SLURM instructions to the controller

Your real starting point is your site’s documentation: how to log in, submit an sbatch job, which partition/queue and account/allocation to use, the module system that provides R, the scratch filesystem, and the walltime/core limits. None of that mentions R, crew, or targets. This table bridges it to the one editable controller block in controller.R (crew.cluster::crew_controller_slurm() / crew.cluster::crew_options_slurm()):

Your site says…	…goes here
log in to the login/submit node	run `run.R` from there (where `sbatch` is on `PATH`)
submit with `sbatch`	nothing to set — `crew.cluster` calls `sbatch` for you
partition / queue name	`crew_options_slurm(partition = "...")`
account / allocation code	a `"#SBATCH --account=..."` line in `script_lines`
`module load R/4.x` (and any deps)	a `"module load R/4.x"` line in `script_lines`
scratch filesystem path	an `"export TMPDIR=/scratch/$USER/..."` line in `script_lines`
walltime limit (per job)	`crew_options_slurm(time_minutes = ...)`
cores per task	`crew_options_slurm(cpus_per_task = ...)`
memory per cpu	`crew_options_slurm(memory_gigabytes_per_cpu = ...)`
how many jobs you may run at once	`crew_controller_slurm(workers = ...)`

So a site whose instructions read “sbatch to the short queue, module load R/4.3, account abc123, scratch under /scratch/$USER, 60-minute walltime” becomes:

controller <- crew.cluster::crew_controller_slurm(
  workers = 4L, # max concurrent SLURM jobs
  seconds_idle = 60, # let an idle worker release its allocation
  options_cluster = crew.cluster::crew_options_slurm(
    partition = "short",
    time_minutes = 60,
    cpus_per_task = 1,
    memory_gigabytes_per_cpu = 4,
    script_lines = c(
      "#SBATCH --account=abc123",
      "module load R/4.3", # exposes R + the ssdsims binary library (below)
      "export TMPDIR=/scratch/$USER/ssdsims" # writable scratch for tempdir()
    )
  )
)

Anything not covered by a named argument (a constraint, a QOS, a --gres) goes in script_lines as a literal #SBATCH line — script_lines is injected verbatim into the generated sbatch script. Also set expected_r_version (in controller.R) to the major.minor your module load provides, so the preflight (step 2) can confirm it.

Backend prerequisite — the worker install path

A SLURM worker is a fresh R process on a compute node. It must be able to library(ssdsims) without compiling the dependency tree from source (compute nodes are often offline or slow to build). The crew labs pinned this down with a ManyLinux binary install path: install ssdsims and its dependencies as pre-built Linux binaries (e.g. from the Posit Public Package Manager binary repository for your distro) into the library your module load R/... exposes, so the worker resolves everything at load time. Put the module load that exposes that library in script_lines. The preflight in step 2 is the login-node prerequisite checker: it fails loudly if a worker cannot load ssdsims, so you fix the install/module path before launching the study.

Step 2 — Confirm the mapping with the preflight

preflight.R is a standalone check, kept out of the main pipeline. Run it on its own:

source("preflight.R") # or, from a shell:  Rscript preflight.R

It submits one SLURM job through your controller and, inside that job, checks the things that actually break a real run:

R resolves at expected_r_version — else fix the module load line;
library(ssdsims) loads — else fix the worker install/module path (the ManyLinux binary path above);
tempdir() is writable — else fix the scratch export TMPDIR=... line.

A green preflight prints a small witness (the worker’s R version + node id) and means your controller block matches the site. A failed preflight aborts before the pipeline runs and names exactly which check failed, with the fix:

no job dispatched → controller / queue / account wiring;
R or ssdsims missing on the worker → module load / install path;
scratch not writable → the TMPDIR / scratch path.

So you debug the cluster wiring, never an obscure scenario error. run.R runs this preflight for you before tar_make(), so a broken cluster never reaches the expensive scenario shards.

Step 3 — Run a minimal first job

For your first job, make the scenario tiny and cheap so you are testing the wiring, not waiting on a big study. Temporarily shrink the scenario block near the top of _targets.R to a single cheap cell:

data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(
  data,
  nsim = 2L,
  seed = 42L,
  nrow = 5L
)

Then run the pipeline from the login node:

source("run.R") # or, from a shell:  Rscript run.R

run.R runs the preflight first, then dispatches the scenario’s shards across SLURM jobs (independent shards run concurrently), writes one Parquet per shard under results/layout=<hash>/<step>/..., unions them into results/layout=<hash>/summary.parquet, and prints a peek at the estimates. That is your first running cluster job, end to end.

Note

No cluster handy? run.R guards: if crew.cluster is not installed or sbatch is not on PATH, it aborts with a clear message naming the missing prerequisite. To run the same study off a cluster (no scheduler), use the large template — it builds the identical pipeline (same factory + scenario) under a crew::crew_controller_local() controller, and its run-serial.R checks (with all.equal()) that the single-core and targets estimates match.

Step 4 — Swap in your own scenario

Once the minimal job succeeds, expand the scenario block in _targets.R back to the full sweep the template ships (or to your own study; see large/scenario.R). Leave controller.R and the ssd_scenario_targets() call unchanged — the scenario and the factory call are scheduler-independent. Run run.R again.

Targeting a non-SLURM cluster (untested)

This template and guide target SLURM, but crew.cluster also ships controllers for SGE (Sun Grid Engine / Son of Grid Engine), PBS/TORQUE, and LSF, so the same pipeline can in principle run on those schedulers too. ssdsims has only been exercised on SLURM — treat the others as supported by crew.cluster but untested here, and lean on the preflight (step 2) to catch wiring problems early.

Only ingredient B changes: in controller.R, swap the controller constructor and its options function. The factory (A), the scenario (C), and the whole preflight stay exactly the same.

Scheduler	Controller	Options
SLURM	`crew_controller_slurm()`	`crew_options_slurm()`
SGE	`crew_controller_sge()`	`crew_options_sge()`
PBS / TORQUE	`crew_controller_pbs()`	`crew_options_pbs()`
LSF	`crew_controller_lsf()`	`crew_options_lsf()`

The cluster-independent parts carry over unchanged:

script_lines exists on every options function and works the same way — put your module load R/..., your account/project directive, and your scratch export TMPDIR=... there (these are shell / #SCHEDULER directives, not R). So the worker-prerequisite mapping (the ManyLinux binary path) and the step-2 preflight are identical on every scheduler.
workers and seconds_idle are crew settings on the controller, not scheduler-specific.

The named resource arguments differ by scheduler; anything without a named argument goes in script_lines as a literal directive (#$ for SGE, #PBS for PBS/TORQUE, #BSUB for LSF):

Resource	SLURM	SGE	PBS / TORQUE	LSF
cores per worker	`cpus_per_task`	`cores`	`cores`	`cores`
memory	`memory_gigabytes_per_cpu`	`memory_gigabytes_limit`	`memory_gigabytes_required`	`memory_gigabytes_limit`
walltime	`time_minutes`	`script_lines` (`#$ -l h_rt=...`)	`walltime_hours`	`script_lines` (`#BSUB -W ...`)
queue / partition	`partition`	`script_lines` (`#$ -q ...`)	`script_lines` (`#PBS -q ...`)	`script_lines` (`#BSUB -q ...`)

For example, the SLURM controller from step 1 becomes, on SGE:

controller <- crew.cluster::crew_controller_sge(
  workers = 4L,
  seconds_idle = 60,
  options_cluster = crew.cluster::crew_options_sge(
    cores = 1,
    memory_gigabytes_limit = 4,
    script_lines = c(
      "#$ -q short.q", # queue (SGE has no `partition` argument)
      "#$ -l h_rt=01:00:00", # walltime
      "module load R/4.3",
      "export TMPDIR=/scratch/$USER/ssdsims"
    )
  )
)

See crew.cluster::crew_controller_sge() (and the _pbs / _lsf equivalents) for the exact arguments. Everything else in this guide — the preflight, the minimal first job, swapping in your scenario — is unchanged.

Shard ↔︎ SLURM-job packing

Shards are the unit of parallelism: independent shard targets run concurrently across your SLURM jobs. How many shard targets ride in one job is a crew configuration option (workers caps concurrent jobs; tasks_max on the controller packs several shards into one worker’s lifetime — “many-to-one”; tasks_max = 1L gives one shard per job — “one-to-one”). It is documented in controller.R, not hard-coded as 1:1.