library(ssdsims)library(ssdsims)The “Running a Sharded Pipeline” vignette runs a scenario’s shards through a targets pipeline on a single machine. This vignette takes that same pipeline to a SLURM cluster — and the only thing that changes is the crew controller (TARGETS-DESIGN.md §4). The per-task sample/fit/hc results are byte-identical to the local run; the cluster is just a faster backend.
You do not need to know crew or targets to follow this guide — only your cluster’s own (non-R) usage instructions. It is four steps: map your site’s SLURM instructions onto the controller, confirm the mapping with a preflight, run a minimal first job, then swap in your own scenario. This guide targets SLURM; crew.cluster also supports SGE, PBS/TORQUE, and LSF — see Targeting a non-SLURM cluster below.
cluster templateThe package ships an editable cluster template (a sibling of small and large). Three ingredients assemble into the run, and only B is cluster-specific:
| Ingredient | What | Where |
|---|---|---|
| A — shape | the per-shard target graph | ssd_scenario_targets() in _targets.R (reused verbatim) |
| B — backend | the SLURM crew controller (queue, modules, scratch, workers, walltime) |
the one editable block in controller.R |
| C — content | the scenario (seed, datasets, grids) | the inline scenario block in _targets.R |
It is just four files — separating only what cannot be inlined:
list.files(system.file("targets-templates", "cluster", package = "ssdsims"))
#> [1] "_targets.R" "controller.R" "preflight.R" "README.md" "run.R"controller.R (the one editable controller block, sourced by both the pipeline and the preflight), _targets.R (the clean pipeline: controller + the inline scenario + factory), preflight.R (the standalone connectivity/prerequisite check, kept out of the pipeline), and run.R (the driver: preflight then tar_make()). Copy them to your project root:
dir <- system.file("targets-templates", "cluster", package = "ssdsims")
file.copy(list.files(dir, full.names = TRUE, all.files = TRUE, no.. = TRUE), ".")Your real starting point is your site’s documentation: how to log in, submit an sbatch job, which partition/queue and account/allocation to use, the module system that provides R, the scratch filesystem, and the walltime/core limits. None of that mentions R, crew, or targets. This table bridges it to the one editable controller block in controller.R (crew.cluster::crew_controller_slurm() / crew.cluster::crew_options_slurm()):
| Your site says… | …goes here |
|---|---|
| log in to the login/submit node | run run.R from there (where sbatch is on PATH) |
submit with sbatch |
nothing to set — crew.cluster calls sbatch for you |
| partition / queue name | crew_options_slurm(partition = "...") |
| account / allocation code | a "#SBATCH --account=..." line in script_lines |
module load R/4.x (and any deps) |
a "module load R/4.x" line in script_lines |
| scratch filesystem path | an "export TMPDIR=/scratch/$USER/..." line in script_lines |
| walltime limit (per job) | crew_options_slurm(time_minutes = ...) |
| cores per task | crew_options_slurm(cpus_per_task = ...) |
| memory per cpu | crew_options_slurm(memory_gigabytes_per_cpu = ...) |
| how many jobs you may run at once | crew_controller_slurm(workers = ...) |
So a site whose instructions read “sbatch to the short queue, module load R/4.3, account abc123, scratch under /scratch/$USER, 60-minute walltime” becomes:
controller <- crew.cluster::crew_controller_slurm(
workers = 4L, # max concurrent SLURM jobs
seconds_idle = 60, # let an idle worker release its allocation
options_cluster = crew.cluster::crew_options_slurm(
partition = "short",
time_minutes = 60,
cpus_per_task = 1,
memory_gigabytes_per_cpu = 4,
script_lines = c(
"#SBATCH --account=abc123",
"module load R/4.3", # exposes R + the ssdsims binary library (below)
"export TMPDIR=/scratch/$USER/ssdsims" # writable scratch for tempdir()
)
)
)Anything not covered by a named argument (a constraint, a QOS, a --gres) goes in script_lines as a literal #SBATCH line — script_lines is injected verbatim into the generated sbatch script. Also set expected_r_version (in controller.R) to the major.minor your module load provides, so the preflight (step 2) can confirm it.
A SLURM worker is a fresh R process on a compute node. It must be able to library(ssdsims) without compiling the dependency tree from source (compute nodes are often offline or slow to build). The crew labs pinned this down with a ManyLinux binary install path: install ssdsims and its dependencies as pre-built Linux binaries (e.g. from the Posit Public Package Manager binary repository for your distro) into the library your module load R/... exposes, so the worker resolves everything at load time. Put the module load that exposes that library in script_lines. The preflight in step 2 is the login-node prerequisite checker: it fails loudly if a worker cannot load ssdsims, so you fix the install/module path before launching the study.
preflight.R is a standalone check, kept out of the main pipeline. Run it on its own:
source("preflight.R") # or, from a shell: Rscript preflight.RIt submits one SLURM job through your controller and, inside that job, checks the things that actually break a real run:
expected_r_version — else fix the module load line;library(ssdsims) loads — else fix the worker install/module path (the ManyLinux binary path above);tempdir() is writable — else fix the scratch export TMPDIR=... line.A green preflight prints a small witness (the worker’s R version + node id) and means your controller block matches the site. A failed preflight aborts before the pipeline runs and names exactly which check failed, with the fix:
module load / install path;TMPDIR / scratch path.So you debug the cluster wiring, never an obscure scenario error. run.R runs this preflight for you before tar_make(), so a broken cluster never reaches the expensive scenario shards.
For your first job, make the scenario tiny and cheap so you are testing the wiring, not waiting on a big study. Temporarily shrink the scenario block near the top of _targets.R to a single cheap cell:
scenario <- ssd_define_scenario(
ssddata::ccme_boron,
nsim = 2L,
seed = 42L,
nrow = 5L
)Then run the pipeline from the login node:
source("run.R") # or, from a shell: Rscript run.Rrun.R runs the preflight first, then dispatches the scenario’s shards across SLURM jobs (independent shards run concurrently), writes one Parquet per shard under results/layout=<hash>/<step>/..., unions them into results/layout=<hash>/summary.parquet, and prints a peek at the estimates. That is your first running cluster job, end to end.
Note
No cluster handy?
run.Rguards: ifcrew.clusteris not installed orsbatchis not onPATH, it aborts with a clear message naming the missing prerequisite. To run the same study off a cluster (no scheduler), use thelargetemplate — it builds the identical pipeline (same factory + scenario) under acrew::crew_controller_local()controller, and itsrun-serial.Rasserts the results are byte-identical.
Once the minimal job succeeds, expand the scenario block in _targets.R back to the full sweep the template ships (or to your own study; see large/scenario.R). Leave controller.R and the ssd_scenario_targets() call unchanged — the scenario and the factory call are scheduler-independent. Run run.R again.
This template and guide target SLURM, but crew.cluster also ships controllers for SGE (Sun Grid Engine / Son of Grid Engine), PBS/TORQUE, and LSF, so the same pipeline can in principle run on those schedulers too. ssdsims has only been exercised on SLURM — treat the others as supported by crew.cluster but untested here, and lean on the preflight (step 2) to catch wiring problems early.
Only ingredient B changes: in controller.R, swap the controller constructor and its options function. The factory (A), the scenario (C), and the whole preflight stay exactly the same.
| Scheduler | Controller | Options |
|---|---|---|
| SLURM | crew_controller_slurm() |
crew_options_slurm() |
| SGE | crew_controller_sge() |
crew_options_sge() |
| PBS / TORQUE | crew_controller_pbs() |
crew_options_pbs() |
| LSF | crew_controller_lsf() |
crew_options_lsf() |
The cluster-independent parts carry over unchanged:
script_lines exists on every options function and works the same way — put your module load R/..., your account/project directive, and your scratch export TMPDIR=... there (these are shell / #SCHEDULER directives, not R). So the worker-prerequisite mapping (the ManyLinux binary path) and the step-2 preflight are identical on every scheduler.workers and seconds_idle are crew settings on the controller, not scheduler-specific.The named resource arguments differ by scheduler; anything without a named argument goes in script_lines as a literal directive (#$ for SGE, #PBS for PBS/TORQUE, #BSUB for LSF):
| Resource | SLURM | SGE | PBS / TORQUE | LSF |
|---|---|---|---|---|
| cores per worker | cpus_per_task |
cores |
cores |
cores |
| memory | memory_gigabytes_per_cpu |
memory_gigabytes_limit |
memory_gigabytes_required |
memory_gigabytes_limit |
| walltime | time_minutes |
script_lines (#$ -l h_rt=...) |
walltime_hours |
script_lines (#BSUB -W ...) |
| queue / partition | partition |
script_lines (#$ -q ...) |
script_lines (#PBS -q ...) |
script_lines (#BSUB -q ...) |
For example, the SLURM controller from step 1 becomes, on SGE:
controller <- crew.cluster::crew_controller_sge(
workers = 4L,
seconds_idle = 60,
options_cluster = crew.cluster::crew_options_sge(
cores = 1,
memory_gigabytes_limit = 4,
script_lines = c(
"#$ -q short.q", # queue (SGE has no `partition` argument)
"#$ -l h_rt=01:00:00", # walltime
"module load R/4.3",
"export TMPDIR=/scratch/$USER/ssdsims"
)
)
)See ?crew.cluster::crew_controller_sge (and the _pbs / _lsf equivalents) for the exact arguments. Everything else in this guide — the preflight, the minimal first job, swapping in your scenario — is unchanged.
Shards are the unit of parallelism: independent shard targets run concurrently across your SLURM jobs. How many shard targets ride in one job is a crew configuration knob (workers caps concurrent jobs; tasks_max on the controller packs several shards into one worker’s lifetime — “many-to-one”; tasks_max = 1L gives one shard per job — “one-to-one”). It is documented in controller.R, not hard-coded as 1:1.
README.md (the same guide, alongside the files) and TARGETS-DESIGN.md §4 (from local to a cluster), §11 (open questions), §12 (cluster-pipeline).?ssd_scenario_targets, ?crew.cluster::crew_controller_slurm.small/large/cluster template trio.upload = ssd_upload_azure(...) line in this template’s _targets.R) and read them back in place.