Uploading Shards to Cloud Storage

library(ssdsims)

The vignette("sharded-pipeline") article materialises each step as Hive-partitioned Parquet shards, and vignette("cluster-pipeline") runs that same pipeline on a cluster. This vignette covers the last link: shipping each shard to an object store as it is produced, so the results are readable from outside the cluster — analysis notebooks, dashboards, downstream R/Python.

The model is simple: upload is a runner argument, the remote-destination sibling of root. You pass an upload object to the ssd_scenario_targets() factory and it pairs each step shard with an upload_<step> target. There are three modes:

upload = NULL (the default) — no upload targets; the clean DAG.
upload = ssd_upload_dryrun() — upload targets that no-op (reach no network), so the DAG shape can be exercised offline and in CI.
upload = ssd_upload_azure(url, container) — upload targets that ship to Azure Blob Storage.

The per-task sample/fit/hc results are byte-identical across all three — only the presence and behaviour of the upload_<step> targets differ. This vignette’s live chunks use ssd_upload_dryrun(), so the build needs no network and no credentials; the Azure path is shown as described, non-evaluated chunks.

The destination objects

ssd_upload_azure() and ssd_upload_dryrun() return plain, classed S3 objects that carry only the destination — never credentials, connections, or environments — so they travel unchanged to crew workers and through targets:

dryrun <- ssd_upload_dryrun()
azure <- ssd_upload_azure(
  url = "https://acct.blob.core.windows.net",
  container = "ssdsims-results"
)
class(azure)
#> [1] "ssdsims_upload_azure_blob" "ssdsims_upload"
unclass(azure)
#> $url
#> [1] "https://acct.blob.core.windows.net"
#> 
#> $container
#> [1] "ssdsims-results"
#> 
#> $prefix
#> NULL
#> 
#> $domain
#> [1] "blob.core.windows.net"
#> 
#> $account
#> [1] "acct"

Credentials stay external: the Azure methods resolve the secret from the environment at call time — one of SSDSIMS_AZURE_STORAGE_KEY, SSDSIMS_AZURE_STORAGE_SAS, or the service-principal trio (SSDSIMS_AZURE_TENANT_ID/SSDSIMS_AZURE_CLIENT_ID/SSDSIMS_AZURE_CLIENT_SECRET) — and the object itself holds no secrets. The storage account name is derived from url (the acct in https://acct.blob.core.windows.net), so there is no account environment variable; for a sovereign cloud, set domain (e.g. domain = "blob.core.usgovcloudapi.net").

To write under a subdirectory of the container (so one container can hold several independent result sets), pass prefix:

ssd_upload_azure(
  url = "https://acct.blob.core.windows.net",
  container = "ssdsims-results",
  prefix = "study-2026/run-3"
)$prefix
#> [1] "study-2026/run-3"

The shards then land at <container>/<prefix>/<step>/<partition>/part.parquet, and ssd_open_uploaded() reads them back from the same prefixed location.

Run it locally with a dry run

The dry-run probe is trivially OK — no credentials, no network:

ssd_test_upload(ssd_upload_dryrun())

Build a scenario and hand it to the factory with upload = ssd_upload_dryrun(). You run ssd_test_upload() yourself as a preflight (above); the factory does no network I/O of its own — it just pairs each step shard with a no-op upload_<step> target:

data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(
  data,
  nsim = 2L,
  seed = 42L,
  nrow = 6L,
  dists = ssd_distset(lnorm = "lnorm"))
root <- tempfile("results-")
targets_dry <- ssd_scenario_targets(scenario, root = root, upload = dryrun)

The upload targets are present in the DAG, one paired with each shard:

target_names <- function(x) {
  if (inherits(x, "tar_target")) {
    return(x$settings$name)
  }
  if (is.list(x)) {
    return(unlist(lapply(x, target_names), use.names = FALSE))
  }
  character(0)
}
names_dry <- target_names(targets_dry)
grep("^upload_", names_dry, value = TRUE)
#> [1] "upload_sample_42_ccme_boron_1_TRUE" "upload_sample_42_ccme_boron_2_TRUE"
#> [3] "upload_fit_42_ccme_boron_1_6_FALSE" "upload_fit_42_ccme_boron_2_6_FALSE"
#> [5] "upload_hc_42_ccme_boron_1"          "upload_hc_42_ccme_boron_2"         
#> [7] "upload_summary"

The summary ships too: alongside the per-shard pairs, the single upload_summary target (visible above) takes the summary fan-in’s output and ships the combined summary.parquet — and, when the scenario sets samples = TRUE, the full summary-samples.parquet alongside it. Like the shard pairs it is content-hashed, so the summary re-ships only when its bytes change.

Contrast upload = NULL — the default — which emits no upload nodes at all:

targets_null <- ssd_scenario_targets(scenario, root = root)
grep("^upload_", target_names(targets_null), value = TRUE)
#> character(0)

A no-op upload is exactly a no-op: it reaches no network and returns the shard’s local path unchanged. Materialise one shard locally, then “ship” it with the dry-run destination:

run <- ssd_run_scenario_shards(scenario)
#> duckdb is keeping downloaded extensions in a temporary directory:
#> ℹ /tmp/Rtmpq2C2cT/duckdb/extensions
#> This is removed when the R session ends, so extensions are re-downloaded each session.
#> ℹ To keep them, point `options(duckdb.extension_directory =)` or the `DUCKDB_EXTENSION_DIRECTORY` environment variable at a permanent path.
shard <- list.files(
  file.path(run$dir, "hc"),
  pattern = "part.parquet",
  recursive = TRUE,
  full.names = TRUE
)[1]
identical(ssd_upload_shard(shard, dryrun), shard)
#> Dry-run upload: skipped
#> "/tmp/Rtmpq2C2cT/ssdsims-shards-17882cfda288/hc/dataset=ccme_boron/sim=1/part.parquet".
#> [1] TRUE

That is the whole upload DAG — the probe, the paired upload_<step> targets, and the per-shard ship — exercised end to end with no network and no credentials.

Extend the same call to Azure on a cluster

To ship to a real Azure container, swap ssd_upload_dryrun() for ssd_upload_azure(url, container) in the cluster template’s _targets.R (see vignette("cluster-pipeline")). It is the same factory call — one line changes:

ssd_scenario_targets(
  scenario,
  upload = ssd_upload_azure(
    url = "https://<account>.blob.core.windows.net",
    container = "ssdsims-results"
  )
)

Run the credentials/connectivity probe as an interactive preflight before tar_make() — it lists the container and writes then deletes a marker blob, aborting loudly (naming the missing SSDSIMS_AZURE_* variable) if your wiring is wrong:

Sys.setenv(SSDSIMS_AZURE_STORAGE_KEY = "<key>") # account comes from the url
ssd_test_upload(ssd_upload_azure("https://<account>.blob.core.windows.net", "ssdsims-results"))
# silent on success; aborts naming the missing variable otherwise

Verify the upload, in place

Right after an upload, read the results back in place to confirm they landed — no download. ssd_open_uploaded(upload, step) returns a lazy duckplyr/DuckDB table over the <container>/<step>/**/part.parquet Hive glob, read straight against blob storage via DuckDB’s azure extension (it remaps your SSDSIMS_AZURE_* credentials into a DuckDB azure secret for you). Because it is lazy and dplyr-composable, a one-line count() is the immediate smoke test:

upload <- ssd_upload_azure("https://<account>.blob.core.windows.net", "ssdsims-results")
ssd_open_uploaded(upload, step = "hc") |>
  dplyr::count()

A row-for-row compare of ssd_open_uploaded(upload, step) against the local shard verifies the transfer. (A dry-run destination has nothing to read back: ssd_open_uploaded(ssd_upload_dryrun(), ...) aborts and points you at the local shards.)

For the analysis-ready table, ssd_summarise_uploaded() is the cloud counterpart of ssd_summarise(): it unions a step’s uploaded shards in place and collects them, dropping the heavy dists/samples list-columns by default. Pass drop_samples = FALSE when you need the in-flight bootstrap samples:

ssd_summarise_uploaded(upload, step = "hc") # analysis-ready summary tibble
ssd_summarise_uploaded(upload, step = "hc", drop_samples = FALSE) # keep samples

The uploaded summaries are addressable the same way: step = "summary" is the combined compact summary the upload_summary target shipped, and step = "summary_samples" the full summary retaining the dists/samples list-columns (uploaded only when the scenario set samples = TRUE). Because the compact blob physically lacks those columns, ssd_summarise_uploaded(upload, "summary", drop_samples = FALSE) aborts and points you at step = "summary_samples" instead of silently returning a sample-less table:

ssd_open_uploaded(upload, step = "summary") |> dplyr::count()
ssd_summarise_uploaded(upload, step = "summary") # the uploaded compact summary
ssd_summarise_uploaded(upload, step = "summary_samples", drop_samples = FALSE)

What to pay attention to

Important

Credentials must reach the workers. ssd_test_upload() is easy to run interactively on the login node, but the shards upload on the compute nodes. Set SSDSIMS_AZURE_* so the workers see them — via the controller’s script_lines/module loads, or your scheduler’s environment propagation — not just on the login node.

The Azure client and the DuckDB azure extension must be on the workers. Install AzureStor/AzureRMR (and, for the read-back, DuckDB’s azure extension) on the workers, alongside ssdsims — the same ManyLinux binary path the cluster vignette covers.

A missing credential fails loud. Azure with absent credentials aborts — it is never a silent no-op. Intent to skip the network is expressed only by ssd_upload_dryrun(). Under the pipeline’s per-shard error = "null", a failed upload isolates to its own branch and the rest keep shipping, so a re-driven run retries only the failed uploads.

Unchanged shards are not re-uploaded. Each upload_<step> target takes the shard’s path as a format = "file" input, so targets re-uploads a shard only when its content hash changes — a re-driven tar_make() that rebuilt nothing uploads nothing; a partial extension uploads only the new shards. The same holds for upload_summary: the summary re-ships only when its bytes change (e.g. after a grown run), and summary-samples.parquet exists at the destination only when the scenario set samples = TRUE.

The read-back is in place. ssd_open_uploaded() predicate-pushes straight against blob storage — it does not download the Parquet.