| Title: | Simulation Analyses for Species Sensitivity Distributions |
|---|---|
| Description: | Runs reproducible simulation studies for species sensitivity distribution (SSD) models built on the 'ssdtools' package. Expands a declarative scenario into per-step task tables, draws data, fits distributions, and estimates hazard concentrations, with a 'targets'-based, Hive-partitioned shard pipeline for running studies in parallel or on a cluster. |
| Authors: | Joe Thorley [aut, cre] (ORCID: <https://orcid.org/0000-0002-7683-4592>), Rebecca Fisher [aut] |
| Maintainer: | Joe Thorley <[email protected]> |
| License: | Apache License (== 2.0) | file LICENSE |
| Version: | 0.0.0.9012 |
| Built: | 2026-06-05 02:51:05 UTC |
| Source: | https://github.com/poissonconsulting/ssdsims |
Activates the dqrng pcg64 RNG backend for the duration of the calling
frame, then resets it when .local_envir exits. While active, base R's
runif(), rnorm(), rbinom(), rexp(), rgamma(), rpois(),
sample.int(), and sample() (and therefore dplyr::slice_sample() and
ssdtools::ssd_r*()) draw from dqrng's pcg64, seeded via
dqrng::dqset.seed(). pcg64 is forced explicitly because it accepts the
length-2 stream argument the per-task primer design relies on; dqrng's own
default (Xoroshiro128++) does not.
local_dqrng_backend(.local_envir = parent.frame())local_dqrng_backend(.local_envir = parent.frame())
.local_envir |
|
Registering the backend is a process-global side effect that also advances
base R's .Random.seed. local_dqrng_backend() follows the withr
convention (compare withr::local_seed()): it pairs activation with
deferred reset so the backend is always restored, including on error.
The helper is reentrant. dqrng::register_methods() /
dqrng::restore_methods() keep a single global save-slot, so a nested
reset would tear the backend down for the still-open outer scope. To avoid
this, a local_dqrng_backend() call made while the backend is already
active is a no-op: it does not re-activate the backend and schedules no
further reset. Only the outermost call activates the backend on entry and
resets it on exit, so the RNG stream is identical whether or not a nested
call occurs.
Invisibly returns TRUE if this call activated the backend (the
outermost scope) or FALSE if the backend was already active and the call
was a no-op.
withr::local_seed(), dqrng::dqset.seed().
local_dqrng_backend() dqrng::dqset.seed(42, stream = c(1L, 2L)) runif(3)local_dqrng_backend() dqrng::dqset.seed(42, stream = c(1L, 2L)) runif(3)
local_dqrng_state() installs a per-task (seed, primer) starting point as
the running dqrng RNG state via dqrng::dqset.seed(), restoring the previous
state when .local_envir exits. with_dqrng_state() evaluates code with
that state installed, then restores the previous state. The primer argument
is the per-task primer (the value handed to dqrng's stream argument, per
TARGETS-DESIGN.md §2 and the GLOSSARY); the _state suffix marks that the
wrapper installs that primer as the running RNG state.
local_dqrng_state(seed, primer, .local_envir = parent.frame()) with_dqrng_state(seed, primer, code)local_dqrng_state(seed, primer, .local_envir = parent.frame()) with_dqrng_state(seed, primer, code)
seed |
|
primer |
|
.local_envir |
|
code |
|
These are the dqrng-path analogues of local_lecuyer_cmrg_state() /
with_lecuyer_cmrg_state(). Like those helpers they snapshot the RNG state
on entry (via dqrng::dqrng_get_state()) and withr::defer() a restore (via
dqrng::dqrng_set_state()), so a call leaves the surrounding RNG stream
undisturbed, including on error.
Both require an active dqrng backend: they abort unless a
local_dqrng_backend() scope is open. This fails fast rather than silently
seeding base R's Mersenne-Twister.
local_dqrng_state() invisibly returns primer; with_dqrng_state()
returns the value of code.
withr::local_seed(), local_dqrng_backend(),
local_lecuyer_cmrg_state().
local_dqrng_backend() local_dqrng_state(42, c(1L, 2L)) runif(3) with_dqrng_state(42, c(1L, 2L), runif(3))local_dqrng_backend() local_dqrng_state(42, c(1L, 2L)) runif(3) with_dqrng_state(42, c(1L, 2L), runif(3))
local_lecuyer_cmrg_seed() seeds the L'Ecuyer-CMRG RNG with a scalar integer
via base::set.seed(), restoring the previous state when .local_envir
exits. with_lecuyer_cmrg_seed() evaluates code with that seed in effect,
then restores the previous state. For a .Random.seed-style state vector
(e.g. from get_lecuyer_cmrg_stream_state() or parallel::nextRNGStream())
use local_lecuyer_cmrg_state() / with_lecuyer_cmrg_state().
local_lecuyer_cmrg_seed(seed, .local_envir = parent.frame()) with_lecuyer_cmrg_seed(seed, code)local_lecuyer_cmrg_seed(seed, .local_envir = parent.frame()) with_lecuyer_cmrg_seed(seed, code)
seed |
|
.local_envir |
|
code |
|
with_lecuyer_cmrg_seed() returns the value of code.
withr::local_seed(), local_lecuyer_cmrg_state(),
parallel::nextRNGStream().
local_lecuyer_cmrg_seed(42) runif(3) with_lecuyer_cmrg_seed(42, { runif(3) })local_lecuyer_cmrg_seed(42) runif(3) with_lecuyer_cmrg_seed(42, { runif(3) })
local_lecuyer_cmrg_state() sets the L'Ecuyer-CMRG RNG state to a
.Random.seed-style integer vector (length 7) by assigning to
.Random.seed directly, restoring the previous state when .local_envir
exits. with_lecuyer_cmrg_state() evaluates code with that state in
effect, then restores the previous state. A state is the full internal
RNG state (as returned by parallel::nextRNGStream() or
get_lecuyer_cmrg_stream_state()); contrast with base::set.seed()
which takes a scalar seed (see local_lecuyer_cmrg_seed() /
with_lecuyer_cmrg_seed()).
local_lecuyer_cmrg_state(state, .local_envir = parent.frame()) with_lecuyer_cmrg_state(state, code)local_lecuyer_cmrg_state(state, .local_envir = parent.frame()) with_lecuyer_cmrg_state(state, code)
state |
|
.local_envir |
|
code |
|
local_lecuyer_cmrg_state() invisibly returns state;
with_lecuyer_cmrg_state() returns the value of code.
parallel::nextRNGStream(), local_lecuyer_cmrg_seed().
state <- with_lecuyer_cmrg_seed(42, parallel::nextRNGStream(.Random.seed)) local_lecuyer_cmrg_state(state) runif(3) with_lecuyer_cmrg_state(state, runif(3))state <- with_lecuyer_cmrg_seed(42, parallel::nextRNGStream(.Random.seed)) local_lecuyer_cmrg_state(state) runif(3) with_lecuyer_cmrg_state(state, runif(3))
Returns the validated, materialised dataset tibble stored on scenario under
name. The dataset was validated (a numeric Conc column) and materialised
at construction by ssd_define_scenario(), so this accessor performs no
registration, persistence, or re-validation - it just isolates the value a
shard body fits. Aborts with an informative error when name is not one of
the scenario's datasets.
scenario_dataset(scenario, name)scenario_dataset(scenario, name)
scenario |
An |
name |
A scalar string naming one of the scenario's datasets. |
Names - not values - drive task hashing (TARGETS-DESIGN.md section 1.1):
the task path carries the dataset name and this accessor resolves it back to
the tibble at run time, so the tibble never enters a task identity.
The materialised dataset tibble stored under name.
scenario_min_pmix() for the min_pmix counterpart.
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_dataset(scenario, "ccme_boron")scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_dataset(scenario, "ccme_boron")
min_pmix Function from a Scenario by NameReturns the single-argument min_pmix function materialised on scenario
under name. ssd_define_scenario() resolves each min_pmix reference to a
function once, at construction (so a cluster worker needs no shared
interactive environment), and stores it keyed by name; this accessor isolates
it. Aborts with an informative error when name is not one of the scenario's
min_pmix names.
scenario_min_pmix(scenario, name)scenario_min_pmix(scenario, name)
scenario |
An |
name |
A scalar string naming one of the scenario's |
Names - not function values - drive task hashing (TARGETS-DESIGN.md section
1.1): the fit-task path carries the min_pmix name, and this accessor
resolves it back to the function at run time, so the function value never
enters a task identity (no byte-stability concern from byte-compilation or
captured environments).
The single-argument min_pmix function stored under name.
scenario_dataset() for the dataset counterpart.
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_min_pmix(scenario, "ssd_min_pmix")scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_min_pmix(scenario, "ssd_min_pmix")
Returns <root>/layout=<hash>, where the hash is derived from the scenario's
partition_by. A step's Hive shard path depth and axes are a function of
partition_by/bundle, so writing two different layouts into one root would
leave shards of different granularity side by side - and the depth-agnostic
glob the readers use (<step>/**/part.parquet) would then union stale and
current shards, double-counting tasks. Keying the results root on the layout
isolates each partition_by into its own subtree: re-running a scenario with
a changed partition_by/bundle writes to a fresh root (never mixing
granularities), while re-running the same layout reuses the root
(idempotent and cache-friendly - the same shard paths are simply rewritten).
scenario_results_dir(scenario, root = "results")scenario_results_dir(scenario, root = "results")
scenario |
An |
root |
The results root directory (default |
The targets pipeline writes under this root (see the shipped _targets.R
template). The single-core ssd_run_scenario_shards() takes the complementary
approach: it owns and clears a fixed dir on each run.
The layout-keyed path file.path(root, paste0("layout=", <hash>)).
ssd_run_scenario_shards(), ssd_summarize().
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_results_dir(scenario)scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) scenario_results_dir(scenario)
Collects one or more datasets into a validated, named collection - the
single entry point through which ssd_define_scenario() takes dataset
input. Each dataset must carry a numeric Conc column (the species
sensitivity distribution convention); additional columns are preserved.
ssd_data(...)ssd_data(...)
... |
One or more data frames, optionally named. Each is validated for
a numeric |
Names are taken from the argument names where supplied, otherwise derived
from the argument expression by symbol capture (e.g. ssddata::ccme_boron
becomes "ccme_boron"). A literal with no derivable name (e.g. a bare
data.frame(...) call) must be given an explicit name.
ssd_data() is intended to grow: the planned scenario-input-types change
(see TARGETS-DESIGN.md section 12) will let each input also be one of the data
generators ssd_run_scenario() accepts today - a fitdists or tmbfit
object, a generator function, or a function-name string - with the data
materialised by the dataset registry. For now each input must be a data
frame.
An ssdsims_data object: a named list of validated tibbles.
ssd_data(ssddata::ccme_boron) ssd_data(boron = ssddata::ccme_boron, cadmium = ssddata::ccme_cadmium)ssd_data(ssddata::ccme_boron) ssd_data(boron = ssddata::ccme_boron, cadmium = ssddata::ccme_cadmium)
Constructs a purely declarative ssdsims_scenario object: the root of
the targets-based pipeline (see TARGETS-DESIGN.md section 1). The object stores
only declarative fields - a scalar seed, the replicate count nsim, the
sample sizes nrow, the dataset names, and the fit and hc argument
grids. It performs no random-number generation, no task expansion,
and has no dependency on targets.
ssd_define_scenario( data, nsim, seed, ..., name = NULL, nrow = 6L, replace = FALSE, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, samples = FALSE, partition_by = NULL, bundle = NULL, upload = NULL )ssd_define_scenario( data, nsim, seed, ..., name = NULL, nrow = 6L, replace = FALSE, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, samples = FALSE, partition_by = NULL, bundle = NULL, upload = NULL )
data |
An |
nsim |
A count of the number of data sets to generate. |
seed |
A scalar whole number; the scenario's RNG root. Required - changing it fully re-roots the scenario's random-number draws. |
... |
Unused; must be empty. |
name |
An optional dataset name for the single-data-frame form,
overriding the derived name. Must not be combined with a named list or an
|
nrow |
A positive whole number of the minimum number of non-missing rows. |
replace |
A logical vector specifying whether to sample with replacement. |
dists |
A character vector of the distribution names. |
rescale |
A flag specifying whether to leave the values unchanged (FALSE) or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values (TRUE) or a string specifying whether to leave the values unchanged ("no") or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values ("geomean") or to logistically transform ("odds"). |
computable |
A flag specifying whether to only return fits with numerically computable standard errors. |
at_boundary_ok |
A flag specifying whether a model with one or more parameters at the boundary should be considered to have converged (default = TRUE). |
min_pmix |
The |
range_shape1 |
A list of numeric vectors of length two of the lower and upper bounds for the shape1 parameter. |
range_shape2 |
A list of numeric vectors of length two of the lower and upper bounds for the shape2 parameter. |
proportion |
A numeric vector of proportion values to estimate hazard concentrations for. |
ci |
A flag specifying whether to estimate confidence intervals (by bootstrapping). |
nboot |
A count of the number of bootstrap samples to use to estimate the confidence limits. A value of 10,000 is recommended for official guidelines. |
est_method |
A string specifying whether to estimate directly from
the model-averaged cumulative distribution function ( |
ci_method |
A string specifying which method to use for estimating
the standard error and confidence limits from the bootstrap samples.
The default and recommended value is still |
parametric |
A flag specifying whether to perform parametric bootstrapping as opposed to non-parametrically resampling the original data with replacement. |
samples |
A logical scalar (default |
partition_by |
An optional, possibly-partial named list keyed by step
( |
bundle |
An optional, possibly-partial named list keyed by step, the
per-step complement of |
upload |
An optional upload specification (a list), or |
Input data is forwarded through ssd_data() for validation (a numeric
Conc column is required) and retained on the scenario (as $data) so a
local run (ssd_run_scenario_baseline()) can sample it directly. The dataset
names ($datasets) are what the targets/cluster path hashes; the validated
tibbles ride on the scenario and are isolated by name via scenario_dataset(),
so the hash need not carry the data frames.
An S3 object of class ssdsims_scenario.
The preferred form is an ssd_data() collection, which owns validation and
naming: ssd_define_scenario(ssd_data(boron = ccme_boron, cadmium = ccme_cadmium), ...). For convenience, bare data frame input is also
accepted in four forms (routed through the same Conc validation):
A single data frame, name derived from the argument expression:
ssd_define_scenario(ssddata::ccme_boron, ...) gives "ccme_boron".
A single data frame with an explicit name=:
ssd_define_scenario(ssddata::ccme_boron, name = "boron", ...).
A named list, names taken from the list:
ssd_define_scenario(list(boron = ccme_boron, cadmium = ccme_cadmium), ...).
An unnamed list, names derived per element:
ssd_define_scenario(list(ccme_boron, ccme_cadmium), ...).
Supplying both a named list and name= is an error.
ci = FALSEWhen ci = FALSE is the only confidence-interval value, the bootstrap-only
knobs nboot, ci_method, and parametric are meaningless. Passing any of
them in that case is an error; set ci = c(FALSE, TRUE) to enable bootstrap,
or omit the knobs.
ssd_define_scenario(ssddata::ccme_boron, nsim = 100L, nrow = c(5L, 10L), seed = 42L)ssd_define_scenario(ssddata::ccme_boron, nsim = 100L, nrow = c(5L, 10L), seed = 42L)
Fit SSD Distributions to Simulated Data
ssd_fit_dists_sims( x, dists = ssdtools::ssd_dists_bcanz(), ..., rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = range_shape1, seed = NULL, silent = TRUE, .progress = FALSE )ssd_fit_dists_sims( x, dists = ssdtools::ssd_dists_bcanz(), ..., rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = range_shape1, seed = NULL, silent = TRUE, .progress = FALSE )
x |
A data frame with sim and stream integer columns and a list column of the data frames to fit distributions to. |
dists |
A character vector of the distribution names. |
... |
Additional arguments passed to |
rescale |
A flag specifying whether to leave the values unchanged (FALSE) or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values (TRUE) or a string specifying whether to leave the values unchanged ("no") or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values ("geomean") or to logistically transform ("odds"). |
computable |
A flag specifying whether to only return fits with numerically computable standard errors. |
at_boundary_ok |
A flag specifying whether a model with one or more parameters at the boundary should be considered to have converged (default = TRUE). |
min_pmix |
A list of one or more functions with a single argument that inputs the number of rows of data and returns a proportion between 0 and 0.5. |
range_shape1 |
A list of numeric vectors of length two of the lower and upper bounds for the shape1 parameter. |
range_shape2 |
A list of numeric vectors of length two of the lower and upper bounds for the shape2 parameter. |
seed |
An integer of the starting seed or NULL. |
silent |
A flag indicating whether fits should fail silently. |
.progress |
Whether to show a |
The x tibble with a list column fits of fistdist objects.
Estimate hazard concentrations for multiple simulations using bootstrapping
ssd_hc_sims( x, proportion = 0.05, ..., ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, save_to = NULL, .progress = FALSE )ssd_hc_sims( x, proportion = 0.05, ..., ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, save_to = NULL, .progress = FALSE )
x |
A data frame with sim and stream integer columns and a list column of fitdists objects. |
proportion |
A numeric vector of proportion values to estimate hazard concentrations for. |
... |
Additional arguments passed to |
ci |
A flag specifying whether to estimate confidence intervals (by bootstrapping). |
nboot |
A count of the number of bootstrap samples to use to estimate the confidence limits. A value of 10,000 is recommended for official guidelines. |
est_method |
A string specifying whether to estimate directly from
the model-averaged cumulative distribution function ( |
ci_method |
A string specifying which method to use for estimating
the standard error and confidence limits from the bootstrap samples.
The default and recommended value is still |
parametric |
A flag specifying whether to perform parametric bootstrapping as opposed to non-parametrically resampling the original data with replacement. |
seed |
An integer of the starting seed or NULL. |
save_to |
NULL or a string specifying a directory to save where the bootstrap datasets and parameter estimates (when successfully converged) to. |
.progress |
Whether to show a |
The x tibble with a list column hc of data frames produced by applying ssd_hc() to fits.
Runs the fit tasks bundled into one shard: reads the distinct set of parent
sample shards the shard's tasks reference (each once - they may span several
sample shards), isolates each task's draw by sample_id (restoring row
order), truncates it inline (head(sample, nrow),
RNG-free, section 5), and fits with the per-task (seed, primer) through
fit_data_task_primer() (resolving min_pmix off the scenario via
scenario_min_pmix()). The fitted fitdists object is serialised into a
fit_blob string column keyed by fit_id, and one Parquet is written at the
shard's partition path.
ssd_run_fit_step(tasks, scenario, sample_dir, out_dir)ssd_run_fit_step(tasks, scenario, sample_dir, out_dir)
tasks |
A tibble of the shard's |
scenario |
The |
sample_dir |
The |
out_dir |
The |
The shard's Parquet path.
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) dir <- tempfile() ssd_run_sample_step( ssd_scenario_sample_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample") ) ssd_run_fit_step( ssd_scenario_fit_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample"), file.path(dir, "fit") )scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) dir <- tempfile() ssd_run_sample_step( ssd_scenario_sample_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample") ) ssd_run_fit_step( ssd_scenario_fit_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample"), file.path(dir, "fit") )
Runs the hc tasks bundled into one shard: reads the distinct set of parent
fit shards the shard's tasks reference (each once - an hc shard typically
spans several fit shards), isolates each task's fit by fit_id,
deserialises the fitdists object, and estimates the hazard concentration
with the per-task (seed, primer) through hc_data_task_primer(). Each
task's hc tibble (one or more rows - the proportion fan-out and the
ci = FALSE collapse, section 1.2) is tagged with its hc_id and parent
fit_id, stacked, and written as one Parquet at the shard's partition path.
ssd_run_hc_step(tasks, scenario, fit_dir, out_dir)ssd_run_hc_step(tasks, scenario, fit_dir, out_dir)
tasks |
A tibble of the shard's |
scenario |
The |
fit_dir |
The |
out_dir |
The |
The shard's Parquet path.
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) dir <- tempfile() ssd_run_sample_step( ssd_scenario_sample_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample") ) ssd_run_fit_step( ssd_scenario_fit_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample"), file.path(dir, "fit") ) ssd_run_hc_step( ssd_scenario_hc_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "fit"), file.path(dir, "hc") )scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) dir <- tempfile() ssd_run_sample_step( ssd_scenario_sample_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample") ) ssd_run_fit_step( ssd_scenario_fit_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "sample"), file.path(dir, "fit") ) ssd_run_hc_step( ssd_scenario_hc_shards(scenario)$tasks[[1L]], scenario, file.path(dir, "fit"), file.path(dir, "hc") )
Runs the sample tasks bundled into one shard: under one
local_dqrng_backend() scope, reads each task's dataset off the scenario via
scenario_dataset(), draws n_max rows with the per-task (seed, primer)
through sample_data_task_primer(), and writes one Parquet at the shard's
Hive partition path. Each task's draw is tagged with its sample_id and a
.row order index so a downstream fit shard can isolate and re-order it.
ssd_run_sample_step(tasks, scenario, out_dir)ssd_run_sample_step(tasks, scenario, out_dir)
tasks |
A tibble of the shard's task rows (the |
scenario |
The |
out_dir |
The |
The shard's Parquet path (the format = "file" contract).
ssd_run_fit_step(), ssd_run_hc_step(), ssd_scenario_sample_shards().
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) shards <- ssd_scenario_sample_shards(scenario) dir <- tempfile() ssd_run_sample_step(shards$tasks[[1L]], scenario, file.path(dir, "sample"))scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 1L, seed = 42L) shards <- ssd_scenario_sample_shards(scenario) dir <- tempfile() ssd_run_sample_step(shards$tasks[[1L]], scenario, file.path(dir, "sample"))
Run Scenario
ssd_run_scenario(x, ...) ## S3 method for class 'data.frame' ssd_run_scenario( x, ..., nrow = 6L, replace = FALSE, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'fitdists' ssd_run_scenario( x, ..., nrow = 6L, dist_sim = "top", dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'tmbfit' ssd_run_scenario( x, ..., nrow = 6L, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'character' ssd_run_scenario( x, ..., nrow = 6L, args = list(), dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class ''function'' ssd_run_scenario( x, ..., nrow = 6L, args = list(), dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE )ssd_run_scenario(x, ...) ## S3 method for class 'data.frame' ssd_run_scenario( x, ..., nrow = 6L, replace = FALSE, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'fitdists' ssd_run_scenario( x, ..., nrow = 6L, dist_sim = "top", dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'tmbfit' ssd_run_scenario( x, ..., nrow = 6L, dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'character' ssd_run_scenario( x, ..., nrow = 6L, args = list(), dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class ''function'' ssd_run_scenario( x, ..., nrow = 6L, args = list(), dists = ssdtools::ssd_dists_bcanz(), rescale = FALSE, computable = FALSE, at_boundary_ok = TRUE, min_pmix = list(ssdtools::ssd_min_pmix), range_shape1 = list(c(0.05, 20)), range_shape2 = list(c(0.05, 20)), proportion = 0.05, ci = FALSE, nboot = 1000, est_method = "multi", ci_method = "weighted_samples", parametric = TRUE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE )
x |
The object to use for the scenario. |
... |
Unused. |
nrow |
A positive whole number of the minimum number of non-missing rows. |
replace |
A logical vector specifying whether to sample with replacement. |
dists |
A character vector of the distribution names. |
rescale |
A flag specifying whether to leave the values unchanged (FALSE) or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values (TRUE) or a string specifying whether to leave the values unchanged ("no") or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values ("geomean") or to logistically transform ("odds"). |
computable |
A flag specifying whether to only return fits with numerically computable standard errors. |
at_boundary_ok |
A flag specifying whether a model with one or more parameters at the boundary should be considered to have converged (default = TRUE). |
min_pmix |
A number between 0 and 0.5 specifying the minimum proportion in mixture models. |
range_shape1 |
A numeric vector of length two of the lower and upper bounds for the shape1 parameter. |
range_shape2 |
shape2 parameter. |
proportion |
A numeric vector of proportion values to estimate hazard concentrations for. |
ci |
A flag specifying whether to estimate confidence intervals (by bootstrapping). |
nboot |
A count of the number of bootstrap samples to use to estimate the confidence limits. A value of 10,000 is recommended for official guidelines. |
est_method |
A string specifying whether to estimate directly from
the model-averaged cumulative distribution function ( |
ci_method |
A string specifying which method to use for estimating
the standard error and confidence limits from the bootstrap samples.
The default and recommended value is still |
parametric |
A flag specifying whether to perform parametric bootstrapping as opposed to non-parametrically resampling the original data with replacement. |
seed |
An integer of the starting seed or NULL. |
nsim |
A count of the number of data sets to generate. |
stream |
A count of the stream number. |
start_sim |
A count of the number of the simulation to start from. |
.progress |
Whether to show a |
dist_sim |
A character vector specifying the distributions in the fitdists object or |
args |
A named list of the argument values. |
A tibble of nested data sets.
ssd_run_scenario(data.frame): Run scenario using data.frame to sample data
ssd_run_scenario(fitdists): Run scenario using fitdists object to generate data
ssd_run_scenario(tmbfit): Run scenario using tmbfit object to generate data
ssd_run_scenario(character): Run scenario using name of function to generate sequence of random numbers
ssd_run_scenario(`function`): Run scenario data using function to generate sequence of random numbers
ssd_run_scenario(ssddata::ccme_boron, nsim = 2) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_run_scenario(fit, dist_sim = c("lnorm", "top"), nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_run_scenario(fit[[1]], nsim = 3) ssd_run_scenario("rlnorm", nsim = 3) ssd_run_scenario(ssdtools::ssd_rlnorm, nsim = 3)ssd_run_scenario(ssddata::ccme_boron, nsim = 2) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_run_scenario(fit, dist_sim = c("lnorm", "top"), nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_run_scenario(fit[[1]], nsim = 3) ssd_run_scenario("rlnorm", nsim = 3) ssd_run_scenario(ssdtools::ssd_rlnorm, nsim = 3)
Executes the three task tables in dependency order - sample, fit, then
hc - by looping over each table with purrr::pmap() and looking up each
task's parent result by the parent's <step>_id foreign key. The fit step
truncates its parent sample inline (head(sample, nrow)) before fitting. The
runner does no task expansion of its own (it consumes ssd_scenario_tasks());
it just threads outputs forward and returns the collected per-step results.
ssd_run_scenario_baseline(scenario)ssd_run_scenario_baseline(scenario)
scenario |
An |
This is the no-frills baseline: it runs in-process, with no targets
dependency, no shard grouping or partition_by, and no Parquet I/O.
It is reproducible without an external seed. The runner opens one
local_dqrng_backend() scope and seeds each sample/fit/hc task exactly
once through its *_data_task_primer() wrapper, with seed = scenario$seed
and a per-task primer derived from the task's canonical identity
(task_primer() over the task_axes(step) columns). Because each task's
(seed, primer) pair fully determines its RNG starting point, two runs of a
scenario with a fixed seed yield identical results, and a task's result is
independent of the order in which tasks run. These same
*_data_task_primer() wrappers are the per-task entry point a future
targets shard body and the replay helper (TARGETS-DESIGN.md §7) reuse.
The scenario retains the data frames it was built from, so the runner reads
them directly - no separate data argument. min_pmix names are resolved to
their materialised functions off the scenario via scenario_min_pmix()
(resolved once, at construction), not by a runtime ssdtools/global-env
search.
A named list with sample, fit, and hc elements: each the
corresponding task table augmented with a list column of per-task results
(sample draws, fits objects, and hc tibbles).
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) out <- ssd_run_scenario_baseline(scenario) out$hcscenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) out <- ssd_run_scenario_baseline(scenario) out$hc
Executes a scenario's three task steps in dependency order - sample, then
fit, then hc - materialising each step's results as one Parquet per
partition_by path cell under a Hive-partitioned tree
<dir>/<step>/<axis=value>/.../part.parquet, and linking steps by reading the
parent step's shards back via duckplyr (predicate pushdown), rather than
threading results in memory. This is the single-core, targets-free sibling
of ssd_run_scenario_baseline() and the first consumer of partition-by's
path/inner split (scenario_dataset()'s sibling scenario_partition_axes()).
ssd_run_scenario_shards(scenario, dir = tempfile("ssdsims-shards-"))ssd_run_scenario_shards(scenario, dir = tempfile("ssdsims-shards-"))
scenario |
An |
dir |
A results root to write the Hive-partitioned shards under; created
if absent. Defaults to a per-run session temp dir (the shards are left on
disk for inspection and reuse). The runner owns the |
It reuses the per-task seed-and-run wrappers, so for a fixed scenario$seed
it is reproducible and order-independent, and its per-task results are
byte-identical to ssd_run_scenario_baseline() - partition_by is a free
re-layout that moves only file paths, never results. The m:n parent-shard
dependency (a child shard reading several parent shards, or a parent shard
feeding several children, per the section 5 coarsening defaults) is resolved
at read time: each task opens the parent shard at its <parent>_id identity
projected onto the parent's path axes and filters to the rows it needs.
No targets, crew, manifest, or cloud upload - this is the plain-R storage
loop only, de-risking hive-partitioning/task-tables.
An ssdsims_shard_run object: a list with dir and the written
sample, fit, and hc shard Parquet paths (one per shard).
ssd_run_scenario_baseline() (the in-memory reference oracle),
ssd_scenario_sample_shards(), ssd_run_sample_step().
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) run <- ssd_run_scenario_shards(scenario) run$hcscenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) run <- ssd_run_scenario_shards(scenario) run$hc
As ssd_scenario_sample_shards() for the fit step: groups
ssd_scenario_fit_tasks() by partition_by$fit. Each task row in tasks
carries its parent sample path-axis values and sample_id, so the runner
opens the matching sample shard by partition path.
ssd_scenario_fit_shards(scenario)ssd_scenario_fit_shards(scenario)
scenario |
An |
A tibble with one row per fit shard (path-axis columns + a tasks
list-column).
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, rescale = c(FALSE, TRUE) ) ssd_scenario_fit_shards(scenario)scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, rescale = c(FALSE, TRUE) ) ssd_scenario_fit_shards(scenario)
Crosses each sample-task identity (dataset, sim, replace) with the
scenario's nrow values and each row of the scenario's fit argument grid
(rescale, computable, at_boundary_ok, min_pmix name, range_shape1,
range_shape2). nrow is a genuine fit cross-join axis: the fit step
truncates its parent sample inline (head(sample, nrow), RNG-free) before
fitting, so the shared draw is sub-truncated without a separate data step
(TARGETS-DESIGN.md section 5). Parent-identity columns are preserved
verbatim so the table can be grouped directly downstream. min_pmix is
referenced by name, not by function value (TARGETS-DESIGN.md section 1.1).
ssd_scenario_fit_tasks(scenario)ssd_scenario_fit_tasks(scenario)
scenario |
An |
Each row carries a fit_id primary key and a sample_id foreign key
referencing its parent sample task.
An ssdsims_tasks object recording the "fit" step, with one row per
(dataset, sim, replace, nrow) identity crossed with the fit grid.
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 3L, seed = 42L, rescale = c(FALSE, TRUE) ) ssd_scenario_fit_tasks(scenario)scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 3L, seed = 42L, rescale = c(FALSE, TRUE) ) ssd_scenario_fit_tasks(scenario)
As ssd_scenario_sample_shards() for the hc step: groups
ssd_scenario_hc_tasks() by partition_by$hc. Each task row in tasks
carries its parent fit path-axis values and fit_id, so the runner opens
the matching fit shard by partition path.
ssd_scenario_hc_shards(scenario)ssd_scenario_hc_shards(scenario)
scenario |
An |
A tibble with one row per hc shard (path-axis columns + a tasks
list-column).
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, ci = c(FALSE, TRUE) ) ssd_scenario_hc_shards(scenario)scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, ci = c(FALSE, TRUE) ) ssd_scenario_hc_shards(scenario)
Crosses each fit-task identity with each row of the scenario's hc argument
grid (nboot, est_method, ci_method, parametric). The expansion honours
the construction-time ci = FALSE collapse (TARGETS-DESIGN.md section 1.2): rows
where ci = FALSE are not multiplied across the bootstrap-only knobs
(nboot, ci_method, parametric), which are stored as NA, while
ci = TRUE rows fan out across the full grid.
ssd_scenario_hc_tasks(scenario)ssd_scenario_hc_tasks(scenario)
scenario |
An |
Each row carries an hc_id primary key and a fit_id foreign key
referencing its parent fit task.
An ssdsims_tasks object recording the "hc" step, with one row per
fit-task identity crossed with the (collapsed) hc grid.
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, ci = c(FALSE, TRUE), nboot = c(10L, 100L) ) ssd_scenario_hc_tasks(scenario)scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 2L, seed = 42L, ci = c(FALSE, TRUE), nboot = c(10L, 100L) ) ssd_scenario_hc_tasks(scenario)
Groups the scenario's sample task table (ssd_scenario_sample_tasks()) by
its partition_by$sample path axes into one row per shard: each shard row
carries the path-axis columns (the tar_map target-name suffix and Hive path)
and a tasks list-column of that shard's task rows, each decorated with
seed = scenario$seed and its per-task primer (task_primer() over
task_axes("sample")). The decoration is RNG-free; the bare task table keeps
its no-(seed, primer) contract.
ssd_scenario_sample_shards(scenario)ssd_scenario_sample_shards(scenario)
scenario |
An |
A tibble with one row per sample shard: the path-axis columns and a
tasks list-column. Suitable as tarchetypes::tar_map(values = ).
ssd_run_sample_step(), ssd_scenario_fit_shards().
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 42L) ssd_scenario_sample_shards(scenario)scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 42L) ssd_scenario_sample_shards(scenario)
Expands an ssdsims_scenario into the sample task table: one row per cell
of the cross-join of the scenario's dataset names, replicate index (1:nsim),
and replace values. Each row is the single random draw of n_max = max(nrow) rows that every nrow value sub-truncates (TARGETS-DESIGN.md
section 5), so nrow is not a sample axis - the draw is shared. n_max is
carried as an ordinary integer column. The derivation performs no
random-number generation and adds no seed/primer/stream columns (those
arrive in later roadmap steps; see TARGETS-DESIGN.md section 2).
ssd_scenario_sample_tasks(scenario)ssd_scenario_sample_tasks(scenario)
scenario |
An |
Each row carries a path-style sample_id primary key.
An ssdsims_tasks object (a classed tibble recording the "sample"
step) with one row per (dataset, sim, replace) cell, a sample_id key,
and a carried n_max column.
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 3L, seed = 42L) ssd_scenario_sample_tasks(scenario)scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 3L, seed = 42L) ssd_scenario_sample_tasks(scenario)
A target factory: returns the list of targets objects that runs a
scenario as a static-branching Hive-sharded pipeline (TARGETS-DESIGN.md
section 6), so a whole _targets.R reduces to build a scenario and call
this:
ssd_scenario_targets(scenario, root = scenario_results_dir(scenario))ssd_scenario_targets(scenario, root = scenario_results_dir(scenario))
scenario |
An |
root |
The results root the shards and summary are written under;
defaults to the per-layout |
library(targets) library(tarchetypes) library(ssdsims) scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 42L) ssd_scenario_targets(scenario)
For each step it tarchetypes::tar_map()s one named, format = "file",
error = "null" target per partition_by path cell (the names are the
step's path axes), wires sample -> fit -> hc -> summary ordering with
tar_combine() barriers (a step body reads its parents from disk by partition
path, so there is no automatic edge), and writes every shard and the summary
under the per-layout scenario_results_dir() root (so a changed
partition_by/bundle never mixes shard granularities). scenario is
referenced as a global, so editing it invalidates the dependent shards.
To parallelise the shards, set a controller (e.g. a mirai-backed
crew::crew_controller_local()) with targets::tar_option_set() in
_targets.R before calling this - the target set is unchanged.
A list of targets target objects, for _targets.R to return.
scenario_results_dir(), ssd_run_scenario_shards() (the
single-core, targets-free equivalent).
## Not run: # _targets.R library(targets) library(tarchetypes) library(ssdsims) scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 42L) ssd_scenario_targets(scenario) ## End(Not run)## Not run: # _targets.R library(targets) library(tarchetypes) library(ssdsims) scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 42L) ssd_scenario_targets(scenario) ## End(Not run)
The canonical expansion entry point (TARGETS-DESIGN.md section 1/section 2): derives the
sample, fit, and hc task tables from a scenario in one call and
bundles them into an ssdsims_task_set. The per-step derivations
(ssd_scenario_sample_tasks(), ssd_scenario_fit_tasks(),
ssd_scenario_hc_tasks()) remain available for callers that need a single
table.
ssd_scenario_tasks(scenario, step = NULL)ssd_scenario_tasks(scenario, step = NULL)
scenario |
An |
step |
Optional single step name ( |
An ssdsims_task_set object (a list with sample, fit, and hc
elements, each an ssdsims_tasks table), or - when step is supplied - the
single ssdsims_tasks table for that step.
scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 3L, seed = 42L) tasks <- ssd_scenario_tasks(scenario) tasks tasks$hc ssd_scenario_tasks(scenario, "hc")scenario <- ssd_define_scenario(ssddata::ccme_boron, nsim = 3L, seed = 42L) tasks <- ssd_scenario_tasks(scenario) tasks tasks$hc ssd_scenario_tasks(scenario, "hc")
A family of functions to generate a tibble of nested data sets.
ssd_sim_data(x, ...) ## S3 method for class 'data.frame' ssd_sim_data( x, ..., nrow = 6L, replace = FALSE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'fitdists' ssd_sim_data( x, ..., nrow = 6L, dist_sim = "top", seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'tmbfit' ssd_sim_data( x, ..., nrow = 6L, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'character' ssd_sim_data( x, ..., nrow = 6L, args = list(), seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class ''function'' ssd_sim_data( x, ..., nrow = 6L, args = list(), seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE )ssd_sim_data(x, ...) ## S3 method for class 'data.frame' ssd_sim_data( x, ..., nrow = 6L, replace = FALSE, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'fitdists' ssd_sim_data( x, ..., nrow = 6L, dist_sim = "top", seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'tmbfit' ssd_sim_data( x, ..., nrow = 6L, seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class 'character' ssd_sim_data( x, ..., nrow = 6L, args = list(), seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE ) ## S3 method for class ''function'' ssd_sim_data( x, ..., nrow = 6L, args = list(), seed = NULL, nsim = 100L, stream = getOption("ssdsims.stream", 1L), start_sim = 1L, .progress = FALSE )
x |
The object to use for generating the data. |
... |
Unused. |
nrow |
A numeric vector of the number of rows in the generated data which must be between 5 and 1,000, |
replace |
A logical vector specifying whether to sample with replacement. |
seed |
An integer of the starting seed or NULL. |
nsim |
A count of the number of data sets to generate. |
stream |
A count of the stream number. |
start_sim |
A count of the number of the simulation to start from. |
.progress |
Whether to show a |
dist_sim |
A character vector specifying the distributions in the fitdists object or |
args |
A named list of the argument values. |
A tibble of nested data sets.
ssd_sim_data(data.frame): Generate data by sampling from data.frame
ssd_sim_data(fitdists): Generate data from fitdists object
ssd_sim_data(tmbfit): Generate data from tmbfit object
ssd_sim_data(character): Generate data using name of function
ssd_sim_data(`function`): Generate data using function to generate sequence of random numbers
ssd_sim_data(ssddata::ccme_boron, nrow = 5, nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_sim_data(fit, nrow = 5, nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_sim_data(fit[[1]], nrow = 5, nsim = 3) ssd_sim_data("rnorm", nrow = 5, nsim = 3) ssd_sim_data(ssdtools::ssd_rlnorm, nrow = 5, nsim = 3)ssd_sim_data(ssddata::ccme_boron, nrow = 5, nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_sim_data(fit, nrow = 5, nsim = 3) fit <- ssdtools::ssd_fit_dists(ssddata::ccme_boron) ssd_sim_data(fit[[1]], nrow = 5, nsim = 3) ssd_sim_data("rnorm", nrow = 5, nsim = 3) ssd_sim_data(ssdtools::ssd_rlnorm, nrow = 5, nsim = 3)
Fans in the run's results without pulling shard target values back into R or
recomputing anything: reads every hc shard Parquet under dir_hc (a Hive
glob) with duckplyr - the analysis-ready per-task hazard-concentration
estimates - unions them, and writes path. Because it reads the result
directory (not the shard targets), it sees whatever shards landed, so it
unions the survivors of a partially-failed run (error = "null", section
6.2). dir_sample and dir_fit are accepted for signature symmetry with the
three result layers; the sample draws and serialised fit objects are not
summary material, so the combined summary is the hc layer.
ssd_summarize(dir_sample, dir_fit, dir_hc, path)ssd_summarize(dir_sample, dir_fit, dir_hc, path)
dir_sample |
The |
dir_fit |
The |
dir_hc |
The |
path |
The output Parquet path for the combined summary. |
In a targets pipeline a directory read carries no dependency edge, so order
summary after the shards by referencing an upstream barrier in its command
(see the shipped _targets.R template's tar_combine() barriers). Reading
the directory - rather than the shard target values - is what lets it union
whatever shards landed (the survivors of a partially-failed run, section 6.2).
The summary Parquet path (the format = "file" contract).
scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) # Materialise the shards single-core, then fan in the hc layer. run <- ssd_run_scenario_shards(scenario) ssd_summarize( file.path(run$dir, "sample"), file.path(run$dir, "fit"), file.path(run$dir, "hc"), file.path(run$dir, "summary.parquet") )scenario <- ssd_define_scenario( ssddata::ccme_boron, nsim = 1L, nrow = 6L, seed = 42L, dists = "lnorm" ) # Materialise the shards single-core, then fan in the hc layer. run <- ssd_run_scenario_shards(scenario) ssd_summarize( file.path(run$dir, "sample"), file.path(run$dir, "fit"), file.path(run$dir, "hc"), file.path(run$dir, "summary.parquet") )
Derives the per-task primer – a length-2 integer vector – from
rlang::hash(params), suitable for the stream argument of
dqrng::dqset.seed(). Together with the scenario seed, the primer fully
specifies a task's RNG starting point:
dqrng::dqset.seed(seed, stream = task_primer(params)). It pairs with
local_dqrng_state(), which installs the (seed, primer) pair under an
active local_dqrng_backend() scope.
task_primer(params)task_primer(params)
params |
A plain named list of task parameters, or a single-row data frame (one task-table row). |
The primer packs 64 bits of the rlang::hash() digest (xxhash128) as
c(hi32, lo32). Each 32-bit half is encoded as a signed int32, with the
reserved bit pattern 0x80000000 (INT_MIN, which R cannot represent as a
non-NA integer) mapped to NA_integer_; dqrng accepts NA_integer_ in
stream and treats it as INT_MIN, so the encoding recovers the full 64 bits
of stream entropy.
params may be a plain named list or a single-row data frame (one row of a
{sample,fit,hc}_tasks table). A data-frame row is normalised to a
canonical plain list – the inverse of tibble::tibble_row() – by dropping
all attributes, unwrapping length-1 list-style columns to their element, and
leaving df-style (nested data-frame) columns as data frames, before hashing.
The primer is therefore identical whether derived from the row or from the
equivalent plain list. Note that rlang::hash() is order-sensitive, so the
plain list must use the same name order as the task-table columns to
reproduce the row's primer (assembling params in a canonical column order
is part of the task-tables caller contract below).
task_primer() normalises structure, not meaning: it hashes whatever
params it is given. The canonical, name-keyed representation is a caller
contract assembled where params is built (task-tables, over the
task-lists tables). Per the three-step model the RNG-consuming steps each
take a primer over their task identity:
sample – keyed (dataset, sim, replace) only. nrow is deliberately
absent: every nrow shares one n_max-row draw that the fit step
truncates inline (head(sample, nrow), RNG-free, no separate primer), so
excluding nrow is load-bearing for the sub-truncation property
(TARGETS-DESIGN.md §5).
fit – the parent sample identity plus nrow and the fit-grid row
(rescale, computable, at_boundary_ok, min_pmix name,
range_shape1, range_shape2). nrow IS part of the fit primer: a fit on
a different truncation is a genuinely different computation.
hc – the parent fit identity plus the hc-grid row (ci, nboot,
est_method, ci_method, parametric).
Function-valued parameters (e.g. min_pmix) MUST be referenced by name,
not by function value, so a recompile or JIT does not move a task's primer.
An integer vector of length 2 – the per-task primer – to pass as
the stream argument of dqrng::dqset.seed() (via local_dqrng_state()).
local_dqrng_state(), local_dqrng_backend().
task_primer(list(dataset = "boron", sim = 1L, replace = FALSE))task_primer(list(dataset = "boron", sim = 1L, replace = FALSE))