---
title: "Defining a Scenario"
vignette: >
  %\VignetteIndexEntry{Defining a Scenario}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: "#>"
---

```{r}
#| label: setup
#| include: false
# The vignette exercises the live API, so it needs the optional fitting
# dependencies. Skip evaluation gracefully if they are unavailable (e.g. on a
# minimal CI runner) rather than failing the build.
evaluate <- requireNamespace("ssddata", quietly = TRUE) &&
  requireNamespace("ssdtools", quietly = TRUE)
knitr::opts_chunk$set(eval = evaluate)
```

```{r}
#| label: library
library(ssdsims)
```

## Why a declarative scenario?

ssdsims is moving from running a simulation study *immediately* (the legacy
`ssd_run_scenario()`) to a cluster-friendly [targets](https://docs.ropensci.org/targets/)
pipeline. The root of that pipeline is a **scenario**: a small, purely
declarative description of the study you want to run. It contains only the
knobs --- a seed, the number of simulations, the sample sizes, the dataset
*names*, and the fit/hc argument grids. It draws **no** random numbers, expands
**no** tasks, writes **nothing**, and has **no** dependency on `targets`.

Keeping the scenario declarative buys two things:

- It **serialises to a compact manifest** --- no data frames, no RNG state, no
  function bodies --- so it can be shipped to a cluster and stored alongside
  results.
- The set of work a pipeline expands to is a **pure function of the scenario**,
  so the same scenario always describes the same study.

This vignette walks the stages that work *today*, in order. It is intended to
grow: as roadmap features land (per-task seeding, shards, the `targets`
backend), new sections will document them here. See `TARGETS-DESIGN.md` for the
north-star design.

The four stages this vignette covers:

1. **Assemble** the data with `ssd_data()`.
2. **Declare** the scenario with `ssd_define_scenario()`.
3. **Expand** it into per-step task tables with `ssd_scenario_tasks()`.
4. **Run** the baseline in-process loop with `ssd_run_scenario_baseline()`.

Stages 1--3 are side-effect-free. Only stage 4 touches the RNG.

## Stage 1: assemble the data

`ssd_data()` is the single entry point for dataset input. It validates each
data frame --- every dataset must carry a numeric `Conc` column, the species
sensitivity distribution convention --- and assembles them into a named
collection.

```{r}
#| label: ssd-data
datasets <- ssd_data(
  boron = ssddata::ccme_boron,
  cadmium = ssddata::ccme_cadmium
)
datasets
```

Names come from the argument names where you supply them; otherwise they are
derived from the argument expression by symbol capture (so a bare
`ssddata::ccme_boron` becomes `"ccme_boron"`). Names must be unique, and a
literal with no derivable name (e.g. a bare `data.frame(...)`) must be given an
explicit name.

`ssd_data()` is the extensible input point. A planned change
(`scenario-input-types`) will let each input also be a data *generator* --- a
`fitdists`/`tmbfit` object, a generator function, or a function-name string ---
materialised by a dataset registry. For now each input must be a data frame.

## Stage 2: declare the scenario

`ssd_define_scenario()` is the constructor. Its required arguments are the
dataset input, `nsim` (the replicate count), and `seed` (the RNG root --- it has
no default because changing it fully re-roots every draw). Everything else is a
knob with a sensible default.

```{r}
#| label: define-scenario
scenario <- ssd_define_scenario(
  datasets,
  nsim = 3L,
  nrow = c(5L, 10L, 20L),
  seed = 42L,
  dists = c("lnorm", "gamma"),
  proportion = c(0.05, 0.2),
  ci = c(FALSE, TRUE),
  nboot = c(10L, 100L),
  est_method = "multi",
  ci_method = "weighted_samples"
)
scenario
```

The `print()` method shows the declarative fields: the scalar `seed` and
`nsim`, the `nrow` sample sizes, the dataset names, and the two argument grids
(`fit` and `hc`). Note what is *not* shown --- the data frames themselves are
retained for the local runner but are not part of the declarative identity; the
cluster path carries only the names.

### Dataset input is flexible

The preferred form is an `ssd_data()` collection (above), which owns validation
and naming. For convenience the constructor also accepts bare data frame input,
routed through the same `Conc` validation, in several forms:

```{r}
#| label: input-forms
# 1. A single data frame; name derived from the expression ("ccme_boron").
ssd_define_scenario(ssddata::ccme_boron, nsim = 2L, seed = 1L)

# 2. A single data frame with an explicit name.
ssd_define_scenario(ssddata::ccme_boron, name = "boron", nsim = 2L, seed = 1L)
```

A named list (`list(boron = ..., cadmium = ...)`) takes names from the list; an
unnamed list derives them per element. Supplying both a named list and `name=`
is an error.

### `min_pmix` is referenced by name

The `min_pmix` knob is referenced **by name**: the name --- not the function
body --- is what enters the task identity and hashes, so the scenario's identity
stays stable under a recompile or a cosmetic edit. You can pass a character
vector of names directly, or a function (or list of functions) whose name is
derived by symbol capture (mirroring dataset naming). The default
`ssdtools::ssd_min_pmix` is stored as `"ssd_min_pmix"`:

```{r}
#| label: min-pmix
scenario$fit$min_pmix
```

The resolved single-argument *function* is additionally **materialised on the
scenario** at construction (a name-string is resolved then, from `ssdtools` or
the caller's environment, failing fast if it cannot be resolved). It rides along
for execution and is retrieved by name with the `scenario_min_pmix()` accessor;
datasets are reached the same way with `scenario_dataset()`:

```{r}
#| label: accessors
identical(scenario_min_pmix(scenario, "ssd_min_pmix"), ssdtools::ssd_min_pmix)
head(scenario_dataset(scenario, "boron"), 3)
```

Because the *name* drives hashing while the *value* only rides along for
execution, two scenarios with the same `min_pmix` name but different function
bodies produce byte-identical task identities --- the split that lets a cluster
worker resolve `min_pmix` off the transported scenario with no shared
interactive environment.

### The `ci = FALSE` rule

When `ci = FALSE` is the *only* confidence-interval value, the bootstrap-only
knobs (`nboot`, `ci_method`, `parametric`) are meaningless, so passing any of
them is an error:

```{r}
#| label: ci-false-error
#| error: true
ssd_define_scenario(
  ssddata::ccme_boron,
  nsim = 2L,
  seed = 1L,
  ci = FALSE,
  nboot = 1000
)
```

To ask for *both* a point estimate and bootstrap intervals --- the common case
--- set `ci = c(FALSE, TRUE)` (as in the scenario above) rather than a bare
`ci = TRUE`. This keeps the study symmetric: the `ci = FALSE` row is retained
and collapses its bootstrap knobs to `NA` at expansion time, while the
`ci = TRUE` rows fan out across the full bootstrap grid.

## Stage 3: expand into task tables

`ssd_scenario_tasks()` expands the scenario into the three per-step task tables
--- `sample`, `fit`, and `hc` --- bundled into a task set. This is
still RNG-free: no draws happen here.

```{r}
#| label: tasks
tasks <- ssd_scenario_tasks(scenario)
tasks
```

The counts show how each step fans out from its parent. The key design choice
is that the single expensive **random draw** is its own `sample` task, keyed
only by `(dataset, sim, replace)`. The draw is of `n_max = max(nrow)` rows;
every `nrow` value is then a cheap, RNG-free `head()` *sub-truncation* of that
one draw, done inline at the `fit` step. So `nrow` is an ordinary cross-join
axis of the `fit` step and never multiplies the underlying draw --- one draw is
shared across all sample sizes.

Each row carries a path-style `<step>_id` primary key (the Hive partition path)
plus its parent step's id as a foreign key, so dependencies are explicit and
joinable:

```{r}
#| label: sample-table
tasks$sample
```

The `ci = FALSE` collapse is visible in the `hc` table: the `ci = FALSE` rows
carry `NA` for the bootstrap-only knobs and are not multiplied across them,
while the `ci = TRUE` rows fan out fully.

```{r}
#| label: hc-table
tasks$hc
```

The per-step derivations are also available individually
(`ssd_scenario_sample_tasks()`, `ssd_scenario_fit_tasks()`,
`ssd_scenario_hc_tasks()`) when you only need one table.

## Stage 4: run the baseline loop

`ssd_run_scenario_baseline()` is the no-frills runner. It executes the three
task tables in dependency order, threading each task's result forward to its
children by the foreign-key id. It runs **in-process**, with no `targets`, no
shard grouping, and no Parquet I/O.

It **is reproducible without an external seed**: the runner opens one
`local_dqrng_backend()` scope and seeds each task exactly once from
`scenario$seed` plus a per-task primer (`task_primer()` over the task's
identity), so two runs of a scenario with a fixed `seed` yield identical
results, regardless of the order in which tasks run:

```{r}
#| label: run-baseline
out <- ssd_run_scenario_baseline(scenario)
names(out)
```

Each element is the corresponding task table augmented with a list column of
per-task results --- `sample` draws, `fits` objects, and `hc` tibbles (the
`fit` step truncates its sample inline before fitting):

```{r}
#| label: hc-out
out$hc
```

Unnest the hazard-concentration estimates back onto their task identities to
get a tidy results table. `proportion` is not a task axis --- it is passed
whole to each `hc` call, so every result already carries its own `proportion`
column:

```{r}
#| label: unnest
hcs <- tidyr::unnest(
  out$hc[c("dataset", "sim", "nrow", "ci", "est_method", "hc")],
  hc,
  names_sep = "_"
)
hcs[c("dataset", "sim", "nrow", "ci", "hc_proportion", "hc_est")]
```

## Next: shards and the targets pipeline

The baseline runner threads results in memory. The next layer materialises each
step as **Hive-partitioned Parquet shards** grouped by the scenario's
`partition_by` axes, and links steps by reading parent shards back --- the
storage hand-off the cluster `targets` pipeline uses. The
["Running a sharded pipeline"](sharded-pipeline.html) vignette covers that: the
single-core `ssd_run_scenario_shards()` (byte-identical to the baseline runner)
and the shipped `targets` template.

Still ahead on the roadmap (`TARGETS-DESIGN.md` §12): a per-scenario manifest,
cloud upload, shard-completeness assertions, and the crew/SLURM controller. As
each lands, the vignettes grow to cover it.
