ANZTOX data processing

Summary

This article explains how the anztox dataset was derived. The dataset contains machine-curated species-level toxicity data, based on the ANZTOX SQL database originally served by the QLD government and subsequently served on the SETAC website. The species-level toxicity data were screened and aggregated following the documented principles of ANZECC and ARMCANZ (2000) and the post-2000 ANZG guideline revision process (Warne et al. 2015, 2018, 2025). Exact reproduction of the original guideline datasets was not possible due to undocumented expert judgement and manual processing steps that cannot be fully captured algorithmically; however, a transparent and fully reproducible approximation was applied.

In brief, for each species, chemical, and water type, toxicity data were first filtered to records where a pre-populated (in the original source SQL) concentrationused value was available. Salt and alternative chemical forms were mapped to their parent compound CAS numbers before grouping, so that data from toxicologically equivalent forms (e.g. cadmium chloride, cadmium sulfate → cadmium) contribute to a single SSD. Data were then categorised as chronic, subchronic, or acute; where chronic data were available for a species, subchronic and acute data were discarded; where only subchronic data were available, they were reclassified as chronic. For species where only acute data existed, those acute LC/EC/IC50 values were converted to chronic negligible-effect equivalents by dividing by a default acute-to-chronic ratio (ACR) of 10, following Warne et al. (2025) Section 3.4.2.2. Toxicity values were then summarised using geometric means within endpoint categories, and the most sensitive endpoint was selected. The minimum eligibility threshold for SSD fitting is ≥ 5 species from ≥ 4 distinct major taxonomic groups.

Two datasets were processed and combined in a single pipeline: - toxicityvalue2000 (17,755 records), originally sourced from , a Microsoft Access database distributed on CD-ROM with the ANZECC & ARMCANZ Water Quality Guidelines.

  • toxicityvalue2016 (2,794 records), which is the data underpinning the subsequent 2015 and 2018 updates to the water quality guidelines .

Both datasets are cleaned, harmonised, and combined into toxicityvalue_combined_clean, which feeds a single shared SSD eligibility workflow producing anztox_data.


Pipeline overview

All processing is implemented in a single consolidated script:

data-raw/anztox/raw/DATASET.R

The script prepares ecotoxicology data from the ANZTOX database for Species Sensitivity Distribution (SSD) analysis — the statistical method used to derive Guideline Values (DGVs) for water quality. The pipeline has five main stages: 1. Helpers — Three normalisation functions are defined to ensure consistent matching keys: normalize_cas strips CAS numbers to digits only, normalize_mediatype maps variants to “Freshwater” or “Marine”, and normalize_name lowercases and strips punctuation from chemical names. 2. Inputs — Two additional files are read: a CAS parent lookup table (a csv mapping chemical variants/salts to a parent compound, see details below) and a master DGV table from the ANZG Water Quality Guideline website: https://www.waterquality.gov.au/sites/default/files/documents/toxicants-dgvs-mastertable-april2026.xlsx. 3. Database extraction — Connects to the local PostgreSQL database, reads the toxicityvalue2000 and toxicityvalue2016 tables, and de-normalises them by joining all the lookup tables (species, chemical, mediatype, endpoint, effect, testtype, etc.) to produce two clean flat tables. 4. SSD eligibility & nesting — The two clean datasets are combined, chemicals are grouped to their parent CAS via the lookup, then filtered to keep only valid concentration records. A test type priority rule is applied (chronic preferred over subchronic, subchronic over acute, per species/chemical/endpoint group). Concentrations are geometric-mean averaged per species and endpoint, then the most sensitive endpoint per species is selected. Only chemical×mediatype combinations with ≥5 species from ≥4 major groups are retained (the minimum required for an SSD based on Warne et al. (2025)). These are then nested into a list-column ready for SSD fitting. 5. DGV matching — The SSD-eligible chemicals are matched back against the published 2000 DGVs to see how many of the historical guidelines can be reproduced from the database data. A summary table categorises matches and unmatched DGVs by likely derivation method.


Why exact reproduction of the ANZECC & ARMCANZ 2000 guideline input data is not possible

Despite access to a rich relational database that includes the toxicity values collated to support the 2000 guideline derivation, several steps in the original ANZECC and ARMCANZ (2000) workflow cannot be replicated exactly:

Expert judgement and undocumented exclusions

The original guideline derivation involved expert review of studies for methodological acceptability. Decisions were sometimes made by inspection (e.g. excluding anomalous studies, preferring one endpoint over another) without being fully codified in machine-readable rules.

Endpoint selection within species

For each species, the most sensitive ecologically relevant endpoint was selected. This often required subjective judgement across life stage, exposure duration, and endpoint comparability (e.g. EC10 vs NOEC). The database retains all candidate endpoints but does not encode which one was ultimately chosen. Here we algorithmically selected the lowest geometric mean across available endpoints for a given species and chemical.

Temporal and contextual decisions

Some values were included or excluded based on context (e.g. relevance to Australian/New Zealand waters, water chemistry applicability), which cannot be inferred purely from database fields. Therefore, this dataset should be regarded as a curated candidate pool, not the exact SSD input dataset used in the 2000 guidelines. This data should not be used to derive water quality guidelines directly without further manual data curation and expert review.


What can be approximated

The following principles are reproducible and align well with the documented ANZECC and ARMCANZ (2000) and ANZG framework (Warne et al. 2015, 2018, 2025):

  • Use only records where a pre-calculated concentrationused value is present — this field is assumed to represent the database curators’ judgement that a value is suitable for guideline derivation
  • Map salt and alternative chemical forms to parent compound CAS numbers using a curated lookup table (cas_parent_lookup) before grouping, so toxicologically equivalent forms contribute to a single SSD
  • Prefer chronic endpoints where available; discard subchronic and acute data for a species if any chronic data exist; reclassify subchronic as chronic where no chronic data exist
  • For species where only acute data are available, convert the retained acute LC/EC/IC50 values to chronic negligible-effect equivalents by dividing by a default ACR of 10 (Warne et al. (2025), Section 3.4.2.2; ANZECC and ARMCANZ (2000))
  • Collapse to one concentration value per species using geometric means within endpoint categories (applied to ACR-converted concentrations where relevant), then select the most sensitive endpoint
  • Retain only chemical–media combinations with ≥ 5 species from ≥ 4 distinct major taxonomic groups

Filtering approach: concentrationused

The primary data filter uses the concentrationused field rather than status == 'A' and concentrationcode %in% c("U", "C", "HC"). The rationale is:

  • concentrationused is a pre-calculated field populated by the database curators specifically to flag values appropriate for guideline derivation; here we assume its presence is a direct expression of data acceptability
  • Filtering on !is.na(concentrationused) & concentrationused > 0 captures a broader set of accepted records compared to the stricter status/concentrationcode approach (available in the 2000 dataset), reflecting that the database encodes acceptability more richly than those fields alone
  • For toxicityvalue2016, all 2,794 rows have concentrationused populated — the filter passes all rows from this dataset

Structural differences between the 2000 and 2016 databases

The two source tables have different structures, which affects how each is cleaned:

Table 1
Aspect toxicityvalue2000 toxicityvalue2016
Chronic/acute flag `testtype_id` join → `testtype` text field `ischronic` boolean — direct, no join needed
`concentrationused` Some NA — filter on `!is.na()` All rows populated — no filter needed
Endpoint field `endpoint_id` → `endpoint.name` `endpointmeasurement_id` + `endpointfrompaper_id`, coalesced
Endpoint vocabulary Standardised 2000 codes (`MORT`, `GRO`, etc.) Free-text 2016 labels — mapped to 2000 codes via `endpoint_2016_to_2000_lookup`
Reference / status `reference_id`, `status_id` joins Not present — `datasource` text field instead, very limited coverage
Water chemistry fields `temperature`, `ph`, `hardness` present Not present
`guidelinegroup_id` Not present Present — links to pre-computed guideline values

The ischronic boolean in the 2016 dataset makes the chronic preference step cleaner and unambiguous compared to the 2000 pipeline’s testtype join.


Endpoint harmonisation between 2016 and 2000

Because the 2016 dataset uses a different and less standardised endpoint vocabulary, a mapping table is required before the two datasets can be combined. This is handled by:

data-raw/anztox/raw/endpoint_2016_to_2000_lookup.csv

Built by:

data-raw/anztox/raw/endpoint_2016_to_2000_lookup_build.R

The lookup maps each distinct 2016 endpoint label (endpoint_2016_raw) to its nearest equivalent 2000 code (endpoint_2000_code). It is applied within the toxicityvalue2016_clean build step in the main pipeline. A endpoint_mapping_missing flag is added to 2016 rows where no mapping was found, so these can be identified in downstream analysis.

The lookup is validated for uniqueness at load time in DATASET.R — a duplicate endpoint_2016_raw entry will throw an explicit stop() error rather than silently producing incorrect joins.

See the Endpoint Lookup Build vignette for details.


Chemical identifier resolution and CAS parent grouping

cas_parent_lookup

Many chemicals in the ANZTOX database are recorded as specific salt or alternative chemical forms (e.g. cadmium chloride, copper sulfate) whose toxicity arises from the parent compound (e.g. cadmium, copper). Based in expert judgement the ANZECC and ARMCANZ (2000) were often derived at the parent compound level. There appeared to be no available field in the supplied ANZTOX database to indicate which chemical and CAS numbers were grouped for any given DGV. To align with this, a curated lookup table maps variant CAS numbers to their parent compound CAS numbers before data are grouped.

File Description
cas_parent_lookup.csv Curated lookup; columns: chemicalname, casnumber, parent_cas_dashed, parent_casnumber, parent_name, match_rationale

The lookup was derived by:

  1. Identifying candidates: CAS numbers in the combined toxicity data that did not match the DGV 2000 master table were identified, then classified them by name pattern (chloride, sulfate, oxide, nitrate, etc.)
  2. Assigning parent CAS numbers (LLM-assisted, not fully reproducible): Parent CAS numbers were supplied by an LLM (GitHub Copilot using Claude Sonnet 4.6) drawing on its training data (chemistry textbooks, PubChem, ChemSpider, Wikipedia etc.). For example, the model supplies that cadmium chloride (CAS 10108-64-2) dissociates to give cadmium ion, and that elemental cadmium is CAS 7440-43-9. There is no derivation script for this step — the assignments exist only as rows in the CSV.
  3. Extending coverage: The same LLM-assisted approach was used to extend the table to cyanide salts, arsenite, vanadate, dichromate, nitrate, and hypochlorite forms.

Because steps 2 and 3 rely on LLM recall rather than a verifiable algorithm, the parent CAS assignments must be treated as a first draft requiring expert chemical verification. Errors or hallucinated CAS numbers are possible.

Note for domain experts: cas_parent_lookup.csv is intended to be reviewed and extended by chemists. Each row has a match_rationale column explaining the basis for the mapping. The recommended verification approach is to check each parent_casnumber against commonchemistry.cas.org or your institutional SciFinder/CAS Registry access.

After any modifications to the lookup, re-run DATASET.R from scratch to regenerate all outputs. We welcome pull requests from domain experts. If you would like to contribute without GitHub experience, please post an issue on the issues page and we will facilitate your contribution.

Grouped identifier columns

After the parent CAS join, two derived columns are added to toxicityvalue_combined_clean:

Column Description
casnumber_grouped coalesce(parent_casnumber, casnumber) — parent CAS where a mapping exists, otherwise original CAS
chemicalname_grouped coalesce(parent_name, commonname) — parent name where a mapping exists, otherwise original common name

These replace the original CAS/name fields as the primary chemical grouping keys in all downstream processing.

Chemical identifiers available post-mapping

Identifier Notes
casnumber_grouped Primary grouping key; each parent CAS maps to exactly one chemical name post-mapping
chemicalname_grouped 1-to-1 with casnumber_grouped; used as the chemical display name in nested outputs
commonname Original database common name (2000 dataset); may differ from parent name

Note that the 2016 dataset uses IUPAC chemical names (e.g. "6-chloro-n2,n4-diethyl-1,3,5-triazine-2,4diamine") while the 2000 dataset uses common names (e.g. "Simazine"). In toxicityvalue_combined_clean, chemicalname_grouped will carry whichever name is available from the parent lookup, with the original name as fallback.


Reference Data for 2016 Toxicity Values

The 2016 dataset presents a challenge for reference attribution: unlike the 2000 dataset, which includes direct reference_id foreign keys to the reference table, 2016 records contain only datasource and record text fields without explicit reference linkage.

Reference Resolution Strategy

A best-effort reference matcher was implemented:

  1. Primary extraction: Numeric IDs are parsed from the datasource and record fields using regex (\d+).
  2. Lookup: These extracted IDs are matched against lu_reference$orgrefnumber to retrieve bibliographic data (authors, year, title, journal, etc.).
  3. Preference order: datasource-derived matches are preferred over record-derived matches.
  4. Fallback: If a match is found, it is formatted using the same reference_bib convention as the 2000 dataset.

Coverage

Of 2,794 records in the 2016 dataset: - 22 records (0.79%) have valid reference matches - 2,772 records (99.21%) remain unmatched

The low coverage reflects the database schema design: 2016 datasource and record codes do not reliably map to orgrefnumber values in the reference table. Remaining unmapped records retain NA in the reference_bib column.

SSD eligibility workflow

The shared SSD eligibility workflow is applied to toxicityvalue_combined_clean and produces ssd_species_eligible_combined and anztox_data. The steps are:

1. Base filters

Records are excluded where:

  • concentrationused is NA or ≤ 0
  • scientificname is NA
  • testtype is NA
  • mediatype is NA (i.e. not “Freshwater” or “Marine”)
  • casnumber_grouped is NA

2. Test type classification and priority selection

Each record is classified into one of four test classes based on the normalised testtype field:

Table 2
Test class testtype values matched
`chronic` `"chronic"`
`subchronic` `"sub-chronic"`, `"subchronic"` (regex)
`acute` `"acute"`
`other` anything else

Within each combination of chemical × mediatype × species × endpoint, a priority rule is applied:

  • If any chronic data exist for that group, retain only chronic records
  • If no chronic data exist but subchronic data do, retain only subchronic records (reclassified as “Chronic”)
  • If neither chronic nor subchronic data exist, retain acute records

This implements the ANZECC & ARMCANZ preference for chronic data without discarding data from groups where only acute data are available.

3. ACR conversion for retained acute records

After the priority selection step, some species × endpoint groups will have had only acute data retained (i.e. no chronic or subchronic data existed for that species). The concentrationused field in ANZTOX stores the raw acute concentration for these records — it does not pre-apply an ACR. Per Warne et al. (2025) Section 3.4.2.2 and ANZECC & ARMCANZ (2000), acute LC/EC/IC50 values must be divided by an ACR before inclusion in a chronic SSD.

The script applies a default ACR of 10 (stored as ACR_DEFAULT <- 10 at the top of the script) to all retained acute records:

concentration_chronic_equiv = if_else(
  acr_applied,            # TRUE for records still classed as "acute"
  concentrationused / ACR_DEFAULT,
  concentrationused
)

The acr_applied flag is retained in the summarised output as n_acute_converted (count of acute rows that contributed to each geometric mean), so downstream users can identify which species values were ACR-derived. The acr_summary diagnostic table in results_combined reports, per chemical × mediatype, the proportion of species rows that involved ACR conversion.

Why a default ACR of 10? Chemical- or taxon-specific ACRs are preferable where they can be calculated from matched acute/chronic pairs in the same database (Warne et al. 2025). The ANZTOX database does not currently provide a pre-calculated ACR field, and deriving species × chemical ACRs algorithmically is outside the current scope. The default of 10 is the value recommended by Warne et al. (2025), ANZECC & ARMCANZ (2000), U.S. Environmental Protection Agency (1986) and Organisation for Economic Co-operation and Development (1992) where no specific ACR is available.

Important caveat: The default ACR of 10 is conservative for some chemicals (where acute and chronic toxicity are similar) and may be insufficiently conservative for others. Warne et al. (2025) note that the default should be applied with care for chemicals known to have similar acute and chronic toxicity, essential elements that can be physiologically regulated, and chemicals where acute toxicity is the primary concern (e.g. chlorine).

4. Geometric mean concentration per species × endpoint

Within each combination of chemical × mediatype × species × endpoint × testtype, all retained concentration values (after ACR conversion where applicable, stored in concentration_chronic_equiv) are averaged using the geometric mean (exp(mean(log(concentration_chronic_equiv)))). The source datasets contributing to each average are recorded in source_datasets (semicolon-separated, e.g. "2000;2016").

5. Most sensitive endpoint selection

For each species within a chemical × mediatype group, the single endpoint with the lowest endpoint_concentration is retained (slice_min(..., with_ties = FALSE)). This implements the most-sensitive-endpoint selection principle.

6. Minimum species and taxonomic group threshold

Chemical × mediatype combinations are retained only where the eligible species set contains:

  • ≥ 5 distinct scientificname values, and
  • ≥ 4 distinct majorgroup values

Combinations not meeting both thresholds are dropped entirely.

7. Nesting

The eligible species table is nested by casnumber_grouped, chemicalname_grouped, and mediatype to produce anztox_data. Each row’s data column contains all species-level rows for that chemical × media combination, ready for SSD fitting.


DGV matching

Purpose

The DGV matching step assesses how well the SSD-eligible chemicals derived from the ANZTOX database correspond to the published 2000 Derived Guideline Values (DGVs). It does not modify the toxicity data; it is a validation and coverage diagnostic.

Name-based CAS fallback

Some DGV rows in the master table lack a CAS number (i.e. Tox CAS is blank or missing). A name-based fallback is used for these:

  1. A name_cas_lookup is built from both cas_parent_lookup and toxicityvalue_combined_clean, mapping normalised chemical names to their CAS keys
  2. Only unambiguous mappings are used — where a normalised name maps to exactly one CAS number (name_cas_unique)
  3. DGV rows with a missing raw CAS are joined to this lookup by normalised name; if a unique match is found, the matched CAS is used

The cas_fill_method column on dgv_method_flag_2000 records how the CAS key was obtained for each DGV row: "raw_cas", "name_fallback_unique", or "missing".

DGV matching summary

summary_2000_dgvs_combined is written to summary_2000_dgvs_combined.csv each time the pipeline runs. The table below is read directly from that file so it always reflects the most recent run.

Table 3
> **Note:** `summary_2000_dgvs_combined.csv` not found at `data-raw/anztox/raw/summary_2000_dgvs_combined.csv`. Run `DATASET.R` to generate it, then re-render this document.

The four categories are:

Category Meaning
Total 2000 DGVs All rows in dgv_method_flag_2000 with a 2000 publish date and valid mediatype
Matched to anztox_data DGV rows whose CAS × mediatype appears in the SSD-eligible combined data
Unmatched — likely SSD derived Unmatched rows where tox_reliability != "unknown" (suggests SSD derivation in the original guidelines)
Unmatched — other method Unmatched rows where tox_reliability == "unknown" (likely assessment factor or other method in the original guidelines)

Remaining unmatched chemicals

Chemicals in anztox_data with no corresponding 2000 DGV entry fall into three categories:

  1. Pseudo-CAS mixtures — substances with database-assigned CAS numbers that do not correspond to a single compound (e.g. linear alkylbenzene sulfonates, branched sulfonates). These cannot be resolved algorithmically.
  2. Genuinely absent from DGV 2000 — chemicals with sufficient ANZTOX data to pass the eligibility threshold but for which no SSD-based DGV was published in 2000 (e.g. glyphosate, simazine, trifluralin, boric acid, ammonia freshwater, sodium/potassium nitrate).
  3. Residual CAS formatting mismatches — a small number of chemicals where CAS number formatting differences between ANZTOX and the DGV master table prevent matching (e.g. allyl chloride, p-nitrophenol). These can be resolved by extending cas_parent_lookup or by manual review of dgv_missing_from_ssd_combined.

References

ANZECC, and ARMCANZ. 2000. Australian and New Zealand Guidelines for Fresh and Marine Water Quality. Australian; New Zealand Environment; Conservation Council; Agriculture; Resource Management Council of Australia; New Zealand.
Organisation for Economic Co-operation and Development. 1992. Report of the OECD Workshop on the Extrapolation of Laboratory Aquatic Toxicity Data to the Real Environment. OECD Environment Monograph No. 59, OCDE/GD(92)169. OECD.
U.S. Environmental Protection Agency. 1986. Quality Criteria for Water, 1986. EPA 440/5-86-001. U.S. Environmental Protection Agency, Office of Water Regulations; Standards.
Warne, M. StJ, G. E. Batley, R. A. van Dam, et al. 2015. Revised Method for Deriving Australian and New Zealand Water Quality Guideline Values for Toxicants. Department of Science, Information Technology; Innovation.
Warne, M. StJ, G. E. Batley, R. A. van Dam, et al. 2018. Revised Method for Deriving Australian and New Zealand Water Quality Guideline Values for Toxicants. Australian; New Zealand Governments.
Warne, M. StJ, G. E. Batley, R. A. van Dam, et al. 2025. Method for Deriving Australian and New Zealand Water Quality Guideline Values for Protecting Aquatic Ecosystems from Toxicants – Update of 2018 Version. Australian; New Zealand Governments.