--- title: "Rescaling and Distribution Fitting" author: "Joe Thorley and Rebecca Fisher" date: '`r format(Sys.time(), "%Y-%m-%d", tz = "UTC")`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Rescaling and Distribution Fitting} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction By default `ssdtools` does not rescale data when fitting distributions so that the parameter estimates can be used to directly estimate the HCx values. However, if `rescale = TRUE` in the `ssd_fit_dists()` or `ssd_fit_burrlioz()` functions then the data is rescaled by dividing by the geometric mean of the minimum and maximum positive finite values which may aid model fitting in some instances. To examine the extent to which model fitting is improved we fit the `r length(ssdtools::ssd_dists_all())` distributions with valid likelihoods currently implemented in `ssdtools` to the `r length(unique(envirotox::envirotox_acute$Chemical))` acute datasets in the `envirotox` R data package with and without rescaling. ## Methods The R code that performs the analysis is as follows. Consistent with the default settings in `ssdtools` a distribution was considered to have successfully fitted if it had converged irrespective of whether the standard errors were computable for the estimates based on the likelihood or whether a parameter was at a boundary. ```{r} dists <- ssdtools::ssd_dists_all() fit_dists <- function(data, d, r) { list(ssdtools::ssd_fit_dists(data = data, dists = d, rescale = r, computable = FALSE, at_boundary_ok = TRUE, silent = TRUE)) } data <- envirotox::envirotox_acute |> dplyr::nest_by(Chemical) |> dplyr::mutate(ssd_fit_unscale = fit_dists(.data$data, d = dists, r = FALSE), ssd_fit_rescale = fit_dists(.data$data, d = dists, r = TRUE), dists_unscale = list(names(ssd_fit_unscale)), dists_rescale = list(names(ssd_fit_rescale))) |> dplyr::select(!c(ssd_fit_unscale, ssd_fit_rescale)) unscaled <- data |> dplyr::select(Chemical, Distribution = dists_unscale) |> tidyr::unnest(Distribution) |> dplyr::ungroup() |> dplyr::count(Distribution) |> dplyr::mutate(n = n / nrow(data) * 100) |> dplyr::select(Distribution, Unscaled = n) rescaled <- data |> dplyr::select(Chemical, Distribution = dists_rescale) |> tidyr::unnest(Distribution) |> dplyr::ungroup() |> dplyr::count(Distribution) |> dplyr::mutate(n = n / nrow(data) * 100) |> dplyr::select(Distribution, Rescaled = n) results <- unscaled |> dplyr::inner_join(rescaled, by = "Distribution") ``` ## Findings ```{r, echo = FALSE, tab.cap="The percentage of the acute datasets in the envirotox R data package to which the distribution was successfully fitted."} knitr::kable(results, digits = 1) ``` The results indicate that with the `r length(unique(envirotox::envirotox_acute$Chemical))` acute datasets considered, rescaling has little to no effect on fitting for all the currently implemented distributions with valid likelihoods with one exception. The exception is the gompertz distribution for which the fitting rate increases from `r round(results$Unscaled[results$Distribution == "gompertz"], 1)` to `r round(results$Rescaled[results$Distribution == "gompertz"], 1)` %. Despite substantial improvement for the gompertz the fitting rate is still only ~ `r round(results$Rescaled[results$Distribution == "gompertz"], -1)` % which is insufficient to warrant reconsideration of its inclusion in the default set. ## Recommendations Rescaling the data has little to no effect on the fitting rate for the models in the default set. Consequently we recommend that the `ssd_fit_dists()` or `ssd_fit_burrlioz()` continue to use `rescale = FALSE` as the default value and that it remain the fixed option in the `ssd_fit_bcanz()` function. ## Session Info The results were generated with the following packages. ```{r, echo = FALSE} sessioninfo::session_info() ```