Random effects with uneven groups

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(withr)
library(glmmTMB)
library(patchwork)
library(latex2exp)

devtools::load_all()

ℹ Loading galenR

# need consistent plot colors
mod_types <- c("cluster_rand_x", "cluster_fixed_x",
               "cluster_rand", "cluster_fixed",
               "cluster_raw", 'no_cluster') 
mod_pal <- make_pal(mod_types, palette = 'ggsci::default_uchicago')

xrange = c(0, 10)
# we'll want this for predicting the fits
xdata <- tibble(x = seq(from = min(xrange), to = max(xrange), by = diff(xrange)/100))

This is essentially preamble to understanding the beta-binomial issues we’re having. But we should be able to do a better job sharpening our intuition of what we expect if we start off fully gaussian.

This builds on the outline I developed and the work Sarah did (students/Sarah_Taig/Random Effects Simulations) (though I think the emphasis will be different; we’ll see), as well as some beta-binom errorbar checks in caddis/Testing/Error_bars.qmd.

I think I’ll likely just do gaussian here. Then bb in a parallel doc. And likely will do a model comparison thing too- i.e. spaMM vs glmTMB vs lme4::glmer as in caddis/Analyses/Testing/df_z_t_for_sarah/ (and add {brms}).

And will likely need to incorporate some assessments of the se being calculated at various scales, as explored in caddis/Analyses/Testing/Error_bars.qmd.

We’ll need to bring in real data at some point, but I think not in this doc (hence why it’s here, and some of the other testing is in caddis.

I think we’ll want to do some actual math here somewhere too, to show exactly how the random coefficients relate to the estimates of the fit and the clusters.

And extract the coefficients as Sarah did and see if they match what they should.

Are we just recapitulating this? Kind of. Slightly different emphasis and we’ll end up taking it further, but should be careful.

The data generation function lets us have random slopes. These are super relevant in some cases, e.g. following people or riffles through time, where observations might have different x-values within the cluster. I think I’ll largely ignore them here, because the situation we’re trying to address doesn’t (each cluster has a single x), and so the in-cluster slopes are irrelevant. They could certainly be dealt with in this general exploration, but it would just make everything factorially complicated.

Impact of random and residual sd

Before we do anything with unbalanced clusters, let’s first just see how the random and residual sd alters the way fits work. More importantly, let’s establish some approaches to seeing how the random structure affects the outcome.

fits through points (i.e. the minimal data units, nested within ‘clusters’)
fits through clusters (i.e. the random units)
full model fit
cluster error bars on the clusters, i.e. from a fixed model of the clusters
cluster error bars from the random model. What is this? the random sd will be the error of each around the line, then residual will be within-cluster. Can I extract and plot that? I think that will actually be key to understanding what’s happening.

all of this from model fits.

Do that for progressively more complex situations

Balanced N per cluster, vary resid and random variance

Generate the data

with_seed(2,
          sdparams <- expand_grid(
            N = 50, 
            n_clusters = 10,
            cluster_N = 'fixed',
            intercept = 1, 
            slope = 0.5,
            obs_sigma = c(0.1, 1),
            sd_rand_intercept = c(0.5, 2),
            sd_rand_slope = 0, 
            rand_si_cor = 0,
            # putting this in a list lets me send in vectors that are all the same
            cluster_x = list(runif(n_clusters,
                                   min = min(xrange), max = max(xrange))),
            obs_x_sd = 0
          )
)

Generate the data

sddata <- with_seed(2,
                    make_analysed_tibble(sdparams, mod_pal)
)

dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9

Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`

Warning: There were 4 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fitted_x_lines = map(full_models, function(x) fit_x_full(x,
  xvals = xdata))`.
Caused by warning:
! SE for fits with fixed clusters are not correct
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`

sdstacks <- extract_unnest(sddata, sdparams)

Joining with `by = join_by(group)`

The resulting data; each dot is an observation, the ‘cluster’ (random units) are colors.

sdstacks$points |> 
  ggplot(aes(x = x, y = y, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_point() +
  facet_grid(obs_sigma ~ sd_rand_intercept) +
  labs(color = 'Cluster ID')

Results

Main result

This is the model fit for the full model (line +- se), along with the estimates of the clusters +- se extracted from the full model fit. Gray points are the underlying observations.

sd_fit <- 
  # The estimates +- se out of the model.
  sdstacks$fitlines |>
  filter(model == 'cluster_rand_x') |> 
  ggplot(aes(x = x, color = model)) +
  # The raw observations
  geom_point(data = sdstacks$points, aes(y = y), color = 'grey20', alpha = 0.2) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  # The cluster estimates with se errorbars out of the model
  geom_point(data = sdstacks$fitclusters  |>
               filter(model == 'cluster_rand_x'),
             aes(y = estimate)) +
  geom_errorbar(data = sdstacks$fitclusters |>
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  # facet by random residual sigmas 
  facet_grid(obs_sigma ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(obs $\sigma$)'),
                                         breaks = NULL, labels = NULL))

sd_fit + theme(legend.position = 'none')

How did we do on the estimates? The error bars here are 95% CI of the estimates. Observation sd does not have an se, it’s the leftovers, and so does not have error bars.

sdstacks$set_v_est |> 
  ggplot(aes(x = term, shape = type)) + 
  geom_point(mapping = aes(y = estimate)) +
  geom_errorbar(mapping = aes(ymin = cil, ymax = ciu, width = 0.1)) +
  geom_point(mapping = aes(y = set_value), color = 'firebrick')  +
  facet_grid(obs_sigma ~ sd_rand_intercept, labeller = 'label_both') +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(obs $\sigma$)'),
                                         breaks = NULL, labels = NULL))

# The plot is nicer than the table
# knitr::kable(set_v_est |> 
#                select(-group, `2.5%` = cil, `97.5%` = ciu) |> 
#                mutate(across(where(is.numeric), \(x) round(x, digits = 3))))

Method comparison

Now, how does that compare with the raw cluster estimates or the fit through the data with no random structure? The dashed blue line is the fit through the raw cluster estimates for visualisation.

sd_method_comparison <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data = sdstacks$fitlines |>
                filter(model %in% c('cluster_rand_x')),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data = sdstacks$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster')),
            aes(y = estimate)) +
  geom_line(data = sdstacks$fitlinesc |>
              filter(model %in% c('cluster_raw')),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data = sdstacks$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw")),
             aes(y = estimate)) +
  geom_errorbar(data = sdstacks$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw")),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal)  +
  facet_grid(obs_sigma ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(obs $\sigma$)'),
                                         breaks = NULL, labels = NULL))

sd_method_comparison

Shrinkage

And here we can see the very different shrinkage. I guess what we really need to do is show how unbalanced shrinkage yields different sloped lines.

sd_shrink <- sdstacks$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)
  ) +
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(obs_sigma ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(obs $\sigma$)'),
                                         breaks = NULL, labels = NULL))

sd_shrink

Warning: `position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.

Unbalanced groups

For now, we’ll just use two of the options from above, the sd_rand of 0.5 and 2 and sigma of 1, so random variance is half or double residual.

I’m tempted to include an ‘even’ option, but we can either just refer back to the previous or bind_rows the relevant rows.

with_seed(2,
          unbalparams <- expand_grid(
            N = 50, 
            n_clusters = 10,
            cluster_N = 'uneven',
            nobs_mean = 'fixed', # keep simple for now
            force_nclusters = TRUE,
            force_N = TRUE,
            # Definitely want 0, the others are essentailly arbitrary
            nbsize = c(1, 5, 50),
            intercept = 1, 
            slope = 0.5,
            obs_sigma = c(1),
            sd_rand_intercept = c(0.5, 2),
            sd_rand_slope = 0, 
            rand_si_cor = 0,
            # putting this in a list lets me send in vectors that are all the same
            cluster_x = list(runif(n_clusters,
                                   min = min(xrange), max = max(xrange))),
            obs_x_sd = 0
          )
)

unbaldata <- with_seed(2, 
                       make_analysed_tibble(unbalparams, mod_pal)
)

dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9

Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`

Warning: There were 6 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fitted_x_lines = map(full_models, function(x) fit_x_full(x,
  xvals = xdata))`.
Caused by warning:
! SE for fits with fixed clusters are not correct
ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.

Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`

# this can be more useful
unbal <- extract_unnest(unbaldata, unbalparams)

Joining with `by = join_by(group)`

Results

Data

Again, we look at the data, and now can see there are different numbers of points.

unbal$points |> 
  ggplot(aes(x = x, y = y, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_point() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  labs(color = 'Cluster ID')

Probably worth being explicit here about the unevenness. I really do think I’m going to want to rethink this to use nbinom. Then once we get to our data, could parameterise based on what we see.

unbal$points |> 
  ggplot(aes(x = cluster, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_bar() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  theme(legend.position = 'none')

Main result

unbal_fit <- unbal$fitlines |>
  filter(model == 'cluster_rand_x') |> 
  ggplot(aes(x = x, color = model)) +
  geom_point(data = unbal$points, aes(y = y), color = 'grey20', alpha = 0.2) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_point(data = unbal$fitclusters  |>
               filter(model == 'cluster_rand_x'),
             aes(y = estimate)) +
  geom_errorbar(data = unbal$fitclusters |>
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_fit + theme(legend.position = 'none')

How did we do on the estimates? The error bars here are 95% CI of the estimates. Observation sd does not have an se, it’s the leftovers, and so does not have error bars.

unbal$set_v_est |> 
  ggplot(aes(x = term, shape = type)) + 
  geom_point(mapping = aes(y = estimate)) +
  geom_errorbar(mapping = aes(ymin = cil, ymax = ciu, width = 0.1)) +
  geom_point(mapping = aes(y = set_value), color = 'firebrick') +
  facet_grid(nbsize ~ sd_rand_intercept, labeller = 'label_both')

Method comparison

Now we have some meaningful shrinkage and it can cause the lines to deviate. The dashed blue line is the fit through the raw cluster estimates for visualisation.

unbal_method_comparison <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data = unbal$fitlines |>
                filter(model %in% c('cluster_rand_x')),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data = unbal$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster')),
            aes(y = estimate)) +
  geom_line(data = unbal$fitlinesc |>
              filter(model %in% c('cluster_raw')),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data = unbal$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw")),
             aes(y = estimate)) +
  geom_errorbar(data = unbal$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw")),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_method_comparison

It can help to look at one of those more closely just to see what’s happening, and show the fit through the raw clusters as well.

unbal_method_comparison_single <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data = unbal$fitlines |>
                filter(model %in% c('cluster_rand_x') &
                         nbsize == 1 & sd_rand_intercept == 2),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data = unbal$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate)) +
  geom_line(data = unbal$fitlinesc |>
              filter(model %in% c('cluster_raw') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data = unbal$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw") &
                        nbsize == 1 & sd_rand_intercept == 2),
             aes(y = estimate)) +
  geom_errorbar(data = unbal$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw") &
                           nbsize == 1 & sd_rand_intercept == 2),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_method_comparison_single

Shrinkage

And here we can see the very different shrinkage. I guess what we really need to do is show how unbalanced shrinkage yields different sloped lines.

unbal_shrink <- unbal$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_shrink

Warning: `position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Do the clusters with few observations shrink more? Yes, but not a 1:1 relationship

unbal_shrinksize <- unbal$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(aes(size = n), position = position_dodge(width = 0.1), alpha = 0.5) +
  # geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_shrinksize

Warning: `position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.
`position_dodge()` requires non-overlapping x intervals.

Full vs through estimate diagnostic

And finally, do the fits through the random cluster estimates match the line from the random model?

unbal_fit_with_clusters_compare <- unbal$fitlines |> 
  filter(model == 'cluster_rand_x') |>
  ggplot(aes(x = x, color = model)) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_line(data = unbal$fitlinesc |> 
              filter(model == 'cluster_rand_x'), 
            aes(y = estimate), color = 'black', linetype = 'dashed') +
  geom_point(data = unbal$fitclusters |> 
               filter(model == 'cluster_rand_x'), 
             aes(y = estimate)) +
  geom_errorbar(data = unbal$fitclusters |> 
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

unbal_fit_with_clusters_compare + theme(legend.position = 'none')

X is density

For now, we’ll just use two of the options from above, the sd_rand of 0.5 and 2 and sigma of 1, so random variance is half or double residual. By using x_is_density = TRUE, we generate the obs among clusters as before, and then calculate density from that. So this should have the same distribution of cluster sizes, but now it’s arranged on x.

with_seed(2,
          xdensparams <- expand_grid(
            N = 50, 
              n_clusters = 10,
              cluster_N = 'uneven',
              nobs_mean = 'fixed', # keep simple for now
              force_nclusters = TRUE,
              force_N = TRUE,
              # Definitely want 0, the others are essentailly arbitrary
              nbsize = c(1, 5, 50),
              intercept = 1, 
              slope = 0.5,
              obs_sigma = c(1),
              sd_rand_intercept = c(0.5, 2),
              sd_rand_slope = 0, 
              rand_si_cor = 0,
              # putting this in a list lets me send in vectors that are all the same
              cluster_x = list(runif(n_clusters,
                                     min = min(xrange), max = max(xrange))),
              obs_x_sd = 0,
            x_is_density = TRUE
          )
)

xdensdata <- with_seed(2, 
                       make_analysed_tibble(xdensparams, mod_pal)
)

dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9

Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`

Warning: There were 6 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fitted_x_lines = map(full_models, function(x) fit_x_full(x,
  xvals = xdata))`.
Caused by warning:
! SE for fits with fixed clusters are not correct
ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.

Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`

# this can be more useful
xdens <- extract_unnest(xdensdata, xdensparams)

Joining with `by = join_by(group)`

Results

Data

Again, we look at the data, same as above, but now arranged on x.

xdens$points |> 
  ggplot(aes(x = x, y = y, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_point() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  labs(color = 'Cluster ID')

This is the same distribution as above.

xdens$points |> 
  ggplot(aes(x = cluster, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_bar() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  theme(legend.position = 'none')

Main result

xdens_fit <- xdens$fitlines |>
  filter(model == 'cluster_rand_x') |> 
  ggplot(aes(x = x, color = model)) +
  geom_point(data = xdens$points, aes(y = y), color = 'grey20', alpha = 0.2) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_point(data = xdens$fitclusters  |>
               filter(model == 'cluster_rand_x'),
             aes(y = estimate)) +
  geom_errorbar(data = xdens$fitclusters |>
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_fit + theme(legend.position = 'none')

How did we do on the estimates? The error bars here are 95% CI of the estimates. Observation sd does not have an se, it’s the leftovers, and so does not have error bars.

xdens$set_v_est |> 
  ggplot(aes(x = term, shape = type)) + 
  geom_point(mapping = aes(y = estimate)) +
  geom_errorbar(mapping = aes(ymin = cil, ymax = ciu, width = 0.1)) +
  geom_point(mapping = aes(y = set_value), color = 'firebrick') +
  facet_grid(nbsize ~ sd_rand_intercept, labeller = 'label_both')

Method comparison

Now we have some meaningful shrinkage and it can cause the lines to deviate. The dashed blue line is the fit through the raw cluster estimates for visualisation.

xdens_method_comparison <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data = xdens$fitlines |>
                filter(model %in% c('cluster_rand_x')),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data = xdens$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster')),
            aes(y = estimate)) +
  geom_line(data = xdens$fitlinesc |>
              filter(model %in% c('cluster_raw')),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data = xdens$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw")),
             aes(y = estimate)) +
  geom_errorbar(data = xdens$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw")),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_method_comparison

It can help to look at one of those more closely just to see what’s happening, and show the fit through the raw clusters as well.

xdens_method_comparison_single <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data = xdens$fitlines |>
                filter(model %in% c('cluster_rand_x') &
                         nbsize == 1 & sd_rand_intercept == 2),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data = xdens$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate)) +
  geom_line(data = xdens$fitlinesc |>
              filter(model %in% c('cluster_raw') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data = xdens$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw") &
                        nbsize == 1 & sd_rand_intercept == 2),
             aes(y = estimate)) +
  geom_errorbar(data = xdens$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw") &
                           nbsize == 1 & sd_rand_intercept == 2),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
                width = 0.2) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_method_comparison_single

Shrinkage

And here we can see the very different shrinkage. I guess what we really need to do is show how xdensanced shrinkage yields different sloped lines.

xdens_shrink <- xdens$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_shrink

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Do the clusters with few observations shrink more? Yes, but not a 1:1 relationship

xdens_shrinksize <- xdens$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(aes(size = n), position = position_dodge(width = 0.1), alpha = 0.5) +
  # geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_shrinksize

Full vs through estimate diagnostic

And finally, do the fits through the random cluster estimates match the line from the random model?

xdens_fit_with_clusters_compare <- xdens$fitlines |> 
  filter(model == 'cluster_rand_x') |>
  ggplot(aes(x = x, color = model)) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_line(data = xdens$fitlinesc |> 
              filter(model == 'cluster_rand_x'), 
            aes(y = estimate), color = 'black', linetype = 'dashed') +
  geom_point(data = xdens$fitclusters |> 
               filter(model == 'cluster_rand_x'), 
             aes(y = estimate)) +
  geom_errorbar(data = xdens$fitclusters |> 
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

xdens_fit_with_clusters_compare + theme(legend.position = 'none')

So far, there’s not an obvious issue with biased shrinkage. Everything differs from a naive fit without randoms, but that’s clearly wrong anyway.

Larger numbers and freer obs per cluster distribution

I’ll adjust the above to have larger obs and more variance between clusters and also change the nbin mu. Basically crank up the variation and biases. Should I crank up n_clusters? Maybe not (or just a bit), if they’re ‘riffles’.

with_seed(2,
          bigdensparams <- expand_grid(
            N = 1000, 
              n_clusters = 15,
              cluster_N = 'uneven',
              nobs_mean = 'x', 
              force_nclusters = FALSE,
              force_N = FALSE,
              # Definitely want 0, the others are essentailly arbitrary
              nbsize = c(0.5, 1, 10),
              intercept = 1, 
              slope = 0.5,
              obs_sigma = c(1),
              sd_rand_intercept = c(0.5, 2),
              sd_rand_slope = 0, 
              rand_si_cor = 0,
              # putting this in a list lets me send in vectors that are all the same
              cluster_x = list(runif(n_clusters,
                                     min = min(xrange), max = max(xrange))),
              obs_x_sd = 0,
            x_is_density = TRUE
          )
)

bigdensdata <- with_seed(2, 
                       make_analysed_tibble(bigdensparams, mod_pal)
)

dropping columns from rank-deficient conditional model: cluster9

dropping columns from rank-deficient conditional model: cluster8

dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9
dropping columns from rank-deficient conditional model: cluster9

Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`
Joining with `by = join_by(cluster)`

Warning: There were 6 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fitted_x_lines = map(full_models, function(x) fit_x_full(x,
  xvals = xdata))`.
Caused by warning:
! SE for fits with fixed clusters are not correct
ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.

Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`
Joining with `by = join_by(x, cluster)`

# this can be more useful
bigdens <- extract_unnest(bigdensdata,bigdensparams)

Joining with `by = join_by(group)`

Results

Data

Again, we look at the data, same as above.

bigdens$points |> 
  ggplot(aes(x = x, y = y, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_point() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  labs(color = 'Cluster ID')

This is the same distribution as above.

bigdens$points |> 
  ggplot(aes(x = cluster, color = factor(cluster, levels = c(1:max(as.numeric(cluster)))))) + 
  geom_bar() +
  facet_grid(nbsize ~ sd_rand_intercept) +
  theme(legend.position = 'none')

Main result

bigdens_fit <-bigdens$fitlines |>
  filter(model == 'cluster_rand_x') |> 
  ggplot(aes(x = x, color = model)) +
  geom_point(data =bigdens$points, aes(y = y), color = 'grey20', alpha = 0.2) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_point(data =bigdens$fitclusters  |>
               filter(model == 'cluster_rand_x'),
             aes(y = estimate)) +
  geom_errorbar(data =bigdens$fitclusters |>
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_fit + theme(legend.position = 'none')

How did we do on the estimates? The error bars here are 95% CI of the estimates. Observation sd does not have an se, it’s the leftovers, and so does not have error bars.

bigdens$set_v_est |> 
  ggplot(aes(x = term, shape = type)) + 
  geom_point(mapping = aes(y = estimate)) +
  geom_errorbar(mapping = aes(ymin = cil, ymax = ciu, width = 0.1)) +
  geom_point(mapping = aes(y = set_value), color = 'firebrick') +
  facet_grid(nbsize ~ sd_rand_intercept, labeller = 'label_both')

Method comparison

Now we have some meaningful shrinkage and it can cause the lines to deviate. The dashed blue line is the fit through the raw cluster estimates for visualisation. Where these are deviating from the no cluster fit, it seems appropriate- they’re equalising the weight between clusters, and not letting the super dense ones pull the whole fit around. Which is right, provided we have enough data to estimate the low ones.

bigdens_method_comparison <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data =bigdens$fitlines |>
                filter(model %in% c('cluster_rand_x')),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data =bigdens$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster')),
            aes(y = estimate)) +
  geom_line(data =bigdens$fitlinesc |>
              filter(model %in% c('cluster_raw')),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data =bigdens$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw")),
             aes(y = estimate)) +
  geom_errorbar(data =bigdens$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw")),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_method_comparison

It can help to look at one of those more closely just to see what’s happening, and show the fit through the raw clusters as well.

bigdens_method_comparison_single <-
  ggplot(mapping = aes(x = x, color = model)) +
  geom_ribbon(data =bigdens$fitlines |>
                filter(model %in% c('cluster_rand_x') &
                         nbsize == 1 & sd_rand_intercept == 2),
              aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(data =bigdens$fitlines |>
              filter(model %in% c('cluster_rand_x', 'cluster_raw', 'no_cluster') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate)) +
  geom_line(data =bigdens$fitlinesc |>
              filter(model %in% c('cluster_raw') &
                       nbsize == 1 & sd_rand_intercept == 2),
            aes(y = estimate), linetype = 'dashed') +
  geom_point(data =bigdens$fitclusters  |>
               filter(model %in% c('cluster_rand_x', "cluster_raw") &
                        nbsize == 1 & sd_rand_intercept == 2),
             aes(y = estimate)) +
  geom_errorbar(data =bigdens$fitclusters |>
                  filter(model %in% c('cluster_rand_x', "cluster_raw") &
                           nbsize == 1 & sd_rand_intercept == 2),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
                width = 0.2) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_method_comparison_single

Shrinkage

And here we can see the very different shrinkage. I guess what we really need to do is show how bigdensanced shrinkage yields different sloped lines.

bigdens_shrink <-bigdens$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_shrink

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

Do the clusters with few observations shrink more? Yes, but not a 1:1 relationship

bigdens_shrinksize <-bigdens$shrink |> 
  
  ggplot(
    aes(x = x, y = cluster_resid,
        ymin = cluster_resid-se,
        ymax = cluster_resid + se,
        color = model)) +
  geom_point(aes(size = n), position = position_dodge(width = 0.1), alpha = 0.5) +
  # geom_linerange(position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_shrinksize

Full vs through estimate diagnostic

And finally, do the fits through the random cluster estimates match the line from the random model?

bigdens_fit_with_clusters_compare <-bigdens$fitlines |> 
  filter(model == 'cluster_rand_x') |>
  ggplot(aes(x = x, color = model)) +
  geom_ribbon(aes(y = estimate, ymin = estimate-se, ymax = estimate + se),
              alpha = 0.25, linetype = 0) +
  geom_line(aes(y = estimate)) +
  geom_line(data =bigdens$fitlinesc |> 
              filter(model == 'cluster_rand_x'), 
            aes(y = estimate), color = 'black', linetype = 'dashed') +
  geom_point(data =bigdens$fitclusters |> 
               filter(model == 'cluster_rand_x'), 
             aes(y = estimate)) +
  geom_errorbar(data =bigdens$fitclusters |> 
                  filter(model == 'cluster_rand_x'),
                aes(y = estimate, ymin = estimate-se, ymax = estimate + se)) +
  scale_color_manual(values = mod_pal) +
  facet_grid(nbsize ~ sd_rand_intercept) +
  scale_x_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(random $\sigma$)'),
                                         breaks = NULL, labels = NULL)) +
  scale_y_continuous(sec.axis = sec_axis(~ . , name = TeX(r'(N successes ("size"))'),
                                         breaks = NULL, labels = NULL))

bigdens_fit_with_clusters_compare + theme(legend.position = 'none')

Subclusters

testing for now

with_seed(2,
          subparams <- expand_grid(
            N = 10000, 
            n_clusters = 10, # 'riffles'
            n_subclusters = list(c(100, 1000)), # 'rocks' (absolute, not per-riffle I think?)
            cluster_N = 'uneven',
            nobs_mean = 'fixed', # keep simple for now
            force_nclusters = TRUE,
            force_N = TRUE,
            # Definitely want 0, the others are essentailly arbitrary
            nbsize = c(1, 5, 50),
            intercept = 1, 
            slope = 0.5,
            obs_sigma = c(1),
            sd_rand_intercept = list(c(0.5, 2, 1)),
            sd_rand_slope = list(c(0, 0, 0)), 
            rand_si_cor = 0,
            # putting this in a list lets me send in vectors that are all the same
            cluster_x = list(runif(n_clusters,
                                   min = min(xrange), max = max(xrange))),
            obs_x_sd = 0
          )
)

# This is throwing an error, need to fix.
subdata <- with_seed(2, 
                       make_analysed_tibble(subparams, mod_pal)
)

# this can be more useful
subout <- extract_unnest(subdata, subparams)

And how does nesting work, especially if it is again uneven?

Explicitly look at singletons? Scaling- if we have 50 singletons at an x, is it as ‘good’ as 1 50?

Then, does the beta introduce additional issues?

AND, finally, what should we plot? What should we DO. It’s one thing to find out the stats are working right and understand why they’re counterintuitive, it’s another to decide how to present them or whether we need to do something else.

Impact of random and residual sd

Balanced N per cluster, vary resid and random variance

Generate the data

Results

Main result

Method comparison

Shrinkage

Unbalanced groups

Results

Data

Main result

Method comparison

Shrinkage

Full vs through estimate diagnostic

X is density

Results

Data

Main result

Method comparison

Shrinkage

Full vs through estimate diagnostic

Larger numbers and freer obs per cluster distribution

Results

Data

Main result

Method comparison

Shrinkage

Full vs through estimate diagnostic

Subclusters

Next