Random effects with uneven groups

library(tidyverse)
library(withr)
library(glmmTMB)
devtools::load_all()

This is essentially preamble to understanding the beta-binomial issues we’re having. But we should be able to do a better job sharpening our intuition of what we expect if we start off with gaussian.

This builds on the outline I developed and the work Sarah did (students/Sarah_Taig/Random Effects Simulations) (though I think the emphasis will be different; we’ll see), as well as some beta-binom errorbar checks in caddis/Testing/Error_bars.qmd.

I think I’ll likely just do gaussian here. Then bb in a parallel doc. And likely will do a model comparison thing too- i.e. spaMM vs glmTMB vs lme4::glmer as in caddis/Analyses/Testing/df_z_t_for_sarah/ (and add {brms}).

And will likely need to incorporate some assessments of the se being calculated at various scales, as explored in caddis/Analyses/Testing/Error_bars.qmd.

We’ll need to bring in real data at some point, but I think not in this doc (hence why it’s here, and some of the other testing is in caddis.

Are we just recapitulating this? Kind of. Slightly different emphasis and we’ll end up taking it further, but should be careful.

The data generation function lets us have random slopes. These are super relevant in some cases, e.g. following people or riffles through time, where observations might have different x-values within the cluster. I think I’ll largely ignore them here, because the situation we’re trying to address doesn’t (each cluster has a single x), and so the in-cluster slopes are irrelevant. They could certainly be dealt with in this general exploration, but it would just make everything factorially complicated.

random and residual sd and fits

Before we do anything with unbalanced clusters, let’s first just see how the random and residual sd alters the way fits work. More importantly, let’s establish some approaches to seeing how the random structure affects the outcome.

fits through points
fits through clusters
full model fit
cluster error bars on the clusters, i.e. from a fixed model of the clusters
cluster error bars from the random model. What is this? the random sd will be the error of each around the line, then residual will be within-cluster. Can I extract and plot that? I think that will actually be key to understanding what’s happening.

all of this from model fits.

xrange = c(0, 10)

with_seed(11,
          sdparams <- expand_grid(
            n_per_cluster = 5, 
            n_clusters = 10, 
            intercept = 1, 
            slope = 0.5,
            sigma = c(0.1, 1),
            sd_rand_intercept = c(0.5, 2),
            sd_rand_slope = 0, 
            rand_si_cor = 0,
            # putting this in a list lets me send in vectors that are all the same
            cluster_x = list(runif(n_clusters,
                                   min = min(xrange), max = max(xrange))),
            obs_x_sd = 0
          )
)

# There's got to be a way to pmap straight into a mutate without having to send the dataframe in again.
with_seed(11, 
          sddata <- sdparams |>
            purrr::pmap(simulate_gaussian_mm) |> 
            tibble(simdata = _) |> 
            bind_cols(sdparams)
)

sddata |> 
  unnest(cols = simdata) |> 
ggplot(aes(x = x, y = y, color = factor(cluster))) + 
  geom_point() +
  facet_grid(sigma ~ sd_rand_intercept)

Fit models

Now, we can fit the models. What do we want to use here? Build in the model comparison now, or deal with that later? Later, I think (though likely that will mean coming back up here).

# sddata <- sddata |> 
#   mutate(cluster_rand_x = map(simdata, \(x) glmmTMB(y ~ x + (1|cluster), data = x)),
#          # This one ends up rank-deficient for the last cluster usually
#          cluster_fixed_x = map(simdata, \(x) glmmTMB(y ~ x + cluster, data = x)),
#          cluster_fixed = map(simdata, \(x) glmmTMB(y ~ cluster, data = x)),
#          cluster_rand = map(simdata, \(x) glmmTMB(y ~ 1|cluster, data = x)))

I want to extract the cluster estimates for each model skipping the intercept estimates here, though we might want to revisit that (see below)

First, set up a dataset we’re going to want to predict on for the continuous fits

# clusterdata <- sddata$simdata[[3]] |> 
#   select(x, cluster) |> 
#   distinct()

xdata <- tibble(x = seq(from = min(xrange), to = max(xrange), by = diff(xrange)/100))

Start setting this up for a function.

For the models with cluster random, we could get the values straight out of the main fit. But I think we want to actually get them from ranef and coef, because those are more directly what it’s estimating for the point estimates of the clusters. It ends up being the same value, but not the same sd.

# I'm not sure if I actually want this one or not. I think I might, in conjunction with a basic summarise() for the fixefs. Or a clever extraction from summary(model)$coefficients, but the catch is sorting out the x's and intercepts.
# cluster_estimates_rand <- function(model, alldata) {
#   clusterdata <- make_clusterdata(alldata)
#   clustib <- tibble(cluster = row.names(ranef(model)$cond$cluster),
#   intercept = fixef(model)$cond["(Intercept)"], 
#        slope = fixef(model)$cond['x'], 
#        cluster_deviation = ranef(model)$cond$cluster[[1]], 
#        cluster_sd = sqrt(c(attributes(ranef(model)$cond$cluster)$condVar))) |> 
#   mutate(slope = ifelse(is.na(slope), 0, slope)) |> 
#   left_join(clusterdata, by = 'cluster') |> 
#   mutate(cluster_estimate = intercept + slope*x + cluster_deviation)
#   
#   return(dplyr::select(clustib, cluster, x, cluster_estimate, cluster_sd))
# }

Try to get that for the non-random. There’s an issue with the se for the intercept level

# cluster_estimates_fixed <- function(model, alldata) {
#   clusterdata <- make_clusterdata(alldata)
#   sumvals <- summary(model)$coefficients$cond
#   unique(clusterdata$cluster)
#   row.names(sumvals) <- stringr::str_remove_all(row.names(sumvals), 'cluster')
#   missingc <- setdiff(unique(clusterdata$cluster), row.names(sumvals))
#   # We get the intercept elsewhere
#   sumvals[row.names(sumvals) == '(Intercept)', 'Estimate'] <- 0
#   row.names(sumvals)[row.names(sumvals) == '(Intercept)'] <- missingc
#   
#   
#   clustib <- as_tibble(sumvals, rownames = 'cluster') |> 
#     select(cluster, estimate = Estimate, se = `Std. Error`) |> 
#     mutate(intercept = fixef(model)$cond['(Intercept)'],
#            slope = fixef(model)$cond['x'],
#            slope = ifelse(is.na(slope), 0, slope)) |> 
#     filter(cluster != 'x') |> 
#     left_join(clusterdata) |> 
#     mutate(cluster_estimate = intercept + estimate + slope*x)
#   
#   return(dplyr::select(clustib, cluster, x, cluster_estimate, cluster_sd = se))
#   
# }

# extract_cluster_terms <- function(model, alldata) {
#   clusterdata <- make_clusterdata(alldata)
#   if (length(ranef(test_fixed)$cond) == 0) {
#     clustib <- cluster_estimates_fixed(model, clusterdata)
#   } else {
#     clustib <- cluster_estimates_rand(model, clusterdata)
#   }
#   
#   return(clustib)
# }

This gets the cluster estimates from the fits at the cluster values. Because it accounts for the full model, the se in particular are not the same.

# predict_at_clusters <- function(model, alldata) {
#   clusterdata <- make_clusterdata(alldata)
#   fitcluster <- clusterdata |> 
#   add_predictions(model, se.fit = TRUE) |> 
#   rename(predicted = fit, sd = se.fit)
#   return(fitcluster)
# }

# naive_clusters <- function(alldata) {
#   alldata |> 
#     summarise(cluster_mean = mean(y),
#            cluster_se = sd(y)/sqrt(n()), .by = cluster)
# }

So, the above should let us get the cluster points.

Do we really want to stay in this tibble framework, or move on to something else?

Break it down to a single thing to make sure we have what we want. Should be able to re-purr if we want. But I think we probably will want to skip some intermediate steps

One set of data

simdata <- sddata$simdata[[3]]

fit the models

cluster_rand_x <- glmmTMB(y ~ x + (1|cluster), data = simdata)
# This one ends up rank-deficient for the last cluster usually
cluster_fixed_x <- glmmTMB(y ~ x + cluster, data = simdata)
cluster_fixed <- glmmTMB(y ~ cluster, data = simdata)
cluster_rand <- glmmTMB(y ~ 1|cluster, data = simdata)

no_cluster <- glmmTMB(y ~ x, data = simdata)

Get estimates for each cluster

# Extraction as best I can of *just* the cluster estimates and se
clusters_from_rand_x <- extract_cluster_terms(cluster_rand_x, simdata)
clusters_from_fixed_x <- extract_cluster_terms(cluster_fixed_x, simdata)
clusters_from_rand <- extract_cluster_terms(cluster_rand, simdata)
clusters_from_fixed <- extract_cluster_terms(cluster_fixed, simdata)

# From the data
clusters_from_raw <- naive_clusters(simdata)

# Extraction of estimates from the full model for the clusters at their x values (this works here with no in-cluster x variation)
clusters_full_rand_x <- predict_at_clusters(cluster_rand_x, simdata)
clusters_full_fixed_x <- predict_at_clusters(cluster_fixed_x, simdata)
clusters_full_rand <- predict_at_clusters(cluster_rand, simdata)
clusters_full_fixed <- predict_at_clusters(cluster_fixed, simdata)

no_cluster_full <- predict_at_clusters(no_cluster, simdata)

There are a LOT of potential comparisons there; let’s try to clarify what they are.

Preliminary (sort of sideways, but checking):

How do the cluster estimates differ from those from a full model?

The extracted cluster term has the same estimate and marginally larger se, as we would expect.

est_compare <- 
  bind_rows(clusters_from_fixed, clusters_full_fixed,
            clusters_from_fixed_x, clusters_full_fixed_x, 
            clusters_from_rand_x, clusters_full_rand_x, 
            clusters_from_rand, clusters_full_rand) |> 
  ggplot(aes(y = estimate, x = x, color = estimate_type, ymin = estimate-se, ymax = estimate+se)) + 
  geom_pointrange(position = position_dodge(width = 0.1)) +
  facet_wrap('model')

est_compare

How does the inclusion of x affect cluster estimates?

If clusters are a fixed effect, nothing changes. If clusters are random, we do see some deviations between the estimates, which don’t look much different if we just have the cluster term or the full predictions.

x_compare <- 
  bind_rows(clusters_from_fixed, clusters_full_fixed,
            clusters_from_fixed_x, clusters_full_fixed_x, 
            clusters_from_rand_x, clusters_full_rand_x, 
            clusters_from_rand, clusters_full_rand) |> 
  mutate(hasx = grepl('_x$', model),
         clusterrand = grepl('_rand', model)) |> 
  ggplot(aes(y = estimate, x = x, color = hasx, ymin = estimate-se, ymax = estimate+se)) + 
  geom_pointrange(position = position_dodge(width = 0.1)) +
  facet_grid(estimate_type ~ clusterrand, labeller = 'label_both')

x_compare

Does the ‘separate’ estimation of each cluster match the fixeds?

At the mean, yes. It has a smaller se than the cluster terms themselves, but larger than the full model (usually).

sep_fixed <- 
  bind_rows(clusters_from_fixed,
            clusters_from_fixed_x,
            clusters_full_fixed, 
            clusters_full_fixed_x,
            # so I can include on both facets
            clusters_from_raw |> mutate(estimate_type = 'cluster_term'),
            clusters_from_raw |> mutate(estimate_type = 'full_model_predict')) |> 
  
  ggplot(aes(y = estimate, x = x, color = model, ymin = estimate-se, ymax = estimate+se)) + 
  geom_pointrange(position = position_dodge(width = 0.3)) +
  facet_wrap('estimate_type')

sep_fixed

Here’s what I really want to know- how do predictions change between fixed effect estimates (and the naive, but since that’s the same, not including here), and the random? I think I’ll use the cluster term fits (not full model), and I suppose with and without x in the random models. I’ll also include the ‘no cluster’ model as a reference.

So, the rands (particularly rand_x) tend to pull closer to the points without a cluster term (blue, ‘no_cluster’)

cluster_compare <- 
  bind_rows(clusters_from_fixed,
            clusters_from_rand_x, 
            clusters_from_rand,
            no_cluster_full) |> 
  ggplot(aes(y = estimate, x = x, color = model, ymin = estimate-se, ymax = estimate+se)) + 
  geom_pointrange(position = position_dodge(width = 0.3))

cluster_compare

On to lines- how does any of this change the FITS?

For the models themselves? (rand_x, fixed_x, no_cluster)
New fits through the cluster estimates (IE how we try to mentally fit the the data in a plot) (rand_x- do I want the full model or just the cluster ests?-, fixed_x, naive)

Get the fits that our eyes will try to draw:

cluster_model_rand_x <- glmmTMB(y ~ x, data = clusters_from_rand_x |> rename(y = estimate))
cluster_model_fixed_x <- glmmTMB(y ~ x, data = clusters_from_fixed_x |> rename(y = estimate))
cluster_model_raw <- glmmTMB(y ~ x, data = clusters_from_raw |> rename(y = estimate))
# This is what we're doing in the plots, though there it's internally nested as well
cluster_model_fixed <- glmmTMB(y ~ x, data = clusters_from_fixed |> rename(y = estimate))

Fit the lines for each model At the mean random level when relevant

# The full models
lines_rand_x <- predict_fit(cluster_rand_x, xdata)
lines_fixed_x <- predict_fit(cluster_fixed_x, xdata)
lines_no_cluster <- predict_fit(no_cluster, xdata)

# Fits to the cluster estimates
lines_rand_x_clusters <- predict_fit(cluster_model_rand_x, xdata)
lines_fixed_x_clusters <- predict_fit(cluster_model_fixed_x, xdata)
lines_raw_clusters <- predict_fit(cluster_model_raw, xdata)
lines_fixed_clusters <- predict_fit(cluster_model_fixed, xdata)

Now set up some comparisons

Again, some prelim to know what we’re dealing with

Do the full models differ?

The no cluster points are annoying, but this basically works. The fit through the data matches the fit of the random clusters, and is slightly different than the fit you’d get if it were possible to fit the x from the model with fixed clusters.

bind_rows(lines_rand_x,
          lines_fixed_x,
          lines_no_cluster) |> 
ggplot(aes(x = x, y = estimate, color = model, linetype = model)) +
  geom_line() +
  geom_point(data = bind_rows(clusters_from_rand_x, 
                              clusters_from_fixed_x, 
                              simdata |> mutate(estimate = y, model = 'no_cluster')))

How do the full fits differ from fits through the summary points for each set? The last two are a funny comparison (no cluster fit and the raw clusters).

bind_rows(lines_rand_x,
          lines_rand_x_clusters,
          lines_fixed_x, 
          lines_fixed_x_clusters,
          lines_no_cluster,
          lines_raw_clusters) |> 
  mutate(fittype = ifelse(grepl('model', model), 'clusters', 'data'),
         model_group = stringr::str_remove_all(model, 'model_'),
         model_group = case_when(model_group %in% c('no_cluster', 'cluster_raw') ~ 'data',
                                 .default = model_group)) |> 
  ggplot(aes(x = x, y = estimate, color = fittype, linetype = fittype)) +
  geom_line() +
  facet_wrap('model_group') +
  geom_point(data = bind_rows(clusters_from_rand_x, clusters_from_fixed_x, clusters_from_raw) |> 
               mutate(model_group = ifelse(model == 'separate', 'data', model),
                      fittype = 'clusters'))

Now, let’s figure out the situation we have, and what we actually want.

We’re plotting the model fits (lines_rand_x), and the cluster estimates from a cluster model

lines_rand_x |> 
  ggplot(aes(x = x, y = estimate, ymin = estimate-se, ymax = estimate + se)) +
    geom_ribbon(alpha = 0.25) +
  geom_line() +
  geom_point(data = clusters_from_fixed) +
  geom_errorbar(data = clusters_from_fixed)

What makes more sense as a target is the random estimates and the fit

lines_rand_x |> 
  ggplot(aes(x = x, y = estimate, ymin = estimate-se, ymax = estimate + se)) +
    geom_ribbon(alpha = 0.25) +
  geom_line() +
  geom_point(data = clusters_from_rand_x) +
  geom_errorbar(data = clusters_from_rand_x)

For error testing as we go, there’s a few more things I want to plot- the naive cluster estimates, their fit, and the fit to all the data. We basically have that above, but this gets more directly at it. As we move to nested randoms, we will likely end up wanting the thing we currently do in the paper (the outer-level as fixed with inner random). But let’s cross that bridge when we get to it.

lines_rand_x |> 
  ggplot(aes(x = x, y = estimate, ymin = estimate-se, ymax = estimate + se)) +
    geom_ribbon(alpha = 0.25) +
  geom_line() +
  geom_point(data = clusters_from_rand_x) +
  geom_errorbar(data = clusters_from_rand_x) +
  geom_point(data = clusters_from_raw, color = 'seagreen1') +
  geom_errorbar(data = clusters_from_raw, color = 'seagreen1') +
  geom_line(data = lines_raw_clusters, color = 'seagreen1', linetype = 'dashed') +
  geom_line(data = lines_no_cluster, color = 'orchid', linetype = 'dotted')

Can I build that in a tidier way? Yes. Need to come up with a better set of colors, but whatever.

linedf <- bind_rows(lines_rand_x, lines_raw_clusters, lines_no_cluster) |> 
  mutate(model2 = case_when(model == 'cluster_rand_x' ~ 'Random cluster estimates',
                            model == 'cluster_model_raw' ~ 'Naive cluster estimates',
                            model == 'no_cluster' ~ 'No cluster term'))

pointdf <- bind_rows(clusters_from_rand_x, clusters_from_raw) |> 
  mutate(model2 = case_when(model == 'cluster_rand_x' ~ 'Random cluster estimates',
                            model == 'separate' ~ 'Naive cluster estimates'))

ggplot(linedf, 
       aes(x = x, y = estimate, ymin = estimate-se, ymax = estimate + se, 
           color = model2, linetype = model2)) +
    geom_ribbon(data = linedf |> filter(model2 == 'Random cluster estimates'),
                alpha = 0.25, linetype = 0) +
  geom_line() +
  geom_point(data = pointdf) +
  geom_errorbar(data = pointdf) +
  geom_text(data = pointdf |> filter(model == 'cluster_rand_x'), aes(label = cluster), nudge_x = 0.2)

Where are the fixeds? I end up needing them below (maybe?)

ggplot(lines_fixed_x, 
       aes(x = x, y = estimate, ymin = estimate-se, ymax = estimate + se, 
           color = model, linetype = model)) +
      geom_line() +
  geom_point(data = clusters_from_fixed_x) +
  geom_errorbar(data = clusters_from_fixed_x) +
  geom_text(data = clusters_from_fixed_x, aes(label = cluster), nudge_x = 0.2)

THEN, package all this up so we can do it for different sds. And unbalanced. And beta-binom. And nested.

Do I want to make the ‘Intercept’ plots below to show shrinkage directly? Or is it obvious enough here?

# These top two are comparable
rand_x_coefs <- as_tibble(coef(cluster_rand_x)$cond$cluster, 
                          rownames = 'cluster') |>
  # this adds the condVar- are these sd or se?
  bind_cols(se = 
              sqrt(c(attributes(ranef(cluster_rand_x)$cond$cluster)$condVar))) |>
  rename(intercept = `(Intercept)`) |> 
  mutate(grand_intercept = fixef(cluster_rand_x)$cond['(Intercept)']) |> 
  mutate(model = 'cluster_rand_x', hasx = TRUE, ismm = TRUE)

# I *think* this is what I want- the effect of cluster at x = 0 in the fixed model. That's not the same as the cluster estimates in the cluster model, which don't account for the x (or the purely naive estimates).
# Or is it unique(simdata$x[simdata$cluster == 1])
fixed_x_coefs <- tibble(cluster = unique(simdata$cluster), 
                        x = 0) |> 
 add_predictions(cluster_fixed_x, se.fit = TRUE) |> 
  rename(intercept = fit, se= se.fit) |>  # for this specific case at x = 0
    # mutate(grand_intercept = fixef(cluster_fixed_x)$cond['(Intercept)']) |> 
      mutate(grand_intercept = fixef(cluster_fixed_x)$cond['(Intercept)'] + fixef(cluster_fixed_x)$cond['x']*unique(simdata$x[simdata$cluster == 1])) |> 

  mutate(model = 'cluster_fixed_x', hasx = TRUE, ismm = FALSE)

# These are comparable and already created, but don't have the slope taken out.
# clusters_from_rand
# clusters_from_raw

bind_rows(rand_x_coefs, fixed_x_coefs) |> 
ggplot(aes(x = cluster, y = intercept, 
           color = model)) +  
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(aes(ymax = intercept+se, ymin = intercept-se),
                  position = position_dodge(width = 0.1)) +
  geom_hline(aes(yintercept = grand_intercept, color = model, linetype = model))

The way I do it below is better with the ranefs.

bind_rows(clusters_from_rand |> 
            mutate(grand_intercept = fixef(cluster_rand)$cond['(Intercept)']),
          clusters_from_raw |> 
            # Should this be overall data mean or mean of the estimates? It doesn't matter here, but will when we have uneven data.
            mutate(grand_intercept = mean(estimate))) |> 
            # mutate(grand_intercept = mean(simdata$y))) |> 
ggplot(aes(x = x, y = estimate, 
           color = model)) +  
  geom_point(position = position_dodge(width = 0.1)) +
  geom_linerange(aes(ymax = estimate+se, ymin = estimate-se),
                  position = position_dodge(width = 0.1))  +
  geom_hline(aes(yintercept = grand_intercept, color = model, linetype = model))

I really need to set up a standard set of colors, this is really hard when the colors keep changing.

What about the way m-clark does it? He fits lmList, which splits the dataset into clusters and fits them as random. The catch is because I have a true fixed x and no in-cluster x variation, his method doesn’t work the same, and so I end up with the full cluster estimates

ll <- lme4::lmList(y ~ 1|cluster, simdata)

There’s no exact equivalent in glmmTMB, but easy enough to loop, for ex

tl <- glmmTMB(y ~ 1|cluster, data = simdata |> filter(cluster == 9))

But we see that that has the full value (as doe the lmList version), and adding x in doesn’t help

tl <- glmmTMB(y ~ x|cluster, data = simdata |> filter(cluster == 9))
tl <- glmmTMB(y ~ x + 1|cluster, data = simdata |> filter(cluster == 9))

Can I look at the full model, extract the difference from the fixed line at that x value and compare to the ranef? Do we need to? These coef values from the clusters random but separate are exactly clusters_from_fixed and clusters_full_fixed_x . SO. REALLY, I need to make sure I’m doing the proper comparisons to the fixed values. It’s clear from the above that the ‘intercept’ isn’t working, at least how I’m extracting it.

so yeah, coef is giving me the ranef + fixef. and there’s no obviously great way to do that for the fixed clusters. BUT, I should be able to extract them as the difference between their value and a ‘pure’ fit of the line, and compare that to ranef. TRY THAT.

So basically, what I want is the cluster residuals. But again, how do I get them for the fixed model? They’re also (at least for this case) the same as cluster_from_raw

# The full models
nc_rand_x <- predict_fit(cluster_rand_x, make_clusterdata(simdata)) |> 
  rename(line_est = estimate, line_se = se) |> 
  left_join(clusters_from_rand_x) |> 
  mutate(cluster_resid = estimate-line_est) |> 
  # As a check that this is correct,
  left_join(as_tibble(ranef(cluster_rand_x)$cond$cluster, rownames = 'cluster')) |> 
  rename(extract_ranef = `(Intercept)`) |> 
  mutate(check = cluster_resid - extract_ranef)

# Need to figure this one out eventually, probably, but it's not working.
# nc_fixed_x <- predict_fit(cluster_fixed_x, make_clusterdata(simdata)) |> 
#   rename(line_est = estimate, line_se = se) |>
#   left_join(clusters_from_fixed_x) |> 
#   mutate(cluster_resid = estimate-line_est)

nc_raw_cluster <- predict_fit(no_cluster, make_clusterdata(simdata)) |> 
  rename(line_est = estimate, line_se = se) |> 
  select(-model) |> 
  left_join(clusters_from_raw) |> 
  mutate(cluster_resid = estimate-line_est)

# Fits to the cluster estimates
# I don't think I actually need this
# nc_rand_x_clusters <- predict_fit(cluster_model_rand_x, make_clusterdata(simdata))
# nc_fixed_x_clusters <- predict_fit(cluster_model_fixed_x, make_clusterdata(simdata))
# nc_raw_clusters <- predict_fit(cluster_model_raw, make_clusterdata(simdata))
# nc_fixed_clusters <- predict_fit(cluster_model_fixed, make_clusterdata(simdata))

This will work to show shrinkage.

ggplot(bind_rows(nc_rand_x, nc_raw_cluster), aes(x = x, y = cluster_resid, 
                                                 ymin = cluster_resid-se, 
                                                 ymax = cluster_resid + se, 
                                                 color = model)) + 
  geom_point() +
  geom_linerange() +
  geom_hline(yintercept = 0)

Can we get the full fixed model to give up its values? Following https://stackoverflow.com/questions/72820236/comparing-all-factor-levels-to-the-grand-mean-can-i-tweak-contrasts-in-linear-m

This might work, but the NA in cluster 9 kills me every time.

ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

Cmat <- ContrSumMat(as.factor(simdata$cluster))

Extract

coef_ac <- summary(cluster_fixed_x)$coef$cond[3:nrow(summary(cluster_fixed_x)$coef$cond), 'Estimate']

Level-specific

coef_bc <- (Cmat %*% coef_ac)[, 1]

Testing

So, this is working. Let’s choose one of those and sort out everything we want to do and then do the purrrring.

test_full <- sddata$cluster_rand_x[[3]]
test_fixed <- sddata$cluster_fixed_x[[3]]
test_cluster <- sddata$cluster_fixed[[3]]
test_rand <- sddata$cluster_rand[[3]]

Can I get shrinkage from these? I’m now far enough from what the website is doing, I’m a bit lost. Because he still seems to be building a mixed model at this point.

# 'For coef.glmmTMB: a similar list, but containing the overall coefficient value for each level, i.e., the sum of the fixed effect estimate and the random effect value for that level'
coef(test_full)

ranef(test_full)

I can extract the conditional variance too, though it’s a hassle to get. Will be very useful though, I think.

attributes(ranef(test_full)$cond$cluster)$condVar |> c()

fixef(test_full)

coef(test_rand)

ranef(test_rand)

fixef(test_rand)

fixef(test_cluster)

Can I plot all of those? Will move on to predicting the whole line in a bit. I just really want to understand what I’m dealing with.

# add_predictions <- function(df, model, se.fit = TRUE, ...) {
#   preds <- predict(object = model,
#                    newdata = df,
#                    se.fit = TRUE, ...) |> 
#     as_tibble()
#   
#   df <- bind_cols(df, preds)
#   return(df)
# }

full_coefs <- as_tibble(coef(test_full)$cond$cluster, rownames = 'cluster') |> 
  # this adds the condVar- are these sd or se?
  bind_cols(sd = sqrt(c(attributes(ranef(test_full)$cond$cluster)$condVar))) |>
  rename(intercept = `(Intercept)`) |> 
  mutate(model = 'full_mm', hasx = TRUE, ismm = TRUE)

rand_coefs <- as_tibble(coef(test_rand)$cond$cluster, rownames = 'cluster') |> 
    # this adds the condVar- are these sd or se?
  bind_cols(sd = sqrt(c(attributes(ranef(test_rand)$cond$cluster)$condVar))) |>
  rename(intercept = `(Intercept)`) |> 
  mutate(model = 'rand', hasx = FALSE, ismm = TRUE)

# I *think* this is what I want- the effect of cluster at x = 0 in the fixed model. That's not the same as the cluster estimates in the cluster model, which don't account for the x. They'll come in when we look at the actual fit line.
fixed_coefs <- tibble(cluster = unique(full_coefs$cluster), x = 0) |> 
 add_predictions(test_fixed, se.fit = TRUE) |> 
  rename(intercept = fit, sd = se.fit) |>  # for this specific case at x = 0
  mutate(model = 'fixed', hasx = TRUE, ismm = FALSE)

# cluster coefs- these are comparable to 'rand', I think- there's no x term to that gets rolled into cluster.
cluster_coefs <- tibble(cluster = unique(full_coefs$cluster), x = 0) |> 
  add_predictions(test_cluster) |> 
  rename(intercept = fit, sd = se.fit) |>  # these don't even have an x, so doesn't matter
  mutate(model = 'cluster', hasx = FALSE, ismm = FALSE)



intercept_compare <- bind_rows(full_coefs, rand_coefs, 
                               fixed_coefs, cluster_coefs) |> mutate(model = forcats::fct_reorder(model, ismm))

We can see the ‘shrinkage’ here in two ways- first in the models with an x term assessed at the origin, where the clusters are estimating different random intercepts and we just extract the intercepts from a fixed model. The blue points move closer to the intercept (horizontal line). Though in some cases they cross it.

# Leaving here to remind myself why we dont' want rand.
ggplot(intercept_compare |> filter(hasx), 
       aes(x = cluster, y = intercept, 
           color = model)) +
  geom_pointrange(aes(ymax = intercept+sd, ymin = intercept-sd),
                  position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = fixef(test_full)$cond[1])

Similarly, if we look at a model without an x term, and so just estimate the clusters themselves, the blue (random) version is closer to the intercept than the red (clusters as fixed effects).

ggplot(intercept_compare |> filter(!hasx), 
       aes(x = cluster, y = intercept, color = model)) +
    geom_pointrange(aes(ymax = intercept+sd, ymin = intercept-sd),
                  position = position_dodge(width = 0.1)) +
  geom_hline(yintercept = fixef(test_rand)$cond[1])

So now, can we plot the fits along x, and see if we can get the lines to make sense in terms of what they’re doing relative to those points above?

full_preds_cluster <- as_tibble(coef(test_full)$cond$cluster, 
                                  rownames = 'cluster') |> 
  bind_cols(sd = sqrt(c(attributes(ranef(test_full)$cond$cluster)$condVar))) |>
  rename(intercept = `(Intercept)`, slope = x)

full_clusterx <- test_full$frame |> select(x, cluster) |> distinct()

full_preds_cluster <- full_preds_cluster |> 
  left_join(full_clusterx) |> 
  mutate(predicted = intercept + slope*x) |> 
  mutate(model = 'full_mm', hasx = TRUE, ismm = TRUE)

# add some checks
# The full model fit- fits match,se a bit smaller
full_preds_cluster <- full_preds_cluster |> 
  add_predictions(test_full) |> 
  rename(full_model_fit = fit, 
         full_model_sd = se.fit)

# can we build it from ranef?
full_preds_cluster <- full_preds_cluster |> 
  bind_cols(ranef(test_full)$cond$cluster) |> 
  rename(randdev = `(Intercept)`) |> 
  add_predictions(test_full, re.form = NA) |> 
  rename(no_clust_fit = fit, no_clust_sd = se.fit) |> 
  mutate(check_re = no_clust_fit + randdev)
  

# rand_preds <- as_tibble(coef(test_rand)$cond$cluster, rownames = 'cluster') |> 
#     # this adds the condVar- are these sd or se?
#   bind_cols(sd = sqrt(c(attributes(ranef(test_rand)$cond$cluster)$condVar))) |>
#   rename(intercept = `(Intercept)`) |> 
#   mutate(model = 'rand', hasx = FALSE, ismm = TRUE)

# I *think* this is what I want- the effect of cluster at x = 0 in the fixed model. That's not the same as the cluster estimates in the cluster model, which don't account for the x. They'll come in when we look at the actual fit line.
fixed_preds_cluster <- test_fixed$frame |>
  select(x, cluster) |> distinct() |> 
  add_predictions(test_fixed, se.fit = TRUE) |> 
  rename(predicted = fit, sd = se.fit) |>  # for this specific case at x = 0
  mutate(model = 'fixed', hasx = TRUE, ismm = FALSE)

preds_compare <- bind_rows(full_preds_cluster, 
                           fixed_preds_cluster) |> 
  mutate(model = forcats::fct_reorder(model, ismm))

A quick check of the various point fits for the full model- the estimates for the clusters are the same whether we use the full model or extract them in various ways from coef and ranef. The blue points here are the odd one out, they are the fits in the no-cluster predict (and the orchid are those with the cluster deviations added back on).

ggplot(full_preds_cluster) +
  geom_point(aes(x = x, y = predicted), color = 'firebrick') +
  geom_point(aes(x = x+0.1, y = full_model_fit), color = 'forestgreen') +
  geom_point(aes(x = x+0.2, y = no_clust_fit), color = 'dodgerblue') +
  geom_point(aes(x = x+0.3, y = check_re), color = 'orchid')

Now, I want to fit some lines.

First, let’s fit those models themselves.

newdata_nc <- tibble(x = 0:10)
# use re.form = NA to fit at population
fit_full <- newdata_nc |> 
  add_predictions(test_full, se.fit = TRUE, re.form = NA) |> 
  rename(predicted = fit, sd = se.fit) |> 
  mutate(model = 'full_mm')

newdata_c <- expand_grid(newdata_nc, cluster = unique(full_coefs$cluster))
fit_fixed <- newdata_c |> 
  add_predictions(test_fixed, se.fit = TRUE) |> 
  rename(predicted = fit, sd = se.fit) |> 
  mutate(model = 'fixed')

# That gives a line for each cluster. But what I want is the main effect of x. I can get that slope from fixef(test_fixed), but the intercept is cluster 1, not the true intercept. so I need to calculate that?
fixed_intercept <- test_fixed$frame |> filter(cluster == 1) |> 
  distinct(x) |> pull()

fixed_slope <- fixef(test_fixed)$cond['x']

# intercept error (note, I'm not trying to deal with slope error here yet)
int_se <- summary(test_fixed)$coefficients$cond[1, 'Std. Error']

fit_fixed2 <- newdata_nc |> 
  mutate(predicted = fixed_intercept + fixed_slope*x,
         sd = int_se,
         model = 'fixed')

Let’s also add lines that fit through the cluster estimates. Note that these will have much too high error, but we’re largely ignoring that anyway.

fitrandclusts <- glmmTMB(predicted ~ x, 
                         data = preds_compare |> filter(model == 'full_mm'))

fitfixedclusts <- glmmTMB(predicted ~ x, 
                         data = preds_compare |> filter(model == 'fixed'))

# I think we can add the other two on here too, if we join to the relevant x-values
xc <- sddata$simdata[[3]] |> 
  select(x, cluster) |> 
  distinct()

# Fit the cluster estimates
rand_addx <- rand_coefs |> 
  left_join(xc) |> 
  rename(predicted = intercept)

cluster_addx <- cluster_coefs |> 
  select(-x) |> 
  left_join(xc) |> 
  rename(predicted = intercept)

fitrandonly <- glmmTMB(predicted ~ x, 
                         data = rand_addx)
fitclustonly <- glmmTMB(predicted ~ x, 
                         data = cluster_addx)

fit_randonly <- newdata_nc |> 
  add_predictions(fitrandonly, se.fit = TRUE, re.form = NA) |> 
  rename(predicted = fit, sd = se.fit) |> 
  mutate(model = 'rand')

fit_clusteronly <- newdata_nc |> 
  add_predictions(fitclustonly, se.fit = TRUE, re.form = NA) |> 
  rename(predicted = fit, sd = se.fit) |> 
  mutate(model = 'cluster')

# Fit the points without cluster info
model_nocluster <- glmmTMB(y ~ x, data = sddata$simdata[[3]])

fit_nocluster <- newdata_nc |> 
  add_predictions(model_nocluster, se.fit = TRUE, re.form = NA) |> 
  rename(predicted = fit, sd = se.fit) |> 
  mutate(model = 'nocluster')

preds_withextra <- bind_rows(preds_compare, rand_addx, cluster_addx)

Need a linetype here to see that the red (fixed effect for cluster, then fit x) and blue (cluster random with x) sit on top of each other. Which seems odd. I expected the random cluster, then fit x would match (i.e. the purple), but that doesn’t. Doesnt’ this imply zero shrinkage? I guess in the slope. So maybe that’s actually right??

ggplot(preds_withextra, aes(x = x, y = predicted, 
                          ymin = predicted-sd, 
                          ymax = predicted+sd, 
                          color = model,
                          linetype = model)) +
  geom_pointrange(position = position_dodge(width = 0.1)) +
  geom_line(data = fit_full) +
  geom_line(data = fit_fixed2) +
  geom_line(data = fit_randonly) +
  geom_line(data = fit_clusteronly) +
  geom_line(data = fit_nocluster)

What do I want to do now? Clean up the above so it makes more sense.

Cleanup

ranef(test_full)
Plot the fit and error
- At mean random
- at each random
The other stuff above- just points, just cluster means, etc.
What’s the ‘shrinkage’? Can we plot essentially the points they move to a la the website?
Plot
Nested? Is that where the shrinkage ends up mattering?

Simple illustrations

What do random effects mean?

Some plots here of how the data looks- points around clusters following the mean.

For the situation we have in our data, with each ‘cluster’ (random level) having the same x-value, we can see how the various terms work out to yield the overall distribution given some ‘true’ slope and intercept, random variance, and observation-level variance.

simdata <- simulate_gaussian_mm(n_per_cluster = 5,
                                    n_clusters = 10,
                                    intercept = 1,
                                    slope = 0.5,
                                    sigma = 0.1,
                                    sd_rand_intercept = 0.2,
                                    sd_rand_slope = 0,
                                    rand_si_cor = 0,
                                    cluster_x = \(x) runif(x, min = 0, max = 10),
                                    obs_x_sd = 0)

ggplot(simdata, aes(x = x, y = y, color = factor(cluster))) + geom_point()

Increasing random variance

simdatarv <- simulate_gaussian_mm(n_per_cluster = 5,
                                    n_clusters = 10,
                                    intercept = 1,
                                    slope = 0.5,
                                    sigma = 0.1,
                                    sd_rand_intercept = 1,
                                    sd_rand_slope = 0,
                                    rand_si_cor = 0,
                                    cluster_x = \(x) runif(x, min = 0, max = 10),
                                    obs_x_sd = 0)

ggplot(simdatarv, aes(x = x, y = y, color = factor(cluster))) + geom_point()

Increasing obs variance

simdataov <- simulate_gaussian_mm(n_per_cluster = 5,
                                    n_clusters = 10,
                                    intercept = 1,
                                    slope = 0.5,
                                    sigma = 1,
                                    sd_rand_intercept = 0.2,
                                    sd_rand_slope = 0,
                                    rand_si_cor = 0,
                                    cluster_x = \(x) runif(x, min = 0, max = 10),
                                    obs_x_sd = 0)

ggplot(simdataov, aes(x = x, y = y, color = factor(cluster))) + geom_point()

This will make my life easier moving forward to figure this out:

simparams <- expand_grid(n_per_cluster = 5, 
                         n_clusters = 10, 
                         intercept = 1, 
                         slope = 0.5,
                         sigma = c(0.1, 1),
                         sd_rand_intercept = c(0.5, 2),
                         sd_rand_slope = 0, 
                         rand_si_cor = 0,
                         obs_x_sd = 0)

# Have to do cluster_x not in the expand_grid-it doesnt' fit in a row.
# Tibbles *will* hold functions, but it's weirdly hard to get them in there if not handbuilding.
simresults <- simparams |> 
  bind_cols(tibble(simdata = purrr::pmap(simparams, simulate_gaussian_mm, 
                                         cluster_x = \(x) runif(x, min = 0, max = 10))))

# I thought there might be a way to ggplot straight out of the nested cols, but not obviously working
simresults |> 
  unnest(cols = simdata) |> 
ggplot(aes(x = x, y = y, color = factor(cluster))) + 
  geom_point() +
  facet_grid(sigma ~ sd_rand_intercept)