library(tidyverse)
library(fitdistrplus)
library(foreach)
Fitting partial distributions
We have some data that has a lot of zeros in it, but is otherwise reasonably modelled by a lognormal. We want to be able to shift the distribution to change the number of zeros, and so we need to fit that lognormal, and then shift it around.
One way to think about the problem is that there is some detection limit or process limit below which we get zeros in the data, but where theoretically the distribution continues. Specifically, we’re dealing with water flow, and so maybe it drops below the ability of the gauges to detect, but continues dropping below that. It may even drop underground (e.g river is dry), but to some approximation, water remains underground, and the amount can continue dropping.
Thus, we want to take the data we have, and say that any zeros are actually some small but unknown number, and fit the distribution. We can identify the zero/undetectable threshold as the smallest nonzero number. Then, we can simulate new data with a shifted distribution, which will have a different number of points below that threshold.
Here, we want to figure out how to shift the data.
Generate data
Make the ‘true’ data
<- rlnorm(10000, 5, 1.5) truedata
Set the detection limit. Making this a variable since we need it in a few places and might want to change it.
<- 10 detectionlimit
<- truedata
censdata < detectionlimit] <- 0 censdata[censdata
Plot those densities- big spike at 0 in censored (obviously).
<- tibble(truedata, censdata)
tc
ggplot(tc) +
geom_density(aes(x = truedata), color = 'black') +
geom_density(aes(x = censdata), color = 'firebrick') +
xlim(-1, 100)
Warning: Removed 6104 rows containing non-finite outside the scale range
(`stat_density()`).
Removed 6104 rows containing non-finite outside the scale range
(`stat_density()`).
Naive fit
We can just use fitdistr
to get a naive fit- it should be right for truedata but wrong for censdata. It’s not way off, but it is off.
<- fitdistr(truedata, densfun = 'lognormal')
fit_true <- fitdistr(censdata[censdata > 0], densfun = 'lognormal')
fit_cens
fit_true
meanlog sdlog
5.01228084 1.49945363
(0.01499454) (0.01060274)
fit_cens
meanlog sdlog
5.135346982 1.376120151
(0.014012165) (0.009908097)
How far off?
<- tibble(x = seq(0,1000, by = 0.1),
df_cdf cdf_true = plnorm(x,
$estimate['meanlog'],
fit_true$estimate['sdlog']),
fit_truecdf_cens = plnorm(x,
$estimate['meanlog'],
fit_cens$estimate['sdlog'])) fit_cens
ggplot() +
stat_ecdf(data = tc, mapping = aes(x = truedata), color = 'forestgreen') +
stat_ecdf(data = tc, mapping = aes(x = censdata), color = 'firebrick') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_true), color = 'darkseagreen') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_cens), color = 'deeppink2') +
coord_cartesian(xlim = c(-1, 2*detectionlimit), ylim = c(0, 0.1))
The empirical cdfs fit exactly after 10, because of the probability mass at 0 for the censored data. The fitted distributions are clearly very good for the true data, but quite a ways off for the censored.
Censored fit
We should be able to use fitdistrplus::fitdistcens
, though we need some weird data manipulation. Our data is either left-censored or true. From the help, left
should be NA for left-censored, or observed for non-censored, and right
should be the right bound of the interval for interval-censored or the data for non-censored. I think that means we set it to the detection limit for the censored observations- e.g. it’s the right bound of a censoring interval from -infinity to detection limit. But the interval censoring is confusing me.
<- tc |>
censframe ::mutate(left = ifelse(censdata == 0, NA, censdata),
dplyrright = ifelse(censdata == 0, detectionlimit, censdata)) |>
::select(left, right) dplyr
Now, fit that. There’s a bug where an internal error check fails for tibbles, so send it an old-fashioned df. (it asks for length(censdata[,1]), which is the length of the vector for data.frames, but because indexing remains a tibble, fails for tibbles).
<- fitdistcens(censdata = data.frame(censframe), distr = 'lnorm')
fit_cens2
fit_cens2
Fitting of the distribution ' lnorm ' on censored data by maximum likelihood
Parameters:
estimate
meanlog 5.013691
sdlog 1.496170
Add that to the cdfs
<- df_cdf |>
df_cdf mutate(cdf_cens2 = plnorm(x,
$estimate['meanlog'],
fit_cens2$estimate['sdlog']),) fit_cens2
And plot it
ggplot() +
stat_ecdf(data = tc, mapping = aes(x = truedata), color = 'forestgreen') +
stat_ecdf(data = tc, mapping = aes(x = censdata), color = 'firebrick') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_true), color = 'darkseagreen') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_cens), color = 'deeppink2') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_cens2), color = 'purple') +
coord_cartesian(xlim = c(-1, 2*detectionlimit), ylim = c(0, 0.1))
Shifted to censor
The above is working, but when I go to apply it, there’s a slightly different issue. The data above is still bounded by 0, we’re just saying we can’t see it below some limit. And so the censored fitting allows us to fill in that bit of distribution between 0 and the limit.
The issue I actually encounter with my real data is that it looks like the curve simply cuts off at zero, but should go below. In other words, it looks like this:
<- tc |>
tc mutate(shiftdata = truedata-detectionlimit,
shiftcens = ifelse(shiftdata > 0, shiftdata, 0))
ggplot(tc) +
stat_ecdf(mapping = aes(x = shiftdata), color = 'grey80') +
stat_ecdf(mapping = aes(x = shiftcens), color = 'black') +
stat_ecdf(mapping = aes(x = truedata), color = 'forestgreen') +
stat_ecdf(mapping = aes(x = censdata), color = 'firebrick') +
coord_cartesian(xlim = c(-detectionlimit-1, 10*detectionlimit), ylim = c(0, 0.5))
So, it looks like the same problem, but we run into problems when we try to fit it. Because any function we try to fit will still be bounded by 0, not allow some unknown pseudo-tail to go lower than 0.
Fit it. where do we put the detection limit in this case? It shouldn’t be the same as above, necessarily, because we do have numbers between 0 and 10 (or whatever it is). I’ll choose 1, but even that’s arbitrary. Changing the left to NA anything less than whatever that value is so we don’t have a few numbers and then the spike.
<- 1
shiftdetect <- tc |>
censframeshift ::mutate(left = ifelse(shiftcens < shiftdetect, NA, shiftcens),
dplyrright = ifelse(shiftcens == 0, shiftdetect, shiftcens)) |>
::select(left, right) dplyr
Now, fit that.
<- fitdistcens(censdata = data.frame(censframeshift), distr = 'lnorm')
fit_censshift
fit_censshift
Fitting of the distribution ' lnorm ' on censored data by maximum likelihood
Parameters:
estimate
meanlog 4.752459
sdlog 1.914297
That is quite a bit different than the previous
fit_cens2
Fitting of the distribution ' lnorm ' on censored data by maximum likelihood
Parameters:
estimate
meanlog 5.013691
sdlog 1.496170
Add that to the cdfs
<- df_cdf |>
df_cdf mutate(cdf_censshift = plnorm(x,
$estimate['meanlog'],
fit_censshift$estimate['sdlog'])) fit_censshift
And plot it
ggplot() +
stat_ecdf(data = tc, mapping = aes(x = truedata), color = 'forestgreen') +
stat_ecdf(data = tc, mapping = aes(x = censdata), color = 'firebrick') +
stat_ecdf(data = tc, mapping = aes(x = shiftcens), color = 'black') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_true),
color = 'darkseagreen') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_cens),
color = 'deeppink2') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_cens2),
color = 'purple') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_censshift),
color = 'magenta') +
coord_cartesian(xlim = c(-1, 2*detectionlimit), ylim = c(0, 0.1))
We can see that the magenta line is a terrible fit, because it is trying to fit an actual lognormal, which is necessarily bound by 0.
So, we want to find a way to fit the bit of the lognormal we have, assuming that some portion has gone below 0.
Shift the distribution
This approach exactly parallels how we created the dummy data- we shift the distribution up some amount, call that the detection limit, and fit (and then shift the fit back down). Let’s first do that for the case we know should work- exactly mirroring the shift we used to create the data. Assuming that works, we’ll need to then explore the situation where we don’t know how far to shift.
This is a bit silly, because we are just recovering the test data, but to make the process clear (and retain the zeros in shiftcens), we create a shifted version of it. This should match the censored true data censdata
for this precise re-shift, with the slight change in the location of the initial probability mass.
<- tc |>
tc mutate(shiftcensback = shiftcens + detectionlimit)
ggplot(tc) +
stat_ecdf(data = tc, mapping = aes(x = censdata), color = 'firebrick') +
stat_ecdf(data = tc, mapping = aes(x = shiftcensback), color = 'cyan', linetype = 2) + xlim(c(0,100))
Warning: Removed 6104 rows containing non-finite outside the scale range
(`stat_ecdf()`).
Removed 6104 rows containing non-finite outside the scale range
(`stat_ecdf()`).
Now, we fit that shifted data. We’re back to censoring at the detectionlimit (or whatever the shift back is). Now, instead of a mass at 0, all that mass gets pushed to detectonlimit (e.g. min(tc$shiftcensback) is 10, not 0). So we have to use that as the censoring value.
<- tc |>
censshiftback ::mutate(left = ifelse(shiftcensback <= detectionlimit, NA, shiftcensback),
dplyrright = ifelse(shiftcensback <= detectionlimit, detectionlimit, shiftcensback)) |>
::select(left, right)
dplyr
<- fitdistcens(censdata = data.frame(censshiftback), distr = 'lnorm')
fit_censshiftback
fit_censshiftback
Fitting of the distribution ' lnorm ' on censored data by maximum likelihood
Parameters:
estimate
meanlog 5.013691
sdlog 1.496170
That is the same as fit_cens2, which we know works. Plot to double check. Need to add to df_cdf. And make a shifted x that we can use to re-backshift the new cdf. If that makes any sense.
<- df_cdf |>
df_cdf mutate(cdf_censshiftback = plnorm(x,
$estimate['meanlog'],
fit_censshiftback$estimate['sdlog']),
fit_censshiftbackx_shiftback = x-detectionlimit)
Now, cdf_censshiftback should match the shifted-up distribution, and shifting it back down should match the ‘original’ distribution that looks like it should go below 0.
ggplot() +
stat_ecdf(data = tc, mapping = aes(x = censdata), color = 'firebrick') +
stat_ecdf(data = tc, mapping = aes(x = shiftcens), color = 'black') +
stat_ecdf(data = tc, mapping = aes(x = shiftcensback),
color = 'cyan', linetype = 2) +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_censshiftback),
color = 'deeppink2') +
geom_line(data = df_cdf, mapping = aes(x = x, y = cdf_censshift),
color = 'magenta') +
geom_line(data = df_cdf, mapping = aes(x = x_shiftback, y = cdf_censshiftback),
color = 'dodgerblue') +
coord_cartesian(xlim = c(-1, 2*detectionlimit), ylim = c(0, 0.1))
So, that seems to be working. Now, though, what if we don’t know how much to shift? E.g. we’ll have data with no obvious ‘true’ shift. Can we just shift it way off to make sure the cdf hits zero, and then shift it back?
I think at this point I probably need to use a clean dataframe and set of variables- detectionlimit doesn’t really mean the same thing as ‘arbitrary shift’.
Unknown shift
I’m going to start all the way back at the beginning with a clean set of truedata and then shift it.
<- 10 trueshift
<- tibble(truedata = truedata,
df_shift shiftdata = truedata-trueshift,
shiftcens = ifelse(shiftdata > 0, shiftdata, 0))
Is that what I think it is? Yes
ggplot(df_shift) +
stat_ecdf(aes(x = truedata), color = 'black') +
stat_ecdf(aes(x = shiftdata), color = 'firebrick') +
stat_ecdf(aes(x = shiftcens), color = 'dodgerblue') +
coord_cartesian(xlim = c(-10, 100), ylim = c(0, 0.5))
SO, the blue line there (shiftcens) is what we think we have for the real data. We know from above that it works to fit it if we bump it back up by exactly what we bumped down to make it, because, unsurprisingly, it becomes exactly what it was.
So the question now is, can we shift that blue line back up some arbitrary amount (as long as it’s greater than the downshift), and still recover the distribution?
<- 100 upshift
Now, the min of that is upshift, not 0.
<- df_shift |>
df_shift mutate(shiftup = shiftcens + upshift)
Get the fit for that. It clearly does not give the same values, by necessity.
<- df_shift |>
upcens ::mutate(left = ifelse(shiftup <= upshift, NA, shiftup),
dplyrright = ifelse(shiftup <= upshift, upshift, shiftup)) |>
::select(left, right)
dplyr
<- fitdistcens(censdata = data.frame(upcens), distr = 'lnorm')
fit_upcens
fit_upcens
Fitting of the distribution ' lnorm ' on censored data by maximum likelihood
Parameters:
estimate
meanlog 5.6890233
sdlog 0.9478606
Do we at least do a good job fitting that shifted distribution?
<- tibble(x = seq(0,1000, by = 0.1),
cdf_up cdf_shiftup = plnorm(x,
$estimate['meanlog'],
fit_upcens$estimate['sdlog'])) fit_upcens
That’s terrible.
ggplot() +
stat_ecdf(data = df_shift, mapping = aes(x = shiftup)) +
geom_line(data = cdf_up, mapping = aes(x = x, y = cdf_shiftup), color = 'purple') +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, 0.5))
Write a function and loop to look at this over some range of shifts and clean up the dataframe.
<- function(indata, upshift) {
shiftfun # Shift the data
<- tibble(clipped = indata, shiftdat = indata + upshift, upshift)
shiftdf
# Create censored dataset and fit
# Handle the zero case- we just use the next value up
if (upshift == 0) {rightlim <- min(indata[indata>0])} else {rightlim <- upshift}
<- shiftdf |>
upcens ::mutate(left = ifelse(shiftdat <= upshift, NA, shiftdat),
dplyrright = ifelse(shiftdat <= upshift, rightlim, shiftdat)) |>
::select(left, right)
dplyr
<- fitdistcens(censdata = data.frame(upcens), distr = 'lnorm')
fit_up
# This isn't ideal, but we can shove the cdf on here too, it just has rows that don't mean the same thing. prevents us saving a list though.
<- shiftdf |>
shiftdf mutate(x = row_number()/10,
cdf_up = plnorm(x,
$estimate['meanlog'],
fit_up$estimate['sdlog']),
fit_uppdf_up = dlnorm(x,
$estimate['meanlog'],
fit_up$estimate['sdlog']),
fit_up# Some diagnostics
fitloglik = fit_up$loglik,
fitsemean = fit_up$sd['meanlog'],
fitsesd = fit_up$sd['sdlog'])
}
Loop over a few values of the shift, from as-is, below true, true, and above true. Indata is the dummy for the data we actually have (ending at 0, but too high). Here, it’s in df_shift$shiftcens.
<- foreach(upshift = c(0, 1, 5, 10, 20, 50, 100), .combine = bind_rows) %do% {
abshift shiftfun(df_shift$shiftcens, upshift)
}
ggplot(abshift) +
stat_ecdf(aes(x = shiftdat, color = factor(upshift))) +
geom_line(aes(x = x, y = cdf_up, color = factor(upshift)), linetype = 2) +
coord_cartesian(xlim = c(0,200), ylim = c(0, 0.5))
I think maybe we need to minimize Kullback-Leibler, but we might be able to just look at the fit diagnostics
|>
abshift distinct(upshift, fitloglik, fitsemean, fitsesd) |>
ggplot(aes(x = upshift, y = fitloglik)) +
geom_line() + geom_point() +
geom_label(aes(label = upshift), hjust = -0.1, vjust = -0.1, alpha = 0.2)
|>
abshift distinct(upshift, fitloglik, fitsemean, fitsesd) |>
ggplot(aes(x = upshift)) +
geom_line(aes(y = fitsemean)) + geom_point(aes(y = fitsemean)) +
geom_line(aes(y = fitsesd), color = 'firebrick') +
geom_point(aes(y = fitsesd), color = 'firebrick') +
geom_label(aes(y = fitsemean, label = upshift), hjust = -0.1, vjust = -0.1, alpha = 0.2)
So, unsurprisingly, the thing to do is maximise log-likelihood. I could probably figure out a way to include a shift in the main function and use whatever the optimiser is doing internally.
Or, I can use optim on the output?
<- df_shift$shiftcens
indata <- function(upshift) {
shiftoptim # Shift the data
<- tibble(clipped = indata, shiftdat = indata + upshift, upshift)
shiftdf
# Create censored dataset and fit
# Handle the zero case- we just use the next value up
if (upshift == 0) {rightlim <- min(indata[indata>0])} else {rightlim <- upshift}
<- shiftdf |>
upcens ::mutate(left = ifelse(shiftdat <= upshift, NA, shiftdat),
dplyrright = ifelse(shiftdat <= upshift, rightlim, shiftdat)) |>
::select(left, right)
dplyr
suppressWarnings(fit_up <- fitdistcens(censdata = data.frame(upcens), distr = 'lnorm'))
return(-fit_up$loglik)
}
This says to use optimize
optim(1, shiftoptim)
Warning in optim(1, shiftoptim): one-dimensional optimization by Nelder-Mead is unreliable:
use "Brent" or optimize() directly
$par
[1] 10.275
$value
[1] 67581.24
$counts
function gradient
28 NA
$convergence
[1] 0
$message
NULL
optimize(shiftoptim, interval = c(0, 1000))
$minimum
[1] 10.26533
$objective
[1] 67581.24
Let’s write a version that takes the data, then optimizes to get the right upshift.
<- function(rawdata, shift_up) {
fitshift # Handle the zero case- we just use the next value up
if (shift_up == 0) {rightlim <- min(rawdata[rawdata>0])
else {
} <- shift_up}
rightlim
<- rawdata + shift_up
inshift
<- tibble(left = ifelse(inshift <= shift_up, NA, inshift),
upcens right = ifelse(inshift <= shift_up, rightlim, inshift))
suppressWarnings(fit_up <- fitdistcens(censdata = data.frame(upcens),
distr = 'lnorm'))
return(fit_up)
}
<- function(shift_up, rawdata) {
opt_up
<- fitshift(rawdata, shift_up)
fit_up
return(-fit_up$loglik)
}
<- function(rawdata) {
optshift
# get the optimal shift
<- optimize(opt_up, interval = c(0, 1000), rawdata = rawdata)
shift
# Get the fit at that shift (would be nice to kick this out of opt_up somehow)
<- fitshift(rawdata, shift$minimum)
fit_up
# Create a df for output
# The shifted data
<- tibble(orig_data = rawdata,
shiftdf shift_data = rawdata + shift$minimum,
optimum_shift = shift$minimum)
# This isn't ideal, but we can shove the cdf on here too, it just has rows that don't mean the same thing. prevents us saving a list though.
<- shiftdf |>
shiftdf mutate(x = row_number()/10,
cdf_up = plnorm(x,
$estimate['meanlog'],
fit_up$estimate['sdlog']),
fit_uppdf_up = dlnorm(x,
$estimate['meanlog'],
fit_up$estimate['sdlog']),
fit_up# Some diagnostics
fitloglik = fit_up$loglik)
# and a shifted-back version of the cdf/pdf just needs a shifted x. The
# backshift of the data is just the original `rawdata`.
<- shiftdf |>
shiftdf mutate(x_back = x-shift$minimum)
}
<- optshift(df_shift$shiftcens) optimal_fit
That’s not exactly 10, but is it close? The CDF looks pretty good.
ggplot(optimal_fit) +
stat_ecdf(aes(x = shift_data)) +
geom_line(aes(x = x, y = cdf_up), linetype = 2) +
coord_cartesian(xlim = c(0,400), ylim = c(0, 0.5))
PDF not as perfect, but OK.
ggplot(optimal_fit) +
geom_density(aes(x = shift_data)) +
geom_line(aes(x = x, y = pdf_up), linetype = 2) +
xlim(c(0,1000))
Warning: Removed 1032 rows containing non-finite outside the scale range
(`stat_density()`).
And the backshift (e.g. back to the original position of the data)
ggplot(optimal_fit) +
stat_ecdf(aes(x = orig_data)) +
geom_line(aes(x = x_back, y = cdf_up), linetype = 2) +
coord_cartesian(xlim = c(0,400), ylim = c(0, 0.5))