The data and the problem

For this example we’ll make use of a dataset found in the PROMISE repository donated by Prof. Martin Shepperd in 2005, and originally from J. M. Desharnais’ master thesis.¹ We would like to predict Effort (our outcome) for implementing a software artifact, given programming language used (our predictor).

Data, script, etc. are found at the GitHub repository.²

f <- read.arff("data/desharnais.arff")
 
# remove columns we don't need
f <- f[-c(1:5,7:11)]

# convert Language (factor) to numeric
f$Language <- as.numeric(f$Language)

str(f)

## 'data.frame':    81 obs. of  2 variables:
##  $ Effort  : num  5152 5635 805 3829 2149 ...
##  $ Language: num  1 1 1 1 1 1 2 1 1 1 ...

So, from the top, we have Effort, our outcome, and then Language, our predictor. Effort is in hours spent. Language we’ve converted to an integer \(1,2,3\), indicating three different languages. In total we have 81 rows (or observations, if you will).

Step 1: Likelihoods

What ontological and epistemological assumptions can we make concerning the underlying process that generated the outcome Effort?

From an ontological perspective, the outcome consists of positive integers, i.e., \(\mathbb{N}^+\), so a count going from \(0 \rightarrow \infty\). For this type of data the \(\mathsf{Poisson}(\lambda)\) distribution is commonly used, where \(\lambda\) is the parameter we want to estimate.³

From an epistemological perspective, and from an information theoretical point of view, we want to use a likelihood that allows the data to happen in the most ways, i.e., it doesn’t constrain the story data wants to tell us. Given the above, the maximum entropy distribution is the \(\mathsf{Poisson}(\lambda)\).

Let us now design a set of models \(\mathbf{M}=\{\mathcal{M}_0,\ldots,\mathcal{M}_n\}\) and see how well they compare concerning out of sample predictions. For the first model we’ll use prior predictive checks (which we won’t report for the second model), while for the second model we’ll do posterior predictive checks.

Step 2: Priors

We want a simple intercept only model (to estimate the grand mean \(\alpha\)). A common prior is \(\mathsf{Normal}(0,10)\), but since we use a log link when using a \(\mathsf{Poisson}\) likelihood, this can have very strange effects.

Without looking at the empirical data we can assume that we perhaps have a mean Effort in the thousands (not much of a project otherwise). The maximum Effort could very well end up in the millions (50–60 people working for ten years in a project), but let’s assume it’s not in the billions (70,000 people working for ten years in a project).

# sample randomly from a log-normal()
max(rlnorm(1e5, 0, 4))

## [1] 37995323

which seems OK. Let’s have a look at a default prior,

# sample randomly from a log-normal()
max(rlnorm(1e5, 0, 10))

## [1] 1.035314e+19

which seems absurd. Hard to see a project with these many hours

Let’s go for a \(\mathsf{Normal}(0,4)\) as a prior on the \(\alpha\) parameter that we want to estimate.

Step 3: Calculating the posterior

Sample the model with empirical data. We sample using four independent chains, which then are used to indicate if we have reached a stationary posterior distribution.

m0 <- ulam(
  alist(
    Effort ~ poisson(lambda),
    log(lambda) <- alpha, # log link
    alpha ~ normal(0, 3)
  ), data = f, cores = 4, chains = 4, cmdstan = TRUE, log_lik = TRUE, iter = 5e3
)

Check diagnostics.

precis(m0)

##           mean          sd    5.5%   94.5%    n_eff    Rhat4
## alpha 8.526364 0.001551667 8.52393 8.52885 3133.083 1.001823

The diagnostics looks OK. \(\widehat{R} < 1.01\) and the effective sample size (n_eff) is in the thousands, so all is well there. Let’s also have a look at the trankplots for our estimated parameters.

This is how healthy trankplots should look like, i.e., four chains mixing well after the initial phase.

The estimated \(\alpha\) was \(8.5\). But remember, we used a log link so we need to get it back on the outcome scale, i.e., \(\mathrm{exp}(8.5)=\) 5046. That’s the overall mean of Effort, no matter language used.

Conducting posterior predictive checks when we only estimate a grand mean is really not sane so we’ll leave this for now. Let’s focus on adding adding our predictor Language.

We’ll add Language as a varying intercept, i.e., each of the three languages get their own intercept.

m1 <- ulam(
    alist(
        Effort ~ poisson(lambda),
        log(lambda) <- a_lang[Language], # each lang its own intercept
        a_lang[Language] ~ normal(0, 3)
    ), data = f, cores = 4, chains = 4, cmdstan = TRUE, log_lik = TRUE
)

We’ve conducted prior predictive checks and checked diagnostics for the above model. Now it might be interesting to do a posterior predictive check, i.e., how well our model fits the data.

Step 4: Sanity check of the posterior

postcheck(m1, window = 81)

The vertical axis is the outcome for each case in our dataset on the horizontal. The blue dots are the empirical data. The circles are the posterior mean. We see that the model has terribly fit in many cases (i.e., many of the blue dots are far away from the circles). Our priors are sane, we do know that, but given this large variation the \(\mathsf{Poisson}\) might not be the right likelihood after all\(\ldots\)

Remember, the \(\mathsf{Poisson}\) assumes that the mean and the variance is approximately equal. Let’s check if this is true,

var(f$Effort)

## [1] 19525504

mean(f$Effort)

## [1] 5046.309

This clearly indicates that the \(\mathsf{Poisson}\) should not be used. We need to fall back on a mixture distribution (a mixture of \(\mathsf{Gamma}\) and \(\mathsf{Poisson}\)) often called \(\mathsf{Negative}\)-\(\mathsf{Binomial}\). Here we will be able to model the variance separately.

m2 <- ulam(
    alist(
        Effort ~ dgampois(lambda, phi), # phi to model variance
        log(lambda) <- a_lang[Language],
        a_lang[Language] ~ normal(0, 3),
        phi ~ exponential(1) # prior on the variance component phi
    ), data = f, cores = 4, chains = 4, cmdstan = TRUE, log_lik = TRUE,
    control = list(adapt_delta=0.95)
)

Let’s do a posterior check again,

postcheck(m2, window=81)

The blue dots are the empirical data. The circles are the posterior mean, with 90% interval, and the ‘+’ symbol mark the 90% predicted interval. In some cases we see clear outliers (e.g., Cases 44 and 81), which are outside the 90% predicted interval. This is not strange, after all, if we want a perfect model for our empirical data, why not simply use the data as-is? Well, the answer to that question is: We will inadvertently overfit, i.e., learn too much from the empirical data, which will lead to a model that will break down when facing new data.

(Step 5: Model comparisons)

We now have a set of models \(\mathbf{M}\), which we can compare using PSIS-LOO. The comparison will be relative and will not indicate if we’ve found an optimal, ‘true’, model. It will simply rank the models according to their relative out of sample prediction capabilities.

(loo_est <- compare(m0, m1, m2, func=LOO))

##          PSIS          SE    dPSIS      dSE       pPSIS weight
## m2   1523.665    20.93278      0.0       NA    4.796302      1
## m1 219746.538 42463.13037 218222.9 42445.74 3269.957919      0
## m0 253159.747 48763.38642 251636.1 48748.82 2147.711144      0

What we see here, not surprisingly, is that \(\mathcal{M}_2\) is considered the best model. But how much better is that model compared to the model that comes on the second place, i.e., \(\mathcal{M}_1\)?

We can actually calculate a confidence interval for it if we want to. Using a \(z\)-score of 1.96 (i.e., 95%) we can use the relative difference in PSIS (dPSIS) and the difference in standard error (dSE) to calculate the interval,

loo_est[2,3] + c(-1,1) * loo_est[2,4] * 1.96

## [1] 135029.2 301416.5

which is clearly not crossing zero. In short, modeling the variance separately using the \(\mathsf{Negative}\)-\(\mathsf{Binomial}\) paid off.

Step 6: Compute stuff

If we plot our estimates we might visually appreciate the differences between the estimates better.

plot(precis(m2, depth = 2, pars = "a_lang", prob = 0.95))

Circles indicate the estimated posterior mean, and the bars are 95% posterior intervals.

We see that one language (Language 3) is clearly lower than the other languages. In short, using Language 3 means that a project uses less effort. Since Language 3 clearly does not “touch” any of the other languages, one could claim that it’s a significant difference. But then there are many other things that could prove this to be false…

Appendix

CmdStan

For the sampling we refrain from using rstan and instead use cmdstan through the R package rethinking. Generally speaking, the community now prefer users to use cmdstan since it updates more frequently.

Install cmdstanr and cmdstan by,

CORES = 4 # set to the number of available CPU cores
remotes::install_github("stan-dev/cmdstanr")
cmdstanr::install_cmdstan(cores = CORES)
# you can now run rethinking with cmdstan instead of rstan

For this execution we’ve used,

cmdstanr::cmdstan_version()

## [1] "2.26.1"

Environment

print(sessionInfo(), locale=FALSE)

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin20.3.0 (64-bit)
## Running under: macOS Big Sur 11.2.3
## 
## Matrix products: default
## BLAS:   /usr/local/Cellar/openblas/0.3.13/lib/libopenblasp-r0.3.13.dylib
## LAPACK: /usr/local/Cellar/r/4.0.5/lib/R/lib/libRlapack.dylib
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] digest_0.6.27       cmdstanr_0.4.0.9000 here_1.0.1         
## [4] foreign_0.8-81      rethinking_2.13     rstan_2.26.1       
## [7] StanHeaders_2.26.1 
## 
## loaded via a namespace (and not attached):
##  [1] shape_1.4.5        tidyselect_1.1.0   xfun_0.22          bslib_0.2.4       
##  [5] purrr_0.3.4        lattice_0.20-41    V8_3.4.0           colorspace_2.0-0  
##  [9] vctrs_0.3.7        generics_0.1.0     htmltools_0.5.1.1  stats4_4.0.5      
## [13] loo_2.4.1.9000     yaml_2.2.1         utf8_1.2.1         rlang_0.4.10      
## [17] pkgbuild_1.2.0     jquerylib_0.1.3    pillar_1.6.0       glue_1.4.2        
## [21] DBI_1.1.1          matrixStats_0.58.0 lifecycle_1.0.0    stringr_1.4.0     
## [25] munsell_0.5.0      gtable_0.3.0       mvtnorm_1.1-1      coda_0.19-4       
## [29] codetools_0.2-18   evaluate_0.14      inline_0.3.17      knitr_1.32        
## [33] callr_3.6.0        ps_1.6.0           curl_4.3           fansi_0.4.2       
## [37] Rcpp_1.0.6         scales_1.1.1       RcppParallel_5.1.2 jsonlite_1.7.2    
## [41] gridExtra_2.3      ggplot2_3.3.3      stringi_1.5.3      processx_3.5.1    
## [45] dplyr_1.0.5        rprojroot_2.0.2    grid_4.0.5         cli_2.4.0         
## [49] tools_4.0.5        magrittr_2.0.1     sass_0.3.1         tibble_3.1.1      
## [53] crayon_1.4.1       pkgconfig_2.0.3    MASS_7.3-53.1      ellipsis_0.3.1    
## [57] prettyunits_1.1.1  assertthat_0.2.1   rmarkdown_2.7      R6_2.5.0          
## [61] compiler_4.0.5

http://promise.site.uottawa.ca/SERepository/datasets/desharnais.arff ↩︎
https://github.com/torkar/icse_tutorial ↩︎
For the \(\mathsf{Poisson}(\lambda)\), the \(\lambda\) estimates the mean and the variance of the distribution.↩︎

Bayesian Data Analysis in Empirical Software Engineering

R. Torkar, C. A. Furia, and R. Feldt

Version: 2021-05-27 15:51:51