1 Introduction

1.1 Synthetic data

In this replication package we have used empirical data under NDA. However, one can use synthetic data to evaluate our approach and reach very much the same conclusions.

Here we show how to load the synthetic dataset (from the data/ directory in the GitHub repository), which can then be used in the analysis.

d <- readRDS("data/data.rds")

The above line is currently not executed, since we will continue to run this analysis with the empirical data. However, by executing the above line, and not executing the next line, the analysis can be executed as-is with synthetic data.

1.2 Data cleaning

First prepare the data, check for NAs and look at some descriptive statistics. Since we’re using Excel we should be very careful when loading the data to see that nothing goes wrong (data manipulated in an arbitrary way, incompatible data types, etc.)

# Make sure to execute the previous statement, and not this one, if you want to 
# re-run the analysis
d <- read.xlsx("data/Features.xlsx", sheet = "Features")

This is how the empirical data looks like. In short, 11110 rows and 10 variables.1

str(d)
## 'data.frame':    11110 obs. of  10 variables:
##  $ ID                    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ State                 : chr  "Elicited, Prio, Dropped" "Elicited, Prio, Dropped" "Elicited, Prio, Planned, Implemented, Tested, Released" "Elicited, Prio, Dropped" ...
##  $ Team.priority         : num  88 83 957 79 88 76 74 73 72 71 ...
##  $ Critical.feature      : chr  "No" "No" "No" "No" ...
##  $ Business.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Customer.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Stakeholders          : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Key.customers         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dependency            : chr  "No" "No" "No" "No" ...
##  $ Architects.involvement: chr  "None" "None" "None" "None" ...

Each row (requirement) has a unique ID. The State, which is our outcome (dependent variable), shows how far the feature survived in the requirements process. It is of an ordered categorical type, which should be modeled with some type of Cumulative likelihood. There are six categories:

  1. “Elicited, Dropped”
  2. “Elicited, Prio, Dropped”
  3. “Elicited, Prio, Planned, Dropped”
  4. “Elicited, Prio, Planned, Implemented, Dropped”
  5. “Elicited, Prio, Planned, Implemented, Tested, Dropped”
  6. “Elicited, Prio, Planned, Implemented, Tested, Released”

Team.priority is the relative priority the feature got, \(\mathbb{N} = \{0,\ldots,1000\}\). Critical.feature is a ‘Yes’/‘No’ answer (\(\mathbb{Z}_2\)). Business.value and Customer.value are also ordered categorical with four levels. The level ‘No value’ is an explicit choice, so it does not imply missingness:

  1. “No value” (explict choice)
  2. “Valuable”
  3. “Important”
  4. “Critical”

Stakeholders have integers, i.e., \(\mathbb{N} = \{0,\ldots,10\}\), and Key.customers the same, but with a different set, i.e., \(\mathbb{N} = \{0,\ldots,60\}\).

Finally, Dependency is \(\mathbb{Z}_2\), while Architects.involvement is ordered categorical:

  1. “None” (explict choice)
  2. “Simple”
  3. “Monitoring”
  4. “Active Participation”
  5. “Joint Design”

All ordered categorical predictors can be modeled as monotonic or category-specific effects, if necessary (Bürkner and Charpentier, 2020).

Check for NAs,

table(is.na(d))
## 
##  FALSE 
## 111100

No NAs in the dataset. However, that doesn’t necessarily mean that we don’t have NAs. Some of the coding can be a representation of NA, e.g., ‘No value.’ In this particular case we know that ‘No value’ and ‘None’ in the dataset actually are values and not a representation of NA.

Finally, we set correct data types.

# ordered categorical
d$State <- factor(d$State, 
                  levels = c("Elicited, Dropped", 
                             "Elicited, Prio, Dropped", 
                             "Elicited, Prio, Planned, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Released"), 
                  ordered = TRUE)

# make sure we have integers for the other ordered categorical
d$Business.value <- factor(d$Business.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)


d$Customer.value <- factor(d$Customer.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)

d$Architects.involvement <- factor(d$Architects.involvement,
                                   levels = c("None",
                                              "Simple",
                                              "Monitoring",
                                              "Active Participation",
                                              "Joint Design"), 
                                   ordered = TRUE)

1.3 Descriptive statistics

First, we see that for the outcome State approximately as many features are released (final stage) as dropped in the first state. We also see that it drops off after the initial state.

For Team.priority many features have zero in priority (\(5139\)), and then there’s a bunch of them (\(1516\)) that have priority set to the maximum value, i.e., \(1000\).

For Critical.feature we have a clear emphasis on ‘No.’

Concerning Business.value and Customer.value they are fairly similar in their respective distribution (as one would perhaps expect).

For Stakeholder and Key.customers we see a strong emphasis on lower numbers, while for Dependency a clear emphasis on ‘No.’

Finally, for Architects.involvement one can see that in the absolute majority of the cases architects are not involved.

In short, it looks sort of what one would expect, i.e., it’s not hard to find answers to why the plots look the way they do.

However, before we continue, we should standardize some of our predictors so the sampling will be easier, i.e., we simply do \((x - \bar{x})/\sigma_x\), then simply multiplying with \(\sigma_x\) and adding the mean, will allow us to get back to the original scale.

# standardize and abbreviated names and change types if need be
d$prio_s <- scale(d$Team.priority)
d$sh_s <- scale(d$Stakeholders)
d$kc_s <- scale(d$Key.customers)

d$b_val <- as.integer(d$Business.value) # use int as input
d$c_val <- as.integer(d$Customer.value)
d$arch <- as.integer(d$Architects.involvement)

# Dichotomous predictors. We can set these to 1/0, but generally speaking
# one should know what one is doing and be careful doing this!
d$Critical.feature <- ifelse(d$Critical.feature == 'Yes', 1, 0)
d$Dependency <- ifelse(d$Dependency == 'Yes', 1, 0)

# only abbreviate names
d$crit <- d$Critical.feature
d$dep <- d$Dependency

2 Model design

2.1 \(\mathcal{M}_0\)

Since our outcome is of an ordered categorical nature we have many options at our hands, some of which barely existed in Bayesian modeling a few decades ago. Our outcome reminds us of a survival model, i.e., a feature needs to survive in order to reach the next stage. Taking this into account we assume that the following type of models could be an option (Bürkner and Vuorre, 2019):

  • Cumulative (single underlying continuous variable) (Samejima, 1997). A model with no predictors (\(\mathcal{M}_0\)), with predictors (\(\mathcal{M}_1\)), and with monotonic predictors (\(\mathcal{M}_2\)).
  • Adjacent-category (mathematically convenient) (Agresti, 2010). A model with predictors (\(\mathcal{M}_3\)).
  • Sequential model (higher response category is possible only after all lower categories are achieved) (Tutz, 1990). A model with predictors (\(\mathcal{M}_4\)) and with category-specific predictors (\(\mathcal{M}_5\)).

The reason we want to use monotonic or category-specific modeling of predictors is that predictor categories will not be assumed to be equidistant with respect to their effect on the outcome (Bürkner and Charpentier, 2020).

The Cumulative family is very common so let us assume a null model, \(\mathcal{M}_0\), which uses this likelihood, with no predictors. Later we will compare all our models to \(\mathcal{M}_0\), to ensure that adding predictors improve out of sample predictions, i.e., if we can’t improve when adding predictors we might as well do other things with our time.

The outcome of the model design will be a set \(M = \{\mathcal{M}_0,\ldots,\mathcal{M}_5\}\).

2.1.1 Prior predictive checks

Let us see what priors such a null model needs.

(p <- get_prior(State ~ 1, 
               family = cumulative, 
               data = d))
##                 prior     class coef group resp dpar nlpar bound       source
##  student_t(3, 0, 2.5) Intercept                                       default
##  student_t(3, 0, 2.5) Intercept    1                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    2                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    3                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    4                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    5                             (vectorized)
# Set a wide Normal(0,2) on the the intercepts (cutpoints for our 
# scale, which is on 6 levels, i.e., 5 cutpoints)
p$prior[1] <- "normal(0,2)"

Sample only from the priors.

# simplest cumulative model we can think of
M0 <- brm(State ~ 1, 
          family = cumulative, 
          data = d, 
          # threads = threading(4), # use if 16 CPUs
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0) # avoid printing sampling progress
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).

pp_check(M0, type = "bars", nsamples = 250)

Evidently the medians are quite evenly set along the \(x\)-axis, and the uncertainty is fairly uniformly distributed among the categories \(1,\ldots,6\) (the bars). In short, this is what we like to see.

2.1.2 Sample with data

M0 <- brm(State ~ 1, 
          family = cumulative, 
          data = d, 
          # threads = threading(4),
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 135.1 seconds.
## Chain 3 finished in 135.2 seconds.
## Chain 2 finished in 137.6 seconds.
## Chain 4 finished in 145.3 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 138.3 seconds.
## Total execution time: 145.6 seconds.

2.1.3 Diagnostics

Our caterpillar plots look good for all estimated parameters (i.e., they look like fat caterpillars when the chains have mixed well).

Diagnostics such as divergences, tree depth, energy, \(\widehat{R}\), and \(\mathrm{ESS}\) all look good.

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

When using dynamic Hamiltonian Monte Carlo we have a plethora of diagnostics, which we should utilize to ensure validity and efficiency when sampling. Here follows a summary of the most common diagnostics.2

There should be no divergences since it’s an indication that the posterior is biased; it arises when the posterior landscape is hard for the sample to explore (validity concern).3 A reparameterization of the model might be necessary if these divergences remain.4

Tree depth warnings are not a validity concern but rather an efficiency concern. Reaching the maximum tree depth indicates that the sampler is terminating prematurely to avoid long execution time (Homan and Gelman, 2014).

Having low energy values (E-BFMI) is an indication of a biased posterior (validity concern) (Betancourt, 2016).

The \(\widehat{R}\) convergence diagnostics indicates if the independent chains converged, i.e. explored the posterior in approximately the same way (validity concern) (Vehtari et al., 2021).

The effective sample size (\(\mathrm{ESS}\)) captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the better. When we come closer to \(0.1\) we should start worrying and in absolute numbers we should be in the hundreds for the Central Limit Theorem to hold (Betancourt, 2018).

Additionally, the Monte Carlo Standard Error (MCSE) has been checked for all models. The MCSE is yet another criteria that reflects effective accuracy of a Markov chain by dividing the standard deviation of the chain with the square root of its effective sample size (Betancourt, 2018),

\[\begin{equation} \mathrm{MCSE} = \frac{\mathrm{SD}}{\sqrt{\mathrm{ESS}}} \end{equation}\]

2.1.4 Posterior predictive check

Let’s look at a posterior predictive plot to see how well the model has estimated our \(6\) levels we have in our outcome.

pp_check(M0, type = "bars", nsamples = 250)

In short, very little uncertainty (i.e., the medians are quite well estimated). Let’s leave \(\mathcal{M}_0\) for now and add predictors to the next model, which you can read about by going back up and clicking on the tab \([2.2\ \mathcal{M}_1]\).

2.2 \(\mathcal{M}_1\)

Here we’ll design a Cumulative model with predictors.

For this and the coming models we won’t report on all the steps since it will take up too much space. However, rest assured, we have conducted all the steps, just as we did for \(\mathcal{M}_0\).

2.2.1 Prior predictive checks

Let’s set sane priors that are uniform on the outcome space.

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
               family = cumulative, 
               data = d)

# Set N(0,1) on \betas and N(0,2) on the cutpoints
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"

Sample only from the priors.

M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).

pp_check(M1, type = "bars", nsamples = 250)

2.2.2 Sample with data

M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 3 finished in 192.6 seconds.
## Chain 1 finished in 192.9 seconds.
## Chain 2 finished in 193.2 seconds.
## Chain 4 finished in 199.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 194.6 seconds.
## Total execution time: 199.9 seconds.

2.2.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.2.4 Posterior predictive check

pp_check(M1, type = "bars", nsamples = 250)

Slight overestimation in the third category and slight underestimations in the first two categories. Shouldn’t be a problem but worth noting. The next model is \(\mathcal{M}_2\).

2.3 \(\mathcal{M}_2\)

A Cumulative model with monotonic predictors.

2.3.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + 
                 kc_s + dep + mo(arch),
               family = cumulative, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
p$prior[16:18] <- "dirichlet(2)" # prior for ordered categorical
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s + 
            dep + mo(arch),
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 4 finished in 0.2 seconds.
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## Chain 3 finished in 0.3 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.6 seconds.
pp_check(M2, type = "bars", nsamples = 250)

2.3.2 Sample with data

# simplest cumulative model we can think of
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s + 
            dep + mo(arch),
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 4 finished in 651.3 seconds.
## Chain 1 finished in 663.5 seconds.
## Chain 3 finished in 716.0 seconds.
## Chain 2 finished in 721.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 688.1 seconds.
## Total execution time: 721.9 seconds.

2.3.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.3.4 Posterior predictive check

pp_check(M2, type = "bars", nsamples = 250)

Not much to say, but instead turn towards the next model \(\mathcal{M}_3\).

2.4 \(\mathcal{M}_3\)

An Adjacent-category model with predictors.

2.4.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + 
                 dep + arch,
               family = acat, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = acat, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.0 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M3, type = "bars", nsamples = 250)

2.4.2 Sample with data

M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = acat, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 547.7 seconds.
## Chain 4 finished in 558.9 seconds.
## Chain 3 finished in 569.6 seconds.
## Chain 2 finished in 581.4 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 564.4 seconds.
## Total execution time: 582.0 seconds.

2.4.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.4.4 Posterior predictive check

pp_check(M3, type = "bars", nsamples = 250)

Let’s move to the next model \(\mathcal{M}_4\).

2.5 \(\mathcal{M}_4\)

A Sequential model with predictors.

2.5.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + 
                 dep + arch,
               family = sratio, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.
pp_check(M4, type = "bars", nsamples = 250)

Here we see a large difference. The sratio family expects a decay. We’ll see if the data will overcome this prior (it should given that we have \(n=11110\)).

2.5.2 Sample with data

M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 158.2 seconds.
## Chain 3 finished in 161.4 seconds.
## Chain 2 finished in 161.9 seconds.
## Chain 4 finished in 183.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 166.2 seconds.
## Total execution time: 183.6 seconds.

2.5.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.5.4 Posterior predictive check

pp_check(M4, type = "bars", nsamples = 250)

Finally, let’s take a look at the final (for now) model \(\mathcal{M}_5\).

2.6 \(\mathcal{M}_5\)

A Sequential model with category-specific predictors.

2.6.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
                 kc_s + dep + cs(arch),
               family = sratio, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
            kc_s + dep + cs(arch),
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M5, type = "bars", nsamples = 250)

2.6.2 Sample with data

M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
            kc_s + dep + cs(arch),
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 2 finished in 635.8 seconds.
## Chain 3 finished in 636.5 seconds.
## Chain 4 finished in 639.1 seconds.
## Chain 1 finished in 645.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 639.3 seconds.
## Total execution time: 646.5 seconds.

2.6.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.6.4 Posterior predictive check

pp_check(M5, type = "bars", nsamples = 250)

3 Model comparison

Once we’ve sampled our set of models \(M\) we use LOO to compare the models’ relative out of sample prediction capabilities (Vehtari et al., 2017).

(l <- loo_compare(loo(M0), loo(M1), loo(M2), loo(M3), loo(M4), loo(M5)))
##    elpd_diff se_diff
## M5     0.0       0.0
## M4   -43.6      10.8
## M3   -61.4      15.4
## M2  -148.1      20.3
## M1  -148.6      20.2
## M0 -3318.0      63.2

LOO puts \(\mathcal{M}_5\) as no. 1. If we assume a \(z_{\mathrm{99\%}}\)-score of \(2.58\) it’s clear that zero is not in the interval and that \(\mathcal{M}_5\) has an advantage, i.e., \(\mathrm{CI}_{z_{99\%}}\)[-71.43, -15.84].

By looking at the expected log pointwise predictive density (elpd), one can see that adding predictors to \(\mathcal{M}_1\) clearly has a significant effect (compared to \(\mathcal{M}_0\)). However, adding monotonic effects, \(\mathcal{M}_2\), had very little added benefit (i.e., \(\mathcal{M}_1\) vs. \(\mathcal{M}_2\)), while adding category-specific effects, \(\mathcal{M}_5\), definitely did something positive concerning out of sample predictions (i.e., \(\mathcal{M}_5\) vs. \(\mathcal{M}_4\)).

Also worth noting is that the Cumulative models are, relatively speaking, performing the worst (i.e., \(\mathcal{M}_1\) and \(\mathcal{M}_2\)). The Sequential models take the two first spots, while the Adjacent-category model falls behind on the third spot.

If we would be interested in refining our models purely for optimal out of sample predictions we could conduct variable selection (Piironen et al., 2020). However, in this particular case we are interested in each predictor’s effect, so we’ll keep them and decide to use \(\mathcal{M}_5\) as our target model, \(\mathcal{M}\), for now.

M <- M5

4 Estimates and effects

4.1 Parameter estimates

To make sense of the parameter estimates we need to transform them, since the values are on the \(\mathop{\mathrm{logit}}\) scale.

First, let’s compare the model where we did not use category-specific effects (\(\mathcal{M}_4\)), with the model which did use them (\(\mathcal{M}\)). On the left side we see \(\mathcal{M}_4\), while on the right side we see the ‘final’ model we’ve settled on, designated \(\mathcal{M}\).

Estimate Est.Error Q2.5 Q97.5
prio_s 1.22 0.02 1.18 1.26
crit 0.60 0.05 0.50 0.70
b_val 0.08 0.02 0.04 0.12
c_val 0.05 0.02 0.00 0.10
sh_s -0.05 0.02 -0.08 -0.02
kc_s 0.01 0.02 -0.02 0.04
dep 0.09 0.04 0.00 0.17
arch 0.03 0.03 -0.02 0.08
Estimate Est.Error Q2.5 Q97.5
prio_s 1.22 0.02 1.18 1.26
crit 0.62 0.05 0.52 0.71
sh_s -0.05 0.02 -0.08 -0.02
kc_s 0.01 0.02 -0.02 0.04
dep 0.09 0.04 0.01 0.18
b_val[1] 0.19 0.03 0.13 0.25
b_val[2] -0.04 0.03 -0.11 0.02
b_val[3] 0.15 0.04 0.06 0.23
b_val[4] -0.18 0.06 -0.30 -0.06
b_val[5] -0.09 0.14 -0.35 0.20
c_val[1] 0.00 0.04 -0.07 0.08
c_val[2] 0.05 0.04 -0.04 0.13
c_val[3] 0.13 0.06 0.02 0.24
c_val[4] 0.19 0.08 0.03 0.34
c_val[5] -0.01 0.16 -0.33 0.30
arch[1] 0.13 0.05 0.03 0.22
arch[2] 0.09 0.05 0.00 0.18
arch[3] 0.03 0.05 -0.06 0.13
arch[4] -0.23 0.06 -0.34 -0.11
arch[5] -0.28 0.12 -0.51 -0.03

There are two things to note. First, due to us modeling category-specific effects (right table) we see that we have several estimates of interest in arch, which is not even a significant parameter in \(\mathcal{M}_4\), whcih is lacking category-specific effects. Second, for all category-specific effects we receive a much more fine-grained view of precisely which categories in each predictor are making a difference.

To start with, let’s focus on the right-hand side table consisting of estimates from \(\mathcal{M}\), and turn our attention to the other parameters. The prio_s parameter, 1.22, would then become 0.77 when transformed with the inverse logit, i.e.,

\[\begin{equation} \frac{\exp(1.22)}{\exp(1.22)+1} = 0.77 \end{equation}\]

But remember the suffix _s! We need to multiply with \(\sigma_x\) and add \(\bar{x}\) from the data, which leads to an estimate of 689.74 \(\mathrm{CI}_{95\%}\)[686.78, 692.61].

As already discussed, the more exotic parameters are our category-specific effects b_val, c_val, and arch. To our knowledge this has never been used in an analysis in software engineering or computer science. We see that each parameter that is ordinal has five category-specific effects estimated (e.g., rows \(6\)\(10\) in the right-hand table above). These estimates indicate to what degree b_val affected the six outcomes in State (remember, our outcome consisted of six ordered categorical levels, having five borders, or crosspoints, between the levels).

If we continue taking b_val as an example, the \(95\)% credible interval does not cross \(0\) for the \(1\)st (positive), \(3\)rd (negative), and \(4\)th (positive) parameter (see the table above). This means that b_val affects the borders (cutpoints) positively between the \(1\)st and \(2\)nd categories and between the \(4\)th and \(5\)th categories, while the opposite is true for the negative effect on the intercept between the \(3\)rd and \(4\)th categories. Remember, if a border is affected positively it means that the border was shifted upwards and, hence, more probability mass was moved to lower levels! But what does this mean in practice?

If you recall, our outcome State had six categories,

levels(d$State)
## [1] "Elicited, Dropped"                                     
## [2] "Elicited, Prio, Dropped"                               
## [3] "Elicited, Prio, Planned, Dropped"                      
## [4] "Elicited, Prio, Planned, Implemented, Dropped"         
## [5] "Elicited, Prio, Planned, Implemented, Tested, Dropped" 
## [6] "Elicited, Prio, Planned, Implemented, Tested, Released"

So we can claim, when analyzing these estimates, that b_val moves probability mass to lower levels by shifting the border up between,

levels(d$State)[1:2]
## [1] "Elicited, Dropped"       "Elicited, Prio, Dropped"

and,

levels(d$State)[4:5]
## [1] "Elicited, Prio, Planned, Implemented, Dropped"        
## [2] "Elicited, Prio, Planned, Implemented, Tested, Dropped"

while it shifts the border down between,

levels(d$State)[3:4]
## [1] "Elicited, Prio, Planned, Dropped"             
## [2] "Elicited, Prio, Planned, Implemented, Dropped"

Next, let’s plot the posterior probability densities for our population-level estimates on the \(\mathop{\mathrm{logit}}\) scale.

Examining the above plot, from bottom to top, we can say that on the \(95\)%-level, the first three parameters are clearly positive or negative and do not cross \(0\). The fourth parameter, Key customers, is not significant, with the following \(95\)% credible intervals [-0.02, 0.04]. The fifth parameter, Dependencies, is significant (positive) [0.01, 0.18]. For Business value, Customer value, and Architects' involvement we see that some parameters are significant.

To conclude what we’ve noticed so far: Priority, Critical feature, Stakeholders, and Dependencies are significant on \(\mathrm{CI}_{95\%}\), while for Business value, Customer value, and Architects' involvement some categories are significant. Key customers is not significant.

4.2 Conditional effects

Below we plot conditional effects for some of our estimates. The colors represent the different ordered categories, \(1,\ldots,6\), for our outcome State. We are particularly interested in Category \(6\), i.e., the final category which indicates a released feature.

  1. Elicited, Dropped
  2. Elicited, Prio, Dropped
  3. Elicited, Prio, Planned, Dropped
  4. Elicited, Prio, Planned, Implemented, Dropped
  5. Elicited, Prio, Planned, Implemented, Tested, Dropped
  6. Elicited, Prio, Planned, Implemented, Tested, Released

One important question we would like to have an answer to is which independent variable(s) contribute(s) more for a feature to, ultimately, be released, i.e., is it priority, criticality, business or customer value, number of stakeholders, number of key customers, having dependencies, and/or the level of architect involvement? In the above plots the answer to our question can be found, without even having to conduct any statistical tests or examining \(p\)-values. Let’s start by analyzing the predictor Priority.

Concerning Priority we see that it has a very large effect for State \(6\) (i.e., a feature being released). The higher the priority (the more to the right) the more probability mass is set on State \(6\). In the end it has close to \(70\)% of the probability mass, while the other states are not even close. Also worth noting is how, for State \(4\) (the hump), medium priorities seem to be the recipe for reaching this stage.

For the predictor Critical we see some of the same effects, albeit the uncertainty increases. The clearest effect is visible for State \(6\), i.e., going from No to Yes significantly increases the probability, while the opposite holds for States \(1\)\(3\) (logically so, since if it is critical then a requirement should be released with a higher probability, i.e., Stage \(6\)).

Concerning Number of stakeholders, we see that virtually all states (except State \(1\)) has a lower probability with increasing number of stakeholders (and more uncertainty is visible). For State \(1\), however, an increase in stakeholders leads to an increase in probability. One could claim this is natural since having stakeholders would lead to the requirement being considered in the first place.

For Dependency not much changes when it moves from No to Yes.

We saw previously that Key customers was not a significant parameter. If we plot it we will see why, i.e., the uncertainty is too large.

5 References

Agresti, A., 2010. Analysis of ordinal categorical data. John Wiley & Sons.
Betancourt, M., 2018. A conceptual introduction to Hamiltonian Monte Carlo.
Betancourt, M., 2016. Diagnosing suboptimal cotangent disintegrations in Hamiltonian Monte Carlo.
Bürkner, P.-C., Charpentier, E., 2020. Modelling monotonic effects of ordinal predictors in Bayesian regression models. British Journal of Mathematical and Statistical Psychology 73, 420–451. https://doi.org/10.1111/bmsp.12195
Bürkner, P.-C., Vuorre, M., 2019. Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science 2, 77–101. https://doi.org/10.1177/2515245918823199
Homan, M.D., Gelman, A., 2014. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15, 1593–1623.
Piironen, J., Paasiniemi, M., Vehtari, A., 2020. Projective inference in high-dimensional problems: Prediction and feature selection. Electronic Journal of Statistics 14, 2155–2197. https://doi.org/10.1214/20-EJS1711
Samejima, F., 1997. Graded response model, in: Handbook of Modern Item Response Theory. Springer, pp. 85–100.
Tutz, G., 1990. Sequential item response models with an ordered response. British Journal of Mathematical and Statistical Psychology 43, 39–55. https://doi.org/10.1111/j.2044-8317.1990.tb00925.x
Vehtari, A., Gelman, A., Gabry, J., 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 1413–1432. https://doi.org/10.1007/s11222-016-9696-4
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.-C., 2021. Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. Bayesian Analysis 1–28. https://doi.org/10.1214/20-BA1221

6 Appendix

6.1 CmdStan

For the sampling we refrain from using rstan and instead use cmdstan through the R package cmdstanr. Generally speaking, the community now prefer users to use cmdstan since it updates more frequently.

Install cmdstanr and cmdstan by,

CORES = 4 # set to the number of available CPU cores
remotes::install_github("stan-dev/cmdstanr")
cmdstanr::install_cmdstan(cores = CORES)

# once you have brms installed you can now run brms with cmdstan instead of rstan
options(brms.backend="cmdstanr")
options(mc.cores = CORES) # set num cores
library(brms)

For this execution we’ve used,

cmdstanr::cmdstan_version()
## [1] "2.26.1"

6.2 Environment

print(sessionInfo(), locale=FALSE)
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cmdstanr_0.3.0.9000 brms_2.14.11        Rcpp_1.0.6         
##  [4] kableExtra_1.3.4    forcats_0.5.1       stringr_1.4.0      
##  [7] dplyr_1.0.4         purrr_0.3.4         readr_1.4.0        
## [10] tidyr_1.1.2         tibble_3.0.6        tidyverse_1.3.0    
## [13] patchwork_1.1.1     ggthemes_4.2.4      bayesplot_1.8.0    
## [16] ggplot2_3.3.3       openxlsx_4.2.3     
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1            backports_1.2.1         systemfonts_1.0.1      
##   [4] plyr_1.8.6              igraph_1.2.6            splines_4.0.4          
##   [7] crosstalk_1.1.1         TH.data_1.0-10          rstantools_2.1.1       
##  [10] inline_0.3.17           digest_0.6.27           htmltools_0.5.1.1      
##  [13] rsconnect_0.8.16        magrittr_2.0.1          modelr_0.1.8           
##  [16] RcppParallel_5.0.2      matrixStats_0.58.0      xts_0.12.1             
##  [19] sandwich_3.0-0          svglite_2.0.0           prettyunits_1.1.1      
##  [22] colorspace_2.0-0        rvest_0.3.6             haven_2.3.1            
##  [25] xfun_0.21               callr_3.5.1             crayon_1.4.1           
##  [28] jsonlite_1.7.2          lme4_1.1-26             survival_3.2-7         
##  [31] zoo_1.8-8               glue_1.4.2              gtable_0.3.0           
##  [34] emmeans_1.5.4           webshot_0.5.2           V8_3.4.0               
##  [37] pkgbuild_1.2.0          rstan_2.26.0.9000       abind_1.4-5            
##  [40] scales_1.1.1            mvtnorm_1.1-1           DBI_1.1.1              
##  [43] miniUI_0.1.1.1          viridisLite_0.3.0       xtable_1.8-4           
##  [46] stats4_4.0.4            StanHeaders_2.26.0.9000 DT_0.17                
##  [49] htmlwidgets_1.5.3       httr_1.4.2              threejs_0.3.3          
##  [52] ellipsis_0.3.1          pkgconfig_2.0.3         loo_2.4.1              
##  [55] sass_0.3.1              dbplyr_2.1.0            tidyselect_1.1.0       
##  [58] rlang_0.4.10            reshape2_1.4.4          later_1.1.0.1          
##  [61] munsell_0.5.0           cellranger_1.1.0        tools_4.0.4            
##  [64] cli_2.3.0               generics_0.1.0          broom_0.7.5            
##  [67] ggridges_0.5.3          evaluate_0.14           fastmap_1.1.0          
##  [70] yaml_2.2.1              processx_3.4.5          knitr_1.31             
##  [73] fs_1.5.0                zip_2.1.1               nlme_3.1-152           
##  [76] mime_0.10               projpred_2.0.2          xml2_1.3.2             
##  [79] compiler_4.0.4          shinythemes_1.2.0       rstudioapi_0.13        
##  [82] gamm4_0.2-6             curl_4.3                reprex_1.0.0           
##  [85] statmod_1.4.35          bslib_0.2.4             stringi_1.5.3          
##  [88] ps_1.5.0                Brobdingnag_1.2-6       lattice_0.20-41        
##  [91] Matrix_1.3-2            nloptr_1.2.2.2          markdown_1.1           
##  [94] shinyjs_2.0.0           vctrs_0.3.6             pillar_1.4.7           
##  [97] lifecycle_1.0.0         jquerylib_0.1.3         bridgesampling_1.0-0   
## [100] estimability_1.3        httpuv_1.5.5            R6_2.5.0               
## [103] bookdown_0.21           promises_1.2.0.1        gridExtra_2.3          
## [106] codetools_0.2-18        boot_1.3-27             colourpicker_1.1.0     
## [109] MASS_7.3-53.1           gtools_3.8.2            assertthat_0.2.1       
## [112] rprojroot_2.0.2         withr_2.4.1             shinystan_2.5.0        
## [115] multcomp_1.4-16         mgcv_1.8-34             parallel_4.0.4         
## [118] hms_1.0.0               grid_4.0.4              coda_0.19-4            
## [121] minqa_1.2.4             rmarkdown_2.7           shiny_1.6.0            
## [124] lubridate_1.7.9.2       base64enc_0.1-3         dygraphs_1.1.1.6

  1. In this text we use the words outcome and predictor for our dependent and independent variables, respectively. The word covariate can also often be used instead of predictor, even though it, in the traditional sense, means a continuous predictor.↩︎

  2. https://mc-stan.org/misc/warnings.html↩︎

  3. https://mc-stan.org/docs/2_26/reference-manual/divergent-transitions.html↩︎

  4. https://mc-stan.org/docs/2_26/stan-users-guide/reparameterization-section.html↩︎