In this replication package we have used empirical data under NDA. However, one can use synthetic data to evaluate our approach and reach very much the same conclusions.
Here we show how to load the synthetic dataset (from the data/ directory in the GitHub repository), which can then be used in the analysis.
d <- readRDS("data/data.rds")
The above line is currently not executed, since we will continue to run this analysis with the empirical data. However, by executing the above line, and not executing the next line, the analysis can be executed as-is with synthetic data.
First prepare the data, check for NAs and look at some descriptive statistics. Since we’re using Excel we should be very careful when loading the data to see that nothing goes wrong (data manipulated in an arbitrary way, incompatible data types, etc.)
# Make sure to execute the previous statement, and not this one, if you want to
# re-run the analysis
d <- read.xlsx("data/Features.xlsx", sheet = "Features")
This is how the empirical data looks like. In short, 11110 rows and 10 variables.1
str(d)
## 'data.frame': 11110 obs. of 10 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ State : chr "Elicited, Prio, Dropped" "Elicited, Prio, Dropped" "Elicited, Prio, Planned, Implemented, Tested, Released" "Elicited, Prio, Dropped" ...
## $ Team.priority : num 88 83 957 79 88 76 74 73 72 71 ...
## $ Critical.feature : chr "No" "No" "No" "No" ...
## $ Business.value : chr "No value" "No value" "No value" "No value" ...
## $ Customer.value : chr "No value" "No value" "No value" "No value" ...
## $ Stakeholders : num 1 1 0 0 0 1 1 1 1 1 ...
## $ Key.customers : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Dependency : chr "No" "No" "No" "No" ...
## $ Architects.involvement: chr "None" "None" "None" "None" ...
Each row (requirement) has a unique ID. The State, which is our outcome (dependent variable), shows how far the feature survived in the requirements process. It is of an ordered categorical type, which should be modeled with some type of Cumulative likelihood. There are six categories:
Team.priority is the relative priority the feature got, \(\mathbb{N} = \{0,\ldots,1000\}\). Critical.feature is a ‘Yes’/‘No’ answer (\(\mathbb{Z}_2\)). Business.value and Customer.value are also ordered categorical with four levels. The level ‘No value’ is an explicit choice, so it does not imply missingness:
Stakeholders have integers, i.e., \(\mathbb{N} = \{0,\ldots,10\}\), and Key.customers the same, but with a different set, i.e., \(\mathbb{N} = \{0,\ldots,60\}\).
Finally, Dependency is \(\mathbb{Z}_2\), while Architects.involvement is ordered categorical:
All ordered categorical predictors can be modeled as monotonic or category-specific effects, if necessary (Bürkner and Charpentier, 2020).
Check for NAs,
table(is.na(d))
##
## FALSE
## 111100
No NAs in the dataset. However, that doesn’t necessarily mean that we don’t have NAs. Some of the coding can be a representation of NA, e.g., ‘No value.’ In this particular case we know that ‘No value’ and ‘None’ in the dataset actually are values and not a representation of NA.
Finally, we set correct data types.
# ordered categorical
d$State <- factor(d$State,
levels = c("Elicited, Dropped",
"Elicited, Prio, Dropped",
"Elicited, Prio, Planned, Dropped",
"Elicited, Prio, Planned, Implemented, Dropped",
"Elicited, Prio, Planned, Implemented, Tested, Dropped",
"Elicited, Prio, Planned, Implemented, Tested, Released"),
ordered = TRUE)
# make sure we have integers for the other ordered categorical
d$Business.value <- factor(d$Business.value,
levels = c("No value",
"Valuable",
"Important",
"Critical"),
ordered = TRUE)
d$Customer.value <- factor(d$Customer.value,
levels = c("No value",
"Valuable",
"Important",
"Critical"),
ordered = TRUE)
d$Architects.involvement <- factor(d$Architects.involvement,
levels = c("None",
"Simple",
"Monitoring",
"Active Participation",
"Joint Design"),
ordered = TRUE)
First, we see that for the outcome State approximately as many features are released (final stage) as dropped in the first state. We also see that it drops off after the initial state.
For Team.priority many features have zero in priority (\(5139\)), and then there’s a bunch of them (\(1516\)) that have priority set to the maximum value, i.e., \(1000\).
For Critical.feature we have a clear emphasis on ‘No.’
Concerning Business.value and Customer.value they are fairly similar in their respective distribution (as one would perhaps expect).
For Stakeholder and Key.customers we see a strong emphasis on lower numbers, while for Dependency a clear emphasis on ‘No.’
Finally, for Architects.involvement one can see that in the absolute majority of the cases architects are not involved.
In short, it looks sort of what one would expect, i.e., it’s not hard to find answers to why the plots look the way they do.
However, before we continue, we should standardize some of our predictors so the sampling will be easier, i.e., we simply do \((x - \bar{x})/\sigma_x\), then simply multiplying with \(\sigma_x\) and adding the mean, will allow us to get back to the original scale.
# standardize and abbreviated names and change types if need be
d$prio_s <- scale(d$Team.priority)
d$sh_s <- scale(d$Stakeholders)
d$kc_s <- scale(d$Key.customers)
d$b_val <- as.integer(d$Business.value) # use int as input
d$c_val <- as.integer(d$Customer.value)
d$arch <- as.integer(d$Architects.involvement)
# Dichotomous predictors. We can set these to 1/0, but generally speaking
# one should know what one is doing and be careful doing this!
d$Critical.feature <- ifelse(d$Critical.feature == 'Yes', 1, 0)
d$Dependency <- ifelse(d$Dependency == 'Yes', 1, 0)
# only abbreviate names
d$crit <- d$Critical.feature
d$dep <- d$Dependency
Since our outcome is of an ordered categorical nature we have many options at our hands, some of which barely existed in Bayesian modeling a few decades ago. Our outcome reminds us of a survival model, i.e., a feature needs to survive in order to reach the next stage. Taking this into account we assume that the following type of models could be an option (Bürkner and Vuorre, 2019):
The reason we want to use monotonic or category-specific modeling of predictors is that predictor categories will not be assumed to be equidistant with respect to their effect on the outcome (Bürkner and Charpentier, 2020).
The Cumulative family is very common so let us assume a null model, \(\mathcal{M}_0\), which uses this likelihood, with no predictors. Later we will compare all our models to \(\mathcal{M}_0\), to ensure that adding predictors improve out of sample predictions, i.e., if we can’t improve when adding predictors we might as well do other things with our time.
The outcome of the model design will be a set \(M = \{\mathcal{M}_0,\ldots,\mathcal{M}_5\}\).
Let us see what priors such a null model needs.
(p <- get_prior(State ~ 1,
family = cumulative,
data = d))
## prior class coef group resp dpar nlpar bound source
## student_t(3, 0, 2.5) Intercept default
## student_t(3, 0, 2.5) Intercept 1 (vectorized)
## student_t(3, 0, 2.5) Intercept 2 (vectorized)
## student_t(3, 0, 2.5) Intercept 3 (vectorized)
## student_t(3, 0, 2.5) Intercept 4 (vectorized)
## student_t(3, 0, 2.5) Intercept 5 (vectorized)
# Set a wide Normal(0,2) on the the intercepts (cutpoints for our
# scale, which is on 6 levels, i.e., 5 cutpoints)
p$prior[1] <- "normal(0,2)"
Sample only from the priors.
# simplest cumulative model we can think of
M0 <- brm(State ~ 1,
family = cumulative,
data = d,
# threads = threading(4), # use if 16 CPUs
prior = p,
control = list(adapt_delta=0.9),
sample_prior = "only",
refresh = 0) # avoid printing sampling progress
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.
Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).
pp_check(M0, type = "bars", nsamples = 250)
Evidently the medians are quite evenly set along the \(x\)-axis, and the uncertainty is fairly uniformly distributed among the categories \(1,\ldots,6\) (the bars). In short, this is what we like to see.
M0 <- brm(State ~ 1,
family = cumulative,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 135.1 seconds.
## Chain 3 finished in 135.2 seconds.
## Chain 2 finished in 137.6 seconds.
## Chain 4 finished in 145.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 138.3 seconds.
## Total execution time: 145.6 seconds.
Our caterpillar plots look good for all estimated parameters (i.e., they look like fat caterpillars when the chains have mixed well).
Diagnostics such as divergences, tree depth, energy, \(\widehat{R}\), and \(\mathrm{ESS}\) all look good.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
When using dynamic Hamiltonian Monte Carlo we have a plethora of diagnostics, which we should utilize to ensure validity and efficiency when sampling. Here follows a summary of the most common diagnostics.2
There should be no divergences since it’s an indication that the posterior is biased; it arises when the posterior landscape is hard for the sample to explore (validity concern).3 A reparameterization of the model might be necessary if these divergences remain.4
Tree depth warnings are not a validity concern but rather an efficiency concern. Reaching the maximum tree depth indicates that the sampler is terminating prematurely to avoid long execution time (Homan and Gelman, 2014).
Having low energy values (E-BFMI) is an indication of a biased posterior (validity concern) (Betancourt, 2016).
The \(\widehat{R}\) convergence diagnostics indicates if the independent chains converged, i.e. explored the posterior in approximately the same way (validity concern) (Vehtari et al., 2021).
The effective sample size (\(\mathrm{ESS}\)) captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the better. When we come closer to \(0.1\) we should start worrying and in absolute numbers we should be in the hundreds for the Central Limit Theorem to hold (Betancourt, 2018).
Additionally, the Monte Carlo Standard Error (MCSE) has been checked for all models. The MCSE is yet another criteria that reflects effective accuracy of a Markov chain by dividing the standard deviation of the chain with the square root of its effective sample size (Betancourt, 2018),
\[\begin{equation} \mathrm{MCSE} = \frac{\mathrm{SD}}{\sqrt{\mathrm{ESS}}} \end{equation}\]
Let’s look at a posterior predictive plot to see how well the model has estimated our \(6\) levels we have in our outcome.
pp_check(M0, type = "bars", nsamples = 250)
In short, very little uncertainty (i.e., the medians are quite well estimated). Let’s leave \(\mathcal{M}_0\) for now and add predictors to the next model, which you can read about by going back up and clicking on the tab \([2.2\ \mathcal{M}_1]\).
Here we’ll design a Cumulative model with predictors.
For this and the coming models we won’t report on all the steps since it will take up too much space. However, rest assured, we have conducted all the steps, just as we did for \(\mathcal{M}_0\).
Let’s set sane priors that are uniform on the outcome space.
p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = cumulative,
data = d)
# Set N(0,1) on \betas and N(0,2) on the cutpoints
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
Sample only from the priors.
M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = cumulative,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
sample_prior = "only",
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.
Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).
pp_check(M1, type = "bars", nsamples = 250)
M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = cumulative,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 3 finished in 192.6 seconds.
## Chain 1 finished in 192.9 seconds.
## Chain 2 finished in 193.2 seconds.
## Chain 4 finished in 199.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 194.6 seconds.
## Total execution time: 199.9 seconds.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
pp_check(M1, type = "bars", nsamples = 250)
Slight overestimation in the third category and slight underestimations in the first two categories. Shouldn’t be a problem but worth noting. The next model is \(\mathcal{M}_2\).
A Cumulative model with monotonic predictors.
p <- get_prior(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s +
kc_s + dep + mo(arch),
family = cumulative,
data = d)
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
p$prior[16:18] <- "dirichlet(2)" # prior for ordered categorical
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s +
dep + mo(arch),
family = cumulative,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
sample_prior = "only",
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 0.2 seconds.
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## Chain 3 finished in 0.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.6 seconds.
pp_check(M2, type = "bars", nsamples = 250)
# simplest cumulative model we can think of
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s +
dep + mo(arch),
family = cumulative,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 651.3 seconds.
## Chain 1 finished in 663.5 seconds.
## Chain 3 finished in 716.0 seconds.
## Chain 2 finished in 721.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 688.1 seconds.
## Total execution time: 721.9 seconds.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
pp_check(M2, type = "bars", nsamples = 250)
Not much to say, but instead turn towards the next model \(\mathcal{M}_3\).
An Adjacent-category model with predictors.
p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s +
dep + arch,
family = acat,
data = d)
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = acat,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
sample_prior = "only",
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.0 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M3, type = "bars", nsamples = 250)
M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = acat,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 547.7 seconds.
## Chain 4 finished in 558.9 seconds.
## Chain 3 finished in 569.6 seconds.
## Chain 2 finished in 581.4 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 564.4 seconds.
## Total execution time: 582.0 seconds.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
pp_check(M3, type = "bars", nsamples = 250)
Let’s move to the next model \(\mathcal{M}_4\).
A Sequential model with predictors.
p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s +
dep + arch,
family = sratio,
data = d)
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = sratio,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
sample_prior = "only",
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.
pp_check(M4, type = "bars", nsamples = 250)
Here we see a large difference. The sratio family expects a decay. We’ll see if the data will overcome this prior (it should given that we have \(n=11110\)).
M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
family = sratio,
data = d,
# threads = threading(4),
prior = p,
control = list(adapt_delta=0.9),
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 158.2 seconds.
## Chain 3 finished in 161.4 seconds.
## Chain 2 finished in 161.9 seconds.
## Chain 4 finished in 183.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 166.2 seconds.
## Total execution time: 183.6 seconds.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
pp_check(M4, type = "bars", nsamples = 250)
Finally, let’s take a look at the final (for now) model \(\mathcal{M}_5\).
A Sequential model with category-specific predictors.
p <- get_prior(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s +
kc_s + dep + cs(arch),
family = sratio,
data = d)
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s +
kc_s + dep + cs(arch),
family = sratio,
data = d,
# threads = threading(4),
prior = p,
sample_prior = "only",
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M5, type = "bars", nsamples = 250)
M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s +
kc_s + dep + cs(arch),
family = sratio,
data = d,
# threads = threading(4),
prior = p,
refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 2 finished in 635.8 seconds.
## Chain 3 finished in 636.5 seconds.
## Chain 4 finished in 639.1 seconds.
## Chain 1 finished in 645.9 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 639.3 seconds.
## Total execution time: 646.5 seconds.
# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
##
## Divergences:
## 0 of 4000 iterations ended with a divergence.
##
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
print("Warning: Rhat >=1.01")
} else {
print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
print("Warning: ESS <=0.2")
} else {
print("All ESS >0.2")
}
## [1] "All ESS >0.2"
pp_check(M5, type = "bars", nsamples = 250)
Once we’ve sampled our set of models \(M\) we use LOO to compare the models’ relative out of sample prediction capabilities (Vehtari et al., 2017).
(l <- loo_compare(loo(M0), loo(M1), loo(M2), loo(M3), loo(M4), loo(M5)))
## elpd_diff se_diff
## M5 0.0 0.0
## M4 -43.6 10.8
## M3 -61.4 15.4
## M2 -148.1 20.3
## M1 -148.6 20.2
## M0 -3318.0 63.2
LOO puts \(\mathcal{M}_5\) as no. 1. If we assume a \(z_{\mathrm{99\%}}\)-score of \(2.58\) it’s clear that zero is not in the interval and that \(\mathcal{M}_5\) has an advantage, i.e., \(\mathrm{CI}_{z_{99\%}}\)[-71.43, -15.84].
By looking at the expected log pointwise predictive density (elpd), one can see that adding predictors to \(\mathcal{M}_1\) clearly has a significant effect (compared to \(\mathcal{M}_0\)). However, adding monotonic effects, \(\mathcal{M}_2\), had very little added benefit (i.e., \(\mathcal{M}_1\) vs. \(\mathcal{M}_2\)), while adding category-specific effects, \(\mathcal{M}_5\), definitely did something positive concerning out of sample predictions (i.e., \(\mathcal{M}_5\) vs. \(\mathcal{M}_4\)).
Also worth noting is that the Cumulative models are, relatively speaking, performing the worst (i.e., \(\mathcal{M}_1\) and \(\mathcal{M}_2\)). The Sequential models take the two first spots, while the Adjacent-category model falls behind on the third spot.
If we would be interested in refining our models purely for optimal out of sample predictions we could conduct variable selection (Piironen et al., 2020). However, in this particular case we are interested in each predictor’s effect, so we’ll keep them and decide to use \(\mathcal{M}_5\) as our target model, \(\mathcal{M}\), for now.
M <- M5
To make sense of the parameter estimates we need to transform them, since the values are on the \(\mathop{\mathrm{logit}}\) scale.
First, let’s compare the model where we did not use category-specific effects (\(\mathcal{M}_4\)), with the model which did use them (\(\mathcal{M}\)). On the left side we see \(\mathcal{M}_4\), while on the right side we see the ‘final’ model we’ve settled on, designated \(\mathcal{M}\).
| Estimate | Est.Error | Q2.5 | Q97.5 | |
|---|---|---|---|---|
| prio_s | 1.22 | 0.02 | 1.18 | 1.26 |
| crit | 0.60 | 0.05 | 0.50 | 0.70 |
| b_val | 0.08 | 0.02 | 0.04 | 0.12 |
| c_val | 0.05 | 0.02 | 0.00 | 0.10 |
| sh_s | -0.05 | 0.02 | -0.08 | -0.02 |
| kc_s | 0.01 | 0.02 | -0.02 | 0.04 |
| dep | 0.09 | 0.04 | 0.00 | 0.17 |
| arch | 0.03 | 0.03 | -0.02 | 0.08 |
| Estimate | Est.Error | Q2.5 | Q97.5 | |
|---|---|---|---|---|
| prio_s | 1.22 | 0.02 | 1.18 | 1.26 |
| crit | 0.62 | 0.05 | 0.52 | 0.71 |
| sh_s | -0.05 | 0.02 | -0.08 | -0.02 |
| kc_s | 0.01 | 0.02 | -0.02 | 0.04 |
| dep | 0.09 | 0.04 | 0.01 | 0.18 |
| b_val[1] | 0.19 | 0.03 | 0.13 | 0.25 |
| b_val[2] | -0.04 | 0.03 | -0.11 | 0.02 |
| b_val[3] | 0.15 | 0.04 | 0.06 | 0.23 |
| b_val[4] | -0.18 | 0.06 | -0.30 | -0.06 |
| b_val[5] | -0.09 | 0.14 | -0.35 | 0.20 |
| c_val[1] | 0.00 | 0.04 | -0.07 | 0.08 |
| c_val[2] | 0.05 | 0.04 | -0.04 | 0.13 |
| c_val[3] | 0.13 | 0.06 | 0.02 | 0.24 |
| c_val[4] | 0.19 | 0.08 | 0.03 | 0.34 |
| c_val[5] | -0.01 | 0.16 | -0.33 | 0.30 |
| arch[1] | 0.13 | 0.05 | 0.03 | 0.22 |
| arch[2] | 0.09 | 0.05 | 0.00 | 0.18 |
| arch[3] | 0.03 | 0.05 | -0.06 | 0.13 |
| arch[4] | -0.23 | 0.06 | -0.34 | -0.11 |
| arch[5] | -0.28 | 0.12 | -0.51 | -0.03 |
There are two things to note. First, due to us modeling category-specific effects (right table) we see that we have several estimates of interest in arch, which is not even a significant parameter in \(\mathcal{M}_4\), whcih is lacking category-specific effects. Second, for all category-specific effects we receive a much more fine-grained view of precisely which categories in each predictor are making a difference.
To start with, let’s focus on the right-hand side table consisting of estimates from \(\mathcal{M}\), and turn our attention to the other parameters. The prio_s parameter, 1.22, would then become 0.77 when transformed with the inverse logit, i.e.,
\[\begin{equation} \frac{\exp(1.22)}{\exp(1.22)+1} = 0.77 \end{equation}\]
But remember the suffix _s! We need to multiply with \(\sigma_x\) and add \(\bar{x}\) from the data, which leads to an estimate of 689.74 \(\mathrm{CI}_{95\%}\)[686.78, 692.61].
As already discussed, the more exotic parameters are our category-specific effects b_val, c_val, and arch. To our knowledge this has never been used in an analysis in software engineering or computer science. We see that each parameter that is ordinal has five category-specific effects estimated (e.g., rows \(6\)–\(10\) in the right-hand table above). These estimates indicate to what degree b_val affected the six outcomes in State (remember, our outcome consisted of six ordered categorical levels, having five borders, or crosspoints, between the levels).
If we continue taking b_val as an example, the \(95\)% credible interval does not cross \(0\) for the \(1\)st (positive), \(3\)rd (negative), and \(4\)th (positive) parameter (see the table above). This means that b_val affects the borders (cutpoints) positively between the \(1\)st and \(2\)nd categories and between the \(4\)th and \(5\)th categories, while the opposite is true for the negative effect on the intercept between the \(3\)rd and \(4\)th categories. Remember, if a border is affected positively it means that the border was shifted upwards and, hence, more probability mass was moved to lower levels! But what does this mean in practice?
If you recall, our outcome State had six categories,
levels(d$State)
## [1] "Elicited, Dropped"
## [2] "Elicited, Prio, Dropped"
## [3] "Elicited, Prio, Planned, Dropped"
## [4] "Elicited, Prio, Planned, Implemented, Dropped"
## [5] "Elicited, Prio, Planned, Implemented, Tested, Dropped"
## [6] "Elicited, Prio, Planned, Implemented, Tested, Released"
So we can claim, when analyzing these estimates, that b_val moves probability mass to lower levels by shifting the border up between,
levels(d$State)[1:2]
## [1] "Elicited, Dropped" "Elicited, Prio, Dropped"
and,
levels(d$State)[4:5]
## [1] "Elicited, Prio, Planned, Implemented, Dropped"
## [2] "Elicited, Prio, Planned, Implemented, Tested, Dropped"
while it shifts the border down between,
levels(d$State)[3:4]
## [1] "Elicited, Prio, Planned, Dropped"
## [2] "Elicited, Prio, Planned, Implemented, Dropped"
Next, let’s plot the posterior probability densities for our population-level estimates on the \(\mathop{\mathrm{logit}}\) scale.
Examining the above plot, from bottom to top, we can say that on the \(95\)%-level, the first three parameters are clearly positive or negative and do not cross \(0\). The fourth parameter, Key customers, is not significant, with the following \(95\)% credible intervals [-0.02, 0.04]. The fifth parameter, Dependencies, is significant (positive) [0.01, 0.18]. For Business value, Customer value, and Architects' involvement we see that some parameters are significant.
To conclude what we’ve noticed so far: Priority, Critical feature, Stakeholders, and Dependencies are significant on \(\mathrm{CI}_{95\%}\), while for Business value, Customer value, and Architects' involvement some categories are significant. Key customers is not significant.
Below we plot conditional effects for some of our estimates. The colors represent the different ordered categories, \(1,\ldots,6\), for our outcome State. We are particularly interested in Category \(6\), i.e., the final category which indicates a released feature.
Elicited, DroppedElicited, Prio, DroppedElicited, Prio, Planned, DroppedElicited, Prio, Planned, Implemented, DroppedElicited, Prio, Planned, Implemented, Tested, DroppedElicited, Prio, Planned, Implemented, Tested, ReleasedOne important question we would like to have an answer to is which independent variable(s) contribute(s) more for a feature to, ultimately, be released, i.e., is it priority, criticality, business or customer value, number of stakeholders, number of key customers, having dependencies, and/or the level of architect involvement? In the above plots the answer to our question can be found, without even having to conduct any statistical tests or examining \(p\)-values. Let’s start by analyzing the predictor Priority.
Concerning Priority we see that it has a very large effect for State \(6\) (i.e., a feature being released). The higher the priority (the more to the right) the more probability mass is set on State \(6\). In the end it has close to \(70\)% of the probability mass, while the other states are not even close. Also worth noting is how, for State \(4\) (the hump), medium priorities seem to be the recipe for reaching this stage.
For the predictor Critical we see some of the same effects, albeit the uncertainty increases. The clearest effect is visible for State \(6\), i.e., going from No to Yes significantly increases the probability, while the opposite holds for States \(1\)–\(3\) (logically so, since if it is critical then a requirement should be released with a higher probability, i.e., Stage \(6\)).
Concerning Number of stakeholders, we see that virtually all states (except State \(1\)) has a lower probability with increasing number of stakeholders (and more uncertainty is visible). For State \(1\), however, an increase in stakeholders leads to an increase in probability. One could claim this is natural since having stakeholders would lead to the requirement being considered in the first place.
For Dependency not much changes when it moves from No to Yes.
We saw previously that Key customers was not a significant parameter. If we plot it we will see why, i.e., the uncertainty is too large.
For the sampling we refrain from using rstan and instead use cmdstan through the R package cmdstanr. Generally speaking, the community now prefer users to use cmdstan since it updates more frequently.
Install cmdstanr and cmdstan by,
CORES = 4 # set to the number of available CPU cores
remotes::install_github("stan-dev/cmdstanr")
cmdstanr::install_cmdstan(cores = CORES)
# once you have brms installed you can now run brms with cmdstan instead of rstan
options(brms.backend="cmdstanr")
options(mc.cores = CORES) # set num cores
library(brms)
For this execution we’ve used,
cmdstanr::cmdstan_version()
## [1] "2.26.1"
print(sessionInfo(), locale=FALSE)
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cmdstanr_0.3.0.9000 brms_2.14.11 Rcpp_1.0.6
## [4] kableExtra_1.3.4 forcats_0.5.1 stringr_1.4.0
## [7] dplyr_1.0.4 purrr_0.3.4 readr_1.4.0
## [10] tidyr_1.1.2 tibble_3.0.6 tidyverse_1.3.0
## [13] patchwork_1.1.1 ggthemes_4.2.4 bayesplot_1.8.0
## [16] ggplot2_3.3.3 openxlsx_4.2.3
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 systemfonts_1.0.1
## [4] plyr_1.8.6 igraph_1.2.6 splines_4.0.4
## [7] crosstalk_1.1.1 TH.data_1.0-10 rstantools_2.1.1
## [10] inline_0.3.17 digest_0.6.27 htmltools_0.5.1.1
## [13] rsconnect_0.8.16 magrittr_2.0.1 modelr_0.1.8
## [16] RcppParallel_5.0.2 matrixStats_0.58.0 xts_0.12.1
## [19] sandwich_3.0-0 svglite_2.0.0 prettyunits_1.1.1
## [22] colorspace_2.0-0 rvest_0.3.6 haven_2.3.1
## [25] xfun_0.21 callr_3.5.1 crayon_1.4.1
## [28] jsonlite_1.7.2 lme4_1.1-26 survival_3.2-7
## [31] zoo_1.8-8 glue_1.4.2 gtable_0.3.0
## [34] emmeans_1.5.4 webshot_0.5.2 V8_3.4.0
## [37] pkgbuild_1.2.0 rstan_2.26.0.9000 abind_1.4-5
## [40] scales_1.1.1 mvtnorm_1.1-1 DBI_1.1.1
## [43] miniUI_0.1.1.1 viridisLite_0.3.0 xtable_1.8-4
## [46] stats4_4.0.4 StanHeaders_2.26.0.9000 DT_0.17
## [49] htmlwidgets_1.5.3 httr_1.4.2 threejs_0.3.3
## [52] ellipsis_0.3.1 pkgconfig_2.0.3 loo_2.4.1
## [55] sass_0.3.1 dbplyr_2.1.0 tidyselect_1.1.0
## [58] rlang_0.4.10 reshape2_1.4.4 later_1.1.0.1
## [61] munsell_0.5.0 cellranger_1.1.0 tools_4.0.4
## [64] cli_2.3.0 generics_0.1.0 broom_0.7.5
## [67] ggridges_0.5.3 evaluate_0.14 fastmap_1.1.0
## [70] yaml_2.2.1 processx_3.4.5 knitr_1.31
## [73] fs_1.5.0 zip_2.1.1 nlme_3.1-152
## [76] mime_0.10 projpred_2.0.2 xml2_1.3.2
## [79] compiler_4.0.4 shinythemes_1.2.0 rstudioapi_0.13
## [82] gamm4_0.2-6 curl_4.3 reprex_1.0.0
## [85] statmod_1.4.35 bslib_0.2.4 stringi_1.5.3
## [88] ps_1.5.0 Brobdingnag_1.2-6 lattice_0.20-41
## [91] Matrix_1.3-2 nloptr_1.2.2.2 markdown_1.1
## [94] shinyjs_2.0.0 vctrs_0.3.6 pillar_1.4.7
## [97] lifecycle_1.0.0 jquerylib_0.1.3 bridgesampling_1.0-0
## [100] estimability_1.3 httpuv_1.5.5 R6_2.5.0
## [103] bookdown_0.21 promises_1.2.0.1 gridExtra_2.3
## [106] codetools_0.2-18 boot_1.3-27 colourpicker_1.1.0
## [109] MASS_7.3-53.1 gtools_3.8.2 assertthat_0.2.1
## [112] rprojroot_2.0.2 withr_2.4.1 shinystan_2.5.0
## [115] multcomp_1.4-16 mgcv_1.8-34 parallel_4.0.4
## [118] hms_1.0.0 grid_4.0.4 coda_0.19-4
## [121] minqa_1.2.4 rmarkdown_2.7 shiny_1.6.0
## [124] lubridate_1.7.9.2 base64enc_0.1-3 dygraphs_1.1.1.6
In this text we use the words outcome and predictor for our dependent and independent variables, respectively. The word covariate can also often be used instead of predictor, even though it, in the traditional sense, means a continuous predictor.↩︎
https://mc-stan.org/docs/2_26/reference-manual/divergent-transitions.html↩︎
https://mc-stan.org/docs/2_26/stan-users-guide/reparameterization-section.html↩︎