Creating models in practice

Choosing a model

As discussed previously, it is essential to exercise engineering judgement when choosing an uncertainty model, both in terms of speed of computations which can be performed with the model and the appropriateness of the model’s representation of uncertainty. Engineers should also note that it is essential to exercise their judgement even after the theoretical uncertainty framework is chosen, since the set of hypotheses included in the uncertainty model has a strong affect on the conclusions drawn from analysis.

For example, consider the following problem proposed by [3] two doctors examine a patient, but differ in their diagnoses. Doctor A believes the patient has a 99% chance of meningitis and 1% chance of concussion. Doctor B believes the patient has a 99% chance of tumor and 1% chance of concussion. Since the doctors’ diagnoses strongly conflict with each other, a naïve application of Bayesian probability concludes that the patient most likely has concussion. Zadeh proposes that this problem can be solved with fuzzy logic. However, [4] shows that Bayesian probabilities can, in fact, be used to solve the problem, by allowing the model to consider that the doctors may have made a mistake in their estimations of probabilities.

Another interesting example is demonstrated by [5] and [6]; it is shown that using probability distributions to represent epistemic uncertainty in satellite conjunction analysis does not provide a useful description of the likelihood of collision between satellites, since the likelihood of collision appears to decrease when data with more incertitude is collected.

Training models from data

Here we provide a non-exhaustive review of methods to calibrate probabilistic and non-probabilistic generative uncertainty models, in order to set the context for the remainder of the thesis.

Creating parametric Bayesian probabilistic models

Consider the probability distribution \(p_\theta(x) = p(x^{(i)} | \theta)\) with vector of parameters \(\theta\), which we wish to identify based on a set of \(n\) training samples, \(\mathcal{X}_\text{train} = \{ x^{(1)}, \ldots, x^{(n)} \}\), drawn from the random variable specified by \(p_\theta(x)\). A distribution over the parameters \(\theta\), given the data \(\mathcal{X}_\text{train}\) can be obtained by applying Bayes’ law:

(7)\[P(\theta | \mathcal{X}_\text{train}) = \frac{P(\mathcal{X}_\text{train} | \theta) p(\theta)}{P(\mathcal{X}_\text{train})}, \]

where \(p(\theta)\) represents a prior distribution on \(\theta\), \(P(\mathcal{X}_\text{train}) = \int P(\mathcal{X}_\text{train} | \theta) d\theta\) acts as a normalising constant, and the data likelihood can be written as \(P(\mathcal{X}_\text{train} | \theta) = \prod_i p(x^{(i)} | \theta)\) by assuming independence of training samples. This approach, known as Bayesian Hierarchical Modelling [7], has desirable properties. For example, the epistemic uncertainty on \(\theta\) will decrease as more data becomes available which will be observed as a ‘concentration’ of the posterior distribution for \(\theta\) around one point.

Although simple analytical distributions are often used for the likelihood \(P(\mathcal{X}_\text{train} | \theta)\) (e.g. a Gaussian distribution with mean \(\theta_1\) and scale \(\theta_2\)). One can also extend the framework to consider more complex likelihood functions. For example, often the likelihood is \(P(\mathcal{X}_\text{train} | \theta) = \int{P(\mathcal{X}_\text{train}|y) \delta(f(\theta) - y) dy}\), where \(f(\theta)\) is an arbitrarily expensive and complex function, for which we may not know the gradient, and \(p(x^{(i)} | \theta)\) is a simple probability density, for example a normal distribution, and \(\delta(x)\) is the Dirac delta function. This setting is referred to as an ‘inverse problem’ [8].

The probability distributions over \(\theta\) represent epistemic uncertainty in \(\theta\), whilst the data likelihood, \(p(x^{(i)} | \theta)\), represents the natural stochasticity (aleatory uncertainty) of the data generating mechanism. Note that the prior distribution, \(p(\theta)\), should be chosen to represent our prior knowledge of the parameter \(\theta\), and in the case of no knowledge, should be set to an appropriate uninformative distribution. The distribution used for the uninformative prior should be chosen based on physical considerations regarding the parameter of interest, but is often somewhat arbitrarily assumed to be uniform [1].

The prior distribution is not the only place where prior knowledge enters into the probabilistic model; the model specification, i.e. the data likelihood, represents another form of prior knowledge which must be carefully considered with this approach [1]. It is particularly important to decide which parameters in the likelihood function should be modelled as uncertain, e.g. if the likelihood is assumed to be a Gaussian density, will a value be assumed for the standard deviation of the distribution, or will this be an element of \(\theta\), and hence an uncertain parameter?

We can derive point estimates for \(\theta\) from the Bayesian approach [9]. The maximum a posteriori estimator for \(\theta\) is obtained by evaluating \(\theta_\text{MAP} = \max_\theta P(\theta | \mathcal{X}_\text{train})\), where \(P(\theta | \mathcal{X}_\text{train}) \propto P(\mathcal{X}_\text{train} | \theta) P(\theta)\). The maximum likelihood estimator for \(\theta\) is obtained by evaluating \(\theta_\text{ML} = \max_\theta P(\mathcal{X}_\text{train} | \theta)\). Note that the maximum likelihood estimator is equivalent to the maximum a posteriori estimator when a uniform prior distribution is used. These estimators can be evaluated using any optimisation method. Stochastic Gradent Descent, a widely used optimisation method, will be discussed in Machine Learning of Regression Models since it is most often applied to regression models.

We do not necessarily have to disregard uncertainty in \(\theta\) when using the maximum a posteriori approach, since the covariance of the distribution can be estimated by inverting the Hessian (matrix of second derivatives with respect to parameters \(\theta\)) of \(P(\theta | \mathcal{X}_\text{train})\). This is known as the Laplace approximation. This estimate is exact in some well known cases, e.g. the case of a Gaussian likelihood and prior, where the optimisation loss (the logarithm of the posterior) becomes the mean squared error [8].

In many cases the full posterior for \(\theta\) can be calculated analytically, for example where a conjugate prior is used so that the posterior distribution has the same functional form as the prior distribution. If this is not the case, and one wishes to compute the full posterior distribution, then a Markov Chain Monte Carlo (MCMC) method is often used to obtain samples from the posterior distribution, or other approximate numerical techniques are used.

An MCMC algorithm constructs a Markov chain with the desired distribution as its equilibrium distribution, so that if the Markov chain is simulated for a sufficient time then the samples drawn are from the posterior distribution [7]. MCMC methods typically do not require the gradient of the posterior to be known, and are hence applicable to a wide class of problems. Unfortunately, MCMC simulation can be computationally infeasible when \(\theta\) has high dimensionality, or when the training dataset is large. Recently, efficient sampling based algorithms have been proposed to combat this problem [10].

As an alternative to MCMC based methods, Variational Inference can be used to find the closest match between an approximating parametric ‘proposal’ distribution and the true posterior distribution. This method typically requires the gradient of the likelihood function to be known, but scales very well to high dimensionality problems [11].

Approximate Bayesian Computation is an efficient computational method which can be used to sample from an approximation of the posterior distribution in the case that the likelihood is too expensive to compute [12]. [13] demonstrate a similar method, where the true likelihood probability density function is replaced by an interval with associated probability, and show that bounds on the likelihood function can still be obtained in this case.

Worked example : elementary probability theory

We wish to measure the length of bolts in a box. It is known that all of the bolts in the box have the same length. Your colleague tells you that he thinks the bolts are about 1 metre long, and he is almost certain that the bolts are between 0.9 and 1.1 metres. Your ruler measures the bolts to the nearest 5 centimetres. You draw 10 bolts with lengths measuring \(L = \{0.95m, 1m, 1m, 1m, 0.9m, 0.9m, 0.95m, 0.95m, 0.9m, 1m\}\).

This problem can be solved most easily by modelling your colleague’s prior belief with a normal prior distribution: \(P(l) = \mathcal{N}(\mu=1,\sigma=0.033)\), giving a \(3\sigma\) (99%) confidence interval to fit the colleague’s stated belief.

The likelihood function describes the precision of the ruler \(P(L | l) = \prod_i P(l_i | l) = \prod_i \mathcal{N}(\mu,\sigma=0.05)\), where the first equality holds because the measurements are made independently.

Then the posterior is obtained easily since the Gaussian prior is conjugate to the Gaussian likelihood: \(P(l|L) \propto P(L | l) P(l)\)

The equations to analytically calculate the posterior hyperparameters are given here.

Using the sample mean and standard deviation:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{1}{10} (0.95 + 1 + 1 + 1 + 0.9 + 0.9 + 0.95 + 0.95 + 0.9 + 1) = 0.955 m \]
\[\sigma = 0.0437798 m \]

The posterior hyperparameters are:

\[{\sigma^2_0}' = \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} = 0.0128 ^2 m^2\]
\[\mu_0' = \frac{\frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} = 0.962 m \]

So you can be slightly more confident than your colleague about the length of the bolts in the box. Also, you believe the bolts are slightly shorter.

Helpfully, this type of computation allows you to simulate how much data is required to obtain a certain level of confidence in the posterior distribution, by drawing more data from the posterior predictive distribution and performing inference with this simulated data. This can also be achieved by studying analytic formulae to predict what is known as the posterior concentration rate.

Choosing a prior and likelihood function with support on the whole real line is probably a poor choice for a variable like length, which must be positive. How could the model be improved?

Frequentist confidence intervals

In this thesis, traditional frequentist statistics are not used, except for in the validation of some Interval Predictor Models, but for the interested reader we briefly describe here how a frequentist confidence interval can be obtained for \(\theta\).

In frequentist statistics, one aims to identify a region of parameter space which would contain the true value of the parameter with a specified frequency if the experiment was repeated, i.e. we aim to find the confidence interval \(\Theta = [\underline{\theta}, \overline{\theta}]\), where \(P(\theta \in \Theta) = 1 - \alpha\), and \(\alpha\) is an arbitrarily small probability.

Non-parametric prediction intervals

The maximum and minimum of a dataset (i.e. \([\min_i{x_{(i)}},\max_i{x_{(i)}}]\) when \(x\) is one dimensional) can be used to produce a prediction interval with coverage probability \(\frac{n-1}{n+1}\), i.e. the probability that \(x_{(n+1)}\) will fall inside the prediction interval [14]. A tighter prediction interval, with a lower coverage probability of \(\frac{n + 1 - 2j}{n + 1}\) can be obtained by using the \(j\)-th smallest and largest values in the dataset.

Creating parametric imprecise probability models

The application of Bayes’ law in (7) assumes that the data \(\mathcal{X}_\text{train}\) consists of real, ‘crisp’, values. However, we can also apply Bayes’ law to imprecise, interval data. For example, consider the set of training data \(\mathcal{X}_\text{train} = \{ [\underline{x}^{(1)}, \overline{x}^{(1)}], \ldots, [\underline{x}^{(n)}, \overline{x}^{(n)}] \}\). If an analytic equation is available for the posterior parameters then in many cases it is possible to obtain bounds on the posterior parameters given interval data. For example, if a Gaussian density is used for the likelihood and prior, then one may obtain bounds on the posterior normal distribution parameters analytically [15].

The standard Bayesian paradigm can also be made robust by considering a set of prior distributions. This is known as Robust Bayes [16]. Again, bounds on the posterior parameters are available analytically in many cases, e.g. the Imprecise Dirichlet model.

Probability boxes can also be obtained by creating so-called confidence structures, which are encoded as probability boxes. Confidence boxes encode confidence intervals at all confidence levels. The binomial confidence bounds, which bound the success probability of a binomial random variable, are a particularly useful example which can be found by inverting the CDF of a binomial random variable [17].

Creating non-parametric imprecise probability models

Several methods exist to obtain non-parametric CDFs from data. The CDF can be estimated from \(n\) training samples, \(\mathcal{X}_\text{train} = \{ x^{(1)}, \ldots, x^{(n)} \}\), using the empirical cumulative distribution function (eCDF), which is given by

\[S_n(x) = \frac{1}{n} \sum_i^n \mathbb{I}_{x \ge x^{(i)}} (x),\]

where \(\mathbb{I}\) is the indicator function, which is equal to \(1\) if the subscript statement is satisfied, and is otherwise equal to zero [18]. The eCDF is effectively the random variable which is formed by assigning probability density equally at the point value of each sample, and hence when plotted the eCDF looks like a staircase function. The eCDF can be generalised to the case of imprecise sampled data, by considering a separate eCDF for the upper and lower bounds of the samples. These upper and lower bounds represent the envelope of a probability box, and hence an empirical probability box is obtained [19].

So-called concentration inequalities can be used to obtain bounds on the CDF of a random variable with a certain confidence. A probability box can be obtained for the random variable by choosing a cutoff confidence, such that the CDFs at that confidence will form the envelope of the probability box [19].

The Kolmogorov-Smirnov statistic can be used to measure the confidence that the true CDF of a random variable differs by more than a certain probability from the eCDF obtained from sampled data (i.e. the vertical distance between the CDFs is compared) [20]. The Kolmogorov-Smirnov statistic is given by

\[D = \sup_x |S_n(x) - F(x)|,\]

where \(F(x)\) is the true CDF and the values for \(D\) can be obtained from [18]. Now a set of CDFs can be found with associated confidence. The Kolmogorov-Smirnov statistic can be applied to eCDF bounds obtained by considering imprecise data [19].

Chebyshev’s inequality bounds the probability density of a random variable which can fall more than a certain number of standard deviations from the mean [21]. Therefore knowledge of the mean and standard deviation of a random variable imposes bounds on its CDF. Chebyshev’s inequality is given by

\[P((X-\mu) \ge k \sigma) \le \frac{1}{k^2},\]

where \(k>1\) is a real number, \(\mu\) is the mean of a random variable, and \(\sigma\) is the standard deviation of the random variable.

Creating uncertainty models without data

When insufficient data is available to create a satisfactory uncertainty model using the techniques described in the previous section, one may resort to creating a model based on the opinions of experts. This process is known as expert elicitation [22]. In this section we briefly outline how various uncertainty models can be obtained from expert opinion, in order to further justify and provide context for the uncertainty models used in this thesis. Expert elicitation is not the main focus of this thesis, and therefore this section may be skipped without consequence.

Probabilistic elicitation

In the Bayesian Hierarchical Modelling paradigm, discussed in the previous section, the posterior distribution concentrates as data is received and gradually the prior has less influence on the posterior. The prior distribution represents the state of knowledge about a parameter before data is available, and if limited data is available then more care should be taken to choose an appropriate prior, i.e. the opinions of experts should be considered and assessed quantitatively.

When eliciting multiple expert opinions one must attempt to aggregate the opinions of experts, regardless of the model chosen. Usually the opinions of experts are fused using quantitative rules [22], and feedback may be given to the experts in order to allow their opinions to be changed. [23] propose the SHELF framework which gives specific rules for how the opinions of experts should be elicited, and proposes that the opinions should be aggregated by a rational unbiased observer during the elicitation process.

Non-probabilistic elicitation

Imprecise probability models also have a role to play in expert elicitation. For example, if the model exhibits severe dependency on a probabilistic prior which can only be elicited approximately, one may wish to conduct a sensitivity analysis to the prior by considering a probability box prior, as discussed in Creating parametric imprecise probability models.

Other non-probabilistic models may be considered for a parameter, for example interval bounds on a parameter may be available from physical considerations. Alternatively, experts may prefer to specify their estimates as intervals, or may feel more comfortable specifying bounding CDFs (i.e. probability boxes). [19] discuss several methods for aggregating probability boxes which can be chosen based on the desired properties of the analysis.

Validating a trained model

Once an uncertainty model has been obtained it is essential that the model is validated, to ensure that it suitably represents the analysts uncertainty. Even if one uses a Bayesian framework and trusts the priors and probability calculus, it is still possible that the model has been misspecified. For this reason, one should qualitatively inspect the results of the analysis, and quantitatively check that a probabilistic model is correctly calibrated using numerical techniques.

Validating probabilistic models

Usually one partitions the data available for creating the model into the data set used for training the model, \(\mathcal{X}_\text{train}\), and the data set used for testing, \(\mathcal{X}_\text{test}\) (containing \(N_\text{test}\) data points).

If a probabilistic model is correctly calibrated then we expect the stated probabilities to represent the real frequencies with which events occur, for example if a set of events are predicted to occur with 0.9 probability then we expect that they occur 90% of the time in reality. This can be verified by plotting the test data relative to the trained distribution. One method of achieving this is ‘binning’ the data into a histogram, and visually comparing the histogram to the plotted distribution for the trained model [7].

One may use the test set, \(\mathcal{X}_\text{test}\), to compute various statistics of the model. For example, classical statistical tests can be used, such as the \(\chi^2\) summary statistic for the sum of squared errors which represents goodness of fit [7]. The test set can be used to compute the negative logarithmic predictive density for the model, i.e. \(-\log{P(\mathcal{X}_\text{test} | \text{Model})} = -\log{\mathbb{E}_{P(\theta | \mathcal{X}_\text{train})} P(\mathcal{X}_\text{test} | \theta)}\), which can be used as a figure of merit for comparing models.

If one wishes to compare two probabilistic models \(\text{Model}_1\) and \(\text{Model}_2\) then one may compare the evidence for the models by computing the Bayes Factor:

(8)\[\frac{P(\mathcal{X}_\text{train} | \text{Model}_1)}{P(\mathcal{X}_\text{train} | \text{Model}_2)} = \frac{ P(\text{Model}_1|\mathcal{X}_\text{train}) P(\text{Model}_2)}{P(\text{Model}_2|\mathcal{X}_\text{train})P(\text{Model}_1) },\]

where the model evidence (or marginal likelihood) \(P(\mathcal{X}_\text{train} | \text{Model})\) is computed by evaluating the expectation of the data likelihood for the model over the posterior obtained in training (\(P(\mathcal{X}_\text{train} | \text{Model}) = \mathbb{E}_{P(\theta | \mathcal{X}_\text{train})} P(\mathcal{X}_\text{train} | \theta) = \int P(\mathcal{X}_\text{train} | \theta) P(\theta | \mathcal{X}_\text{train}) d\theta\)). If the Bayes factor is greater than one then \(\text{Model}_1\) is preferred, otherwise one should choose \(\text{Model}_2\) [24]. When \(P(\text{Model}_2)=P(\text{Model}_1)\) the prior belief in each model is equal and the Bayes factor becomes equal to the likelihood ratio. The likelihood ratio can also be computed for the test data set. One can also generate new data by sampling from the distribution \(\mathbb{E}_{P(\theta | \mathcal{X}_\text{train})} p(x | \theta)\) and comparing this to the training and test data. This is known as a posterior predictive check [7].

Validating non-probabilistic models

If a convex-set or interval based model is compared to crisp data then one can check that all elements of the test set \(\mathcal{X}_\text{test}\) fall within the model. [25] proposes that interval models are validated against interval data using the metric for comparison of two sets \(A\) and \(B\):

\[\Delta(A,B) = \inf_{a\in A, b\in B}{|a-b|},\]

where \(A\) would represent the trained convex model, and \(B\) would represent an element of \(\mathcal{X}_\text{test}\). The mean of \(\Delta(A,B)\) over every element \(B \in \mathcal{X}_\text{test}\) could be used to validate against the entire test set, i.e. \(\frac{1}{N_\text{test}} \sum_{B \in \mathcal{X}_\text{test}} \Delta(A,B)\).

Following this, [25] proposes a generalisation of the Wasserstein distance to measure the distance between an eCDF and a probability box, as a probability box validation metric. The proposed metric, termed mean absolute difference of deviates, is given by

\[\mathbb{E}_x \Delta([\underline{F}(x), \overline{F}(x)], [\underline{S_n}(x), \overline{S_n}(x)]),\]

where \([\underline{F}(x), \overline{F}(x)]\) are the bounds of the probability box to be validated, and \([\underline{S_n}(x), \overline{S_n}(x)]\) are the bounds of the empirical probability box created from the training data, which becomes a single CDF in the case of crisp data (\([{S_n}(x), {S_n}(x)]\)). The metric reduces to zero when there is overlap of the probability boxes at every \(x\). We compare models by computing the metric for each model, and then choosing the model with the lowest value for the metric.