Chapter 3 Prior Distributions

3.1 Non-informative Prior Distibrutions

We have seen in a few examples how the choice of the prior distribution (and prior parameters) can impact posterior distributions and the resulting conclusions. As the choice of prior distribution is subjective, it is the main criticism of Bayesian inference. A possible way around this is to use a prior distribution that reflects a lack of information about \(\theta\).

Definition 3.1 A non-informative prior distribution is a prior distribution that places equal weight on the every possible value of \(\theta\).

Example 3.1 In Example 2.4, we assigned a uniform prior distribution to the parameter \(\theta\).

Definition 3.2 A vague prior distribution is a prior that conveys minimal information about \(\theta\) before observing data. It is chosen to be weakly informative or nearly flat over a wide range, so that the posterior distribution is driven mainly by the likelihood rather than the prior.

Example 3.2 An Exp(0.01) distribution is often used as a vague prior distribution for rate parameters. It has a mean of 100 and a standard deviation of 100, so places weight over a wide range of values.

3.2 Prior Ellicitation

Throughout this course, we have tried to be objective in our choice of prior distributions. We have discussed uninformative and vaugue prior distributions. This misses out on one real difference between Bayesian and frequentist inference. In Bayesian inference, we can include prior information about the model parameters. Determining the value of prior parameters is know as prior elicitation. This is more of an art than a science and still controversial in the Bayesian world. In this section, we are going to look at a few ways of how to elicit prior information from experts.

Example 3.3 This example to show how difficult it is to talk to experts in other areas about probabilities and risk. Suppose we are interested in estimating the number of people who migrate into a new region each week Denote the number of people migrating per week by \(X\) and suppose \(X \sim \hbox{Po}(\lambda)\). We place a \(\lambda \sim \Gamma(\alpha, \beta)\) prior distribution on the rate parameter \(\lambda\). The density function of this prior distribution is \[ p(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}\exp(-\lambda\beta). \] The parameter \(\alpha\) is know as the shape parameter and \(\beta\) the rate parameter.

We interview migration charities, border enforcement officials and housing providers to estimate the values of \(\alpha\) and \(\beta\). The difficulty is that they do not know about the Gamma distribution or what shape and rate parameters are. Instead, we can ask them about summaries of the data. For the Gamma distribution, the mean is \(\alpha/\beta\), the mode is \((\alpha - 1)/\beta\) and the variance is \(\alpha/\beta^2\). If we can get information about two of these then we can solve for \(\alpha\) and \(\beta\). But needing two brings about another difficulty, non-mathematicians have limited understanding of statistics and probability. They can find it difficult to differentiate between the mean and the mode, and the concept of variance is very difficult to explain.

3.2.1 Prior Summaries

The first method we are going to look at is called summary matching. For this method, we ask experts to provide summaries of what they think the prior distribution is. We then choose a function form for the prior distribution and use these summaries to estimate the prior distribution. The choice of summaries depend on the application in hand as well as the choice of prior distribution. Common choices are: the mean, median, mode, variance, and cumulative probabilities (i.e. \(p(\theta < 0.5)\)).

Example 3.4 Let’s return to the migration example from above. Suppose that someone who works for a housing agency tells us the expected number of new arrivals per week is 0.75 and the probability that \(\lambda < 0.9\) is 80%. Matching summaries tells us \[ \frac{\alpha}{\beta} = 0.75 \\ \int_0^{0.9}p(\lambda) d\lambda = 0.8 \]

From the expectation, we have \(\alpha = 0.75\beta\). To estimate the value of \(\beta\) from the cumulative probability, we need to find the root to the equation \[ \int_0^{0.9} \frac{\beta^{0.75\beta}}{\Gamma(\beta)}\lambda^{0.75\beta-1}e^{-\lambda\beta}d\lambda - 0.8 = 0. \]

This looks horrible, but there are many optimisation ways to solve this problem.

cummulative.eqn <- function(b){
  #Compute equation with value beta = b
  value <- pgamma(0.9, 0.75*b, b)-0.8
  return(value)
  
}

uniroot(cummulative.eqn, lower = 1, upper = 1000)
## $root
## [1] 21.80224
## 
## $f.root
## [1] -5.747051e-08
## 
## $iter
## [1] 8
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05

This gives us \(\beta = 21.8\) and \(\alpha = 13.65\).

3.2.2 Betting with Histograms

The difficulty with the summary matching method is that it requires experts to describe probabilities about the prior parameter, which is difficult to do and quite an abstract concept. Instead of asking them to describe summaries, we can ask them to draw the prior distribution completely, using a betting framework.

In the prior weights (sometimes called roulette) method, we give the the experts a set of intervals for the parameter of interest and fixed number of coins. We then ask the experts to place the coins in the intervals according to how likely they think the parameter will be in each interval, effectively betting on what value the parameter will take. For example, they might place \(n_1\) coins in the interval \(\theta \in [a, b)\), \(n_2\) in \(\theta \in [b, c)\) and \(n_3\) for \(\theta \in [c, d]\). From this we can construct our prior density.

Example 3.5 Suppose we are interested in the probability a voter votes for the yellow party. Every voter’s choice can be modelled by \(X\sim\hbox{Bernoulli}(p)\). We ask a political about their experience with focus groups and past elections. We ask the expert to estimate the probability a voter choses the yellow part and and their prior expertise for \(p\) using the following table and 20 coins.

\(p\) Coins
[0, 0.2) 3
[0.2, 0.4) 7
[0.4, 0.6) 5
[0.6, 0.8) 3
[0.8, 1] 2

We can use the website http://optics.eee.nottingham.ac.uk/match/uncertainty.php# to fit a distribution to this table. It proposes the best fit of a \(p \sim \Gamma(3.10, 6.89)\). We can use this distribution as the prior distribution, although it is not conjugate and places weight on \(p > 1\). Another option is a Beta\((1.64, 2.16)\) distribution,

3.2.3 Prior Intervals

The last method we will look at is the bisection method. In this method, we ask the experts to propose four intervals, each of which the parameter value is likely to fall into, i.e. \(p(\theta \in [a, b)) = p(\theta \in [c, d)) = p(\theta \in [e, f))= p(\theta \in [g, h)) = 0.25\). From these intervals, we develop a prior distribution that fits these intervals.

Example 3.6 The police are interested in estimating the number of matching features between a fingerprint from a crime scene and a fingerprint from a suspect in the police station. The number of matching features is \(X \sim \hbox{Po}(\lambda)\). We speak to experienced fingerprint analysts. She advises us that about a quarter of the time, she would expect to see between 0 and 4 matches and this is when it is unlikely for the suspect to be the criminal. She says in some cases, it’s very clear that the suspect is the criminal as she would expect to see 20 to 30 matches. The rest of the time, she sees some matches but the fingerprint collected at the crime scene is poor quality, so see may see 5 to 14 matches. She agrees that the matches are uniform across this range. So our four intervals are [0, 4], [5, 12], [13, 19], [20, 30]. Using the Match software, we get a Uniform[0, 30] distribution.