In this first of a two part special guest blog, my colleague Paul McKeigue will explain how the formal logic of probability theory can be used to evaluate the evidence for alternative explanations of an event like the Khan Sheikhoun chemical incident in April this year. As an epidemiologist and genetic statistician, Paul is expert in this approach.

Paul picks up on the recent debate between George Monbiot and myself (here and here) observing how we were somewhat at cross-purposes. George was insisting that I offer a competing explanation to the ‘Assad did it’ story, but I declined to speculate, having no independent way of knowing. Because George believed a “mountain” of evidence supported his belief, he found it vexing that I should question it without venturing a specific alternative explanation.

Paul points the way forward by arguing that the logic of probability theory implies that you cannot evaluate the evidence for or against a single hypothesis, but only the evidence favouring one hypothesis over another. He shows, with simple examples, how an observation that is consistent with a hypothesis does not necessarily support that hypothesis against an alternative; in fact, an observation that is highly unlikely under one hypothesis may still support that hypothesis if it is even more unlikely under an alternative.

In this framework, then, evidence for a claim that the Syrian government carried out a chemical attack in Khan Sheikhoun cannot be evaluated except by comparison with an alternative explanation. The problem for anyone who formulates such an alternative explanation is that, in the current climate, they are likely to be denounced as a “conspiracy theorist”. Paul shows, however, that you cannot evaluate evidence without envisaging what you would expect to observe if each of the alternative hypotheses were true. This inevitably requires you to ‘speculate’: it doesn’t mean that you endorse any of the alternative hypotheses.

In Part 1 posted here below, Paul explains the approach and demonstrates how it can be applied to evaluate the evidence for alternative explanations of the alleged chemical attack in Ghouta in 2013. *We welcome discussions of the approach in the comments. In Part 2, he will examine the evidence for alternative explanations of the Khan Sheikhoun incident.*

**Paul McKeigue**

**Paul McKeigue**

## Using probability calculus to evaluate evidence for alternative hypotheses, including deception operations

In this post I will try to show how the formal framework of hypothesis testing based on probability theory is able to separate subjective beliefs about the plausibility of alternative explanations, on which we can agree to differ, from the evaluation of the weight of evidence supporting each of these alternative explanations, on which it should be easier to reach a consensus. We can then begin to apply this to the Syrian conflict.

Although the mathematical basis for using evidence from observations to update the probability of a hypothesis was first set out by the 18th century clergyman Thomas Bayes, the first practical use of this framework was for cryptanalysis by Alan Turing at Bletchley Park. This was later elaborated by his assistant Jack Good as a general approach to evaluating evidence and testing hypotheses. This approach to testing hypotheses has been standard practice in genetics since the 1950s, and has spread into many other fields of scientific research, especially astronomy. It underlies the revolution in machine learning and artificial intelligence that is beginning to transform our lives. Although the practical usefulness of the Bayes-Turing framework is not in question, this does not prove that it is the only logical way to evaluate evidence. The basis for this was provided by the physicist Richard Cox, who showed that degrees of belief must obey the mathematical rules of probability theory if they satisfy simple rules of logical consistency. Another physicist, Edwin Jaynes, drew together the approach developed by Turing and Good with Cox’s proof to develop a philosophical framework for using Bayesian inference to evaluate uncertain propositions. In this framework, Bayesian inference is just an extension of the ordinary rules of logic to manipulating uncertain propositions; any other way of evaluating evidence would violate rules of logical consistency. There are too many names – not limited to Bayes, Turing, Good, Cox and Jaynes – attached to the development of this framework to name it after all of them, so I’ll follow Jaynes and just call it **probability calculus**.

The objective of this post and the one that follows is to show you, the reader, how to evaluate evidence for yourself using simple back-of-the-envelope calculations based on probability calculus.

Some fundamental principles of probability calculus can be expressed without using mathematical language:-

- For two alternative hypotheses, H
_{1}and H_{2}, the evidence favouring H_{1}over H_{2}is evaluated by comparing how well H_{1}would have predicted the observations with how well H_{2}would have predicted the observations. - We cannot evaluate the evidence for or against a single hypothesis, only the evidence favouring one hypothesis over another.
- The evidence favouring one hypothesis over another can be calculated without having to specify your prior degree of belief in which of these two hypotheses is correct. Two people may have different priors, but their calculations of the strength of evidence favouring one hypothesis over another should agree if they agree on what they would expect to observe if each of these hypotheses were true.

For a light-hearted tutorial in how to apply these principles in everyday life, try this exercise.

To take the argument further, I need to explain some simple maths. If you already have a basic grounding in Bayesian inference, you can skip to the next section. Otherwise, you can work through the brief tutorial below, or try an online tutorial like this one .

Before you have seen the evidence, your degree of belief in which of these alternatives is correct can be represented as your **prior odds**. For instance if you believe H_{1} and H_{2} are equally probable, your prior odds are 1 to 1, or even odds in everyday language. After you have seen the evidence, your prior odds are updated to become your **posterior odds**.

Bayes’ theorem specifies how evidence updates prior odds to posterior odds. The theorem can be stated in the form:-

- Your prior odds encode your degree of belief favouring H
_{1}over H_{2}, before you have seen the observations. Priors are subjective: one person may assign prior odds of 100 to 1 favouring H_{1}over H_{2}, while another may believe that both hypotheses are equally probable. - The
**likelihood**of a hypothesis is the conditional probability of the observations given that hypothesis. To evaluate it, we have to envisage what would be expected to happen if the hypothesis were true. We can think of the likelihood as measuring how well the hypothesis can predict the observation. - Likelihoods of hypotheses measure the relative support for those hypotheses; they are not the probabilities of those hypotheses.
- The ratio of the likelihood of H
_{1}to the likelihood of H_{2}is called the**Bayes factor**or simply the**likelihood ratio**. In recognition of his mentor, Good called it the “Bayes-Turing factor”. - It is only through the likelihood ratio that your prior odds are modified by evidence to posterior odds. All the evidence on whether the observations support H
_{1}or H_{2}is contained in the likelihood ratio: this is the**likelihood principle**.

### Examples

- You have two alternative hypotheses about a coin that is to be tossed: H
_{1}that the coin is fair, and H_{2}that the coin is two-headed. In most situations your prior belief would be that H_{1}is far more probable than H_{2}. Given the observation that the coin comes up heads when tossed once, the likelihood of a fair coin is 0.5 and the likelihood of a two-headed coin is 1. The likelihood ratio favouring a two-headed coin over a fair coin is 2. This won’t change your prior odds much. If, after the first ten tosses, the coin has come up heads every time, the likelihood ratio is 2^{10}=1024, perhaps enough for you to suspect that someone has got hold of a two-headed coin. - Hypothesis H
_{1}is that all crows are black (as in eastern Scotland), and hypothesis H2 is that only 1 in 8 crows are black (as in Ireland where most crows are grey). The first crow you observe is black. Given this single observation, the likelihood of H_{1}is 1, and the likelihood of H_{2}is 1/8. The likelihood ratio favouring H_{1}over H_{2}is 8. So if your prior odds were 2 to 1 in favour of H_{1}, your posterior odds, after this first observation, will be 16 to 1. This posterior will be your prior when you next observe a crow. If this next crow is also black, the likelihood ratio contributed by this observation is again 8, and your posterior odds favouring H_{1}over H_{2}will be updated to (16×8=128) to 1.

Bayes’ theorem can be expressed in an alternative form by taking logarithms. If your maths course didn’t cover logarithms, don’t be put off. To keep things simple, we’ll work in logarithms to base 2. The logarithm of a number is defined as the power of 2 that equals the number. So for instance the logarithm of 8 is 3 (2 to the power of 3 equals 8). The logarithm of 1/8 is minus 3, and the logarithm of 1 is zero. Taking logarithms replaces multiplication and division by addition and subtraction, which is why if you went through secondary school before the arrival of cheap electronic calculators you were taught to use logarithms for calculations. However logarithms are not just for calculations but fundamental to using maths to solve problems in the real world, especially those that have to do with information. I was dismayed to find that here in Scotland, where logarithms were invented, they have disappeared from the national curriculum for maths up to year 5 of secondary school.

The logarithm of the likelihood ratio is called the **weight of evidence** favouring H_{1} over H_{2}. As taking logarithms replaces multiplying by adding, we can rewrite Bayes’ theorem as

prior weight + weight of evidence = posterior weight

where the prior weight and posterior weight are respectively the logarithms of the prior odds and posterior odds. If we use logarithms to base 2, the units of measurement of weight are called **bits** (binary digits).

So we can rewrite the crow example (prior odds 2 to 1, likelihood ratio 8, posterior odds 2×8=16) as

prior weight = 1 bit (2^{1}=2)

likelihood ratio = 3 bits (2^{3}=8)

posterior weight = 1 + 3 = 4 bits

One advantage of working with logarithms is that it gives us an intuitive feel for the accumulation of evidence: weights of evidence from independent observations can be added, just like physical weights. Thus in the coin-tossing example above, after one toss of the coin has come up heads the weight of evidence is one bit. After the first ten coin tosses have come up heads, the weight of evidence favouring a two-headed coin is 10 bits. As a rule of thumb, 1 bit of evidence can be interpreted as a hint, 2 to 3 bits as weak evidence, 5 to 6 bits as modest evidence, and anything more than that as strong evidence.

## Hempel’s paradox

Within the framework of probability calculus we can resolve a problem first stated by the German philosopher Carl Gustav Hempel. What he called a paradox can be stated in the following form:

*An observation that is consistent with a hypothesis is not necessarily evidence in favour of that hypothesis.*

Good showed that this is not a paradox, but a corollary of Bayes’ theorem. To explain this, he constructed a simple example (I have changed the numbers to make it easier to work in logarithms to base 2). Suppose there are two Scottish islands denoted A and B. On island A, there are 2^{15} birds of which 2^{6} are crows and all these crows are black. On island B, there are 2^{15} birds of which 2^{12} are crows and 2^{9} of these crows (that is, one eighth of all crows) are black. You wake up on one of these islands and the first bird that you observe is a black crow. Is this evidence that you are on island A, where all crows are black?

You can’t do inference without making assumptions. I’ll assume that on each island all birds, whatever their species or colour, have equal chance of being seen first. The likelihood of island A, given this observation, is 2^{-9}. The likelihood of island B is 2^{-3}. The weight of evidence favouring island B over island A is [−3−(−9)]=6 bits. So the observation of a black crow is moderately strong evidence against the hypothesis that you are on island A where all crows are black. So, when two hypotheses are compared, *an observation that is consistent with a hypothesis can nevertheless be evidence against that hypothesis*.

The converse applies: an observation that is highly improbable given a hypothesis is not necessarily evidence against that hypothesis. As an example, we can evaluate the evidence for a hypothesis that most readers will consider an implausible conspiracy theory: that the Twin Towers of the World Trade Center were brought down not by the hijacked planes that crashed into them but by demolition charges placed in advance, with the objective of bringing about a “new Pearl Harbour” in the form of a catastrophic event that would provoke the US into asserting military dominance. We’ll call the two alternative hypotheses for the cause of the collapses – plane crashes, planned demolitions – H_{1} and H_{2} respectively. The proponents of this hypothesis attach great importance to the observation that a nearby smaller tower (Building 7), collapsed several hours after the Twin Towers for reasons that are not obvious to non-experts. I have no expertise in structural engineering, but I’m prepared to go along with their assessment that the collapse of a nearby smaller tower has low probability given H_{1}. However I also assess that the probability of this observation given H_{2} is equally low. If the planners’ objective in destroying the Twin Towers was to create a catastrophic event, why would they have planned to demolish a nearby smaller tower several hours later, with the risk of giving away the whole operation? For the sake of argument, I’ll put a value of 0.05 on both these likelihoods. Note that it doesn’t matter whether the observation is stated as “collapse of a nearby tower” for which the likelihoods of H_{1} and H_{2} are both 0.05, or as “collapse of Building 7” for which (if there were five such buildings all equally unlikely to collapse) the likelihoods of H_{1} and H_{2} would both be 0.01. For inference, all that matters is the ratio of the likelihoods of H_{1} and H_{2} given this observation. If this ratio is 1, the weight of evidence favouring H_{1} over H_{2} is zero.

The conditional probabilities in this example are my subjective judgements. I make no apology for this; the logic of probability calculus says that you can’t evaluate evidence without making these subjective judgements, that these subjective judgements must obey the rules of probability theory, and that any other way of evaluating evidence violates axioms of logical consistency. If your assessment of these conditional probabilities differs from mine, that’s not a problem as long as you can explain your assessments of these probabilities in a way that makes sense to others. The general point on which I think most readers will agree is that although the collapse of a nearby smaller tower would not have been predicted from H_{1}, it would not have been predicted from H_{2} either. The likelihood of a hypothesis given an observation measures how well the hypothesis would have predicted that observation.

We can see from this example that to evaluate the evidence favouring H_{1} over H_{2}, you have to assess, for each hypothesis in turn, what you would expect to observe if that hypothesis were true. Like a detective solving a murder, you have to “speculate”, for each possible suspect, how the crime would have been carried out if that individual were the perpetrator. This requirement is imposed by the logic of probability calculus: complying with it does not imply that you are a “conspiracy theorist”. The principle of evaluating how the data could have been generated under alternative hypotheses applies in many other fields: for instance, medical diagnosis, historical investigation, and intelligence analysis. A CIA manual on intelligence analysis sets out a procedure for ‘analysis of competing hypotheses’ which ‘demands that analysts explicitly identify all the reasonable alternative hypotheses, then array the evidence against each hypothesis – rather than evaluate the plausibility of each hypothesis one at a time.’ I am not trying to tell people who are expert in these professions that they don’t know how to evaluate evidence. However it can still be useful to work through the formal framework of probability calculus to identify when intuition is misleading. For instance, where two analysts evaluating the same observations disagree on the weight of evidence, working through the calculation will identify where their assumptions differ, and how the evaluation of evidence depends on these assumptions.

An interesting argument about the use of Bayesian evidence in court can be found in this judgement of the Appeal Court in 2010. In a murder trial, the forensic expert had given evidence that there was “moderate scientific support” for a match of the defendant’s shoes to the shoe marks at the crime scene, but had not disclosed that this opinion was based on calculating a likelihood ratio. The judges held that where likelihood ratios would have to be calculated from statistical data that were uncertain and incomplete, such calculations should not be used by experts to form the opinions that they presented to the court. However the logic of probability calculus imply that you cannot evaluate the strength of evidence except as a likelihood ratio. Calculating this ratio makes explicit the assumptions that are used to assess the strength of evidence. In this case, the expert had used national data on shoe sales to assign the likelihood that the foot marks were made by someone else, given that the foot marks were made by size 11 trainers. The conditional probability of size 11 trainers, given that they were made by someone else, should have been based on the frequency of size 11 trainers among people present at similar crime scenes. It was because the calculations were made available at the appeal that the judges were able to criticize the assumptions on which they were based and to overturn the conviction.

We next consider an example of Hempel’s paradox from the Syrian conflict.

## Rockets used in the alleged chemical attack in Ghouta in 2013: evidence for or against Syrian government responsibility?

To explain the alleged chemical attack in Ghouta in 2013, two alternative hypotheses have been proposed, which we’ll denote H_{1} and H_{2}

- H
_{1}states that a chemical attack was carried out by the Syrian military, under orders from President Assad. The proponents of this hypothesis include the US, UK and French governments. - H
_{2}states that a false-flag chemical attack was carried out by the Syrian opposition, with the objective of bringing about a US-led attack on the Syrian armed forces. A leading proponent of this hypothesis was the blogger “sasa wawa”, who set up a crowd-sourced investigation of the Ghouta incident. The evidence generated during this investigation was later set out in the framework of probability calculus by the Rootclaim project, founded by Saar Wilf, an Israeli entrepreneur (and noted international poker player) with a background in the signals intelligence agency Unit 8200. I think we can tentatively identify “sasa wawa”, who seemed “to have unlimited time and energy and to be some sort of polymath”, as Wilf.

Other hypotheses are possible: for instance we can define hypothesis H_{3} that an unauthorized chemical attack was carried out by a rogue element in the Syrian military, and hypothesis H_{4} that there was no chemical attack but a massacre of captives, in which rockets and sarin were used to create a false trail of evidence for a chemical attack. But for now we’ll just consider H_{1} and H_{2}.

You may have strong prior beliefs about the plausibility of these two hypotheses: for instance you may believe that H_{1} is highly implausible because the Syrian government had no motive to carry out such an attack when OPCW inspectors had just arrived, or you may take the view that H_{2} is an absurd conspiracy theory requiring us to believe that the opposition carried out a large-scale chemical attack on themselves. Whatever your prior beliefs, to evaluate the evidence you must be prepared to envisage for each of these hypotheses what would be expected to happen if that hypothesis were correct, in order to compute the likelihood of that hypothesis. This requires for H_{1} that you put yourself in the shoes of a Syrian general ordered to carry out a chemical attack, or for H_{2} that you put yourself in the shoes of an opposition commander planning a false-flag chemical attack that will implicate the regime.

A key observation is credited to Eliot Higgins, who showed that the rockets examined by the OPCW inspectors at the impact sites were a close match to a type of rocket that the Syrian army had been using as an improvised short-range siege weapon. This “Volcano” rocket consisted of a standard artillery rocket with a 60-litre tank welded over the nose, giving it a heavier payload but a very short range (about 2 km).

In Higgins’s interpretation, which has been widely disseminated, this observation is evidence for hypothesis H_{1}. Let’s apply the framework of probability calculus to compute the weight of evidence favouring H_{1} over H_{2} given this observation.

First, we compute the likelihood of H_{1}. The Syrian military had large stocks of munitions specifically designed to deliver nerve agent at medium to long range, including missiles and air-delivered bombs together with equipment for safely filling them with sarin. Given that they had been ordered to carry out a chemical attack, I assess the probability that they would have used these purpose-designed munitions as at least 0.9. The probability that they would have used an improvised short-range siege weapon, which to reach the target would have had to be fired from the front line or from within opposition-controlled territory, is rather low: I assess this as about 0.05. This is the likelihood of H_{1} given the observation.

Second, we compute the likelihood of H_{2}. Given that under this hypothesis the objective of the attack was to implicate the Syrian government, the opposition had to be able to show munitions at the sites of sarin release that could plausibly be attributed to the Syrian military. They had two possible ways to do this: (1) to fake an air strike, with fragments of air-delivered munitions matching something in the Syrian arsenal; or (2) to use rockets or artillery shells matching something in the Syrian military arsenal. Volcano rockets, either captured from Syrian army stocks or copied, would have been ideal for this. With no other reason to choose between options (1) and (2), we assign equal probabilities to them under H_{2}. The likelihood of H_{2} given the observation is therefore 0.5.

The likelihood ratio favouring H_{2} over H_{1} is 10, corresponding to a weight of evidence of 3.3 bits. Your assessment of the conditional probabilities may vary from mine, but I think the general point is clear: from hypothesis H_{1} we would not have predicted that the Syrian military would have chosen to use an improvised chemical munition rather than their stocks of purpose-designed chemical munitions, but from hypothesis H_{2} we would have expected the opposition to use any munition available to them that would implicate the Syrian army. So this is a classic example of Hempel’s paradox: an observation consistent with hypothesis H_{1} does not necessarily support H_{1}, but instead contributes (under a plausible specification of the conditional probabilities) weak evidence favouring the alternative hypothesis H_{2}.

This also shows how, by using the framework of probability calculus we are able to separate prior beliefs from evaluation of the weight of evidence. Your evaluation of the weight of evidence depends only on the ratios of the conditional probabilities that you specify for the observation given H_{1} or given H_{2}; it does not depend on your prior odds.

## Comparison with Rootclaim’s evaluation of the weight of evidence contributed by the rockets

As discussed above, where two analysts disagree on evaluating the weight of evidence contributed by the same observations, using probability calculus allows us to identify exactly where their assumptions differ. For the weight of evidence favouring H_{2} over H_{1} given the observation of Volcano rockets, I assigned a value of 3.3 bits. I had not looked at Rootclaim’s assessment which assigns a value of minus 0.5 bits for the weight of evidence favouring H_{2} over H_{1}.

Let’s see how Rootclaim’s assumptions differ from mine. For the probability of observing Volcano rockets given H_{1}, Rootclaim assigns the same value (0.05) as I have. However Rootclaim assigns a value of only 0.036 to the probability of observing Volcano rockets under H_{2}. Rootclaim obtains this value by multiplying together a probability of 0.4 that an opposition group would capture Volcano rockets, a probability of 0.3 that another opposition group with access to sarin would find this group, and a probability of 0.3 that these two groups would figure out how to fill the munition with sarin.

I think Rootclaim’s assignment of the conditional probability of observing Volcano rockets given H_{2} does not correctly condition on what is implied by H_{2}. Under hypothesis H_{2}, the purpose of the operation is to implicate the Syrian government. The conditional probability of observing Volcano rockets given H_{2} is the probability that these rockets would be found given that the opposition plans to release sarin and to leave a false trail of evidence implicating the Syrian military. To release sarin the opposition has to figure out some way to fill munitions (rockets or improvised devices) with it. To implicate the Syrian military, the opposition has to use a munition (captured or copied) that matches something in the Syrian military’s arsenal. The only choice for the opposition, given hypothesis H_{2}, is whether to use a munition fired from the ground, like the Volcano rocket, or to use remnants of an air-dropped bomb with an improvised chemical-releasing device. With nothing to choose between these two options given H_{2}, I have assigned them equal probabilities.

Without these calculations being stated explicitly, there would have been no way for you, the reader, to evaluate the difference between my assessment that the rockets contribute weak evidence in favour of hypothesis H_{2} and Rootclaim’s assessment that the rockets contribute practically no evidence favouring either hypothesis. By working through the formal framework of probability calculus, you can see that this difference arises because my assignment of the likelihood of H_{2} is based on assumptions about the purpose of the deception operation that is implied by hypothesis H_{2}.

This example illustrates a more general principle: to evaluate the likelihood of a hypothesis that implies a deception operation, we must condition on what that deception would entail.

## Evidence contributed by the non-occurrence of an expected event

To evaluate all relevant evidence, we must include the non-occurrence of events that would have been expected under at least one of the alternative hypotheses. This is the principle set out in the case of “the curious incident of the dog in the night-time“: Holmes noted that the observation that the dog did not bark had low probability given the hypothesis of an unrecognized intruder, but high probability given the hypothesis that the horse was taken by someone that the dog knew.

From the alleged chemical attack in Ghouta, a “dog did not bark” observation is that despite the mass of stills and video clips uploaded that showed victims in hospitals or morgues, no images have appeared that showed victims being rescued in their homes or bodies being recovered from affected homes. The only images from Ghouta purporting to show victims being found where they had collapsed were obviously fraudulent, showing nine alleged victims of chemical attack dead in the stairwell of an unfinished building named the “Zamalka Ghost House” by researchers.

As an exercise, you can assess the likelihoods of each of the following hypotheses, given the observation that no images showing rescue of victims in their homes, or recovery of bodies of people who had died in their homes were made available. To put this observation in context, this page lists more than 150 original videos uploaded, most showing victims in hospitals or morgues, attributed to 18 different opposition media operations.

- H
_{1}: a chemical attack was carried out by the Syrian military, authorized by the government - H
_{2}: a false-flag chemical attack was carried out by the Syrian opposition to implicate the government - H
_{3}: an unauthorized chemical attack was carried out by a rogue element in the Syrian military - H
_{4}: there was no chemical attack but a managed massacre of captives, with rockets and sarin used to create a trail of forensic evidence that would implicate the Syrian government in a chemical attack.

Given each of these hypotheses in turn, what do you assess to be the conditional probability that none of the uploaded videos would show the rescue of victims in their homes or the recovery of bodies of people who had died in their homes?

*In the next post we shall explore how to apply the formal framework of probability calculus to evaluate the weight of evidence for alternative explanations of the alleged chemical attack in Khan Sheikhoun.*

Great post. Here is a similar idea from a unpublished post a few years back.

Reverend Bayes and chemical weapons in Syria.

Approximately 250 years ago Reverend Thomas Bayes discovered a mathematical forma that revolutionized statistics. Surprisingly, Bayes theorem is relevant to evaluating the evidence regarding who is responsible for using the chemical weapons in Syria.

The critical insight of Bayes is that when assessing the probability that a hypothesis is true it is not enough to consider the evidence at hand. In addition, you need to consider the prior probability that the hypothesis is true. In many contexts we appreciate this point. For instance, in a criminal trial, innocence or guilt is determined not only on the basis of physical and circumstantial evidence, but in addition, establishing a motive is key. Motives are related to priors because people tend to act in accordance to their interests: The probability that a person committed a crime is higher if he or she had something to gain. In court, establishing a motive is not some ancillary point, but rather, is often a key part of making a case. The prior is considered significant here.

But we do not always appreciate the importance of prior probabilities. A classic example often included in textbooks concerns the probability of having a disease given a positive test for the disease. For example, imagine a test for beast cancer that is 99% accurate, meaning that 99 out of 100 people who test positive have the disease. Nevertheless, if a young woman of 20 with no identified risk of breast cancer tests positive, the likelihood of cancer is very low. The reason is that the evidence needs to be combined with prior probabilities. For example, if the prior probability of breast cancer in 20-year old women is .1%, then this woman only has about a 10% chance of having breast cancer despite the positive test. The key insight from Bayes is that even strong evidence in the form of a 99% accurate test can often be wrong. Priors matter all the more when the evidence is less compelling than a 99% test. Our failure to fully consider the relevance of priors in many contexts is so common it has a name: “base rate neglect”.

How does this relate to Syria? The evidence regarding who launched a chemical attack in Damascus is uncertain. The US and other western governments claim the evidence is compelling, but have released little evidence to the public [1]. A number of members of US congress have seen the classified evidence and claim that they are even less convinced after being briefed [2]. Multiple U.S. officials have told the Associated Press that the evidence against Assad is “no slam dunk” [3]. Russia claims that they have evidence that the rebels used the chemical weapons [4]. It would not be unreasonable to have some doubt about the strength evidence that has been collected thus far. US intelligence has been wrong before.

At the same time, it is clear who has a strategic motivation to use chemical weapons. The Syrian army was winning the civil war at the time the chemical weapons were used in Damascus, and earlier Obama drew a “red line” and threatened US intervention if government forces used chemical weapons. Of course this does not rule out the hypothesis that the Syria government is responsible, but it is hard to see a political or military motivation to do so under these conditions. On the other hand, the rebels have everything to gain by Western intervention.

The claim that motivations are relevant to decision making is not rocket science, indeed, it many contexts is it “common sense”, but it is striking how little consideration has been given to this fundamental point when estimating probabilities of who used the chemical weapons in Syria on August 21st killing hundreds of persons. Ignoring motives is irrational: it is a case of base rate neglect. Of course compelling evidence could become available that makes it clear that then individuals within the Syrian government are guilty. People sometimes do act in ways counter to their own interests. And if compelling evidence is presented that the Syrian government is responsible it does not entail that intervention is the correct policy decision. Perhaps intervention will only make things worse. Whatever the truth of the matter, when assigning probabilities to events based on evidence that is uncertain, our answer should be informed by the lesson of Reverend Bayes. Don’t ignore priors.

Jeffrey Bowers is a professor in the School of Experimental Psychology, University of Bristol. http://www.bristol.ac.uk/expsych/people/jeffrey-s-bowers/index.html. His work focuses on how the brain supports language and memory, and recently has criticised the hypothesis within psychology and neuroscience that the brain computes according to Bayesian principles. Follow Jeff on twitter: @jeffrey_bowers

[1] http://www.nytimes.com/2013/09/07/opinion/on-syria-vote-trust-but-verify.html?hp&_r=1&

[2] http://www.independent.co.uk/voices/comment/gas-missiles-were-not-sold-to-syria-8831792.html

[3] http://www.huffingtonpost.co.uk/2013/08/29/syria-us-intelligence_n_3835129.html

[4] http://www.washingtonsblog.com/2013/09/classified-intelligence-doesnt-prove-anything.html

Thanks, Jeffrey, this is really interesting – and would have been good to publish! (Regarding the question of motive for the chemical attacks, it has been striking how the professional debunkers have tried to divert attention from it.)

They be like “motive? Who cares? There’s hexamine! How do you explain the hexamine? it’s in the sarin used in all those other attacks Syria had no motive for, and why are you supporting Assad? Is Putin paying you? Hexamine!”

Great and ambitious work here. I still haven’t read this straight through, and certainly didn’t read the back links to understand probablity calculus. It’s noted as an extension of everyday logic, which is more where I operate usually, minus the extension. Putting numbers on things to turn it into match feels arbitrary. Like do I feel odds are 1:1, or 1.25:1? 100 to 1 or 94 to 1? As soon as you start multiplying, small difference can add up to distorted results. How do you set likelihood for different scenarios independently of your biases? Plus I have little patience for math, and have never used log-o-rhythms, or even been clear on what they are. (I know it’s not lumber-related, but something in that math area)

But a lot of work went into this, and I see how these issues are addressed and I can see the utility of this method. It may not overcome differences in prior bias, but it can help locate just where they are, so some effort could be made. And it shows how non-rigorous the reasoning used to #BlameAssad has been, in comparison. It’s all priors and unchanged posteriors with no connection to actual facts or likelihoods – a meaningless self-perpetuating cycle.

“…any other way of evaluating evidence would violate rules of logical consistency.” I hope that doesn’t mean all less formal work to this point is useless. No, it can’t mean quite that. But with this guide on hand, I feel it could be formalized into something more …scientific? Looking forward to part 2 to see how that might look.

I know exactly what you mean! It has been an eye-opener for me, too. (I had to go through some rather humiliatingly remedial maths to get half way to grips with the basics of this. I shall also confess that even instruction videos like “how to understand Bayes if you’re five” were a bit challenging. A video that I did find good at conveying its intuitive feel, without the technicalities, is this one: https://www.youtube.com/watch?v=BrK7X_XlGB8 )

Part 2 will definitely be interesting in a substantial way!

CL

I emphasize that I’m not trying to tell people who evaluate evidence less formally that they don’t know what they’re doing, An experienced detective solving a crime will consider each hypothesis in turn and evaluate how well that hypothesis could have predicted the observations. Like Moliere’s bourgeois gentleman who was surprised to learn that he had been speaking prose all his life, people who are good at evaluating evidence intuitively use Bayesian inference. For your evaluations of the strength of evidence to be logically consistent, your assessments of how well each hypothesis could have predicted the data must map to a set of values that obey the rules of probability theory. If these values depend on subjective judgements about what would be expected to happen under each hypothesis, then you are implicitly making these subjective judgements if you make a statement about the strength of evidence.

"Putting numbers on things" by setting out a formal calculation of the weight of evidence helps us to see where different analysts using the same observations disagree, and to identify situations where we cannot trust our intuition to deliver a logically consistent judgement. It also, crucially, penalizes hypotheses that explain everything but predict nothing. This is the mathematical basis of Occam’s razor which tells us to accept the simplest hypothesis that fits the data. For a simple explanation of this, see this link (just look at the picture on the first page, and then skip to the example).

Where the assessment of the weight of evidence depends critically on a subjective judgement of conditional probabilities on which two analysts differ, this can guide us to seek more information that will help to set these probabilities. For instance, in assigning the likelihood of a false flag operation given the observation of Volcano rockets in Ghouta, I’ve assumed that it’s feasible for the opposition to have captured or copied these rockets. Another analyst might disagree. In this situation we might be able to reach a consensus by searching for reports that the opposition had captured sites where such munitions were held, or for expert opinions on how difficult it would have been for the opposition to copy the design.

I don’t think it’s so difficult for observers with different priors to reach consensus on the likelihoods, which are calculated for each hypothesis as the conditional probability of the observations given that hypothesis. In the examples of fallacious evaluation of evidence that I’ve discussed in this post, the problem was not that people disagreed on the likelihood ratios but rather that the likelihood of the alternative hypothesis given the observation was never evaluated. You can’t exclude the explanation you haven’t considered. For instance Higgins did not evaluate the likelihood of a false flag operation given the observation of Volcano rockets, which should have been evaluated as the conditional probability that the opposition would use this munition given that their objective was to implicate the Syrian government.

CL

I emphasize that I’m not trying to tell people who evaluate evidence less formally that they don’t know what they’re doing, An experienced detective solving a crime will consider each hypothesis in turn and evaluate how well that hypothesis could have predicted the observations. Like Moliere’s bourgeois gentleman who was surprised to learn that he had been speaking prose all his life, people who are good at evaluating evidence intuitively use Bayesian inference. For your evaluations of the strength of evidence to be logically consistent, your assessments of how well each hypothesis could have predicted the data must map to a set of values that obey the rules of probability theory. If these values depend on subjective judgements about what would be expected to happen under each hypothesis, then you are implicitly making these subjective judgements if you make a statement about the strength of evidence.

"Putting numbers on things" by setting out a formal calculation of the weight of evidence helps us to see where different analysts using the same observations disagree, and to identify situations where we cannot trust our intuition to deliver a logically consistent judgement. It also, crucially, penalizes hypotheses that explain everything but predict nothing. This is the mathematical basis of Occam’s razor which tells us to accept the simplest hypothesis that fits the data. For a simple explanation of this, see this link (just look at the picture on the first page, and then skip to the example).

Where the assessment of the weight of evidence depends critically on a subjective judgement of conditional probabilities on which two analysts differ, this can guide us to seek more information that will help to set these probabilities. For instance, in assigning the likelihood of a false flag operation given the observation of Volcano rockets in Ghouta, I’ve assumed that it’s feasible for the opposition to have captured or copied these rockets. Another analyst might disagree. In this situation we might be able to reach a consensus by searching for reports that the opposition had captured sites where such munitions were held, or for expert opinions on how difficult it would have been for the opposition to copy the design.

I don’t think it’s so difficult for observers with different priors to reach consensus on the likelihoods, which are calculated for each hypothesis as the conditional probability of the observations given that hypothesis. In the examples of fallacious evaluation of evidence that I’ve discussed in this post, the problem was not that people disagreed on the likelihood ratios but rather that the likelihood of the alternative hypothesis given the observation was never evaluated. You can’t exclude the explanation you haven’t considered. For instance Higgins did not evaluate the likelihood of a false flag operation given the observation of Volcano rockets, which should have been evaluated as the conditional probability that the opposition would use this munition given that their objective was to implicate the Syrian government.

This is excellent and is just the kind of walkthrough that any journalist should be able to follow.

For me the real ‘proof’ of KS fraud is the lack of ANY photographic or video evidence available of the immediate aftermath of the alleged attack.

What are the chances that at least three ‘local journalists’ who videoed the conventional bombs and who attended the scene all chose not to take any pictures at all of the chaos they profess to have witnessed. Likewise the dozen or so White Helmets who attended, many of them saying they had time to return to base for protective equipment – we have to believe that each one of these not-entirely-camera-shy individuals each chose not to pick up a camera or even a mobile phone?

Actually we don’t have to believe that, as one, Anas al-Diab, has claimed to have been recording the events as they occured, but then we have to ask why he has not shared any of these images at all?

http://www.aljazeera.com/indepth/features/2017/04/idlib-gruesome-170405115057834.html

The incentives for them to record and publish these pictures could not be more obvious, but there are non whatsoever.

Likewise, what are the chances that a grieving father and husband would not be able to provide a single picture of him and his wife with their twins when such pictures would only serve to strengthen his case?

Yes, indeed. The method Paul outlines provides a sound basis for insisting that we bring all these crucial questions into the reckoning – in contrast to the professional debunkers who try to portray the odd thing they can obfuscate about as a clincher.

Anas al-Diab’s testimony is indeed highly relevant (and there are more such cases 🙂

On Abdulhamid al-Youssef: no, why should he? There is no reason to doubt this part of his testimony. And he is part of a culture where women and wives are *not* shown:

@Qoppa, Thanks – I stand corrected, but this only goes to show the effectiveness of this method of analysis. Had I made my argument using this approach, I would have had to apply a probability to my judgement of the likelihood of a non-existent wife’, which would therefore be open to direct criticism and debate. Likely I would have reduced the odds of it considerably having considered your information, but might have not dismissed it completely (given the likely pressure from the global media for the kind of pictures they usually like to use).

I would worry that the usual suspects would tend to focus on the debates that we’re all familiar with (hexamine and all that), but it would make it very much harder for them to hide the other assumptions they are making.

Although I agree with the author on most points, I am quite sceptical about the method.

How can you calculate (the probability of) complex real-life events? The examples for probability calculus always are on clearly defined conditions, and countable outcomes (coin-tosses, birds).

This we usually (and necessarily, I would add) *lack* in real life – and the basis (imput) for the calculus will have to rest, as the author freely admits, on “subjective judgments” which are then, quite artificially, put on a numerical scale in order to “weigh” them. So the crux clearly is how good those judgments are – which I think gets distorted in the mathematical (pseudo-scientific) cloak in which they are presented. And making good judgments on the situation is not advanved by calculus – but first by diligent research, then by sound evaluation of the findings, then by formulating persuasive and compelling arguments.

Let us take the well-chosen example of the complete lack of in situ evidence for the “chemical attack”:

– this fact only became apparent after some researchers have geolocated the mysterious crater and the WhiteHelmet complex, and started to make a timeline and reconstruction of events. You need to study closely and *understand* the hypothetical situation in order to make a good judgment about how “probable” it is.

– The “improbability” of the official scenario will be highlighted once you realise that the decontamination we see in the hosing scene at the WhiteHelmet complex would, in a real-world scenario, have to take place in situ, *before* transporting the supposed victims 3 kms across town. Quite absurd, but nothing you can put into a calculus.

– The lack of footage from the impact site – how probable or improbable is it in itself? You need a qualified judgment to assess probability (in fact: plausibility here, there is again no way to “measure” this fact by statistical means). And for making such an assessment it helps enormously if you do in-depth, background research about the many people with cameras, and where they have been at the time … (I have plenty of evidence on that point, which I’ll be posting in September).

Thanks! While Paul can offer his own comments, I’d like quickly to make two. The first thing I’d say is that absolutely indispensible for any evaluation of the relative probabilities is the kind of work you’ve been doing and are pointing to here. I think Paul is clear that we are relying on the army of citizen investigators to come up with the hypotheses to test. The other thing is that, as he emphasises, the method only permits rational evaluation of evidence in the form of a comparison between two hypotheses, so it is very much a question of judgement, but the point, I believe, is that it does help make the assumptions informing each judgement more explicit than they might otherwise be. (Contrast Bellingcat that just lists a bunch of apparently preposterous things the opponent must believe without indicating any of the preposterous things their own position may depend on believing.)

I agree in so far as the evaluation leads to discussing the many merits/weaknesses of the hypotheses in consideration.

What I object to (both on a pragmatic and a theoretical level) is the mathematical = pseudo-scientific cloak in which it is presented. Maybe persuasive for some people who believe in such an approach. Not for most, as they want to hear (rightly, in my opinion) good compelling arguments for either side. And it is not the accumulated “weight” of probabilities that wins – but the most cogent argument. Even if it is only one – if it is well presented, it will be the winner!

So, it is decisive for the debate to frame the arguments in a precise way. Major example is the much discussed OPCW report. It is certainly worthwhile to point out the weaknesses, especially the lack of a chain of custody. However, it doesn’t lead much further: even if there were a documented chain of custody, and the “presence” of sarin proper at KS were proved, it doesn’t tell us who put it there. Therefore, it would not be evidence that can distinguish between the two scenarios at stake: either “Assad” sarin attack – or false flag?

So, this is in fact a red herring, it tends to *obscure* the real issue. We need to look for reliable, strong evidence that can tell apart the two scenarios. Like demonstrable clues for fakery. Or impossible timelines (on the claim that all footage comes from 4 April).

Qoppa

See my reply to CL above for why (1) statements about the the strength of evidence implicitly make judgements about conditional probabilities; (2) it is useful to put numbers on conditional probabilities and calculate likelihood ratios, even if these conditional probabilities are based on subjective judgements.

More generally, I think you’re arguing that probabilities have to be based on something like physical properties (symmetry, randomization) as in coin tosses, or on the observed long-term frequencies of events. This

frequentistinterpretation of probability theory was taught to most statisticians for most of the 20th century. The Bayesian interpretation of probability issubjectivist: probabilities are degrees of belief, but these subjective degrees of belief must obey the rules of probability theory if they satisfy simple axioms of logical consistency. This doesn’t mean that different observers can never agree on what the data show: in many situations, even after allowing for uncertainties in the conditional probabilities, the likelihood ratio is so large that the hypothesis best supported by the data will be accepted, however surprising it may be.A simple puzzle that illustrates the difference between physical probability and subjective probability is the Monty Hall problem, involving a game show, a car and two goats. One reason why many distinguished mathematicians have trouble getting the right answer is that they think of probabilities as physical rather than subjective.

Peter – I can’t see the reply to CL you reference. Did it post corrrectly?

@Qoppa, You’re right in that it’s likely the most cogent arguments that will ‘win’, but I think you’re wrong to dismiss the assignment of assumed probabilitiese, even with a largely subjective component, as in some way rendering the approach ‘psuedo-science’. This method very much IS science – it forces assumptions to be explicit, to be disclosed, to be accompanied by evidence and therefore to be available to challenge and revision.

Without such approaches we could be left waiting, and waiting, for more perfect evidence to accumulate before we could make judgements between hypotheses. This method offers a properly scientific way in which to make these judgements on the basis of the best available evidence, and to use them to support coherent arguments on that basis.

“This method very much IS science – it forces assumptions to be explicit, to be disclosed, to be accompanied by evidence and therefore to be available to challenge and revision.”

Yes, indeed. It moves us away from bald assertions and from pseudo-explanations that have large gaps in the logical chain.

@Paul McKeigue

Thanks for explaining. However, doesn’t really alleviate my concerns. For applying a calculus you first need an “input” of something measurable (whatever it is), my point is that it is all about the *quality* of the arguments that go into the “subjective judgments”, and there is nothing quantitative in it (so if you like, call it a category problem).

When you say “probabilities are degrees of belief” – I am not sure this is really where you want to go. Sure, you can somehow apply a “measure” to subjective states of conviction. With the foreseeable result that the subjective states of fanatics and paranoids are close to 100%. I assume this is not what you mean – but rather the degrees a hypothetically rational observer would assign to various hypotheses. Yes? – Well, then we’re back to the quality of arguments, and more specifically to the soundness of whoever makes the judgment. So, what I am saying is that there is no abridgement to engaging in the discussion, understanding and evaluating the evidence (which is no little thing, there are huge differences in interpretation, and even in assigning relevance to various parts of it).

The probability degrees you assign to hypotheses are not more than an expression of how convincing you find an argument. It’s an artificial posterior exercise. Just the peculiar way a mathematically-minded person expresses himself (instead of saying “I am fully convinced/half convinced”).

Let’s discuss examples:

1) How do you arrive at probability degrees in respect to results of the OPCW report?

There’s a lot of material to discuss, many problems to be understood and assessed. How on earth can you “weigh” them? And what percentage to they add to the final tally? – Well, if you follow my argument above (it’s basically a red herring), this discussion won’t add anything. If you don’t, you will end up with a vastly different estimate.

2) Having said that, I do think that probability theory can help us on some points. People are quite bad adequately to assess “improbable” events. Something is considered quite “unlikely” if there is a 1-in-10 chance to materialize. But what about much rarer events, like only a 1% chance? Or if it is just 0,001%?

I have made use of this reasoning with regard to the “head wound children” – the first clue (after the overriding cui bono? question) that something seriously doesn’t add up in the “official version”.

Sarin doesn’t cause head wounds. So how to explain that many of the dead children, on the very first pictures that circulated, had serious injuries at their heads? Well, it could in some way be coincident with the hypothetical sarin attack: as sarin causes paralysis the wounds could come from falling unconscious … Or they could be wounds from conventional bombing that reportedly also took place. However, they would be rather untypical wounds on both explanations as they clearly look like as if receiving a pointed hit by a metal object. But many odd things happen, so we can’t exclude atypical injuries.

But being atypical they are rare. How rare? Impossible to say precisely, but let us (generously, I think) assume that 1 in 1000 cases of either falling unconscious or conventional shrapnel wounds would produce such an atypical injury that suggests an entirely different cause.

Now, my argument was that we have (at least) 4 such cases, and not correlated (among ~90 victims). If for one case there is a small probability that we see an atypical wound, to see it on at least 4 children at the same time – the probability goes close to zero.

3) This was my evaluation before reconstructing a key sequence of events (“trail of blood”). It turns out that we actually see two of the head wound children before being injured – and this is right in the middle of the “hosing scene” (so no peripheral event):

The sequence is incontrovertible, so the wounds definitely cannot come from the hypothetical chemical attack. It happened after the alleged treatment from the WhiteHelmets.

What to make of this evidence? Well, we have to closely consider the situation: there seems to be no alternative, even remotely plausible explanation other than that these children were intentionally injured (and likely killed). Especially as on both kids we see “progressive stages” of injuries, so there must have been at least two rounds of violence directed at them.

The visual evidence only testifies to this specific episode which we can reconstruct – but it has implications for the main question (false flag or not). As we can safely exclude other causes, there seems to be no way to escape the conclusion that these children were meant to die!

So …. how do you now measure the “degree of probability”?

If you ask me: this is already the killer argument, what we call “beyond reasonable doubt”. If you find my argument convincing, you’ll put it pretty close to 100% (and we can indeed argue about what highly unlikely but theoretically possible alternative scenarios are still conceivable).

If you don’t find it convincing – well, you’ll also need to make considerations (and not just quantitative “measurements”) to include it in your calculus.

This response has grown into a very long text, – I have spelled this out to show how exactly it is that we can derive at “degrees of probability” from an informed assessment of the evidence. And that it all hangs on the *quality* of the arguments which are presented.

Very glad to see Bayes used on the two chemical attacks in Syria. May I suggest Paul do a third post critiquing the likelihood ratios used by Rootclaim on Zamalka?

I’m intrigued about the guess that Sasa Wawa is Saar Wilf. Any other evidence for this or any sources of speculation you’ve seen?

There is a long post about using Bayes Theroem for past events written by Richard Carrier. http://www.richardcarrier.info/archives/12742 He wrote a book about using BT for history called Proving History and then used the method to evaluate the likelihood that Jesus was not a historical person (mythicism) in another book called On the Historisticy of Jesus.

The key part of this methodology is formulating alternative hypotheses and listing the assumptions that underlie these hypotheses. Some form of diagram and/or table would help in clarifying the alternative hypotheses and their underlying assumptions, before assigning probabilities.

There will always be a group of politicians and commentators who will have difficulty in coping with the idea that there are alternative explanations: they see their job as making assertions and ex-cathedra pronouncements. However, critiquing their assertions logically may go a long way of draining them of their power.

I’ve had a closer look at one of the currently most extensively documented cases against Assad in the Khan Sheikhoun attack. It really doesn’t bear any scrutiny at all.

Here (at last) is my walkthrough:

View story at Medium.com

Thanks for all the comments above, and a word of recommendation for Adrian’s dissection of the HRW report, as linked to.