The UK's lie detector scandal
The evidence base for the polygraph test is full of holes. So why are we using lie detectors increasingly often?
I’ve only known for a couple of months that the UK uses polygraph tests. Before that, I thought they were a quirky, somewhat benighted American thing (sorry, US readers!). The first I saw of it was back in February, when Justice Secretary (and Deputy Prime Minister) Dominic Raab tweeted the following:
It’s not just terrorists: we also use polygraphs on sex offenders and domestic abusers. Although the results of a polygraph aren’t admissible in criminal court (as they are in just under half of US states), they’re used to check if already-convicted and released criminals are sticking to the rules of their licence: not contacting certain people, not using the internet, not buying certain terror-related items - that kind of thing.
Indeed, it turns out that the penal service in the UK has been allowed to use the polygraph since 2007, and its use has been expanded with new criminal justice-related bills across the last couple of years.
Lie detectors are back in the news because the WIRED writer Amit Katwala has a new book out called Tremors in the Blood, which tells the story of the invention of the polygraph in the 1920s, the interpersonal drama between its creators, and its effect on two murder trials. It’s a ripping yarn - but as James Ball noted in his review for The Spectator, it isn’t really about the science. It only very briefly discusses modern scientific research on the polygraph in its epilogue. So I thought it might be fun to take a look at the science myself.
“Fun”. You know what I mean.
Poly want a grapher
The first thing to do is define our terms. I said “the UK” above, but actually I mean “England and Wales” - Scotland and Northern Ireland don’t use polygraphs. Also, it’s important to know that “polygraph” isn’t a standardised term by any means: it just means a lie detector test where multiple (hence “poly”) measurements are taken of people’s physiological reactions during an interrogation: skin conductance, heart rate, blood pressure, and breathing rate are the most common.
So: does it work? The UK Home Office has an FAQ (last updated in January 2022) on its website. It states:
Are polygraph examinations accurate?
Polygraph testing has been found to be 80-90% accurate.
That’s it - that’s the whole answer to the question. It doesn’t have a hyperlink or cite a source. Can’t Her Majesty’s Government do a little better than this?
What do the numbers even mean? You have to be careful here. It doesn’t make sense to just state baldly that a test like the polygraph is “X% accurate”, or that “in X% of cases the polygraph will give the correct answer”. That’s because tests like this need to be analysed in terms of their sensitivity and specificity (and yes, my eyes glaze over when I see these words in a popular-science work too, but they’re important!).
The sensitivity of your test—in this case, how likely it is to say that someone is lying when they actually are lying; in other words its measure of “true positives”—is up to you. If you set the threshold differently (if you choose to declare that less of a physiological effect means that someone is lying), then you’ll increase the sensitivity and the number of true positives. But you’ll also always make the test less specific - less likely to avoid false positives. Think of an extreme case: you could choose to interpret the results as saying that every single person you test is lying(!). That way you’ll certainly catch every single liar, but you’ll also label every truth-teller a liar, too.
As one critical paper put it:
[I]f we wish to achieve a sensitivity of 90%… then the false-positive rate will be 27.2% (more than a quarter of the innocent suspects will be classified as guilty). Likewise, keeping the false-positive rate at no more than 10% would yield only 73.8% sensitivity. These figures are definitely much lower than claims made by [polygraph] proponents…
So, as we talk about percentages as we go on, bear this in mind: even very impressive-seeming numbers in this field are far from simple to interpret. For stats fans, single percentage numbers given in polygraph studies often refer to the Area Under the Curve from a ROC curve, which tells us overall how good the test is across all different thresholds - but that’s still not a simple measure of the “accuracy” of the test.
So we’re off to a great start: the numbers provided by the Government are meaningless. But where do they actually come from? The Government website doesn’t say, but luckily, a while ago a journalist emailed me to chat about this, and he told me he’d asked them for a source. They told him that the figures are from a meta-analysis published in 2011. A meta-analysis run and published by… the American Polygraph Association.
Okay, let’s not jump to conclusions! Sure, since it’s from the “professional association of polygraph examiners”, it does mean there’s something of a conflict of interest here. But it doesn’t mean we can just discount it. Our question is whether the numbers are biased in some way, or perhaps whether the meta-analysis uses low-quality studies, in a Garbage-In, Garbage-Out kind of deal.
But before we look at the meta-analysis itself, we need to talk about how the polygraph test really works. It turns out that polygraph testing is a lot more complicated—and a lot more questionable—than the version you see on TV.
How polygraphs “work”
As far as I can tell, the most common method for polygraph testing—and certainly the most commonly-studied—is the “Comparison Question Technique” (which also has a number of sub-types, just to make things more complicated). Here’s how it works.
The first step is to get the subject to believe that the machine really can detect their deception. Remember the scene in The Wire where the cops hook the murder suspect up to a photocopier—a device he’s clearly never seen before—and manage to elicit a confession by convincing him the machine can tell if he’s lying? That was apparently based on real life, and reading descriptions of the Comparison Question Technique drives that home: the technique starts with what’s essentially a magic trick where the examiner pretends to work out a number the subject is thinking of, “just from their physiological responses”.
Then, to the questions. There are two main types of question and, as the name of the technique suggests, the subject’s physiological responses are compared while answering them. They’re asked “relevant” questions which directly impinge on the crime that’s of interest: for example, “did you make contact with an ISIS recruiter during the week of 7 March 2022?”. Then there are “control” questions: “have you ever cheated on a test?” or “have you ever stolen anything, however minor?”.
The control questions are supposed to be the sort of thing that make innocent people uncomfortable: the assumption is that almost everyone has cheated at some point and almost everyone has stolen something, so saying “no” to these questions will be a lie. But, crucially, the subject is made to believe that they’re part of the test and that the examiner considers them crucially important (“it’s very important that we know you’re not the sort of person who cheats or steals…”), so they might be incentivised to lie on them anyway.
The relevant questions, on the other hand, will be far more concerning to guilty people, since they contain specific details of the crime they committed. Innocent people won’t have a particularly strong reaction to them. And that’s how the test is scored: if you have a stronger reaction to the relevant questions than the control ones, you’re guilty. If it’s the opposite way around, you’re innocent.
Note that questions like “is your name Stuart J. Ritchie?” or “are you thirty-three years old?” (like you see in films; both sadly answered “yes” for me) aren’t used, or if they are, they’re considered “filler” or “irrelevant” questions that aren’t used in the actual measurement of truth or deception.
Anyway, if you’re thinking that this sounds like an awful lot of assumptions about how people will react - you’re exactly right. You can imagine all sorts of situations where the wording of the “relevant” question isn’t precisely correct, or someone is guilty of a related act but not the specific one in the question, which could affect how they respond. You can imagine huge individual differences in how nervous people are when asked questions, and how aggressively the examiner asks them. You can imagine that some people will be far more sceptical than others about the truth-telling properties of the machine.
And yet, the numbers say the tests using this rather weird technique are highly accurate, don’t they? So why worry?
The big flaw
Here’s why you should worry: the field studies from which these numbers are derived are subject to an enormous, almost unbelievable flaw. It relates to the need for “ground truth” - the confirmation that the person they think committed the crime actually did so.
The problem is that in most cases this ground truth is in form of a confession: the guilty party admits that theydunnit (and in so doing, clears any other suspects). But where do these confessions come from? Often they’re obtained during the polygraph testing session itself. Studies of polygraph effectiveness have a fatal selection-bias problem that drastically undermines their results. Here’s a toy scenario, taken from a book chapter about polygraphs, to explain it:
Ten women are suspects in a criminal investigation. A polygrapher tests them one by one until a deceptive outcome is obtained, say on the sixth suspect tested… According to usual practice, the examiner then attempts to extract a confession from the sixth suspect. If the examinee fails to confess, her guilt or innocence cannot be confirmed.
It is possible that the polygrapher committed two errors in testing these six cases: The person with the deceptive [results] may have been innocent, and one of those tested before her could have been guilty. In the absence of confession-backed verification, however, the polygraph records from these six cases will never be included as part of a sample in a validity study.
If the sixth suspect does confess, however, these six [results], all of which confirm the original examiner’s assessment, will be included. The resulting sample of cases would consist entirely of [results] the original examiner judged correctly and would never include cases in which an error was made…
[I]f polygraph testing actually had no better than chance accuracy, by basing validity studies on confession-verified [results] selected in this manner, a researcher could misleadingly conclude that the technique was virtually infallible. Given how cases are selected in confession studies of validity, it should not be surprising that field validity studies typically report that the original examiner was 100% correct. [my italics]
Did you get that? It’s so circular that I could scarcely believe it, but there it is: there’s a huge bias towards including successful polygraph sessions in the studies, and skipping ones where the “ground truth” is undetermined (and where the polygraph might’ve not produced accurate results).
Almost everyone agrees this is a big problem, even the American Polygraph Association - they say so in a footnote in their meta-analysis (the one the UK Government used; it’s in footnote 16 on page 210). Astonishingly though, they just carry on and include studies in the meta-analysis that they know are rendered useless by this flaw. This alone renders any conclusions they draw—that 80-90% number, for instance—extremely suspect.
As far as I’m aware, there’s just one study, from 1991, where the authors went out of their way to collect data that verified the confession of the participant that was nothing to do with the polygraph (confessions from non-polygraphed suspects; a case where a woman thought her jewelry had been stolen but it turned out her daughter had just borrowed it). This was only a very small number of cases, but the conclusion of that study was that the polygraph only did a tiny bit better than chance for identifying innocent people. To my knowledge, nobody has gone to these kinds of lengths since then to gather independent evidence and subject the Comparison Question Technique to a proper field trial.
The National Academy of Sciences report in 2003 put it this way:
In summary, we were unable to find any field experiments, field quasi-experiments, or prospective research-oriented data collection specifically designed to address polygraph validity and satisfying minimal standards of research quality.
(Note this doesn’t include the 1991 study, which was a retrospective study, rather than one deliberately set up to address polygraph accuracy.)
And a review from 2019 concluded:
…that the quality of research has changed little in the years elapsing since the release of the NAS report, and that the report’s landmark conclusions still stand.
A new meta-analysis
So much for field trials with real-life crimes. But surely you could instead test the accuracy of the polygraph in a laboratory setting, where participants in an experiment have committed some mocked-up “crime”, and you know exactly who’s “guilty”?
Lots of such studies have been done - in fact, they make up the majority of trials in the American Polygraph Association meta-analysis. But they didn’t separate out field and laboratory studies, so we don’t have a number for the lab in particular. Luckily, there’s an even more up-to-date meta-analysis, from 2020, which does split them out. They argue that lab studies, with mocked-up crimes, show a decent, above-chance accuracy - something like 67% for the overall area-under-the-curve measure.
And that sounds pretty decent! But I’m sure you’ve already noticed the problem here: a psychology lab is quite a different context from a real, criminal-justice interrogation where you’ve been accused of a crime that you may or may not have committed. Your reactions in each case may be rather different (the jargon term is “ecological validity” - does a study actually bear any relation to the real question we need answered, which in this case is: “how accurate is the polygraph when it’s being used in high-stakes situations?”).
Aside from this problem—an issue for any laboratory study of polygraphs—the meta-analysis doesn’t look very good. For example, there’s no mention of the quality of studies. In their section on field studies, they include studies that are contaminated by the “confession bias” we talked about above. They do a cursory check for publication bias using a funnel plot - checking to see if the distribution of studies looks as you’d expect if every study was published, even the ones with smaller, non-significant effects. But they just make the plot, don’t run the actual statistical test to check it, and only make one plot for the whole dataset (so we can’t see whether the lab or field tests alone might show publication bias). They do provide their dataset, but their description of how they did the stats for their meta-analysis is so under-detailed that I couldn’t reproduce what they did.
Then there are a couple of other, less scientific, reasons to be concerned. First is the conflict of interest section. There are three authors on the meta-analysis, and two of them are polygraph examiners (one is even the editor of the journal of the American Polygraph Association). So they have a heavy incentive to find results favourable to the tests. Second, they seem to have a slightly odd attitude to science. Two of the three wrote another article in 2019 that contained the following rant about critics of the polygraph:
…[critics] discounted those findings, saying the research methods were substandard. To the present authors this seems to be an arrogant conclusion as the [critics] substituted their judgment about the qualities of research published in first tier peer-reviewed journals of psychological science. Such a position is insulting to the editors of those first-tier journals and the working scientists who peer-review for them.
To argue that methodological criticism of science is “arrogant” and “insulting” shows a pretty impressive failure to understand what science is all about. And to imply that peer-reviewers are infallible is - well, put it this way: if they were, I wouldn’t have much to write about on this Substack.
So we find ourselves in the very frustrating situation where proponents of the polygraph give it a big thumbs up, and sceptics note all the methodological flaws. You can pick a study to support whichever side of the debate you’d like to join.
But what that really means is that there isn’t strong evidence that the polygraph is useful. This is a really serious question, and mistakes could involve either punishing people for crimes they didn’t commit, or letting genuine criminals off the hook. In countries with capital punishment, it can literally be a matter of life and death. To be expanding the use of a test with such a flimsy, heavily-contested evidence base is nothing other than a scandal.
What a tangled web
Just like with the psychedelic drugs I discussed in an earlier post, you can imagine a counterargument: “even if there’s no decent evidence that the polygraph works, isn’t it still useful if some suspects believe it works?”. Indeed the Government has published analyses that show that criminals give more admissions about breaches of their licence terms when the polygraph is used - regardless of whether the test is “really” uncovering whether they’re lying from their physiology, isn’t that still useful?
But all this shows it that the examiner has managed to dupe the subject into thinking the test can read their mind, and they duly fess up more often - it’s the photocopier from The Wire again. But this is theatricality - it’s not science. Aside from the irony of having to deceive subjects during a test of deception (and aside from the issue of false confession, which we haven’t considered), it’s a big risk: do we want to lull ourselves into a false sense of security by saying “this terrorist passed the polygraph, so it’s perfectly fine to let him walk free”? Just one mistake here—and remember from our discussion of sensitivity and specificity that, depending on how you set the threshold, mistakes can be common—could lead to disaster.
You’ve also got to think that criminals will become more and more aware that the test is being used, and work out how to get around it: advice on how to fool the test by deploying mental “countermeasures” is freely available online.
In 2020, the UK Government scrapped the use of Unconscious Bias training for civil servants. To those of us who’ve read the research—which shows that there’s only extremely weak evidence for changes in implicit biases having any effect on explicit biases, or biased behaviour—this was a welcome move. It was an embarrassment that our government was relying on something that just seemed like it might work, without any solid evidence backing it up.
Polygraph testing might not have the same current culture-war valence as Unconscious Bias training. But it’s a very similar story: a superficial “scientific” method to measure something that’s actually extremely fuzzy and difficult to quantify; a bunch of studies that seem compelling on the surface but which, upon closer inspection, reveal that there’s very little relevance to what we actually want to know. And an outcome—you passed the test!—that might seem reassuring but could easily miss the exact thing we want to pin down - whether that’s bias or deception.
This isn’t to say that at some point the evidence won’t look more favourable, or that one of the many different techniques of polygraphy might prove to work more effectively. But we’re far from that right now.
We dumped the Unconscious Bias test because there was no solid evidence that it worked. Not a word of a lie: it’s time to do the same for the polygraph.
Thanks for reading. If you’d like to have Science Fictions delivered to your inbox, or to support my work with a contribution, consider subscribing below.
Acknowledgements: I’m grateful to Saloni Dattani for commenting on an earlier draft, and especially for making me clarify the part about sensitivity and specificity (but if it’s still unclear, that’s my fault).
Image credit: Getty
Edit 15 May 2022: adding link to audio version; fixing a couple of minor typos.