The Volokh Conspiracy - DNA Matches and Statistics:

Patterico criticizes the use of statistics in this L.A. Times article:

[I]n 2004, a search of California's DNA database of [338,000] criminal offenders yielded an apparent breakthrough [in a 1972 rape/murder case]: Badly deteriorated DNA from the assailant's sperm was linked to John Puckett, an obese, wheelchair-bound 70-year-old with a history of rape.

The DNA "match" was based on fewer than half of the genetic markers typically used to connect someone to a crime, and there was no other physical evidence.

Puckett insisted he was innocent, saying that although DNA at the crime scene happened to match his, it belonged to someone else.

At Puckett's trial earlier this year, the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million.

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett's case, it was 1 in 3.... In every cold hit case, the [scientific expert advisory] panels advised, police and prosecutors should multiply the Random Match Probability (1 in 1.1 million in Puckett's case) by the number of profiles in the database (338,000)..

I'm not knowledgeable enough about these things to speak with confidence about just how these things can be explained accurately and comprehensibly to a jury. I may also be mistaken even about the more basic things (I've forgotten far too much about statistics, I'm sorry to report). Still, I'm pretty sure that both "the chance of such a coincidence was 1 in 1.1 million" and "the probability that the database search had hit upon an innocent person ... was 1 in 3" aren't quite right.

To begin with, if I'm right that the "1 in 1.1 million" number means roughly that 1 in 1.1 million people have the particular DNA markers that Puckett had, that's about 6000 people worldwide. To say that the defendant is one of only 6000 people who may have committed a crime (or one of only 3000, if the 1 in 1.1 million figure means 1 in 1.1 million males) doesn't by itself tell you much. It certainly doesn't tell you that there's only a 1 in 1.1 million chance that "although DNA at the crime scene happened to match his, it belonged to someone else."

Now it may well be that, coupled with other evidence, the DNA match information might be quite probative. To take a simple example, imagine that only 1 in 100 men have red hair, and it's discovered that the killer had red hair. That the defendant had red hair is surely relevant evidence, coupled with various other evidence. But it doesn't say that there's only a 1% chance that "although the hair color at the crime scene happened to match [the red-haired defendant's], it belonged to someone else."

On the other hand, "the probability that the database search had hit upon an innocent person ... was 1 in 3" also strikes me as wrong. Most obviously, I don't think it can't be the case that you should just "multiply the Random Match Probability (1 in 1.1 million in Puckett's case) by the number of profiles in the database (338,000). That's the same as dividing 1.1 million by 338,000" to yield the "chance that the search would link an innocent person to the crime." Say that the database had 2.2 million profiles; under that calculation, the chance that the search would link an innocent person to the crime would be 200%, obviously nonsensical.

Now I think that the multiplication might be someone's oversimplification of a different formula — the chance that a database of 338,000 people would yield a match with an innocent person, when there's a 1/1,100,000 chance that any particular innocent person would have the DNA markers. That formula is 1-(1-1/1,100,000)^338,000, which yields 26.5%, rather than 338,000/1,100,000 (30.7%). Nor is it an accident that the two percentages are close; when n is relatively low compared to a, 1-(1-1/a)^n is relatively close to n/a.

But even taking account of this oversimplification, this strikes me as mistaken. 1-(1-1/1,100,000)^380,000 is the probability that, if the rapist is not in the database, a database search would still come up with someone (who would then be innocent, since by hypothesis the rapist is not in the database). It is not "the probability that the database search had hit upon an innocent person."

Here's one way of seeing this: Let's say that the prosecution comes up with a vast amount of other evidence against Pickett — he admitted the crime in a letter to a friend; items left at the murder site are eventually tied to him; and more. He would still, though, have been found through a search of a 338,000-item DNA database, looking for a DNA profile that is possessed by 1/1,100,000 of the population — and under the article's assertion, "the probability that the database search had hit upon an innocent person" would still have been "1 in 3."

Despite all the other evidence that the police would have found, and even if the prosecutors didn't introduce the DNA evidence, there would be, under the article's description, a 1/3 chance that the search had hit upon an innocent person (Pickett), and thus a 1/3 chance that Pickett was innocent, presumably more than enough for an acquittal. That can't, of course, be right. But that just reflects the fact that 1/3 is not "the probability that the database search had hit upon an innocent person." It's the probability that a search would have come up with someone innocent if the rapist wasn't in the database.

So, as I said, I'm not sure what juries should be told about these statistics, and how to weigh them together with the other probative evidence that's introduced at trial. But it seems to me that both of the options given in the quote — "the chance [that although DNA at the crime scene happened to match [defendant's], it belonged to someone else] was 1 in 1.1 million" and "the probability that the database search had hit upon an innocent person ... was 1 in 3" — are incorrect.

	[Eugene Volokh, May 5, 2008 at 4:30pm] Trackbacks DNA Matches and Statistics: Patterico criticizes the use of statistics in this L.A. Times article: [I]n 2004, a search of California's DNA database of [338,000] criminal offenders yielded an apparent breakthrough [in a 1972 rape/murder case]: Badly deteriorated DNA from the assailant's sperm was linked to John Puckett, an obese, wheelchair-bound 70-year-old with a history of rape. The DNA "match" was based on fewer than half of the genetic markers typically used to connect someone to a crime, and there was no other physical evidence. Puckett insisted he was innocent, saying that although DNA at the crime scene happened to match his, it belonged to someone else. At Puckett's trial earlier this year, the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million. Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person. In Puckett's case, it was 1 in 3.... In every cold hit case, the [scientific expert advisory] panels advised, police and prosecutors should multiply the Random Match Probability (1 in 1.1 million in Puckett's case) by the number of profiles in the database (338,000).. I'm not knowledgeable enough about these things to speak with confidence about just how these things can be explained accurately and comprehensibly to a jury. I may also be mistaken even about the more basic things (I've forgotten far too much about statistics, I'm sorry to report). Still, I'm pretty sure that both "the chance of such a coincidence was 1 in 1.1 million" and "the probability that the database search had hit upon an innocent person ... was 1 in 3" aren't quite right. To begin with, if I'm right that the "1 in 1.1 million" number means roughly that 1 in 1.1 million people have the particular DNA markers that Puckett had, that's about 6000 people worldwide. To say that the defendant is one of only 6000 people who may have committed a crime (or one of only 3000, if the 1 in 1.1 million figure means 1 in 1.1 million males) doesn't by itself tell you much. It certainly doesn't tell you that there's only a 1 in 1.1 million chance that "although DNA at the crime scene happened to match his, it belonged to someone else." Now it may well be that, coupled with other evidence, the DNA match information might be quite probative. To take a simple example, imagine that only 1 in 100 men have red hair, and it's discovered that the killer had red hair. That the defendant had red hair is surely relevant evidence, coupled with various other evidence. But it doesn't say that there's only a 1% chance that "although the hair color at the crime scene happened to match [the red-haired defendant's], it belonged to someone else." On the other hand, "the probability that the database search had hit upon an innocent person ... was 1 in 3" also strikes me as wrong. Most obviously, I don't think it can't be the case that you should just "multiply the Random Match Probability (1 in 1.1 million in Puckett's case) by the number of profiles in the database (338,000). That's the same as dividing 1.1 million by 338,000" to yield the "chance that the search would link an innocent person to the crime." Say that the database had 2.2 million profiles; under that calculation, the chance that the search would link an innocent person to the crime would be 200%, obviously nonsensical. Now I think that the multiplication might be someone's oversimplification of a different formula — the chance that a database of 338,000 people would yield a match with an innocent person, when there's a 1/1,100,000 chance that any particular innocent person would have the DNA markers. That formula is 1-(1-1/1,100,000)^338,000, which yields 26.5%, rather than 338,000/1,100,000 (30.7%). Nor is it an accident that the two percentages are close; when n is relatively low compared to a, 1-(1-1/a)^n is relatively close to n/a. But even taking account of this oversimplification, this strikes me as mistaken. 1-(1-1/1,100,000)^380,000 is the probability that, if the rapist is not in the database, a database search would still come up with someone (who would then be innocent, since by hypothesis the rapist is not in the database). It is not "the probability that the database search had hit upon an innocent person." Here's one way of seeing this: Let's say that the prosecution comes up with a vast amount of other evidence against Pickett — he admitted the crime in a letter to a friend; items left at the murder site are eventually tied to him; and more. He would still, though, have been found through a search of a 338,000-item DNA database, looking for a DNA profile that is possessed by 1/1,100,000 of the population — and under the article's assertion, "the probability that the database search had hit upon an innocent person" would still have been "1 in 3." Despite all the other evidence that the police would have found, and even if the prosecutors didn't introduce the DNA evidence, there would be, under the article's description, a 1/3 chance that the search had hit upon an innocent person (Pickett), and thus a 1/3 chance that Pickett was innocent, presumably more than enough for an acquittal. That can't, of course, be right. But that just reflects the fact that 1/3 is not "the probability that the database search had hit upon an innocent person." It's the probability that a search would have come up with someone innocent if the rapist wasn't in the database. So, as I said, I'm not sure what juries should be told about these statistics, and how to weigh them together with the other probative evidence that's introduced at trial. But it seems to me that both of the options given in the quote — "the chance [that although DNA at the crime scene happened to match [defendant's], it belonged to someone else] was 1 in 1.1 million" and "the probability that the database search had hit upon an innocent person ... was 1 in 3" — are incorrect.