Judges want a tool that can help them make decisions. But the science behind those tools is limited.

The Calculus of Criminal Risk

Earlier this month, California Superior Court Judge Aaron Persky decided that a sex offender should spend 6 months in jail followed by probation, rather than go to prison. Brock Turner was convicted on three felony counts of sexual assault. The maximum sentence for his crimes was 14 years of imprisonment. Prosecutors recommended six years. But Persky chose to reduce that, largely because Turner had no previous criminal record, came from a good family, and reportedly showed remorse.


To those who have read about this case and disagree with the judge’s assessment, the news stands as a painful example of how subjective judgment can lead sentencing astray. Where some people — including Turner’s victim — saw an adult who committed a violent crime, Persky apparently saw a young man with potential who made a mistake — and crucially, one who wouldn’t pose a high risk to society going forward.

It’s worth noting, though, that the defense team offered the judge more than a casual and subjective assessment of Turner’s future criminal risk. Rather, as part of their sentencing memorandum, the attorneys cited two separate, data-driven risk evaluations, which suggested in aggregate that Turner presented little future threat, and that his rehabilitation needs could be met through probation. These sorts of so-called “algorithmic” risk evaluation tools are now common in the American criminal justice system — not least because they promise to bring a measure of objectivity and empiricism to what would otherwise be little more than a judge’s gut feeling.

And yet, a growing number of legal scholars, criminal justice researchers, and even scientists — many of whom support the use of such tools in theory — suggest that they nonetheless have flaws and biases that aren’t always well understood by the people using them.

Judges want a scientific tool that can help them make decisions in an objective way, said Christopher Slobogin, director of the criminal justice program at Vanderbilt Law School, a supporter of the use of evidence-based sentencing tools. But, he said, some judges “don’t inquire deeply into how scientific what they’re using actually is.”

The history of evidence-based risk assessment goes back to mid-century parole boards, said Sonya Starr, co-director of the empirical legal studies center at the University of Michigan Law School, and one of the tool’s biggest critics. Those original assessments were less quantitative than those used today. “You’d probably find them comically old fashioned. One characterized offenders as one of 9 or 11 different profiles, like ‘The Ne’er-Do-Well’ and ‘The Ruffian,’” she said.

But over time, they evolved into complex instruments. Evaluators, still often a member of a parole board, present offenders with a series of questions about their criminal and personal history – gender, age, education, family, where they grew up. The answers to these questions are combined with data from criminal records and fed into an algorithm, which spits out a measurement of risk – usually “low.” “medium,” or “high.”

The evaluation is based on what criminologists have learned over the decades about factors that correlate with recidivism in former convicts, though how each specific algorithm works – and what weight it gives different factors – varies widely. In some cases, nobody knows how the algorithms work because they are proprietary intellectual property owned by a corporation that sells its services to the state. None of the risk assessment tools in use today include race as a factor in their evaluations. Almost none of them include information about the offender’s current crime.

In 1994, Virginia became the first state to use risk assessments in sentencing. Around 2007, though, the practice began to expand rapidly. Starr wrote a critique that was published in the Stanford Law Review in 2014. At that time, she counted 20 states that were using the assessment tools in sentencing in at least some jurisdictions. It’s likely to be even more widespread today, she said, though there is no agency that keeps a formal count.

Parole boards use the assessments to figure out who should be let out of prison early and what kinds of programs they might need to reduce their recidivism risks. Judges use the assessments to determine how long a person should be imprisoned to begin with – under the assumption that a high-risk offender should be kept off the streets for a longer period of time.

And research suggests these assessments do a better job of predicting who will go on to commit acts of violence than subjective judgment, said Slobogin. He pointed to a 2006 meta-analysis of 56-years worth of studies comparing clinical and statistical prediction. Algorithms haven’t been used in sentencing for that long, but they have been used by the psychiatrists who treat convicted offenders and other potentially violent people. In that setting, the meta-analysis found a 13 percent increase in accuracy when clinical psychiatrists used statistical methods as opposed to their own judgment.

That might not sound like much of an improvement, but the clinical prediction is already better than chance, Slobogin said. And statistical algorithms are better than that. It’s not perfect, of course. But nothing is. If someone was evaluating your likelihood of committing an act of violence, wouldn’t you want them to use the tool that was incrementally better?

The problem, say critics, is that there are a lot of assumptions built into this process – and those assumptions serve to stack the system against people who are already disadvantaged and in favor of people like Brock Turner – a white, middle class, Stanford student and champion swimmer.

Let’s start with the idea of evaluating the algorithms’ success by how accurate they are at predicting recidivism, said Jessica Eaglin, associate professor of law at Indiana University. The first thing you have to ask is what kind recidivism we’re talking about.

The word “recidivism” can mean many different things, from committing the same crime you were previously convicted of, all the way down to not showing up for a mandatory meeting or failing a drug test. Some of those things, she said, don’t actually post a threat to public safety. Some do. And the statistical assessment tools don’t distinguish between the two. So their successes don’t, either. “The commission of new violent crimes is so low in the data set that they have to flatten the data so any kind of recidivism, including things we aren’t really concerned about, will have positive hits for the tool because the tool has to give some kind of variation,” she said.

There are also problems with what judges understand about the tool and what it tells them, she said. Usually, a judge is getting this risk assessment in conjunction with other information, and they don’t always know what information went into the assessment and what didn’t. That lack of transparency means that a judge might end up making a sentencing decision based on a statistical assessment that says an offender is “high risk” – that assessment being partially based on their previous criminal record – and the offender’s criminal record. From the judge’s perspective, Eaglin said, that can look like a high-risk individual who also has a history of violence. In reality, it’s actually a double count of that person’s history.

The transparency problem also bleeds over into the categorizations, said Cecelia Klingele, assistant professor at the University of Wisconsin Law School. “Low risk”, “medium risk”, and “high risk” mean different things depending on which tool you’re using. In particular, the cut offs between the categories are, inherently, themselves subjective. There’s a fine line between what makes somebody low risk vs. medium risk, but the result looks very different to the person making a judgment based on the category. If your sentencing is based on a proprietary algorithm, Klingele said, you might not even have the information necessary to challenge the cut-off in an appeal.


But the biggest problem, according to Sonya Starr, is that these risk assessments essentially take the same subjective bias that led to a middle class, high achieving, white college student being empathetic to a judge (and, thus, getting a light sentence) and legitimize them as objective science fact. This problem was the subject of a deep investigation by Pro Publica in May, which highlighted cases where black defendants and white defendants convicted of similar crimes were given very different risk levels – even in cases where the white defendant had a history of committing more, and more serious, crimes.

“Just about every marker of poverty has been included as a risk factor in these algorithms,” she said. The neighborhood a person grew up in, their parent’s criminal history, their own level of education – it’s all in there. “If all you want to predict is the crime rate, or rather the risk of being arrested, poverty is a decent predictor. The more disadvantaged factors you have in your life, the more likely you are to be arrested. But we have fundamental principles about treating rich and poor the same. You’re supposed to punish on the basis of what was done, not on who they are,” she said.

Ultimately, the real problem here is a disconnect between what the phrase “evidence-based sentencing” seems to promise and what it actually is. There are questions that go unanswered by math, Klingele said, and there are numbers that appear objective in their final form but have subjectivity and bias baked into them in ways that are difficult to tease apart.

The result, according to a 2012 British Medical Journal meta-analysis of the accuracy of statistical risk assessments, is a tool that may be better than chance, but “should not be the sole determinant of sentencing, detention, and release.”

That’s a conclusion that’s about to be put to the test, legally speaking. The Wisconsin Supreme Court is due to rule any day on whether statistical risk assessments can be used for sentencing without violating due process. A key part of that ruling, Slobogin said, will be evaluating whether algorithms meet a standard of scientific validity.