Incomplete information, changing circumstances, and human error can make it tricky to interpret the results of a replication study.

What We Can (and Can’t) Learn From Replicating Scientific Experiments

Anyone who enters the field of pupillometry — the study of pupil size — stumbles upon one classic research paper: “Pupil Size as Related to Interest Value of Visual Stimuli.” The study was published in Science in 1960 by psychologists Eckhard H. Hess and James M. Polt of the University of Chicago. The duo used 16-mm film to capture images of volunteers’ eyes as they viewed a picture of a baby, a mother holding a child, a nude male, a nude female, and a landscape. The baby photo and the mother holding a child caused the female volunteers’ pupils to dilate by about 20 percent, the researchers reported, as did the male nudes. Male volunteers’ pupils responded by far the most to the female nude. The landscape photo seemingly had no effect on the males and even caused the females’ pupils to constrict by about 7 percent. The researchers’ conclusion: When we look at something we find interesting, our pupils dilate.

Partner content, op-eds, and Undark editorials.

Hess and Polt’s paper has been cited more than 700 times and counting. Hess in particular gathered fame by popularizing the findings in a book he wrote on pupillometry. But while his work received much praise — as this 1986 obituary in the Chicago Tribune attests — it was also criticized from the beginning. His studies typically involved less than a dozen participants, statistical justifications were almost absent, and his methodological approaches were highly prone to bias. “Hess’s publications were very entertaining and conversational, but his works are terribly lacking in detail,” says Bob Snowden, a professor of psychology at Cardiff University in the U.K. who uses pupillometry to study psychopathy. “The way he presented it was very anecdotal.”

When reading Hess’ papers, one starts wondering: Did he invent or exaggerate a phenomenon, or was he really onto something? For several years, that question has been eating at Joost de Winter, whose group at Delft University of Technology, in the Netherlands, studies human-technology interactions. “Hess’s paper kept on popping up, but we just didn’t fully comprehend it,” says de Winter. “So we decided to try and replicate it.”

De Winter’s project, co-directed by biomechanical engineer Dimitra Dodou, was among the first to be funded by a grant from the Netherlands Organization for Scientific Research (NWO) designated exclusively to replication studies — the first such grant program in the world. The project comes amid a growing movement among scientists to demand reproducibility of controversial and important findings. Now that replication studies are here to stay, an important question arises: What can replication studies tell us, and what can’t they?

It is tempting to regard a failed replication as a refutation of the original finding. But that interpretation is often too simplistic. There are several reasons a repeated study could deliver a new outcome: The replicators might have lacked crucial information needed to reproduce the experiment; they might have made errors themselves; the study population might have been different; or the circumstances around the study might have changed. The opposite is true as well: A confirmed finding is not necessarily true, since both teams might have made the same technical or conceptual mistakes.

All these nuances make it tricky to interpret the results of a replication study. But notwithstanding, replication projects are portrayed in the popular media as scientific “proofs in the pudding” — as verdicts not only on the original study’s veracity but, often times, on the researchers’ professional integrity.

“What I am concerned about is that, in many cases, the whole discourse gets transformed into a very naive and frankly simplistic and inadequate idea of what reproducibility is supposed to be,” says Sabina Leonelli, a professor of philosophy and history of science at the University of Exeter in the U.K. She recently argued in Research in the History of Economic Thought and Methodology that we should rethink reproducibility as a criterion for research quality. “In many cases, direct replication of the results is not the point. It is interesting to get different results and explore the differences.”

That’s exactly what de Winter and colleagues are trying to do. Today, many pupillometry labs have developed their own variations on the research methods Hess and Polt used. But de Winter says his group wants to go beyond just replicating or debunking the half-century-old findings: “We want to reconstruct what happened, and set a new standard for this type of research.”

Roughly speaking, there are three types of replication studies, each of which have benefits and limitations. One approach is to redo only the analysis, using the original data. A second approach, known as direct replication, is to redo the entire experiment as faithfully as possible, based on the description of the original methods. But if there are already doubts about the original study’s setup, a direct replication is of little value. In that case, experts advocate a third approach: devising new, improved experimental protocols, and seeing if they reproduce the original result.

The Dutch researchers took a rigorous approach that combined all three replication types. They visited an extensive Hess archive at the Cummings Center for the History of Psychology at the University of Akron in Ohio, where they collected as much of the original data as they could by analyzing handwritten notes and tables stored in 48 cardboard boxes. They also made copies of the original pictures and slides used by Hess and Polt. They even ordered, on eBay, the exact same model of the projector Hess and Polt used in their original experiment. On top of that, de Winter and his colleagues also repeated the experiment using modern computer screens and eye trackers, to check where on a screen participants were looking and to measure pupil response at a higher frequency and accuracy than the original study.

The Dutch team found shortcomings not only in Hess and Polt’s experimental setup but also in their measurements, statistics, and interpretation. Several factors Hess never mentioned turned out to strongly impact pupil dilation, including the light conditions during a slide change. The Dutch researchers also confirmed what others have called “pupillary escape”: When a new image is shown, the pupil quickly constricts, possibly as a protective reflex, and only after that slowly starts to dilate. De Winter and colleagues are still trying to determine the impact of the brightness of the specific part of the image a viewer focuses on.

The Dutch project is ongoing, but so far, their measurements suggest that any effect was clearly smaller than Hess had reported. “We’re talking about tenths of millimeters, which corresponds to a few percent, not 20. Numerous things he did made the results look more impressive,” de Winter said. For example, Hess reported change in total pupil area instead of the diameter. And whereas Hess thought that negative stimuli caused pupils to shrink, just as positive ones caused them to grow, the relationship between pleasure and dilation seems to work in only one direction. Said de Winter, “there don’t seem to be emotions leading to constriction.”

The risk of a project like de Winter’s is that it becomes an exercise in iconoclasm — one that unfairly judges yesterday’s scientists by today’s standards. De Winter is aware of that risk. “It was a different time,” he says. “But Hess seemed very eager to believe what he was observing as well.”

Does this devalue the heritage of Eckhard Hess? It’s not a revelation that Hess tended to exaggerate his findings and might have suffered from an overdose of imagination. According to Bob Snowden, pioneers like Hess are critical to the scientific enterprise — even when they get things wrong. “In the current age of preregistration and strict protocols, we should give ‘cowboys’ like Hess space to make new discoveries,” he says. Afterward, he adds, other groups can take their time to more rigorously retest the findings.

Not everyone shares that view. “There is never a good excuse for sloppy science,” says Eric-Jan Wagenmakers, a mathematical psychologist at the University of Amsterdam. “Wild ideas should be executed as thoroughly as dull ones. So those cowboys need rigorous colleagues to collaborate with, or other teams should try to replicate their findings right away.”

But we’re now more than 50 years out from Hess’s original experiment. One wonders, what’s the use of replicating such an old finding, when the field has moved on? On this crucial point, Daniel Lakens, a psychologist and methodologist at Eindhoven University of Technology in the Netherlands who co-initiated the NWO replication fund, argues that replication is about much more than simply confirming or refuting results. We should check results not just to see if they hold up but to investigate if we still comprehend what the original authors did, he says. “After the 500th citation, check it. After the 1000th, check it again. Remaining critical of the findings we’re citing the most, doesn’t seem to me like such a bad idea at all.”

Jop de Vrieze is a freelance science writer based in Amsterdam.