Opinion: The Advanced Placement Exams’ Grading System Gets Low Marks

Scoring of the popular pre-college exams is marred by inconsistencies and irregularities, writes a former grader.

6 Comments

Every May, millions of high school students in the United States and across the globe take specialized tests known as advanced placement (AP) exams. Offered for 38 subjects, ranging from staples like calculus and physics to specialized topics including computer science and psychology, the exams are high-stakes affairs. For the students who take them, a good AP exam score can enhance their academic transcripts, boost their chances of winning scholarships, help them gain admission to top-tier institutions, and earn them college credit, potentially saving thousands of dollars in tuition costs. At the more than 22,000 high schools where AP curricula are offered, taking the exams has become a rite of spring.

Afterward, a different kind of ritual takes place: Thousands of test readers — high school or college instructors with teaching experience in the exam subjects — gather to simultaneously grade the exams over the course of two week-long sessions. Multiple choice responses are graded electronically, but answers to free-response questions — which, depending on the subject, can account for about a third to more than half of a student’s score — are graded by humans. Prior to the Covid-19 pandemic, exam readers gathered in a Cincinnati convention hall, where they were grouped into tables, with each table assigned one multipart question to score; post-Covid exams have been graded online, with readers grouped within virtual tables.

I know this because for four years, I served as a reader for the AP environmental science exam, and what I saw troubled me: Despite steps taken by exam administrators to standardize the scoring process, I noted multiple inconsistencies and irregularities in the exam scoring. As a seasoned instructor, I worry about the impacts these irregularities might be having on students’ lives and learning outcomes, where points associated with a “borderline response” can make the difference between receiving college credit or not.

The problems largely center around the ever-changing exam scoring rubrics — the official lists of approved responses that readers use to assess answers. At the beginning of scoring week, table leaders talk their readers through each rubric and use sample student responses to calibrate the readers’ scoring accuracy for each part of their assigned question. Once scoring ensues, table leaders constantly double-check scored responses to ensure readers are assigning points in congruence with the rubric.

Inevitably, subjectivity seeps into the process. Take, for example, this free-response question from the 2021 AP environmental science exam, which at one point asks the student to “identify one natural mechanism of soil erosion.” If an exam taker interprets that directive to include both primary and secondary mechanisms of soil erosion — a reasonable reading of the question — acceptable answers could include wind, precipitation, and flowing water as primary mechanisms, but also topsoil removal and wildfire as secondary mechanisms that facilitate erosion. But very possibly, a response like wildfire, which has only recently been recognized as a major contributor to soil erosion — and which is largely absent from the eastern U.S., where the big testing companies are based — might initially be missing from the rubric.

When a scientifically valid response isn’t included in the rubric, a reader has two choices: They can follow the rubric to the letter and mark the response wrong, penalizing the student for a novel but correct response; or they can bring it to the attention of their supervisors, who discuss the response with the reader and then pass judgment on whether or not to accept the answer. If the supervisors approve the answer, it’s marked correct and then added to the official rubric. But even then — as I discovered my years reading exams — the current system doesn’t allow for retroactive corrections of exams with previously scored questions where students gave the same novel answer. This means that when novel answers are added to the rubric over the course of the scoring week, exams graded later in the week are more likely to receive higher scores.

To be clear, the wildfire scenario is hypothetical. (Although official scoring rubrics of prior exams are available online, as part of a contractual agreement I signed I’m not allowed to discuss specifics of students’ responses.) But during my time as an exam reader, I witnessed similar cases where valid responses — responses that, as a university instructor, I would find acceptable on a college-level exam — were marked incorrect simply because they were not included in the original rubric. Other exam graders have told me of similar experiences.

Each year, when I flagged valid student responses that weren’t included in a given question’s rubric, the outcomes depended on the personalities of my table leader, the exam question writer, and other supervisors. Sometimes, my proposed additions were received with curiosity and flexibility — and with respect for the science and my scientific expertise — and the scoring rubric was altered. In other cases, they were met with suspicion and rigidity, nothing was changed, and every student who had the correct, yet off-rubric answer, lost the point.

Some exam readers privately told me of cases where leaders of two different tables assigned to score the same question disagreed on the acceptability of a valid off-rubric response. (The readers indicated that in these cases, they were obligated to go with their table leaders’ decisions, and a response marked correct at one table might be marked incorrect at a different table.)

While the broader impacts of such scoring failures are almost impossible to quantify, the impact on an individual can be considerable. Every point is crucial on an AP exam, and points lost unfairly could initiate a costly domino effect: A student’s final exam score could suffer, as could their academic transcripts, their odds of college admission, their chances at securing financial aid, and their time to graduation.

The Educational Testing Service (ETS)the nonprofit that contracts with the College Board to administer the AP exams and develops and administers other standardized tests, including the Graduate Record Examination (GRE) and the Test of English as a Foreign Language (TOEFL) — has come under fire for scoring irregularities before. In 2006, the organization paid out $11.1 million to settle a class action lawsuit over scoring of a middle- and high-school teacher certification test. “About 27,000 people who took the exam received lower scores than they should have, and 4,100 of them were wrongly told they had failed,” reported The New York Times. (The National Center for Fair and Open Testing, or FairTest — an advocacy organization that advised the lawyers representing the plaintiffs in the suit — provided me with a copy of the settlement agreement, shared tax documents filed by the College Board, and clarified for me the relationship between ETS and the College Board. A FairTest attorney who litigated a different case against ETS provided a legal review of an early draft of this essay.)

While the broader impacts of such scoring failures are almost impossible to quantify, the impact on an individual can be considerable.

Students and teachers deserve better from ETS and the College Board — and so does the American public. According to recent tax filings, the College Board nets close to $500 million annually for its AP exams alone, and a portion of that money comes from U.S. taxpayers: The Every Student Succeeds Act provides funding to states and districts to subsidize AP exam fees for low-income students, fees that range from $94 to $143 per exam. In other words, we all have a stake in the exams’ fairness and transparency.

In response to the Covid-19 pandemic, the College Board took significant steps to invest in improved technology for students preparing for and taking online AP exams. Similar steps should be taken by ETS to shore up the exam scoring. Whenever an exam reader encounters a novel, off-rubric response that they believe is scientifically valid, the response should be vetted by an on-call pool of university instructors serving as independent reviewers, who can assist with updating the scoring guidelines as needed. The scoring software used by ETS should be updated to consistently ensure that whenever a new response is added to the rubric of acceptable answers for a free-response question, previously graded responses from that year’s exam are revisited and rescored.

Administered wisely, AP exams can be a force for educational good, promoting intellectual curiosity and critical thinking skills that will serve students for a lifetime. By the time someone sits down to take an exam, they will have invested many months preparing for it, and their college careers may hang in the balance. At the very least, we should give everyone a fair shake.


Jeanine Pfeiffer taught undergraduate and graduate courses in the biological and environmental sciences at the University of California, Davis, San Diego State University, and San Jose State University for 22 years. She currently provides strategic advising for tribal nations, government agencies, environmental nonprofits, and field practitioners.