The Advanced Placement Exams’ Grading System Gets Low Marks

Every May, millions of high school students in the United States and across the globe take specialized tests known as advanced placement (AP) exams. Offered for 38 subjects, ranging from staples like calculus and physics to specialized topics including computer science and psychology, the exams are high-stakes affairs. For the students who take them, a good AP exam score can enhance their academic transcripts, boost their chances of winning scholarships, help them gain admission to top-tier institutions, and earn them college credit, potentially saving thousands of dollars in tuition costs. At the more than 22,000 high schools where AP curricula are offered, taking the exams has become a rite of spring.

Afterward, a different kind of ritual takes place: Thousands of test readers — high school or college instructors with teaching experience in the exam subjects — gather to simultaneously grade the exams over the course of two week-long sessions. Multiple choice responses are graded electronically, but answers to free-response questions — which, depending on the subject, can account for about a third to more than half of a student’s score — are graded by humans. Prior to the Covid-19 pandemic, exam readers gathered in a Cincinnati convention hall, where they were grouped into tables, with each table assigned one multipart question to score; post-Covid exams have been graded online, with readers grouped within virtual tables.

I know this because for four years, I served as a reader for the AP environmental science exam, and what I saw troubled me: Despite steps taken by exam administrators to standardize the scoring process, I noted multiple inconsistencies and irregularities in the exam scoring. As a seasoned instructor, I worry about the impacts these irregularities might be having on students’ lives and learning outcomes, where points associated with a “borderline response” can make the difference between receiving college credit or not.

The problems largely center around the ever-changing exam scoring rubrics — the official lists of approved responses that readers use to assess answers. At the beginning of scoring week, table leaders talk their readers through each rubric and use sample student responses to calibrate the readers’ scoring accuracy for each part of their assigned question. Once scoring ensues, table leaders constantly double-check scored responses to ensure readers are assigning points in congruence with the rubric.

Inevitably, subjectivity seeps into the process. Take, for example, this free-response question from the 2021 AP environmental science exam, which at one point asks the student to “identify one natural mechanism of soil erosion.” If an exam taker interprets that directive to include both primary and secondary mechanisms of soil erosion — a reasonable reading of the question — acceptable answers could include wind, precipitation, and flowing water as primary mechanisms, but also topsoil removal and wildfire as secondary mechanisms that facilitate erosion. But very possibly, a response like wildfire, which has only recently been recognized as a major contributor to soil erosion — and which is largely absent from the eastern U.S., where the big testing companies are based — might initially be missing from the rubric.

When a scientifically valid response isn’t included in the rubric, a reader has two choices: They can follow the rubric to the letter and mark the response wrong, penalizing the student for a novel but correct response; or they can bring it to the attention of their supervisors, who discuss the response with the reader and then pass judgment on whether or not to accept the answer. If the supervisors approve the answer, it’s marked correct and then added to the official rubric. But even then — as I discovered my years reading exams — the current system doesn’t allow for retroactive corrections of exams with previously scored questions where students gave the same novel answer. This means that when novel answers are added to the rubric over the course of the scoring week, exams graded later in the week are more likely to receive higher scores.

To be clear, the wildfire scenario is hypothetical. (Although official scoring rubrics of prior exams are available online, as part of a contractual agreement I signed I’m not allowed to discuss specifics of students’ responses.) But during my time as an exam reader, I witnessed similar cases where valid responses — responses that, as a university instructor, I would find acceptable on a college-level exam — were marked incorrect simply because they were not included in the original rubric. Other exam graders have told me of similar experiences.

Each year, when I flagged valid student responses that weren’t included in a given question’s rubric, the outcomes depended on the personalities of my table leader, the exam question writer, and other supervisors. Sometimes, my proposed additions were received with curiosity and flexibility — and with respect for the science and my scientific expertise — and the scoring rubric was altered. In other cases, they were met with suspicion and rigidity, nothing was changed, and every student who had the correct, yet off-rubric answer, lost the point.

Some exam readers privately told me of cases where leaders of two different tables assigned to score the same question disagreed on the acceptability of a valid off-rubric response. (The readers indicated that in these cases, they were obligated to go with their table leaders’ decisions, and a response marked correct at one table might be marked incorrect at a different table.)

While the broader impacts of such scoring failures are almost impossible to quantify, the impact on an individual can be considerable. Every point is crucial on an AP exam, and points lost unfairly could initiate a costly domino effect: A student’s final exam score could suffer, as could their academic transcripts, their odds of college admission, their chances at securing financial aid, and their time to graduation.

The Educational Testing Service (ETS) — the nonprofit that contracts with the College Board to administer the AP exams and develops and administers other standardized tests, including the Graduate Record Examination (GRE) and the Test of English as a Foreign Language (TOEFL) — has come under fire for scoring irregularities before. In 2006, the organization paid out $11.1 million to settle a class action lawsuit over scoring of a middle- and high-school teacher certification test. “About 27,000 people who took the exam received lower scores than they should have, and 4,100 of them were wrongly told they had failed,” reported The New York Times. (The National Center for Fair and Open Testing, or FairTest — an advocacy organization that advised the lawyers representing the plaintiffs in the suit — provided me with a copy of the settlement agreement, shared tax documents filed by the College Board, and clarified for me the relationship between ETS and the College Board. A FairTest attorney who litigated a different case against ETS provided a legal review of an early draft of this essay.)

While the broader impacts of such scoring failures are almost impossible to quantify, the impact on an individual can be considerable.

Students and teachers deserve better from ETS and the College Board — and so does the American public. According to recent tax filings, the College Board nets close to $500 million annually for its AP exams alone, and a portion of that money comes from U.S. taxpayers: The Every Student Succeeds Act provides funding to states and districts to subsidize AP exam fees for low-income students, fees that range from $94 to $143 per exam. In other words, we all have a stake in the exams’ fairness and transparency.

In response to the Covid-19 pandemic, the College Board took significant steps to invest in improved technology for students preparing for and taking online AP exams. Similar steps should be taken by ETS to shore up the exam scoring. Whenever an exam reader encounters a novel, off-rubric response that they believe is scientifically valid, the response should be vetted by an on-call pool of university instructors serving as independent reviewers, who can assist with updating the scoring guidelines as needed. The scoring software used by ETS should be updated to consistently ensure that whenever a new response is added to the rubric of acceptable answers for a free-response question, previously graded responses from that year’s exam are revisited and rescored.

Administered wisely, AP exams can be a force for educational good, promoting intellectual curiosity and critical thinking skills that will serve students for a lifetime. By the time someone sits down to take an exam, they will have invested many months preparing for it, and their college careers may hang in the balance. At the very least, we should give everyone a fair shake.

Jeanine Pfeiffer taught undergraduate and graduate courses in the biological and environmental sciences at the University of California, Davis, San Diego State University, and San Jose State University for 22 years. She currently provides strategic advising for tribal nations, government agencies, environmental nonprofits, and field practitioners.

Comments are automatically closed one year after article publication. Archived comments are below.

Helen York

November 25, 2021 at 7:45 am

This is hardly a definitive assessment of the Advanced Placement Tests. While it is true that a certain amount of evaluation may e slanted by subjectivity, readers should e aware that the essay is only part of the entire test. There is also a very standardized multiple choice test which is taken alongside the essay portion in assigning scores.
Secondly, the example the author gives of a test question asking for responses, the question only asks for ONE natural process. Leaing out wildfires because the student lives in the Eastern part of the country would hardly have much effect on the total outcome.
The concerns I have with the program include the wholesale ubiquity of test-taking, while many of the students who take the tests need remedial help not advanced placement.
In general however I think any programs that can encourage students to study at higher levels, while saving them money in tuition are to be praised!
Chandra

October 2, 2021 at 1:30 pm

My doubt about rubric and falls scoring got strengthened that, the rubric they are following is meaningless. I say this because, a student with A+ grading from the school for a same subject in AP test scores 1 which is unbelievable.
In addition, their customer service is horrible. Even though they take your call after a long que, their answer/response would not satisfy the purpose of calling them. I am really frustrated.
C Lai

September 26, 2021 at 4:12 pm

I feel like this is probably subject driven. For Math and traditional Science subjects including Biology, the rubrics is more likely to be well constructed and include mostly all expected answers but for the newer Science subjects like AP Environmental Science that the author wrote as an example, it’s more likely there are omitted answers that are acceptable because for one, subject content is evolving faster to include the latest environmental development and for another, it involves more “common sense” and “general knowledge” for everyday encounter that students can be more creative in answering questions which can be “borderline” and should be considered correct (but not in the Rubrics).
Gary Feilich

September 21, 2021 at 9:21 pm

As a season AP Bio reader since 1996, I can only speak of the the grading process for this particular exam. It is certainly not perfect but pre-reading and subsequent daily activities IMHO produce quality product most of the time. We meet with our question leafed where we will read roughly 50 papers even before a discussion begins.Years ago when we numbered 50-60 readers per question, the table leaders would have met earlier and worked out preliminary rubrics. But them feedback from the readers would allow for charity and revisions. In one discussion a reader contested the rejection of his contribution. He became quite vocal which resulted by having the Question Leader declare loudly “I am the boss and I said no” When you are a reader you agree to follow the rubrics. If an issue comes up, you can ask your table leader for a judgment call. To put it simply. I must have read well over 30 thousand frq’s and not one student ever wrote something that had not been covered in the rubrics. With some of the smartest people developing standards and random back reading which serves as an excellent means of controlling fluctuations of the graders
A. Black

September 21, 2021 at 10:08 am

Wildly inconsistent. And students cannot appeal if their score is a 3 or more. In addition, while colleges accept many of the courses’ credits, they rarely satisfy in-major requirements. My son started college with 27 AP credits, but only Calc AB & BC counted toward his major (Biomedical Engineering). Even Physics and Biology didn’t count. And that is at a major state university, not a private school.
Susie

September 21, 2021 at 8:30 am

As a former Ap student I feel as though we are not prepared enough to take the exam and the teachers are not provided with the right material to help their students pass. I feel as though I could have excelled way better in duel enrollment and Ap did nothing for me

Republish

Share this Story

Opinion: The Advanced Placement Exams’ Grading System Gets Low Marks

Scoring of the popular pre-college exams is marred by inconsistencies and irregularities, writes a former grader.

Republish

Share this Story

Related

Get Our Newsletter

Share This Story

in War Against DEI in Science, What Is Lost?

A Biologist, a Blog, and Spartan Mosquito