Limits to Growth: Can AI’s Voracious Appetite for Data Be Tamed?

In the spring of 2019, artificial intelligence datasets started disappearing from the internet. Such collections — typically gigabytes of images, video, audio, or text data — are the foundation for the increasingly ubiquitous and profitable form of AI known as machine learning, which can mimic various kinds of human judgments such as facial recognition.

In April, it was Microsoft’s MS-Celeb-1M, consisting of 10 million images of 100,000 people’s faces — many of them celebrities, as the name suggests, but also many who were not public figures — harvested from internet sites. In June, Duke University researchers withdrew their multi-target, multi-camera dataset (DukeMTMC), which consisted of images taken from videos, mostly of students, recorded at a busy campus intersection over 14 hours on a day in 2014. Around the same time, people reported that they could no longer access Diversity in Faces, a dataset of more than a million facial images collected from the internet, released at the beginning of 2019 by a team of IBM researchers.

All together, about a dozen AI datasets vanished — hastily scrubbed by their creators after researchers, activists, and journalists exposed an array of problems with the data and the ways it was used, from privacy, to race and gender bias, to issues with human rights.

The problems originate in the mundane practices of computer coding. Machine learning reveals patterns in data — such algorithms learn, for example, how to identify common features of “cupness” from processing many, many pictures of cups. The approach is increasingly used by businesses and government agencies; in addition to facial recognition systems, it’s behind Facebook’s news feed and targeting of advertisements, digital assistants such as Siri and Alexa, guidance systems for autonomous vehicles, some medical diagnoses, and more.

To learn, algorithms need massive datasets. But as the applications grow more varied and complex, the rising demand for data is exacting growing social costs. Some of those problems are well known, such as the demographic skew in many facial recognition datasets toward White, male subjects — a bias passed on to the algorithms.

But there is a broader data crisis in machine learning. As machine learning datasets expand, they increasingly infringe on privacy by using images, text, or other material scraped without user consent; recycle toxic content; and are the source of other, more unpredictable biases and misjudgments.

“You’ve created this system in industry that rewards shady practices, rewards people stealing data from the internet or scraping it from social networks and creating products with no standard of how well they work,” said Liz O’Sullivan, chief executive of Parity, a startup that provides companies with tools to build responsible AI systems.

Pushed by outside groups calling for more ethical data collection and by negative media attention, the industry has only recently begun to examine these problems — sometimes by pulling datasets offline. But many other large public datasets remain online, and, in general, a bigger-is-better imperative remains.

“There’s this race to have bigger datasets with more and more parameters,” said Daniel Leufer, a Brussels-based policy analyst at the digital rights organization Access Now. “So there’s this constant one-upmanship. That’s hugely problematic because it encourages the cheapest, laziest possible gathering of data.”

Some computer scientists are asking whether the brute force approach of compiling ever-larger datasets is necessary to sustain machine learning, and if the expansion can continue indefinitely. And they are looking at systemic reforms in data collection, combined with alternative techniques that use less data, to offer a potential exit ramp.


While machine learning dates back to the 1940s, it only really took off in the past decade. In 2012, a team of University of Toronto researchers trounced all comers in an object recognition contest based on the AI dataset ImageNet, which had been created by Princeton researchers just a few years before. Using a souped-up version of machine learning called deep learning, the team achieved an 85 percent accuracy rate, 10 percentage points higher than the previous year’s top team. By 2015 that figure topped 95 percent — surpassing that of humans, at least in laboratory settings — and people soon found myriad applications for the new approach.

To achieve better and better results, deep learning algorithms needed bigger and bigger datasets — the bigger, the better. Computer scientists found that in many cases deep learning scaled: The more data they used, the more accurate it got.

This created a supply problem. These algorithms required not only data in quantity, but quality. For object recognition, this meant photos with target items outlined and labeled, usually by a person (for instance, labeling a “cup”), and for language, sources of reasonably well-written text. Assembling quality data on large scales is costly and hard (less so, however, for giant technology companies with access to vast troves of customer information). Creating large datasets soon became AI’s version of sausage-making: tedious and difficult, with a high risk of using bad ingredients.

“The way in which datasets are selected for machine learning development right now is not a very mature process,” said Inioluwa Deborah Raji, a Mozilla Fellow who has studied the evolution of machine learning and datasets. “Among those who train and develop machine learning models, there’s often not a conscious awareness of the importance of their role as decision-makers in that process — and especially the selection of the datasets and the way in which their dataset represents, or does not represent, a specific population.”

ImageNet, the first public large-scale AI image dataset, is a case in point. A wildly ambitious attempt to map the world of everyday objects and people, the dataset would eventually include more than 14 million photos collected online and labeled through the cheap crowdsourcing service Amazon Mechanical Turk. For structure, the creators employed another dataset, WordNet, which mapped the English language in a tree-like form.

“There’s this race to have bigger datasets with more and more parameters,” said Leufer. “So there’s this constant one-upmanship. That’s hugely problematic because it encourages the cheapest, laziest possible gathering of data.”

In 2018, AI researcher Kate Crawford, then with the AI Now Institute at New York University, and the visual artist Trevor Paglen published a report on ImageNet’s data. The authors concluded that the resulting picture was distorted. For instance, WordNet contained a number of slurs and offensive terms, since it is simply a record of the English language, and some of the labelers applied those terms to specific images of people, creating a collection of thousands of derogatory insults in picture form.

Crawford and Paglen wrote that the dataset had “many racist slurs and misogynistic terms” and as they went deeper, the classifications took “a sharp and dark turn,” the authors noted: “There are categories for Bad Person, Call Girl, Drug Addict, Closet Queen, Convict, Crazy, Failure, Flop, Fucker, Hypocrite, Jezebel, Kleptomaniac, Loser, Melancholic, Nonperson, Pervert, Prima Donna, Schizophrenic, Second-Rater, Spinster, Streetwalker, Stud, Tosser, Unskilled Person, Wanton, Waverer, and Wimp.”

This problem wasn’t limited to ImageNet. Many datasets created before and since use WordNet as a structural template. For instance, in a paper published earlier this year researchers Vinay Prabhu, chief scientist at the biometrics company UnifyID, and Abeba Birhane, a doctoral candidate in cognitive and computer science at University College Dublin, examined two other large image dataset — Tiny Images, which includes 80 million images, and Tencent ML-images, which includes 18 million — that also used the WordNet template, and found similar sets of offensive labels applied to people.

Such flaws suggest deeper problems with machine learning, as it is deployed for increasingly complex real-world applications, said Crawford, now a professor at the University of Southern California’s Annenberg School and the inaugural chair of AI and Justice at the École Normale Supérieure in Paris. “It really goes to the core of what supervised machine learning thinks it’s doing when it takes vast amounts of data — say, for example, discrete images — and then uses it to build a worldview as though that is a straightforward, uncomplicated and somehow objective task,” she said. “It is none of these things. It is intensely complicated and highly political.”


While vacuuming up public data may once have seemed like a harmless practice to AI researchers, it has had many real-world consequences. For image datasets, invasion of privacy may be the most serious. The public Google AI training dataset Open Images, which is currently still online, is made up of nine million photos with 600 labeled categories. Recent patent applications show it has been used been used for a variety of purposes — for example, by the Chinese police in creating a distracted driving recognition application; by the Royal Bank of Canada to address issues of training algorithms with partial or incomplete labels; and by Chinese technology company Huawei in automating image processing tasks.

Like many other datasets, Open Images harvested photos from public Flickr accounts, pulling in personal photos of children playing on beaches, women in bikinis, and people drinking. One of the photos, a horizontal close-up of a baby’s face, was posted in the 2000s by Marie, a North Carolina woman. When Undark alerted Marie about the use of the photo in Google’s database, she responded: “Oh, that’s gross. You think you’re not sharing things and then you find out oh, there’s just loophole.” (Marie asked to be identified by her first name only, because she said her privacy was violated and she did not want to expose her identity again.)

Many people like Marie don’t realize that by using Flickr, YouTube, or Facebook and other social media networks, the information is “empowering the neural networks run by those organizations,” said Adam Harvey, founder of Exposing AI, which tracks privacy violations of datasets used in facial recognition and other applications. “These photographs have become an important part of a data supply chain that powers the global biometrics industry.” And while many companies’ terms of service grant permission to use that biometric data (typically separated from personally identifiable information), some aren’t bothering to obtain it. The facial recognition company Clearview AI, for instance, reportedly sells access to a dataset it created by scraping billions of pictures of faces from social networks and other sites, a practice those sites nominally ban.

A selection of face images collected by IBM from Flickr for the Diversity in Faces dataset. Many people don’t realize their social media photos may be used by such databases. Faces are blurred to protect privacy. Visualization by Adam Harvey/Exposing.ai; Original images via Flickr

This practice faces legal restrictions in only in a few places, such as Illinois, where the state’s Biometric Information Privacy Act requires express permission for acquiring personal data. (Earlier this month, the parliament of the European Union called for restricting the use of biometric data, along with facial recognition, among its member states.) And individual organizations typically allow the data collection. For instance, the organization that developed the licensing system used by Flickr, Creative Commons, says the practice does not violate its licenses: “No special or explicit permission is required from the licensor to use CC-licensed content to train AI applications to the extent that copyright permission is required at all.” In an email to Undark, Google spokesperson Jason Freidenfelds cited Creative Commons’ statement in support of projects like Open Images and said: “This is an active topic of discussion across the research community, and we’re attuned to how leading organizations in this area are approaching it. We’re working with a range of external research groups to improve on datasets for [machine learning].”

Still, when the collection of personal images in datasets is publicized — as in the case of IBM’s Diversity in Faces, the subject of an NBC News story — companies and researchers often end up taking the datasets offline to quiet any backlash.


Meanwhile, many datasets have continued to get bigger; since 2014, Google, for instance, has been developing a proprietary image dataset known as JFT-300M that now contains 300 million photos harvested, Google researchers wrote in their paper, from “all over the web.” The team wants to make it even bigger, too. “We should explore if models continue to improve in a meaningful way,” they wrote in a blog post, “in the regime of even larger (1 billion+ image) datasets.”

The trend encourages researchers to cut corners. “If the incentive is to get a big dataset as cheaply as possible, then you don’t have resources being allocated to careful construction and curation of that dataset,” said Emily Bender, a linguistics professor at the University of Washington and an expert on natural language processing, which uses machine learning involving text and speech. As the industry looks for bigger and bigger datasets, she added, it also puts pressure on smaller research groups and companies to try to keep up.

The proliferation of large datasets exposes people to various potential harms. “Reverse image search is growing at a scorching pace,” said UnifyID’s Prabhu. “It is so trivial for someone to write a simple tool to take all of these images, pass them through reverse image search, and figure out who these people are in real life.”

Large datasets may also be making it easier for the algorithms themselves to connect training images to people’s real identities. A 2020 study by researchers at the University of Waterloo in Canada found that ArcFace, an open-source facial recognition application, was modestly better at recognizing people in its training dataset than those who were not. In other words, if a person’s image is in a dataset, they will be more visible to facial recognition systems.

Large text-based datasets also raise personal security concerns. Huge amounts of data scraped from the internet can contain birth dates and Social Security numbers. A 2020 report by researchers with Google, OpenAI, Apple, Stanford, the University of California, Berkeley, and Northeastern University found that certain language-based algorithms can essentially be reverse-engineered to output information from their underlying training data, and that larger models are more susceptible to such attacks compared to smaller ones.

“If the incentive is to get a big dataset as cheaply as possible,” said Bender, “then you don’t have resources being allocated to careful construction and curation of that dataset.”

When employees within large companies have pushed back, it has sometimes backfired. In the fall of 2020, researchers at Google and the University of Washington submitted a paper on large language datasets to a computer science conference, which asked: “How big is too big?” Among other things, they noted that collecting training material from internet sites such as Reddit means algorithms end up recycling demographic biases and toxic language that harm disadvantaged groups.

Soon after, Google fired one of paper’s authors, Timnit Gebru, from her job as co-leader of the company’s Ethical AI team, because the paper “painted too bleak a picture of the new technology,” among other things, according to Wired. The other co-leader, Margaret Mitchell, was fired shortly afterward. (Bender from the University of Washington was also a co-author.)

“It seems to me that the big internet companies are very reluctant to even talk about this because it threatens their core business,” said Walter Scheirer, a computer scientist at the University of Notre Dame. “They really are trying to collect as much data from users as possible so they can build products with that as the source material, and if you start limiting that, it really constrains what they can do — or at least what they think they can do.”


Faced with a wave of recent exposures of bias and privacy violations, many technology companies, along with academic researchers, have launched efforts to address specific dataset problems.

In January 2020, for instance, ImageNet administrators published a paper in which they acknowledged the dataset’s label problems. Of its 2,832 “people” categories, 1,593 — 56 percent — “are potentially offensive labels that should not be used in the context of an image recognition dataset,” the authors wrote. Of the remaining categories, only 158 were purely visual, according to the paper, “with the remaining categories simply demonstrating annotators’ bias.” Those were also removed. Ultimately, only 6 percent of ImageNet’s original categories for people remain.

Critics say it’s not enough. “I think we have a much bigger problem in the field,” said Crawford. “You’ll see that this is a very common response in the tech sector generally. When researchers spend a lot of time showing them why a system has very deep problems on the conceptual level, their response is just to delete a few of these problematic labels.”

Other researchers are looking for ways to build better labels from the beginning. The Data Nutrition Project, for instance, creates labels for datasets documenting where they came from and how they were assembled. The transparency can encourage better practices, said Kasia Chmielinski, a project co-founder and a digital expert with McKinsey and Company. The project name was inspired by nutrition labels. “I can see what’s inside this can of Coke before I drink it, and can tell whether it’s healthy for me,” Chmielinski said. “Why can’t we do the same with a dataset before we build a model with it?” So far, the project has created labels for datasets of payments to doctors from drugmakers and medical device companies, medical images used to diagnose melanoma, and New York City property tax bills.

Facial recognition companies have also improved performance across demographic groups in tests administered by the National Institute of Standards and Technology (though researchers have found collections of problematic images in NIST’s own facial recognition datasets, including of children exploited for pornography, U.S. visa applicants, and dead people). In 2021, for instance, Google launched Open Images Extended, which uses crowdsourced photos from around the world to improve geographical diversity and blurs faces to protect privacy.

Still other researchers are looking for other ways to train algorithms. A technique called self-supervised learning, for example, trains algorithms with unlabeled datasets. Large technology companies including Google and Facebook are experimenting with this approach. A Facebook model known as SEER (SElf-supERvised), introduced in March 2021, “can learn from any random group of images on the internet — without the need for careful curation and labeling that goes into most computer vision training today,” a company blog post says.

Another method called dataset distillation, meanwhile, is based on the idea that a relatively small dataset with the most effective examples can work just as well as a giant one. In one experiment, researchers at Facebook and the Massachusetts Institute of Technology distilled a dataset of 60,000 images of handwritten numerals down to 10 synthetic images that were created by a computer, and achieved training results nearly as good. Here as well, researchers at a number of large companies including Google and Huawei are exploring the technique. (The use of synthetic data could also address privacy violations — for example, datasets composed of fake but photorealistic faces created by algorithms.)

“It seems to me that the big internet companies are very reluctant to even talk about this because it threatens their core business,” said Scheirer.

Other groups are applying small, curated datasets to video and images of people. In April 2021, Facebook released a dataset called Casual Conversations consisting of such data from 3,100 people of diverse racial and ethnic backgrounds, taken with permission. The richness of this kind of data — images from many angles and in varied lighting — obviates the need for a huge number of subjects, said Cristian Cantor Ferrer, the Facebook research manager who oversaw the dataset’s construction. But producing it required more money — subjects were paid — time, and work hours than mass data collection, he noted. Subjects can also opt out at any time, which the team must track.

“There is growing evidence of the fact that you technically do not need that much data, which is kind of a volte face on all of the transgressions that have happened in the past few years,” said Prabhu.

Still, once published, datasets take on a life of their own. In the three years that the DukeMTMC dataset was publicly available, for example, dozens of companies and government agencies cited it in papers on facial recognition projects, including companies supplying technology for Chinese government surveillance of its oppressed Uighur minority. After the dataset’s creators pulled it from the internet, it kept circulating anyway. Since then, studies by Chinese scientists have used it to study the surveillance challenges of tracking people when the field of view is partially blocked and when subjects change clothing. A recent study by researchers at Princeton University found that versions of DukeMTMC and MS Celeb-1M have been used and cited hundreds of times in research since being taken down and remain available on various websites.

As the new versions of the databases circulate, the original ethical violations remain online — and the datasets are often modified and used in ways the creators never contemplated. “The problems arise and they’re all, ‘sorry, we shouldn’t have done that,” Leufer, from the digital rights organization Access Now, said of dataset retractions. “At which point it’s far too late to do anything about it.”


John McQuaid is a journalist and author. He reported this story while a fellow at the Woodrow Wilson International Center for Scholars in Washington, D.C. He is currently a Ph.D. student at the University of Maryland Merrill College of Journalism.