Opinion: The Challenge of Preserving Good Data in the Age of AI

If artificial intelligence-created content floods the internet, who decides what online information is worth archiving?

Growing up, people of my generation were told to be careful of what we posted online, because “the internet is forever.” But in reality, people lose family photos, shared to social media accounts they’ve long-since been locked out of. Streaming services pull access to beloved shows, content that was never even possible to own. Journalists, animators, and developers lose years of work when web companies and technology platforms die.

At the same time, artificial intelligence-driven tools such as ChatGPT and the image creator Midjourney have grown in popularity, and some believe they will one day replace work that humans have traditionally done, like writing copy or filming video B-roll. Regardless of their actual ability to perform these tasks, though, one thing is certain: The internet is about to become deluged with a mass of low-effort, AI-generated content, potentially drowning out human work. This oncoming wave poses a problem to computer scientists like me who think about data privacy, fidelity, and dissemination daily. But everyone should be paying attention. Without clear preservation plans in place, we’ll lose a lot of good data and information.

Ultimately, data preservation is a question of resources: Who will be responsible for storing and maintaining information, and who will pay for these tasks to be done? Further, who decides what is worth keeping? Companies developing so-called foundation AI models are some of the key players wanting to catalog online data, but their interests are not necessarily aligned with those of the average person.

The costs of electricity and server space needed to keep data indefinitely add up over time. Data infrastructure must be maintained, in the same way bridges and roads are. Especially for small-scale content publishers, these costs can be onerous. Even if we could just download and back up the entirety of the internet periodically, though, that’s not enough. Just as a library is useless without some sort of organizational structure, any form of data preservation must be archived mindfully. Compatibility is also an issue. If someday we move on from saving our documents as PDFs, for example, we will need to keep older computers (with compatible software) around.

Ultimately, data preservation is a question of resources: Who will be responsible for storing and maintaining information, and who will pay for these tasks to be done?

When saving all these files and digital content, though, we must also respect and work with copyright holders. Spotify spent over $9 billion on music licensing last year, for example; any public-facing data archival system would hold many times this amount of value. A data preservation system is useless if it’s bankrupted to lawsuits. This can be especially tricky if the content was made by a group, or if it’s changed hands a few times – even if the original creator of a work approves, someone may still be out there to protect the copyright they bought.

Finally, we must be careful to only archive the true and useful information, a task that has become increasingly difficult in the internet age. Before the internet, the cost to produce physical media — books, newspapers, magazines, board games, DVDs, CDs, and so on — naturally limited the flow of information. Online, the barriers to publishing are much lower, and thus a lot of false or useless information can be disseminated every day. When data is decentralized, as it is on the internet, we still need some way to make sure that we are promoting the best of it, however that is defined.

This has never been more relevant than now, on an internet plagued with AI-generated babble. Generative AI models such as ChatGPT have been shown to unintentionally memorize training data (leading to a lawsuit brought by The New York Times), hallucinate false information, and at times offend human sensibilities, all while AI-generated content has become increasingly prevalent on websites and social media apps.

My opinion is that because AI-generated content can just be reproduced, we don’t need to preserve it. While many of the leading AI developers do not want to give away the secrets to how they collected their training data, it seems overwhelmingly likely that these models are trained on vast amounts of scraped data from the internet, so even AI companies are wary of so-called synthetic data online degrading the quality of their models.

While manufacturers, developers, and average people can solve some of these problems, the government is in the unique position of having the funds and legal power to save the breadth of our collective intelligence. Libraries save and document countless books, movies, music, and other forms of physical media. The Library of Congress even keeps some web archives, mainly historical and cultural documents. However, this is not nearly enough.

The scale of the internet, or even just digital-only media, almost certainly far outpaces the current digital stores of the Library of Congress. Not only this, but digital platforms — think software like the now-obsolete Adobe Flash — must also be preserved. Much like conservationists maintain and care for the books and other physical goods they handle, digital goods need technicians who care for and keep original computers and operating systems in working order. While the Library of Congress does have some practices in place for digitization of old media formats, they fail to meet the preservation demands of the vast landscape that is computing.

The government is, in theory, the steward of the public will and interest, which must include our collective knowledge and facts.

Groups like the Wikimedia Foundation and the Internet Archive do a great job at picking up the slack. The latter in particular keeps a thorough record of deprecated software and websites. However, these platforms face serious obstacles to their archival goals. Wikipedia often asks for donations and relies on volunteer input for writing and vetting articles. This has a host of problems, not least of which is the biases in what articles get written, and how they are written. The Internet Archive also relies on user input, for example with its Wayback Machine, which may limit what data gets archived, and when. The Internet Archive has also faced legal challenges from copyright holders, which threaten its scope and livelihood.

Government, however, is not nearly so bound by the same constraints. In my view, the additional funding and resources needed to expand the goals of the Library of Congress to archive web data would be almost negligible to the U.S. budget. The government also has the power to create necessary carve-outs to intellectual property in a way that is beneficial for all parties — see, for example, the New York Public Library’s Theatre on Film and Tape Archive, which has preserved many Broadway and off-Broadway productions for educational and research purposes despite these shows otherwise strongly forbidding people taking photos or videos of them. Finally, the government is, in theory, the steward of the public will and interest, which must include our collective knowledge and facts. Since any form of archiving involves some form of choosing what gets preserved (and by complement, what doesn’t), I don’t see a better option than an accountable public body making that decision.

Of course, just as analog recordkeeping did not end with physical libraries, data archiving should not end with this proposal. But it is a good start. Especially as politicians let libraries wither away (as they are doing in my home of New York City), it is more important than ever that we right the course. We must refocus our attention on updating our libraries, centers of information that they are, to the Information Age.


Peter Hall is a computer science graduate student at the New York University Courant Institute of Mathematical Sciences. His research is focused on the theoretical foundations of cryptography and technology policy.

Republish