Citing Web Content in an Age of Misinformation

Last Updated on November 15, 2023 by Lindsay

By: Lindsay Walker

For many investigations, journalists rely on screen captures (images) or embeds to publish copies of the content they find on websites, government databases, and social media platforms. Screen captures are simply pixelated recreations of what an observer saw on a website, and they contain no audit trail that provides a record of what the journalist captured.

Embeds from social media posts can also be taken down or changed at any time, as the actual content is still hosted by the website or company that originally published it. This puts the embedded versions of posts at risk of disappearing, erasing the record of what a journalist is trying to reference.

Another limitation to this method of citing web content is that there is no way to prove where an image came from, and that the content has not been changed since it was originally captured. With the rise of Generative AI, there is a greater risk that content found online is inaccurate, or can be manipulated down the line.

The approach to web capture taken by Black Voice News and Starling Lab was different. Web archives created with the Webrecorder suite of tools enabled the team to capture the full context of everything that existed on the web page in a zipped archive called a WACZ file. The information collected includes all content on a webpage, such as articles, comments, likes, and other multimedia. A WACZ file is a copy of the code and media that makes up that webpage, including an index of what content was captured. When users later display (or “replay”) the page using certain tools, it remains fully interactive like it was at the time of capture.

The Authenticated WACZ Display

The Starling Lab team, alongside developer Giacomo Boscani-Gilroy and Esri’s Joe Allen, prototyped a new kind of visual display for both web archives and their accompanying authenticity information. To create the Authenticated WACZ Displays you see in the Combatting Racism as a Public Health Crisis data dashboard and the displays you see in the article series, the Black Voice News team tirelessly researched and collected content from across the web, which was then archived by Starling Lab and displayed on in the data analysis tool.

This project’s archive includes almost 350 websites, along with two metadata files for each archive (a total of over 1100 records). The first metadata file has information about how this record was created, the software used, and identity of the creator. The second metadata record is a list of blockchain registrations and identifiers for the content on distributed storage systems.

Immutable Records

When a WACZ record of a website is created, an immutable digital ‘fingerprint’ called a hash is also generated. If any single byte of the data is changed, be it a pixel of an image or the timestamp of when it was collected, the hash you use to verify copies of that page will change as well.

A hash is nearly impossible to fake, and the hashed data is also cryptographically signed by the software that created it (and later by Starling Lab) with a Let’s Encrypt certificate. This adds a notarization attesting to exactly what was created and when, and tied it to a known identity (domain name).

The WACZ records were then registered on blockchain to establish exactly what content existed, where it was on the internet, and when it existed. These records were also stored in a distributed, peer-to-peer data sharing system called IPFS and archived on a distributed storage network called Filecoin so users can access and inspect the data contained in the blockchain registrations.

Workflow

Web Archive Creation

First, most of the web archives were created with an automated tool called Browsertrix (run by Starling Lab). Some were created with a Chrome extension called Archiveweb.page when websites were too complex to crawl with a bot. These tools visit a site on the world wide web and scrape all the content on that site.

When these tools capture an archive, each piece of that archive is hashed, then the list of hashes is packaged up, hashed again, and signed (notarized) by Browsertrix or the Chrome extension. This securely signs it with a public key belonging to a verifiable identity, identified with a public key associated with an Archiveweb.page user or Browsertrix instance. Everything is packaged together into a zipped file with a .wacz extension in the name.

The Starling Integrity Pipeline

The Starling Integrity Pipeline is a data processing workflow that ingests and stores data such as photos, videos, and documents, as well as their authenticity records. The pipeline helps create verifiable attestations and immutable records of digital content and preserves it in several different systems. After archives are created, Starling Lab puts the WACZ files into the Starling Integrity Pipeline to process data into different formats, register hashes of the content, and store multiple copies of the content (on Starling servers, shared drives with Mapping Black California, and on distributed systems). In this implementation, the underlying set of records were all about public efforts to address racism as a public health crisis in the state of California.

Blockchain Registration

When the WACZ file is processed by the Starling Integrity Pipeline, it is again signed by Starling Lab’s TLS Certificate, adding an additional notarization of the content. Next, the backup records are created, and the hashes of this content are registered on three different blockchains: Numbers, Avalanche, and Likecoin. These distributed, consensus-based networks establish another timestamp (using OpenTimestamps) of the hashed content, creating records of what was published and when.

Distributed Storage

In addition to blockchain registration, the WACZ files are given content identifiers, or CIDs, and pinned in the peer-to-peer data sharing system called IPFS using web3.storage. This service also packaged and created archives of the WACZ files on the Filecoin network.

With the Starling Lab authentication workflow, which is surfaced in the unique display created for this project, we hope to enable readers to explore web archives and the metadata associated with them. By giving readers the ability to inspect authenticity information, we hope to lead the way in providing new tools to meet the technical and ethical challenges of establishing trust in our most sensitive digital records.