Practical Decentralization of Scholarly Data & Resources

The following post is based on my talk at the library and technology conference Online Northwest in March 2018, titled “A vision for decentralized data preservation across a network of libraries and trusted institutions." Photo credit to Sarah Seymore. I'll update with the link to the recording when it's available.

Technology is imbued with the values and biases of its creators (see the work of Safiya Noble for more). Today’s online data storage and preservation systems are no exception. They are built on traditional web infrastructure, which was designed with the values of hierarchical and commercial organizations. It’s time for scholars to ask whether today’s data preservation technologies align with open scholarship’s values of access, preservation, privacy, and transparency. Alternative communication tools, such as email and RSS, were built for interoperability and portability but are now largely relegated to (small) uses. Similar approaches built with decentralized technology have recently reached a new level of maturity and publicity.

It’s time for scholars to ask whether today’s data preservation technologies align with open scholarship’s values of access, preservation, privacy, and transparency.

Decentralized tools offer a more robust, open alternative for data management. At a fundamental level, decentralized systems distribute data across a network of linked participants. Beyond scholarly use cases, decentralized models are also changing the way the web is built as artists, activists, and technologists use them to rethink the web. This technology offers new models for community managed information sharing on the web. As decentralization remakes the legacy web, it presents scholarship with an opportunity to rethink who owns scholarly data and the pathways to access that information.

Today, decentralized approaches like the Dat Project offer foundational tools for integrating isolated data silos, operating at a lower level of web infrastructure to link information in existing systems. (At the end of this post is a plain language Dat Project explainer.) These modern tools present opportunities for scholars, librarians, and technologists to redesign long-term data preservation in a way that formalizes the shared values of the community within the technology itself.

The Internet is broken and we are using it to access and distribute all of human knowledge.

Today’s web is dominated by centralized, non-interoperable entities that sustain their businesses by enclosing and selling data we provide to them. The model of controlling content via an online platform has been implemented in every industry, from social media to scholarly publishing. We live in an age where data are increasingly collected, analyzed, and repackaged for sale. Across domains, data live online (I mean data in the most inclusive terms): the work of a writer, government data, newspaper archives, your family photos, scientific data, film archives, artist’s creations. These data live on the web with varying degrees of strategy, management, and plans for long-term upkeep.

The lines drawn around which data have political value, research value, or business value constantly shift. Scholarly data - the majority of which are presented, accessed, and stored online - are not an exception, and can move quickly from niche to politically charged (see DataRescue). Although data are valuable, data management and stewardship practices are extremely inconsistent and financially burdensome. Link rot (when links break) and content drift (when the information at a link changes) plague fields from legal judgements and scholarship to biomedical research. Together link rot and content drift create “reference rot” — you can read more about the impact of reference rot on scholarship across scholarly disciplines here, here, and here.

In this landscape, librarians, technologists, and scholars are trying to manage systems that will sustain the preservation of humanity’s growing knowledge base forever. The web is not designed for long term preservation of information. As Laurie Allen said in 2017: “the internet is a terribly unstable way to keep information available.” As decentralized models to connect people to information are developed, these models present a unique opportunity to rethink how scholarly data are stored and accessed online.

A network approach

Centralized data storage systems can only preserve what they hold in their servers. These models require custody to provide access. Data custody becomes increasingly expensive and difficult to manage as data volumes increase. Stephen Abrams asks the question, “can we replace custody with easy access?” In other words, is knowing where data are, and trusting the preservation standards of that location, equivalent to (or better than) custody? Can we reduce the burden on institutions to own everything with a mandate to know where data are and how many verified copies exist?

The idea of “preservation in place” where libraries bring “preservation services to the content” is not new. In a decentralized model, custody is not required for access. By bringing preservation services to content, we replace custody with access and preservation. Data then live in a network of linked institutions that model a commons of trust.

To create a functional system wherein information is shared across silos between trusted institutions is a utopian idea. What would such a network require beyond technology? The most critical factor is trust. Trust in each participating entity’s standards and processes. Trust between institutions. Trust from the community that a commons can be sustained without tragedy. Today's decentralized models make preservation in place technically feasible, and interoperable with existing data preservation silos. But perhaps the cultural part of modeling a new system will be more challenging than writing the software? ;)

Join us

Scholarship today is another form of online content creation. I resisted the idea for years. But the parallels are clear. Scholarly work is used by for-profit entities to drive clicks, bring advertisers and subscribers, and sustain their businesses. We give up custody of — and sometimes rights to — our work. We then must pay for access to our own data and publications, or lose access to those resources. I believe this system is unsustainable because it is not based on the community’s values. Reducing reliance on centralized systems will help to return the control of scholarly assets to the creators and trusted institutions that value access and scholarship. Data preservation and access are fundamentally about trust. Who do you trust to steward human knowledge? I trust libraries, scholars, and public interest technologists over entities with clear business interests.

What’s important to the scholarly community? Are those values are reflected in our technology?

Today’s online scholarly infrastructure values systems of centralized control, people with reliable internet connections, people/institutions with money to pay for access (when there’s no money — it values people/institutions with the technical capacity and time to do the work themselves). The web is being reimagined today as a network wherein data are freely shared between linked users. For scholars and librarians, it’s a chance to step back and assess what assumptions and values are baked in to today’s open scholarly infrastructure. What’s important to the scholarly community? Are those values are reflected in our technology? Let’s reexamine how scholarship and data live on the web and move into a future where our values are reflected in our technology choices.

Save the date for our next community call May 31st 11am PST. Join our mailing list for a reminder. Follow me us on the tweets, @daniellecrobins, @dat_project, and @codeforsociety.

Thanks to the 2018 Online Northwest Program Planning Committee for inviting me to speak and organizing such a fantastic conference! And extra thanks to Joe Hand, Karissa McKelvey, John Chodacki, Robin Champieux, and Stephen Abrams for great discussion and comments on drafts of this work.

What’s Dat why are we working with this technology?

Dat is an open source, non-profit backed peer-to-peer file sharing protocol originally developed to distribute large datasets. It’s not blockchain based, and instead uses an append only log to track changes and a private key to allow an author to make changes.

When a folder is tracked with Dat, it creates a unique persistent identifier for that package of data (whatever is in the folder). This unique identifier is not based on location or the folder’s content. In this system, a folder of data can change location or contain dynamic content while keeping the same identifier. Dat then tracks changes to the contents of the folder with a transparent change log. Any reader can view the change log, early versions of the dataset, or keep the folder synced to always have the latest version.

The Dat identifier can be used to track that package of data across a network. Using the identifier, anyone can see how many verified copies exist, download copies, and re-share it from another computer. A Dat package can contain any file type and tracking with Dat does not change the contents of the package. It’s a lightweight and flexible system that prioritizes user control of data sharing across a decentralized network. For more on how Dat works, check out the whitepaper.