How does data rescue work?

Data rescue is the process of preserving vulnerable digital information and making it accessible for the future. As data becomes increasingly digitized, there is a growing risk of losing valuable scientific, cultural, and historical records due to technological obsolescence, instability, or disasters. Data rescue aims to mitigate these threats.

Why is Data Rescue Needed?

There are several key factors that make data rescue necessary in the digital age:

  • The quantity of born-digital data is exploding, but storage media and file formats can become obsolete quickly.
  • Climate change and severe weather events threaten data stored in vulnerable locations.
  • Political instability and restricted access to information make data preservation more urgent.
  • Funding cuts and changes in institutional priorities often put data at risk.

Without interventions to actively preserve and migrate vulnerable digital content, critical data could be permanently lost or inaccessible. For example, an estimated 80% of government climate data worldwide is currently at risk of being lost.

What Types of Data Are Rescued?

Data rescue projects have focused on preserving many forms of vulnerable digital information, including:

  • Scientific data: Datasets from environmental, geospatial, biological, physics, astronomy, and other research fields.
  • Government records: Emails, documents, databases from public offices and agencies.
  • Cultural heritage: Digitized books, photographs, audio, video, and records from libraries, museums, and other institutions.
  • Web content: Websites, blogs, forums, and other online content.
  • Personal digital records: Emails, photos, and other born-digital personal archives.

Priorities often depend on the anticipated cultural, scientific, or historical value of the data and the urgency of the threat it faces.

Who Conducts Data Rescues?

Data rescue projects involve collaboration between various stakeholders:

  • Domain experts: Researchers, professionals, and enthusiasts with subject expertise related to the data.
  • Archivists and librarians: Information management professionals with skills in digital preservation.
  • Technologists: Software developers, systems administrators, and IT personnel.
  • Volunteers: Members of the public donating time and effort to preserve shared digital heritage.

This cooperative approach combines knowledge, skills, and perspective to build robust and ethical data rescue workflows.

What Are the Steps in a Data Rescue Project?

While specific techniques vary, most data rescue initiatives follow a general workflow:

  1. Identifying at-risk data: Evaluating which data is in danger and prioritizing content with high value.
  2. Gaining access: Securing authorization, permissions, tools, and funding required to preserve the data.
  3. Harvesting data: Making complete copies of the vulnerable digital content and metadata.
  4. Processing data: Cleaning, describe, and organizing the rescued data.
  5. Storing data: Archiving data in trustworthy long-term preservation systems.
  6. Providing access: Making the rescued data available to users in accessible formats.

Effective workflows also incorporate review, quality assurance, security, and documentation practices at each stage.

How is Data Harvested for Rescue?

Data rescue requires making complete copies of vulnerable digital content and metadata prior to any loss or deterioration. This process, known as data harvesting, employs a variety of techniques tailored to the source system and type of data being preserved.

Harvesting From Online Systems

For data stored in online databases, repositories, or other networked systems, data harvesting may involve:

  • Using APIs to systematically query or crawl the source system and retrieve copies of data.
  • Exporting or dumping complete databases.
  • Using web scraping tools to harvest content from websites.
  • Downloading individual files from online file sharing systems.

Harvesting From Media and Storage Devices

To rescue data stored on physical media or storage devices, data harvesting requires:

  • Imaging disk drives, servers, and offline media like tapes, CDs, and USB drives.
  • Cloning or copying the entire contents of the source media or device.
  • Extracting individual files using dedicated forensic tools.

Care must be taken not to alter the original source during data harvesting.

Documenting the Harvest Process

Extensive documentation of the data harvesting process and source details provides critical context for preserving meaning and usability. Metadata collected often includes:

  • Source system environments and specifications.
  • Descriptions of how data was accessed and copied.
  • Hash values, disk images, and other technical evidence.
  • People, dates, locations, authorizations, and other relevant events.

Why Is Processing Required After Data Harvesting?

After at-risk data is harvested, additional processing is required before it can be preserved for long-term access. This processing aimed at:

  • Cleaning: Detecting and correcting errors, inconsistencies, duplicates, and other issues.
  • Transforming formats: Converting proprietary, obsolete, or insecure data formats into widely supported preservation formats.
  • Describing: Adding metadata to describe context, relationships, contents, and meaning.
  • Reuniting data: Linking together related data objects that may have been separately harvested.
  • Validating: Checking completeness, accuracy, and authenticity of the rescued data.

Processing maximizes the clarity, integrity, and usability of data for future users. It also mitigates risks of errors propagating or irreversible decisions being made during unsupervised harvesting.

What Are Some Data Storage Strategies for Preservation?

After data rescue, the processed digital content still requires reliable and well-managed storage to enable long-term preservation. Common archival storage solutions include:

  • Dedicated data repositories: Digital archives, data libraries, institutional repositories.
  • Cloud storage: Services like AWS, Google Cloud, Microsoft Azure.
  • Shared infrastructure: Consortial preservation networks, community-governed systems.
  • Distributed networks: Blockchain, peer-to-peer, decentralized networks.
  • Physical media: Archival quality hard drives, tape, optical media.

Key considerations when selecting storage to enable long-term digital preservation include:

  • Geographic redundancy (minimizing risks of single points of failure).
  • Ongoing monitoring, maintenance, and migration as technology changes.
  • Sufficient documentation for future management and use.
  • Trusted governance and sustainability models.
  • Mechanisms for authentication, version control, fixity checks.

How Do Data Rescue Outputs Get Shared and Used?

To maximize impact, data preserved through rescue projects needs to be discoverable and usable by relevant communities over time. This requires:

  • User-friendly catalogs, finding aids, and search systems tailored to designated user groups.
  • Open APIs, robust metadata, and modern formats to power new computational uses.
  • Websites, exhibits, educational materials, and visualizations that showcase and explain the significance of rescued data.
  • Collaborations with researchers, artists, journalists, and other re-users to realize value.
  • Mechanisms for gathering feedback to guide ongoing improvements to access systems.

Usability is strengthened by proactive outreach and building connections between archives and audiences. Rescued data can gain new meanings and uses beyond what was originally envisioned.

What are Some Notable Examples of Data Rescue Projects?

Many impactful data rescue initiatives have already preserved vast quantities of vulnerable digital content:

  • The End of Term Harvest has preserved government websites and data at risk of changing during presidential transitions in the United States.
  • The DASISH project rescued multiple European social science datasets in danger of being lost.
  • Libraries and archives are conducting large-scale digitization of historical print materials to enable digital access and preservation.
  • Initiatives like Digitalt Arkiv are preserving websites and online content documenting contemporary society and culture.
  • Community-organized DataRescue events have extracted millions of environmental and climate datasets for preservation.

As these examples show, data rescue already plays an indispensable role in mitigating loss of digital heritage. But more work is urgently needed to keep pace with generation of new vulnerable digital content.

What Are Some Challenges Facing Data Rescue Efforts?

Despite growing awareness of its importance, data rescue initiatives face a variety of challenges:

  • Scale: The quantity of vulnerable data being produced outpaces capacity for preservation.
  • Awareness: Many data creators don’t consider long-term preservation.
  • Complexity: Heterogeneous, proprietary, and poorly documented legacy formats make rescue difficult.
  • Access: Legal, policy, and security barriers can block preservation efforts.
  • Resources: Data rescue requires sustained funding and institutional support.
  • Expertise: There is a scarcity of professionals trained in endangered data rescue techniques.

Overcoming these constraints requires greater investment, training, tools, and coordination across sectors, disciplines and organizations.

How Can Data Rescue Efforts Be Improved?

Possible approaches to strengthen and expand data rescue work include:

  • More training programs, apprenticeships, and academic programs focused on digital preservation and data rescue.
  • Policy and funding incentives promoting lifecycle data management planning.
  • Proactive monitoring tools to identify vulnerable high-value data.
  • Improved software, automation, and AI to streamline scalable data rescue workflows.
  • Greater public understanding and support driving political priorities.
  • Legal reforms enabling non-discriminatory public interest preservation.
  • Collaborative infrastructure and standards for sharing data rescue processes globally.

With greater capacity, coordination, and urgency, the array of cultural memory and scientific knowledge encoded in vulnerable digital data can be secured for future generations.

Conclusion

Data rescue provides a vital bulwark against irreparable loss of valuable digital content due to technological, institutional, and environmental threats. By systematically preserving and providing access to vulnerable born-digital materials, data rescue projects serve the public interest today and enact responsible digital stewardship for the future.

Realizing the full vision and potential of data rescue requires increased awareness, funding, training, infrastructure, and technical tools. But it is a profoundly worthy investment. Safeguarding scientific knowledge, cultural memory, and humanity’s digital heritage is an obligation we all share in the digital age.