skip to content
 

This policy can also be downloaded as a PDF from Apollo, Cambridge University's research repository.

1. Policy overview

Apollo, the University of Cambridge’s institutional repository, underpins the University’s commitment to preserve for the long-term and provide access as widely as possible to its research outputs to contribute to society as well as to academic advancement.

The Cambridge University Libraries Digital Preservation Policy outlines the Libraries’ commitment to long-term preservation of the digital collection materials both created (e.g., digitised versions of physical collection items) and acquired (e.g., research outputs of the University, born-digital archives).

The Apollo Repository Preservation Policy is focused specifically on the University’s research outputs as described in Apollo’s Terms of Use and outlines standards and procedures followed by this repository to guarantee the long-term preservation and availability of these materials.

1.2 Background

Established in 2003, Apollo is the institutional repository of the University of Cambridge as a service to deposit, preserve, and share the research outputs of the University.

The University supports the principle that the results of its research should be freely accessible and re-usable, and therefore supports staff in making their scholarly research outputs (“Research Outputs” or “Outputs”) available and disseminated as widely as possible.

Along with this principle, the University is committed to making all research outputs available in accordance with the FAIR principles: Findable, Accessible, Interoperable and Re-usable.

The repository’s infrastructure is critical for facilitating the University’s compliance with funder mandates, such as UK Research and Innovation (UKRI) and the Research Excellence Framework eligibility (REF), to meet requirements since it hosts and provides access to open access articles and open data from University of Cambridge researchers.

2. Repository scope

2.1 Content Scope

Research outputs of the University include publications, conference proceedings, book chapters, monographs, theses, various forms of research data (video recordings, spreadsheets, computational scripts, code, images, etc.), presentations, and others.

This policy document also covers the preservation of any accompanying metadata associated with deposited data, which the Repository aims to preserve, keep up to date, and enrich where necessary.

A current list of content types can be found in Apollo.

2.2 Services scope

Apollo’s core activities include the preservation, curation, and dissemination of the research outputs of the University with the aim of guaranteeing that all content entrusted to it by depositors remains suitable for the needs of its primary users now and in the future.

In this capacity, Apollo services actively work to promote and facilitate:

  • Secure storage of data
  • Reliability and usability of data
  • Quality and integrity of data
  • Publication of and access to data

Undertaking these activities help Apollo meet the UK Concordat on Open Research Data, which advocates the retention of a research data resource for at least ten years from the publication of any publication associated with the data, or as specified by the funder. Further to this commitment, Apollo’s terms of use also state that the repository strives to maintain the availability of deposited works indefinitely, subject to technical, administrative, or legal exceptions.

3. Collection Management

Activities that ensure current and long-term preservation start upstream before research outputs are submitted to an ingest workflow and preservation storage.

3.1 Deposit

There are four routes for depositing material in Apollo: Elements, command line, direct deposit, and external submission using the Apollo API.

3.1.1 Elements

Elements is the University’s CRIS (Current Research Information System). Depositors interact directly with Elements to enter deposit type-specific metadata, attach full text files, and grant a deposit licence. Depositors are presented with a range of mandatory and optional metadata fields. Access to this deposit route is determined by University credentials (Raven, an implementation of Shibboleth) and membership of an appropriate staff group in a feed from the University’s HR system.

3.1.2 Command line

Deposit by command line is reserved for complex deposits that consist of multiple interconnected items, deposit with large files (single files > 2GB in size), and large volumes of material where other deposit routes are impractical. Command line deposit may only be performed by Apollo administrators – currently members of the Open Research Systems team.

3.1.3 Direct deposit

Direct deposit enables users to deposit material directly into Apollo. Only a small group of users may deposit directly; the primary group is Apollo administrators, although a small group of Last updated: November 2022 5 of 12 University staff may also deposit directly via submission forms. Direct deposit is allowed only to meet closely defined use cases.

Material is deposited directly via submission forms coded in Apollo. The fields present on each submission form are equivalent to those described in Elements.

3.1.4 API

As described further in Data Management (Section 3.3), a small number of authenticated external organisations may deposit to Apollo via the API (Application Programming Interface).

3.2 Appraisal

Depositors are responsible for data preparation and submission but are supported by our guidance documentation, helpdesks, and training activities. Once submitted, deposits are subject to quality checks, ethical/legal checks and curation actions that are dependent on output type. Research outputs will not be deposited into Apollo without being subject to these review and curation actions. Checks performed on research datasets are documented in the data submission guidance.

Every submission is described by type-appropriate metadata. This contextual information must be of sufficient quality to allow the research output to be understood and potentially re-used. The depositor is contacted if additional metadata are required that cannot be added by the appropriate repository team.

Research data that derive from human participants or appear to contain personal data or sensitive information are carefully checked pre-publication, with confirmation and documentation sought from the depositor that provides evidence that permissions (e.g., participant consent) and agreements are in place that allow public data sharing. Depositors that incorporate third party data into their dataset also need to confirm they have the rights to share the data in Apollo under their chosen licence. Deposits that do not meet these conditions are returned to the depositor for amended and redeposit, or they are rejected from the repository.

Files within certain output types (data, software/code) are also downloaded and checked to determine that:

  • Files open without error
  • Files can be opened with open-source software
  • Any personal data, sensitive data, or secondary (third party) are present that require evidence of permission to share publicly
  • Data are adequately described via existing documentation
  • Metadata to enable the data to be understood and reused

If there are any problems, the depositor is contacted for more information and/or additional files, such as readme files, codebooks, data files in non-proprietary formats, or enhanced software usage instructions.

3.3 Data Management

To maintain the integrity of the research outputs held in Apollo, data management activities are undertaken regularly by the relevant teams and associated services. Persistent identifiers, such as Digital Object Identifiers (DOIs), are assigned to all repository item-level records in Apollo. Rich, descriptive information (metadata) about the research outputs is preserved, alongside associated Last updated: November 2022 6 of 12 digital assets. This information will be enriched by Repository staff, where possible, to include additional information such as:

  • Links between research datasets and related publications (journal articles, author accepted manuscripts, books, books chapters, reports, pre-prints) or associated materials (theses, code repositories), and links between different dataset versions. The relevant teams managing Apollo content share information between teams so that content can be updated.
  • Records with incomplete metadata are assessed by teams at regular intervals until all necessary metadata have been added to provide a comprehensively described deposit.
  • Apollo is connected to other University systems, and information about funding associated (funder names, identifiers, and grant references) with publications and research data is automatically added to Apollo content.

Additionally, and for some content types such as open access publications, the published versions (version of record – full text from journals) are deposited into Apollo via automated processes and include high quality metadata sourced directly from the publishers. Examples of such submissions include the following Apollo collections:

Apollo promotes the involvement of its designated user community in the management of their data through a range of systems available to users. Prior to publication in the repository, depositors can manage (view, edit, and update metadata and data files) their submissions within the repository via the University's CRIS system, Symplectic Elements. Modifications to published content in Apollo require a new deposit which includes the new versions of files and updated metadata. A new DOI is minted for the new version, and links between versions are maintained by using appropriate "Relationships" metadata from the DataCite schema (Authority used to mint DOIs for Apollo content).

Embargoes of a fixed period of time are applied to some outputs, either as requested by depositors (e.g., for datasets and theses) or as determined by publisher agreements for author accepted manuscripts. Embargoes are applied manually to the relevant files, and, for some records, the relevant team conducts periodic checks of embargoed records to determine if the embargo can be lifted and the files made publicly available. These manual checks apply predominantly to datasets and software/code.

Quality assurance processes are in place as research outputs are being deposited into Apollo and post-publication in the repository. In addition to metadata enrichment performed by repository staff (as discussed above), depositors and repository users can comment on item metadata and highlight any issues with files by contacting the repository’s support team. Feedback and suggestions received in this respect are reviewed regularly by the relevant repository teams and actioned where appropriate. If there are any problems with an item’s metadata or files then we attempt to rectify these and to also provide additional metadata to describe any changes made to the record, to datestamp these changes, and to detail any outstanding issues that we are unable to address.

Apollo has mechanisms in place for producing and monitoring “usage statistics” (access). These systems are used to generate logs that provide information about:

  • Modifications made to data or metadata
  • Changes to data status
  • External access to data

3.4 Storage and Security

3.4.1 Primary storage

The repository’s digital materials are retained indefinitely (subject to technical, administrative and legal exceptions). For this reason, storage is resilient, scalable, and highly configurable. Robust mechanisms for data management and integrity help ensure that the entire storage system can be verified on use, confirmed to be correctly stored, or remedied if corrupt.

To safeguard against date corruption and loss, Apollo stores its digital objects (data files) on storage with many layers of redundancy. The storage that underpins Apollo is managed by industry standard NetApp appliances. These technologies provide both data replication and backups of the primary storage through a series of snapshots. The combination of these services provides data integrity at the storage level for all Apollo data.

3.4.2 Replication

All storage volumes are replicated to a second storage location in near real time.

Data integrity is ensured by the following series of steps:

  • Checksums during the initial replication stream
  • Checksums on subsequent incremental replication stream
  • Ongoing checksums on any data reads
  • Regular RAID scrubbing to identify and remedy errors

The initial (and subsequent) replication data streams will only be accepted if they are intact. Subsequent incremental streams will only be accepted if the current state on the receiving side exactly matches that from the last snapshot on the send side.

All data is encrypted in transit to protect against security threats, such as a man-in-the-middle attack, a form of cyberattack whereby an attacker intercepts data.

3.4.3 Backups

The Apollo backup and tape management strategy outlines how often and when backups are created and how tapes are handled. Backup (full and incremental) is managed for both hot storage (disk) and offline physical media (tape). Hourly snapshots are taken of both the Apollo database and associated file store that contains the content files.

The data entrusted to Apollo is present in three discrete physical locations, and snapshots and backups are verified against the current active Apollo system at the time they are created.

3.4.4 Data monitoring

The deterioration of storage media is handled and monitored by the storage appliances through RAID-level scrubbing. RAID-level scrubs help improve data availability by uncovering and fixing media and checksum errors while the RAID group is in a normal state. If media errors or Last updated: November 2022 8 of 12 inconsistencies are found, the storage appliance uses RAID to reconstruct the data from other disks and rewrites the data.

The storage appliance has checksums on all data and metadata, be that stored on SSD, HDD, in memory, or being replicated for backup or redundancy. Every time data (or metadata) are read, checksums are verified. In the event of disparity, the data are reconstructed from other elements in the relevant RAID group(s).

In addition to the normal checksum verification on read, regular RAID scrubs are scheduled. These take place daily and work systematically through the entire set of storage volumes. Data are repaired if found not to match the saved checksums. When data are repaired in this way, the system sends automated notifications, so the Digital Initiatives’ Digital Services Operations team (DS Ops) that manages Apollo infrastructure are aware of any issues as soon as they are detected.

In addition to the data integrity checks performed at the storage appliance level, another integrity process continuously checks the asset stores in all locations, re-calculating the checksums of all objects and comparing them to the stored value. An alert is generated if a discrepancy is found, and the following data recovery procedures are undertaken:

  • Apollo administrators receive alerts and determine the nature and extent of the incident
  • Depending on the scale of the incident a team is assembled
  • The affected data owners are alerted
  • The last known good copy is recovered from the relevant storage location

3.4.5 Disaster Planning

The information and resources necessary to recreate services in the event of a major disaster are stored securely off-site so that these remain accessible.

Recovering effectively demands spare resources and the ability to rapidly deploy new services. Apollo services can be redeployed quickly using a configuration management system. Backups of all critical data including databases are kept and can be recovered back in place. The asset stores are replicated to two physical locations to mitigate extended periods of data unavailability in the event of a disaster. The repository server can access replicated data across the network until the damage caused by the disaster is repaired.

Apollo maintains separate backups of both the metadata database and the file store in two geographically separate data centres. Apollo content copies on tape are stored in a third separate location, at the University Library. While the premises of data centres are managed by University Information Services, the Operations team within the Library’s Digital Initiatives directorate is responsible for the day-to-day management of the storage that underpins the Apollo repository.

A second repository site that we can failover to is available through the storage infrastructure. This second site has a full repository stack installed, and should the Repository main site go down, we can switch over and maintain access.

3.5 Ingest workflows and preservation storage

The Digital Preservation team is responsible for the ingest workflows, preservation storage, and preservation planning, working with colleagues in the Open Research Systems and Scholarly Last updated: November 2022 9 of 12 Communications teams to ensure services the Digital Preservation team creates and maintains are fit for purpose.

Workflows that ingest digital collection materials to the preservation systems are built with and run on Amazon Web Services. Preservation copies of the data are stored in AWS as well as a second provider.

The term maintain is used in the CUL Digital Preservation Policy. This is equivalent to what the OAIS Reference Model (ISO 14721) refers to as the functional entity of preservation planning to ensure ongoing access to digital materials through monitoring the infrastructure, developing policies and strategies where needed, and creating migration plans.

4. Governance, Roles, and Responsibilities

The Open Research Systems (ORS) team within the University’s Library’s Digital Initiatives directorate is responsible for managing Apollo. The team manages the repository service and technical aspects of the underpinning software and its integrations with other University and Library systems.

In addition to service ownership, the team is also responsible for technical and software development, maintenance, and platform upgrades of Apollo and associated integrations, together with user support.

Other teams support the management of the technical infrastructure and operational aspect of Apollo. The technical infrastructure and storage are owned by Digital Initiatives and managed by the Digital Services team.

Role Responsibility
Open Research Systems
  • Ownership of Repository services
  • Technical & software development, maintenance and platform upgrades of Apollo and associated integrations
  • Provision of user support
Office of Scholarly Communication
  • Management and curation of the repository's content
  • Provision of guidance and support to Repository depositors
Digital Services
  • Management of the technical infrastructure for Apollo
  • Monitoring system performance, and coordinating updates to the system
Digital Preservation
  • Owner of digital preservation for the Libraries.
  • Works with colleagues across teams to ensure that activities that support the longevity of digital collections are embedded and staff are supported.
  • Oversees the ingest of digital files and metadata into systems and care of Last updated: November 2022 10 of 12 materials in these systems postingest.

See the Repository’s Governance and Policies page for detailed information on the governance structure and relevant policies.