A Guide to Choosing a Data Repository for NIH-Funded Research

The National Institutes of Health (NIH) prefers data to be shared via a data repository. This article provides a guide to data repositories and an algorithm for choosing where to deposit your data.

What is a Data Repository?

According to the NNLM’s data glossary, “a repository is a tool to share, preserve, and discover research outputs, including but not limited to data or datasets.” Generally, a data repository is a website that houses a collection of datasets, making them available to a broader audience. Repositories manage data-sharing infrastructure and provide a stable location for researchers to share their work.

Why Choose a Data Repository? 

For most researchers, using a data repository is the best practice for data sharing. Using a personal or lab website or even uploading the dataset in journal article supplementary files is not advisable because the long-term sustainability of the website is unknown; it could be updated or taken down without warning. In addition, simply adding “available upon request” to the data availability statement in a publication is not sufficient. Studies have demonstrated that when data is actually requested, many authors have failed to comply. 

Types of Access in Repositories

Through repositories, several options are available to researchers for data sharing:

  1. Public access: Public datasets are available to all without restriction. This option is commonly used for animal studies or data without privacy concerns.
  2. Controlled access: In a controlled-access repository, researchers must verify their identity before they are allowed to download and analyze data. This can take the form of verifying a university-associated email address, signing a data-use agreement, or sending in an application before access is granted. Some repositories, such as Vivli, which specializes in clinical trial data, require that sensitive data be analyzed in a controlled cloud-computing environment.
  3. Embargoes: Most repositories allow for datasets to be embargoed. Datasets may be embargoed for a number of reasons. For example, the researchers may not wish to publish their data until the accompanying article is available, or they may be pursuing a patent based on their discoveries.

Data Repository Types

Here are the types of repositories that researchers can use for sharing data:

  1. Specialist data repository: Specialist data repositories accept scholarship from certain disciplines or on a specific topic. An example of this type of data repository is ImmPort, a repository funded by NIAID.
  2. Generalist data repository: Generalist data repositories accept any scholarship from any discipline. These include figshare, Zenodo, Mendeley, Harvard Dataverse and OSF. Other generalist repositories include Vivli, which only accepts clinical data, and Dryad, which primarily accepts data from the sciences.
  3. Institutional data repository: Some institutions host their own data repository to encourage their researchers to deposit data. These are typically limited to affiliates of the host institution, so you should check if your collaborators are affiliated with an institution hosting a data repository. As an example, the Harvard Dataverse allows researchers to deposit their datasets regardless of affiliation.

Tip: Institutional repositories vary in their ability to accept and maintain data. Before committing to using an institutional repository, check that they routinely accept data.

Workflow for Choosing a Repository Under the NIH 2023 DMS Policy

Find-a-repository-flow

Source

Data Repository Choices in Order of Preference

  1. NIH Institutes, Centers, and Offices ICOs or Funding Opportunity Announcement (FOA) identified data repository

  2. Domain- or subject-specific data repository
  3. Generalist repository
  4. Other options:
    1. Datasets under 2 GB can be uploaded to PubMed Central with your final publication.
    2. Very large datasets (multiple terabytes or petabytes) can be stored using cloud storage under the NIH STRIDES Initiative.

Finding Data Repositories

Selecting a repository is a personal decision, and no single repository will fit all needs. Luckily, there are a number of repository indices that aggregate data repositories and provide filters to help you pinpoint the one that works best for your research. 

  1. NNLM Repository Finder: An interactive repository finder helps researchers find repositories meeting their research requirements.
  2. List of NIH-Supported Data Sharing Resources: This is a list of open, subject-specific data sharing repositories open for submitting and accessing data.

For a broader list of repositories, check out a repository database such as:

  1. NNLM Data Repository Finder: This tool is meant to help locate NIH-supported repositories for sharing research data.  

  2.  re3data: A global registry of research data repositories that cover a wide range of academic disciplines. re3data presents repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers, and scholarly institutions.
  3. Fairsharing: A registry of knowledge databases and repositories of data and other digital assets.
  4. If there is not a subject-specific repository that fits your needs, check out the Generalist repositories. These repositories generally accept data regardless of data type, format, or disciplinary focus. This comparison chart can help you determine which generalist repository fits your needs.

Key Characteristics to Consider When Selecting a Repository

Selecting a research data repository is an important decision in the research process, as it can impact the visibility, accessibility, and long-term preservation of your data. When choosing a data repository, consider the following criteria:

  1. Unique Persistent Identifiers: Assigns datasets a citable, unique persistent identifier, such as a digital object identifier (DOI) or accession number, to support data discovery, reporting, and research assessment. The identifier points to a persistent landing page that remains accessible even if the dataset is deaccessioned or no longer available.
  2. Long-Term Sustainability: Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.
  3. Metadata: Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using a schema that is appropriate to and ideally widely used across the communities the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.
  4. Curation and Quality Assurance: Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.
  5. Free and Easy Access: Provides broad, equitable, and maximally open access to datasets and their metadata free of charge and in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.
  6. Broad and Measured Reuse: Makes datasets and their metadata available with the broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through the assignment of adequate metadata and unique PIDs).
  7. Clear Use Guidance: Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).
  8. Security and Integrity: Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.
  9. Confidentiality: Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.
  10. Common Format: Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary formats consistent with those used in the community(ies) the repository serves.
  11. Provenance: Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.
  12. Retention Policy: Provides documentation on policies for data retention within the repository.

Additional Considerations for Human Data

When working with human participant data, including de-identified human data, here are some additional characteristics to look for:

  1. Fidelity to Consent: Uses documented procedures to restrict dataset access and use to those that are consistent with participant consent and changes in consent.
  2. Restricted Use Compliant: Uses documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.
  3. Privacy: Implements and provides documentation of measures (for example, tiered access, credentialing of data users, and security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.
  4. Plan for Breach: Has security measures that include a response plan for detected data breaches.
  5. Download Control: Controls and audits access to and download of datasets (if download is permitted).
  6. Violations: Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.
  7. Request Review: Makes use of an established and transparent process for reviewing data access requests.

Ultimately, the choice of a data repository should align with your research goals and the needs of your specific project. Consider consulting with colleagues, your institution's librarians, publishers, and experts in your field for guidance on the most suitable repository for your data.

Video: Data Repository 101  

Presented October 30, 2020, by Dr. Jennie Larkin, National Institute of Aging and co-chair of the FAIR Data Repositories Team provides an overview of data repositories available to researchers. She covers the types of repositories available, ways to find repositories and the characteristics NIH seeks in a strong repository.

Back to top of page

Additional Resources

Related Articles

Sources

Bohman, Lena. “LibGuides: SOM NIH Data Management and Sharing Policy,” September 2023. https://libguides.hofstra.edu/c.php?g=1275561&p=9358324

The National Science and Technology Council, Desirable Characteristics of Data Repositories for Federally Funded Research, 2022, DOI: https://doi.org/10.5479/10088/113528 

NIH Selecting a Data Repository. https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/selecting-a-data-repository

 
 

About FASEB DataWorks! Initiative

DataWorks! is an initiative launched by FASEB to provide biological and biomedical researchers with guidance and support on best practices for research data management, sharing, and reuse. It does this by providing: