DataWorks! Digest

National Library of Medicine Launches New Dataset Catalog

Written by DataWorks! Team | Feb 28, 2024 10:36:18 PM

The National Library of Medicine (NLM) has released a new Dataset Catalog for beta testing by the public.

A data catalog serves as a directory of datasets. Unlike a repository, a data catalog does not store datasets; rather, it gives information about each dataset and where to find it in a public repository. Using a data catalog, researchers can search across different data repositories and find datasets of interest without knowing in which specific repository they are housed.

In a blog post announcing the new catalog, the NLM states that the goal of this project is to create an easy-to-use, all-in-one tool. The new data catalog could eventually function for datasets the way PubMed does for research literature: as a clearinghouse with links out to full text/datasets.

The catalog also serves as a test for the NLM’s new datasets metadata model (DATMM). All information about the datasets included in this catalog is converted to the DATMM format. According to the NLM, “By harmonizing and standardizing the structure of descriptive data, the Dataset Catalog facilitates discovery and reuse of biomedical datasets and will eventually make it easier to find and connect datasets to related objects on the Semantic Web.”

Currently, the repositories included in the beta are:

  • dbGap: The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies investigating the interaction of genotype and phenotype in humans.
  • Dryad: Dryad is an international repository of data underlying peer-reviewed scientific and medical literature, particularly data for which no specialized repository exists.
  • ImmPort: The Immunology Database and Analysis Portal (ImmPort) archives clinical study and trial data generated by investigators funded by the National Institute of Allergy and Infectious Diseases’ Division of Allergy, Immunology, and Transplantation.
  • Harvard Dataverse: Harvard Dataverse Repository is a research data repository running on the open-source Dataverse software. The repository, which is fully open to the public and free for all researchers worldwide, allows upload and browsing of data from all fields of research. Harvard Dataverse Repository receives support from Harvard University, public and private grants, and an emergent consortium model.

The NLM invites feedback on the beta version via the blue “Give Feedback” button on the right side of the dataset catalog website.