A guide to finding an existing dataset to reuse in your research.
Who Is Collecting the Data?
The most effective way to locate datasets is by identifying the agency or organization that focuses on a specific research area of interest. This can include government agencies, nonprofits, or professional organizations. For example, if you want to find a dataset about surgeries, you might look at the American College of Surgeons, a professional organization for surgeons, which curates the NSQIP dataset. If you are looking for health statistics, you can consult KFF, a nonprofit that collects information on health.
In addition to government agencies, nonprofits, and professional organizations, data repositories also collect and curate datasets. According to the NNLM’s data glossary, “a repository is a tool to share, preserve, and discover research outputs, including but not limited to data or datasets.” Generally, a data repository is a website that houses a collection of datasets, making them available to a broader audience. Repositories manage data-sharing infrastructure and provide a stable location for researchers to share their work.
Finding Data in Repositories
The key to finding data in a repository is identifying which repositories are collecting the type of data you are interested in. For biomedical repositories, the NIH list of repositories is a good starting point. A more comprehensive directory of repositories is at re3data. If you still are not finding promising candidates, try googling “[subject matter] AND data repository.”
After you find a candidate data repository, the next step is to identify datasets of interest. Most repositories have a standard search bar, like those used by other academic databases. Many also include a browsing feature, which allows you to get a sense of the data included in their collections. Often clicking the search button without typing in any search terms will show you all datasets in the repository. It can be useful to start off by browsing to see how the repository categorizes data and what type of information is included.
Some datasets and repositories control access to their data. A controlled access repository requires some form of verification before researchers can access data. This can mean filling out a form, agreeing to a data-use agreement, or submitting an application with information about your intended project.
Finding Data from Article Data Availability Statements
A data availability statement (DAS) details where the data used in a published paper can be found and how it can be accessed. Many publishers now require data availability statements for all publications. Looking through the data availability statements in articles in your area of interest can point you to associated datasets. For example, here is a data availability statement from PubMed Central:
This data availability statement directs you to a specific repository (Gene Expression Omnibus) and gives you the accession number of the dataset. The accession number is an identifier assigned to this dataset by the repository, and it can be used to find the dataset. The statement also provides a link.
PubMed Central includes a filter to search for DASs: “has data avail[filter].” To build a search, add your keywords and then the filter. For example, “diabetes AND has data avail[filter].”
Using Google Dataset Search
Google Dataset Search is a search engine across metadata for millions of datasets in thousands of repositories spanning the Web. Similarly to how Google Scholar works, Google Dataset Search lets you find datasets wherever they’re hosted. Google Dataset Search indexes a broad range of datasets, including across repositories.
Enter your keyword search, and Google will give you dataset results.
However, again like Google Scholar, the Google Dataset Search is broadly inclusive rather than curated, and you should critically appraise the results.
Additional Resources
- Data Discovery at the National Library of Medicine
- Dataset Catalog at the National Library of Medicine. Allows users to search across multiple repositories.
- DataCite Commons: This allows you to search through multiple data repositories at once.
- Data.gov: This is the go-to resource for finding statistical information produced by the federal government across all disciplines.
- Healthdata.gov: One-stop shopping for health data produced by the U.S. government.
- Harvard Catalyst Policy Atlas: A list of sources for public health and policy data.
- ICPSR Data Search: This tool lets you search across the repositories housed at ICPSR.
Sources
Bezet, Amanda. “Datasets.” National University Library, November 13, 2023. https://resources.nu.edu/researchprocess/datasets.
A.T. Still Memorial Library. “Finding and Reusing Data,” September 15, 2023. https://guides.atsu.edu/DataServices/sources.