Datasets
This page describes various synthetic and de-identified health datasets available to researchers.
SyntheticRI
The SyntheticRI datasets were generated by the Brown Center for Biomedical Informatics (BCBI) for use in research and education. These datasets contain realistic but fictional residents of the state of Rhode Island. The synthetic population aims to statistically mirror the real population in terms of demographics, disease burden, vaccinations, medical visits, and social determinants.
SyntheticRI Demo: synthetic data representing 1,188 Rhode Island individuals of all ages
SyntheticRI Adult: synthetic data representing 145,010 Rhode Island adults, ages 19-99
SyntheticRI Peds: synthetic data representing 145,010 Rhode Island children, ages 0-18
The SyntheticRI datasets were generated using Synthea, an open-source, synthetic patient generator. The Synthea-generated datasets are in .csv file format.
Each dataset was also transformed to the OHDSI OMOP Common Data Model (CDM) using the Observational Health Data Sciences and Informatics (OHDSI) Consortium's program ETL-Synthea. These datasets can be accessed though direct database queries or with OHDSI software tools such as ATLAS and HADES.
For more information, please email ursa-help@brown.edu.
MIMIC-IV
MIMIC-IV (Medical Information Mart for Intensive Care) is a large, freely-available relational database comprising deidentified health-related data from real patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA.
MIMIC-IV contains comprehensive information from 2008-2019 for over 60,000 hospitalized patients. The database is intended to support a wide variety of research in healthcare. MIMIC-IV builds upon the success of MIMIC-III, and incorporates numerous improvements over MIMIC-III. [1]
Refer to MIT's MIMIC documentation for more information about MIMIC-IV. Researchers interested in accessing the complete MIMIC-IV dataset should follow MIT's "Getting Started" instructions.
HCUP
The Healthcare Cost and Utilization Project (HCUP) is a family of databases, software tools, and related products developed through a Federal-State-Industry partnership and sponsored by the Agency for Healthcare Research and Quality (AHRQ). HCUP databases are derived from administrative data and contain encounter-level, clinical, and nonclinical information including all-listed diagnoses and procedures, discharge status, patient demographics, and charges for all patients, regardless of payer, beginning in 1988. [2]
HCUP offers the following databases:
National (Nationwide) Inpatient Sample (NIS): largest publicly available all-payer hospital inpatient care database in the United States
Kids' Inpatient Database (KID): hospital inpatient stays for children and is specifically designed to allow researchers to study a broad range of conditions and procedures related to children's health
Nationwide Emergency Department Sample (NEDS): emergency department (ED) visits that do not result in an admission as well as ED visits that result in an admission to the same hospital
Nationwide Readmissions Database: designed to support various types of analyses of national readmission rates for all payers and uninsured individuals
State Inpatient Databases (SID): inpatient discharge abstracts from participating States, translated into a uniform format to facilitate multi-State comparisons and analyses
State Ambulatory Surgery and Services Databases (SASD): encounter-level data for ambulatory surgery and other outpatient services from hospital-owned facilities
State Emergency Department Databases (SEDD): discharge information on all emergency department visits that do not result in an admission
Please note: Access to HCUP databases is not free. Database releases must be purchased through the Online HCUP Central Distributor.
Visit ahrq.gov/data/hcup for a more information about the databases. Learn more about HCUP on the HCUP website.
SEER
The Surveillance, Epidemiology, and End Results (SEER) Program provides information on cancer statistics in an effort to reduce the cancer burden among the U.S. population. SEER collects cancer incidence data from population-based cancer registries covering approximately 47.9 percent of the U.S. population. The SEER datasets include data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment. [3]
SEER offers the following datasets:
SEER Research Data
Register with any valid email
Excludes geography, month and year of diagnosis, and other demographic fields
SEER Research Plus and NCCR Data
Requires user authentication through eRA Commons or an HHS account.
Includes geography, month, and year of diagnosis, other demographic fields
Includes National Childhood Cancer Registry (NCCR) data excluding geography
Visit https://seer.cancer.gov/data/ for a deeper comparison of the datasets. Learn more about SEER and SEER Datasets on the SEER website.
SyntheticMass
SyntheticMass is a Synthea-generated data set that contains realistic but fictional residents of the state of Massachusetts. The synthetic population aims to statistically mirror the state population in terms of demographics, disease burden, vaccinations, medical visits, and social determinants. [4] Refer to the SyntheticMass website for more information.
There are several data sets available on the SyntheticMass downloads page.
Complete SytheticMass data sets: "SyntheticMass Data Version 2 (24 May, 2017)". This ZIP file is quite large (21GB), so make sure you move the file to a location with enough storage before attempting to unzip.
Sample data sets (<100MB) containing 100 or 1,000 patient records
Specialized data sets that have been generated using Synthea by other study teams. These include COVID-19 data sets, a Childhood Obesity data set and more.
The following versions are available for each data set.
CSV (data dictionary describing all CSV tables)
C-CDA (xml files)
FHIR (json files)
References
MIT Laboratory for Computational Physiology. (n.d.). About MIMIC. Retrieved May 1, 2024, from https://mimic.mit.edu/docs/about/
Agency for Healthcare Research and Quality. (n.d.). Healthcare Cost and Utilization Project (HCUP). Retrieved May 1, 2024, from https://www.ahrq.gov/data/hcup/index.html
National Cancer Institute. (n.d.). SEER Data. Retrieved May 1, 2024, from https://seer.cancer.gov/data/
MITRE Corporation. (n.d.). About Synthea. Retrieved May 1, 2024, from https://synthea.mitre.org/about
Resources
Articles
Johnson, A.E.W., Bulgarelli, L., Shen, L. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1 (2023). https://doi.org/10.1038/s41597-022-01899-x
Paris N, Lamer A, Parrot A. Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study. JMIR Med Inform. 2021 Dec 14;9(12):e30970. doi: 10.2196/30970. PMID: 34904958; PMCID: PMC8715361
Johnson AE, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Repository: enabling reproducibility in critical care research. J Am Med Inform Assoc. 2018 Jan 1;25(1):32-39. doi: 10.1093/jamia/ocx084. PMID: 29036464; PMCID: PMC6381763.
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intell Based Med. 2020 Nov;1:100007. doi: 10.1016/j.ibmed.2020.100007. Epub 2020 Oct 2. PMID: 33043312; PMCID: PMC7531559.
Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, Duffett C, Dube K, Gallagher T, McLachlan S. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. 2018 Mar 1;25(3):230-238. doi: 10.1093/jamia/ocx079. Erratum in: J Am Med Inform Assoc. 2018 Jul 1;25(7):921. PMID: 29025144; PMCID: PMC7651916.
Links
Last updated