LogoLogo
URSA Initiative in Rhode Island
URSA Initiative in Rhode Island
  • Introduction
  • Computing Environments
  • Health Data Partners
    • Data Requests
  • Datasets
Powered by GitBook
On this page
  • SyntheticRI
  • MIMIC-IV
  • HCUP
  • SEER
  • SyntheticMass
  • References
  • Resources
  • Articles
  • Links
Export as PDF

Datasets

PreviousData Requests

Last updated 3 months ago

This page describes various synthetic and de-identified health datasets available to researchers.

SyntheticRI

The SyntheticRI datasets were generated by the Brown Center for Biomedical Informatics (BCBI) for use in research and education. These datasets contain realistic but fictional residents of the state of Rhode Island. The synthetic population aims to statistically mirror the real population in terms of demographics, disease burden, vaccinations, medical visits, and social determinants.

  • SyntheticRI Demo: synthetic data representing 1,188 Rhode Island individuals of all ages

  • SyntheticRI Adult: synthetic data representing 145,010 Rhode Island adults, ages 19-99

  • SyntheticRI Peds: synthetic data representing 145,010 Rhode Island children, ages 0-18

The SyntheticRI datasets were generated using , an open-source, synthetic patient generator. The Synthea-generated datasets are in .csv file format.

Each dataset was also transformed to the OHDSI OMOP Common Data Model (CDM) using the Observational Health Data Sciences and Informatics (OHDSI) Consortium's program . These datasets can be accessed though direct database queries or with such as ATLAS and HADES.

For more information, please email ursa-help@brown.edu.

MIMIC-IV

MIMIC-IV (Medical Information Mart for Intensive Care) is a large, freely-available relational database comprising deidentified health-related data from real patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA.

MIMIC-IV contains comprehensive information from 2008-2019 for over 60,000 hospitalized patients. The database is intended to support a wide variety of research in healthcare. MIMIC-IV builds upon the success of MIMIC-III, and incorporates numerous improvements over MIMIC-III.

Refer to for more information about MIMIC-IV. Researchers interested in accessing the complete MIMIC-IV dataset should follow .

HCUP

HCUP offers the following databases:

  • National (Nationwide) Inpatient Sample (NIS): largest publicly available all-payer hospital inpatient care database in the United States

  • Kids' Inpatient Database (KID): hospital inpatient stays for children and is specifically designed to allow researchers to study a broad range of conditions and procedures related to children's health

  • Nationwide Emergency Department Sample (NEDS): emergency department (ED) visits that do not result in an admission as well as ED visits that result in an admission to the same hospital

  • Nationwide Readmissions Database: designed to support various types of analyses of national readmission rates for all payers and uninsured individuals

  • State Inpatient Databases (SID): inpatient discharge abstracts from participating States, translated into a uniform format to facilitate multi-State comparisons and analyses

  • State Ambulatory Surgery and Services Databases (SASD): encounter-level data for ambulatory surgery and other outpatient services from hospital-owned facilities

  • State Emergency Department Databases (SEDD): discharge information on all emergency department visits that do not result in an admission

SEER

SEER offers the following datasets:

  • SEER Research Data

    • Register with any valid email

    • Excludes geography, month and year of diagnosis, and other demographic fields

  • SEER Research Plus and NCCR Data

    • Requires user authentication through eRA Commons or an HHS account.

    • Includes geography, month, and year of diagnosis, other demographic fields

SyntheticMass

  • Complete SytheticMass data sets: "SyntheticMass Data Version 2 (24 May, 2017)". This ZIP file is quite large (21GB), so make sure you move the file to a location with enough storage before attempting to unzip.

  • Sample data sets (<100MB) containing 100 or 1,000 patient records

  • Specialized data sets that have been generated using Synthea by other study teams. These include COVID-19 data sets, a Childhood Obesity data set and more.

The following versions are available for each data set.

  • C-CDA (xml files)

  • FHIR (json files)

References

Resources

Articles

Links

The (HCUP) is a family of databases, software tools, and related products developed through a Federal-State-Industry partnership and sponsored by the Agency for Healthcare Research and Quality (AHRQ). HCUP databases are derived from administrative data and contain encounter-level, clinical, and nonclinical information including all-listed diagnoses and procedures, discharge status, patient demographics, and charges for all patients, regardless of payer, beginning in 1988.

Please note: Access to HCUP databases is not free. Database releases must be purchased through the .

Visit for a more information about the databases. Learn more about HCUP on the .

The Surveillance, Epidemiology, and End Results (SEER) Program provides information on cancer statistics in an effort to reduce the cancer burden among the U.S. population. SEER collects cancer incidence data from population-based cancer registries covering approximately 47.9 percent of the U.S. population. The SEER datasets include data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment.

Includes excluding geography

Visit for a deeper comparison of the datasets. Learn more about SEER and SEER Datasets on the .

SyntheticMass is a Synthea-generated data set that contains realistic but fictional residents of the state of Massachusetts. The synthetic population aims to statistically mirror the state population in terms of demographics, disease burden, vaccinations, medical visits, and social determinants. Refer to the for more information.

There are several data sets available on the .

CSV ( describing all CSV tables)

MIT Laboratory for Computational Physiology. (n.d.). About MIMIC. Retrieved May 1, 2024, from

Agency for Healthcare Research and Quality. (n.d.). Healthcare Cost and Utilization Project (HCUP). Retrieved May 1, 2024, from

National Cancer Institute. (n.d.). SEER Data. Retrieved May 1, 2024, from

MITRE Corporation. (n.d.). About Synthea. Retrieved May 1, 2024, from

Johnson, A.E.W., Bulgarelli, L., Shen, L. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1 (2023).

Paris N, Lamer A, Parrot A. Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study. JMIR Med Inform. 2021 Dec 14;9(12):e30970. doi: 10.2196/30970. PMID: 34904958; PMCID:

Johnson AE, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Repository: enabling reproducibility in critical care research. J Am Med Inform Assoc. 2018 Jan 1;25(1):32-39. doi: 10.1093/jamia/ocx084. PMID: 29036464; PMCID: .

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intell Based Med. 2020 Nov;1:100007. doi: 10.1016/j.ibmed.2020.100007. Epub 2020 Oct 2. PMID: 33043312; PMCID: .

Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, Duffett C, Dube K, Gallagher T, McLachlan S. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. 2018 Mar 1;25(3):230-238. doi: 10.1093/jamia/ocx079. Erratum in: J Am Med Inform Assoc. 2018 Jul 1;25(7):921. PMID: 29025144; PMCID: .

Online HCUP Central Distributor
ahrq.gov/data/hcup
HCUP website
National Childhood Cancer Registry (NCCR) data
https://seer.cancer.gov/data/
SEER website
SyntheticMass downloads page
data dictionary
https://mimic.mit.edu/docs/about/
https://www.ahrq.gov/data/hcup/index.html
https://seer.cancer.gov/data/
https://synthea.mitre.org/about
https://doi.org/10.1038/s41597-022-01899-x
PMC8715361
PMC6381763
PMC7531559
PMC7651916
HCUP Website
HCUP Databases
SEER Website
SEER Data Products
SEER Data Access Request Process
SyntheticMass Website
Synthea
ETL-Synthea
OHDSI software tools
MIT's MIMIC documentation
MIT's "Getting Started" instructions
[1]
Healthcare Cost and Utilization Project
[2]
[3]
SyntheticMass website
[4]