The following section is adapted from Wu et al..
Health data research involves the following tasks that occur between the extraction of data and the dissemination of results:
These steps are performed over the entire database:
Address errors, inconsistencies, redundancies
Standardize inconsistent coding schemes, and units, and spellings
Combine data and variables that were mistakenly spread to different tables
Ideally, data will be in a tidy format, where
each variable has 1 column,
each observation has 1 row, and
each cell has 1 value.
Exercise caution with heuristic algorithms for data cleaning!
For example, suppose you assume any decrease or unexpected increase in height is always an error. You might be excluding a valid and interesting minority of cases in order to focus on the mundane! This can increase bias in the dataset.
These steps are performed at the research project scale:
Summarize and extract useful features (e.g., feature engineering and dimension reduction)
Impute missing data, carefully considering why it is missing
Combine redundant information
This stage can involve some type of dimension reduction:
Variable grouping or clustering (Which hierarchical medical code should be used?)
Principal Component Analysis (Can most of the variance be explained with a small portion of the data?)
Embedding and deep learning (Can binary and categorical variables be turned into continuous feature vectors?)
These steps might be done multiple times within a project:
Combine or separate overlapping time intervals
Define, classify, and label patients by different outcomes
Define encounters and time intervals to prepare for analyses
Prepare for different scenarios, such as sensitivity analysis
This can help for checking the validity of your approach
Reduce bias and confounding
Split your data into testing and validation sets, or use different data sources
Domain clinical and epidemiological experts should work together carefully at this stage.
Over the course of research, the exact steps might need to be updated frequently. Changes should be documented and tested individually.
|||Ed. Hulin Wu et al. Statistics and machine learning methods for EHR data: from data extraction to data analytics. CRC Press 2021; ISBN 978-0-367-44239-2|
- Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014 Jun 25;311(24):2479-80.
- Sayers EW, Beck J, Bolton EE, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021 Jan 8;49(D1):D10-D17. doi: 10.1093/nar/gkaa892. PMID: 33095870; PMCID: PMC7778943.
- Blewett LA, Call KT, Turner J, Hest R. Data Resources for Conducting Health Services and Policy Research. Annu Rev Public Health. 2018;39:437–452. doi:10.1146/annurev-publhealth-040617-013544
- Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. Published 2016 May 24. doi:10.1038/sdata.2016.35
- Johnson AE, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Repository: enabling reproducibility in critical care research. J Am Med Inform Assoc. 2018;25(1):32–39. doi:10.1093/jamia/ocx084
- Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543630/ doi: 10.1007/978-3-319-43742-2