Health Data

Introduction
Working with health data
Key Readings

Introduction

Working with health data

The following section is adapted from Wu et al.^[1].

Health data research involves the following tasks that occur between the extraction of data and the dissemination of results:

Data cleaning/cleansing/scrubbing
- These steps are performed over the entire database:
  - Address errors, inconsistencies, redundancies
  - Standardize inconsistent coding schemes, and units, and spellings
  - Combine data and variables that were mistakenly spread to different tables
- Ideally, data will be in a tidy format, where
  - each variable has 1 column,
  - each observation has 1 row, and
  - each cell has 1 value.
- Exercise caution with heuristic algorithms for data cleaning!
  - For example, suppose you assume any decrease or unexpected increase in height is always an error. You might be excluding a valid and interesting minority of cases in order to focus on the mundane! This can increase bias in the dataset.
Data preprocessing
- These steps are performed at the research project scale:
  - Summarize and extract useful features (e.g., feature engineering and dimension reduction)
  - Impute missing data, carefully considering why it is missing
  - Combine redundant information
- This stage can involve some type of dimension reduction:
  - Variable grouping or clustering (Which hierarchical medical code should be used?)
  - Principal Component Analysis (Can most of the variance be explained with a small portion of the data?)
  - Embedding and deep learning (Can binary and categorical variables be turned into continuous feature vectors?)
Data preparation
- These steps might be done multiple times within a project:
  - Combine or separate overlapping time intervals
  - Define, classify, and label patients by different outcomes
  - Define encounters and time intervals to prepare for analyses
  - Prepare for different scenarios, such as sensitivity analysis
Data visualization
- This can help for checking the validity of your approach
Model selection
Result validation
- Reduce bias and confounding
- Split your data into testing and validation sets, or use different data sources
Result interpretation
- Domain clinical and epidemiological experts should work together carefully at this stage.

Over the course of research, the exact steps might need to be updated frequently. Changes should be documented and tested individually.

[1]	Ed. Hulin Wu et al. Statistics and machine learning methods for EHR data: from data extraction to data analytics. CRC Press 2021; ISBN 978-0-367-44239-2

Key Readings

Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014 Jun 25;311(24):2479-80.
Sayers EW, Beck J, Bolton EE, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021 Jan 8;49(D1):D10-D17. doi: 10.1093/nar/gkaa892. PMID: 33095870; PMCID: PMC7778943.
Blewett LA, Call KT, Turner J, Hest R. Data Resources for Conducting Health Services and Policy Research. Annu Rev Public Health. 2018;39:437–452. doi:10.1146/annurev-publhealth-040617-013544
Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. Published 2016 May 24. doi:10.1038/sdata.2016.35
Johnson AE, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Repository: enabling reproducibility in critical care research. J Am Med Inform Assoc. 2018;25(1):32–39. doi:10.1093/jamia/ocx084
Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543630/ doi: 10.1007/978-3-319-43742-2