Skip to main content

Health Data


Working with health data

The following section is adapted from Wu et al.[1].

Health data research involves the following tasks that occur between the extraction of data and the dissemination of results:

  • Data cleaning/cleansing/scrubbing

    • These steps are performed over the entire database:

      • Address errors, inconsistencies, redundancies

      • Standardize inconsistent coding schemes, and units, and spellings

      • Combine data and variables that were mistakenly spread to different tables

    • Ideally, data will be in a tidy format, where

      • each variable has 1 column,

      • each observation has 1 row, and

      • each cell has 1 value.

    • Exercise caution with heuristic algorithms for data cleaning!

      • For example, suppose you assume any decrease or unexpected increase in height is always an error. You might be excluding a valid and interesting minority of cases in order to focus on the mundane! This can increase bias in the dataset.

  • Data preprocessing

    • These steps are performed at the research project scale:

      • Summarize and extract useful features (e.g., feature engineering and dimension reduction)

      • Impute missing data, carefully considering why it is missing

      • Combine redundant information

    • This stage can involve some type of dimension reduction:

      • Variable grouping or clustering (Which hierarchical medical code should be used?)

      • Principal Component Analysis (Can most of the variance be explained with a small portion of the data?)

      • Embedding and deep learning (Can binary and categorical variables be turned into continuous feature vectors?)

  • Data preparation

    • These steps might be done multiple times within a project:

      • Combine or separate overlapping time intervals

      • Define, classify, and label patients by different outcomes

      • Define encounters and time intervals to prepare for analyses

      • Prepare for different scenarios, such as sensitivity analysis

  • Data visualization

    • This can help for checking the validity of your approach

  • Model selection

  • Result validation

    • Reduce bias and confounding

    • Split your data into testing and validation sets, or use different data sources

  • Result interpretation

    • Domain clinical and epidemiological experts should work together carefully at this stage.

Over the course of research, the exact steps might need to be updated frequently. Changes should be documented and tested individually.

[1] Ed. Hulin Wu et al. Statistics and machine learning methods for EHR data: from data extraction to data analytics. CRC Press 2021; ISBN 978-0-367-44239-2

Key Readings