The Health Insurance Portability and Accountability Act (HIPAA), Public Law 104-191, enacted on August 21, 1996, protects the privacy of "protected health information" (PHI) [1]. There are 18 elements of PHI as defined by HIPAA:
Names;
All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000;
All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
Phone numbers;
Fax numbers;
Electronic mail addresses;
Social Security numbers;
Medical record numbers;
Health plan beneficiary numbers;
Account numbers;
Certificate/license numbers;
Vehicle identifiers and serial numbers, including license plate numbers;
Device identifiers and serial numbers;
Web Universal Resource Locators (URLs);
Internet Protocol (IP) address numbers;
Biometric identifiers, including finger and voice prints;
Full face photographic images and any comparable images; and
Any other unique identifying number, characteristic, or code, except a code to permit re-identification of the de-identified data by the Honest Broker. (Note: this does not include the unique code assigned by an investigator to code the data.)
There are also additional standards and criteria to protect individuals from re-identification. Any code used to replace the identifiers in data sets cannot be derived from any information related to the individual and the master codes, nor can the method to derive the codes be disclosed. For example, a subject’s initials cannot be used to code their data because the initials are derived from their name. Additionally, the researcher must not have actual knowledge that the research subject could be re-identified from the remaining identifiers in the PHI used in the research study. In other words, the information would still be considered identifiable if there was a way to identify the individual even though all of the 18 identifiers were removed.
HIPAA requires that each of the 18 PHI identifiers of the individual or of relatives, employers, or household members of the individual must be removed from medical record information in order for the records to be considered a de-identified “Safe Harbor” dataset.
A dataset can also be de-identified by “expert-determination.” The expert must have professional, academic, or other formal training and experience in using health information de-identification methodologies. The expert may determine that the risk of data re-identification is “very small” when the anticipated recipients use it alone or in combination with other reasonably available information.
To qualify as a Limited Dataset, HIPAA requires that each of the following identifiers of the individual or of relatives, employers, or household members of the individual must be removed from the data.
Names
Postal address information, other than town or city, State, and zip code
Telephone numbers
FAX numbers
Electronic mail addresses
Social security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers; license plate numbers
Device identifiers and serial numbers
Web Universal Resource Locators (URLs)
Internet Protocol (IP) address numbers
Biometric identifiers
Full face photographic images and any comparable images
U.S. Department of Health and Human Services, Health Information Privacy.[Accessed 2024-11-12]; https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html
Institute of Medicine (US) Committee on Health Research and the Privacy of Health Information: The HIPAA Privacy Rule; Nass SJ, Levit LA, Gostin LO, editors. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. Washington (DC): National Academies Press (US); 2009. Available from: https://www.ncbi.nlm.nih.gov/books/NBK9578/ doi: 10.17226/12458
"Clinicians and data scientists must apply the same level of academic rigor when analyzing research from clinical databases as they do with more traditional methods of clinical research." [1]
Conducting research with data from the Electronic Health Record (EHR) requires a structured process and a team science approach. The process should include protocols and standards for requesting or extracting the data, assessing the quality of the data, cleaning, standardizing, and analyzing the data, and maintaining security to ensure the confidentiality of the data. A multi-disciplinary team capable of overseeing and performing each specialized step of the process is essential.
-> Research questions, cohorts, and methodologies must be clearly defined using standard clinical definitions and terminologies. Attention must be paid to the timing of exposure and outcome events.
-> Understand the limitations of EHR data. Data may be incomplete, and the quality of the data can vary across sources. Use robust data management strategies to ensure "clean" data and properly handle missing values.
-> Observational studies are susceptible to confounding due to non-random assignment of treatments or exposures. Adjust for confounding factors and be aware of bias that may be inherent in the data.
-> Practice Open Science! Ensure transparency in the study design, data processing, and analysis to promote reproducibility. For NIH-funded studies, comply with the 2023 NIH Data Management and Sharing Policy.
-> Practice Team Science! Collaborate with clinicians, biomedical informaticians, biostatisticians, and data scientists to ensure the study is both methodologically sound and clinically meaningful.
"There may come a time when data can be aggregated automatically from multiple EHR environments to answer a particular question without relying on a human to understand the particular idiosyncrasies of each institution’s data and EHR system. Until that day, effective EHR data set analysis requires collaboration with clinicians and scientists who have knowledge of the diseases being studied and the practices of their particular health care systems; informaticians with experience in the underlying structures of biomedical record repositories at their own institutions and the characteristics of their data; data harmonization experts to help with data transformation, standardization, integration, and computability; statisticians and epidemiologists well versed in the limitations and opportunities of EHR data sets and related sources of potential bias; machine learning experts; and at least one expert in regulatory and ethical standards." [2]
Lokhandwala S, Rush B. Objectives of the Secondary Analysis of Electronic Health Record Data. 2016 Sep 10. In: Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Chapter 1. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543655/ doi: 10.1007/978-3-319-43742-2_1
Kohane IS, Aronow BJ, Avillach P, Beaulieu-Jones BK, Bellazzi R, Bradford RL, Brat GA, Cannataro M, Cimino JJ, García-Barrio N, Gehlenborg N, Ghassemi M, Gutiérrez-Sacristán A, Hanauer DA, Holmes JH, Hong C, Klann JG, Loh NHW, Luo Y, Mandl KD, Daniar M, Moore JH, Murphy SN, Neuraz A, Ngiam KY, Omenn GS, Palmer N, Patel LP, Pedrera-Jiménez M, Sliz P, South AM, Tan ALM, Taylor DM, Taylor BW, Torti C, Vallejos AK, Wagholikar KB; Consortium For Clinical Characterization Of COVID-19 By EHR (4CE); Weber GM, Cai T. What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask. J Med Internet Res. 2021 Mar 2;23(3):e22219. doi: 10.2196/22219. PMID: 33600347; PMCID: PMC7927948.
Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543630/ doi: 10.1007/978-3-319-43742-2
Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018 Apr 30;361:k1479. doi: 10.1136/bmj.k1479. Erratum in: BMJ. 2018 Oct 18;363:k4416. doi: 10.1136/bmj.k4416. PMID: 29712648; PMCID: PMC5925441.
Callahan A, Shah NH, Chen JH. Research and Reporting Considerations for Observational Studies Using Electronic Health Record Data. Ann Intern Med. 2020 Jun 2;172(11 Suppl):S79-S84. doi: 10.7326/M19-0873. PMID: 32479175; PMCID: PMC7413106.
Shang N, Weng C, Hripcsak G. A conceptual framework for evaluating data suitability for observational studies. J Am Med Inform Assoc. 2018 Mar 1;25(3):248-258. doi: 10.1093/jamia/ocx095. PMID: 29024976; PMCID: PMC7378879.