Loading...
Loading...
Loading...
Loading...
Loading...
Below is an example pipeline or process for conducting research with Health Data such as EHR data. The list is by no means exhaustive. However, it is a good place to start. A page could be written on each of the steps in the pipeline - and most likely will be in future releases of CODIAC for Health.
Conduct a literature review.
Explicitly describe the research question.
Form an interdisciplinary team that can guide and perform each step of the study.
Fully specify the research protocol in advance of executing the study.
Apply for IRB approval of the study.
Apply for an Institutional Reliance Agreement, if necessary.
Execute a Data (Transfer and) Use Agreement (DUA, DTUA), as required
Comply with any application and approval procedures set forth by the data provider.
Request access to / Set up computing infrastructure, as necessary.
Assess the suitability (strengths and weaknesses) of the dataset(s) to be used in the study.
Assess the quality of the dataset(s).
Define the study cohort (and matching cases, if applicable).
Create standard code sets for each clinical concept in the cohort definition and every independent and dependent variable.
Compose a computable data request / data extraction specification.
Clean and stage extracted data for analysis; handle missing values according to protocol.
Characterize the study cohort (and matching cases, if applicable).
Adjust for any bias or confounders in the data.
Analyze the data according to protocol.
Produce research products.
Comply with any review procedures required by the data provider.
Publish your work!
Shang N, Weng C, Hripcsak G. A conceptual framework for evaluating data suitability for observational studies. J Am Med Inform Assoc. 2018 Mar 1;25(3):248-258. doi: 10.1093/jamia/ocx095. PMID: 29024976; PMCID: PMC7378879.
Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543630/ doi: 10.1007/978-3-319-43742-2
O’Neil ST, Beasley W, Loomba J, Patrick S, Wilkins KJ, Crowley KM., Anzalone, AJ (Eds.) (2023). The Researcher’s Guide to N3C: A National Resource for Analyzing Real-World Health Data. DOI: 10.5281/zenodo.7749367
The All of Us Research Program seeks to accelerate health research and medical breakthroughs by collecting and analyzing data from one million or more individuals living in the United States. The program was designed to engage historically underrepresented groups in biomedical research, with an emphasis on community outreach and transparency in research.
Research participants self-selectively enroll and share their data with the program. Data may include electronic health records (EHR), data from wearable devices, physical measurements, surveys, and whole genome sequencing and genotyping arrays. The data are curated and made available within a secure data repository. [1]
Anyone may visit the Research Hub to learn about the data and explore aggregated participant data with the Data Browser. Researchers may register for access to the Researcher Workbench and conduct research within the secure environment.
Brown University researchers interested in accessing the the All of Us Researcher Workbench should review the Brown University Library website for more information.
Established in 2004, Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-funded National Center for Biomedical Computing (NCBC) based in Boston, MA. i2b2 seeks to enable clinical researchers to accelerate the translation of genomic and clinical findings into novel diagnostics, prognostics, and therapeutics through the creation of software and a methodological framework. [2]
The i2b2 tranSMART Foundation is "an Open-Source Community Enabling collaboration for precision medicine, through sharing, integration, standardization, and analysis of heterogeneous data from healthcare and research." [3]
Led by the i2b2 international academics users group, the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international consortium for electronic health record (EHR) data-driven studies of the COVID-19 pandemic. 4CE seeks to inform doctors, epidemiologists, and the public about COVID-19 treatments and patient outcomes. [4]
The National Clinical Cohort Collaborative (N3C) has four enclaves (as of 10/18/2024): N3C COVID, N3C Education, N3C Cancer, and N3C Renal. The Education Enclave provides simulated datasets for researchers to develop and practice the skills needed to analyze real-world data. The Cancer and Renal Enclaves are part of a broader feasibility testing initiative being done to refine the overall governance, data linkage, and institutional partnership components of N3C. These domain-specific enclaves are governed by their data contributors.
"The National COVID Cohort Collaborative (N3C) is a collaboration among the NCATS-supported Clinical and Translational Science Awards (CTSA) Program hubs, distributed clinical data networks (PCORnet, OHDSI, ACT, TriNetX), and other partner organizations, with overall stewardship by NIH’s National Center for Advancing Translational Sciences (NCATS). The N3C aims to improve the efficiency and accessibility of analyses with COVID-19 clinical data, expand our ability to analyze and understand COVID, and demonstrate a novel approach for collaborative data sharing." [5]
Founded in 2014, the Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) is a multi-stakeholder, interdisciplinary, open-science collaborative. [6] OHDSI's mission is "To improve health by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care." [7] OHDSI produces large-scale, real-world analytics through an international network of researchers and observational health databases and a central coordinating center at Columbia University. [8]
For more information about OHDSI, refer to OHDSI's website and the Book of OHDSI.
The Patient-Centered Outcomes Research Institute (PCORI®) is an independent, non-profit research funding organization that funds comparative clinical effectiveness research (CER). [9]
For more information about PCORI, please visit their website.
The National Patient-Centered Clinical research Network, PCORnet®, is a distributed network of organizations and standardized health data for patient-centered health research, particularly CER. PCORnet® was developed with funding from the Patient-Centered Outcomes Research Institute® (PCORI®). [10]
For more information, please see their website.
TriNetX is a global health research network which enables researchers to perform large-scale observational studies with real-world data. Data from electronic health records (EHRs) are aggregated, anonymized, and made available through a secure analytics platform. Through TriNetX, researchers can assess study feasibility and identify eligible patient cohorts. TriNetX is used by pharmaceutical companies and academic and clinical research institutions to perform cohort analyses, conduct comparative effectiveness and epidemiological studies. [11]
For more information, visit their website.
All of Us. [Accessed 2024-11-11]; https://allofus.nih.gov/
i2b2. [Accessed 2024-11-11]; https://www.i2b2.org/about/index.html
i2b2 tranSMART Foundation. [Accessed 2024-11-11]; https://i2b2transmart.org/
4CE. [Accessed 2024-11-11]; https://covidclinical.net/index.html
N3C. (n.d.). Covid About. [Accessed 2024-10-07]; https://covid.cd2h.org/about/
OHDSI Who We Are. [Accessed 2024-09-30]; https://ohdsi.org/who-we-are/
OHDSI Mission, Vision & Values. [Accessed 2024-09-30]; https://ohdsi.org/who-we-are/mission-vision-values/
Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, Suchard MA, Park RW, Wong IC, Rijnbeek PR, van der Lei J, Pratt N, Norén GN, Li YC, Stang PE, Madigan D, Ryan PB. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574-8. PMID: 26262116; PMCID: PMC4815923.
PCORI. [Accessed 2024-11-11]; https://www.pcori.org/
PCORnet. [Accessed 2024-11-11]; https://pcornet.org/
TriNetX. [Accessed 2024-11-11]; https://trinetx.com/#signe
O’Neil ST, Beasley W, Loomba J, Patrick S, Wilkins KJ, Crowley KM., Anzalone, AJ (Eds.) (2023). The Researcher’s Guide to N3C: A National Resource for Analyzing Real-World Health Data. DOI: 10.5281/zenodo.7749367
All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature. 2024 Mar;627(8003):340-346. doi: 10.1038/s41586-023-06957-x. Epub 2024 Feb 19. PMID: 38374255; PMCID: PMC10937371.
Cronin RM, Jerome RN, Mapes B, Andrade R, Johnston R, Ayala J, Schlundt D, Bonnet K, Kripalani S, Goggins K, Wallston KA, Couper MP, Elliott MR, Harris P, Begale M, Munoz F, Lopez-Class M, Cella D, Condon D, AuYoung M, Mazor KM, Mikita S, Manganiello M, Borselli N, Fowler S, Rutter JL, Denny JC, Karlson EW, Ahmedani BK, O'Donnell CJ; Vanderbilt University Medical Center Pilot Team, and the Participant Provided Information Committee. Development of the Initial Surveys for the All of Us Research Program. Epidemiology. 2019 Jul;30(4):597-608. doi: 10.1097/EDE.0000000000001028. PMID: 31045611; PMCID: PMC6548672.
Turner SP, Pompea ST, Williams KL, Kraemer DA Jr, Sholle ET, Chen C, Cole CL, Kaushal R, Campion TR Jr. Implementation of Informatics to Support the NIH All of Us Research Program in a Healthcare Provider Organization. AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:602-609. PMID: 31259015; PMCID: PMC6568061.
Doerr M, Grayson S, Moore S, Suver C, Wilbanks J, Wagner J. Implementing a universal informed consent process for the All of Us Research Program. Pac Symp Biocomput. 2019;24:427-438. PMID: 30963079; PMCID: PMC6417826.
Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, Kibbe WA, Payne PRO, Pfaff ER, Robinson PN, Saltz JH, Spratt H, Suver C, Wilbanks J, Wilcox AB, Williams AE, Wu C, Blacketer C, Bradford RL, Cimino JJ, Clark M, Colmenares EW, Francis PA, Gabriel D, Graves A, Hemadri R, Hong SS, Hripscak G, Jiao D, Klann JG, Kostka K, Lee AM, Lehmann HP, Lingrey L, Miller RT, Morris M, Murphy SN, Natarajan K, Palchuk MB, Sheikh U, Solbrig H, Visweswaran S, Walden A, Walters KM, Weber GM, Zhang XT, Zhu RL, Amor B, Girvin AT, Manna A, Qureshi N, Kurilla MG, Michael SG, Portilla LM, Rutter JL, Austin CP, Gersing KR; N3C Consortium. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J Am Med Inform Assoc. 2021 Mar 1;28(3):427-443. doi: 10.1093/jamia/ocaa196. PMID: 32805036; PMCID: PMC7454687.
Suver C, Harper J, Loomba J, Saltz M, Solway J, Anzalone AJ, Walters K, Pfaff E, Walden A, McMurry J, Chute CG, Haendel M. The N3C governance ecosystem: A model socio-technical partnership for the future of collaborative analytics at scale. J Clin Transl Sci. 2023 Nov 14;7(1):e252. doi: 10.1017/cts.2023.681. PMID: 38229902; PMCID: PMC10789985.
Hripcsak G, Schuemie MJ, Madigan D, Ryan PB, Suchard MA. Drawing Reproducible Conclusions from Observational Clinical Data with OHDSI. Yearb Med Inform. 2021 Aug;30(1):283-289. doi: 10.1055/s-0041-1726481. Epub 2021 Apr 21. PMID: 33882595; PMCID: PMC8416226.
Reich C, Ostropolets A, Ryan P, Rijnbeek P, Schuemie M, Davydov A, Dymshyts D, Hripcsak G. OHDSI Standardized Vocabularies-a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc. 2024 Feb 16;31(3):583-590. doi: 10.1093/jamia/ocad247. PMID: 38175665; PMCID: PMC10873827.
There was a time when researchers had direct access to their home hospital's electronic health record (EHR). Data were only available to researchers for hospital(s) that were affiliated with academic institutions that employed them. In many cases, EHR data and their associated databases could only be accessed onsite. Often, data from ancillary services, such as pharmacy and laboratory data, outside of the hospital, were inaccessible. If the needed data were part of a hospital information system module that had not been created yet, it was considered impossible to access and required manual chart review.
Over the years, researchers built systems and modules to enhance the collection of data for patient care, and eventually fuel additional "secondary data" research. Nationally, biomedical informatics visionaries pointed the way and guided us in how to make research dreams into realities. There were many successes, and as many failures to achieve the vision of researchers having untethered access to EHR and other health data. The eventual result of these trials and tribulations is now a new world of health system-wide data warehouses, secure research enclaves, federated research networks, statewide health information exchanges, all-payer claims databases, and national health data repositories. The term “big data” seems diminutive now that we can leverage trillions points of health data for our research across the nation [].
Observational Health Research has come of age in terms of data and the tools to conduct research. There is still much work to be done in terms of structures, policies, and procedures. As CODIAC for Health grows and evolves, this chapter will be enhanced to serve as your go-to for the "whys," "whats," and "hows" of it all.
Butte A. . TEDxSanFrancisco. 2017.
Visit other chapters in CODIAC for Health using the or menu in the upper left corner.
"Clinicians and data scientists must apply the same level of academic rigor when analyzing research from clinical databases as they do with more traditional methods of clinical research." []
Conducting research with data from the Electronic Health Record (EHR) requires a structured process and a team science approach. The process should include protocols and standards for requesting or extracting the data, assessing the quality of the data, cleaning, standardizing, and analyzing the data, and maintaining security to ensure the confidentiality of the data. A multi-disciplinary team capable of overseeing and performing each specialized step of the process is essential.
-> Research questions, cohorts, and methodologies must be clearly defined using standard clinical definitions and terminologies. Attention must be paid to the timing of exposure and outcome events.
-> Understand the limitations of EHR data. Data may be incomplete, and the quality of the data can vary across sources. Use robust data management strategies to ensure "clean" data and properly handle missing values.
-> Observational studies are susceptible to confounding due to non-random assignment of treatments or exposures. Adjust for confounding factors and be aware of bias that may be inherent in the data.
-> Practice Open Science! Ensure transparency in the study design, data processing, and analysis to promote reproducibility. For NIH-funded studies, comply with the .
-> Practice Team Science! Collaborate with clinicians, biomedical informaticians, biostatisticians, and data scientists to ensure the study is both methodologically sound and clinically meaningful.
"There may come a time when data can be aggregated automatically from multiple EHR environments to answer a particular question without relying on a human to understand the particular idiosyncrasies of each institution’s data and EHR system. Until that day, effective EHR data set analysis requires collaboration with clinicians and scientists who have knowledge of the diseases being studied and the practices of their particular health care systems; informaticians with experience in the underlying structures of biomedical record repositories at their own institutions and the characteristics of their data; data harmonization experts to help with data transformation, standardization, integration, and computability; statisticians and epidemiologists well versed in the limitations and opportunities of EHR data sets and related sources of potential bias; machine learning experts; and at least one expert in regulatory and ethical standards." []
Lokhandwala S, Rush B. Objectives of the Secondary Analysis of Electronic Health Record Data. 2016 Sep 10. In: Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Chapter 1. Available from: / doi: 10.1007/978-3-319-43742-2_1
Kohane IS, Aronow BJ, Avillach P, Beaulieu-Jones BK, Bellazzi R, Bradford RL, Brat GA, Cannataro M, Cimino JJ, García-Barrio N, Gehlenborg N, Ghassemi M, Gutiérrez-Sacristán A, Hanauer DA, Holmes JH, Hong C, Klann JG, Loh NHW, Luo Y, Mandl KD, Daniar M, Moore JH, Murphy SN, Neuraz A, Ngiam KY, Omenn GS, Palmer N, Patel LP, Pedrera-Jiménez M, Sliz P, South AM, Tan ALM, Taylor DM, Taylor BW, Torti C, Vallejos AK, Wagholikar KB; Consortium For Clinical Characterization Of COVID-19 By EHR (4CE); Weber GM, Cai T. What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask. J Med Internet Res. 2021 Mar 2;23(3):e22219. doi: 10.2196/22219. PMID: ; PMCID: PMC7927948.
Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Available from: / doi: 10.1007/978-3-319-43742-2
Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018 Apr 30;361:k1479. doi: 10.1136/bmj.k1479. Erratum in: BMJ. 2018 Oct 18;363:k4416. doi: 10.1136/bmj.k4416. PMID: ; PMCID: PMC5925441.
Callahan A, Shah NH, Chen JH. Research and Reporting Considerations for Observational Studies Using Electronic Health Record Data. Ann Intern Med. 2020 Jun 2;172(11 Suppl):S79-S84. doi: 10.7326/M19-0873. PMID: ; PMCID: PMC7413106.
Shang N, Weng C, Hripcsak G. A conceptual framework for evaluating data suitability for observational studies. J Am Med Inform Assoc. 2018 Mar 1;25(3):248-258. doi: 10.1093/jamia/ocx095. PMID: ; PMCID: PMC7378879.
The Health Insurance Portability and Accountability Act (HIPAA), Public Law 104-191, enacted on August 21, 1996, protects the privacy of "protected health information" (PHI) [1]. There are 18 elements of PHI as defined by HIPAA:
Names;
All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000;
All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
Phone numbers;
Fax numbers;
Electronic mail addresses;
Social Security numbers;
Medical record numbers;
Health plan beneficiary numbers;
Account numbers;
Certificate/license numbers;
Vehicle identifiers and serial numbers, including license plate numbers;
Device identifiers and serial numbers;
Web Universal Resource Locators (URLs);
Internet Protocol (IP) address numbers;
Biometric identifiers, including finger and voice prints;
Full face photographic images and any comparable images; and
Any other unique identifying number, characteristic, or code, except a code to permit re-identification of the de-identified data by the Honest Broker. (Note: this does not include the unique code assigned by an investigator to code the data.)
There are also additional standards and criteria to protect individuals from re-identification. Any code used to replace the identifiers in data sets cannot be derived from any information related to the individual and the master codes, nor can the method to derive the codes be disclosed. For example, a subject’s initials cannot be used to code their data because the initials are derived from their name. Additionally, the researcher must not have actual knowledge that the research subject could be re-identified from the remaining identifiers in the PHI used in the research study. In other words, the information would still be considered identifiable if there was a way to identify the individual even though all of the 18 identifiers were removed.
HIPAA requires that each of the 18 PHI identifiers of the individual or of relatives, employers, or household members of the individual must be removed from medical record information in order for the records to be considered a de-identified “Safe Harbor” dataset.
A dataset can also be de-identified by “expert-determination.” The expert must have professional, academic, or other formal training and experience in using health information de-identification methodologies. The expert may determine that the risk of data re-identification is “very small” when the anticipated recipients use it alone or in combination with other reasonably available information.
To qualify as a Limited Dataset, HIPAA requires that each of the following identifiers of the individual or of relatives, employers, or household members of the individual must be removed from the data.
Names
Postal address information, other than town or city, State, and zip code
Telephone numbers
FAX numbers
Electronic mail addresses
Social security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers; license plate numbers
Device identifiers and serial numbers
Web Universal Resource Locators (URLs)
Internet Protocol (IP) address numbers
Biometric identifiers
Full face photographic images and any comparable images
U.S. Department of Health and Human Services, Health Information Privacy.[Accessed 2024-11-12]; https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html
Institute of Medicine (US) Committee on Health Research and the Privacy of Health Information: The HIPAA Privacy Rule; Nass SJ, Levit LA, Gostin LO, editors. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. Washington (DC): National Academies Press (US); 2009. Available from: https://www.ncbi.nlm.nih.gov/books/NBK9578/ doi: 10.17226/12458