Why Linked Healthcare Data is the New Research Standard

Blog Posts  |  25 January 2022

Real-world data (RWD) are data relating to patients’ health or health care, which are routinely collected from several existing sources.1 This article explores what those sources are along an assessment of each. We also explore why linked healthcare data are the new research standard, harnessing the strengths of RWD while helping to compensate for some of its’ limitations.

Real-world data sources

Traditionally, researchers have obtained RWD primarily by mining sources such as:1-4

  • Insurance claims
  • Centers for Medicare and Medicaid Services (CMS) administrative records
  • Electronic health records (EHR)
  • Health surveys
  • Product and disease registries
  • Other sources that can provide health status information, such as digital health technologies and applications (apps) Insurance claims, CMS records, EHRs, surveys, and registries can provide rich data that is already being collected in real-world settings, much of it at the point-of-care. Personal devices and apps allow non-stop monitoring and data collection.2

Strengths and limitations of electronic data sources

Insurance claims data

Insurance claims data collects information from millions of doctor’s appointments, bills, and other patient-provider communications.3

Claims data follow a standard format and are relatively complete, making it widely used among researchers. Insurance claims include information about every service from every medical provider covered by the payer.5 This information comes directly from billing information submitted by the providers’ practice for services provided.3

Claims data generally contains a broader set of information than data from EHRs, because EHRs may not be connected to every healthcare facility visited by the patient. This means claims data may do a better job capturing records of tests, procedures, and services received by the patient. It also means claims data may include important information about the patient’s medication: every filled prescription, the amount dispensed, etc., which can be assessed to determine whether the patient is taking their medication as directed.6

EHR data

EHR data are desirable for research because they are automatically collected by the physician at the point-of-care. Providers usually record data in the EHR during or soon after the patient encounter, making EHR data fairly reliable.3,7 Most importantly, EHRs contain a wealth of information not available elsewhere because it is created directly by providers:3,6,7

  • EHRs contain the “problem list,” a place where the provider can track all of the patient’s medical problems; since this list is maintained independently from specific medical encounters, it allows identification of conditions that may not be noted in claims data
  • EHRs usually contain more detail than claims data, which is focused on the specific information needed for payer reimbursement
  • EHRs may include information patients do not feel comfortable providing through other data sources, such as surveys
  • EHRs often contain key clinical information not found in claims data, such as vital signs, lab work, and test results—data that might be analyzed to allow identification of conditions the physician failed to recognize and code

Data scientists have even begun extracting data found in unstructured or semi-structured EHR fields through use of natural language processing (NLP) and machine learning.7,8

Surveillance and observational data

Patient and disease registries are types of public health surveillance that record health and demographic information about patients who are affected by specific diseases. The Centers for Disease Control and Prevention (CDC), the World Health Organization (WHO), and other medical institutions provide databases that track information about various disease outbreaks.3,4

Similarly, care improvement registries are used to provide a longitudinal view of patients with a specific disease or condition. They are often collaborative, with multiple physicians or healthcare facilities collecting EHR data toward a common purpose. For example, Veradigm® provides two clinical data registries in association with the American College of Cardiology (ACC). The PINNACLE Registry® captures data on coronary artery disease, hypertension, atrial fibrillation, and heart failure to create cardiology’s largest outpatient quality improvement registry. The Diabetes Collaborative Registry® is the first clinical ambulatory registry designed to track and improve the quality of diabetes and metabolic care in both primary and specialty care. Both registries draw data from multiple specialties, including primary care, family care, internal medicine, endocrinology, and cardiology.9

These types of data are valuable because they are usually provided in partnership with a broad spectrum of healthcare providers, such as labs, hospitals, and private physicians. Since the data are provided directly from patient records, they tend to be more reliable than, for example, survey data. In addition, these data are stored in registries that make the data easier to access and analyze than many other data types.

CMS administrative data

CMS administrative data are claims data derived from Medicaid and Medicare reimbursement information, bill payment, or enrollment/disenrollment information.10 CMS data files cover a broad population segment: Over 45 million beneficiaries are enrolled in Medicare today, or 98% of U.S. adults ages 65 and over. These data also include demographic information, such as date of birth, race, place of residence, and date of death.10

Power of linked healthcare data

Administrative claims records are a powerful source of data, but on their own they can handicap researchers with gaps in the information they provide. New linkages between claims data and clinical data give traditional RWD a priceless upgrade.6,11

Linked data can provide researchers with more complete information. Data linkages can be used to:1

  • Increase the breadth and depth of data related to individual patients
  • Provide additional data for use when data validation is needed
  • Help fill in gaps caused by data that is missing, either because that data was not collected or because the database in use does not have a field for collecting that data
  • Maintain data validity, because medical information recorded in the EHR is not always captured by claims data7

Sequirus™ and Veradigm conducted a non-interventional, retrospective cohort study on vaccine effectiveness during the 2018-2019 influenza season using RWD from a large dataset that linked ambulatory patient EHRs (Allscripts Touchworks® EHR, Veradigm EHR™, and Veradigm’s Practice Fusion) with medical and pharmacy claims. This dataset enabled them to assess relative vaccine effectiveness in over ten million individuals who had a record of receiving either the cell culture-derived inactivated quadrivalent influenza vaccine or the egg-derived inactivated quadrivalent influenza vaccine.12,13

Linking claims and EHR data provides the best of both worlds: detailed accounts of all costs and services covered in the claims data linked to deep, rich clinical information from individual patient records.

Veradigm is one of the largest providers of deidentified ambulatory EHR data, data that is captured directly from our point-of-care systems. These data are used to provide flexible linked data solutions supporting multiple claims data providers and linking technologies.

Our high-quality linked database set includes:

  • 180+ million patients*
  • 42+ million patients with EHR + closed claims* (Closed claims data is derived directly from the payer, so it captures nearly all patient interactions that occur during his or her enrollment period; however, this data is generally for a shorter time frame and less recent than open claims data.)
  • 25+ million patients with linked EHR + open claims* (Open claims data can be captured from multiple different data sources, such as electronic medical records, practice management systems, billing systems, or medical claims clearinghouse; however, open claims data frequently include gaps where specific interactions aren’t captured.)
  • Physician notes (unstructured data) available for 50 million patients for NLP extraction*

Veradigm offers both off-the-shelf and custom linked data solutions to meet your research needs. To learn more about how Veradigm’s linked data assets and point-of-care platforms can help you with your research goals, click here.

*5 Year Time Period: November 2015 – October 2019


  1. Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision-Making for Drug and Biological Products - Guidance for Industry (Center for Biologics Evaluation and Research, Food and Drug Administration) (September 2021).
  2. Sherman RE, Anderson SA, Dal Pan GJ, et al. Real-World Evidence–What Is It and What Can It Tell Us? The New England Journal of Medicine. 2016;375:2293-2297. doi:10.1056/NEJMsb1609216.
  3. National Information Center on Health Services Research & Health Care Technology (NICHSR). Finding and Using Health Statistics. NIH - National Library of Medicine website. Updated April 3, 2019. Accessed January 10, 2021, https://www.nlm.nih.gov/nichsr/stats_tutorial/cover.html.
  4. Data Resources in the Health Sciences. Health Sciences Library - University of Washington website. Accessed January 9, 2021, https://guides.lib.uw.edu/c.php?g=99209&p=642709#12207218.
  5. West SL, Johnson W, Visscher W, Kluckman M, Qin Y, Larsen A. The challenges of linking health insurer claims with electronic medical records. Health Informatics Journal. February 18, 2014;20(1):22-34.
  6. Wilson J, MD, Bock A, MD. The benefit of using both claims data and electronic medical record data in health care analysis. Optum. Updated 2012. Accessed January 21, 2021, https://www.optum.com/content/dam/optum/resources/whitePapers/Benefits-of-using-both-claims-and-EMR-data-in-HC-analysis-WhitePaper-ACS.pdf.
  7. Reifsnyder C, Cohen A, Kallenbach L. Below the Surface: Why Multi-Dimensional SDoH Data Is Critical for Clinical Research. March 2021. https://veradigm.com/img/resource-ebook-why-multi-dimensional-sdoh-data-is-critical-for-clinical-research.pdf.
  8. Ferver K, Burton B, Jesilow P. The Use of Claims Data in Healthcare Research. The Open Public Health Journal. 2009;2:11-24.
  9. Clinical Data Registries. Veradigm. Updated 2021. Accessed October 7, 2021, https://veradigm.com/clinical-data-registries/.
  10. Virnig B. Strengths and Limitations of CMS Administrative Data in Research. Research Data Assistance Center website. Updated January 10, 2018. Accessed January 10, 2021, https://www.resdac.org/articles/strengths-and-limitations-cms-administrative-data-research.
  11. Life Sciences industry veterans with unique real-world data assets. Veradigm. Accessed September 28, 2021, https://connect.practicefusion.com/life-sciences-industry-real-world-assets/.
  12. Farah JM. Preventing Influenza-Related Medical Encounters (US 2018-2019 Season): A Real-World Study from Seqirus™ and Veradigm®. Updated May 26, 2021. Accessed October 9, 2021, https://veradigm.com/veradigm-news/seqirus-quadrivalent-influenza-vaccine-real-world-study/.
  13. Boikos C, Fischer L, O’Brien D, Vasey J, Sylvester GC, Mansi JA. Relative Effectiveness of the Cell-derived Inactivated Quadrivalent Influenza Vaccine Versus Egg-derived Inactivated Quadrivalent Influenza Vaccines in Preventing Influenza-related Medical Encounters During the 2018-2019 Influenza Season in the United States. Clin Infect Dis. 73(3):e692-e698. doi:10.1093/cid/ciaa1944.
Spread the word

healthcare data   Claims   CMS records   surveys   registries   research,  

Related insights