4 Best Practices for Managing and Securing EHR Datasets

Blog  |  16 December 2024

Written by: Amanda Cohen, MPH

There are over 55 million visits to US community-based health centers every six months. Since most of these health centers use electronic health record (EHR) systems, these visits generate enormous amounts of patient data. Tapping into these data could completely transform how researchers find answers to pressing questions, identify disease trends, and explore potential treatments.

Accessing routinely collected data for research is not a future possibility but a present reality. Today, researchers can use de-identified data collected in EHRs to deepen their knowledge and test new ideas.

Handling EHR datasets safely is critical to the responsible use of patient health information. Here, we’ll cover best practices for using EHR datasets for research by answering the following questions:

  • How are EHR datasets used in clinical research?
  • Do EHR datasets fall under the same legislation as protected health information?
  • How can organizations protect EHR datasets from misuse?
  • How can organizations mitigate algorithm bias using EHR datasets?

How EHR Datasets Benefit Research

EHR datasets include information collected during routine clinic care: patient vitals, diagnosis and treatment information, demographic information, immunization history, insurance status, family and social history, allergies, therapeutic assessments, and much more. Some EHR datasets also use artificial intelligence (AI) models trained in natural language processing (NLP) to codify unstructured data, such as physician notes.

The type of data found in EHR datasets isn’t collected for research. However, data providers can use frameworks, such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), to standardize the data for research use. The OMOP CDM provides a shared format (vocabulary, terminology, coding schemes) for healthcare data. For example, some healthcare providers may describe low blood glucose as “low blood sugar” or “hypoglycemia.” The OMOP CDM provides a common phrase for this condition so researchers can pull data without wading through different variations.

EHR datasets are just one example of the real-world data (RWD) researchers use to find real-world evidence (RWE) supporting the safety and effectiveness of therapies. Researchers use RWE to conduct prospective and retrospective cohort studies and answer research questions that arise during randomized clinical trials. RWD is also essential in gathering the RWE for post-market follow-up studies, where researchers show that their drug or medical device is safe for the general population.

The information collected in EHRs has enormous potential to enrich research—allowing researchers a comprehensive view of how external factors impact treatment effectiveness.

Regulating De-Identified Data

Before researchers can use EHR datasets, patient data must be stripped of any details that could lead back to individual patients. This process, also known as de-identification, has rules. The United States Department of Health and Human Services Health Insurance Portability and Accountability Act (HIPAA) provides two methods of de-identifying patient data: HIPAA Safe Harbor and HIPAA Expert Determination.

HIPAA Safe Harbor mandates that data providers remove a long list of specific patient identifiers from datasets. Examples include a patient’s name, contact information, social security number, license plate number, and full-face photo. De-identifying information via HIPAA Expert Determination is relatively straightforward. HIPAA-covered organizations request a qualified statistical expert to confirm that the risk of the dataset is very small.

De-identified EHR datasets do not fall under the same legislation as protected health information (PHI) because the information has an extremely low risk of identifying individual patients. However, no matter how carefully identifiers are removed or how thorough the analysis of the removal is, the risk of data still being traced back to individuals (re-identification) remains.

Protecting data from misuse is critical to securing patient information and preventing breaches that could damage your organization’s reputation.

Best Practices for Managing EHR Datasets

Using secondary data to determine the safety of treatments is not new. Researchers have been accessing data stored in registries for treatment follow-up for years. However, using secondary data like EHR datasets in research is still an emerging field. Although legislation is not in place to ensure proper use or enforce the consequences of misuse, organizations like The Joint Commission and Health Evolution are putting together guidelines to help researchers responsibly manage EHR datasets.

The basic principles for protecting secondary data include using data for well-defined purposes, accessing only necessary data, being transparent about data use, keeping data accurate and up to date, storing data only as long as needed, treating data with integrity, and establishing accountability. The following points cover more specific ways to manage EHR datasets responsibly.

#1 Ensure EHR Dataset Control

Protecting EHR datasets requires a robust security infrastructure. This includes technical measures, like cybersecurity software and automatic system log-offs after periods of inactivity, and policy measures, such as training employees in data privacy best practices.

Although technical measures for dataset control look different for each life science organization, they often include layers of defenses. For example, an organization may choose to have database protection software, single sign-on software for employee devices, malware protection software for their entire platform, a firewall on their network, and locks for office doors. Organizations use different tools to mitigate the risks at each level.

Technical measures alone can’t guarantee life science organizations remain in control of EHR datasets. Organizations must also have enforceable policies to ensure employees do their best to protect data. Regular data security training, using encryption software, and restricting data access to company devices can all help ensure malicious actors can’t take control of EHR datasets.

If your organization already deals with PHI, you likely have a security infrastructure in place. Make sure this also applies to EHR datasets. Another best practice is to conduct periodic audits to look for gaps in your security infrastructure, such as employees sharing passwords or outdated malware software. Regularly identifying areas for improvement helps ensure that your security infrastructure remains robust.

#2 Monitor Access to EHR Datasets

Even though de-identified EHR data doesn’t fall under the same protection as PHI, it must be used intentionally. Researchers must use EHR datasets only for legitimate purposes, such as conducting retrospective or prospective studies. Also, life science organizations must restrict access to a select number of researchers and data scientists.

Data providers may also have a data use agreement (DUA) to help customers monitor access to EHR datasets. A DUA outlines terms for protecting data from unauthorized use and provides helpful guidelines, including expectations for data use and who is responsible for data access oversight and restrictions.

Sharing EHR datasets between organizations opens doors for potential misuse. Life science organizations can protect against this by prohibiting redistribution or by allowing redistribution only if they can keep oversight of data use.

#3 Form a Dedicated EHR Dataset Oversight Team

Another way to ensure proper EHR dataset management is to create an internal team to monitor its use. This group oversees tasks like drafting procedures for handling EHR datasets and addressing questions from employees about EHR datasets. To eliminate friction between the oversight team and cybersecurity personnel, include data privacy officers on the team and align any decision with the broader company vision for data security.

Effective oversight teams typically include people from different areas of the organization, such as scientists and information technology specialists. Putting unique perspectives together can help uncover potential problems in data management and security. Most importantly, oversight teams must be trained on research ethics and why health information security matters.

#4 Address Potential Algorithm Bias in EHR Datasets

Life science organizations training AI algorithms on EHR datasets must consider algorithm bias. Algorithm bias is introduced by the questions human developers ask or historical datasets with limited data on certain populations. Research based on biased algorithms may contain inaccuracies or even cause harm to certain patient populations.

While regulatory bodies are working on developing standards for AI algorithms to reduce bias, life science organizations can work against it now by building internal processes to mitigate bias during development, training, and testing. For example, consider reviewing data for completeness, especially in representing patients across variables like age, sex, ethnicity, region, and socioeconomic status.

Using large datasets from multiple EHR platforms helps ensure that data represents the broader population. Also, EHR datasets covering multiple specialties and primary care practices include a wider spectrum of demographics than those covering a single type of medical practice. For example, an oncology clinic may not contain data on patients with other conditions unless they are comorbidities.

Incorporating EHR Datasets in Your Research

As the use of secondary data in clinical research grows, regulatory bodies will continue to refine guidance, and this list of best practices will evolve. For now, safely handling EHR datasets requires practical data security measures and treating patient health information, even when de-identified, as a potential risk.

Network EHR Data includes multiple EHR data sources from the Veradigm Network of ambulatory and specialty clinics. Researchers can access de-identified information from over 154 million patient records across almost 2 billion visits through Veradigm Network EHR. Data is NLP-enriched, providing access to semi-structured and unstructured data (e.g., clinic notes) through a proprietary NLP model.

Benefits of using Veradigm Real World Data for research include:

  • NLP-enriched data
  • EHR datasets covering diverse patient populations
  • Access to Veradigm healthcare provider clients for data clarification
  • Integration with other third-party datasets
  • Long-term tracking and near real-time access

Interested in exploring how Veradigm Real World Data can meet your research needs? Talk to a Veradigm data expert to learn more.

Spread the word

Tags
Blog   Life Science   Veradigm Network   Electronic Health Record (EHR)   Real World Data  

Related insights