Natural Language Processing Applications in Health Care

Ongoing Projects

NLP for Automated Extraction of Cancer Diagnosis from Pathology Reports

NLP methods to extract Social Determinants of Health from EMR

NLP for Automated Electronic Medical Record Review

Evaluating Cardiac Risk in Breast Cancer Patients using Artificial Intelligence

NLP for automated extraction of cancer diagnosis from pathology reports

The widespread adoption of electronic health record (EHR) has led to a significant increase in the availability of clinical data for research. However, large amounts of clinical data are found only within dictated free text medical notes (i.e. clinical or pathology notes dictated by health care providers) and are stored as unstructured free text. Therefore manual extraction by human experts, a process which can be slow and expensive, is required to extract clinically relevant data. Furthermore, this process is subject to human errors in encoding. Natural language processing (NLP) offers significantly faster and reliable data extraction method compared to traditional human extraction.

We are developing a data pipeline for extracting tumor diagnosis information for breast cancer patients from electronic pathology reports in Health Level Seven International (HL7) format ( and PDF formats. Although these diagnostic information can be extracted from the cancer registry, it takes up to two years for the registry to be updated by current manual methods. This automatic extraction allows us to obtain the detailed cancer diagnostic information within days after the confirmation, enabling us to use most up to date clinical diagnosis outcome for our AI algorithm development and validation projects.

NLP methods to extract Social Determinants of Health from EMR

The goal of this research is to develop a Natural Language Processing pipeline capable of automatically extracting Social Determinants of Health (SDoH) from the Electronic Medical Records (EMR) within the BC Cancer’s dataset.

SDoH encompass the non-medical conditions in which people are born, live, work, and play.  SDoH include elements such as employment, income, education, living and working environment, and race, among others. They often reflect an individual’s social status and significantly impact health outcomes. Moreover, SDoH has an influence on health inequities, particularly affecting individuals with low socioeconomic status. Such individuals often face barriers in accessing quality healthcare and preventive services resulting in delayed or inadequate treatment, and poorer health outcomes. Acknowledging this issue, in September 2020, the BC Office of the Human Rights Commissioner published a report advocating for the collection of disaggregated data, closely related to SDoH data, to develop effective policies addressing systemic inequalities and safeguarding human rights.

However, it would be a long process to legislate the collection, use and disclosure of SDoH data.  BC Cancer EMR is a great source of existing SDoH data and mining out this data would largely speed up the collection. Although most SDoH information is in unstructured EMRs, they can be extracted by training an NLP model on a large annotated amount of EMRs. The annotation is in process and a comprehensive annotation guideline was drafted to ensure the gold-standard quality of the labels on the dataset.

This work will pave the way for studying the correlation between SDoH and Cancer development and making policies to reduce health inequalities.

NLP for automated electronic medical record review

The objective of this project is to improve and validate an in-house developed Natural Language Processing (NLP) computer algorithm to correctly identify patients who developed Radiation Pneumonitis by automated analysis of electronic medical record text.

Radiation Pneumonitis (RP) is a side effect of radiation treatment for lung cancer patients. RP can leave patients with permanent impaired oxygen transfer which leads to supplemental oxygen dependence and lower quality of life. Although RP is fatality is uncommon, it still occurs in 2% of those suffering. The overarching goals of this large project is to identify computed tomography (CT) radiomic features for patients at greatest risk of developing RP in pre-treatment CT images. Before the radiomic analysis can begin, a cohort of patients who went through radiation therapy and developed RP must be collected, as well as a control cohort that did not go on to develop RP after treatment. An initial chart review has been completed by manually reading patient charts to identify patients who developed RP after radiation therapy. This manual chart review is a very time-consuming task where patient charts must be read individually to find any diagnostic criteria for RP. As a larger study will be required for future radiomic projects, an automated chart review would save many hours.

An in-house developed computer algorithm will be used to organize the input chart data and sort patients into the two cohorts. Once organized, the in-house NLP algorithm will be used to extract the patients who have developed Radiation Pneumonitis using free text narratives in the electronic medical record. The computer program results will be compared to the manual chart review results to determine if the computer program can accurately select for the patient cohorts. If the computer program will not correctly identify the two patient cohorts that have been found in the previous manual chart review, additional improvements to the NLP algorithm will be made and re-tested. The workflow and the NLP algorithm that are developed in this project will also be used in larger chart review studies by this research group.

Evaluating Cardiac Risk in Breast Cancer Patients using Artificial Intelligence

The purpose of this project is to use deep learning and natural language processing methodologies to create an automated cardiac risk assessment tool for use in evaluating cardiac risk in breast cancer patients.

There is a known link between cardiac conditions and breast cancer. Women with breast cancer, for instance, may have a higher risk of later developing cardiovascular conditions than those without breast cancer. Further, women diagnosed with breast cancer that have a prior history of cardiovascular disease may be at increased risk of mortality from cardiac causes. Initial findings (not yet published) from BC Cancer find similar results. This increase in cardiac-related mortality may be caused by radiation-induced cardiac toxicity, with increases in heart disease proportionate to radiation dosage to the heart, potentially lasting 20 years or more.

Breast cancer patients undergoing radiation therapy undergo a Computed Tomography (CT) scan of the chest for radiation dose calculation. These CT scans offer an opportunity to estimate coronary artery calcium (CAC) score, an independent predictor of coronary events which improves cardiovascular risk prediction in asymptomatic individuals. If a patient is found to have a significantly high CAC score, they may be referred for formal cardiovascular risk evaluation. However, manually performing calcium scoring on all chest CT scans of breast cancer patients is impractical due to the tedious and time consuming process. Recent developments in artificial intelligence methodologies, such as deep learning and natural language processing (NLP), provide an opportunity to automate cardiac risk assessments for breast cancer patients.

Deep neural networks are a type of representation machine learning model which employ multiple processing layers to automatically detect, identify, or classify features of interest present within a dataset. In recent years, deep neural network machine learning techniques have been successfully applied in numerous medical contexts, including diagnostics, radiology, and pathology. Radiology in particular has found many transformative uses for deep learning across various modalities not only in classification tasks, but also increasingly in segmentation tasks. Automatic semantic segmentation allows for specific objects— such as anatomic structures, tissues, or organs—to be identified, contoured, and labeled, such as in the automatic scoring of cardiac calcification of CT scans.

NLP is a type of machine learning that uses computational techniques to process unstructured free text data into structured data. Applications of this are cross-disciplinary, including uses such as language translation and social media and web mining. Many previous applications of NLP have been used in consumer contexts (e.g. Apple’s Siri, analysing web-based customer reviews). However, in recent years, there has been increased interest in using NLP across medical contexts, in particular for automated mining of electronic health records (EHR), which are laborious and time-consuming to analyse by hand. In the radiation oncology context, NLP can identify cancer cases, attributes, and outcomes, with real-world applications in surveillance and epidemiology.

We believe this work will pave the way in using artificial intelligence models to aid in cardiac risk predictions in breast cancer patients during routine clinical care. Future applications of this work could help prevent cardiac-related mortalities in breast cancer patients and survivors of breast cancer.