NLP for automated extraction of cancer diagnosis from pathology reports
The widespread adoption of electronic health record (EHR) has led to a significant increase in the availability of clinical data for research. However, large amounts of clinical data are found only within dictated free text medical notes (i.e. clinical or pathology notes dictated by health care providers) and are stored as unstructured free text. Therefore manual extraction by human experts, a process which can be slow and expensive, is required to extract clinically relevant data. Furthermore, this process is subject to human errors in encoding. Natural language processing (NLP) offers significantly faster and reliable data extraction method compared to traditional human extraction.
We are developing a data pipeline for extracting tumor diagnosis information for breast cancer patients from electronic pathology reports in Health Level Seven International (HL7) format (https://www.hl7.org/about/index.cfm?ref=nav) and PDF formats. Although these diagnostic information can be extracted from the cancer registry, it takes up to two years for the registry to be updated by current manual methods. This automatic extraction allows us to obtain the detailed cancer diagnostic information within days after the confirmation, enabling us to use most up to date clinical diagnosis outcome for our AI algorithm development and validation projects.
NLP methods to extract demographics information from EMR
The objective of this project is to use an in-house NLP computer algorithm to identify socioeconomic and other health risk factors from a subset of patients that have received routine screening mammography.
Using socioeconomic and other risk factors to inform personalised screening recommendation may help clinicians make recommendations about when patients should consider breast screening. However, incorporating these risk factors into a prognostication model by manually looking at charts is a time-consuming task as the patient charts must be read individually. Natural Language Processing (NLP) can help researchers automate this process to provide enough data for the prognostication model.
An in-house developed computer algorithm will be used to organize the input chart data and sort patients into two datasets. Once organized, the in-house NLP algorithm will be used on one dataset to extract socioeconomic and health risk data using free text consultation notes in the electronic medical record. The other dataset will be manually extracted by hand for the same socioeconomic and health risk data. The computer program results will be compared to the manual chart review results to determine if the computer program can accurately identify socioeconomic and health risk data. If the computer program will not correctly identify the two patient cohorts that have been found in the previous manual chart review, additional improvements to the NLP algorithm will be made and re-tested.
Extracted socioeconomic and health risk factors information will be used to enhance the performance of a prognostication model.
This work will pave the way for effective strategies to further optimise personalized breast screening recommendations for routine breast screening in British Columbia.
NLP for automated electronic medical record review
The objective of this project is to improve and validate an in-house developed Natural Language Processing (NLP) computer algorithm to correctly identify patients who developed Radiation Pneumonitis by automated analysis of electronic medical record text.
Radiation Pneumonitis (RP) is a side effect of radiation treatment for lung cancer patients. RP can leave patients with permanent impaired oxygen transfer which leads to supplemental oxygen dependence and lower quality of life. Although RP is fatality is uncommon, it still occurs in 2% of those suffering. The overarching goals of this large project is to identify computed tomography (CT) radiomic features for patients at greatest risk of developing RP in pre-treatment CT images. Before the radiomic analysis can begin, a cohort of patients who went through radiation therapy and developed RP must be collected, as well as a control cohort that did not go on to develop RP after treatment. An initial chart review has been completed by manually reading patient charts to identify patients who developed RP after radiation therapy. This manual chart review is a very time-consuming task where patient charts must be read individually to find any diagnostic criteria for RP. As a larger study will be required for future radiomic projects, an automated chart review would save many hours.
An in-house developed computer algorithm will be used to organize the input chart data and sort patients into the two cohorts. Once organized, the in-house NLP algorithm will be used to extract the patients who have developed Radiation Pneumonitis using free text narratives in the electronic medical record. The computer program results will be compared to the manual chart review results to determine if the computer program can accurately select for the patient cohorts. If the computer program will not correctly identify the two patient cohorts that have been found in the previous manual chart review, additional improvements to the NLP algorithm will be made and re-tested. The workflow and the NLP algorithm that are developed in this project will also be used in larger chart review studies by this research group.
Evaluating Cardiac Risk in Breast Cancer Patients using Artificial Intelligence
The purpose of this project is to use deep learning and natural language processing methodologies to create an automated cardiac risk assessment tool for use in evaluating cardiac risk in breast cancer patients.
There is a known link between cardiac conditions and breast cancer. Women with breast cancer, for instance, may have a higher risk of later developing cardiovascular conditions than those without breast cancer. Further, women diagnosed with breast cancer that have a prior history of cardiovascular disease may be at increased risk of mortality from cardiac causes. Initial findings (not yet published) from BC Cancer find similar results. This increase in cardiac-related mortality may be caused by radiation-induced cardiac toxicity, with increases in heart disease proportionate to radiation dosage to the heart, potentially lasting 20 years or more.
Breast cancer patients undergoing radiation therapy undergo a Computed Tomography (CT) scan of the chest for radiation dose calculation. These CT scans offer an opportunity to estimate coronary artery calcium (CAC) score, an independent predictor of coronary events which improves cardiovascular risk prediction in asymptomatic individuals. If a patient is found to have a significantly high CAC score, they may be referred for formal cardiovascular risk evaluation. However, manually performing calcium scoring on all chest CT scans of breast cancer patients is impractical due to the tedious and time consuming process. Recent developments in artificial intelligence methodologies, such as deep learning and natural language processing (NLP), provide an opportunity to automate cardiac risk assessments for breast cancer patients.
Deep neural networks are a type of representation machine learning model which employ multiple processing layers to automatically detect, identify, or classify features of interest present within a dataset. In recent years, deep neural network machine learning techniques have been successfully applied in numerous medical contexts, including diagnostics, radiology, and pathology. Radiology in particular has found many transformative uses for deep learning across various modalities not only in classification tasks, but also increasingly in segmentation tasks. Automatic semantic segmentation allows for specific objects— such as anatomic structures, tissues, or organs—to be identified, contoured, and labeled, such as in the automatic scoring of cardiac calcification of CT scans.
NLP is a type of machine learning that uses computational techniques to process unstructured free text data into structured data. Applications of this are cross-disciplinary, including uses such as language translation and social media and web mining. Many previous applications of NLP have been used in consumer contexts (e.g. Apple’s Siri, analysing web-based customer reviews). However, in recent years, there has been increased interest in using NLP across medical contexts, in particular for automated mining of electronic health records (EHR), which are laborious and time-consuming to analyse by hand. In the radiation oncology context, NLP can identify cancer cases, attributes, and outcomes, with real-world applications in surveillance and epidemiology.
We believe this work will pave the way in using artificial intelligence models to aid in cardiac risk predictions in breast cancer patients during routine clinical care. Future applications of this work could help prevent cardiac-related mortalities in breast cancer patients and survivors of breast cancer.