Data Engineering, Wrangling and Curation for AI Applications

Ongoing Projects

Traceable medical image de-identification pipeline for AI applications

Medical Image matching using AI for record linkage

Image Data Curation in Oncology for CT Scan Segmentation

Acceptance testing and routine QA of AI systems


Traceable medical image de-identification pipeline for AI applications

The objective of this project is to extend open source Picture Archive and Communication Systems (PACS) to create a pipeline that will lower the barrier of gathering and access to cohort data for training artificial intelligence (AI) algorithms. The service will be able to extract and de-identify DICOM files from various facilities in British Columbia, generate a traceable patient information crosswalk, store the data in a lake, and remove raw intermediate personal health information (PHI). The PACS will be able to then act as a Query/Retrieve service to supply users with a desirable cohort.

Applications of AI in medical research has the potential to greatly impact clinical practice, such as improving patient outcomes by creating more robust diagnoses. Currently, it is time consuming and difficult to acquire and curate large volumes of medical imaging data across various facilities while respecting and protecting PHI.

By generating a store of continuously accumulating, nonspecific, de-identified data, future projects that require large volumes to train AI will have this data much more readily available.


Medical Image matching using AI for record linkage

The objective of this project is to curate a set of mammograms, and subsequently train an artificial intelligence (AI) on images known to be from the same patient to resolve potentially mismatched patient records.

Large amounts of patient information are generated every day at many different healthcare institutions throughout the province. While there are measures taken to ensure digital data integrity; in practice there are often discrepancies that arise from such things as manual data entry error, non-unique identification, and more. This is problematic as curating datasets for medical research requires as much accurate information about a patient as possible, and missing or duplicate records may cause issues when training future AI projects and may create bias in the AI algorithms.

While manual investigation of these cases could be done, as systems and need for data scale, so too does the effort required to confidently resolve potentially mismatched records. Methods to leverage deep learning on mammogram image data to reconcile patients flagged as mismatches in our in-house database. The ability to match images for the same patient on different dates would increase the confidence that the records are indeed appropriate to be linked.

Fig. Example of mammogram images (LCC view) of a patient captured on different dates. Note that the breast is compressed and a projection is taken, leading to variable positioning of structures.

We believe this work may be used in conjunction with other techniques such as probabilistic matching to alleviate this issue and reduce the need for manual resolution.


Image Data Curation in Oncology for CT Scan Segmentation

The primary objective of this study is to estimate the percentage of breast cancer patients undergoing radiation therapy in British Columbia who have cardiac calcifications. This will be achieved by examining and quantitative analysis of the CT scans of a cohort of breast cancer patients who underwent CT scanning for their breast radiation therapy.

A link has been demonstrated between radiation therapy of the chest area and subsequent coronary artery disease (CAD) and other forms of cardiovascular disease. This phenomena is known as radiation induced cardiac toxicity (RICT). CAD is caused by plaque building up along the inner walls of the arteries of the heart, narrowing these arteries and restricting blood flow. Plaque results from calcified fat, and is frequently used as a marker of atherosclerosis and future CAD. Studies have found a similar relationship in left breast cancer survivors. The risk of RICT has been shown to persist long after the completion of radiation treatment. 

A coronary artery calcium (CAC) computed tomography (CT) scan offers information about the presence and extent of calcified plaques in the coronary arteries and has been utilized as a non-invasive tool to predict future CAD. The coronary artery calcium score, which was developed to quantify the extent of coronary calcification, is known as an independent predictor of coronary events and improves cardiovascular risk prediction in asymptomatic individuals. Rates of radiation induced cardiac toxicity have been found to be higher in women with pre-existing cardiac risk factors such as CAC. In addition to starting out at a higher risk, these women have displayed greater absolute increases in risk compared to radiotherapy patients without pre-existing cardiac risk factors.

First, we plan to contour cardiac calcifications on retrospective CT scans using Radiation Therapy Treatment planning software system (ARIA), calculate CACS, and input data elements to the study database. Then, we will analyze the distribution of coronary artery calcium score (CACS) among patients in the study cohort and compare this information with the incidence of possible RICT in the same cohort.


Acceptance testing and routine QA of AI systems

Project description to come