Deep dive in to lung cancer diagnosis

Of course lung cancer is a very wide subject. In a very simple form, detecting lung nodules/cancers when they are still small can save lifes. When the nodule/cancer is asymptomatic, we call this screening imaging.

You can read an interesting review about lung nodule screening here:

Lung-Rads is a screening reporting and data system defined by ACR (American College of Radiology):
https://www.acr.org/Quality-Safety/Resources/LungRADS
Summary table : https://www.acr.org/~/media/ACR/Documents/PDF/QualitySafety/Resources/LungRADS/AssessmentCategories.pdf

If we focus on incidentally detected lung nodules on CT scan the most important and up to date clinical publication is:
Fleischner guidelines 2017 : http://pubs.rsna.org/doi/abs/10.1148/radiol.2017161659 (msg me for some help to get the .pdf)
Summary table of Fleischner 2017 : http://www.nucsradiology.com/fleischner-society-2017-guidelines/
Initial 2005 publication to introduce the concepts : http://pubs.rsna.org/doi/pdf/10.1148/radiol.2372041887

The historical usage scenario for this problem was consequently to develop an automated tool to detect the nodules on CT scan with a good sensitivity and specificity (high test accuracy, or high area under ROC curve). We commonly call these tools CAD (computer assisted detection). Unfortunately, classical CADs have high sensitivity but low specificity; consequently, they are not seriously used in high volume practice. You can see a review of this subject here:

Luna Challenge focused on 1) Detection of nodules and 2) False Positive reduction (e.g higher specificity):
https://luna16.grand-challenge.org/description/

Kaggle 2017 Data science bowl goal (https://www.kaggle.com/c/data-science-bowl-2017) was to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. Of course the winning methods were analyzing nodules/masses on the dataset. Unfortunately for that competition, training labels werenā€™t localized on the 3D data. The task was challenging but not that useful from a clinical perspective. Maybe that was a commercially oriented challenge to find good prospects (developers and models) for the development of a real training/application.

I think the main historical usage scenario is still important. To automatically detect nodules on CT-scan and select the one(s) with the highest probability of cancer based on all the features. With strong evidence of performance, this tool could completely change the current practice (detecting smaller worrisome nodules, less followup for larger nodules with low probability of cancer). As I already said in a previous post, the most important size range for a deep learning application is between 5 to 10 mm on ct scan. Under 5 mm, our technology (biopsy or PET-CT) canā€™t confirm a cancer that small and surgeons (currently) wonā€™t treat a patient from a probability of cancer determined by a deep learning model (no matter the performance). Above 10 mm, usually the diagnosis is relatively straightforward with the current technology.

An open source project with high test performance could have high impact for low development index countries to implement CT screening programs at a very low cost. Of course, this means the population has at least minimal access to a CT scan. There is an open source project for lung cancer screening with chest X ray : https://aiai.care/
But unfortunately, detecting small lung cancers from a chest X ray is not very efficient from a human perspective.

An open source project could also have high impact for high development index countries to lower the cost of CT lung screening and deliver it universally to the entire population. For example, in Canada, with a universal public system, starting a country wide screening program would need a lot more radiologists with a very high program cost.

I agree with @jeremy that replicating the winning data science entry is a good start. I still offer to volunteer as a radiologist (e.g. labeling and localization of nodules) if this is focused as an open source project. A potential model validation compared with many different radiologists is eventually also quite important to get enough credibility to be applicable. I could also help on that side if needed.

I hope this helps.

9 Likes

thanks for this wonderful intro to the problem and the domain!

@alexandrecc this is such great information - is there any chance you might consider copying it into a medium post? If not, do you mind if I turn it into a post later?

1 Like

@jeremy Count me in, Iā€™m very interested.

Yes sure, let me find some images to improve the format and Iā€™ll let you know when I post it on medium. Iā€™ll try to do this before leaving for the RSNA on saturday.

Hi @alexandrecc,

Nice intro from radiologists perspective!
BTW, Iā€™m coming to RSNA too. It would be very nice to meet up with you and say hello in person. If you donā€™t mind, please let me know :wink: Letā€™s enjoy RSNA 2017 :smile:

Blog post inspired by previous forum post available here:

7 Likes

Very good post! Iā€™m very interested and half way through the processing steps!

I cloned the grt123 teamā€™s repo and finished the preprocessing. Still trying to understand the details of each piece of code. Kinda overwhelmed by the complexity of their solutionā€¦

Learning Pytorch in the meantime.

2 Likes

I recently read more about this collaborative project that looks promising if someone from fast.ai wants to get involved:
https://concepttoclinic.drivendata.org/

The project could have very high impact if it works as planned. @kmader posted the link to the public github repository earlier in this thread on sept 16 : Deep dive in to lung cancer diagnosis

They are also using grt123 solution.

2 Likes

I have a question to experts in this field. I am currently walking through Brad Kenstlerā€™s notebook. The images I am working on right now has modality : MR. Iā€™ve tried to look online but couldnā€™t find a good explanation for mapping MR intensities -> Hounsefield Units for visualization purposes. I would appreciate very much if someone can help me out. Maybe I am searching for the wrong thing. Thanks !

And I suspect the interpretation of intensities of CT and MR are probably different which makes it hard to skip visualization part of the notebook and continue with the actual preprocessing. Especially designating ROI thresholds probably differ by different organs (lung to brain) and different machines (CT to MR).

If you ask on twitter and at-mention me Iā€™ll ask the rad community if they can help. cc @Judywawira @alexandrecc @davecg

Hounsfield is just CT.

Depending on the MRI sequence it might have intrinsic meaning (eg ADC, quantitative flow sequences, some perfusion metrics), but even those will vary from scanner to scanner.

Demeaning and dividing by std for the volume should be a reasonable way to start, but you should check to make sure even/odd slices arenā€™t very different (MRI series are sometimes collected ā€œinterleavedā€ and on some scans you will notice alternating intensity levels). Normalizing by slice might avoid this problem, but slices that are nearly empty will be normalized very differently than slices with a lot of tissue (you can see this when viewing images on many PACS systems).

You also usually need bias correction using a tool like N4 (http://stnava.github.io/ANTs/) for research workflows and motion correction if you have time series data (eg MCFLIRT from FSL, basically just rigid registration across timepoints).

Some of these tools might not be necessary for deep learning models, and many others could stand to be updated to use the GPU.

2 Likes

These are all really valuable information, thank you so much. We currently have MRI scans for around 300 meningioma patient. Each MRI scan is 124x512x512 (slice, height, width). Our first task is to come up with a model that can auto-contour meningioma tumor. Since we have raw data we will probably do a lot of preprocessing before feeding it into neural nets, such as normalization, skull removal, and others that might be helpful for the task.

I appreciate your help and if you donā€™t mind may I ask for help for this thread Lung cancer detection; Convolution features + Gradient boost as well. Thanks in advance.

Iā€™m in the same task.Follow you yet!

Very interested in this topic. IĀ“m part of Deep Learning Brasilia(Brazil) group and weĀ“re on lesson 6 - part 1.
Jeremy, congratulations for this initiative. And thanks for the oportunity of Fastai Deep Learning course. YouĀ“re THE GUY!

I am really interested in this topic! Is it too late to help out?

Is this dataset different from the one hosted at drivenbydata.org?

I am very enthusiastic about this topic as well and would greatly appreciate to hear an update from Jeremy, since it sounds like the data has been available for more than a year now.

Our group at the University of Basel is currently working on lung cancer diagnosis (based on CT scans and reports) together with the radiology department of the university hospital and I will draw their attention to this opportunity (and kindly ask the radiologists for their help in labelling).

Further, Iā€™d like to point out that itā€™s very unfortunate that the data from the Kaggle 2017 Data Science Bowl is not available anymore. In this regard, it would be extremely helpful if at least a sample of the NLST data were provided, to get people started with the preprocessing and replication of models.

Hi,

Iā€™m currently exploring the CT scan challenge on Kaggle. Iā€™d love to know where this thread went. Did anyone have any success? Did anyone produce a fastai based CT scan example?