Deep dive in to lung cancer diagnosis

(Alexandre Cadrin-Chênevert) #81

Of course lung cancer is a very wide subject. In a very simple form, detecting lung nodules/cancers when they are still small can save lifes. When the nodule/cancer is asymptomatic, we call this screening imaging.

You can read an interesting review about lung nodule screening here:

Lung-Rads is a screening reporting and data system defined by ACR (American College of Radiology):
Summary table :

If we focus on incidentally detected lung nodules on CT scan the most important and up to date clinical publication is:
Fleischner guidelines 2017 : (msg me for some help to get the .pdf)
Summary table of Fleischner 2017 :
Initial 2005 publication to introduce the concepts :

The historical usage scenario for this problem was consequently to develop an automated tool to detect the nodules on CT scan with a good sensitivity and specificity (high test accuracy, or high area under ROC curve). We commonly call these tools CAD (computer assisted detection). Unfortunately, classical CADs have high sensitivity but low specificity; consequently, they are not seriously used in high volume practice. You can see a review of this subject here:

Luna Challenge focused on 1) Detection of nodules and 2) False Positive reduction (e.g higher specificity):

Kaggle 2017 Data science bowl goal ( was to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. Of course the winning methods were analyzing nodules/masses on the dataset. Unfortunately for that competition, training labels weren’t localized on the 3D data. The task was challenging but not that useful from a clinical perspective. Maybe that was a commercially oriented challenge to find good prospects (developers and models) for the development of a real training/application.

I think the main historical usage scenario is still important. To automatically detect nodules on CT-scan and select the one(s) with the highest probability of cancer based on all the features. With strong evidence of performance, this tool could completely change the current practice (detecting smaller worrisome nodules, less followup for larger nodules with low probability of cancer). As I already said in a previous post, the most important size range for a deep learning application is between 5 to 10 mm on ct scan. Under 5 mm, our technology (biopsy or PET-CT) can’t confirm a cancer that small and surgeons (currently) won’t treat a patient from a probability of cancer determined by a deep learning model (no matter the performance). Above 10 mm, usually the diagnosis is relatively straightforward with the current technology.

An open source project with high test performance could have high impact for low development index countries to implement CT screening programs at a very low cost. Of course, this means the population has at least minimal access to a CT scan. There is an open source project for lung cancer screening with chest X ray :
But unfortunately, detecting small lung cancers from a chest X ray is not very efficient from a human perspective.

An open source project could also have high impact for high development index countries to lower the cost of CT lung screening and deliver it universally to the entire population. For example, in Canada, with a universal public system, starting a country wide screening program would need a lot more radiologists with a very high program cost.

I agree with @jeremy that replicating the winning data science entry is a good start. I still offer to volunteer as a radiologist (e.g. labeling and localization of nodules) if this is focused as an open source project. A potential model validation compared with many different radiologists is eventually also quite important to get enough credibility to be applicable. I could also help on that side if needed.

I hope this helps.

(Surya K) #82

thanks for this wonderful intro to the problem and the domain!

(Jeremy Howard) #83

@alexandrecc this is such great information - is there any chance you might consider copying it into a medium post? If not, do you mind if I turn it into a post later?

(Krishna Vishal V ) #84

@jeremy Count me in, I’m very interested.

(Alexandre Cadrin-Chênevert) #85

Yes sure, let me find some images to improve the format and I’ll let you know when I post it on medium. I’ll try to do this before leaving for the RSNA on saturday.

(Rikiya Yamashita) #86

Hi @alexandrecc,

Nice intro from radiologists perspective!
BTW, I’m coming to RSNA too. It would be very nice to meet up with you and say hello in person. If you don’t mind, please let me know :wink: Let’s enjoy RSNA 2017 :smile:

(Alexandre Cadrin-Chênevert) #87

Blog post inspired by previous forum post available here:

(Octavio ) #88

Very good post! I’m very interested and half way through the processing steps!

(segovia) #89

I cloned the grt123 team’s repo and finished the preprocessing. Still trying to understand the details of each piece of code. Kinda overwhelmed by the complexity of their solution…

Learning Pytorch in the meantime.

(Alexandre Cadrin-Chênevert) #90

I recently read more about this collaborative project that looks promising if someone from wants to get involved:

The project could have very high impact if it works as planned. @kmader posted the link to the public github repository earlier in this thread on sept 16 : Deep dive in to lung cancer diagnosis

They are also using grt123 solution.

(Kerem Turgutlu) #91

I have a question to experts in this field. I am currently walking through Brad Kenstler’s notebook. The images I am working on right now has modality : MR. I’ve tried to look online but couldn’t find a good explanation for mapping MR intensities -> Hounsefield Units for visualization purposes. I would appreciate very much if someone can help me out. Maybe I am searching for the wrong thing. Thanks !

And I suspect the interpretation of intensities of CT and MR are probably different which makes it hard to skip visualization part of the notebook and continue with the actual preprocessing. Especially designating ROI thresholds probably differ by different organs (lung to brain) and different machines (CT to MR).

(Jeremy Howard) #92

If you ask on twitter and at-mention me I’ll ask the rad community if they can help. cc @Judywawira @alexandrecc @davecg

(David Gutman) #93

Hounsfield is just CT.

Depending on the MRI sequence it might have intrinsic meaning (eg ADC, quantitative flow sequences, some perfusion metrics), but even those will vary from scanner to scanner.

Demeaning and dividing by std for the volume should be a reasonable way to start, but you should check to make sure even/odd slices aren’t very different (MRI series are sometimes collected “interleaved” and on some scans you will notice alternating intensity levels). Normalizing by slice might avoid this problem, but slices that are nearly empty will be normalized very differently than slices with a lot of tissue (you can see this when viewing images on many PACS systems).

You also usually need bias correction using a tool like N4 ( for research workflows and motion correction if you have time series data (eg MCFLIRT from FSL, basically just rigid registration across timepoints).

Some of these tools might not be necessary for deep learning models, and many others could stand to be updated to use the GPU.

(Kerem Turgutlu) #94

These are all really valuable information, thank you so much. We currently have MRI scans for around 300 meningioma patient. Each MRI scan is 124x512x512 (slice, height, width). Our first task is to come up with a model that can auto-contour meningioma tumor. Since we have raw data we will probably do a lot of preprocessing before feeding it into neural nets, such as normalization, skull removal, and others that might be helpful for the task.

I appreciate your help and if you don’t mind may I ask for help for this thread Lung cancer detection; Convolution features + Gradient boost as well. Thanks in advance.

(Andy) #95

I’m in the same task.Follow you yet!


Very interested in this topic. I´m part of Deep Learning Brasilia(Brazil) group and we´re on lesson 6 - part 1.
Jeremy, congratulations for this initiative. And thanks for the oportunity of Fastai Deep Learning course. You´re THE GUY!

(David) #97

I am really interested in this topic! Is it too late to help out?

(Jose Quesada) #98

Is this dataset different from the one hosted at

(Imant Daunhawer) #99

I am very enthusiastic about this topic as well and would greatly appreciate to hear an update from Jeremy, since it sounds like the data has been available for more than a year now.

Our group at the University of Basel is currently working on lung cancer diagnosis (based on CT scans and reports) together with the radiology department of the university hospital and I will draw their attention to this opportunity (and kindly ask the radiologists for their help in labelling).

Further, I’d like to point out that it’s very unfortunate that the data from the Kaggle 2017 Data Science Bowl is not available anymore. In this regard, it would be extremely helpful if at least a sample of the NLST data were provided, to get people started with the preprocessing and replication of models.