Deep learning with medical images

I was listening to @jeremy on TWIML last week, and at the end of the talk he mentions the lack of publicly available medical imaging datasets. Applying deep learning to medical images is my research area, so I am intimately aware of the problem and thought I could contribute what I’ve learned about the practice back to the community.

I created a blog post with my thoughts on how to get started with using deep learning on medical images, specifically magnetic resonance (MR) and computed tomography (CT) images. I overview the two imaging modalities, suggest several publicly available datasets, discuss some techniques for data wrangling and preprocessing (with example scripts), and finally build a small 3D deep learning model using the fastai API.

It turned out a bit longer than expected, and while there is a lot more information to cover, this should (hopefully) help people get started with applying deep learning to structural MR and CT images. I know there has been some previous discussion on here (see here, here, and here for some previous discussion). But I’d be happy to answer any questions regarding the blog post or more general questions regarding working with medical images.

Just wanted to say thanks to everyone who has helped build the fastai package, it’s awesome!


Hi there! Good write-up :slight_smile:

I am one of the developers of NiftyNet, which you mention in the post: I would just like to add that we’ve put out a demo of using NiftyNet image readers/writers within PyTorch.

It shows how to get medical data into the PyTorch context and also how to output results in the correct format. As you mentioned, these operations are different in medical imaging compared to the normal computer-vision based approach, and we think our library makes this part of it much easier.

I’m also keen to talk more generally about medical imaging and will be following this thread eagerly.


Hi I’m starting a project for my tesis about pancreas segmentation for pancreatic cancer diagnosis. I’ve been reading some articles about it, but as you say there are few databases for practicing and fewer articles that descrive with detail their neuralnet architecture.
Thanks for the blog posts and for the effort of including examples.
I have a question, how do you deal with the absence of context in the 2d slice of CT VS a 3d analysis?

Sorry for the late reply, I’ve been away for the last week.

The fundamental problem with using only 2D slices is that you are throwing away the 3D context. The optimal fix to this problem would be to use the entire 3D volumes instead of slices. If you are stuck using 2D slices (and DNNs) for whatever reason, then a naive solution would be to post-process the resulting segmentation volume using basic image processing techniques (e.g., morphological filters). There are better—but more complicated—methods to address this problem, and I’d take a look at some academic journals/arxiv for those methods.


FWIW, I believe that medical imaging applications using deep neural networks are moving more and more from 2D CNNs to 3D CNNs. Anecdotally, this appears to be the case (from looking at papers coming out at the relevant conferences), although I have no statistics to back this up. However, for a variety of reasons (e.g., computational limitations, limited dataset size), you may prefer to use a 2D network. Interestingly, you may also notice better performance with a 2D network than a 3D network trained on the same data. Choosing a 2D or 3D model falls within the realm of hyperparameter optimization and will be task/dataset-specific.

The main problem with applying 2D methods to 3D data is that you generally want to reconstruct a 3D result from your 2D outputs. This can lead to inconsistency from slice-to-slice. That is, if you scroll through your reconstructed 3D result, the segmentations may not align in sensible ways from slice-to-slice (which is why I previously suggested the use of morphological filters to smooth the segmentation). Note that this will also occur when using 3D patches (e.g., a training/testing on 64x64x64 patches from a 256x256x256 voxel volume). This problem occurs due to each patch/slice being predicted independently of one another.

As speculation, my initial impression for why 2D models can sometimes perform better than 3D models would be: 1) inadequate/suboptimal data augmentation and 2) differences in training and testing data that are not as problematic in 2D setups. The two of these are connected, but to expand on the second: consider two sets of MR images that are acquired at 1x1x1mm^3 and 1x1x3mm^3. You train a 3D CNN on the first set and test on the second. Even if you interpolate the second to 1x1x1mm^3, I’d guess that your performance metric (e.g., Dice) will often come out worse than if you trained and tested a 2D CNN on the common 1x1mm^3 slices. Take that example with a grain of salt, but hopefully that provides some intuition on why 2D can outperform a 3D network.

If people have other thoughts, I’d be very interested in hearing them


Thanks for your answer, now at least I’m more aware about the problems I’m going to face with each method. I think that I will try a 3d CNN first and if I have a memory error I will migrate to 2D.

Thanks for interesting contributions in this topic. I work in a big healthcare company, and in my free time I’m a machine learning enthusiat.
I would like to propose to my company to introduce a machine learning system in MR and CT images to cooperate with human radiologist to identify cancers.
I can convince them to make this project if I can bring some studies and statistics to demonstrate how good is ML to detect cancer, compared to human performance. Is there any document that compare ML systems and human radiologyst to identify cancer?
If I can start this project, I will be very glad to collaborate with others using libraries.

Hey guys, it is such a pleasure to see other fellow fastai students who are passionate about applying deep learning techniques to healthcare problems.

Below is a post I originally created in the new 2019 Part 2 forum, which has its access limited to only the cohort taking the class. Now I find it a mistake and decided to move it out to reach wider audience.

Hey @RomRoc, in the thread I reference above, there are a few papers that you might be interested.

Hi, have developed 2 u-net for pancreatic segmentation in a CT scan. Now I have one that use a png from the slice and another that use the information directly from the dicom file. My goal is to make a 3d one and compare there accuracy and application for diagnosis of pancreatic cancer. In this days I will share with you the notebook with some tricks I learn to make it work in windows.
I’m very happy to see such a community working in AI and medicine.

Thanks a lot @PegasusWithoutWinds I just started watching that thread to receive notifications, it’s very interesting.
Unfortunately I didn’t find any comparison between human and ML detection performance on medical images.
It would be important to make some slides with my proposal. Unfortunately people out of ML fields need results to understand great benefits we can obtain.

Check out stanford MURA competition. Also Stanford’s work on skin cancer

Hi, I have read and article of a neuralnet that diagnose pancreatic cancer in 20% more early cases than a human. I would like to help you with your investigation. Also with some practical work, I have learn that practice is a very good why to learn new things

I have a general question about medical imaging. Many of the images are in grey scale. I have been looking at transfer learning techniques e.g. mask R-CNN for instance, and this was trained with color images. Is that why people are using U-net? Has anyone looked at turning the gray scale image into a “color” image and then trying something like mask R-CNN with it? What other approaches are there for working with grayscale images? Thanks.

You can adapt the mask R-CNN model to take a one-channel image as input, which would probably result in you having to retrain the whole model. Or you could copy the image two times and stack the result with the original image so that you have three-channels, and then do fine-tuning on the resultant network. The right option depends on your task, amount of available training data, and observed performance. One of these options (i.e., use one-channel as input or stack the images) is how you work with grayscale images, in general.

People use the U-net for a variety of reasons, but one such reason is that people find that a U-net is performs sufficiently for the task they are trying to address, so they stop there and do not explore more architectures. Searching the space of architectures is time-consuming, since you have to implement the model and optimize the hyperparameters to find if the model is worthwhile. While training models has become faster with better practices (e.g., the ones taught here), this is still a very computationally expensive and time intensive process.


ok that is super clear and useful!

I’m interested in segmenting tumors in ultrasound images. I have around 500 images to work with. Has anyone looked at turning images from grayscale to color for this problem? I thought this could be another way around the issue. Or potentially converting the coco dataset to greyscale and then training mask R-CNN after this greyscale conversion so transfer learning is more applicable based around the features learned from grey scale images?

Edit: i’d add that i really believe in the power of transfer learning as it has worked well for me in non-medical domains so i could be a bit biased towards this approach. I could be open to doing things differently but as you say - there are a bunch of architectures to explore and it feels like this is a good way to go.

Hi!. I am not an expert, but I think you should follow the approach of the Data Science Bowl 2018:
You will need to create masks for the tumors and not tumors in the ultrasound images and train the Mask RCNN. I do not think that using the already trained COCO dataset brings something. But as I said, I am not an expert.

1 Like

I see that you have some experience working with MRI datasets, I am working with BRATS dataset for a classification problem, and I have several doubts about how to evaluate my model, some months ago I had access to 63 patients so I used 2D slices and got a good performance in validation an test set, now I have access to 108 patients (45 more patients) but I can not reproduce the same good performance with this bigger dataset, and what worries me most is that if I try the model with good performance in a second test set from the rest of the same dataset (45 more patients) and same preprocessing the results vary from subset to subset. The most logical answer could be overfitting but learning curves do not seem to indicate that.

I expected a worse performance in a test set that came from a different database (different statistics) but not from one that belonged to it. Why do no generalize in the same dataset? Do you have some idea of why is this happening or any comments or suggestions?

I’m confused about what dataset you are working with. You say BraTS, but then mention 63 and 108 patient image datasets. Are those parts of BraTS?

The reason I’m asking this is that if the images do not all come from the same site and scanner, you may have poor performance when incorporating both datasets into training (without additional preprocessing).

I understand the confusion, to be more specific I am using this database:
The images already have some preprocessing like reorientation, coregistration to a template, resampling, skull stripping, denoise and image intensity normalization as I found out in the associated paper, and they are used in the BRATS challenge, they come from 5 different centers, I did not do any additional preprocessing because it was already done, in total are 108 patients now but I started training with a subset of 63 because the rest were not available at that moment.