[AI + Impact] Detecting Breast Cancer in Digital Mammograms

Slides for my talk:


Comments welcome

And many thanks to @sravya8 for organizing


I tried DDSM too, but after initially being really happy with my results realized there’s a huge amount of data leakage.

They used different scanners for different groups of the data and the distribution of cancer/normal is different at different sites/scanners. So the models could cheat by detecting which scanner was used.

Film mammography is also very different from digital mammography (you can see that you can easily mask the digital mammograms with pixels > 0 for example).

It looks like the Cox lab at Harvard is pretty close to solving things (they’re number 1 on Leaderboard 2).


Great feedback David, thank you

I wasn’t aware of the scanner impact. Excellent information. Perhaps the confounding could be reduced by subsampling (i.e., selecting similar proportions of cancer/normal for each scanner). Of course, if the distributional differences are drastic, there isn’t much one can do. Maybe you already tried this

Could you elaborate on difference between “screening mammography” and “digital mammography”

That was a typo, meant film vs digital.

One thing that I should have done but never got around to was converting each film to optical density (there are graphs and formulas on how to do it on the DDSM website). Also, a segmentation approach is possible with the DDSM dataset (many masses and calcifications are outlined in the metadata in a custom format).

But beyond that I agree that making sure the samples per scanner are similar for positive and negative cases.

Segmentation is definitely near the top of our TODO list

All clear now, thanks


In your comments after the talk, you mentioned visual attention models. It definitely seems like that would be helpful in cases like this and the DSB where the image is large but the portion required to make the diagnosis might be small.

Will we be going over these at all during the course?

Great question @davecg. I wasn’t sure if that was something already covered
in Part 1, which I did not attend

Maybe this paper is applicable (see Fig. 5):

1 Like

Came across a nice video describing attention all models and soft vs hard attention:

I also wonder if the secret sauce that the current leaderboard leader is using is related to their previous work on saliency maps:

That might help focus an attentional model. (They also clearly have eye trackers so they may have held a few breast radiologists hostage to create saliency maps specific to Mammography).

I haven’t seen really effective end-to-end visual attention models for large images as yet, unfortunately. The state of the art is still, as far as I’m aware, to simply do some kind of localization in a first step, and then manually created some cropped regions to pass to a 2nd model after that. E.g se http://blog.kaggle.com/2016/01/29/noaa-right-whale-recognition-winners-interview-1st-place-deepsense-io/ . Although simple, this is an effective and necessary technique for large images with small relevant features (like most medical imaging).

Spatial transformer networks seem like they should be a great solution, but there haven’t been many real world successes.

Maybe this approach from a couple of weeks ago will turn out to be successful: https://arxiv.org/abs/1703.06211


Very helpful comments

@davecg, out of curiosity: what makes you say the current leader on the DREAM
Digital Mammography challenge is the Cox lab at Harvard? I don’t see any mention
of the challenge on their Web site. And the teams on this leaderboard seem to be


Great, thank you! But how did you get the link :slight_smile:

I’m sneaky. :wink:

There is a list of registered teams on the wiki, and you can click on any of them to see who is in the team. Probably could have figured out via the api too.


1 Like

Good sleuthing, thanks. I’ve been looking for this, couldn’t find it

Hello there,
I tried also this competition cutting on my sleep hours …
Radiologists have about 90% specificity with 80% sensibility of detection. More than 90% of screening exams in real life have a previous comparison exam. Radiologists AUROCs are a lot higher with a previous exam than without.

The actual leader has about 80% specificity with 80% sensibility in sub-challenge 1 and 2. Even if not better than human, this result could easily be use to triage screening exams to prioritize the interpretation by a radiologist. If the results are truly open source as specified (code + weights), I’ll try to implement that kind of triage in my own departement.

Like I said, I can easily improve my own ROC curve as a radiologist, by comparing the images with the previous exam. Surprisingly I don’t see much difference in the best AUROC from the sub-challenge 2 compared to sub-challenge 1. Another important factor to improve AUROC is the spatial 3D cross correlation for a suspected feature (MLO and CC correlation). A spatially correlated suspected feature in both views is a lot more worrisome than seen in only one view.

Consequently, I think a proper comparison implementation with optimal resolution, decent segmentation, preprocessed normalisation using a 2 channels (current image, last image) or 4 channels (current MLO, last MLO, current CC, last CC) could potentially get as good or potentially better than a radiologist. Inattention/Variance is the weakness of almost any human task. Like radiologists do, this comparison technique could be used also without a previous exam but simply by comparing the current exam with the controlateral current exam, with a different trained network (sensitivity to comparison changes is quite different).

I tried a 3 channel variant of this implementation, training from scratch, but I was stucked with bad overfitting since there are only < 1000 cancers in the dataset. Fine tuning a pretrained model is interesting but lack of resolution probably explains low pAUROC and low specificity at sensibility 90% even if decent AUROC. I don’t really know how to use a low res pretrained network with a higher resolution image without resizing the image. I read that the first layers weights/features of a Conv layer usually can be used with higher resolution but I don’t know how to do this in Caffe or Keras. Any idea to help if the dataset can still be used after the competition ?

That competition was a great way to learn.


This is really interesting information - thanks for sharing! You may want to create a 2-stage network, where the 1st looks for potentially interesting regions in the whole (scaled down) region, and the 2nd then zooms in and looks at each of those regions. You could then use a lower resolution pretrained network for both stages.

1 Like

You can definitely use a pretrained network even on the full resolution image. I actually made the decision to resize after the pretrained convolutional layers instead of before (essentially using ROIpooling over the whole image to get a fixed size).

I initially wanted to use a Siamese network similar to what you were suggesting (MLO + CC) but ended up doing something simpler.

Indeed very interesting @alexandrecc, thank you

One question: is there a reference for the 90% sensitivity/90% specificity human expert performance on digital mammograms?

I am giving a talk at USF about my participation in the challenge, and would like to reference that. I can say personal communication, but a paper would be more convincing. No disrespect :sunglasses:

Thank you Jeremy for the information.

By experience in this case, higher resolution analysis improve specificity by defining more precisely the contour (irregular, circumscribed, concave vs convex) or the content (fat, tiny lines, calcifications) of a lesion (ex. mass, asymetry, distortion) detected from a lower resolution image. Your 2 staged solution can potentially improve this kind of specificity.

But higher resolution is also needed to improve sensitivity for very small irregular masses (3-4 mm) or tiny calcifications clusters (0.1 to 0.5 mm) that can only humanly be viewed in higher resolution. These high resolution small features are probably completely masked/removed in a low res resized image even from a machine perception. By comparison, a bone fracture is usually a 3D plane (or line in 2D projection); this kind of tiny line can probably keep a numeric line spatial feature with low res to apply your solution. But a cancer starts from a dot that will probably disappear with resizing in a noisy background. The dot could survive in low res if the signal to noise ratio of the high res image is very high (ex. MRI imaging). The 2 staged solution will probably miss these small features during training if the position of the cancer is not known in the image.

I really hope the organisation team of this competition will keep the dataset available after the competition to continue open science improvement.


Two papers at the top of the page have what you’re looking for…

Actual numbers are 90% specificity at 80% sensitivity, which corresponds to 1/10 patients being recalled for additional testing. The 2006 paper shows ranges for radiologist performance.