Bone X-Ray Deep Learning Dataset and Competition

Stanford just released publicly a large bone xray dataset. There is also a related competition to determine if a study is normal or abnormal. I hope some of you will be interested to participate !


Do you think the issue of many of the images with positive labels showing hardware (eg screws and stuff) in the images is going to be a problem?

1 Like

That was a problem with the CheXray dataset when we were trying to classifiy a pathological sign (e.g. pneumothorax) but the model was converging on more easily detected but statistically related features (e.g. chest tube).

It could also be a serious problem if we try to classify specifically fractures. The model could converge on therapeutic hardware instead of really identifying the fracture.

But the proposed metric is based on normal vs abnormal result which can be useful for triage. In that case, if the model extracts meaningful features for a normal representation, it is less important to know why it is abnormal (fracture vs orthopedic hardware). But if the model is trained with almost only abnormal cases with orthopedic hardware, the bias will be integrated to the model and it will not be able to detect initial fractures. The most important and prevalent use case for these types of bone xrays is to detect fractures on the first, initial, radiograph. For a potential useful clinical application, it would consequently be a lot more useful to train the model on normal cases vs initial, non treated, fractures cases (without orthopedic hardware).


One needs to take into consideration that a study consists of several images, and that you evaluate a study and not individual images. A study can have 1-n images.

Your program should output binary predictions for every study in the input file (not every image) From the submission instructions


It seems that, according to the paper, the hardware is considered an abnormality

2.3 Abnormality Analysis
To investigate the types of abnormalities present in the dataset, we reviewed the radiologist reports
to manually label 100 abnormal studies with the abnormality finding: 53 studies were labeled with
fractures, 48 with hardware, 35 with degenerative joint diseases, and 29 with other miscellaneous
abnormalities, including lesions and subluxations.


That is an extremely important point that you underline @melonkernel.

The paper is using a simple average of the different predictions coming from the different images. This is an easy but likely underfitting ensemble approach to the problem.

Hint : Any radiologist with a minimum of experience is definitely correlating the 3D validity of each image “features” in the spatial domain to improve its own performance accuracy.

Beyond training a very good 2D deep CNN classifier, this potential 3D correlation is, imho, needed to create the most accurate model. Applying a random tree on the high level extracted semantic features from all images is interesting but would likely miss most of the fine-grain spatial correlation. A random tree combining the mid-level and high-level features of all the images could worth a try but the number of features would be incredibly high with a most probable overfitting problem. Capsule networks probably inherently hide a gem somewhere for this 3D correlation by encoding the spatial relationship between features; I’m thinking about this for months but unfortunately without any practical mature solution …

@Judywawira @rikiya : still interested to create a team ?


Thanks @alexandrecc, I’m very interested in this :wink:

As you mentioned, handling multi-view images would be one of key challenges in this competition. Weighted average based on output probabilities would be one idea to try, I mean something like putting higher weights on probabilities close to 1 (and/or 0) (kind of taking uncertainty into account), but unfortunately I don’t have neat solutions to this.

1 Like

Hi @alexandrecc and team,

I’d be interested in working on this problem with a team like yours. Are you looking for another helping hand for this challenge?


I would be interested to form a team.
Although this is the first time i try out x-ray images. (It resonates with me very well. Using AI to help people is the reason I am into it)

I have been thinking about a couple of options, one might combine them perhaps in the end as an ensemble of sorts.

Option 1
Since x-ray are grayscale, you would not need an RGB tensor, but i am thinking one could combine all the views (images) into one tensor. One problem is that there are different amounts of images per study. Also, some of the x rays were white while others were black, so perhaps one would need to normalize these by inverting.

Option 2
Many to one classification, RNN or equivalent.

Option 3
Averaging the results of each image as in the paper.

Option 4
Adding Embeddings for the extremity type, wrist, shoulder etc…
Although this might be picked up anyway by the network, i am not sure if it is needed.

Option 5.
Averaging 1-4 to give final result

tell me if this doesn’t make sense.

@alexandrecc, with 3D correlation, do you mean that if there is a probable abnormality in let’s say index finger middle joint, on image 1, if image 2 also has a probable abnormality is in the same place (from a different angle) it would consider that abnormality to have higher importance. Or do you mean that in ones mind you create “layers” in 3D from the 2D image


Still interested @alexandrecc look forward to meeting in person on Friday

Is there anyone working on this problem?

Yes, we currently have a relatively large group working on this problem. @jeremy

how can i join in this group @alexandrecc

1 Like

Hello @alexandrecc. Did you download the MURA dataset? The online form is not working. How to get the dataset?

Hi @pierreguillou ,

Yes, I got the dataset since last year. I guess you can contact the Stanford team if the online form isn’t working. The research agreement doesn`t allow transfer of their dataset between individuals.

Thanks Alexandre. I sent an email to the Standford team and I’m waiting for its answer.

[ EDIT ] : I received the email from ML Stanford and downloaded the MURA database :slight_smile:

Hi. I just published my medium post + jupyter notebook about the MURA competition.

My goal was to assess how far the standard fastai method could go in the search for better accuracy/kappa in the radiology domain and without any knowledge in radiology.

However, to go beyond a kappa of 0.642 (my score with the standard fastai method), I think that I need a more complete understanding of the field of radiology and more DL experiments.

Feedbacks welcome!


Excellent work. I would be very interested in someone from experts sharing some advanced techniques and optimization on your notebook.

1 Like

Part 2 of my journey in Deep Learning for medical images with the fastai framework on the MURA dataset.

I got a better kappa score but I need radiologists to go even further (and fastai specialists too :slight_smile: ).
Please, feel free to use (and improve) my notebook (ensemble models, squeezenet models, etc.).

1 Like

Thank you @matejthetree. I just posted the part 2 of my research on the MURA dataset.
Feedback welcome to go further :slight_smile:

1 Like