New high-quality chest xray dataset on lung deceases

Have you guys seen this new high-quality chest xray dataset? It includes 100k xray images from 30k patients with several types of lung diseases annotated with bounding boxes, disease/condition, and some additional meta-info. A quick glance through the dataset suggests that it is very clean and should be not only pleasant to work with but is also very practical and not too hard to work with (as opposed to CT scans).
I haven’t trained any models yet as my GPU is 100% busy with statoil’s ship vs iceberg competition but ill certainly try it when I’m out of ideas for the current competition.

Short description:

The dataset itself
https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

1 Like

This is indeed a very cool dataset, but wouldn’t necessarily call it “clean”.

The annotations are derived from radiology reports in an automated manner, which means that some of the assigned labels will be incorrect.

For example, almost every normal chest radiograph will have some variation of the phrase “no pneumothorax”, which may be phrased in a manner that slips past the method they use to avoid negation - would expect a good percentage of that category in particular to be mislabeled.

There’s also the issue of “infiltration” as a category - that’s a vague term that many chest radiologists avoid.

Also check out the Andrew Ng paper on the same dataset:

https://stanfordmlgroup.github.io/projects/chexnet/

1 Like

Thanks, David for industrial insight!
I’ll have a look at pneumothorax category to see any obvious slips. I assume that infiltration here is an all-encompassing category that wasn’t included into one of the other categories so It is vague on purpose.

It was included because many radiologists still include it in their reports, although others vehemently object to it. :slight_smile:

Standardization of reporting is a lot like herding cats.

1 Like