New open source medical image database

kvinicki · January 22, 2018, 5:33pm

Hello everyone My name is Krunoslav Vinicki and I am studying veterinary medicine at the University of Zagreb, Croatia. I started making an open source medical image database - the first of its kind in veterinary medicine that I am aware of. I already took over 2000 pictures of cat reticulocytes and i got about 96% accuracy with fastai library.

The problem is that I started coding literally 4 months ago and I never started an open source project before. I can make database with some fellow students and professors, but we will definitely need some help with open sourcing it and making a better model (For example I cropped the image and used just a regular CNN from lesson 1, but for usable application we probably need a single shot multi box detector.

Let me first explain what reticulocytes are and why this is an important problem:
Reticulocytes are immature erythrocytes and it is perfectly normal to find them in blood. After they are released from bone marrow, they are staying in blood for about 24h (in humans) after which they mature in erythrocytes (red blood cells). So, why are they so important then? Well, If we have an anemic cat, we want to know is this cat producing new erythrocytes. If we find a lot of reticulocytes then we know that the answer is yes. But, If we are not finding any reticulocytes we have a big problem - this cat’s bone marrow is not producing any new red blood cells.
In other words, “Identification of reticulocytes allows assessment of whether bone marrow is responding to an anemia (given sufficient time) by increasing red blood cell (RBC) production.”
As i already said, in humans reticulocytes stay in blood for 24h after which they turn into erythrocytes. But in cats, it is a little more complicated: they have two types of reticulocytes: aggregate and punctate. Aggregate reticulocytes stay in blood for 12-24h and then they turn into punctuate which can stay for up to a week until they are turned into mature red blood cells. For that reason, we are counting only aggregate reticulocytes so humans (and machine) needs to differentiate between these two reticulocytes. And here comes the main problem: This can be very subjective and human error can in some cases go up to 30% - a huge problem. And that’s why 96% accuracy is really good.

aggregate reticulocytes
000061a !

punctate reticulocytes
001552a

Sometimes the difference is not so clear

I think that machine learning can be a game changer in veterinary medicine. First of all, unlike in human medicine we have more then one species - veterinarians can’t know everything about every animal. Lets take for example White Blood Cells (WBC) count. In human medicine it is quite easy - it is done by laser flow cytometry. But this automated method can only be used on mammals. But what about birds and reptiles? For them only the manual way is possible but it is not done in practice because veterinarians, again, can’t differentiate WBC of every species.
Also, veterinary medicine is in some regard very similar to human medicine in third world countries: lot of pet owners (at least in Croatia) are not prepared to give a lot of money for even the most basic laboratory tests (and the same laboratory tests are usually more expensive then in human medicine).

i still haven’t uploaded the database. I suppose it is best to upload it on the my faculty web page. But, I can send it through the email if anyone wants to play with it.

binarypoet · January 22, 2018, 7:06pm

Awesome initiative! I would be happy to help you with any issues in trying to make this a public dataset. I am not sure what that really means, but offering technical support anyway.

kvinicki · January 22, 2018, 7:18pm

Tnx, I really appreciate this

I am not sure what my first steps should be either. I suppose that I should first start with github and then maybe create a webpage and upload it there

binarypoet · January 22, 2018, 7:32pm

It’s a good question. Is there really a standard for datasets? There are at least a couple of widely adopted standards for some things. But not sure if there is a concensus. Do you have an idea on the total size? Do you know how to compress it?

kvinicki · January 22, 2018, 8:04pm

Well, the database will definitely grow with time. Right now, the cropped images have only 14MB, but for SSD (single shot detector) we need uncropped images that are much larger (1.7GB unresized)

binarypoet · January 22, 2018, 11:13pm

How are the images labeled? Wondering if the simple CSV format from the planet challenge (and the one I used for the fashion stuff) is sufficient? If so, it is just a single directory of images named as the image ID, and a CSV that maps that ID to a space separated list of labels. And if that’s the case, just hosting it somewhere as a .tgz, .bz2, or .zip should suffice. That being said, is there any ‘standard’ for ‘meta-data’? Information about the dataset, presented in some standard format? If there sin’t a standard, does anyone know of any standard being worked on? I know PMML is a standard for describing a model, but I am not sure how prolific that is, or if it is supported as a cross-library model that could be used in say TensorFlow and then in DL4J for instance. Anyone have experience with this?

YJP · January 22, 2018, 11:40pm

Hi Krunoslav,

I truly appreciate this effort!

My family is caring many animals on street (mainly driving around the areas near the house & small farm and providing food and water to cats and dogs twice a day). Because of tough situations (e.g. no food, many cars, tough weather), many street animals passed away even though we brought them to veterinarians when they got unwell. We got a lot of “I am not sure” answers from vets even after x-rays/scans; then we had to follow a sort of ‘experimental’ instructions to try whether a particular treatment works (it works about 50:50 - we buried many). I am, therefore, looking forward to the development of some progress in this field.

In the mean time, if you need any help, please count me in. Once again, thanks a lot.

kvinicki · January 23, 2018, 7:09am

I read in this post that i need to label images with labelImg so i will have .xml: About bounding box localization. I will try this today

EDIT: we can just keep it simple and put two zip files: one with images and another with .xml files

kvinicki · January 23, 2018, 7:31am

Tnx a lot

I am very familiar with such stories. Unfortunately, in veterinary medicine specialization is not so common and the same person has to be a radiologist, surgeon, etc - and this is simply not possible

LaPatel · April 1, 2018, 3:26am

@kvinicki You seem to be a game changer in the making. I wish you all the best for your noble and scientific endeavor. If you need some web space, I can arrange for it on our website.

https://www.kaggle.com/datasets will be a nice option worth considering.
https://github.com/Reuben-Thorpe/open.data model may help you for using GitHub for your dataset.
Do get your dataset added in https://github.com/awesomedata/awesome-public-datasets to let others know about it.

kvinicki · April 1, 2018, 7:21am

Hi LaPatel,

Thanks Especially for the second link. I will probably use this model for linking vet datasets

Dataset was uploaded couple of days ago on kaggle: https://www.kaggle.com/tentotheminus9/feline-reticulocytes. The reason it took so long for one dataset is that I was expected to write a paper and that took some time. But, after this paper, my faculty wants to be one of the pioneers in deep learning-based medical imaging in veterinary medicine so everything will go faster now.

Roszko · December 17, 2020, 1:30pm

I have a similar product in mind - where analyzing human blood cells lets you draw a rectangle around recognized cells, also counting them. Did you have any success in the field?