Fine Grained Medical Dataset Curation Problems

The Problem
I am dealing with an interesting problem in the medical imaging space where I have what ultimately might be a fine grained multilabel multiclass classification task for which I do not have enough quality data.

My images are localised from a larger photo using an object detector so the target image is mostly the desired target surface with small amounts of surrounding image around the object which are not related.

Before embarking on the costly process of more data gathering and curation, its important to understand what is possible and the correct approach. I have decided to tackle a smaller subset binary classification task first and see how that pans out with some Fast AI bag of tricks resnets.

After Object Detection which is working great, the second stage is to detect visual features of the target image that indicate the presence of a particular pathology.

Initial Results
For 13000 images of Class ( A ) absent and 3000 images of Class ( P ) present, the results aren’t great.

I have tried mixup, focal loss and oversampling with oversampling giving the best validation results of about 75% accuracy in detecting Class P. However on a small subset (not used in training of course) that has been re-validated by a single expert the results are concerning, the model doesnt seem to achieve meaningful results. Getting 18% accuracy. So I suspect the original dataset is deeply flawed.

The Dataset
I know that the original dataset was collected without any attention to consensus and the labellers were given lots of different tasks to do in the same image so this could have confounded the labelling process.

I have begun having the entire dataset of “suspected Class P”, reclassified by a single expert individual.
So far 1800 images have been done with the results being that only about 34% of the suspected Class P were actually Class P, according to this single expert. However the average clinician accuracy is like 60% anyway so how meaningful are these results in the first place?

When these newly 1800 images are trained I get about 70% accuracy on validation without any holdout test set.

I have tried to use a portion as a hold out test set but it seems like the results are just skewed heavily in favour of Class P for both positive and negative, so my guess is this small dataset is too unbalanced and theres too much correlation between some other unrelated part the images.

Consensus
One major issue is that with this particular labelling problem there appears to be a lack of consensus on the particular task.

I am thinking of running a load of data through multiple labellers and choosing only the high consensus data first to see if that task itself is achievable. The harder examples with less consensus can be dealt with more scrutiny later and possibly fine tuned over the easier model weights with better success.

There’s some complicated game playing consensus ideas here:

But I figure even just some blind consensus followed by expert panel review on non consensus images might be a quick way to achieve a reasonable improvement in data quality.

So my questions are:

  1. Should this really be a segmentation task? How do you know when you have sufficient information to apply classification or when you need to use segmentation?

  2. If it is possible to achieve with classification, what other things are important in curating the right dataset to achieve high accuracy in a binary, absent / present scenario where the images themselves contain lots of other unimportant features?

  3. Does anyone have any good resources on curating and validating a binary/more classification image tasks which are fine grained? I have watched nearly all the Fast AI videos and started on the older Fast AI ML videos but have yet to see anything solid and concrete on building good datasets for hard problems. Does something like this already exist?

  4. Are there techniques I can use on small datasets to validate if this task is even possible or heading in the right direction?

  5. Or should I just throw “MOAR DATA” at the problem?

Bonus
I found this awesome YouTube video on getting good results for Multi Class Multi Label Classification problems, but its with a good curated dataset from Kaggle so thats no help for my data problem.