Self-Supervised Learning on Giant Database

jfang · April 7, 2020, 5:39pm

Hi everyone,

I was lucky enough to start a wonderful internship from learning Fastai! Now, I have a few questions on how I should proceed. I was lucky enough to get access to a giant unlabelled cancer histopathology database, but I can feel slightly overwhelmed as I don’t know what is the best way to proceed.

Now, access to a massive database offers new opportunities but new challenges as well. My first problem is about preprocessing the data. Is it possible to download data (i.e. TCGA) and perform preprocessing at the same time (tiling and removing tiles with too many white spaces), so that I do not overwhelm the hard disk?

My second problem is regarding the self-supervised learning itself. I am going to follow the Caron et al. 2018 paper on deep clustering (https://arxiv.org/abs/1807.05520), with a modified clustering step using the Robust Continuous Clustering (https://www.pnas.org/content/114/37/9814). The logic of this self-supervised learning is to extract the second last layer of the neural network (the FC layer) for all examples, then cluster the output from all examples, then using the clusters as pseudo-labels for data. The model is trained for an epoch on the new pseudo-labels, then the extract and cluster process cycles all over again.

Unfortunately, I don’t have a clear picture of what it may look like on code. It seems something like this:

for i in range(#epochs):
     Extract activations
     Cluster the outputs
     Assign Psuedo-labels
     Reassign labels

What I want isn’t just any solution, but an efficient solution given the sheer volume of images (on the scale of 20-30 million 224x224 images). This leads to two more questions:

What is the best way to reassign labels? Change the Label_from_df?
What is the most efficient way to clip the output neurons and add a new layer of neurons, and reassigning the y-label of a dataset? Reinitializing another Learner every epoch will not be very efficient!
Lastly, this will be a question regarding how outputs are displayed. Suppose between the next cluster and the first cluster I want to get the normalized mutual information (NMI) between cluster n and cluster n-1, alongside a loss chart (which I would probably care less compared to NMI telling me when to stop training).