Given a fairly well bounded image of a quadrilateral object, detect the corners/permeter of the object. I wasn’t getting any results with my real data so as an easier, toy problem I decided to try and get some results with synthetic data like below. This a batch of 8 examples from my easy synthetic training set. The small red circles show the data labels.
I managed to get “ok” results for this synthetic dataset this by treating the problem as one of linear regression to the corners. So I generate a set of 4 anchor points - the average positions of the four corners of objects in the dataset, adapt resnet to predict 8 values:
resnet50 = torchvision.models.resnet50(pretrained=True) resnet50.fc = nn.Linear(in_features=2048, out_features=8, bias=True)
and interpret those 8 values as 4 (x, y) offsets from the anchor points (normalised to the range: 0.0 to 1.0 wrt. width and height of image) I use MSELoss with the 8 ground-truth corner coordinates (also expressed as normalised offsets from the same anchors)
For the purposes of comparison I have written a function that simply returns a tensor of zeros (ie. a prediction of zero deviation from the anchor points) I refer to this function as the baseline model
results so far
resnet is able to outperform the baseline model on the synthetic data, although not as accurately as I would have hoped. eg. the predictions below. ( Although I’m sure I could get something a bit better if I spent a while tuning the hyperparameters)
However, resnet can’t even match the baseline model when training on the real data. I can’t get it to overfit or even match the baseline model when making predictions on the training data after training
I tried a few other models like mnasnet and mobilenetv2 and found that they couldn’t beat baseline, even on the synethic dataset. I also tried adding a tanh layer after the linear layer like
net = nn.Sequential(resnet50, nn.Tanh()) to map all the outputs to -1 -> +1 but found that this stopped the model being able to beat baseline on synthetic data too.
Any advice on this problem would be great - am I framing the problem well? Am I missing a trick on the DNN output?
Here is a colab upload of the notebook I’m using to experiment with the synthetic data
This is an example of the real data I wish to be able to learn to detect. Of course I know that canny edge detection is a thing, I’m hoping to be able to ultimately find a DNN solution that can cope with very varied lighting conditions and hopefully learn to exploit higher level semantic information that simple conventional CV could not.
OK with large batches and many epochs of training on a big GPU I am now able to beat the baseline on real data. it’s still not doing that good though - would still really appreciate thoughts/feedback on how I can improve my overall approach.
ok - I’m actually getting good results with this same basic approach using efficientnet-B3 yogi/lookahead optimiser and training for > 1000 epochs with batch sizes of 20. I’ve never had to train something for more than a hundred epochs before for convergence - but this is a relatively small dataset (training set is around 1300 unique images) with a bunch of augmentation built in.