I am trying to use darkflow for custom object detections.
Darkflow is an implementation of YOLO architecture for object detection(https://github.com/thtrieu/darkflow). I am trying to manually annotate data and use the darkflow framework to train on it.
The objective is to extract some annotated fields from documents.
My dataset is also quite small, with 58 annotated images.
I was expecting that the darkflow framework will overfit since the dataset is small and the model is sufficiently big. But no matter how much I train the network, the loss kind of plateaus. And the predictions of the model are also way off from what is desired.
I have tried image augmentation, varying the learning rates, trying new architecture(using ResNet50 as feature detector and training a classifier on top of that). But same results.
I am not sure why a complex model will underfit a small training set?
Grateful for any suggestions about how to proceed!
“The objective is to extract some annotated fields from documents” – The document is an image of an ID card(driving license for example). I have drawn bounding boxes on only one field (Date of birth). I have 58 such identification documents with bounding box for only one class(Date of birth, DOB). And I am trying to draw the bounding box over DOB fields for new documents.
I am actually loading pertained YOLO weights and trying to fine tune them for this problem.
I would actually expect it to overfit. But am quite puzzled why its not overfitting?
My first hypothesis is that the pre-trained model’s objective(read target classes) is “far” from yours: so the model cannot discriminate over the signal you are using.
You have to rethink your approach: YOLO, and other DL approaches are good when have a lot of variety in your samples. Since you only look at ID card, look for text recognition task.
Well for one I moved away from YOLO and used ResNet-56 as the feature detector and then trained some layers on top of it. And improved it by adding one batch-norm layer.
Also initially target labels were the lop left and bottom right corners of the bounding box, and I changed it to center, width and height. And finally I applied different learning rates across different layers.
Planning to try oversampling today. Please feel free to suggest more improvements that you think will help. Thanks!
hi @gokul , i am also working on similar problem , here i am trying to extract text from aadhar card and i only need name ,dob and address and i was thinking about using ssd . and could u please share your code and explain me about the difference in change the bounding box form top left and bottom right approach to center , width and height
I was initially using darkflow for predicting the bounding boxes. Then I kind of tried a simpler approach, which was to just take the ResNet stem as a feature detector and add some layers on top of that to predict 4 numbers for each of the images.
One thing that helped me was to start with just one class, for example just the Adhar number.
The core of the code was something like this. With some modifications to the final layers. (Credits: [Programming PyTorch for Deep Learning, by Ian Pointer])
from torchvision import models
transfer_model = models.ResNet50(pretrained=True)
for name, param in transfer_model.named_parameters():
param.requires_grad = False
transfer_model.fc = nn.Sequential(nn.Linear(transfer_model.fc.in_features,500),
nn.ReLU(),
nn.Dropout(), nn.Linear(500,2))
Will try to add the code soon, but hope this helps you to get started!