Object detection bounding boxes training

Hello,

I’m trying to reproduce the bounding box selection based on Lesson 8, and using google open image dataset.
that dataset provides bounding boxes as values in [0,1] range, and all images vary in sizes, thus there’s no easy conversion to TfmType.COORD type of coordinates.

I’m creating a dataset requesting no alternations to be done for the Y dependent tensor containing bounding boxes, as there’s no fastai support for relative bounding boxes values. I probably need implement my own Rotate and Flip transforms.

f_model = resnet34
bs = 64
sz = 224
num_workers = 8

tfm_y = TfmType.NO
augs = [RandomRotate(5, p=0.5, tfm_y=tfm_y),
        RandomLighting(0.05,0.05, tfm_y=tfm_y)]

tfms = tfms_from_model(f_model=resnet34, sz=sz, aug_tfms=augs, crop_type=CropType.NO, tfm_y=tfm_y, norm_y=False)
datasets = ImageClassifierData.get_ds(FilesIndexArrayRegressionDataset, trn_bbox_ds, val_bbox_ds, tfms, path=PATH)
md = ImageClassifierData(PATH, datasets, bs, num_workers, classes=[])

head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)
learn.opt_fn = optim.Adam
learn.crit = nn.L1Loss()

If I leave the coordinates in [0,1] range, then loss never optimizes below 0.35 and predicted bounding boxes hardly make any sense. However, if I multiply the training values by 1e3, then loss optimizes down to 50 (which is 0.5% of [0,1e3] range) and predictions look rather accurate. Are there any requirements for regression training value range?

Also what is the reason for generating class labels for bounding boxes in regression case? I see this case is specifically handled in fastai dataset.py, although it doesn’t seem to have any effect later on - i.e. I supply an empty class label list, and it yields same results for regression case.

def dict_source(folder, fnames, csv_labels, suffix='', continuous=False):
    all_labels = sorted(list(set(p for o in csv_labels.values() for p in ([] if type(o) == float else o))))
    full_names = [os.path.join(folder,str(fn)+suffix) for fn in fnames]
    if continuous:
        label_arr = np.array([np.array(csv_labels[i]).astype(np.float32)
                for i in fnames])

Why not convert the relative bounding box labels to the cartesian format fast.ai expects? You should be able to read the height and width of each image and use that information to calculate cartesian bounding box labels. This is much easier than implementing your own rotate and flip transforms. You also only need to do this only once, afterwards you simply save your new cartesian bounding boxes to a file.

my question is of general kind, while it is indeed possible to pre-calculate all cartesian coordinates in this example, what I’m really trying to understand is why regression for values in range [0,1] has noticeably worse results then for values in range [0,1000] and why dataset.py:dict_source does value encoding into categories array?

The dataset.py:dict_source creates the all_labels and label_arr variables because it allows the regression problem formally to be treated like a classification problem. Instead of having to write a bunch of new code to specifically support regression tasks, fast.ai just transforms the regression data into a format that resembles a classification data and uses the already existing code. This was done I am pretty sure for convenience of implementation (this makes sense, after all the classification code is already very well debugged). Is that what you were asking?

Why regression values in the range of [0,1] are worse compared to [0,1000] is an interesting question. I assume you ran the learning rate finder and chose a suitable learning rate for both cases individually? And the shapes of the learning rate curves looked reasonable in both cases?

re dict_source -
  1. it doesn’t seem to be used; I’m currently supplying it with empty class list. Maybe its a leftover from some other use-case?
  2. Regression is supposed to output continuous values, not a specific set of values. Consider training a neural network not for a bounding box, but, say, estimating physical dimensions of an object.
re value ranges

yes I ran LR finder, and LRs are different, and LR shapes are also slightly different.

You are right about all of these. The labels array isn’t there because there is theoretical need for it. It is there because it was the quickest way to implement regression in fast.ai. Jeremy even comments on that in lesson 8 if I recall correctly. The class is still called ImageClassifierData even though it should really be ImageRegressorData. The reason for that is simply that nobody spent the time so far to refactor the code. If you have time to do so, a corresponding pull request would be highly appreciated and probably quickly merged.

That the two models don’t train equally well is very interesting and good to know, unfortunately I have no explanation for it. You might want to consider posting a specific separate question about this if nobody responds in here in the next couple of days.