Lesson 10 IMDB using custom dataset

yihwa · July 9, 2018, 1:09pm

Hello
I am testing the pretrained model from Lesson 10 IMDB on custom dataset.
Is there any requirements for these datasets?
I am using as trained data 15 rows.

All the steps run okay
but at the Classifier

At first I got
: ValueError: need at least one array to concatenate

learn.lr_find(lrs/1000)
learn.sched.plot()

So then I concatenated my training set a couple of times just to increase the dataset size.
So then there was no error at this stage.

However there was no plot output …

Then however in
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

I get zerodivision error.

ZeroDivisionError Traceback (most recent call last)
in ()
----> 1 learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

~/lpsbigger-topic/fastai/text.py in fit(self, *args, **kwargs)
209
210 def _get_crit(self, data): return F.cross_entropy
–> 211 def fit(self, *args, **kwargs): return super().fit(*args, **kwargs, seq_first=True)
212
213 def save_encoder(self, name): save_model(self.model[0], self.get_model_path(name))

~/lpsbigger-topic/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
285 self.sched = None
286 layer_opt = self.get_layer_opt(lrs, wds)
–> 287 return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
288
289 def warm_up(self, lr, wds=None):

~/lpsbigger-topic/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
232 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
233 swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
–> 234 swa_eval_freq=swa_eval_freq, **kwargs)
235
236 def get_layer_groups(self): return self.models.get_layer_groups()

~/lpsbigger-topic/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
107 avg_mom=0.98
108 batch_num,avg_loss=0,0.
–> 109 for cb in callbacks: cb.on_train_begin()
110 names = [“epoch”, “trn_loss”, “val_loss”] + [f.name for f in metrics]
111 if swa_model is not None:

~/lpsbigger-topic/fastai/sgdr.py in on_train_begin(self)
269 def on_train_begin(self):
270 self.cycle_iter,self.cycle_count=0,0
–> 271 super().on_train_begin()
272
273 def calc_lr(self, init_lrs):

~/lpsbigger-topic/fastai/sgdr.py in on_train_begin(self)
135 def on_train_begin(self):
136 super().on_train_begin()
–> 137 self.update_lr()
138 if self.record_mom:
139 self.update_mom()

~/lpsbigger-topic/fastai/sgdr.py in update_lr(self)
147
148 def update_lr(self):
–> 149 new_lrs = self.calc_lr(self.init_lrs)
150 self.layer_opt.set_lrs(new_lrs)
151

~/lpsbigger-topic/fastai/sgdr.py in calc_lr(self, init_lrs)
275 if self.cycle_iter>cut_pt:
276 pct = 1 - (self.cycle_iter - cut_pt)/(self.nb - cut_pt)
–> 277 else: pct = self.cycle_iter/cut_pt
278 res = init_lrs * (1 + pct*(self.div-1)) / self.div
279 self.cycle_iter += 1

ZeroDivisionError: division by zero

To me it seems like it is an error that is caused by the dataset not satisfying certain conditions. However it is not very clear to me what it should satisfy.

urmas.pitsi · July 9, 2018, 2:03pm

try increasing cycle_len. cycle_len is essentially nr of epochs in this setting.

instead of:

try:
learn.fit(lrs, 1, wds=wd, cycle_len=10, use_clr=(8,3))

yihwa · July 10, 2018, 7:18am

Thanks @urmas.pitsi it seems to solve the problem. However the accuracy keeps decreasing over the epochs (although it is not completely finished yet… )

Should I have already used different parameters for lr_find?
As I have written in the previous post there was no plot as result in learn.lr_find(lrs/1000) and learn.sched.plot()…

urmas.pitsi · July 10, 2018, 7:27am

I am not sure what your lrs are, but the model is not learning: could indicate that learning rate is way too high initially. You could try fit with some small learning rate, eg 0.01 or 0.001 and see whether it starts learning.
About training data: for starters you could extract a smaller sample out of original imdb data and run on that. Make everything working and then replace with your own data. If model is not working on the original then it definitely will not work on your custom data

yihwa · July 10, 2018, 7:31am

I used the default values

lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

It definitely seems overfitting with the trn_loss going down, but val_loss and accuracy going in bad direction.

urmas.pitsi · July 10, 2018, 8:10am

Can be the case that your train set is too small, so that model doesn’t generalize but memorizes examples, validation accuracy doesn’t improve

yihwa · July 10, 2018, 8:16am

Yes it is possible. My training set is in fact very small. It is one example each for each label which I got from wikipedia. For this case, I checked thoroughly for sources, but couldn’t get training data. So had to rely only on the training set from the wikipedia data.
That I had made exact same copies of them 3 times to the training set in order to eliminate errors for the lr_find .

I thought it didn’t matter too much the fact that the training examples are not that many, since in the case of cat vs. dog example in the lesson 1, and also in some example cases it is shown that it worked well with few training examples when there is a pretrained model.

When I trained with the naive bayes i could push it to about 60% accuracy, and I hoped to get higher. However the accuracy is so low now.