Lesson 4 Advanced Discussion ✅

Is the label_delim parameter in TextDataBunch functional? I get an error trying

data_clas = TextDataBunch.from_df(path, train_df=df_trn, valid_df=df_val, 
                                  vocab=data_lm.vocab, 
                                  text_cols='Narrative', 
                                  label_cols='Contributing Factors / Situations',
                                  label_delim='|',
                                  bs=bs)

Error: iterator should return strings, not float (did you open the file in text mode?)
I also get an error when the delimiter is two character, e.g. '; ’

1 Like

I’m wondering about this myself. I’m setting batch size values as I load my data_lm previously created with default values but the GPU load seems to remain the same no matter what.

1 Like

Yes, normally I add a little to the max and min for this reason.

Do you refer to the + self.min_score after passing the result through a sigmoid or to something else done afterwards ? Also I can understand that adding something helps the problem for values close to 5 (as the sigmoid will likely reduce them) but why is it the case for values close to 0 ? In my understanding, we should subtract something, shouldn’t we ?

Jeremy mentioned there is a Chinese language model, (su, zoo?). Where could I find more information about it?
Many thanks~

Quick question: why using learn.fit() instead of learn.fit_1_cycle() with tabular data? (cf notebook shown by Jeremy during the live)

1 Like

Search Model Zoo

While discussing the use of

learn.predict('I liked this movie because ', 100, temperature=1.1, min_p=0.001)

in lesson3-imdb notebook, @jeremy mentioned that “this is not designed to be a good text generation system. There’s lots of tricks that you can use to generate much higher quality text non of which we’re using here.”

What are them tricks? :wink: Where could I find more information about it?

BTW, I created RNN based Polish poetry generator and utilized Jeremy’s ideas of adding special tokens for uppercase and capitalized words, also adding a token for line break was a good idea :slight_smile: Many thanks, Jeremy!

As Polish language is morphologically rich due to cases, gender forms and great vocabulary I’m working with sub-words units - syllables.

4 Likes

Thanks @fredguth :grinning:

I am also interested how we can implement calculating the feature importance for tabular data in fully connected networks

Main one would be ‘beam search’.

4 Likes

No good reason - it’s just a little toy dataset that I didn’t really experiment with much.

1 Like

This is because the column where you have your labels is interpreted as a mix of numbers and strings by csv_reader I’m guessing. I’ve pushed something in master that should fix this, while waiting for the next release, you should type df[label_cols] = df[label_cols].astype(str) to force the column as a string type.

1 Like

Hello guys, I am trying to do multi-label classification on a small text data, but getting this error. leaner is completing the training of first epoch, but while validating it is throwing the error. Any one help?

torch 1.0.0.dev20181120
fastai 1.0.28

data: dataframe with two columns (one text and other multi-labels)

Learner was able to train and but while validating it is throwing the error

Trying to do Text classification: I am getting this Error:

TypeError: list indices must be integers or slices, not NoneType

dc = TextDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                           text_cols = ['title'], label_cols = ['labels'], label_delim = '|')
learn.fit_one_cycle(5, slice(0.01))


        TypeError                                 Traceback (most recent call last)
    <ipython-input-37-9c1fbaf9d6de> in <module>()
    ----> 1 learn.fit_one_cycle(5, slice(0.01))

    /opt/anaconda3/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
         18     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
         19                                         pct_start=pct_start, **kwargs))
    ---> 20     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
         21 
         22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
        160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
        161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
    --> 162             callbacks=self.callbacks+callbacks)
        163 
        164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         92     except Exception as e:
         93         exception = e
    ---> 94         raise e
         95     finally: cb_handler.on_train_end(exception)
         96 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         87             if hasattr(data,'valid_dl') and data.valid_dl is not None:
         88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
    ---> 89                                        cb_handler=cb_handler, pbar=pbar)
         90             else: val_loss=None
         91             if cb_handler.on_epoch_end(val_loss): break

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
         47     with torch.no_grad():
         48         val_losses,nums = [],[]
    ---> 49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
         50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
         51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

    /opt/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
         63         self.update(0)
         64         try:
    ---> 65             for i,o in enumerate(self._gen):
         66                 yield o
         67                 if self.auto_update: self.update(i+1)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
         45     def __iter__(self):
         46         "Process and returns items from `DataLoader`."
    ---> 47         for b in self.dl:
         48             y = b[1][0] if is_listy(b[1]) else b[1]
         49             if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
        612     def __next__(self):
        613         if self.num_workers == 0:  # same-process loading
    --> 614             indices = next(self.sample_iter)  # may raise StopIteration
        615             batch = self.collate_fn([self.dataset[i] for i in indices])
        616             if self.pin_memory:

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py in __iter__(self)
        158     def __iter__(self):
        159         batch = []
    --> 160         for idx in self.sampler:
        161             batch.append(idx)
        162             if len(batch) == self.batch_size:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in __iter__(self)
         59     def __len__(self) -> int: return len(self.data_source)
         60     def __iter__(self):
    ---> 61         return iter(sorted(range_of(self.data_source), key=self.key, reverse=True))
         62 
         63 class SortishSampler(Sampler):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in <lambda>(t)
        244         dataloaders = [train_dl]
        245         for ds in datasets[1:]:
    --> 246             sampler = SortSampler(ds.x, key=lambda t: len(ds[t][0].data))
        247             dataloaders.append(DataLoader(ds, batch_size=bs, sampler=sampler, **kwargs))
        248         return cls(*dataloaders, path=path, collate_fn=collate_fn)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
        413     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':
        414         if isinstance(try_int(idxs), int):
    --> 415             if self.item is None: x,y = self.x[idxs],self.y[idxs]
        416             else:                 x,y = self.item   ,0
        417             if self.tfms:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
         80 
         81     def __getitem__(self,idxs:int)->Any:
    ---> 82         if isinstance(try_int(idxs), int): return self.get(idxs)
         83         else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
         84 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in get(self, i)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    TypeError: list indices must be integers or slices, not NoneType

Thank you @sgugger. This fixed the problem. Now I have another one:

learn.fit_one_cycle(1, 3e-2, moms=(0.8,0.7))

It gives an error at the validation stage - the default metrics accuracy is not appropriate may be:

~/anaconda2/envs/fastai.1/lib/python3.6/site-packages/fastai/metrics.py in accuracy(input, targs)
     37     input = input.argmax(dim=-1).view(n,-1)
     38     targs = targs.view(n,-1)
---> 39     return (input==targs).float().mean()
     40 
     41 def error_rate(input:Tensor, targs:Tensor)->Rank0Tensor:

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'other'

I try to fix this with:

learn.metrics = accuracy_thresh

But I get an error:

TypeError: 'function' object is not iterable

Edit: a list of metrics is expected. This works:

learn.metrics = [accuracy_thresh]

May be you need to give the proper vocabulary with vocab=data_lm.vocab? E.g.

dc = TextClasDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                               text_cols = ['title'], label_cols = ['labels'], label_delim = '|',
                               vocab=data_lm.vocab)
learn = text_classifier_learner(dc, drop_mult=0.5)
learn.metrics = [accuracy_thresh]

Regarding Language Modelling:

@sgugger, @jeremy , and anyone else knowledgeable about NLP/ language modelling

Two questions regarding transfer learning/ NLP:

(1) I’m not sure if I missed it, but is there a minimum amount of words in the domain specific corpus to get decent results?

I’m trying to build a model for scientific writings (geosciences) and was wondering how much data I need to scrape. I plan to do topic modeling and want to try to predict the scientific subfield of a text at inference time (multi-class labeling)…

I think I’m now at 19 mil words.

(2) I will have a lot of super-special words (also plenty of greek symbols etc.) that might carry a lot of weight when deciding what area this is from…

Will this be lost (max. vocal = 60,000)? Do I have to take special care that I retain this stuff (and is it wise to do so)?

Cheers

If there is a minimum amount of words, it’s certainly less than 19 millions :wink: That’s more than what the pretrained model got on wikitext-103 I believe.
The vocabulary will take the most 60,000 common tokens by default. Keep in mind that your language model will never learn the meaning of a very rare word if it doesn’t see it a little bit. So if your special words are common in your corpus, they will be in the vocabulary and the model will understand their meaning during fine-tuning. Otherwise, they might not be useful.

You can try to play around with this maximum vocab size, but then you’ll probably need to have a custom loss function because a full softmax will be too heavy in terms of computation/memory.

I think wikitext-103 is closer to a billion words. I’d guess IMDb might be closer to 19m.

Thanks

That’s good to know… I ramped up to 25 mio… let’s see what we get. Due to the very technical nature (scientific terminology and and niche expressions) I am a bit doubtful if this works as is… but we’ll find out I guess ;-)… I’d probably try to extend the vocab size. Are new words (there should be a lot) just added to the wiki vocab at the moment? Or is 60,000 the limit and new words push out wiki words?

Thanks

A follow-up question.

Given I have conference abstracts as documents, would it make sense to somehow tell the model that the title of them is actually very important (more condensed info). Should I somehow mark this special sentence (all caps, a surrounding tag) or does this not help in “weighting” what the model learns?!?