Lesson 4 Advanced Discussion ✅

I am also interested how we can implement calculating the feature importance for tabular data in fully connected networks

Main one would be ‘beam search’.

4 Likes

No good reason - it’s just a little toy dataset that I didn’t really experiment with much.

1 Like

This is because the column where you have your labels is interpreted as a mix of numbers and strings by csv_reader I’m guessing. I’ve pushed something in master that should fix this, while waiting for the next release, you should type df[label_cols] = df[label_cols].astype(str) to force the column as a string type.

1 Like

Hello guys, I am trying to do multi-label classification on a small text data, but getting this error. leaner is completing the training of first epoch, but while validating it is throwing the error. Any one help?

torch 1.0.0.dev20181120
fastai 1.0.28

data: dataframe with two columns (one text and other multi-labels)

Learner was able to train and but while validating it is throwing the error

Trying to do Text classification: I am getting this Error:

TypeError: list indices must be integers or slices, not NoneType

dc = TextDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                           text_cols = ['title'], label_cols = ['labels'], label_delim = '|')
learn.fit_one_cycle(5, slice(0.01))


        TypeError                                 Traceback (most recent call last)
    <ipython-input-37-9c1fbaf9d6de> in <module>()
    ----> 1 learn.fit_one_cycle(5, slice(0.01))

    /opt/anaconda3/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
         18     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
         19                                         pct_start=pct_start, **kwargs))
    ---> 20     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
         21 
         22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
        160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
        161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
    --> 162             callbacks=self.callbacks+callbacks)
        163 
        164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         92     except Exception as e:
         93         exception = e
    ---> 94         raise e
         95     finally: cb_handler.on_train_end(exception)
         96 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         87             if hasattr(data,'valid_dl') and data.valid_dl is not None:
         88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
    ---> 89                                        cb_handler=cb_handler, pbar=pbar)
         90             else: val_loss=None
         91             if cb_handler.on_epoch_end(val_loss): break

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
         47     with torch.no_grad():
         48         val_losses,nums = [],[]
    ---> 49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
         50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
         51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

    /opt/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
         63         self.update(0)
         64         try:
    ---> 65             for i,o in enumerate(self._gen):
         66                 yield o
         67                 if self.auto_update: self.update(i+1)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
         45     def __iter__(self):
         46         "Process and returns items from `DataLoader`."
    ---> 47         for b in self.dl:
         48             y = b[1][0] if is_listy(b[1]) else b[1]
         49             if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
        612     def __next__(self):
        613         if self.num_workers == 0:  # same-process loading
    --> 614             indices = next(self.sample_iter)  # may raise StopIteration
        615             batch = self.collate_fn([self.dataset[i] for i in indices])
        616             if self.pin_memory:

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py in __iter__(self)
        158     def __iter__(self):
        159         batch = []
    --> 160         for idx in self.sampler:
        161             batch.append(idx)
        162             if len(batch) == self.batch_size:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in __iter__(self)
         59     def __len__(self) -> int: return len(self.data_source)
         60     def __iter__(self):
    ---> 61         return iter(sorted(range_of(self.data_source), key=self.key, reverse=True))
         62 
         63 class SortishSampler(Sampler):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in <lambda>(t)
        244         dataloaders = [train_dl]
        245         for ds in datasets[1:]:
    --> 246             sampler = SortSampler(ds.x, key=lambda t: len(ds[t][0].data))
        247             dataloaders.append(DataLoader(ds, batch_size=bs, sampler=sampler, **kwargs))
        248         return cls(*dataloaders, path=path, collate_fn=collate_fn)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
        413     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':
        414         if isinstance(try_int(idxs), int):
    --> 415             if self.item is None: x,y = self.x[idxs],self.y[idxs]
        416             else:                 x,y = self.item   ,0
        417             if self.tfms:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
         80 
         81     def __getitem__(self,idxs:int)->Any:
    ---> 82         if isinstance(try_int(idxs), int): return self.get(idxs)
         83         else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
         84 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in get(self, i)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    TypeError: list indices must be integers or slices, not NoneType

Thank you @sgugger. This fixed the problem. Now I have another one:

learn.fit_one_cycle(1, 3e-2, moms=(0.8,0.7))

It gives an error at the validation stage - the default metrics accuracy is not appropriate may be:

~/anaconda2/envs/fastai.1/lib/python3.6/site-packages/fastai/metrics.py in accuracy(input, targs)
     37     input = input.argmax(dim=-1).view(n,-1)
     38     targs = targs.view(n,-1)
---> 39     return (input==targs).float().mean()
     40 
     41 def error_rate(input:Tensor, targs:Tensor)->Rank0Tensor:

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'other'

I try to fix this with:

learn.metrics = accuracy_thresh

But I get an error:

TypeError: 'function' object is not iterable

Edit: a list of metrics is expected. This works:

learn.metrics = [accuracy_thresh]

May be you need to give the proper vocabulary with vocab=data_lm.vocab? E.g.

dc = TextClasDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                               text_cols = ['title'], label_cols = ['labels'], label_delim = '|',
                               vocab=data_lm.vocab)
learn = text_classifier_learner(dc, drop_mult=0.5)
learn.metrics = [accuracy_thresh]

Regarding Language Modelling:

@sgugger, @jeremy , and anyone else knowledgeable about NLP/ language modelling

Two questions regarding transfer learning/ NLP:

(1) I’m not sure if I missed it, but is there a minimum amount of words in the domain specific corpus to get decent results?

I’m trying to build a model for scientific writings (geosciences) and was wondering how much data I need to scrape. I plan to do topic modeling and want to try to predict the scientific subfield of a text at inference time (multi-class labeling)…

I think I’m now at 19 mil words.

(2) I will have a lot of super-special words (also plenty of greek symbols etc.) that might carry a lot of weight when deciding what area this is from…

Will this be lost (max. vocal = 60,000)? Do I have to take special care that I retain this stuff (and is it wise to do so)?

Cheers

If there is a minimum amount of words, it’s certainly less than 19 millions :wink: That’s more than what the pretrained model got on wikitext-103 I believe.
The vocabulary will take the most 60,000 common tokens by default. Keep in mind that your language model will never learn the meaning of a very rare word if it doesn’t see it a little bit. So if your special words are common in your corpus, they will be in the vocabulary and the model will understand their meaning during fine-tuning. Otherwise, they might not be useful.

You can try to play around with this maximum vocab size, but then you’ll probably need to have a custom loss function because a full softmax will be too heavy in terms of computation/memory.

I think wikitext-103 is closer to a billion words. I’d guess IMDb might be closer to 19m.

Thanks

That’s good to know… I ramped up to 25 mio… let’s see what we get. Due to the very technical nature (scientific terminology and and niche expressions) I am a bit doubtful if this works as is… but we’ll find out I guess ;-)… I’d probably try to extend the vocab size. Are new words (there should be a lot) just added to the wiki vocab at the moment? Or is 60,000 the limit and new words push out wiki words?

Thanks

A follow-up question.

Given I have conference abstracts as documents, would it make sense to somehow tell the model that the title of them is actually very important (more condensed info). Should I somehow mark this special sentence (all caps, a surrounding tag) or does this not help in “weighting” what the model learns?!?

If you make it a separate field in your CSV, then it’ll automatically be tagged for you when you import it.

1 Like

Hi @Jeremy

Thanks, but I don’t quite follow. I was thinking along the lines that the title content should probably hold more weight than the body of the abstract as it’s the abstract of the abstract if you will…

If my goal later is to do topic modeling, would it help to mark the title content (that I currently have in a separate column and merge with the body text col for a joined col) as more important for the model?

If I do, would that be at the language model learning step, or rather at the classification trainman step…

Sorry if I ask the wrong questions - pretty new to NLP…

Put your title and doc in different fields in a CSV. Then it’ll be tagged by the dataset automatically. If the model finds that tag is more important, it’ll use it automatically.

Oh wow. That’s cool. Have to read up into the docs a bit more it seems

Looks like there’s something wrong with the example

On the section with the prediction of the tabular data I’m getting an error.

You can set the batch size as follows (it’s in the updated course-v3)

bs =24
data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=bs)

You can specify the settings of max_vocab and the language as follows.

txt_proc = [
    TokenizeProcessor(tokenizer=Tokenizer(lang='nl') ),
    NumericalizeProcessor(min_freq=1, max_vocab=10000 )
]

data_lm = (TextList.from_df(df, cols='text', processor=txt_proc)           
            .random_split_by_pct(0.1)
            .label_for_lm()           
            .databunch())
3 Likes

When we want to train the language model for our own dataset on the wiki103 LM. Why don’t we have to align the vocab of our new dataset (IMDB) with the Wiki103? For example like this.
data_lm = (TextList.from_folder(path, **vocab=data_lm.vocab**)

Like we do when we want to use the Classifier.
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)

2 Likes

Thanks! This is great.