Lesson 4 Advanced Discussion ✅

evan.xiong · November 21, 2018, 3:56am

I am also interested how we can implement calculating the feature importance for tabular data in fully connected networks

jeremy · November 21, 2018, 6:46pm

Main one would be ‘beam search’.

jeremy · November 21, 2018, 6:47pm

No good reason - it’s just a little toy dataset that I didn’t really experiment with much.

sgugger · November 21, 2018, 6:59pm

This is because the column where you have your labels is interpreted as a mix of numbers and strings by csv_reader I’m guessing. I’ve pushed something in master that should fix this, while waiting for the next release, you should type df[label_cols] = df[label_cols].astype(str) to force the column as a string type.

Asrst · November 21, 2018, 7:29pm

Hello guys, I am trying to do multi-label classification on a small text data, but getting this error. leaner is completing the training of first epoch, but while validating it is throwing the error. Any one help?

torch 1.0.0.dev20181120
fastai 1.0.28

data: dataframe with two columns (one text and other multi-labels)

Learner was able to train and but while validating it is throwing the error

Trying to do Text classification: I am getting this Error:

TypeError: list indices must be integers or slices, not NoneType

dc = TextDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                           text_cols = ['title'], label_cols = ['labels'], label_delim = '|')
learn.fit_one_cycle(5, slice(0.01))


        TypeError                                 Traceback (most recent call last)
    <ipython-input-37-9c1fbaf9d6de> in <module>()
    ----> 1 learn.fit_one_cycle(5, slice(0.01))

    /opt/anaconda3/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
         18     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
         19                                         pct_start=pct_start, **kwargs))
    ---> 20     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
         21 
         22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
        160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
        161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
    --> 162             callbacks=self.callbacks+callbacks)
        163 
        164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         92     except Exception as e:
         93         exception = e
    ---> 94         raise e
         95     finally: cb_handler.on_train_end(exception)
         96 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         87             if hasattr(data,'valid_dl') and data.valid_dl is not None:
         88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
    ---> 89                                        cb_handler=cb_handler, pbar=pbar)
         90             else: val_loss=None
         91             if cb_handler.on_epoch_end(val_loss): break

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
         47     with torch.no_grad():
         48         val_losses,nums = [],[]
    ---> 49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
         50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
         51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

    /opt/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
         63         self.update(0)
         64         try:
    ---> 65             for i,o in enumerate(self._gen):
         66                 yield o
         67                 if self.auto_update: self.update(i+1)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
         45     def __iter__(self):
         46         "Process and returns items from `DataLoader`."
    ---> 47         for b in self.dl:
         48             y = b[1][0] if is_listy(b[1]) else b[1]
         49             if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
        612     def __next__(self):
        613         if self.num_workers == 0:  # same-process loading
    --> 614             indices = next(self.sample_iter)  # may raise StopIteration
        615             batch = self.collate_fn([self.dataset[i] for i in indices])
        616             if self.pin_memory:

    /opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py in __iter__(self)
        158     def __iter__(self):
        159         batch = []
    --> 160         for idx in self.sampler:
        161             batch.append(idx)
        162             if len(batch) == self.batch_size:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in __iter__(self)
         59     def __len__(self) -> int: return len(self.data_source)
         60     def __iter__(self):
    ---> 61         return iter(sorted(range_of(self.data_source), key=self.key, reverse=True))
         62 
         63 class SortishSampler(Sampler):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/text/data.py in <lambda>(t)
        244         dataloaders = [train_dl]
        245         for ds in datasets[1:]:
    --> 246             sampler = SortSampler(ds.x, key=lambda t: len(ds[t][0].data))
        247             dataloaders.append(DataLoader(ds, batch_size=bs, sampler=sampler, **kwargs))
        248         return cls(*dataloaders, path=path, collate_fn=collate_fn)

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
        413     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':
        414         if isinstance(try_int(idxs), int):
    --> 415             if self.item is None: x,y = self.x[idxs],self.y[idxs]
        416             else:                 x,y = self.item   ,0
        417             if self.tfms:

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
         80 
         81     def __getitem__(self,idxs:int)->Any:
    ---> 82         if isinstance(try_int(idxs), int): return self.get(idxs)
         83         else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
         84 

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in get(self, i)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    /opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
        276         o = self.items[i]
        277         if o is None: return None
    --> 278         return self._item_cls(one_hot(o, self.c), [self.classes[p] for p in o], o)
        279 
        280     def reconstruct(self, t):

    TypeError: list indices must be integers or slices, not NoneType

krasin · November 22, 2018, 6:14am

Thank you @sgugger. This fixed the problem. Now I have another one:

learn.fit_one_cycle(1, 3e-2, moms=(0.8,0.7))

It gives an error at the validation stage - the default metrics accuracy is not appropriate may be:

~/anaconda2/envs/fastai.1/lib/python3.6/site-packages/fastai/metrics.py in accuracy(input, targs)
     37     input = input.argmax(dim=-1).view(n,-1)
     38     targs = targs.view(n,-1)
---> 39     return (input==targs).float().mean()
     40 
     41 def error_rate(input:Tensor, targs:Tensor)->Rank0Tensor:

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'other'

I try to fix this with:

learn.metrics = accuracy_thresh

But I get an error:

TypeError: 'function' object is not iterable

Edit: a list of metrics is expected. This works:

learn.metrics = [accuracy_thresh]

krasin · November 22, 2018, 6:31am

May be you need to give the proper vocabulary with vocab=data_lm.vocab? E.g.

dc = TextClasDataBunch.from_df(path = path, train_df = train_df, valid_df = valid_df,
                               text_cols = ['title'], label_cols = ['labels'], label_delim = '|',
                               vocab=data_lm.vocab)
learn = text_classifier_learner(dc, drop_mult=0.5)
learn.metrics = [accuracy_thresh]

cwerner · November 22, 2018, 7:04pm

Regarding Language Modelling:

@sgugger, @jeremy , and anyone else knowledgeable about NLP/ language modelling

Two questions regarding transfer learning/ NLP:

(1) I’m not sure if I missed it, but is there a minimum amount of words in the domain specific corpus to get decent results?

I’m trying to build a model for scientific writings (geosciences) and was wondering how much data I need to scrape. I plan to do topic modeling and want to try to predict the scientific subfield of a text at inference time (multi-class labeling)…

I think I’m now at 19 mil words.

(2) I will have a lot of super-special words (also plenty of greek symbols etc.) that might carry a lot of weight when deciding what area this is from…

Will this be lost (max. vocal = 60,000)? Do I have to take special care that I retain this stuff (and is it wise to do so)?

Cheers

sgugger · November 23, 2018, 3:40am

If there is a minimum amount of words, it’s certainly less than 19 millions That’s more than what the pretrained model got on wikitext-103 I believe.
The vocabulary will take the most 60,000 common tokens by default. Keep in mind that your language model will never learn the meaning of a very rare word if it doesn’t see it a little bit. So if your special words are common in your corpus, they will be in the vocabulary and the model will understand their meaning during fine-tuning. Otherwise, they might not be useful.

You can try to play around with this maximum vocab size, but then you’ll probably need to have a custom loss function because a full softmax will be too heavy in terms of computation/memory.

jeremy · November 23, 2018, 5:06am

I think wikitext-103 is closer to a billion words. I’d guess IMDb might be closer to 19m.

cwerner · November 23, 2018, 9:36am

Thanks

That’s good to know… I ramped up to 25 mio… let’s see what we get. Due to the very technical nature (scientific terminology and and niche expressions) I am a bit doubtful if this works as is… but we’ll find out I guess ;-)… I’d probably try to extend the vocab size. Are new words (there should be a lot) just added to the wiki vocab at the moment? Or is 60,000 the limit and new words push out wiki words?

Thanks

cwerner · November 23, 2018, 11:21am

A follow-up question.

Given I have conference abstracts as documents, would it make sense to somehow tell the model that the title of them is actually very important (more condensed info). Should I somehow mark this special sentence (all caps, a surrounding tag) or does this not help in “weighting” what the model learns?!?

jeremy · November 23, 2018, 12:17pm

If you make it a separate field in your CSV, then it’ll automatically be tagged for you when you import it.

cwerner · November 23, 2018, 12:32pm

Hi @Jeremy

Thanks, but I don’t quite follow. I was thinking along the lines that the title content should probably hold more weight than the body of the abstract as it’s the abstract of the abstract if you will…

If my goal later is to do topic modeling, would it help to mark the title content (that I currently have in a separate column and merge with the body text col for a joined col) as more important for the model?

If I do, would that be at the language model learning step, or rather at the classification trainman step…

Sorry if I ask the wrong questions - pretty new to NLP…

jeremy · November 23, 2018, 11:46pm

Put your title and doc in different fields in a CSV. Then it’ll be tagged by the dataset automatically. If the model finds that tag is more important, it’ll use it automatically.

cwerner · November 23, 2018, 11:59pm

Oh wow. That’s cool. Have to read up into the docs a bit more it seems

gerardo · November 25, 2018, 3:15am

github.com

fastai/fastai/blob/master/examples/tabular.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tabular example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.tabular import *  # Quick accesss to tabular functionality"
   ]
  },
  {
   "cell_type": "markdown",

This file has been truncated. show original

Looks like there’s something wrong with the example

On the section with the prediction of the tabular data I’m getting an error.

martijnd · November 26, 2018, 10:09am

You can set the batch size as follows (it’s in the updated course-v3)

bs =24
data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=bs)

You can specify the settings of max_vocab and the language as follows.

txt_proc = [
    TokenizeProcessor(tokenizer=Tokenizer(lang='nl') ),
    NumericalizeProcessor(min_freq=1, max_vocab=10000 )
]

data_lm = (TextList.from_df(df, cols='text', processor=txt_proc)           
            .random_split_by_pct(0.1)
            .label_for_lm()           
            .databunch())

martijnd · November 26, 2018, 10:17am

When we want to train the language model for our own dataset on the wiki103 LM. Why don’t we have to align the vocab of our new dataset (IMDB) with the Wiki103? For example like this.
data_lm = (TextList.from_folder(path, **vocab=data_lm.vocab**)

Like we do when we want to use the Classifier.
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)

fredguth · November 26, 2018, 10:34am

Thanks! This is great.