Lesson 4 In-Class Discussion ✅

How do we set max_vocab in TextLMDataBunch or using the DataBlock api? I can get both to work perfectly for creating a databunch for a language model, I just can’t limit the vocab without mannually overiding the default in NumericalizeProcessor().

I have tried passing max_vocab to TextLMDataBunch() and TextList.from_csv(). I am assuming that it needs to be passed to

class NumericalizeProcessor(PreProcessor):

but in my case no arguments get passed to this class from

class LabelLists(ItemLists):
    def get_processors(self):
        procs_x,procs_y = listify(self.train.x._processor),listify(self.train.y._processor)
        xp = ifnone(self.train.x.processor, [p(ds=self.train.x) for p in procs_x])
        yp = ifnone(self.train.y.processor, [p(ds=self.train.y) for p in procs_y])
        return xp,yp

I have read the documentation but I cannot work out where I should be passing max_vocab, can someone please point me in the right direction?

You can pass it in the kwargs when creating a TextDataBunch, for instance in the from_df method, it will be passed there and create an appropriate processor.

With the data block API, you have to override the default processor by creating a new one:

processor = [TokenizeProcessor(), NumericalizeProcessor(vocab=None, max_vocab=...)]

(note that if you put an existing vocab max_vocab will be ignored) then pass it in your TextList creation method.

1 Like

Thank you snugger, using the data block api as you suggested worked perfectly

data_lm = (TextList.from_csv(path,source_txt,cols=1,processor = [TokenizeProcessor(), NumericalizeProcessor(vocab=None, max_vocab=30000)])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())

I then tried

data_lm = TextDataBunch.from_csv(path,source_txt,text_cols=1, bs=32,max_vocab=30000)

creating the processor with max_vocab, however when the kwargs are passed including max_vocab I get the following error

207         collate_fn = partial(pad_collate, pad_idx=pad_idx, pad_first=pad_first)
208         train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
--> 209         train_dl = DataLoader(datasets[0], batch_size=bs//2, sampler=train_sampler, **kwargs)
210         dataloaders = [train_dl]
211         for ds in datasets[1:]:

TypeError: __init__() got an unexpected keyword argument 'max_vocab'

This doesn’t matter because I can use the data block api, but if someone knows what I am doing wrong can they point me in the right direction.

Ah, now I understand the PR that appeared a couple of days ago. Will have this fixed this morning.

I didn’t search any solutions due to busyness. But if i find anything i’ll write here.

See this post for a possible fix to

TypeError: 'bool' object is not callable
1 Like

Thanks

Doesn’t this also mean that the Amazon model did not correctly use a bias term in the model to remove this type of bias?

Am I getting an error because I am running on my CPU and not a GPU? I can’t figure out why I am getting these errors in lesson4-collab.ipynb

learn.lr_find()
learn.recorder.plot(skip_end=15)
> LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
> ---------------------------------------------------------------------------
> RuntimeError                              Traceback (most recent call last)
> <ipython-input-23-ebd3a191e924> in <module>()
> ----> 1 learn.lr_find()
>       2 learn.recorder.plot(skip_end=15)
> 
> ~/anaconda3/lib/python3.7/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
>      26     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
>      27     a = int(np.ceil(num_it/len(learn.data.train_dl)))
> ---> 28     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
>      29 
>      30 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:
> 
> ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
>     160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
>     161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
> --> 162             callbacks=self.callbacks+callbacks)
>     163 
>     164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:
> 
> ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
>      92     except Exception as e:
>      93         exception = e
> ---> 94         raise e
>      95     finally: cb_handler.on_train_end(exception)
>      96 
> 
> ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
>      82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
>      83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
> ---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
>      85                 if cb_handler.on_batch_end(loss): break
>      86 
> 
> ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
>      20 
>      21     if not loss_func: return to_detach(out), yb[0].detach()
> ---> 22     loss = loss_func(out, *yb)
>      23 
>      24     if opt is not None:
> 
> ~/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
>    1742     if size_average is not None or reduce is not None:
>    1743         reduction = _Reduction.legacy_get_string(size_average, reduce)
> -> 1744     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
>    1745 
>    1746 
> 
> ~/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in log_softmax(input, dim, _stacklevel, dtype)
>    1135         dim = torch.jit._unwrap_optional(dim)
>    1136     if dtype is None:
> -> 1137         ret = input.log_softmax(dim)
>    1138     else:
>    1139         _dtype = torch.jit._unwrap_optional(dtype)
> 
> RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

And this related one:

learn.fit_one_cycle(5, 5e-3)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-85-c1ed9d459fcc> in <module>()
    ----> 1 learn.fit_one_cycle(5, 5e-3)

    ~/anaconda3/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
         18     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
         19                                         pct_start=pct_start, **kwargs))
    ---> 20     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
         21 
         22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

    ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
        160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
        161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
    --> 162             callbacks=self.callbacks+callbacks)
        163 
        164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

    ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         92     except Exception as e:
         93         exception = e
    ---> 94         raise e
         95     finally: cb_handler.on_train_end(exception)
         96 

    ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
         82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
         83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
    ---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
         85                 if cb_handler.on_batch_end(loss): break
         86 

    ~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
         20 
         21     if not loss_func: return to_detach(out), yb[0].detach()
    ---> 22     loss = loss_func(out, *yb)
         23 
         24     if opt is not None:

    ~/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
       1742     if size_average is not None or reduce is not None:
       1743         reduction = _Reduction.legacy_get_string(size_average, reduce)
    -> 1744     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
       1745 
       1746 

    ~/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in log_softmax(input, dim, _stacklevel, dtype)
       1135         dim = torch.jit._unwrap_optional(dim)
       1136     if dtype is None:
    -> 1137         ret = input.log_softmax(dim)
       1138     else:
       1139         _dtype = torch.jit._unwrap_optional(dtype)

    RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

I have a very basic (dumb??) question: how do I connect vim (or another editor) to the fastai (or any git) repo?? Must be simple but I don’t see any info on that…

I am trying to fit a regression model and getting the error:-

Does the model see it as a multi class classification problem? Or is the problem something else?

TypeError: batch must contain tensors, numbers, dicts or lists; found lists; found <class ‘NoneType’>

data.train_ds.y
CategoryList (9224 items)
[23.249999761581396 23.249999761581396 1196993.5 1196993.5 … 90679.5 90679.5 42950.0 42950.0]
Path: .

data
DataBunch;
Train: LabelList
y: CategoryList (9224 items)
[23.249999761581396 23.249999761581396 1196993.5 1196993.5 … 90679.5 90679.5 42950.0 42950.0]
Path: .
x: TabularList (9224 items)
[0 1 2 3 … 9220 9221 9222 9223]
Path: .;
Valid: LabelList
y: CategoryList (200 items)
[1128409.5 567376.5 567376.5 1128409.5 … 496921.5 496921.5 338804.0 338804.0]
Path: .
x: TabularList (200 items)
[0 1 2 3 … 196 197 198 199]
Path: .;
Test: LabelList
y: CategoryList (200 items)
[0 0 0 0 … 0 0 0 0]
Path: .
x: TabularList (1423 items)
[0 1 2 3 … 1419 1420 1421 1422]
Path: .

You can create a new Terminal from the Jupyter UI.
It’s on the top right corner.

Basic question but how do you feed a simple text file for training to the pre-trained language model? It looks like in the example text.csv file, all the text was in one column. But what if you just have a bunch of text and there is no way to split it? Is there another method rather than the from_csv one for this use case?

data_lm = (TextList.from_csv(path, 'texts.csv', cols='text', bs=2)
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

Ignore the below, it was my mistake. If anyone else is getting a similar error then there is a chance you too have loaded the data for finetuning your language model with

TextClasDataBunch

instead of

TextLMDataBunch

My previous post is below.
I am getting these errors as well are you trying to use your own pretrained weights?

I have trained a language model, and I want to load it using language_model_learner(), to fine tune it on a classification data set. Everything appears to load until I run the learning rate finder which is when I get an error similar to yours.

This may just be because I am not sure how to save the dictionary, I had a look in the docs but couldn’t find anything, should I just be using pickle directly as below?

pickle_out = open(path_lang_model/'models/dict.pkl',"wb")
pickle.dump(data_lm.vocab.itos, pickle_out)
pickle_out.close()

To create a new language_model_learner for classification using my own weights I firstly I load my newly processed classification data

data_class_lm = TextClasDataBunch.load(path_class, ‘tmp_data_’+ source_txt_class, bs=48)

and then create a language_model_learner, passing the classification data, the previously trained language models weights and dictionary.

learn = language_model_learner(data_class_lm, pretrained_fnames=[path_lang_model/‘models/bestmodel_428511’,path_lang_model/‘models/dict’], drop_mult=0.3)

By this point I don’t receive any errors, but as soon as I run the learning rate finder

learn.lr_find()

I get an error similar to mayank4

ValueError: Expected input batch_size (1416) to match target batch_size (24).

where n in batch_size(n) is always bs/2 (48/2).

Does anyone know what I am doing wrong? The full error output is below

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-118-d81c6bd29d71> in <module>
----> 1 learn.lr_find()

d:\dev\repos\fastai_v1\fastai\train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
     28     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     29     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 30     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     31 
     32 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

d:\dev\repos\fastai_v1\fastai\basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

d:\dev\repos\fastai_v1\fastai\basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

d:\dev\repos\fastai_v1\fastai\basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

d:\dev\repos\fastai_v1\fastai\basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     20 
     21     if not loss_func: return to_detach(out), yb[0].detach()
---> 22     loss = loss_func(out, *yb)
     23 
     24     if opt is not None:

D:\c_progs\Anaconda3\envs\test_fastai\lib\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   1669     if size_average is not None or reduce is not None:
   1670         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1671     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   1672 
   1673 

D:\c_progs\Anaconda3\envs\test_fastai\lib\site-packages\torch\nn\functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   1524     if input.size(0) != target.size(0):
   1525         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
-> 1526                          .format(input.size(0), target.size(0)))
   1527     if dim == 2:
   1528         return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

ValueError: Expected input batch_size (1416) to match target batch_size (24).
1 Like

You have to save the archive in path=Path(’/home/jupyter/.fastai/data/’), not in Path(‘data/’) where actually is your notebook

What does the third value refers to ?
Screenshot%20from%202018-11-30%2020-08-56

1 Like

I wonder when we should use collaborative filttering version and when neural network version? So if we have user_id and movie_review. Is the collaborative filttering then better than creating multi layer neural network? How about if we have user_id, movie_review, user_age, and user_country? Is the neural network better or can we make collaborative filttering model more than two dimensional? Which one is working better then?

Why would you want to do that ?

Collaborative filtering refers to the ability to find missing data amongst existing association patterns between two different entities. e.g. Between users & products. (The answer to the question - Who Likes What ? as mentioned by Jeremy in one of the lectures.) Tabular refers to the nature of data layout of the features of these entities in the form of a table. The data columns of entities may or may not be interdependent either within the entity or across entities. I hope this helps.

When we say deep learning needs a lot of data, one is talking of observations. Not number of features. Many examples of deep learning working successfully even with a few features but with lots of data.