Regression using Fine-tuned Language Model

shaun1 · November 2, 2018, 5:28pm

Hello,

As part of a project, I’m working on the Mercari Price Suggestion Challenge, where the objective is predict a price of a product given information about it. While, the given details also includes categorical variables, I’m only concentrating on the name and item_description columns which constitutes free-text (I will be using other variables in the next stage). As the objective is to predict the price of the product, this is a regression problem.

What I have done:

Trained a LM on this data utilizing the FastAI’s API by fine-tuning the pre-trained WT103 LM
Used only the name and item_description column as part of the data
Created a dummy “classification” task by just adding a dummy label column with random binary labels and tested it by following the docs imdb_sample example to make sure the API works on this data (and it does!).

The next step is to actually setup and solve the regression problem. This entails the following:

Create DataBunch similar to the TextClasDataBunch that formats the data such that each batch contains price values in the y variable corresponding to each x
Create RNNLearner for regression similar to RNNLearner.Classifier
Setup a new loss function for the learner which is the root mean square log error as specified in the evaluation
Replace the “head” of the model with a layer such that it outputs a single value. In particular, the current model architecture is:

RNNLearner(data=<fastai.text.data.TextClasDataBunch object at 0x7fd1a390f9e8>, model=SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(60093, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(60093, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=2, bias=True)
    )
  )
), opt_func=functools.partial(<class 'torch.optim.adam.Adam'>, betas=(0.9, 0.99)), loss_func=<function cross_entropy at 0x7fd34695ce18>, metrics=[<function accuracy at 0x7fd340669488>], true_wd=True, bn_wd=True, wd=0.01, train_bn=True, path=PosixPath('data/price-pred'), model_dir='models', callback_fns=[<class 'fastai.basic_train.Recorder'>], callbacks=[RNNTrainer(learn=RNNLearner(data=<fastai.text.data.TextClasDataBunch object at 0x7fd1a390f9e8>, model=SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(60093, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(60093, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=2, bias=True)
    )
  )
), opt_func=functools.partial(<class 'torch.optim.adam.Adam'>, betas=(0.9, 0.99)), loss_func=<function cross_entropy at 0x7fd34695ce18>, metrics=[<function accuracy at 0x7fd340669488>], true_wd=True, bn_wd=True, wd=0.01, train_bn=True, path=PosixPath('data/price-pred'), model_dir='models', callback_fns=[<class 'fastai.basic_train.Recorder'>], callbacks=[...], layer_groups=[Sequential(
  (0): Embedding(60093, 400, padding_idx=1)
  (1): EmbeddingDropout(
    (emb): Embedding(60093, 400, padding_idx=1)
  )
), Sequential(
  (0): WeightDropout(
    (module): LSTM(400, 1150)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1150, 1150)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1150, 400)
  )
  (1): RNNDropout()
), Sequential(
  (0): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=2, bias=True)
    )
  )
)]), bptt=70, alpha=2.0, beta=1.0, adjust=False)], layer_groups=[Sequential(
  (0): Embedding(60093, 400, padding_idx=1)
  (1): EmbeddingDropout(
    (emb): Embedding(60093, 400, padding_idx=1)
  )
), Sequential(
  (0): WeightDropout(
    (module): LSTM(400, 1150)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1150, 1150)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1150, 400)
  )
  (1): RNNDropout()
), Sequential(
  (0): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=2, bias=True)
    )
  )
)])

I’m thinking that I need to replace Linear(in_features=50, out_features=2, bias=True) with Linear(in_features=50, out_features=1, bias=True) or something similar.

There is a thread from the part2 course this year that talks about a similar problem. However, I didn’t really understand how to use that knowledge in the v1 library.

I’m looking for the easiest way to implement this. I would be grateful for any pointers on how to use/tweak the library in achieving my objective.

Thanks.

shaun1 · November 4, 2018, 9:37pm

I’ve been working on the 1st step of creating a suitable DataBunch for the LM regression problem by looking at the source code in fastai/text/data.py. I initially thought that I could write a separate class similar to TextClasDataBunch for the regression problem but as I started looking deeper into the code, it seems that it might be more complicated than that. From my understanding, eventually a TextDataset is created regardless of which ever method we use to create a DataBunch. I noticed that along with creating the dataset for the DataBunch, the loss function of the model is also set in this class:

self.loss_func = F.cross_entropy if len(self.label_cols) <= 1 else F.binary_cross_entropy_with_logits

So it seems that eventually, that needs to be modified for the regression problem as well. So my question is should I go ahead and create an entire new class similar to TextDataset inherited from BaseTextDataset and then create a custom DataBunch that utilizes the newly created class?

sgugger · November 5, 2018, 1:13am

There’s no need to create a custom DataBunch, you can use that with any kinds of Dataset. The ImageDataBunch too, and it only subclasses DataBunch to add nice visualization methods.

shaun1 · November 5, 2018, 12:05pm

Just wanted to give an update:

For setting up the data for a regression problem all I had to do was these steps:

Pass n_labels=0. This has to be done so that label_cols was initialized to an empty list so that tokenize did not throw an error.
I had to set self.loss_func = F.mse_loss

These are the changes required in the TextDataset class. The TextDataBunch class did not require any changes. Here is a gist with both classes (renamed). I tried to iterate over train_dl, however this throws an error:

itr = iter(dl.train_dl)
next(itr)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-6693cc261707> in <module>
----> 1 next(itr)

~/fastai/fastai/basic_data.py in __iter__(self)
     81     def __iter__(self):
     82         "Process and returns items from `DataLoader`."
---> 83         for b in self.dl: yield self.proc_batch(b)
     84 
     85     def one_batch(self)->Collection[Tensor]:

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    635                 self.reorder_dict[idx] = batch
    636                 continue
--> 637             return self._process_next_batch(batch)
    638 
    639     next = __next__  # Python 2 compatibility

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    656         self._put_indices()
    657         if isinstance(batch, ExceptionWrapper):
--> 658             raise batch.exc_type(batch.exc_msg)
    659         return batch
    660 

TypeError: Traceback (most recent call last):
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/su0/fastai/fastai/torch_core.py", line 91, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 234, in default_collate
    raise TypeError((error_msg.format(type(batch[0]))))
TypeError: batch must contain tensors, numbers, dicts or lists; found <class 'reg_data.RegDataset'>

My next step is to figure out this error.

shaun1 · November 5, 2018, 3:16pm

The problem was that directly instantiated the TextDataBunch and did not transform it in an appropriate way. So, I copied over the TextClasDataBunch and used that and it worked. However, I believe the way that class is written is that the y variables are all set to integers representing labels. This is done by the collate_fn function which is pad_collate. The head of my dataset is:

sample_train.head()

	train_id	name	price	item_description
0	73356	New Turquoise Patina Boho Bracelet	3.13549	Stunning piece is embellished with faux turquo...
1	346137	Set of 3 Victorian themed letter openers	3.29584	These Victorian themed letter openers are beau...
2	383913	Boucheron Jaipur eau de parfum	3.46574	Warm spicy man scent sprayed twice
3	110810	KOURT K KYLIE LIP KIT FREE SHIP	2.70805	The #KylieCosmetics LipKit is your secret weap...
4	692856	NWT Victoria's Secret cooler bag	3.29584	New with tags Victoria's Secret Large insulate...

So, I would expect the first 5 values of the y variable be the values in the price column (apologies for the unorganized output, I just copied and pasted the pandas dataframe output). However, the output I get when I iterate over the training dataloader is:

itr = iter(db.train_dl)
next(itr)

[tensor([[   2,    1,    1,  ...,    1,    1,    1],
         [   6,    2,    1,  ...,    1,    1,    1],
         [  59,    6,    2,  ...,    1,    1,    1],
         ...,
         [  81,   41,    3,  ...,   21,  425,  632],
         [  18, 4143, 1268,  ..., 1429,  109,    9],
         [ 122,    9,    4,  ...,  152,   59,    9]], device='cuda:0'),
 tensor([2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 1,
         3, 2, 3, 2, 3, 2, 3, 3], device='cuda:0')]

I’m not sure where the values of the 2nd tensor are coming from, but I believe the answer lies in the collate_fn function pad_collate which is where I’m headed next.

Note to the admins: I’m using this thread as a sort of self documentation for my debug process (and maybe it might help others). I hope thats OK.

sgugger · November 5, 2018, 3:51pm

It’s completely okay
The second tensor is your targets, but it’s true that pad_collate casts it to an int for now, so what you see are probably the rounded values of your targets. I’m in the process of changing a lot of things in text and will try to fix that.

shaun1 · November 5, 2018, 4:23pm

Thank you for your reply.

Thats what I thought as well. Although I’m curious how the values ended up as they have. The first 5 values of the targets from the sample_train.head() are [3.13549, 3.29584, 3.46574, 2.70805, 3.29584]. However, the first 5 values of the first iteration of the target tensors are: [2, 2, 2, 2, 2]. Even if a floor function is taken I would expect those values to be [3, 3, 3, 2, 3]. Two possible explanations:

Either I don’t understand the rounding process
The first iteration does not actually correspond to the values of the head that I’m seeing.

Perhaps you can enlighten me on that.

Are these breaking changes where I might have to run and create the LM again?

Thanks.

sgugger · November 6, 2018, 12:16am

It’s only changes in the data structure, you won’t have to retrain any model (though the functions to create them will change a bit). If you’re only using the high level factory methods of TextDataBunch, you won’t see much of a thing.

For your other question, the batches are loaded according to sortish sampler, that may explain your differences.

wyquek · November 6, 2018, 6:45am

Very weird indeed! The rounding should rightfully give you [3,3,3,3,3], rather than [2,2,2,2,2]

[round(i) for i in [3.13549, 3.29584, 3.46574, 2.70805, 3.29584]]

[3, 3, 3, 3, 3]

Given that you have a series of 3s, it seems your smallest label, from what you provided, is 1, and

tensor([2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 1,
         3, 2, 3, 2, 3, 2, 3, 3], device='cuda:0')]

so some fastai code, such as the one below, could have taken the minimum label, which is 1, and use it to do this trn_labels -= minimum_label

min_lbl = trn_labels.min() # smallest label is 1
trn_labels -= min_lbl # shift all labels so that smallest label is 0 for training
val_labels -= min_lbl # shift all labels so that smallest label is 0 for validation
c=int(trn_labels.max())+1  #

But it’s just a crude guess, based on what you’ve provided.

shaun1 · November 6, 2018, 12:46pm

Interesting analysis! I could certainly see how the rounding could’ve worked as you mentioned.

An extra bit of information is that during data prep, I removed prices that were < $3 (this is before np.log1p) from the dataset. That means after np.log1p the minimum value of the price column would be np.log1p(3) = 1.38629. I’m curious whether this would have any bearing the values of the targets I’m seeing.

Given that round(np.log1p(3) = 1.0, the minimum value does indeed round down to 1.0.

shaun1 · November 12, 2018, 11:56pm

Since there has been some sweeping changes recently in the FastAI library with the introduction of the data_block API, I decided to start over so that I keep up with the latest version of the library in my work. While its possible that more changes are oncoming, I have started working on learning a new LM based on the lesson3-imdb notebook.

Major difference between the the IMDB data and my data is:

My problem is a regression problem
My data have multiple fields (namely name and item_description)
I am using the from_df method instead of the from_folder method

Following lesson 3, I tried to create my DataBunch ready for learning using:

data = (TextFileList.from_df(sample, col=['name', 'item_description'], path=PATH)
       .label_const(0)
       .random_split_by_pct()
       .datasets()
       .tokenize()
       .numericalize()
       .databunch(TextLMDataBunch))

which led to the following error:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/process.py", line 191, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/su0/fastai/fastai/text/transform.py", line 97, in _process_all_1
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 97, in <listcomp>
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 90, in process_text
    for rule in self.rules: t = rule(t)
  File "/home/su0/fastai/fastai/text/transform.py", line 65, in fix_html
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-38-b0d72d7b760e> in <module>
      1 data = (TextFileList.from_df(sample, col=['name', 'item_description'], path=PATH)
----> 2        .label_const(0)
      3        .random_split_by_pct()
      4        .datasets()
      5        .tokenize()

~/fastai/fastai/text/data.py in tokenize(self, tokenizer, chunksize)
     44     def tokenize(self, tokenizer:Tokenizer=None, chunksize:int=10000):
     45         "Tokenize `self.datasets` with `tokenizer` by bits of `chunksize`."
---> 46         self.datasets = [ds.tokenize(tokenizer, chunksize) for ds in self.datasets]
     47         return self
     48 

~/fastai/fastai/text/data.py in <listcomp>(.0)
     44     def tokenize(self, tokenizer:Tokenizer=None, chunksize:int=10000):
     45         "Tokenize `self.datasets` with `tokenizer` by bits of `chunksize`."
---> 46         self.datasets = [ds.tokenize(tokenizer, chunksize) for ds in self.datasets]
     47         return self
     48 

~/fastai/fastai/text/data.py in tokenize(self, tokenizer, chunksize)
    189         tokens = []
    190         for i in progress_bar(range(0,len(self.x),chunksize), leave=False):
--> 191             tokens += tokenizer.process_all(self.x[i:i+chunksize])
    192         return TokenizedDataset(tokens, self.y, self.classes, encode_classes=False)
    193 

~/fastai/fastai/text/transform.py in process_all(self, texts)
    101         if self.n_cpus <= 1: return self._process_all_1(texts)
    102         with ProcessPoolExecutor(self.n_cpus) as e:
--> 103             return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])
    104 
    105 class Vocab():

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
    474     careful not to keep references to yielded objects.
    475     """
--> 476     for element in iterable:
    477         element.reverse()
    478         while element:

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

AttributeError: 'numpy.ndarray' object has no attribute 'replace'

The culprit here seems to be this line:

  File "/home/su0/fastai/fastai/text/transform.py", line 65, in fix_html
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(

which is located in the fix_html method in transform.py which takes in a string as an argument. However, from the error it seems its getting a numpy array. I’m not sure where the error might be.

wyquek · November 13, 2018, 3:22am

Seems like it’s trying to do a clean-up of the text (like fixed-up in DL2 imdb notebook),

re1 = re.compile(r'  +')

def fixup(x):
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

but at that point the text has already been tokenized and numericalized.

data = (TextFileList.from_df(sample, col=['name', 'item_description'], path=PATH)
       .label_const(0)
       .random_split_by_pct()
       .datasets()
       .tokenize()   
       .numericalize()
       .databunch(TextLMDataBunch)) <<== text already tokenized and numericalized
                                         so fixup() can't do its job

Sorry, not the most helpful answer

shaun1 · November 13, 2018, 3:53am

Thanks for the reply. Actually, when I ran this step by step (as opposed to one line of code that does all of the data block operations), I noticed that the error actually kicked in at the tokenize function and the execution never got past that. So maybe I need to dig into that more.

wyquek · November 13, 2018, 4:26am

Seems like there are two tokenize() - one created by you and the other created by TextDataBunch. TextLMDataBunch inherits from TextDataBunch,

class TextDataBunch(DataBunch):
@classmethod
    def from_df(cls, path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:Optional[DataFrame]=None,
                tokenizer:Tokenizer=None, vocab:Vocab=None, classes:Collection[str]=None, text_cols:IntsOrStrs=1, 
                label_cols:IntsOrStrs=0, label_delim:str=None, **kwargs) -> DataBunch:
        "Create a `TextDataBunch` from DataFrames."
        dfs = [train_df, valid_df] if test_df is None else [train_df, valid_df, test_df]
        src = TextSplitData(path, *[TextLabelList.from_df(path, df, text_cols, label_cols, label_delim) for df in dfs])
return cls.create_from_split_ds(src.datasets(classes=classes), vocab, **kwargs)

which returns cls.create_from_split_ds

class TextDataBunch(DataBunch):

    @classmethod
    def create_from_split_ds(cls, dss:TextSplitDatasets, vocab:Vocab=None, tokenizer:Tokenizer=None, 
                      chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, **kwargs)->'TextDataBunch':
        return (dss.tokenize(tokenizer, chunksize=chunksize)
                   .numericalize(vocab, max_vocab=max_vocab, min_freq=min_freq)
.databunch(cls, **kwargs))

which has its own tokenize and numericalize in
dss.tokenize(tokenizer, chunksize=chunksize) .numericalize(vocab, max_vocab=max_vocab, min_freq=min_freq)

Do post when you can resolve. I’m interested to know

shaun1 · November 13, 2018, 1:56pm

I’m confused. Are you saying that the tokenize function that I call explicitly has calls to two tokenize functions but maybe the 2nd throws an error because we already have a tokenzied output?

As mentioned earlier, I follow these individual lines:

data = TextFileList.from_df(sample, col=['name', 'item_description'], path=PATH)
data = data.label_const(0)
data = data.random_split_by_pct()
data = data.datasets()
data = data.tokenize()

And the error gets thrown at the last step when I call data.tokenize, even before I call data.numericalize. I am going to instantiate my own tokenizer with n_cpus=1 because having multiple processes is throwing off my debugging process. I will report back on how this goes. It might be helpful if one of the devs shed some light on this problem.

wyquek · November 14, 2018, 2:52am

Got a feeling v1 would do these things for you automatically. After I install v1 I’ll play around with your scenario too.

shaun1 · November 15, 2018, 2:14pm

I don’t normally tag the devs, as I’m sure they are really busy working on the library and getting bombarded with questions. However I’m really stuck here with the continuous changes in the data_block API. I’d be grateful if @sgugger could shed some light on some of the questions I have:

Moving forward (assuming I am starting from scratch) what is your suggestion: Do I use the factory methods for my data (pre)processing or do I use the data_block API? The docs for text application use the factory methods, so I’m not sure which way I should proceed.
What is the difference between TextList and TextFilesList?
I’m following the lesson3-imdb on my own dataset:

data_lm = TextList.from_csv(PATH, 'lm-texts.csv', col=['name', 'item_description'])
data_lm = data_lm.random_split_by_pct(valid_pct=0.1) # stuck here for now
data_lm = data_lm.label_for_lm()
data_lm = data_lm.databunch()

Currently, the code is stuck on random_split_by_pct. I’m not sure why splitting is taking this long as it did not previously.

I did the same steps on a smaller sample of 1000 data points and it ran quickly. The original size of the dataset is 5,635,745. So is it just a matter of size and its just processing and not actually stuck?

Thank you for your help.

sgugger · November 15, 2018, 2:35pm

You should learn how to use the data block API as it’s what will give you the most flexibility in the end. The factory methods are there for a beginner who quickly wants to get started, but you’ll never be able to handle your custom datasets with them.
TextFilesList is when you texts are in files (like the full imdb) and not a csv or a dataframe. I’ll probably remove it today as it’s not really needed with our latest development, but we had to have something that worked for last course
It’s very weird to be stuck there as this only does a split. Note that the real preprocessing that will begin after label_for_lm (tokenization and numericalization) will take forever with 5,635,745 articles, and probably get you out of RAM, so you should probably process it by chunks a little less long.

shaun1 · November 15, 2018, 2:39pm

Thank you for your very quick reply and this is what makes using FastAI so great is the help I can get quickly from the devs!

I agree, however I have actually tokenized this before with the earlier APIs without any problems (and I also I have access to a machine that has 376G of RAM )

Having said that, is there an example available on how to do processing in chunks?

shaun1 · November 15, 2018, 2:47pm

I ran a debug magic and the execution gets stuck in this line:

train_idx = [i for i in range_of(self.items) if i not in valid_idx]

at the split_by_idx function the data_block.py file. Again not really sure why this would happen as the dev pointed out this is doing a split.

I’ll keep trying!