SentencePiece

My guess is that this might be because you are probably using the pretrained wiki text model for forwards, and you are needing to learn the backwards model more of less from scratch on the imdb sample, but that is just a thought.

I loved the NLP class! You all do a truly amazing job! I love your teaching style, and all tips you share. I am sure you get praise all the time, but you have changed my life. I recently started my first data science job(unpaid, but still) with a startup, and we used your notebooks to create a Dutch lm and classifier that is showing some great results.(lm and blog post, to follow). Cheers to you and @rachel for all your hard work and effort to the data science community! Thank you!

6 Likes

Jeremy, my apologies for missing that! I’m slowly getting through them all as there is so much to learn! Thank you very much for the amazing resources :slight_smile: When I ran IMDB the full set I did see our ~30% we want from language models

Daniel, that would make sense.

Thank you both very much for the help!!! :slight_smile:

2 Likes

Oh there’s absolutely no apology required - there’s a lot of materials around and I’m more than happy to help navigate them as needed.

2 Likes

Thanks Jeremy!

[ EDIT 2 ] The problem was not the one explained in EDIT 1. The problem came from the number of tokens of my articles saved to create my 100-millions-tokens corpus (about 650k tokens by article). I do not know why but SentencePiece did not like this kind of big number of tokens. Then, I created another 100-millions-tokens corpus with a lower articles length (and so with more articles) and SentencePiece did work.


[ EDIT 1 ] I tested with a corpus in English and I did not get an error. I guess the problem comes from the line 427 in the file text > data.py:

with open(raw_text_path, 'w') as f: f.write("\n".join(texts))

As the raw_text_path file in my case contains French words (ie, words with accents), the open() method should have the argument encoding="utf8" I think. I cc @sgugger


Hello.
I’m testing SentencePiece on a small French dataset (20 text files of 1 000 000 caracters, global size of 6.4Mo). I’m using fastai 1.0.57 on GCP.
When I try to create the databunch, I get the following error. How to solve it?

Note 1: I took the code processor=[OpenFileProcessor(), SPProcessor()]) from the nn-turkish.ipynb notebook.

Note 2: the train labelling (through label_for_lm()) seems to be created as I can see the running bar. The problem seems to appear with the valid one.

Note 3: I can see in the corpus folder (dest) that a tmp folder was created with one file inside: all_text.out of 63.8Mo

2 Likes

How is your data structured? Similar to the others? (A test, train, and valid folder)

Edit: Just saw your edit! I believe in the NLP course Jeremy mentions that different languages use different encoding methodologies (utf8 for example) so that would not surprise me it needed a change here! Great catch :slight_smile:

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].


I think the problem comes from the line 427 in the file text > data.py. It should open the raw file with encoding="utf8" I think, no? (see my EDIT)

1 Like

I agree :slight_smile: For temporary purposes you can just copy those over, but I think perhaps @sgugger a possible optional argument for an encoding method to be passed in?

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].


I made the change in ./opt/anaconda3/lib/python3.7/site-packages/fastai/text/data.py as following but I get the same error:

with open(raw_text_path, 'w', encoding='utf8') as f: f.write("\n".join(texts))

If my memory serves me correctly I ran into an error similar in the past. I don’t know what the error was exactly, but I couldn’t use processor=processor when I defined it like you did, but It worked when I used:

data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])
       .split_by_rand_pct(0.1, seed=42)
       .label_for_lm()
       .databunch(bs=bs, num_workers=1))

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].


Hello Daniel.

Thank you but same error on my side even with your code:
TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])

I’m quite sure my problem comes from French text instead of English one. Hope that someone can solve this issue in order to use SentencePiece within fastai.

Did you try to use the getwiki for French? I can try it tonight and give you the model if it works.

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].


Yes, I followed all the steps of the nn-vietnamese.ipynb (and in particular get_wiki() in the file nlputils.py). I had no problem with that.
Great if you can try on your side. Thanks Daniel.

No problem, I hope I can help, when I was doing it for Dutch I spent a tone of time trying to understand where the files were kept, and how to use them. I end up finding it and moved it to my working directory on a different instance so it was easier to call. When you say that you see that the files were created, is that in the frwiki folder or your other dataset folder?

The file frwiki is created in the folder frwiki. It is well extracted to the folder docs that is in the folder frwiki, too:

frwiki
– frwiki (file)
– docs (folder)
---- text files

I will double check the specifics in a couple hours, but Inside one of the frwiki folders there should be a tmp folder that has the spm models and vocabulary in it, I moved the temp folder to my working directory, then in my code I assigned dest to my working directory, not the fastai defaults because I wanted to see what was going on. I am thinking that it is in the folder with all the text files

I will try again tomorrow, I am not able to sign into my instance, because GCP states there aren’t any available v100 available at this time. I will let you know as soon as I can into my instance.

I finely got into my instance, I found the location of the spm.vocab and spm. model. For me it was in /home/jupyter/.fastai/data/frwiki/docs/tmp

I tried twice and I am getting this error:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
'''
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/concurrent/futures/process.py", line 360, in _queue_management_worker
    result_item = result_reader.recv()
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes
    return self._recv(size)
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 386, in _recv
    buf.write(chunk)
MemoryError
'''

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-43-58259e67b5f0> in <module>
      1 data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])
----> 2         .split_by_rand_pct(0.1, seed=42)
      3         .label_for_lm()
      4         .databunch(bs=bs, num_workers=-1))
      5 

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    475             self.valid = fv(*args, from_item_lists=True, **kwargs)
    476             self.__class__ = LabelLists
--> 477             self.process()
    478             return self
    479         return _inner

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self)
    529         "Process the inner datasets."
    530         xp,yp = self.get_processors()
--> 531         for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n)
    532         #progress_bar clear the outputs so in some case warnings issued during processing disappear.
    533         for ds in self.lists:

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self, xp, yp, name)
    709                     p.warns = []
    710                 self.x,self.y = self.x[~filt],self.y[~filt]
--> 711         self.x.process(xp)
    712         return self
    713 

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self, processor)
     81         if processor is not None: self.processor = processor
     82         self.processor = listify(self.processor)
---> 83         for p in self.processor: p.process(self)
     84         return self
     85 

/opt/anaconda3/lib/python3.7/site-packages/fastai/text/data.py in process(self, ds)
    468         else:
    469             with ProcessPoolExecutor(self.n_cpus) as e:
--> 470                 ds.items = np.array(sum(e.map(self._encode_batch, partition_by_cores(ds.items, self.n_cpus)), []))
    471         ds.vocab = self.vocab
    472 

/opt/anaconda3/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
    474     careful not to keep references to yielded objects.
    475     """
--> 476     for element in iterable:
    477         element.reverse()
    478         while element:

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I am thinking the fr wiki is too big for my instance, but am not sure. at this point in setting up data, my cpu goes up to 100% then dies. Do you know how I would fix this? I can give you spm.model and vocab, but I don’t know if that will help you.

Jeromy stated to limit the corpus size to 100 million tokens [Language Model Zoo 🦍] but I need to figure out how to do that.

Yes: with an architecture like AWD-LSTM (not very deep), it does not help to train your learner with a corpus bigger than 100 million tokens.

After downloading/unzipping/extracting the wikipedia articles in your language (scripts in nlputils.py), you can create your own script to keep only a 100 million tokens corpus or you can use the create_wikitext.py script.

1 Like