SentencePiece

Daniel.R.Armstrong · September 12, 2019, 2:30am

If my memory serves me correctly I ran into an error similar in the past. I don’t know what the error was exactly, but I couldn’t use processor=processor when I defined it like you did, but It worked when I used:

data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])
       .split_by_rand_pct(0.1, seed=42)
       .label_for_lm()
       .databunch(bs=bs, num_workers=1))

pierreguillou · September 12, 2019, 6:32pm

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].

Hello Daniel.

Thank you but same error on my side even with your code:
TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])

I’m quite sure my problem comes from French text instead of English one. Hope that someone can solve this issue in order to use SentencePiece within fastai.

Daniel.R.Armstrong · September 12, 2019, 8:19pm

Did you try to use the getwiki for French? I can try it tonight and give you the model if it works.

pierreguillou · September 12, 2019, 8:25pm

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].

Yes, I followed all the steps of the nn-vietnamese.ipynb (and in particular get_wiki() in the file nlputils.py). I had no problem with that.
Great if you can try on your side. Thanks Daniel.

Daniel.R.Armstrong · September 12, 2019, 8:34pm

No problem, I hope I can help, when I was doing it for Dutch I spent a tone of time trying to understand where the files were kept, and how to use them. I end up finding it and moved it to my working directory on a different instance so it was easier to call. When you say that you see that the files were created, is that in the frwiki folder or your other dataset folder?

pierreguillou · September 12, 2019, 10:13pm

The file frwiki is created in the folder frwiki. It is well extracted to the folder docs that is in the folder frwiki, too:

frwiki
– frwiki (file)
– docs (folder)
---- text files

Daniel.R.Armstrong · September 12, 2019, 10:37pm

I will double check the specifics in a couple hours, but Inside one of the frwiki folders there should be a tmp folder that has the spm models and vocabulary in it, I moved the temp folder to my working directory, then in my code I assigned dest to my working directory, not the fastai defaults because I wanted to see what was going on. I am thinking that it is in the folder with all the text files

Daniel.R.Armstrong · September 13, 2019, 2:43am

I will try again tomorrow, I am not able to sign into my instance, because GCP states there aren’t any available v100 available at this time. I will let you know as soon as I can into my instance.

Daniel.R.Armstrong · September 15, 2019, 4:43pm

I finely got into my instance, I found the location of the spm.vocab and spm. model. For me it was in /home/jupyter/.fastai/data/frwiki/docs/tmp

I tried twice and I am getting this error:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
'''
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/concurrent/futures/process.py", line 360, in _queue_management_worker
    result_item = result_reader.recv()
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes
    return self._recv(size)
  File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 386, in _recv
    buf.write(chunk)
MemoryError
'''

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-43-58259e67b5f0> in <module>
      1 data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])
----> 2         .split_by_rand_pct(0.1, seed=42)
      3         .label_for_lm()
      4         .databunch(bs=bs, num_workers=-1))
      5 

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    475             self.valid = fv(*args, from_item_lists=True, **kwargs)
    476             self.__class__ = LabelLists
--> 477             self.process()
    478             return self
    479         return _inner

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self)
    529         "Process the inner datasets."
    530         xp,yp = self.get_processors()
--> 531         for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n)
    532         #progress_bar clear the outputs so in some case warnings issued during processing disappear.
    533         for ds in self.lists:

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self, xp, yp, name)
    709                     p.warns = []
    710                 self.x,self.y = self.x[~filt],self.y[~filt]
--> 711         self.x.process(xp)
    712         return self
    713 

/opt/anaconda3/lib/python3.7/site-packages/fastai/data_block.py in process(self, processor)
     81         if processor is not None: self.processor = processor
     82         self.processor = listify(self.processor)
---> 83         for p in self.processor: p.process(self)
     84         return self
     85 

/opt/anaconda3/lib/python3.7/site-packages/fastai/text/data.py in process(self, ds)
    468         else:
    469             with ProcessPoolExecutor(self.n_cpus) as e:
--> 470                 ds.items = np.array(sum(e.map(self._encode_batch, partition_by_cores(ds.items, self.n_cpus)), []))
    471         ds.vocab = self.vocab
    472 

/opt/anaconda3/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
    474     careful not to keep references to yielded objects.
    475     """
--> 476     for element in iterable:
    477         element.reverse()
    478         while element:

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I am thinking the fr wiki is too big for my instance, but am not sure. at this point in setting up data, my cpu goes up to 100% then dies. Do you know how I would fix this? I can give you spm.model and vocab, but I don’t know if that will help you.

Jeromy stated to limit the corpus size to 100 million tokens [Language Model Zoo 🦍] but I need to figure out how to do that.

pierreguillou · September 16, 2019, 10:08pm

Yes: with an architecture like AWD-LSTM (not very deep), it does not help to train your learner with a corpus bigger than 100 million tokens.

After downloading/unzipping/extracting the wikipedia articles in your language (scripts in nlputils.py), you can create your own script to keep only a 100 million tokens corpus or you can use the create_wikitext.py script.

Daniel.R.Armstrong · September 17, 2019, 12:01pm

I got it running with articles over 2500 length instead of 1500, but I lost my ssh connection so I cant see the results boo. I guess I should have followed your instructions about using tmux. I tried training for two additional layers but the results lead me to think that the LM didn’t train properly yesterday, so I will have to run it again.

What architecture are you trying to use?

Were you successful in creating a pre-trained wikitext model using SentencePiece? or are you trying to use your data in place of wikitext?

Daniel.R.Armstrong · September 18, 2019, 12:05pm

I completed the training, and I made it available for everyone to use here.

pierreguillou · September 18, 2019, 6:04pm

Thank you Daniel. In fact I found the origin of my problem with SentencePiece (see EDIT 2).

Then, with a 100-millions-tokens corpus created from the French Wikipedia dumb but with articles of the lowest length (but bigger than 100 tokens) and not the highest one, the code from the nn-turkish.ipynb notebook works well.

data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor()])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

Daniel.R.Armstrong · September 18, 2019, 6:56pm

I am glad you found it, if you want to save the training time feel free to use the one I trained.

rahuluppari · July 22, 2020, 10:37pm

Hello Daniel,

Could you please help me to solve an error which im facing during the downloading the get_wiki

1: I’m not able to use nlputils it is triggering an error.
from nlputils import get_wiki,split_wiki
no module named ‘nlputils’

While using get_wiki(lang,path) it is giving the below error:
[Errno 2] No such file or directory: ‘C:\Users\NANI.HOMIES\.fastai\data\mrwiki\text\AA\wiki_00’

Could you please help me with solving this error.

thanking you in advance

rahuluppari · July 22, 2020, 10:39pm

Hello Jeremy,

Could you please help me to solve an error which im facing during the downloading the get_wiki

1: I’m not able to use nlputils it is triggering an error.
from nlputils import get_wiki,split_wiki
no module named ‘nlputils’

While using get_wiki(lang,path) it is giving the below error:
[Errno 2] No such file or directory: ‘C:\Users\NANI.HOMIES.fastai\data\mrwiki\text\AA\wiki_00’

Could you please help me with solving this error.

thanking you in advance

mrfabulous1 · July 23, 2020, 10:30am

Hi ahuluppari hope you are having a wonderful day!

from nlputils import get_wiki,split_wiki
no module named ‘nlputils’

Normally the error above means that you haven’t installed the library.

run the following pip list to check if the library is installed.

Use pip install to install the library.

[quote=“rahuluppari, post:35, topic:53010”]
While using get_wiki(lang,path) it is giving the below error:
[Errno 2] No such file or directory: ‘C:\Users\NANI.HOMIES.fastai\data\mrwiki\text\AA\wiki_00’
[/quote

I don’t think its possible for this to work as it looks like it is being called by the import statement before it.

Hope this helps.

Cheers mrfabulous1 :smiley

ps. we don’t normally ping Jeremy for such an issue

rahuluppari · July 23, 2020, 10:55am

Hello mrfabulous1,

Many Thanks for your response.
Apologies for the same!!

I have the list of libraries installed and nlputils was installed, so i have uninstalled it and reinstalled the library “pip install nlputils” but it is still giving the same error.
no module named “nlputils”.

I don’t know the reason why

I have uploaded the snapshot for the same for your reference!!

It would be a great help if you can help me to resolve this issue.

Daniel.R.Armstrong · July 23, 2020, 5:22pm

@rahuluppari Is the nlputis.py file in the same directory as the notebook you are running? If you haven’t done so already I would watch the Turkish lesson. Then watch it again. Then maybe one more time.

rahuluppari · July 24, 2020, 12:02am

Many thanks @Daniel.R.Armstrong,

It worked, I didn’t saved the nlputils.py file in my working directory.

@Daniel.R.Armstrong, I do still have problem with get_wiki(path,lang) it is giving an error
below are the snapshot for the same

It would a great help if you could help me to solve this error as well.

Thanking you in advance!!!