SentencePiece

muellerzr · August 21, 2019, 7:57pm

Hi all, I am having issues getting sentencepiece working with the fastai library. Could anyone please show me a notebook where you got it working and functioning?

Thank you very much
Zach

darek.kleczek · August 21, 2019, 9:01pm

Here you go https://drive.google.com/open?id=1n3QQcagr9QjZogae6u5G41RxI4l5Mk9g

muellerzr · August 21, 2019, 9:24pm

I think I see what I missed Thank you! We need to train the language model for much longer apparently. Thank you very much @darek.kleczek!!!

muellerzr · August 21, 2019, 10:11pm

I know I said thank you once but I’m thanking you again @darek.kleczek, I was hoping to be able to show how to apply sentencepiece at my meetup I host and this absolutely will help me do that. By the way, have you tried doing sentencepiece forwards, backwards along with Spacy fowards and backwards? I believe Jeremy and Rachel had an idea about that in the NLP course

darek.kleczek · August 21, 2019, 10:26pm

Happy I can be helpful - and thanks to @jeremy and @rachel for teaching Part 1 in a way that I can be helpful after just a few weeks of practice! I’m still deep into Part 1, I’ve been looking into the NLP course and wondering if I do that next, or go for Part 2 and then NLP (recommendations?). I haven’t seen that forward/backward thing you mention yet…

muellerzr · August 21, 2019, 10:27pm

Forwards and backwards is what Jeremy and them recently showed to achieve even more state of the art with ULMFiT on IMDB. I’d do NLP then part 2 personally. I’ll share a notebook with this ensemble shortly, but basically a backwards model is just as it sounds. All the word orders are backwards! And ensembling those models seems to help.

Daniel.R.Armstrong · August 22, 2019, 2:46am

If you are interested in an ensemble take a look at the vietnames notebook.

As for the SentencePiece model I would follow the Turkish notebook. When I did it I needed to use a v100 gpu on gcp, because it would have way to long on a smaller gpu something like three hours per epoch. I think the key to SentencePiece is you need to know where the model and vocabulary are stored, so you can use them in other projects. If you are following rhe Turkish notebook, the get_wiki function creates a dest folder that contains your text files, and when you use SentencePiece it creates a tmp folder inside that dest folder, that is where the spm.model and spm.vocab are stored. When you want to load them you use processor=SPProcessor.load(dest)

I haven’t done the SentencePiece backwards yet, but all you need to do is make sure you create your databuch with backward = True, an example of this is in vietnamese_bwd notebook.

muellerzr · August 22, 2019, 2:58am

Got it! Thanks @Daniel.R.Armstrong! I hadn’t made it there yet in the course, but I will study them thoroughly

I’m mostly doing them on IMDB to show as an example for my class I lecture, to give them experience. (IMDB Sample so it can fully run on colab quickly, even if it’s not above 85%).

Thank you very much!

muellerzr · August 22, 2019, 3:56am

One bit to note on backwards, when I was testing on sample briefly, I could not get above guessing, whereas forwards I was able to get ~78%. I need to play around with why some more, I’m moving onto the entirety of IMDB now

jeremy · August 22, 2019, 5:23am

Yeah in the NLP course I provided notebooks for doing SentencePiece and forwards and backwards models, and also showed how to download and prepare wikitext corpuses in any language. So lots of code for you to borrow!

Daniel.R.Armstrong · August 22, 2019, 11:06am

My guess is that this might be because you are probably using the pretrained wiki text model for forwards, and you are needing to learn the backwards model more of less from scratch on the imdb sample, but that is just a thought.

Daniel.R.Armstrong · August 22, 2019, 11:36am

I loved the NLP class! You all do a truly amazing job! I love your teaching style, and all tips you share. I am sure you get praise all the time, but you have changed my life. I recently started my first data science job(unpaid, but still) with a startup, and we used your notebooks to create a Dutch lm and classifier that is showing some great results.(lm and blog post, to follow). Cheers to you and @rachel for all your hard work and effort to the data science community! Thank you!

muellerzr · August 22, 2019, 11:38am

Jeremy, my apologies for missing that! I’m slowly getting through them all as there is so much to learn! Thank you very much for the amazing resources When I ran IMDB the full set I did see our ~30% we want from language models

Daniel, that would make sense.

Thank you both very much for the help!!!

jeremy · August 22, 2019, 11:26pm

Oh there’s absolutely no apology required - there’s a lot of materials around and I’m more than happy to help navigate them as needed.

muellerzr · August 22, 2019, 11:41pm

Thanks Jeremy!

pierreguillou · September 11, 2019, 10:14pm

[ EDIT 2 ] The problem was not the one explained in EDIT 1. The problem came from the number of tokens of my articles saved to create my 100-millions-tokens corpus (about 650k tokens by article). I do not know why but SentencePiece did not like this kind of big number of tokens. Then, I created another 100-millions-tokens corpus with a lower articles length (and so with more articles) and SentencePiece did work.

[ EDIT 1 ] I tested with a corpus in English and I did not get an error. I guess the problem comes from the line 427 in the file text > data.py:

with open(raw_text_path, 'w') as f: f.write("\n".join(texts))

As the raw_text_path file in my case contains French words (ie, words with accents), the open() method should have the argument encoding="utf8" I think. I cc @sgugger

Hello.
I’m testing SentencePiece on a small French dataset (20 text files of 1 000 000 caracters, global size of 6.4Mo). I’m using fastai 1.0.57 on GCP.
When I try to create the databunch, I get the following error. How to solve it?

Note 1: I took the code processor=[OpenFileProcessor(), SPProcessor()]) from the nn-turkish.ipynb notebook.

Note 2: the train labelling (through label_for_lm()) seems to be created as I can see the running bar. The problem seems to appear with the valid one.

Note 3: I can see in the corpus folder (dest) that a tmp folder was created with one file inside: all_text.out of 63.8Mo

muellerzr · September 11, 2019, 11:14pm

How is your data structured? Similar to the others? (A test, train, and valid folder)

Edit: Just saw your edit! I believe in the NLP course Jeremy mentions that different languages use different encoding methodologies (utf8 for example) so that would not surprise me it needed a change here! Great catch

pierreguillou · September 11, 2019, 11:18pm

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].

I think the problem comes from the line 427 in the file text > data.py. It should open the raw file with encoding="utf8" I think, no? (see my EDIT)

muellerzr · September 11, 2019, 11:19pm

I agree For temporary purposes you can just copy those over, but I think perhaps @sgugger a possible optional argument for an encoding method to be passed in?

pierreguillou · September 12, 2019, 12:06am

[ EDIT ] : the problem was not the one described here. See my [ EDIT 2 ].

I made the change in ./opt/anaconda3/lib/python3.7/site-packages/fastai/text/data.py as following but I get the same error:

with open(raw_text_path, 'w', encoding='utf8') as f: f.write("\n".join(texts))