In SP, vocab_size is the number of token. I am a bit confused with the maximum tokens. Please clarify.
As you say, the 30.000 is the vocab_size (= number of unique tokens). The 100 million tokens are the total corpus size, so “number of words in Wikipedia” if you use wikipedia and word segmentation. So for very large Wikipedias, you can just take the nicest articles.
Sorry, I misunderstood your question.
After downloaded the wiki dataset, I only did the following “minimal preprocessing”:
1 put the dataset into a dataframe and removed all columns, except for ‘text’;
2 get rid of sub-headings inside the text column;
3 save the dataframe (without header and index) into a csv file.
I tried to join all sentences into one long .txt file. However, SP does not like it.
Say, what is the vision for the Language Model Zoo here, do you have a preference between SentencePiece-Tokenized and Spacy-tokenized models?
We should use whatever works best for each language. I expect that means sentence-piece for non-segmented languages like Chinese and Korean, and for agglomerative languages like Finnish, and spacy for everything else.
You definitely want the smaller vocab size. There’s no way to learn a generalizable embedding of such a long and rare sequence. Embeddings handle multiple meanings just fine, as long as they’re part of a non-linear function (like a neural net).
Got it. Effectively this gives us semantic paragraphs as sentence inputs to sentencepiece.
My interpretation of that requirement is that otherwise, sentencepiece might be inclined to merge characters from different sentences, which would make no sense. (E.g. if many english sentences start with "A " and end with a full stop and space, you would not want sentencepiece to make “. A” a token in the vocab.
Probably batching also uses this implicitly, if you feed too much data, sentencepiece blows up the memory quite surely.
I would be interested in your intuition on this, too.
While training my first round of German language model:
Is there a suggested base training procedure from which to start and depart? I have now taken train_lm from
imdb_scripts/tain_tri_wt.py based on a comment in another thread. Edit So I just saw that binga’s notebookhas a comment about learning from scratch, I was confused by the PRE_PATH referring to the english wt103, which I would not expect to use.
(I changed sampled_sm because the LanguageModelData liked
md.n_tok better than
Is that approximately OK?
@rother Do you have a particular German tweet dataset in mind? There seem to be several published last year. Could you add a link, please? I’ll add links to the ones I found.
I’m having an issue with the following line in Paperspace GPU:
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)
After completing the first epoch (there should be 15), the process halts with almost 100% CPU time:
(fastai) paperspace@psdtzkq1p:~/fastai/courses/dl2$ ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head
PID PPID CMD %MEM %CPU
1897 1 python PT_Language_Model_co 17.3 99.8
Does anyone have similar problem or has any idea about what may be happening? After more than 4 hours of training I’ll have to interrupt the process because it doesn’t seem it will come back.
fast.ai only accepts English tokenizer. I tried to use the following code:
But the following error occurs:
OSError: [E050] Can’t find model ‘en’. It doesn’t seem to be a shortcut link, a Python package or a valid path to a data directory.
You need to install the Portuguese language model for spacy as described in: https://spacy.io/models/
I installed it successfully. Even so fast.ai doesn’t recognize ‘pt’.
It sounds like a spacy error. Does
import spacy nlp = spacy.load("pt")
work as expected?
Spacy tries to play a clever game with symlinks (which seems to have potential to go wrong). In the directory where spacy is installed (seen with
print(spacy)) there is a data diirectory. There spacy expects symlinks to the actual model data.
For example, for me
en is a symlink to
You would have to have something there for pt.
Alternatively, you can pass the full name as lang to Tokenizer.
I added some resources and will add new ones as soon as I find them.
Have anyone came across Stanford Sentiment Treebank and know how to work with it? (https://nlp.stanford.edu/sentiment/treebank.html from 2013).
The latest (2017) sentiment benchmark for polish follows the idea and requires models to estimate not only the sentiment of an entire sentence but all sentiments of subsentences as well. I’m trying to figure out the best way forward to compare polish STOA with a universal language model. Are sentiment treebanks used nowadays?
If my hunch is correct and the Treebank is an old idea does anyone knows about a paper that explains why so that I can talk with the organizers of poleval to maybe get different scoring?
Or if I’m wrong and the Treebanks are still useful does anyone know how to best change our Universal Language Model to output sentiment for a treebank?
There are no sym links. I had tested previously in Google Colab, but now in Paperspace, the following output is generated:
Warning: no model found for 'pt' **Only loading the 'pt' tokenizer.** Warning: no model found for 'en' **Only loading the 'en' tokenizer.** Warning: no model found for 'en' Only loading the 'en' tokenizer. Warning: no model found for 'en' Only loading the 'en' tokenizer. Warning: no model found for 'en' Only loading the 'en' tokenizer.
BrokenProcessPool Traceback (most recent call last)
----> 1 tok_trn, trn_labels = get_all(df_trn, 1)
2 tok_val, val_labels = get_all(df_val, 1)
in get_all(df, n_lbls)
—> 16 tok_, labels_ = get_texts(r, n_lbls)
17 tok += tok_;
18 labels += labels_
in get_texts(df, n_lbls)
5 #texts = texts.apply(fixup).values.astype(str)
----> 7 tok = Tokenizer(‘pt’).proc_all_mp(partition_by_cores(texts)) # splits the list into sublists for processing by each core
8 # Lower and upper case is inside the tokenizer
9 return tok, list(labels)
~/fastai/courses/dl2/fastai/text.py in proc_all_mp(ss, lang)
99 ncpus = num_cpus()//2
100 with ProcessPoolExecutor(ncpus) as e:
–> 101 return sum(e.map(Tokenizer.proc_all, ss, [lang]*len(ss)), )
~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
364 careful not to keep references to yielded objects.
–> 366 for element in iterable:
368 while element:
~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
–> 586 yield fs.pop().result()
588 yield fs.pop().result(end_time - time.time())
~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
–> 432 return self.__get_result()
434 raise TimeoutError()
~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
–> 384 raise self._exception
386 return self._result
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
the long names for the language models correspond to modules in the usual python search path. Can you see your language model there / with
Or you could just use the long name of the module you installed with (i.e.
using your parameters for training a German LM, I got to
3.84264 3.76195 0.315587 after 5 epochs, which is better than what I had with the “default policy” at that time (the default was
4.242864 4.056991 0.320959 after 7 epochs).
Unfortunately then my GPU seemed to show a defect, and it’ll be a while until I have a replacement .
But the high rates are awesome, thank you for sharing!
Strangely, the package doesn’t appear by typing pip list. Did you manage to use the fast.ai Tokenizer class with any language other than English (German?). Luckily, the EN tokenizer does a good job for Portuguese.