ULMFiT for Polish

(Piotr Czapla) #1

Results

ULMFiT with sentence piece set new SOTA on Poleval 2018 language modeling task, with perplexity 95 the previous best was 146,
source code and weights

Available datasets and results
  • Poleval 2018

    • Task 2 NER
    • Task 3 Language model
      • a modified version of ULMFit achieved Perplexity of 95, 35% better than the competition
  • Poleval 2017

    • Task 2 Sentiment analysis
      • best model (Tree-LSTM-NR) accuracy 0.795 - the dataset is most likely broken a bit Paper from contest
  • New Sentiment dataset like imdb data set

    • We have approached few companies to publish their data set of comments with ratings we are waiting for their response. It will be published as part of Poleval 2019
6 Likes

Language Model Zoo :gorilla:
(Anna) #2

Hi Piotr, congratulations on setting SOTA for LM in Polish!

I would love to experiment with the downstream tasks like sentiment analysis using your model.

I wanted to download your pre-trained LM weights but I don’t see them (nor the itos file for mapping dictionaries) under the link you provided. Am I missing something?

0 Likes

(Piotr Czapla) #3

Hi AnnaK It seems I’ve forgot to add the link to the pretrained wieghts. Here you have the model that won poleval: https://go.n-waves.com/poleval2018-modelv1.
Do you have some good open dataset that can be used for classification in Polish?

0 Likes

(Anna) #4

Thanks for the link! Correct me if I’m wrong, but in order to use it wouldn’t I also need your word-to-token mapping (in imdb that was the itos_wt103.pkl file)?

Unfortunately I couldn’t find any good datasets, so I think I’ll collect the data myself from some websites with reviews.

0 Likes

(Marcin Kardas) #5

Hi Anna,
our language model uses a subword tokenization to deal with rich morphology of Polish language and a large number of unique tokens in PolEval’s dataset (the competition required to distinguish over 1M of tokens). For tokenization we used sentencepiece (here’s our implementation https://github.com/n-waves/fastai/tree/poleval/task3), so instead of itos*.pkl file we provided sp.vocab and sp.model files which contain trained sentencepiece model.

1 Like

(Kerem Turgutlu) #6

Congrats! Have you experimented with and without sentencepiece how much the perplexity improve with it in your case?

0 Likes

(Piotr Czapla) #7

It wasn’t possible to train the model without the sentence pice as the vocab would be too large.

1 Like

(Kerem Turgutlu) #8

How did you decide on vocab_size using sentencepiece? I am using BPE model type. And did you use any pre or post rules from fastai spacy pipeline? Thanks

From NMT forums:

100000 and 300000 parameters are far too big - the usual values are between 16-30K. With such parameter, you simply disable BPE effect.

0 Likes

(Marcin Kardas) #9

We selected hyperparameters based on results from grid search on smaller corpus.
image

You can find more details in this paper. Bear in mind that lower perplexity does not imply better performance on downstream tasks.

The PolEval 2018 corpus was already pretokenized. From our experiments with hate speech detection in Polish tweets we noticed that default fast.ai transforms are sufficient.

2 Likes

(Kerem Turgutlu) #10

Very nice and easy to understand paper, thanks!

0 Likes

(Kerem Turgutlu) #11

Maybe I am missing an important point here but still wanted to make sure if this makes sense :slight_smile:

As we change our LM’s vocab/tokens with either vocab size or say with different tokenization methods (BPE, uni, etc) we also change the label representation of the test sets (e.g. next token to predict might be different with different vocabs for the same exact test sentence).

So, is it fair to make such comparison between methods when what we test is no longer identical? Is this a valid point to make? Or in overall it doesn’t matter?

Or, is this what your are referring to when you say we should compare these LM models with downstream tasks?

Then, non of the LM papers out there can be solely trusted based on just perplexity, unless same exact preprocessing is done and same tokens are used…

0 Likes

(Antti Karlsson) #12

Hello, my first post here o/ ! I think you are exactly right as the perplexity is a function of the vocabulary (your LM is basically a probabilistic model on top of the vocabulary you have chosen). Thus it’s not “fair” to compare the numbers if the vocabularies are different. Of course one can probably get some intuition of the ballpark of the models (is the perplexity like 20 or 200), if it is reasonable to assume that the vocabularies are not super different.

Testing classification performance or some other downstream task is a better test in a sense, but maybe it could be that a language model can be “good” without being good for a specific task?

0 Likes

#13

Any progress with collecting a movie corpus?

0 Likes

(Darek Kleczek) #14

@piotr.czapla @mkardas Thanks for leading this, it’s really helpful to have the code available that can produce SOTA results in Polish. I’m just starting out with fastai/ULMFiT and prepared a Google Colab version of your approach, with a toy dataset, just to test if the pipeline works. I’d appreciate any feedback on the code: https://colab.research.google.com/drive/1n3QQcagr9QjZogae6u5G41RxI4l5Mk9g

Thanks! Darek

1 Like