ULMFiT for Polish

(Piotr Czapla) #1


ULMFiT with sentence piece set new SOTA on Poleval 2018 language modeling task, with perplexity 95 the previous best was 146,
source code and weights

Available datasets and results
  • Poleval 2018

    • Task 2 NER
    • Task 3 Language model
      • a modified version of ULMFit achieved Perplexity of 95, 35% better than the competition
  • Poleval 2017

    • Task 2 Sentiment analysis
      • best model (Tree-LSTM-NR) accuracy 0.795 - the dataset is most likely broken a bit Paper from contest
  • New Sentiment dataset like imdb data set

    • We have approached few companies to publish their data set of comments with ratings we are waiting for their response. It will be published as part of Poleval 2019

Language Model Zoo :gorilla:
(Anna) #2

Hi Piotr, congratulations on setting SOTA for LM in Polish!

I would love to experiment with the downstream tasks like sentiment analysis using your model.

I wanted to download your pre-trained LM weights but I don’t see them (nor the itos file for mapping dictionaries) under the link you provided. Am I missing something?

(Piotr Czapla) #3

Hi AnnaK It seems I’ve forgot to add the link to the pretrained wieghts. Here you have the model that won poleval: https://go.n-waves.com/poleval2018-modelv1.
Do you have some good open dataset that can be used for classification in Polish?

(Anna) #4

Thanks for the link! Correct me if I’m wrong, but in order to use it wouldn’t I also need your word-to-token mapping (in imdb that was the itos_wt103.pkl file)?

Unfortunately I couldn’t find any good datasets, so I think I’ll collect the data myself from some websites with reviews.

(Marcin Kardas) #5

Hi Anna,
our language model uses a subword tokenization to deal with rich morphology of Polish language and a large number of unique tokens in PolEval’s dataset (the competition required to distinguish over 1M of tokens). For tokenization we used sentencepiece (here’s our implementation https://github.com/n-waves/fastai/tree/poleval/task3), so instead of itos*.pkl file we provided sp.vocab and sp.model files which contain trained sentencepiece model.