Doing a literature survey to get SOTA LMs’ perplexity scores. Initial perplexity score with spacy tokenizer is 80+ range. SentencePiece is most likely to be more suitable for Turkish.
Hi Ertan, have you found some benchmark on document classification for Turkish, I think that just the LM perplexity a bit too far removed benchmark (I have similar issue with Polish) from the down stream tasks.
You’re right, I’m having that problem. I’ll look for a canonical classification task to benchmark against.
I had the same trouble last year. There doesn’t seem to be a good benchmark. You may want to get a benchmark on specific NLP tasks rather than LM in IMHO to see the effect of transfer learning since LM is just the backbone here.
Please share if you find any reasonable benchmarks
@piotr.czapla Up to my knowledge there isn’t any good benchmark dataset. But there is an open source corpus of 1.3B tokens.
I got in touch with some of the researchers in this space 2 months ago. They suggested looking into CoNLL 2017 dependency parsing shared task, picking a model and replacing its embeddings with ours. In order to make it more interesting I was thinking of plugging a pretrained network instead of word embeddings, but I didn’t do enough research on the existing models yet.
has some nice sources. It’s a free project that aims building Turkish corpora, NLP tools and linguistic datasets.
I looked into TS Corpus and signed up to download Wikipedia corpus but it didn’t work, so I ended up using the dataset on Kaggle instead.
Also, I still didn’t encounter any classification tasks to benchmark against on TS Corpus. The citations don’t mention anything either. Please share if you find any.
If I come up with something useful I’ll share it here. But, as far as I can see there is no such benchmark…
some extra data sources migth be useful to others…
TUD Sürüm 3.0 50 milyon sözcükten oluşan, 24 yıllık derlem …
this might be a useful
The syntax level is usually learned on first layer of the ULMFiT, so you probably will have quite a few layers that aren’t used in the dependency parsing.
@s.s.o @ertan, if you speak Turkish how about translating imdb, and check it out if it keeps sentiment or not, by reading 100 examples Once it is ok then training ulmfit? If you manage to get good accuracy ~80% by march 4th as contribution to ulmfit-multilingual we can include that in a short paper we are making.
The timeline is a bit short and I don’t have that much time at hand to do the translation myself, but I’ll see if I can utilize some other resources to get that done.
By the way, as I mentioned in my earlier post, I wasn’t able to reduce the perplexity after a certain point. I can still leverage the LM but I’m not sure whether it captures the language well. Do we have a general idea about how much perplexity score is reasonable to use an LM? Are you only reporting the accuracy on the classification task with and without pretrained network? Any other metrics are you planning to use?
Only the above. Perplexity is too fragile to be used to judge models. First it depends on the numbers of tokens/ words in a sentence and it changes with the number of out of vocab. words. So you can compare it only within the same dataset, same language and same vocab.
Besides even good perplexity if trained on a broken text will result in poor classification. I had that issue with wikipedia that was being split by sentence and then permutated.
What do you mean by this? I think key point in Turkish and probably this applies to many additive language is to have a good tokenizer. I couldn’t find one in Spacy.
I was thinking the same actually. Have you done any work on training on Turkish Wiki? I would be very happy to have pretrained weights if you have Else I can start training from scratch.
I trained a few times using standard tokenizers, but I have not tried SentencePiece yet. Shoot me a message and we can collaborate on that if you like. I have some cloud credits on Azure that we can utilize.