@rother, GermEval is about offensive language are you sure you have the right words in the Vocabluary? 50k sounds like a small number given the amount of words in German compared to English.
Do you know how many Out of Vocabluary you have? Such low perplexity may suggest that you have plenty of unknowns. (The more OOV the more reward model gets for predicting unk).
Can you share the 300k training set you collected, it will be quite useful to train a model on that.
Great idea and thanks for organizing, count me in. Can you share the sentencepiece implementation? I have access to relatively powerful infrastructure so I can help with the experiments. Without time constrain we can probably gather a relatively large German twitter corpus for the LM.
I very briefly looked at the S10K corpus when @MatthiasBachfischer pointed it out. My impression is that the language used on twitter is quite different from that in Wikipedia. This was also my impression for other twitter corpora.
I haven’t seriously tried the million post corpus.
I’ve also looked a bit into German QA but it didn’t work very well. I suspect that sentencepiece would be quite beneficial for German also here, but I never implemented it.
I do have a working LRP (relevance propagation) for ULMFiT which I think makes a great add-on for analysing and showing results.
I’d be happy to collaborate, that’s why I put out my training scripts and model in the public and I’m happy that a few people took it for a spin.
On the other hand, and I’m not sure how to say this politely, I was tremendously disencouraged when I read Note: research on state of the art is WIP, I’ll post resources/links/referenced papers once it is done in the state of the art section and submitted a bunch of patches to my favourite software projects instead.
I must admit that while twitter corpora are all the rage, I’m not sure what to think about them.
@t-v, I’ve just copied what is in the German section of the first post in the Language Model Zoo, The “I” in that sentence is not me saying something it is just copy and paste sorry for the confusion and for the way it made you feel.
I’m all for the collaboration and i don’t really care about hiding things, what’s the point?
@t-v I see what you mean, I’ve noticed that it had been posted a long time ago so you must know that it is an outdated message.
You are more for open collaborative work, which I totally support.!
Although when there is a competition people fear piggybacking on their work, and I guess this is what Kristian meant.
The competition is over. now is collaboration time
For this work to have any meaning it has to be good and it has to be done in many languages. Then we can make a large paper with Jeremy, Sebastian and everyone involved and show how ULMFiT can help push forward NLP around the world :).
It would be super cool if you join the effort.
One side note: if I ever get discouraged by anything I say just state that directly and don’t worry about politeness. 99 of a 100 is a communication error on my side and I haven’t meant what other people read
I would be glad to collaborate, count me in as well.
My work is mostly in transfer learning in text and in a wide variety of languages – I have Jeremy (and everyone else involved in the project) to thank for their contributions in this area. I’m currently writing a fastai style implementation of OpenAI’s Trasnformer Decoder which may or may not be useful for this task.
What I worked on: A proprietary dataset that had severe class imbalance. The language model training went pretty smoothly and I was quite happy with the perplexity score on the German Wikipedia corpus.
What worked: The alternative cyclical learning rate (use_clr_beta).
The requirement was a smaller model owing to resource limitation, so I brought the embedding dimension down to 300 and the number of hidden connections to 1000. The perplexity (that I compare to @t-v’s 32) was 38.
What did not work: From the top of my head, varying dropout values had negligible effect.
What (sort of) worked: PyTorch’s WeightedRandomSampler to balance the dataset. The technique worked reasonably well for me for a moderately skewed distribution, but not so much for heavier skews.
What did not work: (Again, from the top of my head) addressing overfitting by varying dropout or the wd (weight decay) hyper-parameters.
@aayushy OpenAI transformer’s it is awesome project to work on count me in if you need a hand. It is second on my list, after I manage to make use of ULMFiT.
The only issue with Transformers is that they train for a month or something like that (I’ve heard that somewhere on Hacker news i haven’t seen this in the paper)
Good to know that clr_beta worked well, and thank you for sharing the detail of what worked. For Polish the thing that had the most importance was the Sentence Piece Vocab size and the number of layers 4 was better than 3 and 5 was worse.
@MicPie Cool. If you want some directions, let me know what you how comfortable you are with fastai, ulmfit python, etc. so I can point you to things that you could best help. Or alternatively, pick some experiments your self and bring back the results and trained models
@MicPie@aayushy I’ve added you both to the repo, there is not much there yet as I’m trying to correct scripts to use BTW17 set, fighting with sentence piece at the moment as it does not accept BOS EOS tokens. Once I have a first LM trained I will publish the changes so that we can start collaborating.
How about we agree on a plan how to progress etc. Here is a proposal feel free to change it:
a common validation & training set for normal text like Wikipedia
a common validation & training set for comments as @t-v noticed the language is different for tweet/comment and Wikipedia
I’m working on the btw17 - 170 MB of comments from Twitter (Should we add sb10k?)
a script to train a working model for sentiment analysis using sentence piece on the Germeval 2017
The above should give us a base-line then we plan a set of experiments to improve it and work on each experiment separately, sharing intermediate results in github issues and the improved values here.
The perplexity (that I compare to @t-v’s 32) was 38.
@aayushy For the perplexity to make sense we need to know the OOV number and the text you were working on. (If you have a lot of unknowns the perplexity goes down very quickly)
as already mentioned by @t-v I also briefly looked at the Sb10k dataset but could not get any decent results (probably because the vocabulary from Wikipedia articles is fundamentally different from the vocabulary used in tweets and vice versa).
I would love to collaborate on your work to bring ULMFit to the German language, but I’m afraid that I won’t find the time anytime soon…
Btw. @rother: I just realized that you have also submitted a paper to the GermEval shared task - will you be in Vienna tmrw. for the poster session?
Maybe we should create a new metric like perplexity per 10k tokens or something.
Iirc I did some experiments with 80k tokens before (don’t know why I picked that number) and the perplexity was a good bit higher (which makes perfect sense). I think it might be a good idea to start with the tokenization of the second step (the unlabeled twitter data for example) and see how many tokens that produces and work backwards. Maybe do it for different media (forum posts, email, twitter) to get an empirical estimation of how many tokens are a good overall baseline for the wiki-model.
Also the better the token match between the wiki-model and the twitter/newspaper/whathaveu-model the better.
Edit: I’ll upload everything I did to github when I have some time (for twitter you are only allowed to upload the ids not the text but I’ll just share the collection script…which is not pretty ;P).
I like this initiative and will post my thoughts when I’m back from Vienna. My quick summary of my Germeval entry is that it was done a bit hasty and there’s tons of room for improvement. I have a piece of paper in my office with all the notes. I’m quite happy that I did everything end to end once to go through the entire process. Learned a lot, now we can optimize
Ooops just checked and that sentence about posting when I’m done is from me. It wasn’t meant to imply that I want to hold anything back. Just that at that point I had done some literature research and supplied the findings as is and I guess I never updated the post because I didn’t research more literature
I’ve also not updated my github repo with the Germeval code yet because it’s pretty ugly and I want to fix it first. No bad intention, mostly little time. I very much prefer open collaboration on these things.
Upon rereading I now realize it might sound like secrecy but that was not the intention at all. It’s more incompetence/lazyness on my part
If someone is willing to host the language model, I’ll gladly upload it somewhere (27.39 perplexity @ 50k tokens, probably some room to improve if ran for some extra epochs). I’m a little to embarrassed to have it officially hosted by Jeremy
It’s very time consuming to do this step and it’s probably better to focus on the later steps and revisit the LM later. In retrospect I should have used the one Thomas linked somewhere, would probably have saved a lot of time but I wanted to do the entire process end to end once (it was a great learning experience)
The Twitter-CSV file is about 42MB but I think you’re not technically allowed to upload it anywhere legally (at least that’s what I was told). Maybe I could share it with a temporary link or something, but I’ll upload the collection script tomorrow if I find the time to clean it up a bit (you’ll need a Twitter account to run it). Iirc one can get the 300k I collected in a couple of hours. Still it would be great if we can build a large collection and not have to download everything individually. Maybe someone who knows the legal situation can chime in.
I’ll be in my office with access to the data and code on Monday.
Edit: moved the post here, accidentally posted it in the other thread
Wouldn’t 10k validation set be too small?, Perplexity calculation of 350k tokens just run under 3 minutes. But in general assembling a data set or multiple data sets like: Wiki, Twitter, News , let say 100k words each could be a good start.
I think 3 different data sets would work better as the twitter language is very different. (about that later)
I didn’t quite get what you mean here. Do you want to tokenize to words or to subwords tokens? Do you want to know how many unique words can be found in each dataset? And how do you define good.
That make a lot of sense, have you checked for your words?
Can you add your repo to the wiki on the top?
How about we use shared google drive for that while we are experimenting?
Then we select the best models and push them to github (it has releases where you can upload lager files)
I think it is not a big deal if you do that in name of science, at least GDPR is quite relaxed about this. Maybe we simply get it stored in private google drive for time being when we run experiments?