This thread’s objective is to discuss the ULMFiT implementations in different languages and share our roadblocks and approaches.
Languages and people:
Chinese (Simplified): @shoof
Kristian, Matthias (source code)
Benchmark-1: Twitter Sentiment from April 2017*: F1 (macro-average of pos and neg, ignoring neutral): 65.09
Data: New sb10k Corpus
Benchmark-2: GermEval-2017* best results (micro-average F): synchronic: .749; diachronic: .750
Paper: Germeval-2017 Proceedings
Data: GermEval-2017 Data
*Note: research on state of the art is WIP, I’ll post resources/links/referenced papers once it is done
Music: mcleavey (generating music in the style of Mozart & Brahms)
Polish: ULMFiT for Polish
Thai: cstorm125 - source code
This is a Wiki, please add you name (via hyperlink not @user_name) and the language you are working on by alphabetical order. Feel free to form a group to discuss your language specific problems as well.
- Extract data from Wikipedia - thanks to @Moody
- Download Wikimedia data using a shell script
- limit the corpus to 100 million tokens - better to listen to Jeremy’s advice
- use SentencePiece for tokenization
- code for quick files loading into a dataframe - thanks to @shoof
- notebook with data preparation instruction - thanks to @binga
- Using QRNN in language models - thanks to @ sgugger
- try out clr_beta to speed up learning rate finder and training and Super convergence(ish) - thanks to @ sgugger
- when training a model, always use
- use the best_save_name parameter during training to save the best model.
- Just in case you get a memory error message. Please run
torch.cuda.empty_cache()to clear the cache and start the kernel again.
- NEVER GIVE UP!