This thread’s objective is to discuss the ULMFiT implementations in different languages and share our roadblocks and approaches.
Languages and people:
Chinese (Simplified): @shoof
German: t-v, Kristian
Benchmark-Twitter Sentiment from April 2017*: F1 (macro-average of pos and neg, ignoring neutral): 65.09
GermEval-2017* best results (micro-average F): synchronic: .749; diachronic: .750
*Note: research on state of the art is WIP, I’ll post resources/links/referenced papers once it is done
Thai: cstorm125 - source code
This is a Wiki, please add you name (via hyperlink not @user_name) and the language you are working on by alphabetical order. Feel free to form a group to discuss your language specific problems as well.
- Extract data from Wikipedia - thanks to @Moody
- Download Wikimedia data using a shell script
- limit the corpus to 100 million tokens - better to listen to Jeremy’s advice
- use SentencePiece for tokenization
- code for quick files loading into a dataframe - thanks to @shoof
- notebook with data preparation instruction - thanks to @binga
- try out clr_beta to speed up learning rate finder and training
- when training a model, always use
- use the best_save_name parameter during training to save the best model.
- Just in case you get a memory error message. Please run
torch.cuda.empty_cache()to clear the cache and start the kernel again.
- NEVER GIVE UP!