MultiFit or pre-trained German LM for ULMFIT?

Hi all,

First of all, thank you all for this forum, it has been extremely helpful so far!

I am working on a small project to classify German text into 20 or so categories. The text is mostly company descriptions that I have scraped and categorized in industries. It could get between 200-500 chars per description.

In my previous job, we had some tremendous success doing the same with English, so I decided to go with Ulmfit instead of the usual Transformers models.

I guess my question is what is the best way of going here? Should I try and setup Multifit (which right now is not working with the latest version, so it needs a little bit of workaround) from here


use the already pretrained german language model for ULMFit (

What is your general experience with these two for German language? I understand that Multifit has a slightly different architecture, but I am not right now in the position to setup both and try both, I have to pick up one.

Thank you all for your help!
Best regards,

Hi Axel,

Multifit uses a QRNN architecture instead of AWD-LSTM. I think in fastai v2 it’s possible to use QRNN instead of AWD-LSTM in ULMFit. However, since the pre-trained German ULMFit model is based on AWD-LSTM you will need to use the same architecture for fine-tuning.

I’m not sure if Mutifit has other advantages over ULMFit if it’s not used for multi-lingual fine-tuning. I would also be interested in that.

If you are working with multiple languages, machine translation might be another approach. I had a text classification project at work with documents in German, Russian and English. Since I didn’t want to split my small training dataset into even smaller datasets for each language, I first translated all documents to English using Google Translate API and then fine-tuned the standard English ULMFit model on the entire dataset. Surprinsingly this actually worked really well. For comparison I fine-tuned the German ULMFit model only on the German texts and the accuracy was about the same.

Hi Stefan, thank you for reply. My understanding for Multifit was that the architecture (QRNN) and the subword tokenisation should offer a solid advantage compared to the traditional Ulmfit for languages like German, which is morphologically rich. My dataset was also limited, so I was hoping to benefit from the possibility of using small training data.

I do like the approach with machine translation. I had a similar case at work, where we initially built a classifier for EN US with ULMFIT and then did the same for German using machine translation. However, the difference in the accuracy was in the double-digit area. Maybe it was the domain - we classified jobs in categories, so it is a very specific vocabulary. Would you open to sharing which domain your experience is with and which machine translation API did you use (google, deepl?)

Best regards,

Right, subword tokenization could indeed be an advantage for German. Subword tokenization can also be used with ULMFit btw, but your pretrained model then also needs to have used it.

How do you plan to apply Multifit? Leveraging multi-lingual fine-tuning from EN to DE? I haven’t used it yet myself but it sounds promising.

I was working with short, user-generated content, mainly product reviews - ranging from conversational style messages to highly technical descriptions. Also with a very specific vocabulary. But you’re absolutely right, it depends on the specific use case. I used Google Translate API for MT but plan to look into the MT models that huggingface released a while ago.

I don’t know what problem you may encounter using multifit. I also try to adopt it for Persian. Anyway, I had some problems, probably due to the and I replaced the problematic section with this code and could run it. You need to put this code in your program without changing the source code of, and you may still do some changes to it.

import fastai
from fastai.core import PathOrStr
from import ListRules, get_default_size, quotemark, full_char_coverage_langs
from typing import Collection

def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
    vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
    char_coverage=None, tmp_dir='tmp', enc='utf8'):
    "Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
    from sentencepiece import SentencePieceTrainer
    cache_dir = Path(path)/tmp_dir
    cache_dir = Path("data/wiki/fa-2/models/fsp15k")
    quotemark = ''
    print("It's me:", cache_dir)
    os.makedirs(cache_dir, exist_ok=True)
    if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
    raw_text_path = cache_dir / 'all_text.out'
    with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
    spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
    SentencePieceTrainer.Train(" ".join([
        f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
        f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
        f"--model_prefix={cache_dir/'spm'} --hard_vocab_limit=false --vocab_size={vocab_sz} --model_type={model_type}"]))
    return cache_dir = train_sentencepiece

It’s usually easy to get ULMFiT to work, and fast to train, so I would definitively use it as a baseline. That makes it’s easier to see if other more complex approaches actually add any value.

I wonder why MultiFiT project is old or maybe outdated? Why does it have some problems or inconsistency with new version of fastai and no one fix them? Does it show that MultiFiT has no use and we should stick to the ULMFiT? However, I still don’t know which does what (??), I’m just testing…

I think most research focus is on transformers now. See e.g. the linformer. They use it to get 94.2 on IMDb which you can get in a few GPU minutes with ULMFiT… Engineering needs and research interest doesn’t always overlap. (But of course, research into transformers might very well lead to the next big breakthroughs in performance as well!)

@hallvagi Thank you for sharing this paper. To be honest I was not aware of it. I was under the impression that BERT is a few percentages worse compared to Multifit, at least based on MLDoc benchmark from the original paper. But it is a good idea, I can share some benchmarks with my current dataset.