NLP text classification for german company reviews

I would like to implement a multi-classification-model, that classifies reviews (e.g. from trustpilot.com) into different categories (e.g. customer experience, product quality, online show, delivery, on-site store experience).

Is there any model on hugging face you suggest for German texts? I think it won’t be a good idea to use a model that was pretrained on English text and finetune this model on German texts?!

1 Like

There is a German BERT model, for example.

Otherwise, I would say that fine-tuning an English model on German texts could also be a good idea. Drawing a parallel between NLP and CV (computer vision), one can start with Image Net pretrained weights and fine-tune to get good results in quite different domains, like medical images classification. Because learning basic things like shapes, detecting edges, etc. stays relevant even for a new domain. So maybe it would work for languages with similar grammar as well.

Also, one can try to use a seq2seq model to translate German into English, and run a classifier on a translated sentence.

2 Likes

I would suggest to look at these models (which are from the same authors but newer than the original German BERT model):

Here is the accompanying paper for these models.

Here are German (and multi-lingual) models that have already been trained/fine-tuned on some text classification tasks, but I am not sure if they will be more helpful than the general LMs for your use case: Models - Hugging Face

If you want to consider models (regardless of language) that were already fine-tuned on a dataset more similar to yours, you could check models trained on Amazon reviews: Models - Hugging Face

I think it could be tricky though to fine-tune an English model on German texts (called cross-lingual transfer learning) without adjusting the vocabulary of the model.

This works quite well in my experience. You could use e.g. with this translation model from hugging face

3 Likes

Yeah, good point. Correct, there are other models by the same provider as well, and nice thing they’re uploaded to HF Hub.

Good to know! I also plan to experiment with non-English texts, so would like to try various approaches.

2 Likes

I have managed to solve the task, by first translating the review-texts to English and then analyzing the English texts.

Translate Texts:

model_name = "Helsinki-NLP/opus-mt-de-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_de_en(model, tokenizer, source_text: str) -> str:    
    try:
        translated = model.generate(**tokenizer(source_text, return_tensors="pt", padding=True)) 
        translated_text = " ".join([tokenizer.decode(t, skip_special_tokens=True) for t in translated])
        return translated_text
    except Exception as e:
        print(e)
        return 

quick_translate = partial(translate_de_en, model, tokenizer)

dd_sample['review_title_en'] = dd_sample['title'].map_partitions(lambda df: df.apply(lambda x: quick_translate(x)))

I use drask DataFrames for parallel processing the data within the dataframe.

Do Zero Shot Classification:

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")
candidate_labels = ['customersupport', 'shipment', 'webshop', 'order', 'product', 'invoice', 'quality']

def get_main_label(classifier, candidates, text: str)->str:
      return classifier(text, candidates)['labels'][0]

classify = partial(get_main_label, classifier, candidate_labels)

dd_sample['review_class'] = dd_sample['review_text_en'].map_partitions(lambda df: df.apply(lambda x: classify(x)))

Calculate Sentiment Values:

classifier_sentiment = pipeline(task="text-classification", 
                                model="juliensimon/reviews-sentiment-analysis")

df_sample['review_sentiment'] = df_sample['review_text_en'].apply(lambda x: classifier_sentiment(x)[0])
5 Likes

That’s great! I think you could even use the zero shot pipeline on the original German texts with XLM-R https://huggingface.co/joeddav/xlm-roberta-large-xnli

4 Likes

How would the tokenisation work there? Aren’t the words or tokens completely different?

@ulat could you please help understand the zero shot classification ?

zero shot classification means you want the model to classify into categories on which it wasn’t trained.
With “default” classifiers you have to have labeled training data for each class. E.g. trainingdata with texts which you classify as A, texts which you classify as B. Then you model is able to classify into the categories A and B but cannot classify C.
With zero shot classifying you can grab a model and use it to classify into a bunch of given labels.
There are plenty tutorials on zero shot classification. You could start here: NLP Town

2 Likes