Metadata in ULMFiT - Improving accuracy by up to 20%

Tchotchke · June 22, 2019, 6:47pm

My team and I have been working to perform a more in-depth analysis of how you can introduce metadata (such as author or publication) into a model along with the text. We have been calling this Metadata Enhanced ULMFiT (ME ULMFiT)

Our analysis has shown some interesting results:

A single piece of metadata improved the model by as much as 20% (relative improvement)
Adding more metadata to the model did not always improve results
Our method of adding metadata, which has unique tags per column and concatenates the values in the metadata columns, seems to fairly consistently outperform the method suggested by fastai (just separating fields by xxfld)

I’d love to get people’s thoughts on it, as I feel like there is still a lot to learn around this approach.

pietebr · August 14, 2019, 10:56am

Hi Matthew! Could you please share the code that adds the metadata to the text? I’m a beginning programmer and in doubt of when and how to include them.

I have a base of 40k complaints, classifying them with ULMFiT gives accuracy of 74% into 11 classes and 59% into 66 classes, so I am eager to test how much the results can improve with metadata which I have a plenty of.

I am writing a master’s thesis about my work, so I will be able to cite your work in it.

Tchotchke · August 15, 2019, 3:37pm

Those seem like quite good results - I’d be very interested to hear how metadata works for you. There are two components we had to add the metadata. The first serves to clean up the fields and combine multiple strings into one string:

def modify_metadata(input_string, tag_name):
    data = tag_name + "_".join(input_string.split(" "))
    if "," in data:
        data = data.replace(",", "")
    return data

Then you prepend the metadata to the start of the quote:

def add_metadata_joined(input_row, metadata_list):  
    tagged_text_list = []
    # Adapt the following line to get the list of tags we want to use
    for data in metadata_list:
        separator = ""
        tagged = data + " " + data + modify_metadata(str(input_row[data]),data.lower().join("_")) + " "
        tagged_text_list.append(tagged)
    q_text = 'quote_text ' + str(input_row['text'])
    tagged_text_list.append(q_text)
    joined = separator.join(tagged_text_list)
    return joined

(Note: I tried to make sure indentation was correct, but it may have gotten messed up when I copied it over)

So the input_row is from your csv and contains the quote and associated metadata (where you list the names of your metadata in metadata_list). If you have your data in a pandas data frame, you can apply it with something like:

df['text'] = df.apply(lambda x: add_metadata_joined(x, metadata_list), axis=1)

I think that should be all you need, but feel free to let me know if you have trouble with it.

pietebr · September 5, 2019, 12:09pm

Hi Matthew! Thank you, I think I executed it correctly. However, unfortunately the results did not improve. The complaints are quite big, having often several dozens of words. How big are your classified texts?

Tchotchke · September 5, 2019, 3:13pm

They are quotes from newspapers, so often relatively short. I think on the order of 6 - 20 words. Sometimes the quotes will be missing relevant context from the rest of the article, such as “he thought it might be improved upon”. So without the additional context of the article, it would be tough for our model to label that quote. So my intuition has been that especially for those cases, by providing additional metadata (such as who the speaker of the quote was), you will improve predictions on average.

As another example, imagine you were trying to predict sentiment towards a particular political party. Just by providing the speaker of the quote will really help the model - if the speaker is from the same party, sentiment is likely to be positive, and if the speaker is from an opposing party, the sentiment is more likely to be negative.

If you already have a lot of information in the text I would expect there to be less room for improvement. @pietebr Based on your intuition, do you think there are correlations between the metadata you have and the label you are trying to predict that would not be captured by your text? i.e., is the metadata providing new information that is not captured by the text?

nancyC · November 20, 2019, 3:38pm

Hi Matthew, I’m wondering if we use metadata to train model, how should we pass metadata when we are predicting? For example, if use Fast.ai method: “xxfld Pressse Agence xxfld Hurriyet xxfld Turkey…”, different columns are parsed into different fields aotumatically. But when we are predicting, learn_clas.predict(…), what should the input be?

Tchotchke · November 20, 2019, 9:57pm

The input would be in the same format that you use for training. So in training if you processed your text to get art_title this article title quote_text here is the text of the quote you are classifying then you would apply the same preprocessing (which would involve combining a few fields) before passing to predict.

vahuja4 · June 13, 2020, 8:00am

Hi Matthew,

I am trying to use ULMFit to predict the movie genre using its description. This is the movielens dataset. Now, there are other text fields too - tagline, cast, etc. ME ULMFit looks like a good method to try. My question is did you fine-tune a pretrained language model with the meta-data included?

Tchotchke · June 15, 2020, 12:34pm

Yes - started with the typical pertained language model, then fine tuned it after including the metadata (bc the LM needs to learn what all of the special tags and words mean)