We have a recent blog post that discusses how we included article metadata (such as publication name and country) in our text classification models. In our case, we are classifying quotes from news articles. While there is more exploration to be done, I thought it would be of interest to the fast.ai community, as we found that with this method we were able to improve upon our previous results by 3-9%. It’d be great to get people’s feedback.
The way we do this is by prepending the metadata to the text field, like this:
pub_name pub_name_fastai aut_name aut_name_Jeremy_Howard quote_text Today we’ll see how to read a much larger dataset – one which may not even fit in the RAM on your machine!
I was also wondering if other people had insights based on their experiences with similar efforts. I know you can now use the
xxfld marker to indicate separate fields of your data - I think that would have a similar effect, thought I would hypothesize that by indicating the fields are different would improve the results. We also wanted to ensure that if the same string appeared in both a metadata field and in the text that they would be treated differently (which is why we concatenate the metadata and prepend with the tag itself).
This is awesome, thanks for sharing!
This is really cool, thanks for sharing! If I understand this correctly - does this require re-tokenization and retraining on a periodic basis as you encounter new authors or publications?
I came across a very similar idea reading this:
On page 4 they talk about adding in context data, which per table 5.2 did seem to help quite a bit as well.
Yes, you would have to re-tokenize every once in a while. We actually expect to have to retrain these models on a somewhat frequent basis (e.g., biweekly, monthly) due to topic drift, where topic drift is when a new topic enters the media that the model has not seen much before.
Thanks for sharing that paper - I had not seen it before and it does seem that it is very similar. I think there’s a lot of promise in these approaches and it’s something that we’ll continue to explore over the coming year. We’ll be sure to share any interesting results in the forums!
Very interesting, have you tried the opposite of using text inside of tabular models ? I am not sure how we would do it.
I’m not sure exactly what you mean, but if you want to combine data from two different types of models I’d look at this other post on the forums and I think that might answer your question.
I’ll likely be working more about combining models of different types over the coming months, so if I make progress on that front I’ll be sure to post here.
Thanks for the link, looks really interesting (ConcatModel).
What I originally meant is that I see you used NLP model with some tabular fields, I was wondering if you had tried the opposite : text (content of an article) inside a tabular model : https://docs.fast.ai/tabular.html
@Tchotchke Do you train both the lang model and the classifier on the text with the prepended metadata, or just the classifier?
Yes, both the language model and the classifier
I was actually doing the same thing before reading your post and was wondering if there was a better approach. The only difference between your approach and mine was that I appended the metadata to the end of the text and delimited them with “,”. But I like your approach better. Let me try your approach and report the result later. The funny thing is I was only able to get my language model accuracy to about .58 but my Text classifier was able to achieve .80+ accuracy. Did you experienced the same pattern?
I would be very interested in hearing your results, hopefully it works out well for you.
The results you’re getting make sense. Remember, for the language model, the target you are trying to predict is the next word in a sequence, which is really hard (I’m always impressed that we’re able to get above 30% on that task). When adding metadata like this, it will raise the accuracy of the language model because what it’s trying to predict is a bit more constrained.
Then you will almost certainly see a higher accuracy for your text classifier, because you have a much more limited target set (e.g., you may only have 10-20 unique values that you are trying to predict).