Data field tag in ULMFiT

(Nick) #1

In the ULMFiT IMDB notebook we have:

BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)

This produces lines like:

xbos xfld 1 this is the text that I'm fitting on

What is the {FLD} 1 for? I understand the utility of the BOS tag, but I don’t understand xfld 1 at all, and I don’t recall it in lecture 10.

(Nick) #2

@sebastianruder since you asked to be tagged in my other ULMFiT question…

(Sebastian Ruder) #3

We used xfld so that the model can learn to differentiate between different fields and columns if they are present in the data.
For instance, the DBpedia dataset on which we evaluated contains a header and a text column.
An example of the data would thus look like the following:
xbos xfld 1 wave accounting xfld 2 wave is the brand name for a suite of online small business software products [...]
If there’s only one text field like in most applications, than xbos xfld 1 is essentially just a longer special sequence to mark the beginning of the document.

(Nick) #4

Oh wow I didn’t realize this.

So we could use this to annotate news headlines and stories or something like that? Has the lift gained from this been written up anywhere?

(Sebastian Ruder) #5

Yep, you could use this to process news data where other fields could also be the name of the author, the name of the publisher, etc. In general, people haven’t really come up with good ways to incorporate external information into LSTMs, so simply concatenating the different data is often a good choice. For instance, in QA, some approaches would concatenate the question with the answer. I’m not aware of a study that explores the effect of this more generally.

(Nick) #6

I’ve done a second input and concatenated to a dense layer for structured data, or concatenated dense layers from two LSTMs for title/body type data before. End-to-end would be nice though.

(Sebastian Ruder) #7

Yep, I’ve also done that before. Would be nice to have an apples-to-apples comparison of different unstructured/structured data encodings in an LSTM.


So say you are using spacy’s named entity recognizer, would labeling the text with the type of entity be helpful at all? So instead of John Smith went to China you could concatenate the label before the entities, giving you pers John Smith went to loc China or something similar?

(Sebastian Ruder) #9

Depends what you want to do. For a task where knowing what type of entity something is, this might potentially be useful.