Data field tag in ULMFiT

nickl · July 5, 2018, 4:47am

BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag
...
...
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)

This produces lines like:

xbos xfld 1 this is the text that I'm fitting on

What is the {FLD} 1 for? I understand the utility of the BOS tag, but I don’t understand xfld 1 at all, and I don’t recall it in lecture 10.

nickl · July 5, 2018, 12:06pm

@sebastianruder since you asked to be tagged in my other ULMFiT question…

sebastianruder · July 6, 2018, 10:05am

We used xfld so that the model can learn to differentiate between different fields and columns if they are present in the data.
For instance, the DBpedia dataset on which we evaluated contains a header and a text column.
An example of the data would thus look like the following:
xbos xfld 1 wave accounting xfld 2 wave is the brand name for a suite of online small business software products [...]
If there’s only one text field like in most applications, than xbos xfld 1 is essentially just a longer special sequence to mark the beginning of the document.

nickl · July 6, 2018, 10:17am

Oh wow I didn’t realize this.

So we could use this to annotate news headlines and stories or something like that? Has the lift gained from this been written up anywhere?

sebastianruder · July 6, 2018, 10:51am

Yep, you could use this to process news data where other fields could also be the name of the author, the name of the publisher, etc. In general, people haven’t really come up with good ways to incorporate external information into LSTMs, so simply concatenating the different data is often a good choice. For instance, in QA, some approaches would concatenate the question with the answer. I’m not aware of a study that explores the effect of this more generally.

nickl · July 6, 2018, 3:42pm

I’ve done a second input and concatenated to a dense layer for structured data, or concatenated dense layers from two LSTMs for title/body type data before. End-to-end would be nice though.

sebastianruder · July 8, 2018, 1:33pm

Yep, I’ve also done that before. Would be nice to have an apples-to-apples comparison of different unstructured/structured data encodings in an LSTM.

msmedes · July 10, 2018, 4:15pm

So say you are using spacy’s named entity recognizer, would labeling the text with the type of entity be helpful at all? So instead of John Smith went to China you could concatenate the label before the entities, giving you pers John Smith went to loc China or something similar?

sebastianruder · July 13, 2018, 11:31am

Depends what you want to do. For a task where knowing what type of entity something is, this might potentially be useful.

shaun1 · August 2, 2018, 1:12pm

Do we need to manually annotate the different fields? I’m trying to figure out how we setup the fields in a piece of text. For example, lets say we have a product name and its description:

Razer BlackWidow Chroma Keyboard
This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.

There are 2 fields in this piece of text. The name and the description. Would running the code automatically tag this as xfld 1 and xfld 2 (I don’t see that it would). Or would I need to have this as separate columns? I’m not sure how we would get two fields out of this piece of text. I would love to figure this out.

Thanks.