Beginning of NLP

shaun1 · September 25, 2018, 7:47pm

There might be potential bug in the tokenizer that may have carried over from the fastai v0 code. In particular, in the inner for loop in the the tokenizer method in TextDataset:

        for i in range(self.n_labels+1, len(df.columns)):
            texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)

Wouldn’t the code need to be texts += f’ {FLD} {i-n_lbls+1} ’ + df[i].astype(str). Otherwise we will end up with 2 fields that have xfld=1. This was referenced here as well.