In our experiments, we used a separate script to create the training data for the backwards language model. I haven’t uploaded that script yet, as I thought it was quite ugly to create separate files. It’d be nicer to simply transform the data once if the backwards parameter is set. Thoughts?
I’m fine with the backwards parameter, but unclear if I should reverse the BOS and FLD annotations in files too? This part:
def get_texts(df, n_lbls=1):
labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
texts = texts.apply(fixup).values.astype(str)
tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
return tok, list(labels)
I also think I’m missing something - my classifier based on the backwards language model is about 8% worse in accuracy than my forward one. That seems wrong, but I can’t find the problem (and I’ve added the backwards=True parameter to the TextDataset parts.
Just to confirm, the backwards classifier should be roughly as strong as the forward one, right?
So far, we basically kept the FLD annotations fixed so that the model still knows which field it is in and reverse only the text. For context, here’s the script we’ve been using to transform forward ids into backward ids.
Yep, the backwards and the forward model should have roughly similar performance.
import numpy as np
import fire
from create_toks import FLD
import pickle
def _partition_cols(a,idxs):
i=0
for idx in idxs:
yield a[i:i+idx]
i+=idx
yield a[i:]
def partition_cols(a,idxs): return list(_partition_cols(a,idxs))
def reverse_flds(t, fld_id):
t = np.array(t)
idxs = np.nonzero(t==fld_id)[0]
parts = partition_cols(t,idxs)[1:]
reversed = np.concatenate([np.concatenate([o[:2],o[:1:-1]]) for o in parts[::-1]])
return reversed
def create_bw_data(prefix, joined=False):
print(f'prefix {prefix}; joined {joined}')
PATH=f'data/nlp_clas/{prefix}/'
joined_id = 'lm_' if joined else ''
fwd_trn_path = f'{PATH}tmp/trn_{joined_id}ids.npy'
fwd_val_path = f'{PATH}tmp/val_{joined_id}ids.npy'
bwd_trn_path = f'{PATH}tmp/trn_{joined_id}ids_bwd.npy'
bwd_val_path = f'{PATH}tmp/val_{joined_id}ids_bwd.npy'
fwd_trn = np.load(fwd_trn_path)
fwd_val = np.load(fwd_val_path)
itos = pickle.load(open(f'{PATH}tmp/itos.pkl', 'rb'))
stoi = {s: i for i, s in enumerate(itos)}
fld_id = stoi[FLD]
bwd_trn = np.array([reverse_flds(o, fld_id) for o in fwd_trn])
bwd_val = np.array([reverse_flds(o, fld_id) for o in fwd_val])
np.save(bwd_trn_path, bwd_trn)
np.save(bwd_val_path, bwd_val)
if __name__ == '__main__': fire.Fire(create_bw_data)
Hi @sebastianruder
Thanks for your paper, I’m working on implementing ULMFit on our Chinese custom service corpus. I succeeded to train forwards model and now i am working on backwards models. When i try out your codes above, I found some problems. Is _partition_cols wrong, and I use the modified version as blew:
def _partition_cols(a,idxs):
i=0
for idx in idxs:
yield a[i:idx]
i=idx
yield a[i:]
It works well: 2 is fld_id and (5,6) are fld_seq_id
I have another question about “xbos”. Do we need to add “” and xbos at the head of each backwards rows as we did for forwards model?
Thx, I am training the backwards LM model and get the similar result as forward model in the first epoch.
Forward
epoch || trn_loss || val_loss || accuracy
0 || 3.919522 || 3.161851 || 0.43043
backward:
epoch || trn_loss || val_loss || accuracy
0 || 3.934117 || 3.171467 || 0.428741