Fastai Transformer and TransformerXL

Are there any example notebooks showing how to use the new fastai Transformer and TransformerXL architectures? I have been playing around with them, but haven’t had much luck training with the default config. Accuracy goes way down after I call learn.unfreeze(). Are these implementations intended for text classification task? Should they outperform the AWD-LSTM and QRNN implementation?

1 Like

We don’t know yet, and we haven’t had much time to experiment with those architectures either. Defaults come from the openAI pretrained model (the old one) for transformer (that model is available if you pass pretrained=True btw) and from the transformerXL article.
One thing is to pass alpha=0 and beta=0 when you create your Learner, to deactivate the AR and TAR regularization that don’t seem to help (especially AR). I’ll change those defaults when I have the time.


Thanks, I will give the alpha and beta settings a try.

Would be awesome to see an example on how to use it for character generation like char-rnn from Karpathy!

I met the same problem, have you solve it yet?

No, I haven’t. I will probably start looking into it again in a few months.

1 Like

Nice openai post on sparse transformers:

Is discriminative LR suitable for all kinds of LM? When I try to use pretrained transformer to make a classifier and pass slice((1e-2)/(2.6**12),1e-2) to fit_one_cycle() as discriminative LR, the accuracy become lower and it seems almost overfitting (valid_loss is a little bit higher than train_loss) but without discriminative LR it works much more better. I’m sooo confused. :cry:

We didn’t have time to investigate that, so you should try different values.

Noob coder and first time poster here, so apologies if this is a stupid question. I’m trying to parse through the Transformer and XL models, and I can’t find the skip connection on the feed-forward layers of either the encoder or the decoder. Am I just not seeing it or is there actually no residual connection?

It’s the MergeLayer() in the sequential that does the skip connection.

Thanks - got it!

any updates here? I have the same exact problem. I have to use a much lower batch size due to memory constraints that may cause the problem, but I can not be sure

if we pass pre-trained=True while creting learner in TransformerXL, What pre-trained weights is it loading? I don’t see any url in TransorferXL source code.

As far as I know, there are no pretrained TransformerXL weights yet, you would have to pretrain a model by yourself.

Okay thanks. I just got doubt as the setting pre-trained=true in TransorferXL didn’t throw an error as if in QRNNs

Hi,I have a question : Isn’t doing the transformations on data suppose to increase the number of our samples? because after applying the transforms my data is transformed but number of training samples is the same as before is there something that i’m missing ?

The transformer is a NN architecture, not a data augmentation method

Same for me as of 08.2019. Has anyone figured out how to deal with Transformer for classification tasks?

I think you just need to replace AWD_LSTM with Transformer, everything else is the same. I have toy notebook using TransformerXL without pretrained at