Fastai Transformer and TransformerXL

Esteban · February 22, 2019, 4:44am

Are there any example notebooks showing how to use the new fastai Transformer and TransformerXL architectures? I have been playing around with them, but haven’t had much luck training with the default config. Accuracy goes way down after I call learn.unfreeze(). Are these implementations intended for text classification task? Should they outperform the AWD-LSTM and QRNN implementation?

sgugger · February 22, 2019, 12:44pm

We don’t know yet, and we haven’t had much time to experiment with those architectures either. Defaults come from the openAI pretrained model (the old one) for transformer (that model is available if you pass pretrained=True btw) and from the transformerXL article.
One thing is to pass alpha=0 and beta=0 when you create your Learner, to deactivate the AR and TAR regularization that don’t seem to help (especially AR). I’ll change those defaults when I have the time.

Esteban · February 22, 2019, 7:33pm

Thanks, I will give the alpha and beta settings a try.

ale · March 8, 2019, 10:43pm

Would be awesome to see an example on how to use it for character generation like char-rnn from Karpathy!

thousfeet · April 23, 2019, 12:25pm

I met the same problem, have you solve it yet？

Esteban · April 23, 2019, 6:54pm

No, I haven’t. I will probably start looking into it again in a few months.

Dee · April 24, 2019, 1:00am

Nice openai post on sparse transformers:

thousfeet · April 25, 2019, 2:46am

Is discriminative LR suitable for all kinds of LM? When I try to use pretrained transformer to make a classifier and pass slice((1e-2)/(2.6**12),1e-2) to fit_one_cycle() as discriminative LR, the accuracy become lower and it seems almost overfitting (valid_loss is a little bit higher than train_loss) but without discriminative LR it works much more better. I’m sooo confused.

sgugger · April 25, 2019, 12:30pm

We didn’t have time to investigate that, so you should try different values.

AISec · April 30, 2019, 5:42pm

Noob coder and first time poster here, so apologies if this is a stupid question. I’m trying to parse through the Transformer and XL models, and I can’t find the skip connection on the feed-forward layers of either the encoder or the decoder. Am I just not seeing it or is there actually no residual connection?

sgugger · April 30, 2019, 6:04pm

It’s the MergeLayer() in the sequential that does the skip connection.

AISec · May 7, 2019, 9:53pm

Thanks - got it!

rubenarana · May 20, 2019, 1:10pm

any updates here? I have the same exact problem. I have to use a much lower batch size due to memory constraints that may cause the problem, but I can not be sure

msrdinesh · May 22, 2019, 4:08am

if we pass pre-trained=True while creting learner in TransformerXL, What pre-trained weights is it loading? I don’t see any url in TransorferXL source code.

jolackner · May 22, 2019, 5:59am

As far as I know, there are no pretrained TransformerXL weights yet, you would have to pretrain a model by yourself.

msrdinesh · May 22, 2019, 6:22am

Okay thanks. I just got doubt as the setting pre-trained=true in TransorferXL didn’t throw an error as if in QRNNs

moeinh77 · July 16, 2019, 3:47pm

Hi,I have a question : Isn’t doing the transformations on data suppose to increase the number of our samples? because after applying the transforms my data is transformed but number of training samples is the same as before is there something that i’m missing ?

maxmatical · July 16, 2019, 4:35pm

The transformer is a NN architecture, not a data augmentation method

mkowalski · August 9, 2019, 10:30am

Same for me as of 08.2019. Has anyone figured out how to deal with Transformer for classification tasks?

tinhb · August 10, 2019, 2:20am

I think you just need to replace AWD_LSTM with Transformer, everything else is the same. I have toy notebook using TransformerXL without pretrained at https://github.com/tinhb92/sentiment_fastai/blob/master/sentiment.ipynb