HuggingFace + TPU = confusion

Hi all,

tl;dr Is there an easy way to use the Trainer like in the lesson but just use the TPU in Kaggle to train? TensorFlow solutions seem straightforward but PyTorch gets really complicated really fast.

After watching lesson 4 and absorbing the content, I wanted to solidify my knowledge by practicing. To do that I found an NLP-related Kaggle competition and had a go at it. I am trying to improve my score on the “Contradictory, My Dear Watson” practice competition, but have issues using the TPU.

In attempting to learn from the top scorers, I saw that they all leverage xlm-roberta-large-xnli which is a model pre-trained on the same type of problem. The issue is that it doesn’t fit in normal memory so they all use the Kaggle TPU to fine-tune it. Unfortunately, I can’t seem to understand how to do that in PyTorch. I got it to start training, but it is ridiculously slow. Checking the Kaggle notebook stats leads me to believe that it is in fact training on the CPU (estimated 3 hours to finish one epoch).

Is there an easy way to just use the TPU with the Trainer class, or did I accidentally jump into the deep end too soon? Here is my notebook to have a look at my quick and dirty coding :slight_smile: Watson entry | Kaggle

P.S. If you have any other advice for how to approach this competition, feel free to add it here. Ultimately, I just want to improve my understanding by building a better model for this competition.


I would suggest using smaller models and a GPU.

Thanks @jeremy, I will leave the TPU for when I’m more experienced then :slight_smile:

I tried using deberta with my own special tokens (like in your lesson) and with the traditional [SEP] token, but my accuracy keeps dropping with every epoch. Wasn’t sure what to try so I wanted to copy the best performing notebooks.

Will try a different competition for now!