I’m a complete noob when it comes to huggingface’s transformers and have a quite basic question. I don’t want to fine-tune a transformer but only want the architecture since my project surrounds SMILES strings and not your general language problem. I.e I need to implement my own tokenizer, vocab etc. Does anyone know if there’s an easy way to just use any transformers architecture directly in the Learner, similar to how we can use AWD_LSTM with pretrained=False?
My guess is no since I can’t find anything about it… In Fastai v1 there was a transformer called Transformer-XL, separate from huggingface, that could be implemented like this but haven’t found anything like this in fastai v2.
Some GitHub repositories do what you want. One is a simple wrapper around huggingface transformers I have written
But there are also more sophisticated and mature solutions like blurr.
Thanks! I’ve seen Blurr before but didn’t think it answered my question. My issue is that neither of the two seem to allow me to use my own tokenizer? Both seem to import both the model and the tokenizer with the Auto-comands… E.g when I look at the imported model structure the embedder is already defined by a preexisting vocabulary which won’t work in my case.
Check out this tutorial Tutorial - Transformers
I think you can find what you need in the section
Bridging the gap with fastai as you should create your own tokenizer class and model.
Also, there is fasthugs which is done by @morgan
Hope these will help
Thanks! I’ll check it out
I’ve read those before and the issue remains. The closest I’ve come to what I want is this blogpost, also found in @morgan 's fasthugs tutorial: How to train a new language model from scratch using Transformers and Tokenizers
Here they train a new tokenizer that they later use for training an LM from scratch. That sounds like what I want to do but the issue is that per default the new tokenizer they build seem to split texts into words to build the vocab. And I get that is what most people wanna do, but it will not work in my case. I need to at least split by character.
If you want to use your own tokenizer, trained on custom text with custom vocab, it is probably the easiest to use directly use hugginface not fastai. With their tokenizers package, the team made this really easy.
You can then continue training a transformer model with huggingface or use the above-mentioned fastai wrappers.
Thanks! This looks promising You are probably right about having to do some more huggingface coding before moving back to fastai, just hoped it could be avoided Looks super intuitive anyway at a quick glance!
Update a week later. I’ve built a new custom tokenizer using Huggingface’s Tokenizers library following Quicktour — tokenizers documentation . More specifically I used the BPE tokenizer which tokenizes strings into individual characters but also has the possibility to concatenate the most frequent characters found together into smaller substrings, quite similar to the sentence piece tokenizer in fastai. Since that was fairly similar to what I wanted I continued to train a RoBERTa model following How to train a new language model from scratch using Transformers and Tokenizers and also found this great tutorial https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403.
Also I followed the transformers tutorial on fastai which also went great.