I’m a complete noob when it comes to huggingface’s transformers and have a quite basic question. I don’t want to fine-tune a transformer but only want the architecture since my project surrounds SMILES strings and not your general language problem. I.e I need to implement my own tokenizer, vocab etc. Does anyone know if there’s an easy way to just use any transformers architecture directly in the Learner, similar to how we can use AWD_LSTM with pretrained=False?
My guess is no since I can’t find anything about it… In Fastai v1 there was a transformer called Transformer-XL, separate from huggingface, that could be implemented like this but haven’t found anything like this in fastai v2.
Thanks! I’ve seen Blurr before but didn’t think it answered my question. My issue is that neither of the two seem to allow me to use my own tokenizer? Both seem to import both the model and the tokenizer with the Auto-comands… E.g when I look at the imported model structure the embedder is already defined by a preexisting vocabulary which won’t work in my case.
Check out this tutorial Tutorial - Transformers
I think you can find what you need in the section Bridging the gap with fastai as you should create your own tokenizer class and model.
Here they train a new tokenizer that they later use for training an LM from scratch. That sounds like what I want to do but the issue is that per default the new tokenizer they build seem to split texts into words to build the vocab. And I get that is what most people wanna do, but it will not work in my case. I need to at least split by character.
If you want to use your own tokenizer, trained on custom text with custom vocab, it is probably the easiest to use directly use hugginface not fastai. With their tokenizers package, the team made this really easy.
Thanks! This looks promising You are probably right about having to do some more huggingface coding before moving back to fastai, just hoped it could be avoided Looks super intuitive anyway at a quick glance!