Integrating custom tokenizer

Hi!
I’m currently trying to build a language model for SMILES strings (molecular structures in text form) and I have a custom tokenizer for tokenizing such strings. An example string would be e.g c1ccccc1, corresponding to benzene. My tokenizer basically splits the string into it’s atoms and special characters (such as double bonds “=” etc.). My problem is fitting this into the DataBlock structure and also managing to get the nice .show_batch() function to work. Also there seems to be an error in the Learner.predict() function after training so I’m unsure to if my training has even done what I want it to do…

The .show_batch() error resolves itself and at least works when wrapping my tokenizer in the fastai Tokenizer() class, however that won’t work as a solution since my SMILES problem is case-sensitive (a “c” is different from a “C”) and together with my custom tokenizer it splits e.g “xxbos” from the Tokenizer wrapper to “x x b o s” etc… I already have my tokenizer add a bos token and a eos token ("^" and “&”) so I really shouldn’t be needing the Tokenizer wrapper…

Here’s an example of the tokenizer in action:

tokens = tokenizer.tokenize(data.SMILES)
first(tokens)
[’^’, ‘C’, ‘=’, ‘C’, ‘&’]

I’ve tried to follow the IMDB tutorial in lesson 10. Here’s my DataBlock now and its summary after feeding it a dataframe of SMILES

DataBlock

dblock = DataBlock(blocks = TextBlock(tok_tfm=tokenizer.tokenize, vocab=vocab, is_lm=True),
get_x=ColReader(0), #extract SMILES from df
splitter = RandomSplitter(0.1)) # Splits the dataset randomly

.summary():

DataBlock.summary(data, bs=4, show_batch=False)
Setting-up type transforms pipelines
Collecting items from SMILES Molar_mass
0 C 16.031300
1 N 17.026549
2 O 18.010565
3 F 20.006228
4 CC 30.046950
… … …
995 OC(CF)C=C 90.048093
996 OC(CF)C=O 92.027358
997 OC(CF)C#N 89.027692
998 CCC(F)C=C 88.068829
999 CCC(F)C=O 90.048093

[1000 rows x 2 columns]
Found 1000 items
2 datasets of sizes 900,100
Setting up Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → Numericalize

Building one sample
Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → Numericalize
starting from
SMILES CC(F)(F)CN
Molar_mass 95.054656
Name: 482, dtype: object
applying ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} gives
CC(F)(F)CN
applying SMILESAtomTokenizer.tokenize gives
CC(F)(F)CN
applying Numericalize gives
TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5])

Final sample: (TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]),)

Collecting items from SMILES Molar_mass
0 C 16.031300
1 N 17.026549
2 O 18.010565
3 F 20.006228
4 CC 30.046950
… … …
995 OC(CF)C=C 90.048093
996 OC(CF)C=O 92.027358
997 OC(CF)C#N 89.027692
998 CCC(F)C=C 88.068829
999 CCC(F)C=O 90.048093

[1000 rows x 2 columns]
Found 1000 items
2 datasets of sizes 900,100
Setting up Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → Numericalize
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline:
Setting up after_batch: Pipeline:

Building one batch
Applying item_tfms to the first sample:
Pipeline: ToTensor
starting from
(TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]))
applying ToTensor gives
(TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch
Error! It’s not possible to collate your items in a batch
Could not collate the 0-th members of your tuples because got the following shapes
torch.Size([10]),torch.Size([10]),torch.Size([5]),torch.Size([9])

I believe the last error doen’t matter since it’s also produced in the IMDB example. To me everything looks to be working since the numericalized tensors have a integer per letter in the string which should produce one token each using my tokenizer.

Interestingly, the .one_batch() does work as seen below, and it looks to be working as intended with the validation being offset by 1 token…

.one_batch():

dls.one_batch()
(LMTensorText([[ 5, 4, 11, …, 5, 4, 11],
[ 4, 6, 4, …, 4, 12, 11],
[12, 5, 4, …, 8, 4, 4],
…,
[14, 15, 16, …, 11, 7, 12],
[ 4, 11, 4, …, 4, 12, 8],
[12, 7, 4, …, 4, 4, 11]], device=‘cuda:0’),
TensorText([[ 4, 11, 8, …, 4, 11, 5],
[ 6, 4, 8, …, 12, 11, 7],
[ 5, 4, 8, …, 4, 4, 4],
…,
[15, 16, 17, …, 7, 12, 7],
[11, 4, 12, …, 12, 8, 5],
[ 7, 4, 6, …, 4, 11, 4]], device=‘cuda:0’))

.predict():

Learner.predict(‘c1ccccc1’, 8, temperature=0.75)
TypeError: Subscripted generics cannot be used with class and instance checks

To summarize, I guess my question is first if anyone know about any example or other thread where somebody has tried to integrate a custom tokenizer with fastai that is not your typical LM problem? And maybe further if anyone has any idea how to get this tokenizer to work with show_batch and why I don’t get .predict() to work…

Thank you!

“Learner” is a fastai Class.
Make sure you call predict() on the thing you trained your NN on.

e.g.
learn.fit_one_cycle(1, 2e-2)
learn.get_preds()

Thanks for answering, I’m aware of this, a typo on my part… In the actual code it says lm_learner.predict(‘c1ccccc1’, 8, temperature=0.75). Here’s the short pipeline for clarification:

data: a dataframe with SMILES and mass (mass not used in this LM)

dblock = DataBlock(blocks = TextBlock(tok_tfm=tokenizer.tokenize, vocab=vocab, is_lm=True),
get_x=ColReader(0), #extract SMILES from df
splitter = RandomSplitter(0.1)) # Splits the dataset randomly

dblock.summary(data, bs=4, show_batch=False)

dls = dblock.dataloaders(data, bs=128, seq_len=15)

dls.one_batch()
(LMTensorText([[ 1, 5, 4, …, 1, 5, 4],
[ 4, 12, 4, …, 8, 4, 2],
[ 8, 6, 12, …, 4, 2, 1],
…,
[ 4, 10, 4, …, 4, 11, 6],
[ 4, 10, 6, …, 4, 10, 7],
[11, 6, 12, …, 6, 2, 1]], device=‘cuda:0’),
TensorText([[ 5, 4, 11, …, 5, 4, 11],
[12, 4, 5, …, 4, 2, 1],
[ 6, 12, 5, …, 2, 1, 5],
…,
[10, 4, 4, …, 11, 6, 12],
[10, 6, 2, …, 10, 7, 2],
[ 6, 12, 4, …, 2, 1, 4]], device=‘cuda:0’))

dls.show_batch() # Does not work for some reason

lm_learner = language_model_learner(dls, AWD_LSTM, drop_mult=1, pretrained=False, metrics = [accuracy])

Training…

lm_learner.get_preds()

(tensor([[[1.1500e-04, 1.8900e-03, 2.2046e-03, …, 8.6352e-04,
7.5570e-04, 1.1233e-03],
[2.7695e-05, 6.8809e-04, 1.2487e-02, …, 5.8920e-04,
4.8748e-04, 8.8781e-04], …
[1.0859e-04, 3.2776e-03, 1.3802e-01, …, 2.9777e-03,
2.5926e-03, 3.3626e-03]]]),
TensorText([[ 4, 10, 4, …, 4, 10, 2],
[ 1, 7, 4, …, 7, 12, 4],
[ 7, 2, 1, …, 5, 4, 11],
…,
[ 4, 11, 7, …, 10, 2, 1],
[ 5, 8, 4, …, 8, 5, 2],
[ 1, 4, 4, …, 4, 12, 4]]))

lm_learner.predict(‘c1ccccc1’, 8, temperature=0.75)
TypeError: Subscripted generics cannot be used with class and instance checks

So as you see, some of the built in functions from fastai seems to work as they should, specifically .one_batch() and .get_preds(). Since these functions work I guess there must be some error with the decoding part that has to take place in order to do e.g .show_batch() but why the .predict() function won’t work I don’t know…

Turns out this was quite easy to fix. I replaced my custom tokens for unk, pad, bos, eos etc with the fastai tokens (xxbos etc). Then I wrapped my custom tokenizer in the Tokenizer class and passed it rules=[]. This allowed show_batch, get_preds and predict to work properly! To also get the padding on correct index I also didn’t pass any vocab but let fastai build that itself.

dblock = DataBlock(blocks = TextBlock(tok_tfm=Tokenizer(tokenizer.tokenize, rules = []), is_lm=True),
get_x=ColReader(0), #extract SMILES from df
splitter = RandomSplitter(0.1)) # Splits the dataset randomly