Hi!
I’m currently trying to build a language model for SMILES strings (molecular structures in text form) and I have a custom tokenizer for tokenizing such strings. An example string would be e.g c1ccccc1, corresponding to benzene. My tokenizer basically splits the string into it’s atoms and special characters (such as double bonds “=” etc.). My problem is fitting this into the DataBlock structure and also managing to get the nice .show_batch() function to work. Also there seems to be an error in the Learner.predict() function after training so I’m unsure to if my training has even done what I want it to do…
The .show_batch() error resolves itself and at least works when wrapping my tokenizer in the fastai Tokenizer() class, however that won’t work as a solution since my SMILES problem is case-sensitive (a “c” is different from a “C”) and together with my custom tokenizer it splits e.g “xxbos” from the Tokenizer wrapper to “x x b o s” etc… I already have my tokenizer add a bos token and a eos token ("^" and “&”) so I really shouldn’t be needing the Tokenizer wrapper…
Here’s an example of the tokenizer in action:
tokens = tokenizer.tokenize(data.SMILES)
first(tokens)
[’^’, ‘C’, ‘=’, ‘C’, ‘&’]
I’ve tried to follow the IMDB tutorial in lesson 10. Here’s my DataBlock now and its summary after feeding it a dataframe of SMILES
DataBlock
dblock = DataBlock(blocks = TextBlock(tok_tfm=tokenizer.tokenize, vocab=vocab, is_lm=True),
get_x=ColReader(0), #extract SMILES from df
splitter = RandomSplitter(0.1)) # Splits the dataset randomly
.summary():
DataBlock.summary(data, bs=4, show_batch=False)
Setting-up type transforms pipelines
Collecting items from SMILES Molar_mass
0 C 16.031300
1 N 17.026549
2 O 18.010565
3 F 20.006228
4 CC 30.046950
… … …
995 OC(CF)C=C 90.048093
996 OC(CF)C=O 92.027358
997 OC(CF)C#N 89.027692
998 CCC(F)C=C 88.068829
999 CCC(F)C=O 90.048093[1000 rows x 2 columns]
Found 1000 items
2 datasets of sizes 900,100
Setting up Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → NumericalizeBuilding one sample
Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → Numericalize
starting from
SMILES CC(F)(F)CN
Molar_mass 95.054656
Name: 482, dtype: object
applying ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} gives
CC(F)(F)CN
applying SMILESAtomTokenizer.tokenize gives
CC(F)(F)CN
applying Numericalize gives
TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5])Final sample: (TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]),)
Collecting items from SMILES Molar_mass
0 C 16.031300
1 N 17.026549
2 O 18.010565
3 F 20.006228
4 CC 30.046950
… … …
995 OC(CF)C=C 90.048093
996 OC(CF)C=O 92.027358
997 OC(CF)C#N 89.027692
998 CCC(F)C=C 88.068829
999 CCC(F)C=O 90.048093[1000 rows x 2 columns]
Found 1000 items
2 datasets of sizes 900,100
Setting up Pipeline: ColReader – {‘cols’: 0, ‘pref’: ‘’, ‘suff’: ‘’, ‘label_delim’: None} → SMILESAtomTokenizer.tokenize → Numericalize
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline:
Setting up after_batch: Pipeline:Building one batch
Applying item_tfms to the first sample:
Pipeline: ToTensor
starting from
(TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]))
applying ToTensor gives
(TensorText([ 4, 4, 11, 7, 12, 11, 7, 12, 4, 5]))Adding the next 3 samples
No before_batch transform to apply
Collating items in a batch
Error! It’s not possible to collate your items in a batch
Could not collate the 0-th members of your tuples because got the following shapes
torch.Size([10]),torch.Size([10]),torch.Size([5]),torch.Size([9])
I believe the last error doen’t matter since it’s also produced in the IMDB example. To me everything looks to be working since the numericalized tensors have a integer per letter in the string which should produce one token each using my tokenizer.
Interestingly, the .one_batch() does work as seen below, and it looks to be working as intended with the validation being offset by 1 token…
.one_batch():
dls.one_batch()
(LMTensorText([[ 5, 4, 11, …, 5, 4, 11],
[ 4, 6, 4, …, 4, 12, 11],
[12, 5, 4, …, 8, 4, 4],
…,
[14, 15, 16, …, 11, 7, 12],
[ 4, 11, 4, …, 4, 12, 8],
[12, 7, 4, …, 4, 4, 11]], device=‘cuda:0’),
TensorText([[ 4, 11, 8, …, 4, 11, 5],
[ 6, 4, 8, …, 12, 11, 7],
[ 5, 4, 8, …, 4, 4, 4],
…,
[15, 16, 17, …, 7, 12, 7],
[11, 4, 12, …, 12, 8, 5],
[ 7, 4, 6, …, 4, 11, 4]], device=‘cuda:0’))
.predict():
Learner.predict(‘c1ccccc1’, 8, temperature=0.75)
TypeError: Subscripted generics cannot be used with class and instance checks
To summarize, I guess my question is first if anyone know about any example or other thread where somebody has tried to integrate a custom tokenizer with fastai that is not your typical LM problem? And maybe further if anyone has any idea how to get this tokenizer to work with show_batch and why I don’t get .predict() to work…
Thank you!