I’m sorry that this is such an elementary question, but I’ve been staring at this code for long enough that I think it’s time to phone-a-friend. I’m having issues with the tokenizer with a different dataset. What I’ve done so far is to concat a few dataframes into one, select columns that I want (“target” and “text”), then sent that to the TextLMDataBunch.from_df method. Right now my output of that step is a bunch of xxbos, which means to me that the tokenizer isn’t doing what I expect it to do.
I’m trying to trace where things go wrong in the source and I can’t pinpoint the call to the tokenizer.
TextDataBunch contains the
from_df method calls
_get_processor has a call to
TokenizeProcessor() first subclasses
PreProcessor, which doesn’t do much, and then if I’m reading the ifnone correctly (line 286 from data.py) calls
- Tokenizer() to me looks like a dead end
In all of this I don’t see a call to the actual tokenizer. I’m hoping that someone can point out where I’m going wrong here.
I can’t tell you exactly where the tokenizer is called - maybe someone with more understanding of the fastai will be able to. But are you actually tracing the code with a debugger? I found that to understand the source code by reading it directly was impossible for me. It’s dense with terse, efficient, advanced techniques: subclasses, function pointers, delegations, functions of functions, etc. without any roadmap. Tracing with a debugger you can see exactly what happens where, therefore both start to understand the library and debug your use of it.
I use PyCharm; VSCode is also popular.
HTH and welcome,
One option you could do is I believe you can call Tokenizer?? And it should pull the source code if you’re in a jupyter notebook. I haven’t explored where that functionality is myself but if we can’t solve it before tonight I’ll look into it
@Pomo, yes I have been just opening the source code files and following the trail there, but you’re right – this is almost impossible with the code as written. I’ve always been able to track back through code this way, but I think I will need to give one of those debuggers a try. I saw Jeremy Howard was using VIM for this on one of the videos, but I’m an Emacs user.
@muellerzr, My next step in the Tokenizer troubleshooting was going to be just calling
Tokenizer() but I was trying to figure out the function call syntax. At this point I think I might just take your suggestion and try different arguments until it gives me the expected output.
I still think it would be instructive for me to see where that function call is actually executed, since I think there’s something I’m fundamentally missing in the Python code. If anyone has ideas, I will be happy to hear them. Otherwise I will post again once I have
Tokenizer() as a standalone function call.
@muellerzr I was being inarticulate. First, I forgot that Tokenizer() is a class that is instantiated, not a function to be called, as I previously insinuated. Second, I’m really looking for the path that ends in the
Tokenizer() class’ instantiation being used from the
TextLMDataBunch.from_df starting point. The only thing I see is the line 286 in fastai/text/data.py:
self.tokenizer,self.chunksize,self.mark_fields = ifnone(tokenizer, Tokenizer()),chunksize,mark_fields
In this case I don’t see that self.tokenizer is being used after this point and to me this looks like a dead end. I’ll try a few more things, including instantiating the Tokenizer class and post back shortly.