I’m sorry that this is such an elementary question, but I’ve been staring at this code for long enough that I think it’s time to phone-a-friend. I’m having issues with the tokenizer with a different dataset. What I’ve done so far is to concat a few dataframes into one, select columns that I want (“target” and “text”), then sent that to the TextLMDataBunch.from_df method. Right now my output of that step is a bunch of xxbos, which means to me that the tokenizer isn’t doing what I expect it to do.
I’m trying to trace where things go wrong in the source and I can’t pinpoint the call to the tokenizer.
In order:
TextLMDataBunch subclasses TextDataBunch
TextDataBunch contains the from_df method
The from_df method calls _get_processor
_get_processor has a call to TokenizeProcessor()
TokenizeProcessor() first subclasses PreProcessor, which doesn’t do much, and then if I’m reading the ifnone correctly (line 286 from data.py) calls Tokenizer()
Tokenizer() to me looks like a dead end
In all of this I don’t see a call to the actual tokenizer. I’m hoping that someone can point out where I’m going wrong here.
I can’t tell you exactly where the tokenizer is called - maybe someone with more understanding of the fastai will be able to. But are you actually tracing the code with a debugger? I found that to understand the source code by reading it directly was impossible for me. It’s dense with terse, efficient, advanced techniques: subclasses, function pointers, delegations, functions of functions, etc. without any roadmap. Tracing with a debugger you can see exactly what happens where, therefore both start to understand the library and debug your use of it.
One option you could do is I believe you can call Tokenizer?? And it should pull the source code if you’re in a jupyter notebook. I haven’t explored where that functionality is myself but if we can’t solve it before tonight I’ll look into it
@Pomo, yes I have been just opening the source code files and following the trail there, but you’re right – this is almost impossible with the code as written. I’ve always been able to track back through code this way, but I think I will need to give one of those debuggers a try. I saw Jeremy Howard was using VIM for this on one of the videos, but I’m an Emacs user.
@muellerzr, My next step in the Tokenizer troubleshooting was going to be just calling Tokenizer() but I was trying to figure out the function call syntax. At this point I think I might just take your suggestion and try different arguments until it gives me the expected output.
I still think it would be instructive for me to see where that function call is actually executed, since I think there’s something I’m fundamentally missing in the Python code. If anyone has ideas, I will be happy to hear them. Otherwise I will post again once I have Tokenizer() as a standalone function call.
@muellerzr I was being inarticulate. First, I forgot that Tokenizer() is a class that is instantiated, not a function to be called, as I previously insinuated. Second, I’m really looking for the path that ends in the Tokenizer() class’ instantiation being used from the TextLMDataBunch.from_df starting point. The only thing I see is the line 286 in fastai/text/data.py:
In this case I don’t see that self.tokenizer is being used after this point and to me this looks like a dead end. I’ll try a few more things, including instantiating the Tokenizer class and post back shortly.
The issue appears to be that there is a default in TextDataBunch.from_df that sets text_cols=1, which controls the column that the fastai framework thinks the text is in. In my case this resulted in the wrong column being used, and a useless result being produced.
Once I switched the order of my columns (the dumb way to fix it) it looks like I’m getting something closer to a trained classifier.
The call/instantiation I was looking for was actually a few lines below where I was looking in TextDataBunch.from_df on line 199 of data.py where the dataframe work starts (with the TextList portion)
After the pain of tracking back this issue, the IDE suggestion is clearly a good one, and I got started with using Emacs as an IDE since it’s my text editor