Tracking function calls in fastai libraries

I’m sorry that this is such an elementary question, but I’ve been staring at this code for long enough that I think it’s time to phone-a-friend. I’m having issues with the tokenizer with a different dataset. What I’ve done so far is to concat a few dataframes into one, select columns that I want (“target” and “text”), then sent that to the TextLMDataBunch.from_df method. Right now my output of that step is a bunch of xxbos, which means to me that the tokenizer isn’t doing what I expect it to do.

I’m trying to trace where things go wrong in the source and I can’t pinpoint the call to the tokenizer.

In order:

  • TextLMDataBunch subclasses TextDataBunch
  • TextDataBunch contains the from_df method
  • The from_df method calls _get_processor
  • _get_processor has a call to TokenizeProcessor()
  • TokenizeProcessor() first subclasses PreProcessor, which doesn’t do much, and then if I’m reading the ifnone correctly (line 286 from data.py) calls Tokenizer()
  • Tokenizer() to me looks like a dead end

In all of this I don’t see a call to the actual tokenizer. I’m hoping that someone can point out where I’m going wrong here.

Hi Rajan,

I can’t tell you exactly where the tokenizer is called - maybe someone with more understanding of the fastai will be able to. But are you actually tracing the code with a debugger? I found that to understand the source code by reading it directly was impossible for me. It’s dense with terse, efficient, advanced techniques: subclasses, function pointers, delegations, functions of functions, etc. without any roadmap. Tracing with a debugger you can see exactly what happens where, therefore both start to understand the library and debug your use of it.

I use PyCharm; VSCode is also popular.

HTH and welcome,
Malcolm

One option you could do is I believe you can call Tokenizer?? And it should pull the source code if you’re in a jupyter notebook. I haven’t explored where that functionality is myself but if we can’t solve it before tonight I’ll look into it

@Pomo, yes I have been just opening the source code files and following the trail there, but you’re right – this is almost impossible with the code as written. I’ve always been able to track back through code this way, but I think I will need to give one of those debuggers a try. I saw Jeremy Howard was using VIM for this on one of the videos, but I’m an Emacs user.

@muellerzr, My next step in the Tokenizer troubleshooting was going to be just calling Tokenizer() but I was trying to figure out the function call syntax. At this point I think I might just take your suggestion and try different arguments until it gives me the expected output.

I still think it would be instructive for me to see where that function call is actually executed, since I think there’s something I’m fundamentally missing in the Python code. If anyone has ideas, I will be happy to hear them. Otherwise I will post again once I have Tokenizer() as a standalone function call.

@rajan found it! Line 87 https://github.com/fastai/fastai/blob/2517974dd64f298b0dfb9ca1edb0e4ad54849238/fastai/text/transform.py#L87

It was away in transform :slight_smile:

@muellerzr I was being inarticulate. First, I forgot that Tokenizer() is a class that is instantiated, not a function to be called, as I previously insinuated. Second, I’m really looking for the path that ends in the Tokenizer() class’ instantiation being used from the TextLMDataBunch.from_df starting point. The only thing I see is the line 286 in fastai/text/data.py:

self.tokenizer,self.chunksize,self.mark_fields = ifnone(tokenizer, Tokenizer()),chunksize,mark_fields

In this case I don’t see that self.tokenizer is being used after this point and to me this looks like a dead end. I’ll try a few more things, including instantiating the Tokenizer class and post back shortly.

A few things here:

  • The issue appears to be that there is a default in TextDataBunch.from_df that sets text_cols=1, which controls the column that the fastai framework thinks the text is in. In my case this resulted in the wrong column being used, and a useless result being produced.
  • Once I switched the order of my columns (the dumb way to fix it) it looks like I’m getting something closer to a trained classifier.
  • The call/instantiation I was looking for was actually a few lines below where I was looking in TextDataBunch.from_df on line 199 of data.py where the dataframe work starts (with the TextList portion)
  • After the pain of tracking back this issue, the IDE suggestion is clearly a good one, and I got started with using Emacs as an IDE since it’s my text editor

Thanks for the help!