How to pass DF text column name in TextLMDataBunch->Create() function

vladgets · April 24, 2019, 7:24am

In TextLMDataBunch->Create() function is there a way to pass a column name in my data frame that contains the text?
I saw that TextDataBunch->from_df() has a parameter “text_cols”, but I do not see such parameter in Create() function.

sgugger · April 24, 2019, 12:31pm

TextLMDataBunch is a subclass of TextDataBunch so it gets all its factory methods.

vladgets · April 24, 2019, 6:14pm

@sgugger Thanks, it helped!

Now a next question for subsequent text classification, if I need to classify between 2 classes and currently I have a field that contains probability (a floating number between 0 to 1) in one of DF columns,
how I initialize TextDataBunch for classification?
Specifically, what should I put in label_cols, label_delim and classes params in TextDataBunch.from_df() function? What type of data it expects in the label column?

sgugger · April 24, 2019, 6:17pm

You would need to use a TextClasDataBunch, then the documentation is here to help.

vladgets · April 24, 2019, 8:52pm

Thanks, but I do not see it in the documentation what should be the format of the label field.
In case of 2 class classification should I convert a field with probability belonging to the positive class, instead of that to a field that contains name of the class both positive and negative as I saw in IMBD lesson 3 example (‘pos’, ‘neg’)?

vladgets · April 29, 2019, 9:27pm

@sgugger Thanks a lot for the help!
I seems that I am advancing with my try to make fast.ai transfer learning work on this Kaggle competition.
But the function:

data_lm = TextLMDataBunch.from_df('.', train_df=train, valid_df=valid_small, text_cols='comment_text')

seems to be so slow, on the full training set it’s alone takes about 30 minutes.
I did not make the Profiling but my assumption that it happens because of slow Tokenizer of Spacy.
With PyTorch standard Tokenizer in other kernels it takes now more than 2-3 minutes.