TextDataBunch from_csv vs from_df

mlehew · September 3, 2019, 6:06pm

I’m relatively new to the fastai library, but it overall seems very intuitive with a lot of great tools. One question I can’t figure out involves the difference between creating a TextDataBunch from a CSV versus creating it from a data frame.

What I don’t understand is why the from_df method requires explicitly declaring the train_df and valid_df arguments, while the from_csv method does not. Does the CSV require special formatting within the file that distinguishes between the training and validation set? I’m also unsure of how this affects the modeling process down the road.

For example, the documentation for TextLMDataBunch indicates that all labels are ignored and the target would be the next word in the sentence. For a process like this, is a differentiation between training and validation sets necessary? Is it possible to simply feed it a bunch of text in one dataframe to establish a more domain-specific language model? Or would I have to use a CSV for that?

J.J · December 31, 2019, 12:50am

did you ever find an answer to this? I’m having the same confusion

mlehew · January 1, 2020, 9:56pm

Unfortunately, I have not. I ended up changing my workflow to import the TextLMDataBunch from a CSV.

J.J · January 6, 2020, 7:55am

Thanks anyway for replying