I’m relatively new to the fastai library, but it overall seems very intuitive with a lot of great tools. One question I can’t figure out involves the difference between creating a TextDataBunch
from a CSV versus creating it from a data frame.
What I don’t understand is why the from_df
method requires explicitly declaring the train_df
and valid_df
arguments, while the from_csv
method does not. Does the CSV require special formatting within the file that distinguishes between the training and validation set? I’m also unsure of how this affects the modeling process down the road.
For example, the documentation for TextLMDataBunch
indicates that all labels are ignored and the target would be the next word in the sentence. For a process like this, is a differentiation between training and validation sets necessary? Is it possible to simply feed it a bunch of text in one dataframe to establish a more domain-specific language model? Or would I have to use a CSV for that?