2 Questions about TextDataLoaders.from_csv

ASEES15 · August 5, 2021, 9:16pm

As I understand it, I am supposed to use the TextDataLoaders() function for NLP. My files are in google drive so I’ve ran the following:

path = Path(’/content/gdrive/My Drive/AI_Folder/’)

I believe the above is correct in order to pull my dataset(s) from google drive (a google sheets file)

Then I am writing the following to make the dataloader (that I will then put into the learner):

dls = TextDataLoaders.from_csv(path=path, csv_fname=‘Dataset.csv’, text_col=‘Text’, label_col=’ ‘, valid_col=’ ')

I have left Iabel_col and valid_col blank. I have 2 questions about them:

How do I specify multiple label columns?
What is valid_col asking for? I thought I could specify a percent of my dataset to be used for validation?

meanpenguin · August 5, 2021, 11:31pm

If you are doing multi-category then the label column should contain space separated labels.

Valid_col should contain a 0 for training data, 1 for validation data.
You can instead use " valid_pct =0.2" to get 20% validation

See this for other arguments

ASEES15 · August 6, 2021, 12:59am

Thank you for such a swift answer! I hope you don’t mind my asking for clarification:

Will space separating the labels train the model to identify each label independently?
For example: let’s say I have the space separated labels “happy friendly optimistic” as labels for a text. Would the model interpret the labels independently as 3 “happy” and “friendly” and “optimistic”, or 1 label “happy friendly optimistic”? I would like it to treat the labels as independent because otherwise I would require many orders of magnitude more data.

Does this mean that I can specify valid_pct=0.2 and not have to specify valid_col while also going through my data and manually labeling 0 for training data and 1 for validation data? I would love if I could use valid_pct instead of valid_col

Thanks again for being such a help I really really appreciate it

meanpenguin · August 6, 2021, 1:58am

If you use the MultiCategoryBlock, then yes, they will be evaluated independently.

You can just use valid_pct=0.2 and it will randomly pick 20% for validation set. Just don’t specify the valid_col argument.

daveramseymusic · August 6, 2021, 3:41pm

Thank you both @meanpenguin and @ASEES15 . This was incredibly helpful for me today.

ASEES15 · August 6, 2021, 4:18pm

So if I understand you correctly, the following is correct?

dls = TextDataLoaders.from_csv(path=path, csv_fname=‘Dataset.csv’, text_col=‘Text’, label_col=‘Labels’, label_delim=’ ', y_block=MultiCategoryBlock, valid_pct=0.2)

Here’s what I believe is occurring:

label_delim=’ ’

I believe this denotes that the labels are space-separated?

y_block=MultiCategoryBlock

I believe this denotes that there are multiple, independent labels? I also believe the independent labels can thus be in any order, as long as they appear correctly on the data?

valid_pct=0.2

I’m using this instead of valid_col as you informed me I could do

ASEES15 · August 6, 2021, 4:19pm

I’m glad someone else could benefit from my questions

meanpenguin · August 6, 2021, 6:03pm

There are a lot of documentation and tutorials available that you can leverage