'TextList' object has no attribute 'label_from_csv'

phren0logy · April 11, 2019, 12:03am

Hi everyone, I was trying to model some code used in Lesson 3 to load some text data to a classifier, instead of the images in the original example. I guess there are some differences in available functions in the ImageList / ImageFileList objects and the TextList object (there’s apparently no TextFileList object).

My code:

    hate_data_clas = (TextList.from_folder(path, vocab=hate_data_lm.vocab)
             .split_by_folder(valid='test')
             .label_from_csv('./nlp/data/hate-speech-dataset/annotations_metadata.csv')
             .databunch(bs=bs))

So, I’m trying to load in a bunch of text files which are in a weird format: one sentence per file with the same first part of the file name, ending with a _1 to _x for the sentences in the paragraph. Then the per sentence labels are in the annotations_metadata.csv file. I also tried .label_from_df after loading the csv into a dataframe, but that didn’t work either.

If anyone has some suggestions I’d appreciate it. I searched the docs for the Data Block API, but I’m having trouble making much sense of them for this purpose. Any examples using text data in files and labels in a CSV would be greatly appreciated.

Thanks,
Andy

thousfeet · April 11, 2019, 6:23am

There may be no such method, at least I can’t find one in https://docs.fast.ai.
If annotations_metadata.csv contains texts and labels, you can use from_csv to create TextDataBunch directly. It can be used like this:

data_lm = TextLMDataBunch.from_csv(path, 'annotations_metadata.csv', bs=bs)
data_lm.show_batch()

phren0logy · April 11, 2019, 7:39pm

Thanks, the problem is that only the labels are in .csv and the actual text is in files. It looks like it’s possible to mix and match for images but not text?

thousfeet · April 12, 2019, 1:14am

They could be treated the same cuz they both are ItemList.
There are many ways to label input, see https://docs.fast.ai/data_block.html#Step-3:-Label-the-inputs. Hope to help!

phren0logy · April 12, 2019, 1:42am

Thanks thousefeet, I took a look at this but ran into the problem documented with label_from_df:

**Warning:**  This method only works with data objects created with either `from_csv` or `from_df` methods.

Interestingly, this seems to work anyway for images but not for text.

I created my “bunch” of texts from folders rather than a dataframe or csv, so I can’t label the data using the _from_df function. The _empty and _constant options aren’t useful for obvious reasons.

I was trying to figure out how to use the label_from_func to pull the labels from the dataframe, working around the warning about not being able to use label_from_df with a folder full of texts. I anybody has any ideas, I’d be grateful!

phren0logy · April 12, 2019, 2:40am

For posteriety, here’s how I got it to work:

df = pd.read_csv('my_labels_by_filename')
df2 = df.set_index('file_id) # where file_id is the filename minus .txt
hate_data_clas = (TextList.from_folder(path, vocab=hate_data_lm.vocab)
         #grab all the text files in path
         .split_by_folder(valid='test')
         #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
         .label_from_func(lambda x : df2.loc[(os.path.basename(x))[:-4], 'label'])
         .databunch(bs=bs))

The anonymous function lambda’s x is the full PosixPath of the filename. By using os.path.basename(x)[:-4], we chop off the last 4 characters of the filename, which is “.txt”. As the filename witihout “.txt” is now the index value for df2, it looks up the value in the df2 datafram ‘label’ column for the index.