LanguageModelData with a CSV file

KevinB · November 22, 2017, 2:02am

Is there a way to use LanguageModelData using csv instead of files?

@hiromi and I are using lesson4-imdb as a template to build a sentiment analysis predictor and we are at the point where we are trying to convert

md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

to fit our needs and this uses actual files for the FILES variable. So I was wondering if there was a different class we should be using or if there is a way to turn our data into the type of argument that is looking for. Here is what arguments are available:

LanguageModelData(path, field, train, validation, test=None, bs=64, bptt=70, **kwargs)

wluo · November 22, 2017, 2:09am

@KevinB No problem. You should be able to use LanguageModelData to read in .csv files, e.g.:

PATH='data/spooky/'
FILES = dict(train=‘train.csv’, validation=‘test.csv’, test=‘test.csv’)
md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=5)

jeremy · November 22, 2017, 2:11am

I doubt that’s going to do what @KevinB wants. Although it’s a little hard to know since we really need to find out the structure of those CSV files Can you provide an example of a couple of rows?

KevinB · November 22, 2017, 2:13am

It is the Predict the Happiness Challenge. Here is my head of the train:

 	User_ID 	Description 	Browser_Used 	Device_Used 	Is_Response
0 	id10326 	The room was kind of clean but had a VERY stro... 	Edge 	Mobile 	not happy
1 	id10327 	I stayed at the Crown Plaza April -- - April -... 	Internet Explorer 	Mobile 	not happy
2 	id10328 	I booked this hotel through Hotwire at the low... 	Mozilla 	Tablet 	not happy
3 	id10329 	Stayed here with husband and sons on the way t... 	InternetExplorer 	Desktop 	happy
4 	id10330 	My girlfriends and I stayed here to celebrate ... 	Edge 	Tablet 	not happy

hiromi · November 22, 2017, 2:14am

jeremy · November 22, 2017, 2:14am

I’d suggest just grabbing the ‘Description’ column and saving the lines into a file, then use it just like in class.

jamesrequa · November 22, 2017, 2:16am

I was just about to ask the same question, for the IMBD notebook we have all of the text reviews as separate text files so we are using the function texts_from_files. Do you think it would make sense to have another function texts_from_csv to handle data like this more easily? Similarly to how we are handling image files from_csv vs from_paths.

KevinB · November 22, 2017, 2:23am

Would you recommend this method:

np.savetxt(r'c:\data\np.txt', df.values, fmt='%d')

jeremy · November 22, 2017, 2:23am

Yeah I think it would be a reasonable thing to add. Or maybe even texts_from_df ?

jeremy · November 22, 2017, 2:24am

Seems fine - although just grab the column you need, of course.

jamesrequa · November 22, 2017, 5:02am

@KevinB just fyi you can actually use pandas to_csv function to save the df to text instead of saving as a csv. I found this a little easier to work with than np.savetxt. The below code worked well for me assuming the text string is in the second column and id is in the first column.

for x in train.iterrows():
    pd.DataFrame([x[1][1]]).to_csv(TRN + str(x[1][0])+".txt", header=False, index=False)

KevinB · November 22, 2017, 5:09am

I actually didn’t end up using np.savetxt. Just followed @hiromi’s lead and used straight up python like this:

for i in range(trn.values[:,1].shape[0]):
    f = open(PATH+"train/"+trn.values[i,0]+".txt", 'w')
    f.write(trn.values[i,1])
    f.close()

So all this does is saves each description into a file called id#####.txt. It worked pretty well and it seems to be what LanguageModelData expects to get.

Tchotchke · February 2, 2018, 6:40pm

I found that reading in the csv with pandas and then using LanguageModelData.from_dataframes worked well. It saved a lot of time because I didn’t have to write each individual file to disk, which was taking a while.