Convert validation dataset to dataframe

ohiumliu · June 20, 2019, 4:42am

Hello,
I am wondering if there is a utility to convert validation or train dataset to dataframes. For example, my train dataset is something like this

LabelList
y: FloatList (29283 items)
[FloatItem -0.6931472, FloatItem -0.51082563, FloatItem 6.1092477, FloatItem 6.9077554, FloatItem 7.090077]...
Path: .
x: CollabList (29283 items)
[CollabLine compound_id CHEMBL135581; target_id P00374; , CollabLine compound_id CHEMBL135581; target_id P00374; , CollabLine compound_id CHEMBL135581; target_id P00374; , CollabLine compound_id CHEMBL135581; target_id P00374; , CollabLine compound_id CHEMBL135581; target_id P00374; ]...
Path: .

How to do I convert train_ds.y to a dataframe or numpy array? How to do that for train_ds.x?

Any suggestions would be really appreciated.

muellerzr · June 20, 2019, 11:03am

In here is some code I used for plot_top_losses on Tabular models. You should be able to refactor this for your validation dataframe. Why are you trying to do this though? To me that seems like a lot of extra work

msrdinesh · June 20, 2019, 11:35am

list_x = list(data_train_ds.x)
list_y = list(data_train_ds.y)
This directly converts Labellist and collablists into normal python list. Later you can convert them easily into numpy arrays or data-frames.

ohiumliu · June 21, 2019, 4:39am

@muellerzr @msrdinesh thank you very much for the suggestions.

I ended doing the following to generate a list for later processing in a dataframe
[str(i).split(';')[1] for i in list(Kd_learn.data.valid_ds.x)]

I was playing with collaborative filtering. It turns out, if some of the items in the validation set were not seen in the training set, it will be labeled as #na#. As a first step, I need to ignore these predictions. I remember this was mentioned in the class, although don’t remember how to deal with it properly. I also need to get more familiar with dataloader in order to understand @muellerzr’s codes better.

mayankpj · August 11, 2019, 6:31pm

@muellerzr @msrdinesh and @ohiumliu , this might be handy to convert datasets to dataframes directly:

pd.concat(list(learn.data.valid_ds.x[:]),axis=1).T

Just replace valid_ds with valid_dl ; or train_ds or test_ds (if you have added test dataset as well.

This helps with understanding what went wrong (like #na# etc.) with test data once it was loaded.
Hope this helps others