Lesson 5 - Generating a databunch from 25M labeled .pkl files using conditions

Forgive the question if this is obvious to the people reading. I am reading the data_block docs and struggling to plan what to do in my use-case.

I have a series of folders containing pickled dataframes with the ‘1’ column being my input data

	0	1
0	 (1582638057887, 4.8)	0.208
...
1000	(1582632526525, 3.5)	0.286

I wish to do create a number of classification models (like what can be found in the class 5 SGD) based on various combinations of variables, this may be related to data in the 0th column meeting a certain condition or using other data I have included in the filename of the pickle.

'309151_2424__100142113_1000_0_0.pkl'

'_100142113_' (continous variable)
I may witch to create a model that say uses data where this x < value > y…etc

'_1000_' (category)
this is related to the way the data was collected, I wish to separate these out.

The last two digits are my label 0_0, 2_5, etc…

my question is basically, what might be my best route to creating a data_block.

My thinking is I should be able to loop through my folders loading and adding the Dataframes to a ‘master’ df, based on pre-set conditions, pickle that and load it in the same way as lesson5_SGD. However, I am worried that master df will quickly become too big (or maybe better to say slowly) as this seems like it’s might not be the most efficient way of approaching this.
I am also not entirely sure of the structure that ‘master’ df should have either.

Any thoughts are much appreciated.

ok, so far…

I have decided that i can restructure my directory to, on the first layer, seperate out the four ‘1000’ categories.
I am currently going through all the file names creating a list of all the 100142113 values with the plan of using pandas qcut to help generate boundaries and I will use this as the next layer in my folder structure.
This will enable me to account for this and be selective in how I select the path for creating the data bunch.

Lastly, I am going to ignore (for now) the data in the 0th column of my dataframe.pkl’s and create a .pkl as a padas.series? (I am not sure about this, as a list? an array / 1D tensor?) Just to keep thing moving and see what my results are like without accounting for certain conditions contained within that data. I am not sure that data is necessary and I can always go back and look at this in the future if I need better results.

my code for using qcut to create categorised labels from continuous data:

basepath = 'C:\pickles'
foldernum = len(listdir(basepath))
listforbucketing = []

for f in range(foldernum):
    filelist = listdir(os.path.join(basepath,str(f)))

    for l in filelist:
        filename = l
        bits = l.split('__')[1]
        bit = bits.split('_')
        en = bit[0]
        listforbucketing.append(en)

bin_labels = [x for x in range(100)]

df = pd.DataFrame(newlistfordf,columns=['yup'])
df['yup']= df['yup'].astype(float)
newpd = pd.qcut(df['yup'], q=100,labels=bin_labels)

you can then newpd[your_continuous_value] for return your label