Forgive the question if this is obvious to the people reading. I am reading the data_block docs and struggling to plan what to do in my use-case.
I have a series of folders containing pickled dataframes with the ‘1’ column being my input data
0 1
0 (1582638057887, 4.8) 0.208
...
1000 (1582632526525, 3.5) 0.286
I wish to do create a number of classification models (like what can be found in the class 5 SGD) based on various combinations of variables, this may be related to data in the 0th column meeting a certain condition or using other data I have included in the filename of the pickle.
'309151_2424__100142113_1000_0_0.pkl'
'_100142113_'
(continous variable)
I may witch to create a model that say uses data where this x < value > y…etc
'_1000_'
(category)
this is related to the way the data was collected, I wish to separate these out.
The last two digits are my label 0_0, 2_5, etc…
my question is basically, what might be my best route to creating a data_block.
My thinking is I should be able to loop through my folders loading and adding the Dataframes to a ‘master’ df, based on pre-set conditions, pickle that and load it in the same way as lesson5_SGD. However, I am worried that master df will quickly become too big (or maybe better to say slowly) as this seems like it’s might not be the most efficient way of approaching this.
I am also not entirely sure of the structure that ‘master’ df should have either.
Any thoughts are much appreciated.