Please explain how feather for fast on disk format work


(abhi) #1

So the moment I use

df_raw.to_feather(f'{PATH_TMP}train_raw')

do we start to use data frame from /tmp/? Is it a data frame? I read that this is optimisation done with Apache Arrow. They are re-writing pandas.

Do if there are any further operation. Do they get applied on that /tmp/ data frame? I’m a bit confused here. Thanks


#2

At this point you have done a good bit of preprocessing to prepare your dataframe df_raw. That code saves the df TO feather format so you can come back later and read the feather file back into pandas with df_raw = pd.read_feather(f’{PATH_TMP}train_raw’).

You can skip the feather conversion steps, but it is handy to save to feather at this point so if you screw up later you can just go back and reload the dataframe without repeating the preprocessing steps.


(abhi) #3

Thanks