Speed Up Python Code - how to do it faster?

benediktschifferer · May 1, 2017, 11:23am

Hello,

I work on a dataset in my company, which is similar to the Rossmann Dataset problem. I want to develop features based on the history, but in contrast to the Rossmann Dataset, I have no fix timestamps. I wrote a python code, but this take around 40sec per 1000 rows and is too slow.

Dataset:

One column with user Id
One column with date
Multiple other column with information (e.g. number of pages)

Problem:

One user does not send a request everyday, That means that the time line is not fixed based (a user has for each day one entry --> each row is one day). In the Rossmann problem, we used

bwd = df[[‘Store’]+columns].sort_index().groupby(“Store”).rolling(7, min_periods=1).sum()

for calculating the previos/next 7 days. I need to check for the previous rows, if the timestamp is in the previous 7 days.
I could not reuse the elapsed class, because my problem is more dymanic.

Maybe someone has an idea to speed it up? (Parallize it?)

My code:

#timeframe = 14
def addHistory_specific(userId, timeframe):
    traindata['_PreviousRequests_' + str(timeframe)] = 0
    
    start_time = time.time()
    for i, row in traindata.iterrows():
        if i % 1000 == 0:
            print(str(i))
            print("--- %s seconds ---" % (time.time() - start_time))
            start_time = time.time()
        
        user = traindata[userId][i]
        timestamp = traindata['Date'][i]
        dfpuffer = traindata[(traindata[uderId]==user ) & ((timestamp-traindata['Date']).astype('timedelta64[s]')/(24*60*60)>0) & ((timestamp-traindata['Date']).astype('timedelta64[s]')/(24*60*60)<=timeframe)]
        if len(dfpuffer)>0:
            traindata.ix[i, '_PreviousRequests_' + str(timeframe)] = dfpuffer['Status'].count()

benediktschifferer · May 1, 2017, 6:31pm

I tested different versions and I can reduce the calculation time by factor two with futures (similar to the teramisu notebook)

def conv_all_historic_user_14():
    ex = ProcessPoolExecutor(4)
    return np.stack(ex.map(one_historic_user_14, range(n)))

Has someone another idea?
If someone is interested I can share my code