Hello,
I work on a dataset in my company, which is similar to the Rossmann Dataset problem. I want to develop features based on the history, but in contrast to the Rossmann Dataset, I have no fix timestamps. I wrote a python code, but this take around 40sec per 1000 rows and is too slow.
Dataset:
- One column with user Id
- One column with date
- Multiple other column with information (e.g. number of pages)
Problem:
-
One user does not send a request everyday, That means that the time line is not fixed based (a user has for each day one entry --> each row is one day). In the Rossmann problem, we used
bwd = df[[‘Store’]+columns].sort_index().groupby(“Store”).rolling(7, min_periods=1).sum()
for calculating the previos/next 7 days. I need to check for the previous rows, if the timestamp is in the previous 7 days.
I could not reuse the elapsed class, because my problem is more dymanic.
Maybe someone has an idea to speed it up? (Parallize it?)
My code:
#timeframe = 14
def addHistory_specific(userId, timeframe):
traindata['_PreviousRequests_' + str(timeframe)] = 0
start_time = time.time()
for i, row in traindata.iterrows():
if i % 1000 == 0:
print(str(i))
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
user = traindata[userId][i]
timestamp = traindata['Date'][i]
dfpuffer = traindata[(traindata[uderId]==user ) & ((timestamp-traindata['Date']).astype('timedelta64[s]')/(24*60*60)>0) & ((timestamp-traindata['Date']).astype('timedelta64[s]')/(24*60*60)<=timeframe)]
if len(dfpuffer)>0:
traindata.ix[i, '_PreviousRequests_' + str(timeframe)] = dfpuffer['Status'].count()