Bug in get_sample function

jaroxe · December 4, 2017, 11:14pm

I think I have spotted a bug in the get_sample function. It always returns the same sample, while it should be returning a random sample. This is the source code for the function:

def get_sample(df,n):
idxs = sorted(np.random.permutation(len(df)))
return df.iloc[idxs[:n]].copy()

np.random.permutation(len(df)) permutes the indices in the dataframe, however applying sorted on the permutation brings it back to the original ordering. The first line should be:

idxs = np.random.permutation(len(df))

jeremy · December 4, 2017, 11:51pm

Well spotted! Your fix isn’t quite right I think, since we do want it sorted before returning - I’ve just pushed a fix that does that too.

jaroxe · December 5, 2017, 1:08am

I see, we want to keep the ordering in case our data is ordered by date or something else.