How can I turn off shuffle in the data_block?

simoneva · January 2, 2019, 5:49pm

I have some data that trains fine with a dataloader/dataset. I have converted it to the data_block format creating the databunch from itemlist instead. I want to check it produces the same data but currently it shuffles it. I have tried turning off shuffle on the train_ds and train_dl but it makes no difference. Is there a way of turning it off?

sgugger · January 3, 2019, 7:33am

The training dl is always shuffled and you can’t change it without writing you own DataBunch.create function. If you just want to have a look though, use the fix_dl which is the training set not shuffled (and the validation transforms if you’re in vision).
Another way is to type data.train_dl = data.train_dl.new(shuffle=False) after your data object has been created.

simoneva · January 3, 2019, 2:05pm

Thanks. That showed where my problem is. How do I do train/valid split for both X and y to keep them aligned. I assumed (wrongly) it was doing this automatically. This is my code. It works fine with no_split(). However if I do random_split_pct() then the labels no longer match the data. I tried putting the random_split after the labels but that throws an error.

class ArrayItemList(ItemList):
def get(self, i):
    """ load images from array rather than file """
    X = self.items[i]
    X = X.transpose(2,0,1).astype(np.float32)/255
    return X

X = bcolz.open(join(path, "xnn.bcl"))[:100]
y = pd.read_pickle(join(path, "ynn.pkl"))
n = os.cpu_count()
db = (ArrayItemList(X)
        .random_split_by_pct(.05)
        .label_from_list(y.tolist())
        .databunch(bs=1, num_workers=n))
db.train_dl = db.train_dl.new(shuffle=False)

simoneva · January 3, 2019, 4:12pm

This is one solution but I hoped there might be something neater:

X = bcolz.open(join(path, "xnn.bcl"))[:100]
y = pd.read_pickle(join(path, "ynn.pkl"))[:100]
split = (.05, np.random.randint(1e6))
yl = (ItemList(y).random_split_by_pct(*split))
n = os.cpu_count()
db = (ArrayItemList(X)
        .random_split_by_pct(*split)
        .label_from_lists(yl.train, yl.valid)
        .databunch(bs=1, num_workers=n))