I have some data that trains fine with a dataloader/dataset. I have converted it to the data_block format creating the databunch from itemlist instead. I want to check it produces the same data but currently it shuffles it. I have tried turning off shuffle on the train_ds and train_dl but it makes no difference. Is there a way of turning it off?
The training dl is always shuffled and you can’t change it without writing you own DataBunch.create
function. If you just want to have a look though, use the fix_dl
which is the training set not shuffled (and the validation transforms if you’re in vision).
Another way is to type data.train_dl = data.train_dl.new(shuffle=False)
after your data object has been created.
Thanks. That showed where my problem is. How do I do train/valid split for both X and y to keep them aligned. I assumed (wrongly) it was doing this automatically. This is my code. It works fine with no_split(). However if I do random_split_pct() then the labels no longer match the data. I tried putting the random_split after the labels but that throws an error.
class ArrayItemList(ItemList):
def get(self, i):
""" load images from array rather than file """
X = self.items[i]
X = X.transpose(2,0,1).astype(np.float32)/255
return X
X = bcolz.open(join(path, "xnn.bcl"))[:100]
y = pd.read_pickle(join(path, "ynn.pkl"))
n = os.cpu_count()
db = (ArrayItemList(X)
.random_split_by_pct(.05)
.label_from_list(y.tolist())
.databunch(bs=1, num_workers=n))
db.train_dl = db.train_dl.new(shuffle=False)
This is one solution but I hoped there might be something neater:
X = bcolz.open(join(path, "xnn.bcl"))[:100]
y = pd.read_pickle(join(path, "ynn.pkl"))[:100]
split = (.05, np.random.randint(1e6))
yl = (ItemList(y).random_split_by_pct(*split))
n = os.cpu_count()
db = (ArrayItemList(X)
.random_split_by_pct(*split)
.label_from_lists(yl.train, yl.valid)
.databunch(bs=1, num_workers=n))