Error in split by list

champs.jaideep · December 10, 2018, 8:48am

Hi Experts
I am wanting to split the data into train/val sets based on the name of files contained into the list trn,valn
But i get the error message as list type as not attribute label from df .

(ImageItemList.from_csv(path, ‘train.csv’, folder=‘train’, suffix=’.png’)
.split_by_list(tr_n,val_n))
.label_from_df(sep=’ ', classes=[str(i) for i in range(28)]))

Can some one please help here with correct way of doing it…

The above works fine if use random by pct .

larcat · December 12, 2018, 4:18am

Fwiw I’m messing/struggling with the same right now.

champs.jaideep · December 12, 2018, 6:00am

If i understand correctly in case of random by pct…
It takes train.csv files as input to label from df and builds the hot encoded labels…
but when we do split from list… it dsnt finds any such…
One more way i wanting to try is label from list,let you know.
Incase if any one knows how ,request to please post here

champs.jaideep · December 12, 2018, 10:19am

i checked the object type retruned by random pct and split from list are both same Itemlists but still one fails and other works…

Not sure if there is any bug for other split method…
This is the trail for random by pct
def split_by_idx(self, valid_idx:Collection[int])->‘ItemLists’:
“Split the data according to the indexes in valid_idx.”
#train_idx = [i for i in range_of(self.items) if i not in valid_idx]
train_idx = np.setdiff1d(arange_of(self.items), valid_idx)
return self.split_by_idxs(train_idx, valid_idx)

def _get_by_folder(self, name):
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]

def split_by_folder(self, train:str='train', valid:str='valid')->'ItemLists':
    "Split the data depending on the folder (`train` or `valid`) in which the filenames are."
    return self.split_by_idxs(self._get_by_folder(train), self._get_by_folder(valid))

def random_split_by_pct(self, valid_pct:float=0.2, seed:int=None)->'ItemLists':
    "Split the items randomly by putting `valid_pct` in the validation set, optional `seed` can be passed."
    if valid_pct==0.: return self.no_split()
    if seed is not None: np.random.seed(seed)
    rand_idx = np.random.permutation(range_of(self))
    cut = int(valid_pct * len(self))
    return self.split_by_idx(rand_idx[:cut])

This is for split by list
def split_by_list(self, train, valid):
“Split the data between train and valid.”
return self._split(self.path, train, valid)
Only diff between two is one direcetly gives the list of Image ids
other one that is rand by pct ,gets those using index ids of Image ids…

sgugger · December 12, 2018, 8:52pm

There was a bug when combining from_csv or from_df with split_by_list. It should be fixed in master.

champs.jaideep · December 13, 2018, 5:00am

hi sgugger
thanks for reply…
In which version is it going to be fixed…

current version i use .36.post1
As a work around i tried to use this
src= (ItemList.from_df(df,path, folder=‘train’, suffix=’.png’)
.split_by_idx(val_idx) --Type is Label lists same random by pct but i dont know with randm pct i dont get stuck below but using split by idx which is the one internally called by random pct also i get stuck…

   .label_from_df(sep=' ',  classes=[str(i) for i in range(28)]))

type(src)
it works but i get stuck at
data = (src.transform((trn_tfms, _), size=224)
.databunch().normalize(protein_stats))

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in getattr(self, k)
486 def getattr(self,k:str)->Any:
487 res = getattr(self.x, k, None)
–> 488 return res if res is not None else getattr(self.y, k)
489
490 def getitem(self,idxs:Union[int,np.ndarray])->‘LabelList’:

AttributeError: ‘MultiCategoryList’ object has no attribute ‘normalize’

sgugger · December 13, 2018, 4:12pm

The bug is fixed in 1.0.37

champs.jaideep · December 14, 2018, 9:36am

k thanks for confirmation…
I am able to use this with
ImageItemList.from_csv( path, ‘train.csv’,folder=‘train’, suffix=’.png’)
.split_by_idx(val_idx)
.label_from_df(sep=’ ',col=1))
However i notice below things

If we do random by pct ,return type is Labels Lists
if we do by idx/list it returns Multicategoryclass lists

why is there a difference when both call up the same function at the end…

Also below still fails

ImageItemList.from_csv( path, ‘train.csv’,folder=‘train’, suffix=’.png’)
.split_by_list(tr_n,val_n)
.label_from_df(sep=’ ',col=1))

sgugger · December 14, 2018, 2:46pm

Just tried your code above

ImageItemList.from_csv( path, ‘train.csv’,folder=‘train’, suffix=’.png’)
.split_by_idx(val_idx)
.label_from_df(sep=’ ',col=1))

and it works without any problem. random_split_by_pct and split_by_idx both return LabelLists in the end.

As for:

ImageItemList.from_csv( path, ‘train.csv’,folder=‘train’, suffix=’.png’)
.split_by_list(tr_n,val_n)
.label_from_df(sep=’ ',col=1))

It can’t work unless tr_n and val_n are ItemList. As I said on the github issue, this is an internal method.

champs.jaideep · December 14, 2018, 4:00pm

ok it works with below syntax.
src= (ImageItemList.from_csv( path, ‘train.csv’,folder=‘train’, suffix=’.png’)
.split_by_idx(val_idx)
#.split_by_list(tr_n,val_n)
.label_from_df(sep=’ ')
I use this but unfortunately i get very weird predictions on unseen test set not at all in line with what i get when i use random split by pct. I suspect that labels are not getting associated to corresponding training input.

Here is how i generate the val idx

tr_n=list(train_df.Trn_fnames.values) # list of training fnames ,ids of images
val_n=list(val_df.val_fnames.values)
df = pd.read_csv(path/‘train.csv’) # complete list of training files with associate lables Columns : (Id(fnames),Target) Target is multlabel Eg. 1 4 5

val_idx=df.loc[df.Id.isin(val_n)].index #get 0,1,2 indexes of corresponding Val file names

Please correct if m not generating the val idx correctly in the split expects.

Is there way i can clear the doubt . Other than this i cant think of any other reason because of which i get completely different results…

Inside the split by idxs
return self.split_by_list(self[train_idx], self[valid_idx])
shouldnt it be self.iloc[train_idx]…

abhinavt · June 6, 2020, 12:25pm

Hello @sgugger ,
I was trying to create a custom itemlist using the tutorial on the fastai website for siamese networks.

However, after splitting the data (.split_from_df), I get wrong items for itemsB in the validation set. Somehow, the splitter gets the correct indices for items but wrong indices for itemsB. I have also created an issue on github for it. Can you please look into it?