Custom ItemList, getting ForkingPickler broken pipe

sgugger · February 25, 2019, 4:42pm

Ok, pushed MixedItemList in master. Here is an example of use:

path = untar_data(URLs.MNIST_SAMPLE)
df = pd.read_csv(path/'labels.csv')
image_il = ImageItemList.from_df(df.iloc[:1000], path=path, cols='name')

path1 = untar_data(URLs.IMDB_SAMPLE)
text_il = TextList.from_csv(path1, 'texts.csv', cols='text')

path2 = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path2/'adult.csv')

dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

tab_il = TabularList.from_df(df.iloc[:1000], path=path2, cat_names=cat_names, cont_names=cont_names, procs=procs)

tst_il = MixedItemList([image_il, text_il, tab_il], path=path)

Note that this one won’t be able to get to the databunch stage since the texts aren’t all of the same length so can’t be directly batched together without writing a custom collate function, but each data is processed properly after the split and labeling.

If you want to take the labels from one of the csv, let’s say the first one, you have to tell the MiexItemList which inner dataframe to use (here the one from image_il):

tst_il = MixedItemList([image_il, text_il, tab_il], path=path, inner_df=image_il.inner_df)

If you want to apply data augmentation, when you are the time to call transforms, you need to pass two lists (train/valid) of three lists (image_il, text_il, tab_il) of transforms. For instance:

src = tst_il.random_split_by_pct()
src = src.label_from_df(cols='label')
src = src.transform([[[pad(padding=4)], [], []], [[],[],[]]])

As I said, src.databunch() doesn’t work since there is text, if remove the text ItemList and adapt, it works perfectly and the batches will look like:
[[batch of images, [batch of cats, batch of conts]], batch of labels]
(or more generally [[batch of first il, batch of second il, …], batch of labels])

As I noted before, this will require a custom model to work, but you get all the processing done together.