I’ve been following both the fastai tutorials as well as this one https://github.com/anhquan0412/fastai-tabular-text-demo/blob/master/fastai_tab_text.py to understand how the custom API works with tabular data that consists on numberical (continuous and discrete) and text fields for demographics. The data used by the person who wrote this is very similar to mine.
I seem to be experiencing an issue with the default databunch create method. I am able to create a custom ItemList, split, label, train and then create a databunch but whenever I try use .one_batch() it throws the error. I am aware this is because I am not passing the correct PyTorch argument but where I am going wrong?
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 36 and 47 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:711
This led me to dive further into trying to manually collate the batch, where I tried to return it through the following, where mixed_tabular_pad_collate is a custom collate function
class MixedTabularDataBunch(DataBunch): @classmethod def create(cls, train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs=64, pad_idx=1, pad_first=True, no_check:bool=False, **kwargs) -> DataBunch: # calls an above collate padding method collate_fn = partial(mixed_tabular_pad_collate, pad_idx=pad_idx, pad_first=pad_first) return super().create(train_ds, valid_ds, test_ds, path=path, bs=bs, **kwargs)
Then, when I try to reconstruct the databunch I sometimes got the error
AssertionError: can only join a child process which I read to be a treading issue, and this:
/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py:269: UserWarning: It's not possible to collate samples of your dataset together in a batch. warn(message)
It will not let me input any of the super() arguments
collate_fn=collate_fn_var into my classmethod return, and without specifying the collate_fn it doesn’t seem to do it manually
I have various trainings written for the TabularList, splitting, labeling and convertion to a databunch but I’ve condensed for this forum
il = (MixedTabularList.from_df(joined_df, cat_cols, cont_cols, txt_cols, vocab=None, procs=procs, path=PATH) .split_by_rand_pct(valid_pct=0.1, seed=42) .label_from_df(dep_var) transform(tfm_y=True) #Data augmentation? .databunch(bs=8) )
=== Software === python : 3.6.7 fastai : 1.0.52 fastprogress : 0.1.21 torch : 1.1.0 nvidia driver : 410.79 torch cuda : 10.0.130 / is available torch cudnn : 7501 / is enabled === Hardware === nvidia gpus : 1 torch devices : 1 - gpu0 : 15079MB | Tesla T4 === Environment === platform : Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic distro : #1 SMP Wed Dec 19 21:19:13 PST 2018 conda env : Unknown python : /usr/bin/python3 sys.path : /env/python /usr/lib/python36.zip /usr/lib/python3.6 /usr/lib/python3.6/lib-dynload /usr/local/lib/python3.6/dist-packages /usr/lib/python3/dist-packages /usr/local/lib/python3.6/dist-packages/IPython/extensions /root/.ipython
I expected the default collate function would manually collate the samples, in order for the databunch to have the correct dimensions for both text and tabular data so it can be fed into a learner, along with a custom model