Fastai v2 chat

Hi everybody,
I’m working with a dataset of 4000 rows by 1000 columns and I’m trying to use the TabularDataLoader

procs = [Categorify]

dls = TabularDataLoaders.from_df(train, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="y", bs=64)

The initialization is taking ages, is there something I’m doing wrong? Should I use a different data loader for this data? Or a different library altogether?

thank you in advance to any takers.

No, that many columns I wouldn’t be surprised it’s taking forever for Categorify. You could preprocess the data yourself which could potentially speed things up. Another option is to make sure all your categorical columns are a Category type in pandas, this should make it significantly faster

2 Likes

Thank you very much for your response, I’m following your very helpful series of tutorials on v2 but I don’t quite understand what this categorify step is for, do I need it only if I have categorical data like “occupation” in the example in the first video on tabular, right? So if I have continuous real-valued columns I don’t need it?

No, you don’t in that case. Your data is all continuous so you need FillMissing and Normalize (or just Normalize if you have no missing data). This section talks about the procs well:

1 Like

I’ve updated the dataloader, but when training I still get an error,

procs = [Normalize]

dls = TabularDataLoaders.from_df(train, procs=procs, cat_names=["y"], cont_names=cont_names, 
                                 y_names="y", y_block = CategoryBlock(), bs=32)
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(2)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-33-1f839d2f868d> in <module>()
----> 1 learn = tabular_learner(dls, metrics=accuracy)
      2 learn.fit_one_cycle(2)

6 frames

/usr/local/lib/python3.6/dist-packages/fastcore/transform.py in gather_attrs(o, k, nm)
    163     att = getattr(o,nm)
    164     res = [t for t in att.attrgot(k) if t is not None]
--> 165     if not res: raise AttributeError(k)
    166     return res[0] if len(res)==1 else L(res)
    167 

AttributeError: classes

I’ve tried to look online for the error but I could not find much, it is related to the fact the I don’t put the vocab in the CategoryBlock but I’ve tried a couple of ways and I’m not sure how it should work. y is either [0, 1, 2] and it is an int in the train dataframe that I feed to the dataloader. Should this vocab work vocab={0:“0”, 1:“1”, 2: “2”}?

1 Like

Can I see the full stack trace?

Here it is

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-33-1f839d2f868d> in <module>()
----> 1 learn = tabular_learner(dls, metrics=accuracy)
      2 learn.fit_one_cycle(2)

6 frames

/usr/local/lib/python3.6/dist-packages/fastcore/logargs.py in _f(*args, **kwargs)
     50         log_dict = {**func_args.arguments, **{f'{k} (not in signature)':v for k,v in xtra_kwargs.items()}}
     51         log = {f'{f.__qualname__}.{k}':v for k,v in log_dict.items() if k not in but}
---> 52         inst = f(*args, **kwargs) if to_return else args[0]
     53         init_args = getattr(inst, 'init_args', {})
     54         init_args.update(log)

/usr/local/lib/python3.6/dist-packages/fastai/tabular/learner.py in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, **kwargs)
     29     if layers is None: layers = [200,100]
     30     to = dls.train_ds
---> 31     emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
     32     if n_out is None: n_out = get_c(dls)
     33     assert n_out, "`n_out` is not defined, and could not be inferred from data, set `dls.c` or pass `n_out`"

/usr/local/lib/python3.6/dist-packages/fastai/tabular/model.py in get_emb_sz(to, sz_dict)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/usr/local/lib/python3.6/dist-packages/fastai/tabular/model.py in <listcomp>(.0)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in __getattr__(self, k)
    158         if self._component_attr_filter(k):
    159             attr = getattr(self,self._default,None)
--> 160             if attr is not None: return getattr(attr,k)
    161         raise AttributeError(k)
    162     def __dir__(self): return custom_dir(self,self._dir())

/usr/local/lib/python3.6/dist-packages/fastcore/transform.py in __getattr__(self, k)
    200     def __getitem__(self,i): return self.fs[i]
    201     def __setstate__(self,data): self.__dict__.update(data)
--> 202     def __getattr__(self,k): return gather_attrs(self, k, 'fs')
    203     def __dir__(self): return super().__dir__() + gather_attr_names(self, 'fs')
    204 

/usr/local/lib/python3.6/dist-packages/fastcore/transform.py in gather_attrs(o, k, nm)
    163     att = getattr(o,nm)
    164     res = [t for t in att.attrgot(k) if t is not None]
--> 165     if not res: raise AttributeError(k)
    166     return res[0] if len(res)==1 else L(res)
    167 

AttributeError: classes ```
1 Like

I see the issue. You’re putting your y name in the cat_names. You should never do this as cat and cont names are for our inputs. y_names are for our dependent variable, or output.

What you’re saying here is also to train with your y’s, which we don’t want :slight_smile:

1 Like

Yep, that was the issue, thank you very much for the help and for the great tutorials you have made! :+1:

Not really, I finally dealt with imbalanced classes through a custom loss, not with this weighted DL

2 Likes

Hi!

It seems that tabular.add_cyclic_datepart is missing from fastai 2. Is this on purpose? Or should I create a feature request in github?

Thanks

1 Like

FOMO, will I miss out certain topics if I take the 2020 course instead of https://course18.fast.ai/ml

I see the table of contents is different, one is teaching ML (I guess), 2020 version teaches Deep Learning

I want to learn all that’s available in fast.ai but I am like low IQ in math & Python, so is it recommended I finish 2020 version, then go to https://course18.fast.ai/ml

Thanks

Not sure where to post this, but weird behavior:

Also not sure how to open new thread around~

I get AttributeError (see below) whenever I nbdev test 04_data.external.ipynb.

To pass the test successfully I am using the following workaround:

  • replace __file__ = str(NbdevConfig().path("lib_path")/'data'/'external.py')
  • with __file__ = str(NbdevConfig().lib_path/'data'/'external.py')

Am I missing something in my setup in order to call path("lib_path") as originally intended?

Below is the error I get:
---------------------------------------------------------------------------

~/opt/anaconda3/lib/python3.8/site-packages/nbdev/imports.py in __getattr__(self, k)
     46 
     47     def __getattr__(self,k):
---> 48         if k=='d' or k not in self.d: raise AttributeError(k)
     49         return self.config_file.parent/self.d[k] if k.endswith('_path') else self.d[k]
     50 

AttributeError: path

This is a duplicate question and semi-long so I will not post here again if it’s too long for this chat. I have this out to Zach on a forum, but I know he is in school and can get busy.

I am currently in the middle of an attempt to get my colleagues to fall in love with Fastai and nbdev, but it’s a battle I am going through

The question on the Dataloaders:

  1. Can we save the dataloader process without the data because with the data the object is very large I am looking to be able to compete with the Sklearn pipeline.

One thing that can be nice, but also is a pain is that I can only save the data loader with the data until I have a model I don’t have the ability to take the preprocessing steps from a recently saved TabularDataLoader. ( I think this is why I am asking :slight_smile: )

This might be a thing in the DataBlockAPI, but I am currently in a tabular project mode for work.

dl_test = dl_train.test_dl(X_test, with_label=False) # could be true doesn't matter

This is fine when you are going to train and do inference in the same place and have enough ram to hold both data sets. However when using a tabular learn I don’t believe the training data is available and as I write this maybe it is, but I don’t think so.

learn_inf = load_learner(os.path.join(model_path, yaml.get('process_name') + yaml.get('dl_model_suffix')),
                             cpu=True)
test_dl = learn_inf.dls.test_dl(df_test, with_label=False)

Even though the fastai model is a little bigger than a typical model like an xgb model that is completely okay for the functionality it gives me.

Do you know of a way when

dl_train = (TabularDataLoaders.from_df(df_transform, procs=procs,
                                       cat_names=cat_vars, cont_names=cont_vars,
                                       y_names=0, y_block=y_block,
                                       valid_idx=splits[1], bs=bs))
if os.path.exists(p) is False:
     os.makedirs(f'{p}')
logging.info(f'{fn} getting saved to {p}')
file_path = os.path.join(p, '' f"{process_name}_{fn}.pkl")
logging.info(f'file saved to {file_path}')
torch.save(dl, file_path)

Rather than save the entire dataset in the Dataloader is there way to pop out the data have that this be similar to a sklearn pipeline that is there to then use what’s above without the overhead of the memory and large object movement

What’s the issue with the code below? It doesn’t have the data but it does store all the preprocessing needed to transform your data based on what was trained on.

Otherwise maybe this is what you want?

@muellerzr

The code work perfect.

The problem or not even problem more like feature request is that I don’t want the data to be save because there are times that I want to use the Fastai Data Processing on a Test set for an XGB model for example.

Currently if I have a test set with out a Fastai Learn Model then I have to use the dl_train to preprocess the Test set. Meaning that I have to bring in the

which is a lot larger than a sklearn pipeline

This is strictly due to the fact that the poc_dl_train.pkl has the training data available, but I would like to just save the preprocessing steps it used and apply it to a new training set.

ie:

You should use the TabularPandas for inference then (my article I posted), as it works on just the tabular pandas level, not at the DataLoaders level.

IE:

from wwf.tab.export import *
learn.dls.train.dataset.export('export.pkl')
###
to = load_pandas('export.pkl')
preprocess = to.new(mydf)

@muellerzr

This assumes we have all the data in one place. The Dataloader movement to pandas is simple with

dl_train.train.xs
dl_train.valid.xs
etc etc

The actual issue is that we have a new test set let’s say that it’s the Bulldozer project and we got new Bulldozers to evaluate.

We have a dataset that has never been seen and we ended up going with XGboost over the fastai NN because it was better for this dataset.

We have used Fastai to preprocess, which means we have the training set (dl_train) which is a Dataloader that has all the ability to prepare the new test set. The issue isn’t that it doesn’t work I would just like to make it lighter by saving only the rules that it needs to preprocess the data and I will get it back out into a pd.DataFrame.

I could be wrong, but it looks like you are exporting a dataset and I don’t want the data set I want the rules to create the dataset with out the train data making the object huge (varies on dataset size)

Why this is an issue is that if we were to have this sitting in a small kube cluster the training data and test data might not fit for example there are many ways around this, but this would create a simple solution. I have a solution and it’s use sklearn pipeline or spark pipeline to do the preprocessing, but I like Fastai’s defaults a lot better than I like these pipelines defaults.

btw https://walkwithfastai.com/tab.stats is awesome