Fastai v2 chat

FOMO, will I miss out certain topics if I take the 2020 course instead of https://course18.fast.ai/ml

I see the table of contents is different, one is teaching ML (I guess), 2020 version teaches Deep Learning

I want to learn all that’s available in fast.ai but I am like low IQ in math & Python, so is it recommended I finish 2020 version, then go to https://course18.fast.ai/ml

Thanks

Not sure where to post this, but weird behavior:

Also not sure how to open new thread around~

I get AttributeError (see below) whenever I nbdev test 04_data.external.ipynb.

To pass the test successfully I am using the following workaround:

  • replace __file__ = str(NbdevConfig().path("lib_path")/'data'/'external.py')
  • with __file__ = str(NbdevConfig().lib_path/'data'/'external.py')

Am I missing something in my setup in order to call path("lib_path") as originally intended?

Below is the error I get:
---------------------------------------------------------------------------

~/opt/anaconda3/lib/python3.8/site-packages/nbdev/imports.py in __getattr__(self, k)
     46 
     47     def __getattr__(self,k):
---> 48         if k=='d' or k not in self.d: raise AttributeError(k)
     49         return self.config_file.parent/self.d[k] if k.endswith('_path') else self.d[k]
     50 

AttributeError: path

This is a duplicate question and semi-long so I will not post here again if it’s too long for this chat. I have this out to Zach on a forum, but I know he is in school and can get busy.

I am currently in the middle of an attempt to get my colleagues to fall in love with Fastai and nbdev, but it’s a battle I am going through

The question on the Dataloaders:

  1. Can we save the dataloader process without the data because with the data the object is very large I am looking to be able to compete with the Sklearn pipeline.

One thing that can be nice, but also is a pain is that I can only save the data loader with the data until I have a model I don’t have the ability to take the preprocessing steps from a recently saved TabularDataLoader. ( I think this is why I am asking :slight_smile: )

This might be a thing in the DataBlockAPI, but I am currently in a tabular project mode for work.

dl_test = dl_train.test_dl(X_test, with_label=False) # could be true doesn't matter

This is fine when you are going to train and do inference in the same place and have enough ram to hold both data sets. However when using a tabular learn I don’t believe the training data is available and as I write this maybe it is, but I don’t think so.

learn_inf = load_learner(os.path.join(model_path, yaml.get('process_name') + yaml.get('dl_model_suffix')),
                             cpu=True)
test_dl = learn_inf.dls.test_dl(df_test, with_label=False)

Even though the fastai model is a little bigger than a typical model like an xgb model that is completely okay for the functionality it gives me.

Do you know of a way when

dl_train = (TabularDataLoaders.from_df(df_transform, procs=procs,
                                       cat_names=cat_vars, cont_names=cont_vars,
                                       y_names=0, y_block=y_block,
                                       valid_idx=splits[1], bs=bs))
if os.path.exists(p) is False:
     os.makedirs(f'{p}')
logging.info(f'{fn} getting saved to {p}')
file_path = os.path.join(p, '' f"{process_name}_{fn}.pkl")
logging.info(f'file saved to {file_path}')
torch.save(dl, file_path)

Rather than save the entire dataset in the Dataloader is there way to pop out the data have that this be similar to a sklearn pipeline that is there to then use what’s above without the overhead of the memory and large object movement

What’s the issue with the code below? It doesn’t have the data but it does store all the preprocessing needed to transform your data based on what was trained on.

Otherwise maybe this is what you want?

@muellerzr

The code work perfect.

The problem or not even problem more like feature request is that I don’t want the data to be save because there are times that I want to use the Fastai Data Processing on a Test set for an XGB model for example.

Currently if I have a test set with out a Fastai Learn Model then I have to use the dl_train to preprocess the Test set. Meaning that I have to bring in the

which is a lot larger than a sklearn pipeline

This is strictly due to the fact that the poc_dl_train.pkl has the training data available, but I would like to just save the preprocessing steps it used and apply it to a new training set.

ie:

You should use the TabularPandas for inference then (my article I posted), as it works on just the tabular pandas level, not at the DataLoaders level.

IE:

from wwf.tab.export import *
learn.dls.train.dataset.export('export.pkl')
###
to = load_pandas('export.pkl')
preprocess = to.new(mydf)

@muellerzr

This assumes we have all the data in one place. The Dataloader movement to pandas is simple with

dl_train.train.xs
dl_train.valid.xs
etc etc

The actual issue is that we have a new test set let’s say that it’s the Bulldozer project and we got new Bulldozers to evaluate.

We have a dataset that has never been seen and we ended up going with XGboost over the fastai NN because it was better for this dataset.

We have used Fastai to preprocess, which means we have the training set (dl_train) which is a Dataloader that has all the ability to prepare the new test set. The issue isn’t that it doesn’t work I would just like to make it lighter by saving only the rules that it needs to preprocess the data and I will get it back out into a pd.DataFrame.

I could be wrong, but it looks like you are exporting a dataset and I don’t want the data set I want the rules to create the dataset with out the train data making the object huge (varies on dataset size)

Why this is an issue is that if we were to have this sitting in a small kube cluster the training data and test data might not fit for example there are many ways around this, but this would create a simple solution. I have a solution and it’s use sklearn pipeline or spark pipeline to do the preprocessing, but I like Fastai’s defaults a lot better than I like these pipelines defaults.

btw https://walkwithfastai.com/tab.stats is awesome

That’s exactly what that’s doing… that’s also what learn.export does… none of the original data is saved and used on the DataLoader…

If it’s not, what version of fastai are you using? (as this was an issue that got fixed)

1 Like

Ahh… This is awesome love that patch freaken amazing how flexible Fastai can be this worked like a charm thank you will continue developing with it Amazing!!! Thank you @muellerzr

1 Like

I don’t know If I should ask this here.

But I’m getting the below error even when I’m not passing my y variable in the cat_names parameter.

    *from fastai.tabular.all import **
    *input_df = input_df.astype('category')*
    *columns = input_df.columns*
    *features = columns.drop('cohort_flag')*
    *dls = TabularDataLoaders.from_df(input_df, y_names="cohort_flag",*
    *    cat_names = list(features), procs = [Normalize], bs=32)*

    learn = tabular_learner(dls, metrics=accuracy)

And the full stack trace is below:

   AttributeError                            Traceback (most recent call last)

<ipython-input-60-4b53f3a1cac0> in <module>()
      6     cat_names = list(features), procs = [Normalize], bs=32)
      7 
----> 8 learn = tabular_learner(dls, metrics=accuracy)

/usr/local/lib/python3.7/dist-packages/fastai/tabular/learner.py in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, **kwargs)
     27     if layers is None: layers = [200,100]
     28     to = dls.train_ds
---> 29     emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
     30     if n_out is None: n_out = get_c(dls)
     31     assert n_out, "`n_out` is not defined, and could not be inferred from data, set `dls.c` or pass `n_out`"

/usr/local/lib/python3.7/dist-packages/fastai/tabular/model.py in get_emb_sz(to, sz_dict)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/usr/local/lib/python3.7/dist-packages/fastai/tabular/model.py in <listcomp>(.0)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/usr/local/lib/python3.7/dist-packages/fastcore/basics.py in __getattr__(self, k)
    386         if self._component_attr_filter(k):
    387             attr = getattr(self,self._default,None)
--> 388             if attr is not None: return getattr(attr,k)
    389         raise AttributeError(k)
    390     def __dir__(self): return custom_dir(self,self._dir())

/usr/local/lib/python3.7/dist-packages/fastcore/transform.py in __getattr__(self, k)
    202     def __getitem__(self,i): return self.fs[i]
    203     def __setstate__(self,data): self.__dict__.update(data)
--> 204     def __getattr__(self,k): return gather_attrs(self, k, 'fs')
    205     def __dir__(self): return super().__dir__() + gather_attr_names(self, 'fs')
    206 

/usr/local/lib/python3.7/dist-packages/fastcore/transform.py in gather_attrs(o, k, nm)
    163     att = getattr(o,nm)
    164     res = [t for t in att.attrgot(k) if t is not None]
--> 165     if not res: raise AttributeError(k)
    166     return res[0] if len(res)==1 else L(res)
    167 

AttributeError: classes

You need to pass in Categorify to your procs, Normalize is only for continuous variables. You may also need FillMissing as well. IE:

procs = [Categorify, FillMissing, Normalize]

1 Like

@muellerzr Thanks for your help ! It worked !

Hi! A year later I’ve come across the same issue :sweat_smile:
Did you ever find time for this project? Or are you aware of anyone else doing it? Thanks anyway!

I never wound up getting to it sadly, but if you make a post on it we can work through your issues :smiley:

1 Like

For the moment I’ve found this:

I’m still reading through the details, but at least it contains relevant stuff like SentencePiece tokenization, and the reported metrics are really good, so I’ll definitely give this a try.

Thanks for sharing that, @florianl !

1 Like

I am trying to train a segmentation algorithm with FastAi. I have training and validation data in separate folders, so was planning on using GrandparentSplitter() but for some reason the validation set is empty.

My files are organised as below:

Path ---> train ---> images
                ---> masks
     ---> valid ---> images
                ---> masks

And this is how I set up my datablock and dataloader:

codes = np.array(['background', 'prostate'])

def label_func(x): return path/'train/masks'/f'{x.stem}_mask.png'

db = DataBlock(blocks=(ImageBlock(), MaskBlock(codes)),
              splitter=GrandparentSplitter(train_name='train', valid_name='valid'),
              get_items=get_image_files,
              get_y=label_func)

dls = db.dataloaders(path/'train/images', bs=1)
dls.show_batch()

I am assuming there is something wrong with how I organised the files.

I guess, You didnt input the ‘valid’ data into the dataloaders… I too dont know the answer, but from the code, I could see, dls has access to only images from the train folder.

Greetings to all code worriers! I’d like to change the metric when training a binary image classifying model from accuracy to false negative rate. I wonder if you could help how to take into account the number of false nagatives after each epoch as a metric. I read through documentation on Metrics/Callbacks however wasn’t much of help much.
Cheers :slight_smile: