Developer chat

stas · November 11, 2018, 8:01pm

This post has been moved to Memory, stability & performance of fastai v1

stas · November 11, 2018, 9:31pm

This post has been moved to Memory, stability & performance of fastai v1

Patrick · November 11, 2018, 10:54pm

You’re right! So long as the Dataset returns a list, the fit() and loss_batch() function will work just fine. However, I would like to propose one line of code to change the validate() function.

I’ve created a notebook for this and bundled it alongside a PR.

aman5319 · November 12, 2018, 6:49am

I was reading through the fastai code and I came across the Stepper class

My question is why are we using a class when all we need to do is iterations can’t we use generators in here and it will be beneficial also less code, less use of memory, lazy execution.

wrote a generator which does the same thing

linear_anneal = lambda start,end,pct : start+(end-start)*pct

def stepper(start,end,n_iter):
    n=1
    while n<=n_iter:
          yield linear(start,end,n/n_iter)
          n+=1
step = stepper(1,15,100)  #intialize it like this
step.__next__() #use it like this

fredguth · November 12, 2018, 11:29am

This post has been moved to Memory, stability & performance of fastai v1.

piotr.czapla · November 12, 2018, 3:16pm

moved the post to a separated thread as @stas suggested

stas · November 12, 2018, 6:34pm

Thanks for taking the lead on starting a focused thread based on my earlier posts, @piotr.czapla. I felt that your title was much broader than the very specific-intention my posts had - avoid restarting the kernel all the time. So I renamed it to a more specific: Getting the most out of your GPU RAM in jupyter notebook.

But please don’t let it prevent you from starting a much more important topic on stability and performance of fastai v1.

Thank you.

sgugger · November 13, 2018, 8:57pm

Just merged: huge refactor of the data block API. If you were only using the databunch factory methods, this shouldn’t impact you.
If you were using the data block API note that the calls to dataset, numericalize and tokenize don’t exist anymore and that you now have to split your data before labeling it.
If you were using the internal datasets of fastai… learn how to use the datablock API very quickly because those don’t exist anymore.

The basic idea is that to allow more flexibility, there is no dataset anymore: you explain what are your xs and your ys with the datablock API and that’s it. That way regression (or single classification or multi classification) for computer vision has the same underlying class than for text or tabular.

Update to the docs will follow shortly. Lessons should run smoothly.

shaun1 · November 13, 2018, 11:07pm

Does this mean we now have a way to solve all types of ML problems (classification, multi-classification, regression) for all types of data (vision, text, tabular)?

stas · November 14, 2018, 1:29am

I’m observing that the suggested use of partial functions for metrics leads to misleading results, e.g. in lesson3-planet nb:

acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = create_cnn(data, arch, metrics=[acc_02, f_score])

epoch  train_loss  valid_loss  accuracy_thresh  fbeta

the metrics column names are misleading, because these are not the metrics functions that were used (the defaults are different).

There must be a better way to have the used metrics match the names displayed in the header of the results.

The relevant code is:

def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:
    "About to start learning."
    self.state_dict = _get_init_state()
    self.state_dict['n_epochs'],self.state_dict['pbar'],self.state_dict['metrics'] = epochs,pbar,metrics
    names = [(met.name if hasattr(met, 'name') else camel2snake(met.__class__.__name__)) for met in self.metrics]
    self('train_begin', metrics_names=names)

I see we already have AverageMetric class, so this could be now fixed with a hack:

acc_02 = AverageMetric(partial(accuracy_thresh, thresh=0.2))
acc_02.name = "acc_02"
learn = create_cnn(data, arch, metrics=[acc_02])

now, the metric header is displayed correctly.

epoch  train_loss  valid_loss  acc_02

But perhaps we can add a new wrapper class?

acc_02 = MakeMetric(partial(accuracy_thresh, thresh=0.2), "acc_02")
learn = create_cnn(data, arch, metrics=[acc_02])

I also researched partial() and it’s possible to write a wrapper around partial to inject a name, say under partial_func.__name__ but it won’t be the same as normal functions which also have __class__.__name__ set and this can’t be set in the partial function. So probably, this is not a good approach.

marcmuc · November 14, 2018, 4:30am

Just realized this major change while watching todays lesson.

I think the possibility to easily inject our own Dataset classes via the datablock api was kind of an important feature!? And it was kind of compatible with regular pytorch, so you could reuse dataset classes others had written for pytorch with slight modifications.

So how do I do that now? And what do I do with my modified own Dataset classes?

Kaspar · November 14, 2018, 9:53am

Hi @sgugger i belive that the line. “self.create_func = open_image” overrides whatever you set af argument for createfunc ?

class ImageItemList(ItemList):
_bunch = ImageDataBunch

def __post_init__(self):
    super().__post_init__()
    self.sizes={}
    self.create_func = open_image

to make it use my own i have to set:
vision.data.open_image = my_own_open_image

sgugger · November 14, 2018, 2:02pm

You can still use your own datasets and pass them to DataBunch.create, that hasn’t changed.

The data block API separates in two blocks the inputs and the outputs now, because it’s more flexible this way. One block of output (like classification) can be directly used for multiple blocks of inputs (images, texts, tabular lines etc…).

sgugger · November 14, 2018, 2:02pm

Looks like there is a mistake there, will dig in to this at some point today.

marcmuc · November 14, 2018, 2:09pm

Okay thanks, that means DataBunch.create will not be deprecated at some point? I had understood that all the old methods would go away at some point?!

sgugger · November 14, 2018, 2:26pm

No, the current factory methods will stay (as they are useful for beginners) and DataBunch.create is what we use all the time behind the scene whenever we build a databunch, so that one will stay too.

fredguth · November 14, 2018, 5:53pm

Spacy is by far the biggest lib depencency in fastai… around 1Gb. For comparisson, torch is about 250Mb.
It seems that we use it basically for training, is it possible to somehow prevent loading it when we only want/need to predict?

In our study group we wanted to deploy our language model in AWS Lambda but there is a limit on code size and we had to not use fastai, used torch directly.

copied from: https://forums.fast.ai/t/lesson-4-advanced-discussion/30319/19?u=fredguth

Rvbens · November 15, 2018, 2:49am

There is two misspelling on the doc https://docs.fast.ai/data_block.html#Invisible-step:-preprocessing : “vlaidation” and “isntance”

sgugger · November 15, 2018, 2:56am

Feel free to open a PR to fix them

sgugger · November 15, 2018, 2:57am

Regression is here. Whatever your application, you can now easily get your data ready for regression by

doing nothing if your target is just an array of floats of dimension 1 since the API should detect it automatically
by forcing it with label_cls = FloatList when your call your label_from_*** method