Developer chat

Hi @sgugger i belive that the line. “self.create_func = open_image” overrides whatever you set af argument for createfunc ?

class ImageItemList(ItemList):
_bunch = ImageDataBunch

def __post_init__(self):
    super().__post_init__()
    self.sizes={}
    self.create_func = open_image

to make it use my own i have to set:
vision.data.open_image = my_own_open_image

You can still use your own datasets and pass them to DataBunch.create, that hasn’t changed.

The data block API separates in two blocks the inputs and the outputs now, because it’s more flexible this way. One block of output (like classification) can be directly used for multiple blocks of inputs (images, texts, tabular lines etc…).

1 Like

Looks like there is a mistake there, will dig in to this at some point today.

Okay thanks, that means DataBunch.create will not be deprecated at some point? I had understood that all the old methods would go away at some point?!

2 Likes

No, the current factory methods will stay (as they are useful for beginners) and DataBunch.create is what we use all the time behind the scene whenever we build a databunch, so that one will stay too.

2 Likes

Spacy is by far the biggest lib depencency in fastai… around 1Gb. For comparisson, torch is about 250Mb.
It seems that we use it basically for training, is it possible to somehow prevent loading it when we only want/need to predict?

In our study group we wanted to deploy our language model in AWS Lambda but there is a limit on code size and we had to not use fastai, used torch directly.

copied from: https://forums.fast.ai/t/lesson-4-advanced-discussion/30319/19?u=fredguth

There is two misspelling on the doc https://docs.fast.ai/data_block.html#Invisible-step:-preprocessing : “vlaidation” and “isntance”

Feel free to open a PR to fix them :wink:

Regression is here. Whatever your application, you can now easily get your data ready for regression by

  • doing nothing if your target is just an array of floats of dimension 1 since the API should detect it automatically
  • by forcing it with label_cls = FloatList when your call your label_from_*** method

I implemented the jupyter notebook experiments module to maximize memory utilization, as discussed here.

Please have a look: https://github.com/stas00/ipyexperiments

Your feedback is sought after and if you do have some, please send it to this thread.

Thank you.

2 Likes

Okey, I only have to change the file in docs_src and you take care of the conversion right?

This is great!

Yes indeed.

Small breaking changes:

  • removed TextFilesList to replace it by TextList like in everywhere else.
  • put col everywhere there was a col or cols argument in the data block API.

While you’re changing the API, perhaps these could be normalized?

def language_model_learner(data:DataBunch, bptt:int=70, emb_sz:int=400, nh:int=1150, nl:int=3, pad_token:int=1,
def text_classifier_learner(data:DataBunch, bptt:int=70, max_len:int=70*20, emb_sz:int=400, nh:int=1150, nl:int=3,
def get_tabular_learner(data:DataBunch, layers:Collection[int], emb_szs:Dict[str,int]=None, metrics=None,
def get_collab_learner(ratings:DataFrame, n_factors:int, pct_val:float=0.2, user_name:Optional[str]=None,

have get_ everywhere, or nowhere?

Also the first two could have their argument positions synced. text_classifier_learner injects max_len before other arguments - could probably go after, to stay similar.

and then we have:

def create_cnn(data:DataBunch, arch:Callable, cut:Union[int,Callable]=None, pretrained:bool=True,

it also returns a learner object, but the name is completely different. get_cnn_learner?

And this one has no action - get/create in the name:

def simple_cnn(actns:Collection[int], kernel_szs:Collection[int]=None,

and we use ‘get_’ in:

def get_embedding(ni:int,nf:int) -> nn.Module:
2 Likes

Nice, that was my first PR ever :smile:

4 Likes

Here is another questionable API:

def series2cat(df:DataFrame, *col_names):

it does in place edit, returns nothing - should it be series2cat_ instead?

That one will very likely disappear once I’ve refactored collab as it’s only used there.

awesome. and when you do also can you replace *col_names with normal list argument, so one could pass a list and not needing to expand it with *cols. Thank you.

After discussing with Jeremy, I changed again the names of the arguments from col to cols in the data block API when you can pass one or more columns. If you can pass only one, the name is col, if you can pass several, the name is cols. Example:

data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')
                     .split_from_df(col='is_valid')
                     .label_from_df(cols='label'))

In the first and last function, you can pass multiple columns (if you have multiple text fields or multiple labels), but one the second one, you can only pass one column.

1 Like