Abbreviations for train/validation/test variables

stas · July 27, 2018, 2:11am

You’re correct for how they were passed as positional arguments in notebooks:

"learn.fit(lr, 3, cycle_len=1, cycle_mult=2)"
"learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)"

but then internally and the named argument is lrs - perhaps the new API can be consistent?

v0:

def get_layer_opt(self, lrs, wds):
def fit(self, lrs, n_cycle, wds=None, **kwargs):

yet, it v1 we currently have:

def fit(self, epochs, lr, opt_fn=optim.SGD):

I’m talking about s/lr/lrs/ in v1. I hope we are on the same page now.

stas · July 27, 2018, 2:21am

Hmm, this forum software is not super-friendly for this kind of parallel multi-issue discussion. Is there a better workflow to follow? It’s so much easier to do that over email with automatic quoting.

You can actually just prefix lines with > just like with email, and it turns it into quote - that’s what I’ve done above. Or whilst composing you can select something from a previous message, and it pops up a ‘quote’ button, like so:

Yes, that I figured out but when quoting it doesn’t respect the hierarchy of quotes, flattening all quotes to the same level, making it impossible to distinguish who said what and I have to go and re-add >>. Not user-friendly at all.

Perhaps it’s ok if we continue using x in very short lambdas. or perhaps l for lambda?

OK I’m back to using o then

Did you mean 'back to using x and not o?

another alternative adding some short prefix img_class? it also makes it more specific

Much better! Actually maybe we should stop calling them ‘classes’ and start calling them ‘categories’ - which is quite naturally then cat and cats. I think in v0 I might have used these terms interchangeably…

That’s even better. I wasn’t sure whether categories were already given to cat_vars. So can you be more specific, Jeremy? Do you suggest:

s/cat_vars/category_vars/
s/cats/categories/
s/cat/category/
s#folder/cls#folder/category#

Yes? So for example in nb_002.py it’d appear:

class FilesDataset(Dataset):
    def __init__(self, folder, categories):
        self.fns, self.y = [], []
        self.categories = categories
        for i, category in enumerate(categories):
            fnames = get_image_files(folder/category)
            self.fns += fnames
            self.y += [i] * len(fnames)

Continuing this thread of thought cat_vars should really be cat_cols (or in the new way category_cols). as they are columns in the dataframe and not really variables. Thoughts?

And if so, expanding further:

dep_col
category_cols
contin_cols

Perhaps some more rounded up word for cont/contin/?

sgugger · July 27, 2018, 2:32am

In v1, we currently don’t have differential lrs at the moment, which is why it’s written lr for now. I don’t know yet how we will deal with the differential learning rates, so we will see if that lr becomes lrs or not

jeremy · July 27, 2018, 4:01am

@stas I mean cat not category. But thinking about it more, that’s a problem because “cat” could mean “category” or it could mean “categorical”, and they’re really different things that are likely to appear in the same method, so that’ll be confusing! So I think we should say for cl in classes after all…

I really did mean o for lambdas, since we could well have situations where we have a tensor in the outer scope called x - and I try to never have anything in the outer scope called o.

BTW, I’m not sure I’ve seen anyone else on the forum using multi-level quoting. Personally I don’t find it that necessary, because when you quote with the UI (i.e not just with >) then you get a hyperlink back to the original post, so it’s easy to see the whole context that way. I use that a lot to navigate threads that I haven’t been previously involved in.

stas · July 27, 2018, 4:23am

I mean cat not category. But thinking about it more, that’s a problem because “cat” could mean “category” or it could mean “categorical”, and they’re really different things that are likely to appear in the same method, so that’ll be confusing! So I think we should say for cl in classes after all…

ok, so cl/classes for classes/categories
and cat reserved exclusively for categorical

BTW, I’m not sure I’ve seen anyone else on the forum using multi-level quoting. Personally I don’t find it that necessary, because when you quote with the UI (i.e not just with >) then you get a hyperlink back to the original post, so it’s easy to see the whole context that way. I use that a lot to navigate threads that I haven’t been previously involved in.

I’m old school and keeping relevant context the way it was done 20 years ago is a way more efficient and it requires the users to think a little bit to keep what’s important and trimming what’s not. Skipping back and force between a mix of flattened messages on various topic is so inefficient. But oh well, the lazy manager reply on top email style won, the geeks lost. I’m fine with the new new thing.

At the very least quoting feature could keep the quoted text’ markdown intact, yet it removes all markdown and you have to put it back or just make the communication less clear.

stas · July 27, 2018, 4:25am

So here is the summary of what has been discussed (agreed on?) so far:

1) data 

prefixes:

train
valid
test

suffixes:

w/o   DataBunch object. 
df    DataFrame
ds    DataSet
dl    DataLoader

2) tensors

x     generic parameter name for tensors (forward(x) in nn.Module)
indep independent variable tensor
dep   dependent variable tensor 

3) loops

b     batch (from a dataloader)
xb    x parts of the batch
yb    y parts of the batch

4) lambdas

o     lambda arg

5) pandas

dep_col   name of the dependent column passed to proc_df
cat_col   single categorical column
cat_cols  multiple categorical columns   

6) classes

cl       single class/category (cls and class are reserved)
classes  list of classes/categories

7) categorical vars
   
cat   single categorical var
cats  list of categorical vars

if I missed anything please let me know.

Once confirmed/agreed on I can merge it into abbr.md.

jeremy · July 27, 2018, 4:45am

FYI me pressing “like” on that means “confirmed”

stas · July 27, 2018, 4:55am

Thank you for clarifying that, Jeremy.

jeremy · July 27, 2018, 5:06pm

I assume by ‘var’ here you mean a Pandas Series. In which case I imagine we’ll be using cat_col and cat_cols. Although it’ll be a while before we get to Pandas stuff so this might change.

stas · July 27, 2018, 5:12pm

Right, I adjusted the summary @ Abbreviations for train/validation/test variables. Thank you, @jeremy

s.s.o · July 28, 2018, 2:13pm

Instead of def normalize(mean,std,x) and def denorm() -> normalize / denormalize or norm/denorm will be nice.

jeremy · July 28, 2018, 5:48pm

@stas one problem I’ve noticed in the new notebooks is that I’m sometimes using ‘x’ for a tensor in a transform, and sometimes ‘img’. Really these should all be ‘img’ if they’re specifically transforms for images - I think it’s helpful to know what a tensor represents, where that’s possible.

jeremy · July 28, 2018, 5:50pm

denormalize is better. norm has a specific linear algebra meaning so we shouldn’t use that for normalization.

stas · July 28, 2018, 7:46pm

Will do.

And perhaps a silly question - why are we calling tensors x in first place? why not t or tens or tensor?

s.s.o · July 28, 2018, 10:23pm

I think they are math symbols used in equations where DL is derived. Usually, in equations x is for vector, X is for matrix and x is for scalar used as independent variables in math formulas.

As jeremy mentioned I like the idea of being explicit about tensors as image , etc.

I like this image from link

stas · July 29, 2018, 3:03am

Yes, I’m aware of the math. And that’s why asked the question - x as it used in math f(x), could mean any of the 4 types you presented in the image, and not just the tensor (which usually stands for 3+ dimensions in math), and even a scalar (with very small variations, such as an arrow above x, upcase, etc.). That’s why to me x doesn’t tell anything about the inner structure of the variable.

Moreover, here when we say tensor in the context of fastai, we mean a pytorch tensor variable. You could just as well have a multi-dimensional numpy structure, and mathematically it’s the same. So how in the code one could tell whether x is a pytorch var or a numpy var?

jsa169 · July 29, 2018, 5:21am

So how in the code one could tell whether x is a pytorch var or a numpy var?

I believe this is where duck-typing should come in (and I think this was agreed upon as something that should be done…right?):

def something(self, x: torch.Tensor, y : np.ndarray):

I mean that’s one way at least. I think that’s probably better than trying do comments (that quickly go out of sync with code) or a Hungarian notation thing (that gets messy/overloaded/confusing quickly).

This is definitely where Python gets on my nerves (coming from Java/C#).

Now of course there’s also the dimensionality of the tensors/ndarrays and how to interpret them. That’s not solved by duct-typing. Maybe ample comments are the best solution in this case… I certainly don’t like having to run code just to figure it out though. But that sort of thing is really hard to get a lot of people to do in practice (just like saying- “hey, run unit tests!”). In the case of unit tests- you can at least make the build fail and engineer away neglect. But comments…that’s another beast.

EDIT Personally…this is where my Java mind comes in and says “make wrapper objects” when you’re passing around things tensors and ndarrays, so that they have a type associated with them and perhaps provide basic assessors that inform how to interpret the tensors by just doing it in code (rather than relying on comments, for example). But I know that can be considered “over-engineering” because gasp- there’s more classes to deal with. But IMHO…code is -way- better than convention and comments in communicating what’s -actually- going on.

EDIT 2 To expand on the above idea a bit: Tensors/ndarrays are really low level, and it seems like they shouldn’t be the primary thing you’re having to think about when dealing with functions that talk to each other on high level things such as “predict”,“display image” etc. It’s a lot of unnecessary mental clutter, imho.

jeremy · July 31, 2018, 1:23am

heh - I just suggested t in your PR I kinda hate using x

jeremy · July 31, 2018, 1:26am

We’ll be adding type annotations to all params.

stas · July 31, 2018, 4:44am

Ah! Fantastic!