Abbreviations for train/validation/test variables


(Stas Bekman) #1

As mentioned here in the item on vertical alignment of conceptually similar concepts:

Aim to align statement parts that are conceptually similar. It allows the reader to quickly see how they’re different. E.g. in this code it’s immediately clear that the two parts call the same code with different parameter orders.

if self.store.stretch_dir==0: x = stretch_cv(x, self.store.stretch, 0)
else:                         x = stretch_cv(x, 0, self.store.stretch)

I’m appreciating the abbreviations doc.

I’d like to propose a few changes/additions:

I. that we have a vertically aligned abbreviations for anything related to train/validation/test.

Currently the doc has:

train 	   trn 	trn_ds
validation val 	val_ds

can we add one more?

test       tst  tst_ds

Also, can we start using similar in structure variable names in the notebooks:

df_val
df_trn
df_tst

(perhaps these could go into the abbr doc too?)

currently used convention of:

df = ...
df_test = ...

doesn’t align well and its very inconsistent in general.

II. As you can see we have df_trn and trn_ds - should we have a consistency there as well? probably purpose first and type second?

Should we switch to df_trn and ds_trn. i.e. renaming *_ds to ds_*. (and change abbr.md)

But then I see there are many other vars that currently use type_purpose convention, as in: opt_fn, init_fn, reg_fn. If it’s more readable it’s fine, then can we change df_* to *_df? so trn_df, tst_df, val_df?

I think one important consideration to take into account here is the use of TAB completion. In the long run, Is it better to have the variable start with the type: val_[TAB], or purpose df_[TAB] to type less?

p.s. markup rendering messes up code formatting if using 1) 2) items, so I used Romans to separate the issues

Thank you!


(urmas pitsi) #2

By the way smarter ide-s tab complete on substrings, making the order of words in variable names irrelevant for this purpose. Most probably this feature will be available for python in vs code quite soon, if it is not already there.


(WG) #3

I’m personally a fan of *_df nomenclature as it lines up with how we name other data related variables (e.g., trn_ds, trn_dl etc.)


(Stas Bekman) #4

Also I think it’s important to develop consistent naming for data frames in the notebooks, in particular when things like proc_df() are run. proc_df() slightly changes the df’s but the change is significant in reality, yet, it’s often easy to make mistakes using the wrong df version of the same data, so that it somewhat works, but breaks things somewhere down the road and can be difficult to track down. Also I have seen quite a few notebooks where the df gets overwritten by proc_df by assigning back to the same df var.

Currently there is no convention (grepping through the current fastai notebooks):

df, y, nas              = proc_df(df_raw
df, y, nas, mapper      = proc_df(joined_samp
df, y, nas, mapper      = proc_df(df_raw
df_test, _, nas, mapper = proc_df(joined_test
df_trn, y_trn, nas      = proc_df(df_raw
df_trn, y_trn, nas      = proc_df(df_raw
df_trn2, y_trn, nas     = proc_df(df_raw

IMHO, there should be 2 sets of 3 variables with unique, yet, familiar name structure:

  • Set 1 for Data Cleanup and making conclusions/predictions (unaffected by processing)
  • Set 2 for ML/DL processing

The naming of the two sets should be clearly distinct from each other to avoid wrong variable picking. So perhaps:

Set 1 (Data Cleanup/Prep + Predictions)

trn_df
val_df (not really needed during cleanup, but good to have for consistency)
tst_df

Set 2 (ML/DL processing)

trn_proc_df
val_proc_df
tst_proc_df

So the workflow would be something like:

trn_df = pd.read_csv(f'{PATH}/train.csv')
tst_df = pd.read_csv(f'{PATH}/test.csv')

... massage *_df

trn_proc_df, y, nas, mapper = proc_df(trn_df, ...
tst_proc_df, _, nas, mapper = proc_df(tst_df, ...
...
md = ColumnarModelData.from_data_frame(PATH, val_idxs, trn_proc_df,..., test_df=tst_proc_df)
or
(val_proc_df, trn_proc_df), (val_y, trn_y) = split_by_idx(val_idx, trn_proc_df.values, y)
md = ColumnarModelData.from_data_frames(PATH, trn_proc_df, val_proc_df, trn_y, val_y,..., test_df=tst_proc_df)

I’m aware that trn_proc_df is somewhat long, but it has a very short locality in the current code.

Of course, I’m new to this, so I trust that the experienced among you already have strong error-proof conventions and perhaps we can make those conventions accessible to all.

Thank you.


#5

It’s a minor thing, but the fact that fastai code base uses three letter abbreviation for train and validation, and then uses all four letters for test causes my OCD to flare up. So, yes, I like trn, val, and tst. Not test.


(William Horton) #6

I feel the opposite way—I think that abbreviating to tst instead of test in order to get better vertical alignment is taking it a little too far. Just write one more character so people understand what it means. When I see tst I read it as an acronym, like T-S-T.


(Jeremy Howard) #7

These are all good points. I think *_df, *_ds, etc is fine.

Also, in fastai_v1 I’ve moved towards 5 letters instead of 3. So train and valid. If you want test to line up you can just add a trailing space.

If someone wants to be helpful, perhaps go through the 001a and 001b notebooks (and nb*.py) and send a PR that tries to update the naming to be consistent with this approach? I’m still working on 002 and 003 notebooks so probably best not to edit them yet, or we’ll have tricky conflicts.


(Stas Bekman) #8

I can do it, @jeremy.

Though since this thread has been a lot of possible maybes, could you please clarify what standard you’d like to follow so that I know what needs to be changed. You clarified on:

train_df
valid_df
test_df

I see in 001a and 001b notebooks, we could add _df

s#((x|y)_(train|valid))#$1_df#g (perl regex, don’t know how to express it concisely in python)

, correct?

and in nb*.py, I’m not sure, there are some xb,yb, etc. what would you like the internal vars to be?

edit: also is pytorch 0.4.0 required for _v1? is there a document specifying the version requirements for this project? I will need to test my changes before submitting PR, and currently the notebooks bail on torch 0.3.1


(Stas Bekman) #9

Also, in fastai_v1 I’ve moved towards 5 letters instead of 3. So train and valid. If you want test to line up you can just add a trailing space.

Too bad ML has been deeply conditioned to test, otherwise check or trial are 5-letter words :wink:


(Jeremy Howard) #10

Thanks @stas!

s#((x|y)_(train|valid))#$1_df#g (perl regex, don’t know how to express it concisely in python)

That’s OK, I knew a little perl once :wink:

df is short for ‘data frame’; we don’t have any data frames in this notebook yet (they’re part of pandas, used for tabular data. We currently have ds (dataset) and dl (dataloader), and let’s also standardize on x (independent variable tensor) and y (dependent variable tensor). Without any suffix, I think just train refers to a DataBunch object. Any more we should add?

For loop vals, I was thinking that b in general is a batch (from a dataloader), and then there’s xb and yb for the x and y parts of that batch. I think loop vars should be kept short.

Yup you need at least pytorch 0.4. I’d suggest creating a new conda env for this project. I’m using pytorch master installed from source, and although 0.4 works for now, there are some new features coming soon to pytorch master that we’ll be using. Our goal is to stay current with master so we can release fastai v1 around the same time as pytorch v1, and have them work well together.


(Stas Bekman) #11

If you want test to line up you can just add a trailing space.

adding whitespace to compensate for len(“test”)==4 doesn’t work too well, since it needs to be pushed in many places and it doesn’t align well still if you have nested things:

train_dur = train[columns], train.something, train.somemore[~train...]
test_dur  = test[columns],  test.something,  test.somemore[~test...]

come on folks, if this is a big rewrite, perhaps someone has a brilliant idea?


(Stas Bekman) #12

Yup you need at least pytorch 0.4. I’d suggest creating a new conda env for this project. I’m using pytorch master installed from source, and although 0.4 works for now, there are some new features coming soon to pytorch master that we’ll be using. Our goal is to stay current with master so we can release fastai v1 around the same time as pytorch v1, and have them work well together.

OK, I will then need to setup a new environment. I’m waiting for ubuntu 18.04.01 LTS to be released in a few days and then I will be forced to rebuild all the environments (to switch from 16.04), so I will do that then, and then will start working on that.


(Jeremy Howard) #13

(You may not want to upgrade ubuntu just yet, since I don’t think CUDA officially supports newer versions. They tend to be a little slow!)


(Stas Bekman) #14

Thank you for the warning, Jeremy.

There are quite a few posts out there on how to install CUDA on 18.04, so I will give it a try. I have to build CUDA from source anyway, since I use my GTX card as non-primary card - so that I could have every MB possible available to ML work. Hope, this will work.


(Jeremy Howard) #15

Another naming issue and request - I’ve been using ‘x’ as a generic parameter name for tensors, such as the argument to forward() in an nn.Module. But I also often use it as the argument name in a lambda. And of course we use it for the independent variable. That’s too many uses! :open_mouth:

Perhaps the usage of ‘x’ we should stick with is as a generic parameter name for tensors, in situations where we don’t know anything about what the tensor represents. And then for lambdas let’s use ‘o’ instead. And how about for dependent and independent vars we just use ‘dep’ and ‘indep’?

Which means I have another request (sorry!) - maybe @stas when you go through and do the renaming, you could handle these changes too? BTW the 002 notebook is ready to be looked at and go through renaming too now.


(Stas Bekman) #16

First, a quick summary of the suggestions made by you so far, @jeremy:

*** Variable Naming Convention ***

1) data 

prefixes:

train
valid
test

suffixes:

w/o   DataBunch object. 
df    DataFrame
ds    DataSet
dl    DataLoader

2) tensors

x     generic parameter name for tensors (forward(x) in nn.Module)
indep independent variable tensor
dep   dependent variable tensor 


3) loops

b     batch (from a dataloader)
xb    x parts of the batch
yb    y parts of the batch


4) lambdas

o     lambda arg

wrt to dep/indep - are those going to be used in parallel and we want them to align as with train/valid/test?

any acceptable synonyms so that we don’t have that prefix ‘in’ that is sticking out? also, the words dep/indep are too generic to intuitively indicate what they represent.

we also use dep in the notebooks for the targ column variable, which is not a tensor - so it’d be confusing, no? perhaps in the notebooks it can become targ_col?

otherwise, I built a separate environment with conda installed pytorch 0.4.0 and run the first v1 notebook successfully so I can start with the renaming once you agree on the convention. (I also installed 0.5.0 dev, but something is wrong with the CUDA setup, so it’ll have to wait until after 18.04.1 upgrade to switch to 0.5.0)


(Stas Bekman) #17

A different kind of convention:

Are we going to establish the notebooks to be started with to simplify dev:

%matplotlib inline
%reload_ext autoreload
%autoreload 2

should I add those to 001* 002* notebooks? anything else while we are there?


(Stas Bekman) #18

looking at the code:

nb_001b.py:

    val_loss = np.sum(np.multiply(losses,nums)) / np.sum(nums)

wrt val_loss - What do we do with the rest of val_, keep them or s/val_/valid_/g? i.e. to what scope train/valid/test prefixes apply?

return reduce(lambda f, g: lambda x: f(g(x)), funcs, lambda x: x)

wrt o in lambdas instead of x: here: f(g(x)) reads quite intuitively, whereas f(g(o)) is odd.

besides in such nested lambdas won’t it be ideal to use different variables for each lambda? e.g.:

return reduce(lambda f, g: lambda x: f(g(x)), funcs, lambda o: o)

is it important to have lrs for sets of lr’s? or just leave it as lr? currently it’s lrs in the main code base. v1 uses lr:

def fit(self, epochs, lr, opt_fn=optim.SGD):

nb_002.py:

def find_classes(folder):
    [...]
    return sorted(classes, key=lambda d: d.name)

another lambda. d seems to be most fitting here, rather than o. I suppose you suggested to use o as fallback when a more suitable letter can’t be found (but don’t use x, as it now has a special meaning in fastai v1).

also we are primarily on unix here and you use d's anyway for dirs, so perhaps s/folder/dir/?

besides abbr.md says: directory dir

in the same function:

    for i, cls in enumerate(classes):
        fnames = get_image_files(folder/cls)

wrt to using cls - no problem here that there is cls for python class calls (instead of method calls)? perhaps:

    for i, class in enumerate(classes):
        fnames = get_image_files(folder/class)

but you do have that one covered in abbr.md: class cls

def normalize(mean,std,x)

should function definitions consistently have spaces between their args?

Before I continue, is this kind of feedback useful or a way too much nitpicking and thus counterproductive?


(Jason Antic) #19

These points Stas brings up I think are interesting and legit. I’d like to take the opportunity while this is being talked about to chime in with some feedback on the current notebooks we have for the courses. This is the -one- and only thing I felt like I ever disagreed with Jeremy on, which is how to name things.

While the short abbreviations work fine as long as the conventions are strictly adhered to and we’re dealing with abbreviations that are documented/seen in the past, it breaks down when you run into new stuff a lot of times. The chief problem I run into is trying to figure out “what does this actually stand for?” Unless the abbreviation standards are continuously updated for new variables (hard to enforce 100%), there will definitely be edge cases that make it harder to understand what’s going on in the code then is necessary. What Stas is running into above seems like a bit of a preview of that.

My personal inclination is to just spell things out exactly what they are, even if you’re dealing with a small variable scope (like a for loop). Looking at my translate.ipynb notebook right now:

def seq2seq_loss(input, target):
    sl,bs = target.size()
    sl_in,bs_in,nc = input.size()
..... (truncated)

By looking at the larger context of the notebook’s contents combined with past experience, you can eventually figure it out. But it requires that context to an unnecessary degree I think.

Just for illustrative purposes (it doesn’t -have- to be this verbose/enterprisey and obviously Java inspired, but it drives a point):

def sequence2sequence_loss(input, target):
    target_sequence_length, target_batch_size = target.size()
    input_sequence_length, input_batch_size, num_unique_tokens = input.size()
....(truncated)

BTW- This might make me sound dumb but I’ll just be honest- I had to do print statements to make sure I knew what nc was. That seems quite unnecessary and counterproductive if the alternative is to simply spell it out. EDIT: I originally wrote “num_unique_characters” when in fact it’s tokens…“nc” just naturally plugged into my brain as “number of characters” without thinking…

So the point is that not only is the second version of the function more readily understood without any further context in this forum, but that I’d also have been able to save some time in understanding the code and avoided running print statements to make sure I knew that nc intended to convey “number of characters” (I think?!? [Edit- even though it’s tokens, not characters]. It’s just my best guess, honestly).

I really think this makes all the difference, not only in terms of making the code easier to understand but also making it less likely to birth bugs through misunderstanding as the code is reworked (both for others and the original author!)

Now I know a lot of people are going to take issue with the verbosity so all I’m suggesting here is this: I think what could be aimed for is context free understanding. Doesn’t have to be whole words, but honestly I’d say verbosity is not a huge price to pay, in comparison to the problems ambiguous two letter names cause.

The reason I’m pointing this out is because I know abbr.md is definitely trying to address this and something like that needs to be done, but I think a much simpler and enforceable way of doing it that’ll handle the inevitable edge cases better is to just go by the convention of “spell it out”. Simple to follow- and you know what to do for all those edge cases- write words for people to read!

Now one more thing to address: Jeremy pointed out in a lesson that he likes this abbreviated convention because you can fit more on a screen and comprehend more. True. But you can also achieve the same thing with carefully selected abstractions (words in code). Ideally- “lossless abstractions” that describe exactly what is going on without misleading is what you’d want. That might mean you have to break down functions more but that’s generally a good idea anyway.


(Jeremy Howard) #20
  1. Yes that’s what I was thinking
  2. I don’t know of any, but I think that’s OK
  3. In Pytorch 0.4 there’s no “Variable” objects any more. Do you main a Series from a DataFrame, perhaps? If so, I think we can wait until we get to Pandas stuff to figure out naming there - but ‘_col’ seems like a reasonable approach for Series objects.