Can't get a simple model (tabular) to learn - probably missing something

ronlut · December 13, 2018, 9:19am

Hi.
I know this seems like a random lazy call for help, but I tried to solve this literally for days (even fixed a bug during the process).
I’m a long time user of fastai (pre 1.0 version) but I came to a dead-end in this case.

I have a simple dataset containing 586 features and 1607 samples, single-labeled with a category label which is one of 6 classes.
I can’t get the learner to learn, no matter what I tried.
Using a simple RandomForest, without any optimizations, I get ~87 accuracy.
Using a neural net, I get random results (~20 validation accuracy with 6 classes)

I tried the examples from the fastai tutorials, and it indeed works. Tried with some sample data I created, and it works as well.
I don’t know what I’m missing with this dataset. I would be super grateful for anyone taking a few minutes to help me, I’m out of ideas.

Gist: https://gist.github.com/ronlut/af598fd23bdd69dae8b702aa88e9121e#file-simple-train-ipynb
x,y files to reproduce: x, y

Thank you very very much!

ronlut · January 29, 2019, 11:04am

More than a month after I posted this, I’m still having the same problem, which is, unfortunately, preventing me from using fastai in our project as all our data looks like the data I posted here.
Any help will be appreciated, I really want to solve it.

Some more details:
After talking to @sgugger he suggested checking whether the batch looks ok.
I tried to do show_batch() with previous versions but it didn’t work. I tried again, with v42 and now the show_batch() and I think the data is incorrectly fed into the network.
If I understand correctly, the batch contains wrong labels for samples, meaning, somewhere in the split process something is done incorrectly, or my DataFrame is incorrect.
RandomForest works well so let’s assume the DataFrame is correct but maybe it’s formatted in a way that fastai doesn’t know to handle - index or x or y, labels list or something like that.

@sgugger - We talked using private messages but didn’t come to any conclusion, it would be super helpful if you could try to help me with that again, hopefully, that time we will find the root cause for that.
@jeremy Your help will be very appreciated as well

AbuFadl · January 29, 2019, 1:04pm

Check to make sure your cat/cont vars are selected correctly. Most likely, Pandas reads all your data as float. Run df.info() on your data (df is pd.read_csv(…) to see how many are selected as float/object.
I have made a quick demo of a working kernel using your data (https://www.kaggle.com/abedkhooli/fastai-586) and validation accuracy is at .94 (may not mean a lot if classes are unbalanced - did not check, but proves fastai tabular works).

sgugger · January 29, 2019, 2:42pm

I hadn’t properly looked, label_from_list used like this can’t work (and yes it’s going to label randomly your data): since you have taken a random split for the validation set, your items and labels aren’t aligned anymore. You should put your xs and ys in the same dataframe, then use label_from_df.

ronlut · January 29, 2019, 5:26pm

Ok… that’s surprising. But good we may have a direction, thanks!

I thought random_split save the original indices to be able to match the labels to it later?
Anyway, I tried with split_by_idx(list(range(100)) and I get the same results.
What is the use of the split methods if I can’t use the label methods afterwards? Do I miss something?

Regarding label_from_df, does that mean the df I’m passing to TabularList.from_df should also contain the labels?

Or, more generally, if I have x and y as in my example, what is the correct chain of methods I should use to get a proper DataBunch? That is really not clear to me.

sgugger · January 29, 2019, 5:49pm

You shouldn’t use label_from_list in any case since it’s intended for one ItemList, not an ItemLists (which is the result of your split). This is an internal method, it’s not meant to be used outside.
Yes the dataframe you pass should regroup your inputs/targets.

AbuFadl · January 29, 2019, 5:51pm

Did you check the kernel I posted above? It has the data in question and ran fine.

ronlut · January 29, 2019, 6:25pm

Thanks a lot, I didn’t see you edited the post so missed that.
I understood the problem with your and @sgugger’s kind help, so now it’s clear to me

Thanks a lot. Happy to know that was the mistake and not something more serious.
I will check that tomorrow and will update whether that solves the problem or not, I am full of hope now

Anyway, I must say, the fact that I misunderstood the whole flow and usage means something is confusing, as I spent days and days trying to figure out the problem, reading all available documentation.
I think what confused me the most:

The fact that the first line of the TabularList documentation points to ItemList documentation, and then you can see all the available methods and the 3 steps of “providing input, splitting the data, labelling the input” from which I conveniently chose from_df, then random_split_by_pct then label_from_list.
It really looks like that is the flow to go with.
ItemList vs ItemLists
The fact that label_from_list isn’t stated anywhere (could be called _label_from_list for example)

Thanks again and thanks for the great job you guys do!