Tabular data - Kaggle competitions

offirinbar · April 10, 2019, 10:36am

Hi all

I want to train on tabular data with DL. I already tried a few Kaggle competitions like Rossman, Titanic, and Santander.

I am looking for recommends about other competitions using Tabular data. (and if you have fast.ai kernels for them as reference).

Thank you!
Offir

mcclomitz · April 17, 2019, 12:28am

Hey Offir - how did you go on the santander data set?
There are a heap of tabular data sets around, look at the data sets available on kaggle, I think there aren’t many competitions though.

offirinbar · April 18, 2019, 6:38am

HI Kieran

This is the link for santander: https://www.kaggle.com/c/santander-customer-transaction-prediction

if you can reccomand on more datasats it will be great

Stephen_F · May 4, 2019, 11:52pm

Hey, I’m working on a similar problem, maybe you all could help me figure this out: I’m trying to use a tabular learner on a wide dataset, and all the values are continuous. I keep getting a “divide by zero” error, and I think it’s due to the way that I’m creating my databunch, but not sure what exactly I’m doing wrong.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from fastai.tabular import * 

import os
print(os.listdir("../input"))
path = Path('../input')
```df = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

procs = [Normalize, FillMissing]
dep_var = 'target'
# valid_idx = range(len(df)-50, len(df))
df.drop(columns='id', inplace=True)

columns = df.columns

ids_test = test['id']
test.drop(columns='id', inplace = True)```
```test_list = TabularList.from_df(test, procs=procs)
data = (TabularList.from_df(df, procs=procs)
#                            .split_by_idx(valid_idx=valid_idx)
                        .split_by_rand_pct(.2)   
                        .label_from_df(cols=dep_var)
                            .add_test(test_list)
                           .databunch(path='.'))

learn = tabular_learner(data, layers=[200,100], metrics=AUROC())

This gives the error:

/opt/conda/lib/python3.6/site-packages/torch/nn/init.py in kaiming_uniform_(tensor, a, mode, nonlinearity)
    288     fan = _calculate_correct_fan(tensor, mode)
    289     gain = calculate_gain(nonlinearity, a)
--> 290     std = gain / math.sqrt(fan)
    291     bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
    292     with torch.no_grad():

ZeroDivisionError: float division by zero```

dipam7 · May 5, 2019, 5:53am

Hey @Stephen_F, the error might be because you are not telling the model which columns contain constant values and which ones contain categorical values. You might want to add a line like cont_names = ['age', 'gender' and so on] or since all your columns are constant you can write cont_names = list(df.columns). Then in your learner pass that as an argument like this cont_names=cont_names
Regards

Stephen_F · May 6, 2019, 3:56pm

@dipam7 I’ll be darned, I think that worked! I think when I tried it before I accidentally left the target variable in the list of columns and that threw me off the trail. Thanks!