A beginner's Journey to Playground Series Season 3, Episode 1

Daniel · January 3, 2023, 2:53pm

How and Why shoud I get started

Introduced to me by Radek’s tweet
Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp

EDA + training a first model + submission

Data checking

more info on the dataset from sklearn
what to predict? (the the median house value for California districts) cell
what are the independent variables? (8 or 9?) cell
what is the evaluation metric? (root mean squared error and watch out for what) cell
is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
eyeball for numeric and categorical columns/features? cell
how many NAs or nulls in each feature? cell1, cell2
check the size of the train and test, cell

Modeling

why do modeling, instead of doing statistical analysis to find interesting things? cell
to be continued tomorrow

Daniel · January 4, 2023, 9:20am

How and Why shoud I get started

Introduced to me by Radek’s tweet
Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp
recording my journey in fastai forum

EDA + training a first model + submission

Data checking

more info on the dataset from sklearn
what to predict? (the the median house value for California districts) cell
what are the independent variables? (8 or 9?) cell
what is the evaluation metric? (root mean squared error and watch out for what) cell
is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
eyeball for numeric and categorical columns/features? cell
how many NAs or nulls in each feature? cell1, cell2
check the shape of the train and test, cell

Modeling

why do modeling, instead of doing statistical analysis to find interesting things? cell
what are all the libraries and funcions needed for the modeling? cell
how to learn more of the classes and functions imported for modeling? cell
what are the features and target? cell
why adding additional dataset and what is the additional dataset? cell
how to download the additional dataset? cell
check the dataset (as a dict) provided by fetch_california_housing as a dict? cell
how to concat the features (numpy.array) and target (numpy.array) from the dict? cell
how to split the train into 5 folds and control the randomness of each fold with KFold(n_splits=5, random_state=0, shuffle=True)? cell
what exactly can for i, (train_index, val_index) in enumerate(kf.split(train)): give us? cell
how to access each fold of X_train, X_val, y_train, y_val with a list of features and a list of idx? cell
reading docs of LGBMRegressor, cell
how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
how to add additional dataset to the model for training? pandas-numpy vs polars
make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
reading docs of LGBMRegressor.fit, cell
how to make predictions on a X_val with model.predict? cell
how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell
how to take the mean of a list with pl.Series(list).mean? cell
how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars
what insight can the feature importance offer us? cell
what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell
how to make 5 predictions from 5 models and put them into a list? cell
how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell
how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell

QUESTION

How learning_rate and n_estimators of LGBMRegressor get chosen? cell not yet answered
Question: pandas version trained twice, the scores are very close but not the same, why? cell not yet answered
Question: polars version trained twice, the scores are close (but more different from pandas version), why? cell not yet answered
Will Radek dig into the interactions between features? cell not yet answered

Daniel · January 5, 2023, 1:49pm

How and Why shoud I get started

Introduced to me by Radek’s tweet
Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp
recording my journey in fastai forum

EDA + training a first model + submission

Milestone notebooks

train LGBMRegressor model with official dataset alone: Radek (metric score: 0.56099, public score: 0.56237), Daniel (float64 score: 0.56497, float32 score: 0.56506, public score: 0.56824)
train and validate LGBMRegressor model with combined dataset between official and additional: Radek (metric score: 0.52590225), Daniel (pandas score: 0.52590225, public score: 0.56097; polars float64: 0.525977, public score: 0.56064; polars float32: 0.525936, public: 0.56014) polars float32 outperform all
train LGBMRegressor with given parameters from this notebook : Radek (with random_state as 0, metric score: 0.519450), Daniel (random_state as 19, f64 pandas: metric score: 0.52017, public score; f64 polars: metric: 0.52017, public; f32 polars: metric: 0.52003, public: 0.55858)
feature interactions notebook
hyperparameter search notebook
notebook to understand LGBMRegressor model

Data checking version 5 version 5

more info on the dataset from sklearn
what to predict? (the the median house value for California districts) cell
what are the independent variables? (8 or 9?) cell
what is the evaluation metric? (root mean squared error and watch out for what) cell
is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
eyeball for numeric and categorical columns/features? cell
how many NAs or nulls in each feature? cell1, cell2
check the shape of the train and test, cell

Modeling version 5

why do modeling, instead of doing statistical analysis to find interesting things? cell
what are all the libraries and funcions needed for the modeling? cell
how to learn more of the classes and functions imported for modeling? cell
what are the features and target? cell
why adding additional dataset and what is the additional dataset? cell
how to download the additional dataset? cell
check the dataset (as a dict) provided by fetch_california_housing as a dict? cell
how to concat the features (numpy.array) and target (numpy.array) from the dict? cell
how to split the train into 5 folds and control the randomness of each fold with KFold(n_splits=5, random_state=0, shuffle=True)? cell
what exactly can for i, (train_index, val_index) in enumerate(kf.split(train)): give us? cell
how to access each fold of X_train, X_val, y_train, y_val with a list of features and a list of idx? cell
reading docs of LGBMRegressor, cell
how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
how to add additional dataset to the model for training? pandas-numpy vs polars
make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
reading docs of LGBMRegressor.fit, cell
how to make predictions on a X_val with model.predict? cell
how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell
how to take the mean of a list with pl.Series(list).mean? cell
how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars
what insight can the feature importance offer us? cell
what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell
how to make 5 predictions from 5 models and put them into a list? cell
how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell
how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell

Q&A version 5

How learning_rate and n_estimators of LGBMRegressor get chosen? cell, asked here direction provided
Question: pandas version or polars version trained twice, the scores are very close but not the same, why? cell asked and answered here figured out here solved
why scores produced by pandas and polars are different? cell asked and answered here explored but solved
Will Radek dig into the interactions between features? cell asked and answered here direction provided

Modeling version 6 version 6 :

changes of Radek’s in version 6
- merge the additional dataset with official dataset cell
- changed the random_state to 19 from 0,
- metric score changed to 0.5259 from 0.5609. (I suspect it is due to the change of dataset) cell
implement the above in polars in my notebook
- pandas: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
- polars: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
- I compared and proved all data inputs for pandas and polars are the same as df or series (see cell), all arrays are the same too when same dtype enforced (see cell)
- I figured out why pandas and polars training have different results under same randomness: the same dtypes must be enforced , see cell
  - but only the same when enforced into pl.Float32, slightly difference when pl.Float64, see notebook
  - using official dataset alone both float64 and float32 are all the same, see notebook

Q&A Radek version 6 vs my version 22

why scores produced by pandas and polars are different? cell asked and answered here and solved in this notebook cell solved
why is the falling scores from Radek’s version 5 to version 6? my guess is that after joining the additional dataset, the problem gets harder due to more data. asked here hypothesis proposed

Modeling Radek version 7 version 7

The major changes of this version compared to version 6, see cell
but I don’t know how did those parameters come from, nor how does LGBMRegressor model work
how to set up all parameters in a dict for a function beforehand with func(**params), cell, cell2

Q&A

How did SoupMonster come up with the specified parameters for the model? asked and answered here, must give a try to Optuna answered

Todos

What if I enforce dataset’s dtype as float64 or int 32 to see the differences in scores?
what if the additional dataset is not added? why adding it is more interesting?
update with Radek’s new versions
watch Radek’s new videos

Daniel · January 5, 2023, 1:51pm

If you are interested to read my daily update, please check out my repo here

I will leave the forum with some peace and save space for discussions