Radek’s comp intro video gets very interested in it

Reasons not to do it: dataset is not real, therefore the problem is not serious

Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp

Radek’s comp intro video gets very interested in it

Reasons not to do it: dataset is not real, therefore the problem is not serious

Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp

how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell

how to add additional dataset to the model for training? pandas-numpy vs polars

make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell

when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version

Radek’s comp intro video gets very interested in it

Reasons not to do it: dataset is not real, therefore the problem is not serious

Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

Implement everything Radek is sharing in this comp

train LGBMRegressor model with official dataset alone: Radek (metric score: 0.56099, public score: 0.56237), Daniel (float64 score: 0.56497, float32 score: 0.56506, public score: 0.56824)

train and validate LGBMRegressor model with combined dataset between official and additional: Radek (metric score: 0.52590225), Daniel (pandas score: 0.52590225, public score: 0.56097; polars float64: 0.525977, public score: 0.56064; polars float32: 0.525936, public: 0.56014) polars float32 outperform all

train LGBMRegressor with given parameters from this notebook : Radek (with random_state as 0, metric score: 0.519450), Daniel (random_state as 19, f64 pandas: metric score: 0.52017, public score; f64 polars: metric: 0.52017, public; f32 polars: metric: 0.52003, public: 0.55858)

how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell

how to add additional dataset to the model for training? pandas-numpy vs polars

make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell

when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version

how to make predictions on a X_val with model.predict? cell

how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell

how to take the mean of a list with pl.Series(list).mean? cell

how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars

what insight can the feature importance offer us? cell

what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell

how to make 5 predictions from 5 models and put them into a list? cell

how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell

how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell

Q&A version 5

How learning_rate and n_estimators of LGBMRegressor get chosen? cell, asked here direction provided

Question: pandas version or polars version trained twice, the scores are very close but not the same, why? cell asked and answered here figured out here solved

why scores produced by pandas and polars are different? cell asked and answered here explored but solved

Will Radek dig into the interactions between features? cell asked and answered here direction provided

merge the additional dataset with official dataset cell

changed the random_state to 19 from 0,

metric score changed to 0.5259 from 0.5609. (I suspect it is due to the change of dataset) cell

implement the above in polars in my notebook

pandas: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell

polars: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell

I compared and proved all data inputs for pandas and polars are the same as df or series (see cell), all arrays are the same too when same dtype enforced (see cell)

I figured out why pandas and polars training have different results under same randomness: the same dtypes must be enforced , see cell

but only the same when enforced into pl.Float32, slightly difference when pl.Float64, see notebook

using official dataset alone both float64 and float32 are all the same, see notebook

Q&A Radek version 6 vs my version 22

why scores produced by pandas and polars are different? cell asked and answered here and solved in this notebook cell solved

why is the falling scores from Radek’s version 5 to version 6? my guess is that after joining the additional dataset, the problem gets harder due to more data. asked here hypothesis proposed