Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)
My plan for this comp
Implement everything Radek is sharing in this comp
Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)
My plan for this comp
Implement everything Radek is sharing in this comp
how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
how to add additional dataset to the model for training? pandas-numpy vs polars
make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
Radek’s comp intro video gets very interested in it
Reasons not to do it: dataset is not real, therefore the problem is not serious
Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)
My plan for this comp
Implement everything Radek is sharing in this comp
train LGBMRegressor model with official dataset alone: Radek (metric score: 0.56099, public score: 0.56237), Daniel (float64 score: 0.56497, float32 score: 0.56506, public score: 0.56824)
train and validate LGBMRegressor model with combined dataset between official and additional: Radek (metric score: 0.52590225), Daniel (pandas score: 0.52590225, public score: 0.56097; polars float64: 0.525977, public score: 0.56064; polars float32: 0.525936, public: 0.56014) polars float32 outperform all
train LGBMRegressor with given parameters from this notebook : Radek (with random_state as 0, metric score: 0.519450), Daniel (random_state as 19, f64 pandas: metric score: 0.52017, public score; f64 polars: metric: 0.52017, public; f32 polars: metric: 0.52003, public: 0.55858)
how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
how to add additional dataset to the model for training? pandas-numpy vs polars
make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
how to make predictions on a X_val with model.predict? cell
how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell
how to take the mean of a list with pl.Series(list).mean? cell
how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars
what insight can the feature importance offer us? cell
what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell
how to make 5 predictions from 5 models and put them into a list? cell
how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell
how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell
Q&A version 5
How learning_rate and n_estimators of LGBMRegressor get chosen? cell, asked here direction provided
Question: pandas version or polars version trained twice, the scores are very close but not the same, why? cell asked and answered here figured out here solved
why scores produced by pandas and polars are different? cell asked and answered here explored but solved
Will Radek dig into the interactions between features? cell asked and answered here direction provided
merge the additional dataset with official dataset cell
changed the random_state to 19 from 0,
metric score changed to 0.5259 from 0.5609. (I suspect it is due to the change of dataset) cell
implement the above in polars in my notebook
pandas: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
polars: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
I compared and proved all data inputs for pandas and polars are the same as df or series (see cell), all arrays are the same too when same dtype enforced (see cell)
I figured out why pandas and polars training have different results under same randomness: the same dtypes must be enforced , see cell
but only the same when enforced into pl.Float32, slightly difference when pl.Float64, see notebook
using official dataset alone both float64 and float32 are all the same, see notebook
Q&A Radek version 6 vs my version 22
why scores produced by pandas and polars are different? cell asked and answered here and solved in this notebook cell solved
why is the falling scores from Radek’s version 5 to version 6? my guess is that after joining the additional dataset, the problem gets harder due to more data. asked here hypothesis proposed