A beginner's Journey to Playground Series Season 3, Episode 1

How and Why shoud I get started

  • Introduced to me by Radek’s tweet
  • Radek’s comp intro video gets very interested in it
  • Reasons not to do it: dataset is not real, therefore the problem is not serious
  • Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

  • Implement everything Radek is sharing in this comp

:chart_with_upwards_trend: EDA + training a first model + submission :rocket:

Data checking

  • more info on the dataset from sklearn
  • what to predict? (the the median house value for California districts) cell
  • what are the independent variables? (8 or 9?) cell
  • what is the evaluation metric? (root mean squared error and watch out for what) cell
  • is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
  • eyeball for numeric and categorical columns/features? cell
  • how many NAs or nulls in each feature? cell1, cell2
  • check the size of the train and test, cell

Modeling

  • why do modeling, instead of doing statistical analysis to find interesting things? cell
  • to be continued tomorrow
1 Like

How and Why shoud I get started

  • Introduced to me by Radek’s tweet
  • Radek’s comp intro video gets very interested in it
  • Reasons not to do it: dataset is not real, therefore the problem is not serious
  • Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

  • Implement everything Radek is sharing in this comp
  • recording my journey in fastai forum

:chart_with_upwards_trend: EDA + training a first model + submission :rocket:

Data checking

  • more info on the dataset from sklearn
  • what to predict? (the the median house value for California districts) cell
  • what are the independent variables? (8 or 9?) cell
  • what is the evaluation metric? (root mean squared error and watch out for what) cell
  • is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
  • eyeball for numeric and categorical columns/features? cell
  • how many NAs or nulls in each feature? cell1, cell2
  • check the shape of the train and test, cell

Modeling

  • why do modeling, instead of doing statistical analysis to find interesting things? cell
  • what are all the libraries and funcions needed for the modeling? cell
  • how to learn more of the classes and functions imported for modeling? cell
  • what are the features and target? cell
  • why adding additional dataset and what is the additional dataset? cell
  • how to download the additional dataset? cell
  • check the dataset (as a dict) provided by fetch_california_housing as a dict? cell
  • how to concat the features (numpy.array) and target (numpy.array) from the dict? cell
  • how to split the train into 5 folds and control the randomness of each fold with KFold(n_splits=5, random_state=0, shuffle=True)? cell
  • what exactly can for i, (train_index, val_index) in enumerate(kf.split(train)): give us? cell
  • how to access each fold of X_train, X_val, y_train, y_val with a list of features and a list of idx? cell
  • reading docs of LGBMRegressor, cell
  • how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
  • how to add additional dataset to the model for training? pandas-numpy vs polars
  • make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
  • when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
  • reading docs of LGBMRegressor.fit, cell
  • how to make predictions on a X_val with model.predict? cell
  • how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell
  • how to take the mean of a list with pl.Series(list).mean? cell
  • how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars
  • what insight can the feature importance offer us? cell
  • what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell
  • how to make 5 predictions from 5 models and put them into a list? cell
  • how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell
  • how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell

QUESTION

  • How learning_rate and n_estimators of LGBMRegressor get chosen? cell not yet answered
  • Question: pandas version trained twice, the scores are very close but not the same, why? cell not yet answered
  • Question: polars version trained twice, the scores are close (but more different from pandas version), why? cell not yet answered
  • Will Radek dig into the interactions between features? cell not yet answered

How and Why shoud I get started

  • Introduced to me by Radek’s tweet
  • Radek’s comp intro video gets very interested in it
  • Reasons not to do it: dataset is not real, therefore the problem is not serious
  • Reasons to do it: time and energy will be focused on techniques and models, and learning will be more efficient (whereas in otto comp I have spent a month just to implement scripts before touching real complex models)

My plan for this comp

  • Implement everything Radek is sharing in this comp
  • recording my journey in fastai forum

:chart_with_upwards_trend: EDA + training a first model + submission :rocket:

Milestone notebooks

  • train LGBMRegressor model with official dataset alone: Radek (metric score: 0.56099, public score: 0.56237), Daniel (float64 score: 0.56497, float32 score: 0.56506, public score: 0.56824)
  • train and validate LGBMRegressor model with combined dataset between official and additional: Radek (metric score: 0.52590225), Daniel (pandas score: 0.52590225, public score: 0.56097; polars float64: 0.525977, public score: 0.56064; polars float32: 0.525936, public: 0.56014) polars float32 outperform all
  • train LGBMRegressor with given parameters from this notebook : Radek (with random_state as 0, metric score: 0.519450), Daniel (random_state as 19, f64 pandas: metric score: 0.52017, public score; f64 polars: metric: 0.52017, public; f32 polars: metric: 0.52003, public: 0.55858)
  • feature interactions notebook
  • hyperparameter search notebook
  • notebook to understand LGBMRegressor model

Data checking version 5 version 5

  • more info on the dataset from sklearn
  • what to predict? (the the median house value for California districts) cell
  • what are the independent variables? (8 or 9?) cell
  • what is the evaluation metric? (root mean squared error and watch out for what) cell
  • is the column id a feature to be studied or just a artifact of preprocessingjust to be ignored? (use maximum occurrence of id to confirm) cell
  • eyeball for numeric and categorical columns/features? cell
  • how many NAs or nulls in each feature? cell1, cell2
  • check the shape of the train and test, cell

Modeling version 5

  • why do modeling, instead of doing statistical analysis to find interesting things? cell
  • what are all the libraries and funcions needed for the modeling? cell
  • how to learn more of the classes and functions imported for modeling? cell
  • what are the features and target? cell
  • why adding additional dataset and what is the additional dataset? cell
  • how to download the additional dataset? cell
  • check the dataset (as a dict) provided by fetch_california_housing as a dict? cell
  • how to concat the features (numpy.array) and target (numpy.array) from the dict? cell
  • how to split the train into 5 folds and control the randomness of each fold with KFold(n_splits=5, random_state=0, shuffle=True)? cell
  • what exactly can for i, (train_index, val_index) in enumerate(kf.split(train)): give us? cell
  • how to access each fold of X_train, X_val, y_train, y_val with a list of features and a list of idx? cell
  • reading docs of LGBMRegressor, cell
  • how to create a LGBMRegressor model with specific learing_rate and n_estimators, metric? cell
  • how to add additional dataset to the model for training? pandas-numpy vs polars
  • make sure all the data inputs are the same shape between pandas/numpy version and polars version, cell
  • when LGBMRegressor does fit, the data inputs should mostly be numpy array (sometimes pandas, but not polars), see Radek’s pandas version, my polars2numpy version
  • reading docs of LGBMRegressor.fit, cell
  • how to make predictions on a X_val with model.predict? cell
  • how to calculate metric score with mean_squared_error(truth, pred, squared=False)? cell
  • how to take the mean of a list with pl.Series(list).mean? cell
  • how to rank the importance of different features/columns from the trained model with clf.feature_importances_? with pandas vs polars
  • what insight can the feature importance offer us? cell
  • what is the shortcomings of tree based models? (feature interactions, and what to do about it) cell
  • how to make 5 predictions from 5 models and put them into a list? cell
  • how to take the mean from the 5 predictions (ensemble) with transpose, explode, mean(axis=1)? cell
  • how to build a dataframe from id and ensembled prediction with pl.DataFrame and save it to csv write_csv? cell

Q&A version 5

  • How learning_rate and n_estimators of LGBMRegressor get chosen? cell, asked here direction provided
  • Question: pandas version or polars version trained twice, the scores are very close but not the same, why? cell asked and answered here figured out here solved
  • why scores produced by pandas and polars are different? cell asked and answered here explored but solved
  • Will Radek dig into the interactions between features? cell asked and answered here direction provided

Modeling version 6 version 6 :

  • changes of Radek’s in version 6
    • merge the additional dataset with official dataset cell
    • changed the random_state to 19 from 0,
    • metric score changed to 0.5259 from 0.5609. (I suspect it is due to the change of dataset) cell
  • implement the above in polars in my notebook
    • pandas: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
    • polars: join additional dataset with competition dataset (see cell) and feed 5 fold split in the numpy array with to_numpy to the training, see cell
    • I compared and proved all data inputs for pandas and polars are the same as df or series (see cell), all arrays are the same too when same dtype enforced (see cell)
    • :joy: :rocket: :tada: I figured out why pandas and polars training have different results under same randomness: the same dtypes must be enforced , see cell
      • :scream: :star: but only the same when enforced into pl.Float32, slightly difference when pl.Float64, see notebook
      • :joy: :tada: using official dataset alone both float64 and float32 are all the same, see notebook

Q&A Radek version 6 vs my version 22

  • why scores produced by pandas and polars are different? cell asked and answered here and solved in this notebook cell solved
  • why is the falling scores from Radek’s version 5 to version 6? my guess is that after joining the additional dataset, the problem gets harder due to more data. asked here hypothesis proposed

Modeling Radek version 7 version 7

  • The major changes of this version compared to version 6, see cell
  • :scream: :scream: :scream: but I don’t know how did those parameters come from, nor how does LGBMRegressor model work
  • how to set up all parameters in a dict for a function beforehand with func(**params), cell, cell2

Q&A

  • How did SoupMonster come up with the specified parameters for the model? asked and answered here, must give a try to Optuna answered

Todos

  • What if I enforce dataset’s dtype as float64 or int 32 to see the differences in scores?
  • what if the additional dataset is not added? why adding it is more interesting?
  • update with Radek’s new versions
  • watch Radek’s new videos

If you are interested to read my daily update, please check out my repo here

I will leave the forum with some peace and save space for discussions :grin: