Building a Tabular regression to predict water level fluctuation

hi there, im a relatively new user to fastai, i have been trying to build a model to predict water level fluctuation using a tabular data representing metrological data. i have face many issues which i hope will find an answer to, first my model RMSE is ~20 down from 60 after 20 epoch (and also tried to use the lr_finder). these are snippets of my data and code:

cont_names=[ ‘Evaporation’, ‘Rain’, ‘Temperature’, ‘Humidity’]
cat_names = [‘Name’,‘TIME’, ‘x’,‘y’]
procs=[Normalize, Categorify]
y_names=‘Readings’
splits= (samplist,valist)
to= TabularPandas(GWL, procs=procs, cat_names=cat_names, cont_names=cont_names, splits=splits, y_names=y_names, y_block= RegressionBlock())
y=to.train.y
y_range=((y.min()*0.8,y.max()*1.2))
dls = to.dataloaders(32)
learn = tabular_learner(dls, layers=[200,100], opt_func=Adam, metrics=[rmse], y_range=y_range)
learn.fit_one_cycle(10)

image

now to my questions:

  1. how can i get the RMSE to something that is remotely reasonable , what am i doing wrong?
  2. i have tried different opt_func and loss_func, as well as different number of layers, nodes and other hyperparameters (batch size, etc.) but the results didn’t change much?
  3. keeping the settings the same, each time i run the code it gives me a vastly different result at the start but soon it get better and stops at ~20 rmse. should i keep restarting the code until i get a satisfactory result?
  4. i kinda understood why we need to use w_range, but didn’t understand why we need to use RegressionBlock.
  5. since my problem is time series, i wasn’t sure to treat my date data as cont or cat, i chose the later but your opinion would be much appreciated.
  6. is there any need to use the name of the station in the Neural network? especially since im feeding it the x,y coordinate?
  7. my data have gaps but i didn’t like fastai approach to fill them, so i used pandas interpolate which use LR to fill them, can this case the high rmse somehow?

finally, thanks in advance for any kind of help, i have had these questions for almost two weeks with no answer.

As your problem is timeseries, do you want to try LSTM rather than tabular ? I’ve created a notebook that is quite similar the same problem for a kaggle competition: Ventillator / Fastai [LB 0.168 no kfolds-no blend] | Kaggle

For my personally experience, I haven’t successfully tweaked a fastai tabular model to do far better than the default one without adding some more feature engineering

Hope it helps

1 Like

thank you very much, i pressed against time so i cant explore other TS methods but i will definitely check it out once i finish this ongoing project.