I have to use an LSTM for a time-series problem as my internship manager demands so! Need some thoughts on the approach please! (long post)

Hey folks, I have a discrete time-series dataset with 5 independent variables. I have to predict the dependent variable one day ahead i.e. predict tomorrow based on all other values today. I am at a point where I don’t know what other features to add before diving into using an LSTM, also I have some crazy questions, perhaps nonsensical, please bear with me.

So far I did:

cleaned the dataset by using averages for missing values (less than 1% missing data), created a bunch of date related features(categorical like start-of-month? weekday or weekend? stuff like that), also created moving averages and other expanding mean and median features. I also added lags (1,2,3,…n) of the dependent variable as features. The data set looks good :stuck_out_tongue: but I am totally confused with the tutorials online on how to proceed.

Questions:
1. Am I supposed to check correlations of all columns of the dataset and drop features that have high correlations? [I believe so as these features might be redundant but if an LSTM is capable of taking care of it, I don’t wanna do that ¯\_ (ツ)_/¯]
2. And am I supposed to check the stationary of all the features (dependent and independent) ?? If no, well and good :), but if it’s an yes, am I supposed to change every feature that is non-stationary to stationary either using 1st, 2nd, etc differences or log transformations? Can’t an LSTM take care of this stuff? :confused:
3. I added lags and moving averages to take care of the auto-regressive and MA components. Also added features of date, likely taking care of the seasonality, am I missing anything else here? Should I use any autoML (like featuretools) for feature extraction?
4. My next step would be to apply MinMax Scaler and shove everything into an LSTM, but I know I can’t use minmax scaler on categorical variables directly(that I got by extracting features from date). From some preliminary research, I found that we could use an embedding layer to turn one hot vectors to dense vectors which you can normalize eventually. I have no clue what that even means, care to throw some light on it? I know what one hot encoding is, couldn’t pick up what an embedding layer in this sense is!
5. Is Relu my best bet for an activation function? tanh would cause vanishing gradient issues and sin would cause the opposite, any better activation functions I could experiment with?
6. Finally, any good resources that talk about optimizing the LSTM hyper-parameters? Any thoughts on the number of layers, any architectures that resulted in consistent results?

Appreciate any thoughts on these questions or anything in general related to LSTMs. Thank you for reading :slight_smile: