I am training a random forest classifier to predict if I am going to catch a fish on a fishing trip.
Over the past 5 years I recorded hourly data for every fishing trip. Each trip is 2-18 hours and thus contains a 2-18 observations. Every observation is an input of pressure, water temperature, water salinity, precipitation (and more), as well as an indicator if any fish were caught in that particular hour.
Challenge is that each set of observations (per trip) contains correlated data. This impacts performance of my classifier significantly. How do I handle such correlated data? Is it enough to make sure that the train and test data do not have observations from the same fishing trips?