Handling of correlated data in fishing trips

hanshermansen · July 27, 2020, 6:59pm

I am training a random forest classifier to predict if I am going to catch a fish on a fishing trip.

Over the past 5 years I recorded hourly data for every fishing trip. Each trip is 2-18 hours and thus contains a 2-18 observations. Every observation is an input of pressure, water temperature, water salinity, precipitation (and more), as well as an indicator if any fish were caught in that particular hour.

Challenge is that each set of observations (per trip) contains correlated data. This impacts performance of my classifier significantly. How do I handle such correlated data? Is it enough to make sure that the train and test data do not have observations from the same fishing trips?

Best regards,

Hans

marii · July 27, 2020, 8:49pm

I haven’t taken this course, but usually when handling data that is correlated over time like this, we just make sure that we do a test split based on time. Probably good to look here: https://www.fast.ai/2017/11/13/validation-sets/

IT would also be good to look at the different models and see which ones generalize outside of the training data. (can be used to predict the future)