How to select training and validation set

kodzaks · February 7, 2019, 7:46pm

Hello,

I am using files that have been rated by observers (observers assign the file to a specific category, let’s say we have 4 categories) to create training and validation sets.

The problem is that observers often disagree. I have files that are 100% category A, or B, but many files are like 75% category A, 15 % Category B and 10% Category C.

My question is, when composing training and validation sets what files should I select? Is it better to have the clear cut files, i.e. all rated as 100%, or should I add more not-so-clear-cut files into training and validation sets, like 50-50, or should I just do random selection so the files in training and validation set are representative of the entire test data?

Thank you in advance.