High variation between random seeds -- how to train on all data / choose a seed?

Hello. I’m trying to create a classifier that categorizes different Yelp reviews about career coaching businesses.

I’ve manually classified about 100 reviews into 5 relatively even classes: career change, current job, interview, lost job, and resume. The goal is to train the model on these 100 classified reviews and then extrapolate the results to ~2k unclassified reviews.

Because there isn’t much data, the learner’s max accuracy (within 13 cycles) is wildly variable based on the random seed used. I’ve seen numbers as low as 32% and as high as 67%.

My idea was to run a for loop through a bunch of random seeds to get an idea of the true accuracy of the learner, and then train the learner on all 100 reviews (without a valid set) and extrapolate the results to the unclassified reviews (the test set).

Is there a way to do this? I know that training a NN is different than a normal ML model, so you can’t just model.fit(X, y). Is the updating of weights in the NN based only on train_loss within the cv (and then the valid loss is just something displayed for us to personally evaluate the model), or does the NN take into account the validation set when updating its weights?

If it’s not possible, what’s the best practice for how to go about training the neural network for extrapolation to a unlabeled test set? Selecting the “most accurate” random seed feels wrong, and yet choosing one at random (that could potentially have a very uneven split and low accuracy) doesn’t seem optimal either.

I’d recommend reading through Rachel’s How (and why) to pick a good validation set article. What that tells me is your validation set is not very representative of your test set. Which is okay! Sometimes a random split isn’t always the best. Also: 100 reviews is not a large amount of data to go off of, to then extrapolate another 2,000 reviews. This is a technique called semi-supervised learning, and in general the rule of thumb is you extrapolate ~1/3 of your training dataset, further train on it, mostly the ones it got wrong, and then go from there.

So in summary: your training data may not be representative of what you are testing on, and you may want to slowly build your training data instead if you can.

2 Likes

I read through the article and it was helpful, thank you. It seems that a non-random train-valid split would be better; is it okay for me to manually create the split?

One thing I’ll add is that I’m mostly interested in the distribution of the classes in the test set (2000 unlabeled); my idea was to have the labeled train set have even classes as to not introduce bias, and then make predictions using the trained model on the test set and look at the distribution (which would probably not be even and therefore give insights).

I haven’t involved the test set at all yet; I’ve been working exclusively with the subset of 100 labeled reviews. Wouldn’t it rather indicate that the validation set isn’t representative of the train set?

If I’m understanding this correctly, going forward I should do something like the following:

  1. Within the 100 labeled reviews, create a nonrandom train-valid split where the validation set is a good representation of train set.

  2. Train the learner using this train-valid split

  3. Predict the labels of ~33 unlabeled reviews

  4. Create a representative train-valid split of the now 133 reviews (what are the guidelines on how to incorporate the 33 pseud-labeled reviews? proportional between train and test? can I keep the same split I had previously with the 100 actually labeled reviews but just add in the pseudo-labeled reviews?)

  5. Train the learner again (a new learner or the same learner?) with the now 133 labeled reviews

  6. Predict the labels of ~44 unlabeled reviews

  7. Repeat until all reviews are classified

What does this mean (how do you train more on incorrect predictions)?

If you have any recommended resources/guides for semi-supervised learning that would be very helpful; I’ve read this.

One thing I’m worried about is that not all of the reviews are super clear-cut in terms of class; the manually labeled reviews were deliberately chosen because they were strong examples of the classes. That’s why the idea was to train the learner on just the strong examples, and then extrapolate it to the rest of the data all at once which may have weaker examples.

I’m worried that adding unclear (or just unevenly distributed) pseudo-labeled reviews at the first few cycles of semi-supervised learning will bias the model towards those classes, even if they wouldn’t have been as present overall based on the predictions of the original model (which is trained on strong class examples).

Looking at the fact that I’m interested in the final distribution of classes and that the current labeled reviews are strong examples, is semi-supervised learning still the way to go?

Thanks in advance and for your first reply. It was clear, concise, and helpful.