This dataset contained 13k+ images of over 2000 people with some people having 1, 2, 3 4 or more images.
I trimmed my data set to working on only 610 people with 5770 images so all classes have 4 or more images.
Out of these I moved 10% of images to the validation set. For some classes it meant a 3:1 split leaving only one image in the validation set.
My Architecture:
At first I tried to create the model myself as follows but that was slow (very slow) and it only got to 70% and 35% accuracy after a lot of training so I changed my approach.
Ah yes I see the problem It’s a mistake we’ve all made before! Your validation batches are being shuffled, and then those shuffled batches are being used to create your convolutional features. You’re then trying to make predictions with those shuffled batches and are comparing them to the unshuffled validation labels. Of course, the two sets are totally unrelated!
You should add ‘shuffle=False’ to both of your gen_batches to fix this problem. The fit() call will handle the shuffling for you automatically.
There’s something odd with your validation set. It can’t be a random sample, based on the epoch results you’ve shown here. Did you intend it to be a random sample?
@jeremy I am not sure I completely understand that question.
I am sharing the train and valid sets for Adrien Brody’s images with you on slack (here it doesnt allow to upload zip files).
I basically moved 10% of images from train folder to the valid folder for each person.
This is what i did
Copied all training folder valid
Deleted 90% of the files in the valid folder with the code below.
matches = []
for root, dirnames, filenames in os.walk(‘valid’):
filecount = len(filenames)
for filename in fnmatch.filter(filenames, '*.jpg'):
matches.append(os.path.join(root, filename))
for i in range(int(filecount*0.9)):
os.remove(matches[i])
matches = []
If a file is in valid, delete it from training with this code
import os, fnmatch, shutil
matches = []
for root, dirnames, filenames in os.walk(‘valid’):
filecount = len(filenames)
for filename in fnmatch.filter(filenames, '*.jpg'):
matches.append(os.path.join(root, filename))
check = root.replace("valid", "train")
print root, check
trainFileName = os.path.join (check,filename)
if os.path.exists(trainFileName):
print ('This file should be deleted')
print trainFileName
os.remove(trainFileName)
print matches
matches = []
I know its a bit round about way and could be changed to be
I changed my data set to have atleast 5 or more images per person
and
I changed the way I was creating the validation set using this code
for root, dirnames, filenames in os.walk(‘train’):
filecount = len(filenames)
for filename in fnmatch.filter(filenames, ‘*.jpg’):
matches.append(os.path.join(root, filename))
print filecount
files2move = math.ceil(filecount*0.2)
print files2move
dircheck = root.replace("train", "valid")
if not os.path.exists(dircheck):
os.makedirs(dircheck)
shuf = np.random.permutation(matches)
for i in range(int(files2move)):
filename = shuf[i]
check = filename.replace("train", "valid")
shutil.move(shuf[i], dircheck)
print 'moving file' + shuf[i] + 'to ' +dircheck
matches = []
Now I am just copy 20% of the files from train to valid randomly versus the first few.
I see a minor improvement in my validation set. Its gone up to 32 % but still stuck there.
For train I have 4640 images in 423 classes = 10.9 mean
For valid I have 1345 images in 423 classes = 3.17 mean
For train the median is 6
and for Valid the median is 2
One odd thing i noticed is that i have the largest number of files in one folder that has 424 files in train. The next number was 188.
This might skewing up the number.
Okay, that makes sense. You have a very large number of classes, with few images per class. I would suggest removing classes with less than 20 validation images, if that’s possible.
All I’m doing is splitting the data to 60% train 20% test and 20% for validation.
The way I’m doing that is choosing 60%,20%,20% of each user data randomly. This makes me to have a share of each user’s data for training, validation and the test(It will make more sense if you look at the pictures above).
I made 90x3 window of data to feed conv network. I reshape this window to 30x3x3 for conv overtime and feed for LSTM network