Which competition are you referring to? I assume since you are posting to this thread then its for Dog Breed? The code can vary quite a bit depending on which one it is
Generally speaking I always start by using the submission_file.csv that kaggle provides for each competition as my df with a simple line like submission = pd.read_csv('sample_submission.csv')
. From there you just replace the ids in the sub with your test ids (so they are sorted correctly/aligned with predictions) which you can grab from here data.test_dl.dataset.fnames
. Then you fill in the rest of the columns in the df with your predictions.
Check out this link. This really helped me get a submission generated. Don’t worry so much about the fact that it is for vgg19. Focus on the exporting to CSV steps near the bottom.
Still getting the error!
Here is my entire script (just the actual setup parts):
from fastai.conv_learner import *
PATH = "data/dogbreed/"
arch=resnext101_64
sz=224
bs=64
label_csv = f'{PATH}labels.csv'
n = len(list(open(label_csv)))-1
#val_idxs = get_cv_idxs(n, val_pct=1e-4)
#val_idxs = get_cv_idxs(n, val_pct=0.01)
val_idxs = [0]
n, len(val_idxs), val_idxs
# (10222, 1, [0])
label_df = pd.read_csv(label_csv)
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', bs=bs, tfms=tfms,
val_idxs=val_idxs, suffix = '.jpg', test_name = 'test',
num_workers=4)
fn = PATH+data.val_ds.fnames[0]; fn
# 'data/dogbreed/train/000bec180eb18c7604dcecc8fe0dba07.jpg'
## why is it going to my train directory maybe I need to redistribute images
# to suit reconfiguration of data sets??
# to be sure check just one image in val set, an error is generated if I attempt to
# reference another image
data.val_ds.fnames[1]
# ---------------------------------------------------------------------------
# IndexError Traceback (most recent call last)
# <ipython-input-12-be174aefb09c> in <module>()
# ----> 1 data.val_ds.fnames[1]
# IndexError: index 1 is out of bounds for axis 0 with size 1
fn = PATH+data.test_ds.fnames[0]; fn
'data/dogbreed/test/fd1a7be32f10493735555e62913c0841.jpg'
fn = PATH+data.trn_ds.fnames[0]; fn
# 'data/dogbreed/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg'
def get_data(sz, bs):
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', test_name='test',
val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
return data if sz>300 else data.resize(340, 'tmp')
sz, bs
# (224, 64)
data = get_data(sz, bs)
learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-47-0708a7145fb8> in <module>()
----> 1 learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5)
~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, **kwargs)
92 def pretrained(cls, f, data, ps=None, xtra_fc=None, xtra_cut=0, **kwargs):
93 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg, ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut)
---> 94 return cls(data, models, **kwargs)
95
96 @property
~/fastai/courses/dl1/fastai/conv_learner.py in __init__(self, data, models, precompute, **kwargs)
85 elif self.metrics is None:
86 self.metrics = [accuracy_multi] if self.data.is_multi else [accuracy]
---> 87 if precompute: self.save_fc1()
88 self.freeze()
89 self.precompute = precompute
~/fastai/courses/dl1/fastai/conv_learner.py in save_fc1(self)
132 self.fc_data = ImageClassifierData.from_arrays(self.data.path,
133 (act, self.data.trn_y), (val_act, self.data.val_y), self.data.bs, classes=self.data.classes,
--> 134 test = test_act if self.data.test_dl else None, num_workers=8)
135
136 def freeze(self):
~/fastai/courses/dl1/fastai/dataset.py in from_arrays(cls, path, trn, val, bs, tfms, classes, num_workers, test)
296 ImageClassifierData
297 """
--> 298 datasets = cls.get_ds(ArraysIndexDataset, trn, val, tfms, test=test)
299 return cls(path, datasets, bs, num_workers, classes=classes)
300
~/fastai/courses/dl1/fastai/dataset.py in get_ds(fn, trn, val, tfms, test, **kwargs)
264 def get_ds(fn, trn, val, tfms, test=None, **kwargs):
265 res = [
--> 266 fn(trn[0], trn[1], tfms[0], **kwargs), # train
267 fn(val[0], val[1], tfms[1], **kwargs), # val
268 fn(trn[0], trn[1], tfms[1], **kwargs), # fix
~/fastai/courses/dl1/fastai/dataset.py in __init__(self, x, y, transform)
160 def __init__(self, x, y, transform):
161 self.x,self.y=x,y
--> 162 assert(len(x)==len(y))
163 super().__init__(transform)
164 def get_x(self, i): return self.x[i]
AssertionError:
Yes I do personally use it in multiple stages just to make sure my learning rate is still looking good based on the weight updates. Other “transitions” like you mention with resizing the images can also be a good time to use it.
BTW, I highly recommend going through the notebook that Jeremy recently created called cifar10
as it should provide you with a good intuition about how best to use lr_finder, cycle_len, cycle_mult and resizing.
Try deleting your tmp directory. If that doesn’t work then it’s possible you have found a bug and maybe you could dig into it to see what is happening in get_ds. I would start with just deleting your tmp directory though.
Also if you wanted a line by line example, Jeremy did post the code that could be used to create a submission file specifically for dog breed at the top of the Lesson 3 Wiki.
The tmp & models directory are inside your data/dogbreeds/ folder. Could you check there?
Hi @KevinB
I am new to Jupyter Notebook and the Ubuntu environment. How can I debug into a called module using these tools?
That makes sense, thanks! So data.test_dl.dataset.fnames should return a list of file IDs to direct output values, or does it just contain file index? If the former, then I just need to match up the csv ids to the fnames result, save the csv, and submit it I think.
18.81s/it per 64 images of size 350. Indeed really slow )
Thanks
There are 4 models in that directory, but they are all a week old and dated similarly, so I guess they come from the wor I did last week!
@binga let me know that I should look for tmp under my data/dogbreed folder, and yes there is a tmp there. I removed all of the files, and bingo - it works now!
Thanks
I am trying the trick of training with all of the data, but as you say its a bit odd to have no validation and it feels like cheating, because I am training with what might be tested against.
But anyway, when I come to validation time under this scenario there is something I need to do - about labels - is this something you are aware of?
log_preds, y = learn.TTA()
probs = np.exp(log_preds)
accuracy(log_preds, y), metrics.log_loss(y, probs)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-47692796ae62> in <module>()
1 log_preds, y = learn.TTA()
2 probs = np.exp(log_preds)
----> 3 accuracy(log_preds, y), metrics.log_loss(y, probs)
~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/metrics/classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
1652 raise ValueError('y_true contains only one label ({0}). Please '
1653 'provide the true labels explicitly through the '
-> 1654 'labels argument.'.format(lb.classes_[0]))
1655 else:
1656 raise ValueError('The labels array needs to contain at least two '
ValueError: y_true contains only one label (19). Please provide the true labels explicitly through the labels argument.
The syntax for sklearn.metrics.accuracy
is accuracy(y_true, y_pred)
and not the other way around. Could you try flipping log_preds and y?
Edit: Also, for accuracy, you’ll have to feed the labels not the probabilities!
Just wanted to thank everyone for your contributions to this “dog breed identification” thread!
I have today made my first-ever Kaggle submissions, and immediately got to #21 on the leaderboard . Would have been impossible without your help.
What I did with my three submissions:
(1) follow lesson instructions with model_1
(2) follow lesson instructions with model_2
(3) averaged the predictions from those two
Thanks again!
If you are training with all of the data (including validation set) there isn’t much point to doing any prediction on the validation set which is what learn.TTA() is doing - because you effectively no longer have a validation set . In this case you should just go directly to learn.TTA(test=True) to predict on the test set.
I assume you have already trained on your training set only while setting aside a validation set (i.e. 20%) and tuned your parameters based on the training/validation losses you were seeing? So now you are just going back and repeating the same steps but now without a validation set?
Yes it returns a list of paths to each file which includes the fnames id. So you can just extract the fnames from those and use them directly for your id column of the submission file since your predictions are already following the same ordering as those fnames and luckily the classes are already in the correct order / alphabetized.
Yes, that’s what I am doing. But the way you ask the question makes me wonder if I am going about this the right way. I have for sure got the best I can with a validation set, so I used the same approach to train a brand new model with no validation set. Is this correct, or should I take the existing model I have and retrain it in the same manner with the entire set?