In notebook 6. Little trick: typing a class or function name in a cell gives you where it comes from.
Regarding processors that map categories to numerical labels: How do you handle the case of online streaming data, which may contain new categories not seen yet in previous data used for training?
FWIW, the latest 08.ipynb says āWe use the ListContainer
class from 08ā¦ā
Sorry if I missed this, but does the split function check to make sure every label has corresponding images in both training and validation sets?
in LabeledData class, what is the @classmethod decorator doing?
has pytorch got any such inbuilt split functionality by ids/funcs ?
static method
Anyone is a bit worried that the creation of the vocab is implicit? If you reorder labeling lines (training with valid) you get different label values.
This isnāt a problem if you train but if you think about inference you might just get that wrong easily.
Creating an āOtherā class kind of sounds like the āNoneā that we talked about last week that didnāt work.
Another way to not do much better than random is normalizing the validation set by its own standard deviation and mean.
do you look for distribution of classes and see whether its balanced ?
A PR to fix that typo would be welcome
Not if you donāt make it do so.
Thatās why the vocab is sorted by alphabetical order.
I thought we were talking last week about how āotherā categories were tricky because weāre effectively asking the classifier to detect things that are positively aspects of negatively being a thingā¦ I havenāt reviewed so I might be misremembering, though. I wonder when it is & isnāt a good idea to have an āotherā category vs. eg. a loss function which would give low weights to low confidence predictions + confidence cutoff when displaying a prediction output. (This is pretty off topic)
That is data leakage, also called ādata snoopingā. It can lead you to overestimate the generalizability of your model.
At the ned of the day, itās a design decision for your model. You can decide that the target for other is everything at 0.
do we always need to convert them to same size ? is that a requirement ?
You canāt batch them if they arenāt all of the same size.
Given a distribution of image sizes, how do you choose the dimensions to resize all images to?