So basically the idea is that any increase % you make to minority classes during training you would want to decrease by the same % their predicted probabilities at the end? What if its a binary classification, in that case does the same rule apply?
I’m not sure if it will be exactly decreasing by the same % at the end - you may have to experiment with a few amounts to see which works best. It’s not something I’ve studied closely, and don’t know of anyone else who has either, but it’s clearly an important issue!
Yea I’m currently working on this Kaggle comp and the difficulty is that it has an imbalanced dataset where only approx 10% of the images in the training set have a “threat”. To make it more difficult the images have a lot of noise and the threats are also only visible from certain angles and classifications need to be segmented by body zones
I tried increasing threat images in the training set but like you mentioned that just ended up increasing the probabilities so a lot of false positives. Its almost like the inherent rarity of the threat is a good thing because it naturally makes the model more selective in what it classifies as a threat.
Try using the over-sampling I suggested and then try rescaling the probabilities by a few different values to find the best amount. Hopefully it’ll be a little improvement.
After updating fastai, I am getting error, nothing changed except pulling the new code…
ImportError Traceback (most recent call last)
----> 1 from fastai.structured import *
2 from fastai.column_data import *
3 np.set_printoptions(threshold=50, edgeitems=20)
~/workspace/fastai/courses/dl1/fastai/structured.py in ()
----> 1 from .imports import *
3 from sklearn_pandas import DataFrameMapper
4 from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler
5 from pandas.api.types import is_string_dtype, is_numeric_dtype
~/workspace/fastai/courses/dl1/fastai/imports.py in ()
2 import PIL, os, numpy as np, math, collections, threading, json, bcolz, random, scipy, cv2
3 import random, pandas as pd, pickle, sys, itertools, string, sys, re, datetime, time
----> 4 import seaborn as sns, matplotlib
5 import IPython, graphviz, sklearn_pandas, sklearn, warnings
6 from abc import abstractmethod
~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/seaborn/init.py in ()
8 from .palettes import *
9 from .regression import *
—> 10 from .categorical import *
11 from .distributions import *
12 from .timeseries import *
~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/seaborn/categorical.py in ()
16 from . import utils
—> 17 from .utils import iqr, categorical_order, remove_na
18 from .algorithms import bootstrap
19 from .palettes import color_palette, husl_palette, light_palette, dark_palette
ImportError: cannot import name ‘remove_na’
It’s saying you’ve got a problem in the
seaborn package. Try updating it.
Thanks Jeremy, it works now…
Hey… Had a question.
How do you decide how many fully connected layers to add? For e.g. in the dog breeds we could just have ONE fully connected layer that converts the input to the number of classes i.e. 120, instead of two… Another related question, when we add more than one additional layer, we can specify how big we want the layer to be. Is there any method to choosing how big the layer should be? How do we arrive at this number?
Adding two layers to CNNs for images pretty much always works for me. I haven’t had much if any need to change the number or size of layers when fine tuning a network from the fastai defaults. If anyone finds situations that need to be much different, let me know!
For structured data, I have much less confidence in knowing the right answer to this question. I still have to experiment quite a bit! But the amounts shown in Rossmann are a good start generally.
Can you explain why in binary variables, we don’t want embedding and use that variable as continuous?
got answer. embeddings doesn’t seem to improve anything for binary variables (as Jeremy answered)
Maybe this article can help. On page 144 in 4.1 Prior Correction you can find how to change intercept in logistic regression to be able to restore original probability after sampling. It is possible to mathematically transform this formula to use with any model (not only logit).
Does anyone know how the data is supposed to be set up for the courses/dl1/lesson3-rossman notebook ?
nbuser@jupyter:~/fastai/courses/dl1/data/rossman$ nbuser@jupyter:~/fastai/courses/dl1/data/rossman$ ls googletrend rossmann.tgz state_names.csv store_states.csv train.csv weather.csv googletrend.csv sample_submission.csv store.csv test.csv weather nbuser@jupyter:~/fastai/courses/dl1/data/rossman$ ls -al total 48076 drwxr-xr-x 4 nbuser nbuser 6144 Nov 24 15:30 . drwxr-xr-x 3 nbuser nbuser 6144 Nov 24 15:13 .. drwxr-xr-x 2 nbuser nbuser 6144 Nov 24 15:29 googletrend -rw-r--r-- 1 nbuser nbuser 86605 Jan 11 2017 googletrend.csv -rw-r--r-- 1 nbuser nbuser 7730448 Nov 24 15:14 rossmann.tgz -rw-r--r-- 1 nbuser nbuser 317611 Sep 29 2015 sample_submission.csv -rw-r--r-- 1 nbuser nbuser 265 Jan 11 2017 state_names.csv -rw-r--r-- 1 nbuser nbuser 45010 Sep 29 2015 store.csv -rw-r--r-- 1 nbuser nbuser 9051 Jan 6 2017 store_states.csv -rw-r--r-- 1 nbuser nbuser 1427425 Sep 29 2015 test.csv -rw-r--r-- 1 nbuser nbuser 38057952 Sep 29 2015 train.csv drwxr-xr-x 2 nbuser nbuser 6144 Nov 24 15:30 weather -rw-r--r-- 1 nbuser nbuser 1518814 Jan 11 2017 weather.csv nbuser@jupyter:~/fastai/courses/dl1/data/rossman$
Reading the code for the concat_csvs function, am i supposed to create a directory called googletrend and then copy the googletrend.csv file to it or do i copy all of the csv files to the googletrend directory ?
You don’t need to run the commented out concat_csv lines - they’ve already been run for you, and googletrend.csv was created from them and is part of what you downloaded.
It would also depend on how much data you have to learn from.
Experiencing very frequent freezes and slow response when trying to work with Crestle.
I git pull always to last version but if I understand correctly in Crestle nothing more to do, is that right? (We can not conda env update neither source fastai activate, all is taken care of by default, isn’t it? I connect from Spain (in case location maters), have a quite good internet symetric connection.
No clue on how to make it run smoother / with fewer freezes, @anurag, if I am skiping some important step please let me know. (Problem happens with different notebooks, with or without gpu)
I still see a possibility where embedding may be relevant for binary flag especially if one wants to use the binary flag for some sort of similarity measure for additional analysis. This may not impact classification results. If a variable doesn’t have predictive power binary flag indicates they are separate classes but the embedding may indicate they are very similar.
I’m pretty sure that mathematically it won’t make any difference for binary flags. The values of the embeddings will be multiplied by a weight in the fully connected layer, so I can’t see how there would be any difference between training the embedding value and training the weight. Although I’d be very happy to be proven wrong
I didn’t refer to the impact on classification results which I agree. I am working on something to use features created from combination of embeddings for clustering purposes after the prediction is done. In one case if I directly use the binary flags it indicates they are different. However if represented as embedding they may not show so much differentiation if the feature doesn’t have predictive power.