So basically the idea is that any increase % you make to minority classes during training you would want to decrease by the same % their predicted probabilities at the end? What if its a binary classification, in that case does the same rule apply?
Iâm not sure if it will be exactly decreasing by the same % at the end - you may have to experiment with a few amounts to see which works best. Itâs not something Iâve studied closely, and donât know of anyone else who has either, but itâs clearly an important issue!
Yea Iâm currently working on this Kaggle comp and the difficulty is that it has an imbalanced dataset where only approx 10% of the images in the training set have a âthreatâ. To make it more difficult the images have a lot of noise and the threats are also only visible from certain angles and classifications need to be segmented by body zones
I tried increasing threat images in the training set but like you mentioned that just ended up increasing the probabilities so a lot of false positives. Its almost like the inherent rarity of the threat is a good thing because it naturally makes the model more selective in what it classifies as a threat.
Try using the over-sampling I suggested and then try rescaling the probabilities by a few different values to find the best amount. Hopefully itâll be a little improvement.
Hello
After updating fastai, I am getting error, nothing changed except pulling the new codeâŚ
ImportError Traceback (most recent call last)
in ()
----> 1 from fastai.structured import *
2 from fastai.column_data import *
3 np.set_printoptions(threshold=50, edgeitems=20)
4
5 PATH=âdata/rossman/â
~/workspace/fastai/courses/dl1/fastai/structured.py in ()
----> 1 from .imports import *
2
3 from sklearn_pandas import DataFrameMapper
4 from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler
5 from pandas.api.types import is_string_dtype, is_numeric_dtype
~/workspace/fastai/courses/dl1/fastai/imports.py in ()
2 import PIL, os, numpy as np, math, collections, threading, json, bcolz, random, scipy, cv2
3 import random, pandas as pd, pickle, sys, itertools, string, sys, re, datetime, time
----> 4 import seaborn as sns, matplotlib
5 import IPython, graphviz, sklearn_pandas, sklearn, warnings
6 from abc import abstractmethod
~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/seaborn/init.py in ()
8 from .palettes import *
9 from .regression import *
â> 10 from .categorical import *
11 from .distributions import *
12 from .timeseries import *
~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/seaborn/categorical.py in ()
15
16 from . import utils
â> 17 from .utils import iqr, categorical_order, remove_na
18 from .algorithms import bootstrap
19 from .palettes import color_palette, husl_palette, light_palette, dark_palette
ImportError: cannot import name âremove_naâ
Itâs saying youâve got a problem in the seaborn
package. Try updating it.
Thanks Jeremy, it works nowâŚ
Hey⌠Had a question.
How do you decide how many fully connected layers to add? For e.g. in the dog breeds we could just have ONE fully connected layer that converts the input to the number of classes i.e. 120, instead of two⌠Another related question, when we add more than one additional layer, we can specify how big we want the layer to be. Is there any method to choosing how big the layer should be? How do we arrive at this number?
Adding two layers to CNNs for images pretty much always works for me. I havenât had much if any need to change the number or size of layers when fine tuning a network from the fastai defaults. If anyone finds situations that need to be much different, let me know!
For structured data, I have much less confidence in knowing the right answer to this question. I still have to experiment quite a bit! But the amounts shown in Rossmann are a good start generally.
Can you explain why in binary variables, we donât want embedding and use that variable as continuous?
âeditâ
got answer. embeddings doesnât seem to improve anything for binary variables (as Jeremy answered)
Maybe this article can help. On page 144 in 4.1 Prior Correction you can find how to change intercept in logistic regression to be able to restore original probability after sampling. It is possible to mathematically transform this formula to use with any model (not only logit).
Hi,
Does anyone know how the data is supposed to be set up for the courses/dl1/lesson3-rossman notebook ?
nbuser@jupyter:~/fastai/courses/dl1/data/rossman$
nbuser@jupyter:~/fastai/courses/dl1/data/rossman$ ls
googletrend rossmann.tgz state_names.csv store_states.csv train.csv weather.csv
googletrend.csv sample_submission.csv store.csv test.csv weather
nbuser@jupyter:~/fastai/courses/dl1/data/rossman$ ls -al
total 48076
drwxr-xr-x 4 nbuser nbuser 6144 Nov 24 15:30 .
drwxr-xr-x 3 nbuser nbuser 6144 Nov 24 15:13 ..
drwxr-xr-x 2 nbuser nbuser 6144 Nov 24 15:29 googletrend
-rw-r--r-- 1 nbuser nbuser 86605 Jan 11 2017 googletrend.csv
-rw-r--r-- 1 nbuser nbuser 7730448 Nov 24 15:14 rossmann.tgz
-rw-r--r-- 1 nbuser nbuser 317611 Sep 29 2015 sample_submission.csv
-rw-r--r-- 1 nbuser nbuser 265 Jan 11 2017 state_names.csv
-rw-r--r-- 1 nbuser nbuser 45010 Sep 29 2015 store.csv
-rw-r--r-- 1 nbuser nbuser 9051 Jan 6 2017 store_states.csv
-rw-r--r-- 1 nbuser nbuser 1427425 Sep 29 2015 test.csv
-rw-r--r-- 1 nbuser nbuser 38057952 Sep 29 2015 train.csv
drwxr-xr-x 2 nbuser nbuser 6144 Nov 24 15:30 weather
-rw-r--r-- 1 nbuser nbuser 1518814 Jan 11 2017 weather.csv
nbuser@jupyter:~/fastai/courses/dl1/data/rossman$
Reading the code for the concat_csvs function, am i supposed to create a directory called googletrend and then copy the googletrend.csv file to it or do i copy all of the csv files to the googletrend directory ?
You donât need to run the commented out concat_csv lines - theyâve already been run for you, and googletrend.csv was created from them and is part of what you downloaded.
It would also depend on how much data you have to learn from.
Experiencing very frequent freezes and slow response when trying to work with Crestle.
I git pull always to last version but if I understand correctly in Crestle nothing more to do, is that right? (We can not conda env update neither source fastai activate, all is taken care of by default, isnât it? I connect from Spain (in case location maters), have a quite good internet symetric connection.
No clue on how to make it run smoother / with fewer freezes, @anurag, if I am skiping some important step please let me know. (Problem happens with different notebooks, with or without gpu)
I still see a possibility where embedding may be relevant for binary flag especially if one wants to use the binary flag for some sort of similarity measure for additional analysis. This may not impact classification results. If a variable doesnât have predictive power binary flag indicates they are separate classes but the embedding may indicate they are very similar.
Iâm pretty sure that mathematically it wonât make any difference for binary flags. The values of the embeddings will be multiplied by a weight in the fully connected layer, so I canât see how there would be any difference between training the embedding value and training the weight. Although Iâd be very happy to be proven wrong
I didnât refer to the impact on classification results which I agree. I am working on something to use features created from combination of embeddings for clustering purposes after the prediction is done. In one case if I directly use the binary flags it indicates they are different. However if represented as embedding they may not show so much differentiation if the feature doesnât have predictive power.