Lesson 3 Advanced Discussion ✅

veehoo · November 9, 2018, 8:20am

In playing with the data block api, I’ve found it to be yes flexible, but also much slower on big datasets, since it apparently always loads the entire dataset to memory before doing anything else. Or maybe I’m missing something, does anybody know if there’s a way to speed up databunch creation on larger datasets?

radek · November 9, 2018, 8:42am

Recently read an article on this - it is basically factory scale at this point. You get entire floors of people who work on nothing but labeling pixels.

It still is likely not in the order of magnitude of millions of segmented images, but tens or hundredths of thousands.

In general, one can get very nice results with segmentation on much smaller datasets though!

ymittal23 · November 9, 2018, 9:06am

We generally see people getting CUDA: out of memory error. When we restart the kernel it run fine. What could be possible reasons for that? It looks like the memory is got getting properly managed ?

Kaspar · November 9, 2018, 11:21am

This promis to be an interesting dataset for NLP & AI in law: https://case.law/

jeremy · November 9, 2018, 4:52pm

I normally pick the steepest bit of section (1) in your list. But you should try a few a tell us what works best for you!

jeremy · November 9, 2018, 4:53pm

Yeah it’s less of an issue now, with optimized PIL and our faster augmentations - although you might still want to resize if your images are huge.

The best way to install PIL is using the comment at the bottom here:

gist.github.com

https://gist.github.com/soumith/01da3874bf014d8a8c53406c2b95d56b#gistcomment-2657191

gistfile1.txt

conda uninstall --force pillow -y

# install libjpeg-turbo to $HOME/turbojpeg
git clone https://github.com/libjpeg-turbo/libjpeg-turbo
pushd libjpeg-turbo
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX:PATH=$HOME/turbojpeg
make
make install

This file has been truncated. show original

jeremy · November 9, 2018, 4:54pm

In high dimensions there are basically no local minima - at least for the kinds of functions that neural net losses create.

fredguth · November 9, 2018, 7:32pm

In the lesson 3, there was an explanation on U-net and how to use it now in v1.

But I do remember that in course-v2 and fastai 0.7, @kcturgutlu implemented a way tp create Dynamic Unets, U-netish models using any pretrained model as encoder: resnet, resnext, vgg…

Are these Dynamic Unets deprecated in v1?

jeremy · November 9, 2018, 7:33pm

Quite the opposite - that’s what we’re using all the time now! That’s why we were able to automatically create a unet with a given backbone architecture.

kcturgutlu · November 9, 2018, 8:54pm

You may check https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-camvid.ipynb for how to create_unet in v1. It’s much faster and much lighter in terms of GPU memory,

mmiakashs · November 9, 2018, 4:43am

do jeremy use ULMFit language model in IMDB review task ?

gamino · November 9, 2018, 4:44am

Yes, ULMgfit is the technique. Universal Language Model fit

mmiakashs · November 9, 2018, 4:44am

Thanks, but where jeremy specify the ULMFit model ?

lesscomfortable · November 9, 2018, 4:46am

language_model_learner() implements an AWD-LSTM RNN behind the scenes which is what Jeremy and Sebastian used in ULMFit.

ladydata · November 9, 2018, 3:32am

@Kaspar what do you mean by “make your own open_image”?

sam2 · November 10, 2018, 9:33pm

Hello all,

Since I did not get any response from the lesson-3-discussion, I thought I will ask you gurus

I am trying to get a handle on the data_block API. Don’t know what am i doing wrong.
working with a established dataset kaggle whale-categorization-playground
train and test folders contain jpg images (no sub-folders by class etc.)
train.csv contains ImageId/ClassName belonging to the train folder as:

ImageId	LabelName
00022e1a.jpg	w_e15442c
000466c4.jpg	w_1287fbc
00087b01.jpg	w_da2efe0
001296d5.jpg	w_19e5482

I reached

data = (ImageFileList.from_folder(path) # works
.label_from_csv(path/‘train.csv’, folder=‘train’) # works
.random_split_by_pct(0.2) # works
.datasets() #errors key error on first Id in the train.csv
.transform(tfms, size=128)
.databunch()
.normalize(imagenet_stats))

Can any of you gurus help?

Here is the full trace:

KeyError Traceback (most recent call last)
in ()
----> 1 d=c.datasets()

~/Documents/fastai/courses/v3/nbs/dl1/fastai/data_block.py in datasets(self, dataset_cls, **kwargs)
234 train = dataset_cls(*self.train.items.T, **kwargs)
235 dss = [train]
–> 236 dss += [train.new(*o.items.T, **kwargs) for o in self.lists[1:]]
237 cls = getattr(train, ‘ splits_class ’, self._pipe)
238 return cls(self.path, *dss)

~/Documents/fastai/courses/v3/nbs/dl1/fastai/data_block.py in (.0)
234 train = dataset_cls(*self.train.items.T, **kwargs)
235 dss = [train]
–> 236 dss += [train.new(*o.items.T, **kwargs) for o in self.lists[1:]]
237 cls = getattr(train, ‘ splits_class ’, self._pipe)
238 return cls(self.path, *dss)

~/Documents/fastai/courses/v3/nbs/dl1/fastai/vision/data.py in new(self, classes, *args, **kwargs)
80 def new(self, *args, classes:Optional[Collection[Any]]=None, **kwargs):
81 if classes is None: classes = self.classes
—> 82 return self. class (*args, classes=classes, **kwargs)
83
84 class ImageClassificationDataset(ImageClassificationBase):

~/Documents/fastai/courses/v3/nbs/dl1/fastai/vision/data.py in init (self, x, y, classes, **kwargs)
75 class ImageClassificationBase(ImageDatasetBase):
76 def init (self, x:Collection, y:Collection, classes:Collection=None, **kwargs):
—> 77 super(). init (x=x, y=y, classes=classes, **kwargs)
78 self.learner_type = ClassificationLearner
79

~/Documents/fastai/courses/v3/nbs/dl1/fastai/vision/data.py in init (self, **kwargs)
67 class ImageDatasetBase(DatasetBase):
68 def init (self, **kwargs):
—> 69 super(). init (**kwargs)
70 self.image_opener = open_image
71 self.learner_type = ImageLearner

~/Documents/fastai/courses/v3/nbs/dl1/fastai/basic_data.py in init (self, x, y, classes, c, task_type, class2idx, as_array, do_encode_y)
23 else: self.c = len(self.classes)
24 if class2idx is None: self.class2idx = {v:k for k,v in enumerate(self.classes)}
—> 25 if y is not None and do_encode_y: self.encode_y()
26 if self.task_type==TaskType.Regression: self.loss_func = MSELossFlat()
27 elif self.task_type==TaskType.Single: self.loss_func = F.cross_entropy

~/Documents/fastai/courses/v3/nbs/dl1/fastai/basic_data.py in encode_y(self)
30 def encode_y(self):
31 if self.task_type==TaskType.Single:
—> 32 self.y = np.array([self.class2idx[o] for o in self.y], dtype=np.int64)
33 elif self.task_type==TaskType.Multi:
34 self.y = [np.array([self.class2idx[o] for o in l], dtype=np.int64) for l in self.y]

~/Documents/fastai/courses/v3/nbs/dl1/fastai/basic_data.py in (.0)
30 def encode_y(self):
31 if self.task_type==TaskType.Single:
—> 32 self.y = np.array([self.class2idx[o] for o in self.y], dtype=np.int64)
33 elif self.task_type==TaskType.Multi:
34 self.y = [np.array([self.class2idx[o] for o in l], dtype=np.int64) for l in self.y]

KeyError: ‘w_e15442c’

miwojc · November 10, 2018, 11:07pm

i had similar issue with datablock API with different data set. try using the standard API, it worked for me

data = ImageDataBunch.from_csv(path, folder='train', sep=None, csv_labels='train.csv', valid_pct=0.2, ds_tfms=get_transforms(), size=128)

sam2 · November 10, 2018, 11:17pm

@miwojc, Thank you !!!
I thought it was just me.
I wonder if I should post this as a bug or not use V1 since it is “WIP”,
Interestingly
vision.data.ImageClassificationDataset says
warnings.warn("ImageClassificationDataset is deprecated and will soon be removed. Use the data block API.

miwojc · November 10, 2018, 11:20pm

I didn’t have time to dig into it so I moved on but you are right we should report it as a bug.

sam2 · November 10, 2018, 11:25pm

what release of V1 are you on?
I am at 1.0.22 (the latest)