Part 1 (2020) - Weekly Beginner Only Review and Q&A

PDiTO · April 10, 2020, 3:36pm

Hey WG,

Will a video of the most recent session be shared? I’ve found the other two very useful.

Paul.

wgpubs · April 10, 2020, 3:55pm

Yah it will be up later today.

gansme · April 10, 2020, 9:42pm

@Albertotono try shrinking your model by using smaller image sizes to train (I used 64px, and resnet18) which got my .pkl file under 50MB. I’m not sure how to get binder to work with git lfs. Binder seems to pick up just the pointer to the file, not the actual file, but shrinking the model to not have to use git lfs worked.

gansme · April 10, 2020, 9:46pm

@wgpubs unfortunately I joined a few mins in, as beginner Q&A went over a bit. The DataBlock API coverage was super helpful. Did I miss a solution shared for #1, getting deployment with Binder to work or will that be covered next time?

wgpubs · April 10, 2020, 11:53pm

Haven’t covered Binder (or any deployment) … no time. Are you still having issues?

wgpubs · April 10, 2020, 11:55pm

FYI: Updated Wiki with link to recorded week 4 session (https://youtu.be/KfIMSozuMfM)

gansme · April 11, 2020, 12:23am

Huzzah! Got it working.

Gathered up my binder debugging tips in a blog post here: https://medium.com/@leggomymego/lessons-learned-pushing-a-deep-learning-model-to-production-d6c6d198a7d8.

KarlH · April 11, 2020, 12:29am

I’m not yet super familiar with the V2 library so I may get some details wrong, but here’s my understanding.

First lets start with how plain Pytorch does things. Pytorch has two main classes for dealing with data - Dataset and Dataloader. A Dataset is designed to get individual data items, while a Dataloader is designed to grab batches from a particular Dataset.

Datasets are pretty simple - they just need a __len__ method and a __getitem__ method.

class MyData(Dataset):

    def __len__(self):
        pass

    def __getitem__(self, index):
        pass

__len__ is used to determine how many items are in a Dataset. __getitem__ gives you one item in the dataset based on the index value passed.

Dataloaders grab a batch of index values, then use those index values to return a list of items from the Dataset. The Dataloader has a collate_fn that processes the list of items. The simplest collate_fn just returns the batch as a list.

def my_collate(batch):
    return batch

But typically we want to pack the batch items into a single tensor. If the items returned from the dataset are already tensors, we could simply stack them. We can also do fancier things in the collate function.

So the question is now how do we decide what actions are going to happen in the __getitem__ method of the Dataset, and which should happen in the collate_fn of the Dataloader. Consider an image dataset. To create batch of images, we need to execute the following steps:

load image from filename
convert image into Pytorch tensor
normalize the image
resize the image
apply data augmentation functions to the image
stack images into a batch

What you typically see in Pytorch is operations happening on a single item occur in the __getitem__ method of the Dataset, while operations related to batching occur in the collate_fn. This means you would do 1-5 in the __getitem__ method, and step 6 in the collate_fn.

To your question, you could totally do all the steps in the collate_fn if you wanted. You could have your __getitem__ method just return the filename, and handle all the rest in the collate_fn. However this would be slower out of the box. The reason is your Dataloader will automatically run a multithreaded process for gathering data based on the num_workers parameter of the Dataloader. This means your batch of __getitem__ calls are running in parallel. Sure you could write your own collate_fn to be multithreaded as well, but that feels like a headache.

Now moving to fastai, the item_tfms is sort of a general extension of processing you might do in the __getitem__ method of a standard Dataset. It denotes transforms applied to one item at a time. This is the case for things like image augmentation, where we are applying different transforms to each image in a batch. We want this stuff happening in a multithreaded process before the collate_fn. If you wanted to apply a single transform to an entire batch (say rotating a batch of images by the same degree), that might be more efficient to do in the collate_fn on a batch level. But for transforms that are going to be different for each image (slightly different augmentations, different resizing, etc), it makes more sense to have outside the collate_fn.

wgpubs · April 11, 2020, 3:35am

Nice!

And thanks for blogging about it … I know a lot of folks were/are having issues with Binder so I’m sure this will be helpful to the community at large.

farid · April 11, 2020, 3:35pm

Thank you @wgpubs for sharing your video. I’m bit familiar with fastai2, and I really like your approach. I also liked the fact that you started by presenting the pytorch dataset and dataloader presented in WHAT IS TORCH.NN REALLY?.

I encourage both beginner and those who are familiar with fastai2 to watch that video.

I created the the Fastai v2 Recipes (Tips and Tricks) - Wiki thread in order to encourage people to share their fastai2 knowledge. Please, feel free to share your own tips and/or recipes there if you feel like.

wgpubs · April 11, 2020, 10:33pm

The full article we were reviewing last week (week 4 session) on the DataBlock API has been published and is good to go for running on colab! You can find it here.

We’ll be finishing off the rest of this article in addition to answering any of your burning questions after next week’s lesson.

Hit me up with any questions or anything DataBlock wise y’all want to discuss next Thursday!

-wg

PDiTO · April 14, 2020, 12:15pm

Thank you @wgpubs. Just caught up on the latest video and reviewed your post too. A really awesome breakdown that was incredibly helpful - i’d highly recommend it to anyone who is a little cloudy on anything DataBlock related.

wgpubs · April 17, 2020, 1:18am

We’ll be starting this up in about 15 mins for anyone interested (zoom link up top).

See ya in a bit - Wayde

golz · April 17, 2020, 1:32am

Here are a few of my Qs for Week 5:

What is difference b/t GD and SGD. It was asked during the lecture, but I don’t recall Jeremy pointing it out.
In the SGD image in the book, Jeremy describes the 7 steps. For the last one, “stop,” it says we’ve already discussed how to choose how many epochs to train a model for. I don’t recall that discussion. Can we discuss?
I’d like to go through adding a non-linearity one more time.
Week 2 Qs:

my .pkl returned #0, does this mean my file doesn’t exist
I’d like to discuss untar data vs download data. I don’t get the difference.

dhoa · April 17, 2020, 3:38am

GD means when you run through all your data at once. Although it will give you the best gradient, it is not possible in term of memory for the computer to handle this long long matrix. So we need to use SGD, cut your data into batch and process each time only a batch

Antoine.C · April 18, 2020, 8:42am

On SGD vs GD, I posted an answer in two parts on the lesson 4 threads.

part 1 contains some preliminary remarks which I think could be useful
part 2 spells things out a bit on the distinction between SGD, GD, and mini-batch GD.

pierreg · April 18, 2020, 6:27pm

Hello guys I have questions regarding Character-level language model I am implementing to practice.

After reading some documentation I got a Datablock working

#One txt file is the source
def get_file(path):
    with open(path/"100as_cleaned.txt", "r") as infile:
        text = infile.read()
    n = 1000
    return [text[i:i+n] for i in range(0, len(text), n)]

as100 = DataBlock(blocks = TextBlock(tok_tfm = lambda txt: list(txt), 
                                is_lm=True),
                                get_items = get_file,
                                splitter=RandomSplitter(valid_pct=0.3, seed=42)) 

xb, yb = next(iter(dls.valid))
xb.size(), yb.size()

(torch.Size([64, 72]), torch.Size([64, 72]))

However, I get and the following error if I try

dls.show_batch()

AttributeError                            Traceback (most recent call last)
<ipython-input-19-90634fcc3c9e> in <module>()
----> 1 dls.show_batch()

6 frames
/usr/local/lib/python3.6/dist-packages/fastai2/data/core.py in show_batch(self, b, max_n, ctxs, show, unique, **kwargs)
     93         if b is None: b = self.one_batch()
     94         if not show: return self._pre_show_batch(b, max_n=max_n)
---> 95         show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)
     96         if unique: self.get_idxs = old_get_idxs
     97 

/usr/local/lib/python3.6/dist-packages/fastcore/dispatch.py in __call__(self, *args, **kwargs)
     96         if not f: return args[0]
     97         if self.inst is not None: f = MethodType(f, self.inst)
---> 98         return f(*args, **kwargs)
     99 
    100     def __get__(self, inst, owner):

/usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in show_batch(x, y, samples, ctxs, max_n, trunc_at, **kwargs)
    111 @typedispatch
    112 def show_batch(x: LMTensorText, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
--> 113     samples = L((s[0].truncate(trunc_at), s[1].truncate(trunc_at)) for s in samples)
    114     return show_batch[TensorText](x, None, samples, ctxs=ctxs, max_n=max_n, trunc_at=None, **kwargs)
    115 

/usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
     39             return x
     40 
---> 41         res = super().__call__(*((x,) + args), **kwargs)
     42         res._newchk = 0
     43         return res

    /usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in __init__(self, items, use_list, match, *rest)
        312         if items is None: items = []
        313         if (use_list is not None) or not _is_array(items):
    --> 314             items = list(items) if use_list else _listify(items)
        315         if match is not None:
        316             if is_coll(match): match = len(match)

    /usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in _listify(o)
        248     if isinstance(o, list): return o
        249     if isinstance(o, str) or _is_array(o): return [o]
    --> 250     if is_iter(o): return list(o)
        251     return [o]
        252 

    /usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in <genexpr>(.0)
        111 @typedispatch
        112 def show_batch(x: LMTensorText, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    --> 113     samples = L((s[0].truncate(trunc_at), s[1].truncate(trunc_at)) for s in samples)
        114     return show_batch[TensorText](x, None, samples, ctxs=ctxs, max_n=max_n, trunc_at=None, **kwargs)
        115 

    AttributeError: 'L' object has no attribute 'truncate'

DrC · April 18, 2020, 7:23pm

Looks like the widgets nb has rendered some conflicts to be merged, this one:

github.com

fastai/fastai2/blob/master/nbs/09c_vision.widgets.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#default_exp vision.widgets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from fastai2.torch_basics import *\n",
    "from fastai2.data.all import *\n",

This file has been truncated. show original

Just figured to share @sgugger

sgugger · April 18, 2020, 7:25pm

Oh thanks for flagging! Should be fixed now.

wgpubs · April 18, 2020, 7:32pm

So truncate is a method that operates on a pandas DataFrame or Series (see here)

My guess is the reason s[0] and/or s[1] as of type L, is because of this line in your DataBlock above:

tok_tfm = lambda txt: list(txt)

Take a look at Tokenizer and functions like Tokenizer.from_df … you may have to create something a bit more elaborate than just list(txt) to return a type show_batch can work with out of the box.