Part 1 (2020) - Weekly Beginner Only Review and Q&A

wgpubs · April 11, 2020, 10:33pm

The full article we were reviewing last week (week 4 session) on the DataBlock API has been published and is good to go for running on colab! You can find it here.

We’ll be finishing off the rest of this article in addition to answering any of your burning questions after next week’s lesson.

Hit me up with any questions or anything DataBlock wise y’all want to discuss next Thursday!

-wg

PDiTO · April 14, 2020, 12:15pm

Thank you @wgpubs. Just caught up on the latest video and reviewed your post too. A really awesome breakdown that was incredibly helpful - i’d highly recommend it to anyone who is a little cloudy on anything DataBlock related.

wgpubs · April 17, 2020, 1:18am

We’ll be starting this up in about 15 mins for anyone interested (zoom link up top).

See ya in a bit - Wayde

golz · April 17, 2020, 1:32am

Here are a few of my Qs for Week 5:

What is difference b/t GD and SGD. It was asked during the lecture, but I don’t recall Jeremy pointing it out.
In the SGD image in the book, Jeremy describes the 7 steps. For the last one, “stop,” it says we’ve already discussed how to choose how many epochs to train a model for. I don’t recall that discussion. Can we discuss?
I’d like to go through adding a non-linearity one more time.
Week 2 Qs:

my .pkl returned #0, does this mean my file doesn’t exist
I’d like to discuss untar data vs download data. I don’t get the difference.

dhoa · April 17, 2020, 3:38am

GD means when you run through all your data at once. Although it will give you the best gradient, it is not possible in term of memory for the computer to handle this long long matrix. So we need to use SGD, cut your data into batch and process each time only a batch

Antoine.C · April 18, 2020, 8:42am

On SGD vs GD, I posted an answer in two parts on the lesson 4 threads.

part 1 contains some preliminary remarks which I think could be useful
part 2 spells things out a bit on the distinction between SGD, GD, and mini-batch GD.

pierreg · April 18, 2020, 6:27pm

Hello guys I have questions regarding Character-level language model I am implementing to practice.

After reading some documentation I got a Datablock working

#One txt file is the source
def get_file(path):
    with open(path/"100as_cleaned.txt", "r") as infile:
        text = infile.read()
    n = 1000
    return [text[i:i+n] for i in range(0, len(text), n)]

as100 = DataBlock(blocks = TextBlock(tok_tfm = lambda txt: list(txt), 
                                is_lm=True),
                                get_items = get_file,
                                splitter=RandomSplitter(valid_pct=0.3, seed=42)) 

xb, yb = next(iter(dls.valid))
xb.size(), yb.size()

(torch.Size([64, 72]), torch.Size([64, 72]))

However, I get and the following error if I try

dls.show_batch()

AttributeError                            Traceback (most recent call last)
<ipython-input-19-90634fcc3c9e> in <module>()
----> 1 dls.show_batch()

6 frames
/usr/local/lib/python3.6/dist-packages/fastai2/data/core.py in show_batch(self, b, max_n, ctxs, show, unique, **kwargs)
     93         if b is None: b = self.one_batch()
     94         if not show: return self._pre_show_batch(b, max_n=max_n)
---> 95         show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)
     96         if unique: self.get_idxs = old_get_idxs
     97 

/usr/local/lib/python3.6/dist-packages/fastcore/dispatch.py in __call__(self, *args, **kwargs)
     96         if not f: return args[0]
     97         if self.inst is not None: f = MethodType(f, self.inst)
---> 98         return f(*args, **kwargs)
     99 
    100     def __get__(self, inst, owner):

/usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in show_batch(x, y, samples, ctxs, max_n, trunc_at, **kwargs)
    111 @typedispatch
    112 def show_batch(x: LMTensorText, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
--> 113     samples = L((s[0].truncate(trunc_at), s[1].truncate(trunc_at)) for s in samples)
    114     return show_batch[TensorText](x, None, samples, ctxs=ctxs, max_n=max_n, trunc_at=None, **kwargs)
    115 

/usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
     39             return x
     40 
---> 41         res = super().__call__(*((x,) + args), **kwargs)
     42         res._newchk = 0
     43         return res

    /usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in __init__(self, items, use_list, match, *rest)
        312         if items is None: items = []
        313         if (use_list is not None) or not _is_array(items):
    --> 314             items = list(items) if use_list else _listify(items)
        315         if match is not None:
        316             if is_coll(match): match = len(match)

    /usr/local/lib/python3.6/dist-packages/fastcore/foundation.py in _listify(o)
        248     if isinstance(o, list): return o
        249     if isinstance(o, str) or _is_array(o): return [o]
    --> 250     if is_iter(o): return list(o)
        251     return [o]
        252 

    /usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in <genexpr>(.0)
        111 @typedispatch
        112 def show_batch(x: LMTensorText, y, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    --> 113     samples = L((s[0].truncate(trunc_at), s[1].truncate(trunc_at)) for s in samples)
        114     return show_batch[TensorText](x, None, samples, ctxs=ctxs, max_n=max_n, trunc_at=None, **kwargs)
        115 

    AttributeError: 'L' object has no attribute 'truncate'

DrC · April 18, 2020, 7:23pm

Looks like the widgets nb has rendered some conflicts to be merged, this one:

github.com

fastai/fastai2/blob/master/nbs/09c_vision.widgets.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#default_exp vision.widgets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from fastai2.torch_basics import *\n",
    "from fastai2.data.all import *\n",

This file has been truncated. show original

Just figured to share @sgugger

sgugger · April 18, 2020, 7:25pm

Oh thanks for flagging! Should be fixed now.

wgpubs · April 18, 2020, 7:32pm

So truncate is a method that operates on a pandas DataFrame or Series (see here)

My guess is the reason s[0] and/or s[1] as of type L, is because of this line in your DataBlock above:

tok_tfm = lambda txt: list(txt)

Take a look at Tokenizer and functions like Tokenizer.from_df … you may have to create something a bit more elaborate than just list(txt) to return a type show_batch can work with out of the box.

pierreg · April 18, 2020, 7:33pm

Thanks. I will check that out.

DrC · April 18, 2020, 10:04pm

For the older version, we can put Counter(data.train_ds.y) to get count of labels per train or valid. Does anybody know what’s equivalent to that with the datablock API for fastai2?

Thanks,

wgpubs · April 18, 2020, 11:21pm

Yup … len(dls.train.vocab[1])

gautam_e · April 20, 2020, 12:02pm

Regarding the difference between untar_data and download_data, this is what I gathered by looking at the documentation:

download_data: For downloading fastai files from a fastai url to the appropriate local path.
untar_data: For downloading files from a Fastai url or custom url and extracting to folder in dest (argument of the function)

DrC · April 20, 2020, 2:56pm

doesn’t get the number of samples,

Any thoughts?

wgpubs · April 20, 2020, 5:15pm

Assuming you are doing a classification problem, if you want the number of classes it is: dls.c

If you want the number of samples in your training dataset: len(dls.train_ds)

If you are looking to get a count for each of the unique classes in your training dataset, then there are probably multiple ways. For example, I’m using a DataFrame and can do this:

dls.items['label'].value_counts()

… or even something like this would work:

Counter([y.item() for x,y in dls.train_ds ])

Not sure if there is a easier one-liner or if it depends on your datasource.

pinaki · April 24, 2020, 12:25am

Hi @wgpubs are we doing today’s session ? If yes, I would like to propose couple of topics:

Understanding intuition behind Embeddings in vision, collab filter and nlp domains (word embeddings).
One Hot Encoding and its application in different domains.
A better understanding of lr_finder, freeze and unfreeze of models and how to choose in different applications. The example that jeremy showed in class. had a nice steep descent that ended in a min value. However, there could be other scenarios – see attached image where steepest and min loss could be very different. In the attached example, the steepest lr_steep was 2.51e-07 while lr_min was 3.02e-04. How to choose the correct lr is such scenarios or others.
Cross Entropy and other popular losses used in different domains.

wgpubs · April 24, 2020, 1:07am

Yup. 25 mins

wgpubs · April 24, 2020, 1:17am

Ok folks … we’ll be starting in 15 mins.

Zoom link is at top and also here: https://ucsd.zoom.us/j/98630190281

steef · April 24, 2020, 1:44am

Anyone else having issues with this zoom link? I’m redirected to UC San Diego site. Sorry if this is a noob question…

–

Edit: it’s working for me on my private laptop. My corp laptop was blocking Zoom desktop client.