Lesson 3 In-Class Discussion ✅

lesscomfortable · November 30, 2018, 8:01pm

Hey! Can you try df['Unnamed: 0']=df['Unnamed: 0'].astype(int) + 1 on your original table? This basically casts every item in this column to an integer.

ricknta · December 1, 2018, 3:13am

Thanks! I don’t think that Unnamed col was a problem - at that point the problem with creating the databunch was fixed; that Unnamed col was just a leftover empty col from when I was manipulating the csv in Excel.

Now I’m working with the df and everything runs OK, but the TextLMDataBunch seems to be reading labels instead of the text :

df = pd.read_csv(path/'fake_or_real_news.csv', usecols=["label", "text"])[["label", "text"]]
df.head(10)

data_lm = (TextList.from_df(df, col='text')
                .random_split_by_pct(0.2)
                .label_for_lm()
                .databunch())
data_lm.save('tmp_lm')

data_lm = TextLMDataBunch.load(path, 'tmp_lm')

Why would it do that??

lesscomfortable · December 1, 2018, 4:17am

Can you try running this instead?

data_lm = (TextList.from_df(df, cols=1)
                .random_split_by_pct(0.2)
                .label_for_lm()
                .databunch())
data_lm.save('tmp_lm')

data_lm = TextLMDataBunch.load(path, 'tmp_lm')

ricknta · December 1, 2018, 4:56am

Thanks, I tried that (very hopeful!) but got almost the same result - slight differences but same basic pattern:

shyampagadi · December 1, 2018, 7:42am

Hi,
I am trying to create a Image Data Bunch form a CSV file and trying to display sample images, receiving below error, can someone please help.

data = ImageDataBunch.from_csv(path, csv_labels=‘train.csv’,folder=‘train’,tfms=tfms)
data.show_batch(rows=2, figsize=(9,7))

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 465 and 498 in dimension 2 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1325

lesscomfortable · December 3, 2018, 5:40pm

What about:

data_lm = (TextList.from_df(df, cols=‘text’)
.random_split_by_pct(0.2)
.label_for_lm()
.databunch())
data_lm.save(‘tmp_lm’)

data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

prratek · December 3, 2018, 8:07pm

I’m having some trouble labeling my images using the data block API. Since it’s a multi-class classification problem where most classes have just one example, I want to duplicate images from the underrepresented classes. I now have a DataFrame with image names and labels, including duplicates. My images are in path/train. Here’s my code:

src = (ImageItemList.from_df(df=train_df, path=path, cols='Name', folder='train')
       .split_by_valid_func(lambda o: o in val_n)
       .label_from_df(cols='Id'))

However, label_from_df throws the following error:
IndexError: index 0 is out of bounds for axis 0 with size 0

Any thoughts? Here’s the full traceback:

gist.github.com

https://gist.github.com/prratek/deb8c0627afb1df590d57aa38c50e60d

datablock_label.py

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-65-a21b1f7faa1d> in <module>
      2 src = (ImageItemList.from_df(df=train_df, path=path, cols='Name', folder='train')
      3        .split_by_valid_func(lambda o: o in val_n)
----> 4        .label_from_df(cols='Id'))

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    346             self.train = ft(*args, **kwargs)
    347             assert isinstance(self.train, LabelList)

This file has been truncated. show original

ricknta · December 3, 2018, 11:39pm

Thanks, but that doesn’t work, either! Same basic pattern:

lesscomfortable · December 4, 2018, 1:05am

What happens if you run:

data_lm = (TextList.from_df(df, cols=[0,1])
.random_split_by_pct(0.2)
.label_for_lm()
.databunch())
data_lm.save(‘tmp_lm’)

data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

and

data_lm = (TextList.from_df(df, cols=[‘text’,‘label’])
.random_split_by_pct(0.2)
.label_for_lm()
.databunch())
data_lm.save(‘tmp_lm’)

data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

ricknta · December 4, 2018, 1:40am

Yes! Both worked. So it looks like, for some reason, I have to specify both cols, maybe because the df has a few more cols I’m not using? Does this make sense? Want to make sure I learn from it! thanks

lesscomfortable · December 4, 2018, 1:52am

I don’t really know what’s going on. This might give us more info. Try:

data_lm = (TextList.from_df(df, cols=[0])
.random_split_by_pct(0.2)
.label_for_lm()
.databunch())
data_lm.save(‘tmp_lm’)

data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

and

data_lm = (TextList.from_df(df, cols=[‘label’])
.random_split_by_pct(0.2)
.label_for_lm()
.databunch())
data_lm.save(‘tmp_lm’)

data_lm = TextLMDataBunch.load(path, ‘tmp_lm’)

wyquek · December 4, 2018, 2:26am

I suspect you have an index column that took up column 0, hence pushing [‘label’] to column 1 and
[‘text’] to column 2

Maybe try removing the index?

ricknta · December 4, 2018, 2:39am

OK, I’ll try that later. Meanwhile, I tried to move to a faster GPU (I was on a P5000), but Gradient is screwing up (again!) so I couldn’t get a faster one (couldn’t get any gpu’s at first) and then could only get a slower one (GPU+). So now I’m running on that - and

cols=['text','label']

doesn’t work anymore!

cols=[0,1]

does work so I’m running with that right now.

So now I wonder if these problems are due to some problems in the Gradient stack/infrastructure? Seems very weird that code would work with one GPU and not another.

ricknta · December 4, 2018, 2:47am

Thanks, there is an index col in the csv, but since I’m creating the df like this:

df = pd.read_csv(path/'fake_or_real_news.csv', usecols=["label", "text"])[["label", "text"]]

I don’t think the df can even see the index col. Also, I’m addressing the df cols explicitly by col name like this:

data_lm = (TextList.from_df(df, cols=[‘text’,‘label’])

so col order shouldn’t be a problem.

wyquek · December 4, 2018, 1:52pm

Not sure if this is helpful, but I clobbered your codes together and ran them, the show_batch() seems to be working fine

gshashank84 · December 4, 2018, 9:58pm

How can we build a TextDataBunch to Classify more than one labels in one context i.e Tags?

ricknta · December 5, 2018, 12:14am

That’s interesting, because I get the original error I had (TypeError: must be str, not int) if I do exactly the same thing:

Which may be a result of the platform and/or versions. Which platform are you running on? I’m on Gradient and got weird behavior yesterday when I switched servers.

Also, if I specify the cols, I get around the TypeError, but then the text isn’t read correctly:

wyquek · December 5, 2018, 12:20am

i’m not on cloud. i’ve got a humble little desktop with a 1070 gpu on ubuntu

ricknta · December 5, 2018, 12:48am

Both of those result in the same scrambled text:

If I go back to

data_lm = (TextList.from_df(df, cols=['text','label'])
                .random_split_by_pct(0.2)
                .label_for_lm()
                .databunch())

or

data_lm = (TextList.from_df(df, cols=[0,1])

it works properly again.

I’m running on a Gradient P6000 today, so it does appear that part of this problem is the platform! Last night on the GPU+

cols=['text','label']

didn’t work. Also see what wyquek and I compared above:

…so again it appears that the platform (meaning Gradient) is at least part of the problem.

lesscomfortable · December 5, 2018, 12:58am

I don’t really know why the problem arises. TextList calls .from_df from the parent class ItemList which in turn calls .iloc with the column numbers. I don’t know where it is failing.