Lesson 3 In-Class Discussion ✅

hey, Sam, could you elaborate a bit more on how to solve this?

thank you!

hi Karan,

I hope your issue has been resolved. I have an issue above this issue you have in the data_lm part. :frowning: Could you share your nb how you get pass this? I think it has to do with my untar_url on IMDB. i have the following 3 files instead of what Jeremy has in his output. So i dont have a train or test folder for running “data_lm = (TextList.from_folder…”

Thank you so much!

[PosixPath(’/home/nbuser/courses/fast-ai/course-v3/nbs/data/imdb/classes.txt’),
PosixPath(’/home/nbuser/courses/fast-ai/course-v3/nbs/data/imdb/train.csv’),
PosixPath(’/home/nbuser/courses/fast-ai/course-v3/nbs/data/imdb/test.csv’)]

data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=[‘train’, ‘test’])
#We may have other temp folders that contain text files so we only keep what’s in train and test
.random_split_by_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.save(‘tmp_lm’)

@angelinayy, It was not for you and me to solve. The URL for IMDB used to point to a tgz file which contained only 3 txet files. The current URL (https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz) has the right tgz file. If you click on the URL link and open the tgz file you will find that it contains directories like train, test etc.

So the issue is resolved now.

thanks!

nevermind. it seems to work now. thank you all the same!

While running this code chunk:

data = (src.transform(tfms, size=256)
        .databunch().normalize(imagenet_stats))

learn.data = data
data.train_ds[0][0].shape

learn.freeze()

learn.lr_find()
learn.recorder.plot()

This is where we resize the images to 256 to improve the f-score - but I’m running into CUDA Out of memory issues, despite restarting the kernel. I am also just loading the saved weights from the previous run, and thus not re-running any model. I’m using GCP Compute instance for training my models. Does anyone have a workaround for this?

1 Like

Try to reduce the batch size inside of the databunch method. That worked for me (e.g. bs=32)

I must be doing something wrong - even after decreasing the batch-size to 4, I’m still getting the out of memory error. There must be something that I’m missing.

Edit: My bad - I was adding that parameter to .transform() instead of .databunch(). It’s working now. Thanks, @jm0077

1 Like

Trying to conduct some experiments using unet from fastai library but couldnt get it work with some high resolution images…using batch size=1 as each image resolution is 8k*5k.
For some reason, batch size =1 isnt working for any dataset even for camvid.
Error when trying to access one batch of data


Error with lr_find

Error with fit_once_cycle

I am building language model for hindi but not able to load data . Every time kernel restarts. I have 32 gb ram and my data is of 10gb.
getting problem here:

data_lm = (TextFileList.from_folder(path)         
           #grab all the text files in path
           .label_const(0)           
           #label them all wiht 0s (the targets aren't positive vs negative review here)
           .split_by_folder(valid='test')         
           #split by folder between train and validation set
           .datasets() 
           #use `TextDataset`, the flag `is_fnames=True` indicates to read the content of the files passed
           .tokenize()
           #tokenize with defaults from fastai
           .numericalize()
           #numericalize with defaults from fastai
           .databunch(TextLMDataBunch))
           #use a TextLMDataBunch
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-be9b2c40e9af> in <module>
      3            .label_const(0)
      4            #label them all wiht 0s (the targets aren't positive vs negative review here)
----> 5            .split_by_folder(valid='test')
      6            #split by folder between train and validation set
      7            .datasets()

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/data_block.py in datasets(self, dataset_cls, **kwargs)
    232         "Create datasets from the underlying data using `dataset_cls` and passing along the `kwargs`."
    233         if dataset_cls is None: dataset_cls = self.dataset_cls()
--> 234         train = dataset_cls(*self.train.items.T, **kwargs)
    235         dss = [train]
    236         dss += [train.new(*o.items.T, **kwargs) for o in self.lists[1:]]

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/text/data.py in __init__(self, fns, labels, classes, mark_fields, encode_classes)
    199         for f in fns:
    200             with open(f,'r') as f: texts.append(''.join(f.readlines()))
--> 201         super().__init__(texts, labels, classes, mark_fields, encode_classes)
    202 
    203 class LanguageModelLoader():

~/anaconda2/envs/hindinlu/lib/python3.6/site-packages/fastai/text/data.py in __init__(self, texts, labels, classes, mark_fields, encode_classes)
    129     def __init__(self, texts:Collection[str], labels:Collection[Any]=None, classes:Collection[Any]=None,
    130                  mark_fields:bool=True, encode_classes:bool=True):
--> 131         texts = _join_texts(np.array(texts), mark_fields)
    132         super().__init__(texts, labels, classes, encode_classes)
    133 

MemoryError: 


I have divided my text file into small subsets but still same problem. I have monitored memory usage of my system also and its getting upto 95% then gets crash.
How you people load huge data?
@sgugger @radek

how to set the batchsize while using DataBlock API?

I am trrying to use it in .databunch(bs =32) but
still the data batch is 64(it might be default, I guess) and throwing mre memory error

data = (ImageItemList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
        #Where to find the data? -> in planet 'train' folder
        .random_split_by_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(sep=' ')
        #How to label? -> use the csv file
        .transform(planet_tfms, size=128)
        #Data augmentation? -> use tfms with a size of 128
        .databunch())

AFAIK bs should go in databunch. Have you tried removing the empty space? I mean use .databunch(bs=32) instead of .databunch(bs =32). I’m not sure what’s the impact of that.

whitespace in python function argument lists (and the language in general) does not matter. well, apart from the indent in loops/ if clauses etc. of course

1 Like

Is there a y_range argument or anything similar for the text_classifier_learner to limit the output of a regression?

1 Like

Is there anyway i can track errors from validation set and further finetune my model? Otherwise can i split my training data into training,test and validation set. Then do the model testing with this new set of test data created from training data and collect all the mis-classification ones and run the model only through the wrong ones?

I guess this is what was also discussed during lesson-3.

Thanks
Amit

I am also not sure what action we should take from the output of following top x mis-classifications one?

interp = ClassificationInterpretation.from_learner(cn_152)
interp.plot_top_losses(20, figsize=(12,10))?

Can someone tell me what’s wrong with this? I get

TypeError: must be str, not int

from

.label_from_df(cols='label')

I get the same error if I use (cols=3), which works in the IMDB nb.

data = (TextList.from_csv(path, 'fake_or_real_news.csv', col='text')
                .random_split_by_pct(0.2)
                .label_from_df(cols='label')
                .databunch())

TypeError                                 Traceback (most recent call last)
/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1011         try:
-> 1012             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1013         except TypeError:

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    204     if use_numexpr:
--> 205         return _evaluate(op, op_str, a, b, **eval_kwargs)
    206     return _evaluate_standard(op, op_str, a, b)

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    119     if result is None:
--> 120         result = _evaluate_standard(op, op_str, a, b)
    121 

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     64     with np.errstate(all='ignore'):
---> 65         return op(a, b)
     66 

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in radd(left, right)
    112 def radd(left, right):
--> 113     return right + left
    114 

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
   1032             with np.errstate(all='ignore'):
-> 1033                 return na_op(lvalues, rvalues)
   1034         except Exception:

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1022                 mask = notna(x)
-> 1023                 result[mask] = op(x[mask], y)
   1024 

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in radd(left, right)
    112 def radd(left, right):
--> 113     return right + left
    114 

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-62-09c1b499b3ab> in <module>
      1 data = (TextList.from_csv(path, 'fake_or_real_news.csv', col='text')
      2                 .random_split_by_pct(0.2)
----> 3                 .label_from_df(cols='label')
      4                 .databunch())

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    316             self.valid = fv(*args, **kwargs)
    317             self.__class__ = LabelLists
--> 318             self.process()
    319             return self
    320         return _inner

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self)
    360     def process(self):
    361         xp,yp = self.get_processors()
--> 362         for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
    363         return self
    364 

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    428             filt = array([o is None for o in self.y])
    429             if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 430         self.x.process(xp)
    431         return self
    432 

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in process(self, processor)
     58         if processor is not None: self.processor = processor
     59         self.processor = listify(self.processor)
---> 60         for p in self.processor: p.process(self)
     61         return self
     62 

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/text/data.py in process(self, ds)
    275     def process_one(self, item):  return self.tokenizer._process_all_1([item])[0]
    276     def process(self, ds):
--> 277         ds.items = _join_texts(ds.items, self.mark_fields)
    278         tokens = []
    279         for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/text/data.py in _join_texts(texts, mark_fields)
    334     if is1d(texts): texts = texts[:,None]
    335     df = pd.DataFrame({i:texts[:,i] for i in range(texts.shape[1])})
--> 336     text_col = f'{BOS} {FLD} {1} ' + df[0] if mark_fields else  f'{BOS} ' + df[0]
    337     for i in range(1,len(df.columns)):
    338         text_col += (f' {FLD} {i+1} ' if mark_fields else ' ') + df[i]

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right)
   1067             rvalues = rvalues.values
   1068 
-> 1069         result = safe_na_op(lvalues, rvalues)
   1070         return construct_result(left, result,
   1071                                 index=left.index, name=res_name, dtype=None)

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
   1035             if is_object_dtype(lvalues):
   1036                 return libalgos.arrmap_object(lvalues,
-> 1037                                               lambda x: op(x, rvalues))
   1038             raise
   1039 

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x)
   1035             if is_object_dtype(lvalues):
   1036                 return libalgos.arrmap_object(lvalues,
-> 1037                                               lambda x: op(x, rvalues))
   1038             raise
   1039 

/opt/conda/envs/fastai/lib/python3.6/site-packages/pandas/core/ops.py in radd(left, right)
    111 
    112 def radd(left, right):
--> 113     return right + left
    114 
    115 

TypeError: must be str, not int

@ricknta

Does your csv file have a first row that labels the columns as ‘text’ and ‘label’ ? Check once.

I am guessing because if result is None:

1 Like

Yes, first row has those labels.

My data looks like this:

I don’t see why that would be a problem but am not sure.

I thought this could be a data problem of some sort since the exact same syntax works fine on the IMDB data, so I cleaned up a subset of the data (it looked like the csv file had some problems) but still get similar errors.