Fastai v2 text

The issue caused:

Here it is commented out, and working:

One more thing, I tried using the DataBlock method with TextBlock to load the data that way, and encountered a totally different error that prevented me from proceeding. I’ve reached the reply limit for this topic so I can’t post a screenshot. I’ll try pasting the code in though:

clas_db = DataBlock( blocks=(TextBlock.from_df('title', vocab=lm_dls.vocab, seq_len=72, is_lm=False), CategoryBlock),
                  get_x=ColReader('title'),
                  get_y=ColReader('major_class'),
                  splitter=ColSplitter()
                  )

clas_dls  = clas_db.dataloaders(df_training, bs=64, seq_len=72)

This results in the following error:

AttributeError: 'Series' object has no attribute 'title'

@reyf - This is because of TextBlock.from_df. Once it performs the tokenization, it stores the resulting column name as ‘text’ (see res_col_name).

TextBlock.from_df ( text_cols , vocab = None , is_lm = False , seq_len = 72 , backwards = False , min_freq = 3 , max_vocab = 60000 , tok = None , rules = None , sep = ' ' , n_workers = 64 , mark_fields = None , res_col_name = 'text' , **** kwargs** )

Hence it is not finding the column name ‘title’. Try changing the get_x as ColReader(‘text’). Please try using https://dev.fast.ai/data.block#DataBlock.summary to validate the above and also for debugging the datablock.

1 Like

Thanks! I’ll give that a try. I didn’t think the documentation was very clear but perhaps my attention span was to blame :slight_smile:

I am guessing the problem is with the label_col that you passed. It has to be major_class & not major_class_int. Try changing it and let us know if it works

I made the screenshot while I was experimenting with both of the columns - I was mostly using major_class_int and it didn’t work.

I was able to train on data tokenized by tokenize_df by creating my DataBlock like this:

df, count = tokenize_df(df_orig, text_cols=['data'], n_workers = 1)

imdb_clas = DataBlock(blocks=(TextBlock(tok_tfm=noop, vocab=vocab), MultiCategoryBlock),
                      get_x=attrgetter('text'),
                      splitter=TrainTestSplitter(test_size = 0.1, stratify=df['label'], random_state = 24),
                      get_y=ColReader(0, label_delim='|'))

Hope that helps!

2 Likes

I’m experiencing some issues with learn.get_preds() for inference. I want to get predictions from a trained text_classifier_learner on thousands of documents. So in the past with fastai v1 I created a test set with all my docs and then ran learn.get_preds(). Now I’m getting some strange results.

I was testing this with my validation set and cannot make sense of it. When I run learn.get_preds() I get correct results on my validation set as expected.

Then I create a test dataloader (also with my validation set for comparison)

test_dl = learn.dls.test_dl(df_val)
out = learn.get_preds(dl=test_dl)

However, the predictions from learn.get_preds(dl=test_dl) are entirely different from learn.get_preds() even though the underlying dataset is the same.

Can it be the order? Is there an equivalent of setting ordered=True in fastai v2? Or am I getting something else wrong?

1 Like

I did some more digging around and it looks like the test dataloader gets sorted by length no matter if shuffle is set to True or False.

test_dl = learn.dls.test_dl(df_val, shuffle=False, drop_last=False)
for batch in test_dl:
    print(batch[0].shape)

This results in the following shapes (note that my dataset is not sorted by length):

torch.Size([128, 1502])
torch.Size([128, 69])
torch.Size([22, 25])

Any advice on how to get predictions in the original order? @muellerzr I also had a look at fastinference (which is great btw) but since the problem arises when I create the dataloader it probably would lead to the same behavior.

@stefan-ai I had a similar issue towards the end of last year, I think Sylvain added a reorder argument in learner.get_preds to change the order they are returned. reorder=True by default, so maybe trying switching it to false?

Or else you can grab the indexes with something like test_dl.get_idxs, and then re-order your predictions with sorted from fastcore (or just python’s sort for lists)

1 Like

Thanks for your reply @morgan!

It seems like reorder doesn’t exist but getting the test indices and then reordering does the trick :slight_smile:

2 Likes

I am facing an error while loading my language model vocab into DataBlock for my classifier. The error is

TypeError: unhashable type: ‘list’

Here is my DataBlock

dls_clas = DataBlock(

blocks=(TextBlock.from_df('text', vocab=dls.vocab), MultiCategoryBlock),

get_x=ColReader('text'), 

get_y=ColReader('label'), 

splitter=RandomSplitter()

).dataloaders(data, bs=128, seq_len=72)

I am not sure what is wrong as I have followed all the steps mentioned in the lecture.

Thanks !

hey @sgugger I have a language model pretrained on some text . Now in my data there is text column which consists of around 2.57 lak records now i would like to get the sentence encodings from the encoder of pretrained model. https://medium.com/@alden_6876/getting-document-encodings-from-fastais-ulmfit-language-model-a3f9271f9ecd . i have tried this approach it’s taking ( 3 min) for 500 records . Is there anyway to get the encodings . the above code was written from the fastai v1 library . can you help me out … to process faster in order to get encodings…?

@Anish_sri best not to tag Sylvain or Jeremy for ad-hoc help requests (Sylvain has also moved to Huggingface so it less active here these days), try open a new topic and the community can take a look and try help

2 Likes

@morgan Hello Morgan ,
I am facing a problem with multi label text classification. can you help me out where i did wrong in this as I am new to the fast,ai Here is the link where i posted my whole problem.Multi label text classification . Could you please help me out …?

Thanks,
Anish

Hi mgloria

did you find a fix for this? I have run into the same error! thanks

Not able to open this link: http://dev.fast.ai/tutorial.wikitext.html
:roll_eyes: :disappointed_relieved:

Fastai 2 is in production now, you can find it here: https://docs.fast.ai/tutorial.wikitext.html

(dev. -> docs.)

I’m having some trouble using multiple GPUs to train a language model on v2. Following the documentation:

and the sample:

I came up with this:

from fastai import *
from fastai.text import *
from fastai.text.all import *
from fastai.distributed import *

path = Path('/mnt/harddrive/text_files')
imdb = DataBlock(blocks=(TextBlock.from_folder(path, is_lm=True), CategoryBlock),
                 get_items=get_text_files,
                 get_y=parent_label,
                 splitter=RandomSplitter(valid_pct=0.2, seed=1))
dbunch = imdb.dataloaders(path, bs=64, seq_len=80, num_workers=0)

learn_lm = language_model_learner(dbunch, AWD_LSTM).to_fp16()
learn_lm.freeze()

with learn_lm.distrib_ctx():
    learn_lm.fit_one_cycle(1, 0.05, moms=(0.8,0.7,0.8))

When training with a smaller dataset, it trains properly most of the time. With a larger dataset, it always fails. The error is almost identical in both cases, the only part that ever changes is the “and [77] at entry 32” part:

RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32

The difference between it working and not working, can be as subtle as changing the seed on the RandomSplitter. I’ve added some debug lines and provided the output:

First attempt, with seed: 1

python3 -m fastai.launch dist_simple.py 
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 1
bs: 64
seq_len: 80
Learning
dbunch loaded
_
rank_distrib(): 1
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
epoch     train_loss  valid_loss  time    
0         4.363470    3.997428    00:00

Second attempt, with seed: 3

python3 -m fastai.launch dist_simple.py
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 3
bs: 64
seq_len: 80
Learning
dbunch loaded
_
rank_distrib(): 1
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
epoch     train_loss  valid_loss  time    
Traceback (most recent call last):███████████████████████████------------------------------------------------| 50.00% [2/4 00:00<00:00 4.7127]
  File "/home/chess/project/training/dist_simple.py", line 51, in <module>
    learn_lm.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
    self._do_epoch_train()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 101, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 110, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 133, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 132, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in fa_collate
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in <listcomp>
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 47, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/torch_core.py", line 325, in __torch_function__
    res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32

Third attempt, seed: 0:

Everything is the same besides the last line:

RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [67] at entry 32

Forth attempt, an example of a failure with batch size changed from 64 to 32, note the “entry” in the stack error at the end changed from 32 to 16:

World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 1
bs: 32
seq_len: 80
Learning
dbunch loaded
_
rank_distrib(): 1
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
epoch     train_loss  valid_loss  time    
Traceback (most recent call last):███████████████████████████████████████████████████------------------------| 75.00% [6/8 00:00<00:00 4.2810]
  File "/home/chess/project/training/dist_simple.py", line 51, in <module>
    learn_lm.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
    self._do_epoch_train()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 101, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 110, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 133, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 132, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in fa_collate
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in <listcomp>
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 47, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/torch_core.py", line 325, in __torch_function__
    res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [14] at entry 16

When I run it with a single GPU, it works 100% of the time:

python3 -m fastai.launch --gpus=0 dist_simple.py

Things I’ve tried:

  1. Different versions of the dataset, one with many smaller files, one with a few larger files…etc.
  2. Using this code instead of the datablock:
dbunch = TextDataLoaders.from_folder(path, is_lm=True, valid_pct=0.1)
  1. Different combinations of bs, seq_len, valid_pct, and num_workers.
  2. GrandparentSplitter instead of RandomSplitter.
  3. Adding drop_last=True to the dataloader.

My analysis:

Based on this post by sgugger, if there happens to be a remainder in the last batch (not enough data to fill up the batch fully), it’s dropped:

I think this issue affects language models specifically, as classifiers use padding when the data doesn’t line up. I believe it’s failing to drop the remainder sometimes, resulting in the “stack expects each tensor to be equal size” error when using a distributed language learner with multiple GPUs. This would explain why the “entry” number in the error is always the last item in the batch that gets sent to the GPU, and can be calculated by looking at the batch_size and world_size (number of GPUs).

In the examples above:

world_size: 2
bs: 64
produces:

RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32

world_size: 2
bs: 32
produces:

RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [14] at entry 16

This matches with the math I see in the class DistributedDL(TfmdDL): section in the documentation:

Am I on the right track? If my analysis is correct and drop_last isn’t being properly applied, Any ideas on how I can address that?

Thanks!

Update: I’ve gone into python3.8/site-packages/fastai/data/load.py, and hard-coded drop_last to always be True and activated, however the error is consistent.

Update2: While I saw this post before, I didn’t think it was related. After a deeper dive I see @pierreguillou had the same problem last July with no solution:

So I tried with DataParallel instead of DistributedDataParallel:

with learn_lm.parallel_ctx():
    learn_lm.fit_one_cycle(1, 0.05)

but now I’m getting this:

python3 -m fastai.launch dist_simple.py 
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 3
bs: 64
seq_len: 32
dbunch loaded
_
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:30: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
epoch     train_loss  valid_loss  time    
rank_distrib(): 1--------------------------------------------------------------------------------------------| 0.00% [0/9 00:00<00:00]
Learning
/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:30: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
epoch     train_loss  valid_loss  time    
Traceback (most recent call last):---------------------------------------------------------------------------| 0.00% [0/10 00:00<00:00]
  File "/home/chess/project/training/dist_simple.py", line 57, in <module>
    learn_lm.fit_one_cycle(1, lr)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
    self._do_epoch_train()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 184, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 169, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Traceback (most recent call last):
  File "/home/chess/project/training/dist_simple.py", line 57, in <module>
    learn_lm.fit_one_cycle(1, lr)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
    self._do_epoch_train()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 184, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 169, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/text/models/awdlstm.py", line 106, in forward
    output, new_h = rnn(output, self.hidden[l])
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/text/models/awdlstm.py", line 53, in forward
    return self.module(*args)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 581, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0

I don’t think there’s a reliable way to train a language model on v2 with multiple GPUs at the moment. Moving forward with single GPU training!

1 Like

Has anyone here successfully used the partial_dataloader from @boris on Fastai v2 text lately?

Here’s my code:

from fastai import *
from fastai.text import *
from fastai.text.all import *

path = Path('/mnt/harddrive/text_files')
imdb = DataBlock(blocks=(TextBlock.from_folder(path), CategoryBlock),
                 get_items=get_text_files,
                 get_y=parent_label,
                 splitter=RandomSplitter(valid_pct=0.2, seed=1))

dbunch = imdb.datasets(path)

dls = dbunch.partial_dataloaders(partial_n=32, bs=16)

At this point, if I try to train, or show the batch, or do much of anything with dls, I get this:

dls.show_batch(max_n=5)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-0d9e63801e38> in <module>
----> 1 dls.show_batch(max_n=5)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/core.py in show_batch(self, b, max_n, ctxs, show, unique, **kwargs)
     98             old_get_idxs = self.get_idxs
     99             self.get_idxs = lambda: Inf.zeros
--> 100         if b is None: b = self.one_batch()
    101         if not show: return self._pre_show_batch(b, max_n=max_n)
    102         show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in one_batch(self)
    146     def one_batch(self):
    147         if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
--> 148         with self.fake_l.no_multiproc(): res = first(self)
    149         if hasattr(self, 'it'): delattr(self, 'it')
    150         return res

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastcore/basics.py in first(x, f, negate, **kwargs)
    545     x = iter(x)
    546     if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
--> 547     return next(x, None)
    548 
    549 # Cell

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in __iter__(self)
    107         self.before_iter()
    108         self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    110             if self.device is not None: b = to_device(b, self.device)
    111             yield self.after_batch(b)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     32                 raise StopIteration
     33         else:
---> 34             data = next(self.dataset_iter)
     35         return self.collate_fn(data)
     36 

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in create_batches(self, samps)
    116         if self.dataset is not None: self.it = iter(self.dataset)
    117         res = filter(lambda o:o is not None, map(self.do_item, samps))
--> 118         yield from map(self.do_batch, self.chunkify(res))
    119 
    120     def new(self, dataset=None, cls=None, **kwargs):

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in do_batch(self, b)
    142         else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
    143     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
--> 144     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    145     def to(self, device): self.device = device
    146     def one_batch(self):

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in create_batch(self, b)
    141         elif s is None:  return next(self.it)
    142         else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
--> 143     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
    144     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    145     def to(self, device): self.device = device

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in fa_collate(t)
     48     b = t[0]
     49     return (default_collate(t) if isinstance(b, _collate_types)
---> 50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     51             else default_collate(t))
     52 

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in <listcomp>(.0)
     48     b = t[0]
     49     return (default_collate(t) if isinstance(b, _collate_types)
---> 50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     51             else default_collate(t))
     52 

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in fa_collate(t)
     47     "A replacement for PyTorch `default_collate` which maintains types and handles `Sequence`s"
     48     b = t[0]
---> 49     return (default_collate(t) if isinstance(b, _collate_types)
     50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     51             else default_collate(t))

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     53             storage = elem.storage()._new_shared(numel)
     54             out = elem.new(storage)
---> 55         return torch.stack(batch, 0, out=out)
     56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
     57             and elem_type.__name__ != 'string_':

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/torch_core.py in __torch_function__(self, func, types, args, kwargs)
    327         convert=False
    328         if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 329         res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
    330         if convert: res = convert(res)
    331         if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/tensor.py in __torch_function__(cls, func, types, args, kwargs)
    993 
    994         with _C.DisableTorchFunction():
--> 995             ret = func(*args, **kwargs)
    996             return _convert(ret, cls)
    997 

RuntimeError: stack expects each tensor to be equal size, but got [337] at entry 0 and [235] at entry 1

Running the code directly from the documentation gives a similar error:

assert len(dls[0])==2
for batch in dls[0]:
    assert len(batch[0])==16

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-315c76472cbb> in <module>
      1 assert len(dls[0])==2
----> 2 for batch in dls[0]:
      3     assert len(batch[0])==16

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in __iter__(self)
    107         self.before_iter()
    108         self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    110             if self.device is not None: b = to_device(b, self.device)
    111             yield self.after_batch(b)

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
   1083             else:
   1084                 del self._task_info[idx]
-> 1085                 return self._process_data(data)
   1086 
   1087     def _try_put_index(self):

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1109         self._try_put_index()
   1110         if isinstance(data, ExceptionWrapper):
-> 1111             data.reraise()
   1112         return data
   1113 

~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/_utils.py in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)
    429 
    430 

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 118, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 144, in do_batch
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 143, in create_batch
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 50, in fa_collate
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 50, in <listcomp>
    else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 49, in fa_collate
    return (default_collate(t) if isinstance(b, _collate_types)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/torch_core.py", line 329, in __torch_function__
    res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
  File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [135] at entry 0 and [39] at entry 1


My research tells me I need to add some padding transforms, I’ve tried adding this to the datablock to no avail:

item_tfms=pad_input
batch_tfms=pad_input

Outside of this I’ve tried too many things to list here, including calling partial_dataloaders in a myriad of ways, and it’s not clear to me why the standard dataloaders wouldn’t need any extra transform code for padding, but the partial_dataloader does. I tried bringing over some of the tfms code from the standard dataloader function over to partial_dataloader, but also to no avail.

Have anyone been successful with this lately?