One more thing, I tried using the DataBlock method with TextBlock to load the data that way, and encountered a totally different error that prevented me from proceeding. I’ve reached the reply limit for this topic so I can’t post a screenshot. I’ll try pasting the code in though:
Hence it is not finding the column name ‘title’. Try changing the get_x as ColReader(‘text’). Please try using https://dev.fast.ai/data.block#DataBlock.summary to validate the above and also for debugging the datablock.
I am guessing the problem is with the label_col that you passed. It has to be major_class & notmajor_class_int. Try changing it and let us know if it works
I’m experiencing some issues with learn.get_preds() for inference. I want to get predictions from a trained text_classifier_learner on thousands of documents. So in the past with fastai v1 I created a test set with all my docs and then ran learn.get_preds(). Now I’m getting some strange results.
I was testing this with my validation set and cannot make sense of it. When I run learn.get_preds() I get correct results on my validation set as expected.
Then I create a test dataloader (also with my validation set for comparison)
test_dl = learn.dls.test_dl(df_val)
out = learn.get_preds(dl=test_dl)
However, the predictions from learn.get_preds(dl=test_dl) are entirely different from learn.get_preds() even though the underlying dataset is the same.
Can it be the order? Is there an equivalent of setting ordered=True in fastai v2? Or am I getting something else wrong?
Any advice on how to get predictions in the original order? @muellerzr I also had a look at fastinference (which is great btw) but since the problem arises when I create the dataloader it probably would lead to the same behavior.
@stefan-ai I had a similar issue towards the end of last year, I think Sylvain added a reorder argument in learner.get_preds to change the order they are returned. reorder=True by default, so maybe trying switching it to false?
Or else you can grab the indexes with something like test_dl.get_idxs, and then re-order your predictions with sorted from fastcore (or just python’s sort for lists)
hey @sgugger I have a language model pretrained on some text . Now in my data there is text column which consists of around 2.57 lak records now i would like to get the sentence encodings from the encoder of pretrained model. https://medium.com/@alden_6876/getting-document-encodings-from-fastais-ulmfit-language-model-a3f9271f9ecd . i have tried this approach it’s taking ( 3 min) for 500 records . Is there anyway to get the encodings . the above code was written from the fastai v1 library . can you help me out … to process faster in order to get encodings…?
@Anish_sri best not to tag Sylvain or Jeremy for ad-hoc help requests (Sylvain has also moved to Huggingface so it less active here these days), try open a new topic and the community can take a look and try help
@morgan Hello Morgan ,
I am facing a problem with multi label text classification. can you help me out where i did wrong in this as I am new to the fast,ai Here is the link where i posted my whole problem.Multi label text classification . Could you please help me out …?
I’m having some trouble using multiple GPUs to train a language model on v2. Following the documentation:
and the sample:
I came up with this:
from fastai import *
from fastai.text import *
from fastai.text.all import *
from fastai.distributed import *
path = Path('/mnt/harddrive/text_files')
imdb = DataBlock(blocks=(TextBlock.from_folder(path, is_lm=True), CategoryBlock),
get_items=get_text_files,
get_y=parent_label,
splitter=RandomSplitter(valid_pct=0.2, seed=1))
dbunch = imdb.dataloaders(path, bs=64, seq_len=80, num_workers=0)
learn_lm = language_model_learner(dbunch, AWD_LSTM).to_fp16()
learn_lm.freeze()
with learn_lm.distrib_ctx():
learn_lm.fit_one_cycle(1, 0.05, moms=(0.8,0.7,0.8))
When training with a smaller dataset, it trains properly most of the time. With a larger dataset, it always fails. The error is almost identical in both cases, the only part that ever changes is the “and [77] at entry 32” part:
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32
The difference between it working and not working, can be as subtle as changing the seed on the RandomSplitter. I’ve added some debug lines and provided the output:
python3 -m fastai.launch dist_simple.py
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 3
bs: 64
seq_len: 80
Learning
dbunch loaded
_
rank_distrib(): 1
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
epoch train_loss valid_loss time
Traceback (most recent call last):███████████████████████████------------------------------------------------| 50.00% [2/4 00:00<00:00 4.7127]
File "/home/chess/project/training/dist_simple.py", line 51, in <module>
learn_lm.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
self._do_epoch_train()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 101, in __iter__
for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data = next(self.dataset_iter)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 110, in create_batches
yield from map(self.do_batch, self.chunkify(res))
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 133, in do_batch
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 132, in create_batch
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in fa_collate
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in <listcomp>
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 47, in fa_collate
return (default_collate(t) if isinstance(b, _collate_types)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/torch_core.py", line 325, in __torch_function__
res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32
Third attempt, seed: 0:
Everything is the same besides the last line:
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [67] at entry 32
Forth attempt, an example of a failure with batch size changed from 64 to 32, note the “entry” in the stack error at the end changed from 32 to 16:
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 1
bs: 32
seq_len: 80
Learning
dbunch loaded
_
rank_distrib(): 1
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
epoch train_loss valid_loss time
Traceback (most recent call last):███████████████████████████████████████████████████------------------------| 75.00% [6/8 00:00<00:00 4.2810]
File "/home/chess/project/training/dist_simple.py", line 51, in <module>
learn_lm.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
self._do_epoch_train()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 101, in __iter__
for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data = next(self.dataset_iter)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 110, in create_batches
yield from map(self.do_batch, self.chunkify(res))
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 133, in do_batch
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 132, in create_batch
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in fa_collate
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 48, in <listcomp>
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/data/load.py", line 47, in fa_collate
return (default_collate(t) if isinstance(b, _collate_types)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/torch_core.py", line 325, in __torch_function__
res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [14] at entry 16
When I run it with a single GPU, it works 100% of the time:
python3 -m fastai.launch --gpus=0 dist_simple.py
Things I’ve tried:
Different versions of the dataset, one with many smaller files, one with a few larger files…etc.
Different combinations of bs, seq_len, valid_pct, and num_workers.
GrandparentSplitter instead of RandomSplitter.
Adding drop_last=True to the dataloader.
My analysis:
Based on this post by sgugger, if there happens to be a remainder in the last batch (not enough data to fill up the batch fully), it’s dropped:
I think this issue affects language models specifically, as classifiers use padding when the data doesn’t line up. I believe it’s failing to drop the remainder sometimes, resulting in the “stack expects each tensor to be equal size” error when using a distributed language learner with multiple GPUs. This would explain why the “entry” number in the error is always the last item in the batch that gets sent to the GPU, and can be calculated by looking at the batch_size and world_size (number of GPUs).
In the examples above:
world_size: 2
bs: 64
produces:
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [77] at entry 32
world_size: 2
bs: 32
produces:
RuntimeError: stack expects each tensor to be equal size, but got [80] at entry 0 and [14] at entry 16
This matches with the math I see in the class DistributedDL(TfmdDL): section in the documentation:
Am I on the right track? If my analysis is correct and drop_last isn’t being properly applied, Any ideas on how I can address that?
Thanks!
Update: I’ve gone into python3.8/site-packages/fastai/data/load.py, and hard-coded drop_last to always be True and activated, however the error is consistent.
Update2: While I saw this post before, I didn’t think it was related. After a deeper dive I see @pierreguillou had the same problem last July with no solution:
So I tried with DataParallel instead of DistributedDataParallel:
with learn_lm.parallel_ctx():
learn_lm.fit_one_cycle(1, 0.05)
but now I’m getting this:
python3 -m fastai.launch dist_simple.py
World Size: 2
_
Loading dbunch
_
valid_pct: 0.2
seed: 3
bs: 64
seq_len: 32
dbunch loaded
_
rank_distrib(): 0
num_distrib(): 2
torch.cuda.device_count(): 2
Learning
/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:30: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
epoch train_loss valid_loss time
rank_distrib(): 1--------------------------------------------------------------------------------------------| 0.00% [0/9 00:00<00:00]
Learning
/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:30: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
epoch train_loss valid_loss time
Traceback (most recent call last):---------------------------------------------------------------------------| 0.00% [0/10 00:00<00:00]
File "/home/chess/project/training/dist_simple.py", line 57, in <module>
learn_lm.fit_one_cycle(1, lr)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
self._do_epoch_train()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 184, in one_batch
self._with_events(self._do_one_batch, 'batch', CancelBatchException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 169, in _do_one_batch
self.pred = self.model(*self.xb)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Traceback (most recent call last):
File "/home/chess/project/training/dist_simple.py", line 57, in <module>
learn_lm.fit_one_cycle(1, lr)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/callback/schedule.py", line 112, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 202, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 196, in _do_epoch
self._do_epoch_train()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 188, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 166, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 184, in one_batch
self._with_events(self._do_one_batch, 'batch', CancelBatchException)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 160, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/learner.py", line 169, in _do_one_batch
self.pred = self.model(*self.xb)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/text/models/awdlstm.py", line 106, in forward
output, new_h = rnn(output, self.hidden[l])
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/fastai/text/models/awdlstm.py", line 53, in forward
return self.module(*args)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chess/project/environments2/fastai_latest_dist/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 581, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0
I don’t think there’s a reliable way to train a language model on v2 with multiple GPUs at the moment. Moving forward with single GPU training!
Has anyone here successfully used the partial_dataloader from @boris on Fastai v2 text lately?
Here’s my code:
from fastai import *
from fastai.text import *
from fastai.text.all import *
path = Path('/mnt/harddrive/text_files')
imdb = DataBlock(blocks=(TextBlock.from_folder(path), CategoryBlock),
get_items=get_text_files,
get_y=parent_label,
splitter=RandomSplitter(valid_pct=0.2, seed=1))
dbunch = imdb.datasets(path)
dls = dbunch.partial_dataloaders(partial_n=32, bs=16)
At this point, if I try to train, or show the batch, or do much of anything with dls, I get this:
dls.show_batch(max_n=5)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-0d9e63801e38> in <module>
----> 1 dls.show_batch(max_n=5)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/core.py in show_batch(self, b, max_n, ctxs, show, unique, **kwargs)
98 old_get_idxs = self.get_idxs
99 self.get_idxs = lambda: Inf.zeros
--> 100 if b is None: b = self.one_batch()
101 if not show: return self._pre_show_batch(b, max_n=max_n)
102 show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in one_batch(self)
146 def one_batch(self):
147 if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
--> 148 with self.fake_l.no_multiproc(): res = first(self)
149 if hasattr(self, 'it'): delattr(self, 'it')
150 return res
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastcore/basics.py in first(x, f, negate, **kwargs)
545 x = iter(x)
546 if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
--> 547 return next(x, None)
548
549 # Cell
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in __iter__(self)
107 self.before_iter()
108 self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109 for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
110 if self.device is not None: b = to_device(b, self.device)
111 yield self.after_batch(b)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
473 def _next_data(self):
474 index = self._next_index() # may raise StopIteration
--> 475 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
476 if self._pin_memory:
477 data = _utils.pin_memory.pin_memory(data)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
32 raise StopIteration
33 else:
---> 34 data = next(self.dataset_iter)
35 return self.collate_fn(data)
36
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in create_batches(self, samps)
116 if self.dataset is not None: self.it = iter(self.dataset)
117 res = filter(lambda o:o is not None, map(self.do_item, samps))
--> 118 yield from map(self.do_batch, self.chunkify(res))
119
120 def new(self, dataset=None, cls=None, **kwargs):
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in do_batch(self, b)
142 else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
143 def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
--> 144 def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
145 def to(self, device): self.device = device
146 def one_batch(self):
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in create_batch(self, b)
141 elif s is None: return next(self.it)
142 else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
--> 143 def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
144 def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
145 def to(self, device): self.device = device
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in fa_collate(t)
48 b = t[0]
49 return (default_collate(t) if isinstance(b, _collate_types)
---> 50 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
51 else default_collate(t))
52
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in <listcomp>(.0)
48 b = t[0]
49 return (default_collate(t) if isinstance(b, _collate_types)
---> 50 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
51 else default_collate(t))
52
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in fa_collate(t)
47 "A replacement for PyTorch `default_collate` which maintains types and handles `Sequence`s"
48 b = t[0]
---> 49 return (default_collate(t) if isinstance(b, _collate_types)
50 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
51 else default_collate(t))
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
53 storage = elem.storage()._new_shared(numel)
54 out = elem.new(storage)
---> 55 return torch.stack(batch, 0, out=out)
56 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
57 and elem_type.__name__ != 'string_':
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/torch_core.py in __torch_function__(self, func, types, args, kwargs)
327 convert=False
328 if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 329 res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
330 if convert: res = convert(res)
331 if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/tensor.py in __torch_function__(cls, func, types, args, kwargs)
993
994 with _C.DisableTorchFunction():
--> 995 ret = func(*args, **kwargs)
996 return _convert(ret, cls)
997
RuntimeError: stack expects each tensor to be equal size, but got [337] at entry 0 and [235] at entry 1
Running the code directly from the documentation gives a similar error:
assert len(dls[0])==2
for batch in dls[0]:
assert len(batch[0])==16
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-16-315c76472cbb> in <module>
1 assert len(dls[0])==2
----> 2 for batch in dls[0]:
3 assert len(batch[0])==16
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py in __iter__(self)
107 self.before_iter()
108 self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109 for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
110 if self.device is not None: b = to_device(b, self.device)
111 yield self.after_batch(b)
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
1083 else:
1084 del self._task_info[idx]
-> 1085 return self._process_data(data)
1086
1087 def _try_put_index(self):
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1109 self._try_put_index()
1110 if isinstance(data, ExceptionWrapper):
-> 1111 data.reraise()
1112 return data
1113
~/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/_utils.py in reraise(self)
426 # have message field
427 raise self.exc_type(message=msg)
--> 428 raise self.exc_type(msg)
429
430
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data = next(self.dataset_iter)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 118, in create_batches
yield from map(self.do_batch, self.chunkify(res))
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 144, in do_batch
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 143, in create_batch
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 50, in fa_collate
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 50, in <listcomp>
else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/data/load.py", line 49, in fa_collate
return (default_collate(t) if isinstance(b, _collate_types)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/fastai/torch_core.py", line 329, in __torch_function__
res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
File "/home/chess/TheProject/environments2/fastai_latest_2021_02_04/lib/python3.8/site-packages/torch/tensor.py", line 995, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: stack expects each tensor to be equal size, but got [135] at entry 0 and [39] at entry 1
My research tells me I need to add some padding transforms, I’ve tried adding this to the datablock to no avail:
item_tfms=pad_input
batch_tfms=pad_input
Outside of this I’ve tried too many things to list here, including calling partial_dataloaders in a myriad of ways, and it’s not clear to me why the standard dataloaders wouldn’t need any extra transform code for padding, but the partial_dataloader does. I tried bringing over some of the tfms code from the standard dataloader function over to partial_dataloader, but also to no avail.
Did you ever work this out?
I was thinking of migrating my codebase from fastai1 to fastai2, but it seems that this functionality still doesn’t exist in v2.