For those who run their own AI box, or want to

fmussari · July 7, 2022, 2:39am

How it started:

Thanks to the help in this forum, I was able to setup an old GTX 1070

How it is going:

I managed to fine tuned a ‘vit_large_patch16_224’ using this GPU with only 8GB of memory

As you can see in the image I had to use a batch size of 4! and GradientAccumulation (thanks to Jeremy’s lesson 7 and live coding sessions).

Some additional observations

Overclocking your GPU is potentially risky

Power, Temperature and Time

vit_large_patch16_224

This large model took about 16:30 min per epoch.

Epoch 10 got 33:10. That happened because some models, at some times, wont train at full GPU capacity. But then it re-started using the full GPU again.

I would like to make some more (and more consistent) experiments comparing temperature, times and power consumption. by different limiting values to the power of the GPU. Limiting the power was (or is, I don’t know) a must for crypto mining. An effect can be observed in this model, but for what I have seen, for small models you can get even better drops in temperature (and consumption). There large models are eager in power.

The temperature on the GPU for this model is relatively high: about 62° / 64° C.

I limited the power to 90% at epoch 12/13 and you can see that after epoch 13, each epoch took about 30 seconds more to train. But it also limited the heat a little, which is good also for reducing fan noise.

In this plot it is clear that somehow a memory usage peak is what caused the GPU not being able to use all its power. At that point I stopped the measurement (what I regret), then, when the GPU started at full power again, I also started measuring again. It is clear the drops in temperature and power consumption.

You could count the epochs in those plots by looking at the patterns.

swinv2_base_window12_192_22k

Only to compare, here is a cooler model with power limited to 80%. I made another plot measure for this same model without limiting the power and it trained at about 2° C above the 55.74° observed in this.

resnet34

A resnet34 trains with little power consumption and very cool. I’m not even sure if the 80% power limit made a difference or not.

GPU Clock and Memory Clock

For crypto mining these values (in addition to limiting the power) are crucial to getting the most out of the GPU. Back in 2017 I was hesitant to overclock but it turned out that the GPU worked more efficient, cooler and with less noise by finding the appropriate tweak for each coin. Each coin has its own algorithm, and each one of them has its own GPU optimizations. Some algorithms were more power intensive or more unstable.

For what I have seen so far, with deep learning models is different, I haven’t found difference between different GPU or memory clocks. Like the effect I have seen with limiting the power.

Command for quering the data

The loop parameter set to 1 implies that the query is made at a 1 second frequency…

nvidia-smi --query-gpu=timestamp,power.draw,memory.total,utilization.memory,temperature.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.memory --loop=1 --format=csv --filename=filename.csv

mike.moloch · July 7, 2022, 2:49am

A 1070ti is the only GPU I have, and will have for some time, it seems reasonable. And when epoch timings get annoyingly long, I try to run stuff on Jarvislabs etc, but for most quick explorations it’s still pretty decent and I don’t know enough about overclocking/undervolting etc enough to mess with it, the stock setup seems to work for me. They’re still usable and I get unlimited time with my home machine compared to the free GPUs on Kaggle (they seem comparable performance wise.)

fmussari · August 6, 2022, 11:28pm

In May I installed fastai 2.6.3 this way.

I did search but I’d prefer to ask here what’s the best way to update that fastai installation? THX

bencoman · August 7, 2022, 2:49am

$ mamba --help

shows…

update - Updates conda packages to the latest compatible version.

so I presume the following would do it…

mamba update -c fastchan fastai nbdev

I’m not sure if the “-c fastchan” is required or the package manager remembers where it was originally installed from and defaults to update from there. I imagine the latter is true, but someone else will need to confirm that.

jeremy · August 7, 2022, 3:06am

It doesn’t, unfortunately.

jeremy · August 7, 2022, 3:07am

You can use the same commands to update as you used to install.

michael-pru · August 18, 2022, 10:01am

I’m on ubuntu 22.04.1 LTS
calling search_images_ddg() gives URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:1129)>

I read it’s something to do with SSL on ubuntu, but have no idea how to fix this

mamba list contains openssl 1.1.1q and pyopenssl 22.0.0

EDIT: hmmm, after disabling VPN, now I get to the loop inside search_images_ddg(), but get 403:Forbidden every time on data = urljson(requestUrl,data=params)

bencoman · August 20, 2022, 8:17am

You don’t mention having searched this forum and which parts of these recent similar posts were effective or not for you, so you should first try them and report. That will make it easier to answer any remaining gaps. Also please read HTAQTSW.

slowtalk · August 27, 2022, 5:11pm

For those that setup on their own Linux box I recently upgraded mine to Ubuntu 22.04.1 and ran into an issue using FileUpload where the error message was FileUpload(): object has no attribute 'data'. I believe this might be due to an updated version of ipywidgets, but am/was unsure.

A workaround I used, tested with lessons 01_intro.ipynb and 02_production.ipynb, replaces any line similar to the following:

img = PILImage.create(uploader.data[0])

with:

img = PILImage.create(uploader.value[0].content.tobytes())

It seems that FileUpload switched from returning data type to a new type called value, which is also reflected in the ipywidgets FileUpload documentation.

I searched the forums and it didn’t seem that anybody else has run into this problem (yet), which may mean it’s a specific problem with my own environment or setup. I used the vanilla pip install fastai command listed on the welcome page of the fast.ai docs page as my setup.

Pomo · September 14, 2022, 6:19am

Hello all,

I have gone back to an old project and need to install opencv with conda to continue. It would not install with
conda install -c conda-forge opencv
giving the errors shown below.

Rather than spending time to diagnose, I decided to completely uninstall conda and start fresh.

conda install anaconda-clean   # install the package anaconda clean
anaconda-clean --yes           # clean all anaconda related files and directories 
rm -rf ~/anaconda3 
rm -rf ~/.anaconda_backup 
bash '/home/malcolm/Downloads/Anaconda3-2022.05-Linux-x86_64.sh'
#Run conda init? yes

conda install -c conda-forge opencv

Installing opencv gives the same errors as originally, after an hour of grinding:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: - 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.

There followed a list of 400 packages checked one by one, for more than an hour. Then the command hung, and I interrupted it.

Things I have tried…

The various alternative installation commands for opencv listed on anaconda.org
Reading and trying stackoverflow opinion-suggestions for hours
Going far down the mamba/mambaforge rabbit hole

Things that seem strange…

The conda installation script does not download anything, even after conda clean -a. From where is it getting the packages for the base environment? Maybe some of them are stored locally and corrupted.
PyTorch installs perfectly without error.
That so straightforward a task can waste hours of one’s precious human life.
That hundreds of people have reported this problem and dozens of opinion-suggestions are given on the internets, with none of them working.

Does anyone have a solution or suggestion for moving forward?

Thank you!

bencoman · September 14, 2022, 2:24pm

Its been a while since I ran conda in my local shell,
so I first updated it (so that when you do the same, our experiences are hopefully similar. I used conda-forge channel since that was what you referenced.)

 mamba update -c conda-forge --all

Note I use mamba rather than conda since its been reported as being faster, but both should operate the same.

I then did…

mamba install -c conda-forge opencv

which completed without incident.

Possibly all bets are off if you interupt an operation part way through (though I’m hardly a conda expert)

not sure.

PyTorch installs perfectly without error.

Why is this strange?

That so straightforward a task can waste hours of one’s precious human life.
That hundreds of people have reported this problem and dozens of opinion-suggestions are given on the internets, with none of them working.

Sorry to hear. Are you blaming anyone in particular? Computers are just like that sometimes. The history of any particular person’s environment has so many possible paths that will always be always corner cases

Similarly, not many people will every get hit by lightning, but the person who is hit doesn’t have a very good day. So sorry, all I can report is that it worked for me - first time. Since I’m not experiencing your problem, I have nothing to troubleshoot. Just wanted to give you some hope.

Pomo · September 15, 2022, 1:02am

@bencoman, thanks for attempting to reproduce the problem. I appreciate your efforts.

Yes, conda listed the packages it was testing for conflicts one by one (all 400+), erased the listing, and then hung. So I don’t really know what it found, if anything.

Perhaps that is actually the underlying issue - there is no one responsible for defective error handling in the open source ecosystem. But I am reluctant to say anything critical of open source software on these forums.

Can anyone recommend software that might reveal the problem?

My next step is to try installing with mamba. However, attempting to install mamba via conda leads to exactly the same error! I will look for a way to install mamba directly.

Thanks again for trying to troubleshoot and give me some hope!

Pomo · September 16, 2022, 4:40am

I have made progress by switching to mamba which revealed the problem. Shown here in case it helps anyone else. I do not claim this is the best way, just that it worked for me.

First, uninstall conda:

conda install --yes anaconda-clean   # install the package anaconda clean
anaconda-clean --yes           # clean all anaconda related files and directories 
rm -rf ~/anaconda3 
rm -rf ~/.anaconda_backup

Next, install micromamba and initialize:

wget -qO- https://micromamba.snakepit.net/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
./bin/micromamba shell init -s bash -p ~/micromamba
source ~/.bashrc

Installing opencv with micromamba competently yields a useful error message:

micromamba install -c conda-forge opencv #opencv    #4.6.0
Encountered problems while solving:
  - nothing provides harfbuzz >=3.3.1,<4.0a0 needed by libopencv-4.5.5-py37h4958a1a_1

The problem is that the package harfbuzz with the specified version is not available from conda-forge. Indeed, a direct search of anaconda.org does not find harfbuzz with the proper version and platform. Thanks mamba, and boo for conda!

Rather than look around for a proper harfbuzz, I decided to try installing an earlier version of opencv (4.5.5). After all, the notebook was working fine two years ago.

You need to make a micromamba environment and then do the install into it:

micromamba create -n fastai  python=3.9 -y -c conda-forge
micromamba activate fastai
micromamba install -c anaconda opencv  #opencv 4.5.5, according to anaconda.org

This time the install of opencv (4.5.5) works.

Install PyTorch, fastai, and Jupyter:

micromamba install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge
micromamba install -c fastchan fastai
micromamba install -c conda-forge jupyter_core
micromamba install -c conda-forge notebook

But there is yet another problem. In the Jupyter notebook, import cv2 fails, saying something like “Cannot find shared object”. Stackoverflow gives a bunch of arcane and contradictory solutions.

At this point I am ready to scream again, but remind myself that I have made some progress at least. I decided to try removing opencv and, on a whim, reinstalling the original opencv 4.6.0.

micromamba remove opencv
micromamba install -c conda-forge opencv  #4.6.0

This time the newer opencv installs successfully! An acceptable version of harfbuzz must have been installed by some of the other packages from a different channel. Note to self and others - the order of installation matters, and missing packages may exist in other channels.

And a minor miracle - import cv2 now works within a Jupyter notebook. Don’t ask me why. I am just glad to be able to move forward, even at the cost of a whole day and massive frustration.

I sincerely hope that my misadventure helps someone else.

Malcolm

jeremy · September 16, 2022, 4:52am

You’ve made your feelings abundantly clear.

Pomo · September 16, 2022, 5:37pm

Sorry for not accepting what is and for my own part in that. I do get excessively frustrated when an infrastructure problem stands in the way of moving forward, and I can’t see how I will ever get past it. Thanks for your feedback.

jeremy · September 16, 2022, 11:26pm

Apology accepted @Pomo - understood.

AllenK · September 24, 2022, 8:25am

and even more cooling needed in Summer. I can’t run anything on GPU for long without aircon in summer. Fans and liquid cooling only go so far.

nikem · October 13, 2022, 10:47am

I have a question about GPUs. There are some specific model GPUs that have LHR (low-hash-rate) limitation for preventing crypto coin mining. Does that affect the training process?
thanks.

hannesloots · October 31, 2022, 7:14am

Hi all, II need your help to get going.

I am trying to run the examples in the first day of the course and I am getting the same error the whole time. I was hoping that you can help me work out what I am doing wrong.

I setup the environment with the following:
Ubuntu 22.04.1 LTS
Python 3.10.6 # Default in Ubuntu version
CUDA:1.7 to align with PyTorch compatibility (See below)

I used mamba and the command below to install fastai modules in

mamba create -n fast310 python=3.10.6
mamba activate fast310
mamba install -c fastchan fastai

I got stuck on “00-is-it-a-bird-creating-a-model-from-your-own-data”

learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(3)

Here is the full output:
/home/hannes/mambaforge/envs/fast310/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter ‘pretrained’ is deprecated since 0.13 and may be removed in the future, please use ‘weights’ instead.
warnings.warn(
/home/hannes/mambaforge/envs/fast310/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for ‘weights’ are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Downloading: “https://download.pytorch.org/models/resnet18-f37072fd.pth” to /home/hannes/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████████████████████████████████| 44.7M/44.7M [00:03<00:00, 12.6MB/s]

0.00% [0/1 00:00<?]
epoch train_loss valid_loss error_rate time

7.14% [1/14 00:01<00:22]

TypeError Traceback (most recent call last)
Cell In [12], line 2
1 learn = vision_learner(dls, resnet18, metrics=error_rate)
----> 2 learn.fine_tune(3)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/callback/schedule.py:165, in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
163 “Fine tune with Learner.freeze for freeze_epochs, then with Learner.unfreeze for epochs, using discriminative LR.”
164 self.freeze()
→ 165 self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
166 base_lr /= 2
167 self.unfreeze()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/callback/schedule.py:119, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch)
116 lr_max = np.array([h[‘lr’] for h in self.opt.hypers])
117 scheds = {‘lr’: combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
118 ‘mom’: combined_cos(pct_start, *(self.moms if moms is None else moms))}
→ 119 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:256, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
254 self.opt.set_hypers(lr=self.lr if lr is None else lr)
255 self.n_epoch = n_epoch
→ 256 self._with_events(self._do_fit, ‘fit’, CancelFitException, self._end_cleanup)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:193, in Learner.with_events(self, f, event_type, ex, final)
192 def with_events(self, f, event_type, ex, final=noop):
→ 193 try: self(f’before{event_type}'); f()
194 except ex: self(f’after_cancel{event_type}‘)
195 self(f’after_{event_type}’); final()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:245, in Learner._do_fit(self)
243 for epoch in range(self.n_epoch):
244 self.epoch=epoch
→ 245 self._with_events(self._do_epoch, ‘epoch’, CancelEpochException)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:193, in Learner.with_events(self, f, event_type, ex, final)
192 def with_events(self, f, event_type, ex, final=noop):
→ 193 try: self(f’before{event_type}'); f()
194 except ex: self(f’after_cancel{event_type}‘)
195 self(f’after_{event_type}’); final()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:239, in Learner._do_epoch(self)
238 def _do_epoch(self):
→ 239 self._do_epoch_train()
240 self._do_epoch_validate()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:231, in Learner._do_epoch_train(self)
229 def _do_epoch_train(self):
230 self.dl = self.dls.train
→ 231 self._with_events(self.all_batches, ‘train’, CancelTrainException)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:193, in Learner.with_events(self, f, event_type, ex, final)
192 def with_events(self, f, event_type, ex, final=noop):
→ 193 try: self(f’before{event_type}'); f()
194 except ex: self(f’after_cancel{event_type}‘)
195 self(f’after_{event_type}’); final()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:199, in Learner.all_batches(self)
197 def all_batches(self):
198 self.n_iter = len(self.dl)
→ 199 for o in enumerate(self.dl): self.one_batch(*o)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:227, in Learner.one_batch(self, i, b)
225 b = self._set_device(b)
226 self._split(b)
→ 227 self._with_events(self._do_one_batch, ‘batch’, CancelBatchException)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:195, in Learner.with_events(self, f, event_type, ex, final)
193 try: self(f’before{event_type}‘); f()
194 except ex: self(f’after_cancel_{event_type}’)
→ 195 self(f’after_{event_type}'); final()

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:171, in Learner.call(self, event_name)
→ 171 def call(self, event_name): L(event_name).map(self._call_one)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastcore/foundation.py:156, in L.map(self, f, gen, *args, **kwargs)
→ 156 def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastcore/basics.py:840, in map_ex(iterable, f, gen, *args, **kwargs)
838 res = map(g, iterable)
839 if gen: return res
→ 840 return list(res)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastcore/basics.py:825, in bind.call(self, *args, **kwargs)
823 if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
824 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
→ 825 return self.func(*fargs, **kwargs)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/learner.py:175, in Learner._call_one(self, event_name)
173 def _call_one(self, event_name):
174 if not hasattr(event, event_name): raise Exception(f’missing {event_name}')
→ 175 for cb in self.cbs.sorted(‘order’): cb(event_name)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/callback/core.py:62, in Callback.call(self, event_name)
60 try: res = getcallable(self, event_name)()
61 except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
—> 62 except Exception as e: raise modify_exception(e, f’Exception occured in {self.__class__.__name__} when calling event {event_name}:\n\t{e.args[0]}', replace=True)
63 if event_name==‘after_fit’: self.run=True #Reset self.run to True at each end of fit
64 return res

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/callback/core.py:60, in Callback.call(self, event_name)
58 res = None
59 if self.run and _run:
—> 60 try: res = getcallable(self, event_name)()
61 except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
62 except Exception as e: raise modify_exception(e, f’Exception occured in {self.__class__.__name__} when calling event {event_name}:\n\t{e.args[0]}', replace=True)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/callback/progress.py:33, in ProgressCallback.after_batch(self)
31 def after_batch(self):
32 self.pbar.update(self.iter+1)
—> 33 if hasattr(self, ‘smooth_loss’): self.pbar.comment = f’{self.smooth_loss:.4f}’

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/torch/_tensor.py:855, in Tensor.format(self, format_spec)
853 def format(self, format_spec):
854 if has_torch_function_unary(self):
→ 855 return handle_torch_function(Tensor.format, (self,), self, format_spec)
856 if self.dim() == 0 and not self.is_meta and type(self) is Tensor:
857 return self.item().format(format_spec)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/torch/overrides.py:1534, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
1528 warnings.warn("Defining your __torch_function__ as a plain method is deprecated and " 1529 "will be an error in future, please define it as a classmethod.", 1530 DeprecationWarning) 1532 # Use public_apiinstead ofimplementation` so torch_function
1533 # implementations can do equality/identity comparisons.
→ 1534 result = torch_func_method(public_api, types, args, kwargs)
1536 if result is not NotImplemented:
1537 return result

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/fastai/torch_core.py:376, in TensorBase.torch_function(cls, func, types, args, kwargs)
374 if cls.debug and func.name not in (‘str’,‘repr’): print(func, types, args, kwargs)
375 if _torch_handled(args, cls._opt, func): types = (torch.Tensor,)
→ 376 res = super().torch_function(func, types, args, ifnone(kwargs, {}))
377 dict_objs = _find_args(args) if args else _find_args(list(kwargs.values()))
378 if issubclass(type(res),TensorBase) and dict_objs: res.set_meta(dict_objs[0],as_copy=True)

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/torch/_tensor.py:1278, in Tensor.torch_function(cls, func, types, args, kwargs)
1275 return NotImplemented
1277 with _C.DisableTorchFunction():
→ 1278 ret = func(*args, **kwargs)
1279 if func in get_default_nowrap_functions():
1280 return ret

File ~/mambaforge/envs/fast310/lib/python3.10/site-packages/torch/_tensor.py:858, in Tensor.format(self, format_spec)
856 if self.dim() == 0 and not self.is_meta and type(self) is Tensor:
857 return self.item().format(format_spec)
→ 858 return object.format(self, format_spec)

TypeError: Exception occured in ProgressCallback when calling event after_batch:
unsupported format string passed to TensorBase.format

bencoman · October 31, 2022, 3:21pm

Jeremy previously advised that unless you are experienced with mamba environments and have a very specific need for them, you should avoid createing sub-environments and just operate in the base environment.

The Default System Python version is irrelevant as it is separate from the version used in the mamba-base-environment. I recently reused my Local Jupyter From Scratch post on a new Ubuntu.22.04 install. So if you can, wipe your mamba install and try this procedure. One of the key things is using the fastsetup/master/setup-conda.sh script.

btw, I’d like to append the following to that post, except the post seems locked for some reason.

$ mamba install -c conda-forge jupyterlab
$ jupyter lab