FYI I have a 3090 and have no such problems. You shouldnβt need to install CUDA separately at all - what you need is packaged automatically with PyTorch via conda.
(Iβm not mentioning this to suggest you should change anything, but to discourage other folks reading this from doing anything other than just installing the conda package.)
No I donβt recommend using an environment. If you just mamba install -c fastchan fastai then pip install fastbook it should install everything you need.
distilroberta-base uses another way of tokenizing called BPE (Byte Pair Encoding) so the special characters you see are no error. A short explanation from some github user:
TLDR; This is how the byte-level BPE works. Main advantages are:
Smaller vocabularies
No unknown token
This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
Each byte value gets a βvisibleβ character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we canβt just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Δ .
The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that wonβt ever need an βunknownβ token.
Since NLP has evolved so fast in recent years its tough to try to include it within the fastai lectures, especially in the first part of the course. Theres just so many unknowns like transformer model architecture, tokenization, transformers library basics that are tough to fit into a single lesson. The HF course also comes with prepared colab notebooks and gives some more insights into tokenization and building simple pipelines and its good for just playing around with the code. Theres also a great book called " Natural Language Processing with Transformers", wouldnt recommend that in the beginning though.
Q: We talk about how we donβt want to overfit to our data + that we want to build a model that can generalise beyond our dataset, but are there cases where we might actually WANT our model to overfit to our data? Letβs imagine we have a static set of data that weβd gathered (some documents, letβs say). If we were finetuning a language model that we could use to make natural language queries on the documents, would it be ok in that scenario to just train and train and train and not be bothered about overfitting? i.e. perhaps even no need to split data? In other words: I could see some cases where just memorising the data in a useful way might be beneficial.
I think the only issue might arise from using an nvidia driver that doesnt support the needed CUDA version. Then conda might install a pytorch + CUDA combination thats not really supported for 3090s. Have them as well and no problems with standard installation procedure via conda.
@jeremy I think that Iβm missing something here, please have a look at the below error. I think some other ppl, had the same issue that hasnβt been answered at StackOverflow here
Yeah sorry I got a bit ahead of myself - it needs to be updated for the new stuff from yesterdayβs lecture still. Iβll post a reply here when itβs ready.
This is unrelated to the Stackoverflow issue - the reply there is correct.
(Frankly getting that set up hasnβt been a priority since using your own machine is an advanced topic, and this is a beginner course, so making sure that the recommended platforms (kaggle, colab, and gradient) work well has been my focus. If youβre not totally comfortable setting up all the needed libs yourself by studying the package dependencies then I strongly recommend not setting up your own machine. Itβs a huge distraction and has minimal benefit compared to using Gradient or Kaggle.)
Ok, thank you for the confirmation. Iβve been pulling hair out of my head, questioning what did I do wrong
Anyway, I think that for transparency (in case other ppl run into the same issue), it is good to keep the topic open.
I couldnβt do the standard install via Mamba (as described above), thus did the manual installation and concluded that we needed the env file until this was fixed. If anyone is interested in my environment.yml, please PM me.
A bulletproof way that I always rely on when needing to reinstall a new fastai env is:
create new conda env with python 3.7 or 3.8
use conda to install the proper pytorch version Start Locally | PyTorch
β in my case conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Then just pip install fastai or pip install -r requirements.txt in case of fastbook
ps: i havnt tried mamba yet, conda method might be slower at times
Ok but there are a few missing steps to for me to draw this owl. Maybe letβs start with just the datasets package installation on Jeremyβs notebook.
the output after running this cell has no links to whl files which is the next step in the how to download pip installers post.
How to download the datasets packages so I can upload it? Am I downloading to my local machine and then loading it into Kaggle? Iβm accustomed to installing a package in an python environment but itβs still there when I come back to that environment. This doesnβt appear to happen when you run install package with a Kaggle notebook in a way that makes the kernel work offline. Do I follow a process as suggested here?
I see this in the notebook you posted the link to (thanks for that! it helped me investigate further.)
UPDATE Sept 14, 2020 : install no longer shows the whl file. Instead use !pip download [package_name] , then you will get the whl file.
Also, I tried the above suggestion (!pip download datasets ) in a Kaggle notebook and I can see them downloaded to my current directory on kaggle (when I do !ls *.whl in a cell). If you have a local install of pip on your laptop, I would say download it there instead of downloading it in kaggle, then copying it down to your laptop, then uploading it to kaggle again. This way you could avoid having to download from kaggle to your machine and just save the .whl files directly on your machine.
Thanks for letting me know @nikem. When running on my machine, do I leave it empty as default or should I have something similar to Localhost in there?
iskaggle = os.environ.get(βKAGGLE_KERNEL_RUN_TYPEβ, ββ)
on my local machine i executed pip download datasets -d <folder_path>
<folder_path> is the folder you want datasets package to be downloaded to
then from that folder i uploaded to kaggle datasets and added this dataset to the notebook and then when installing with pip in the notebook you need to give path to these whl files pip install --no-index --find-links ../input/huggingface-datasets datasets -q
in my case the path to whl files is ../input/huggingface-datasets