Lesson 4 official topic

FYI I have a 3090 and have no such problems. You shouldn’t need to install CUDA separately at all - what you need is packaged automatically with PyTorch via conda.

(I’m not mentioning this to suggest you should change anything, but to discourage other folks reading this from doing anything other than just installing the conda package.)

2 Likes

No I don’t recommend using an environment. If you just mamba install -c fastchan fastai then pip install fastbook it should install everything you need.

1 Like

distilroberta-base uses another way of tokenizing called BPE (Byte Pair Encoding) so the special characters you see are no error. A short explanation from some github user:

TLDR; This is how the byte-level BPE works. Main advantages are:

Smaller vocabularies
No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
Each byte value gets a β€œvisible” character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can’t just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Δ .

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won’t ever need an β€œunknown” token.

5 Likes

Since NLP has evolved so fast in recent years its tough to try to include it within the fastai lectures, especially in the first part of the course. Theres just so many unknowns like transformer model architecture, tokenization, transformers library basics that are tough to fit into a single lesson. The HF course also comes with prepared colab notebooks and gives some more insights into tokenization and building simple pipelines and its good for just playing around with the code. Theres also a great book called " Natural Language Processing with Transformers", wouldnt recommend that in the beginning though.

5 Likes

Q: We talk about how we don’t want to overfit to our data + that we want to build a model that can generalise beyond our dataset, but are there cases where we might actually WANT our model to overfit to our data? Let’s imagine we have a static set of data that we’d gathered (some documents, let’s say). If we were finetuning a language model that we could use to make natural language queries on the documents, would it be ok in that scenario to just train and train and train and not be bothered about overfitting? i.e. perhaps even no need to split data? In other words: I could see some cases where just memorising the data in a useful way might be beneficial.

I think the only issue might arise from using an nvidia driver that doesnt support the needed CUDA version. Then conda might install a pytorch + CUDA combination thats not really supported for 3090s. Have them as well and no problems with standard installation procedure via conda.

@jeremy I think that I’m missing something here, please have a look at the below error. I think some other ppl, had the same issue that hasn’t been answered at StackOverflow here

Do you have any ideas about what’s wrong here?

mamba install -c fastchan fastbook

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ/  /β–ˆβ–ˆ/  /β–ˆβ–ˆ/  /β–ˆβ–ˆ/  /β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
        β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—
        β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘
        β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘
        β–ˆβ–ˆβ•‘ β•šβ•β• β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ•β• β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘
        β•šβ•β•     β•šβ•β•β•šβ•β•  β•šβ•β•β•šβ•β•     β•šβ•β•β•šβ•β•β•β•β•β• β•šβ•β•  β•šβ•β•

        mamba (0.22.1) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ


Looking for: ['fastbook']

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
pkgs/r/linux-64                                               No change
pkgs/main/linux-64                                            No change
pkgs/main/noarch                                              No change
pkgs/r/noarch                                                 No change
fastchan/linux-64                                             No change
fastchan/noarch                                               No change

Pinned packages:
  - python 3.10.*


Encountered problems while solving:
  - nothing provides requested fastbook

Yeah sorry I got a bit ahead of myself - it needs to be updated for the new stuff from yesterday’s lecture still. I’ll post a reply here when it’s ready.

This is unrelated to the Stackoverflow issue - the reply there is correct.

(Frankly getting that set up hasn’t been a priority since using your own machine is an advanced topic, and this is a beginner course, so making sure that the recommended platforms (kaggle, colab, and gradient) work well has been my focus. If you’re not totally comfortable setting up all the needed libs yourself by studying the package dependencies then I strongly recommend not setting up your own machine. It’s a huge distraction and has minimal benefit compared to using Gradient or Kaggle.)

4 Likes

Ok, thank you for the confirmation. I’ve been pulling hair out of my head, questioning what did I do wrong :wink:

Anyway, I think that for transparency (in case other ppl run into the same issue), it is good to keep the topic open.

I couldn’t do the standard install via Mamba (as described above), thus did the manual installation and concluded that we needed the env file until this was fixed. If anyone is interested in my environment.yml, please PM me.

A bulletproof way that I always rely on when needing to reinstall a new fastai env is:

  1. create new conda env with python 3.7 or 3.8
  2. use conda to install the proper pytorch version Start Locally | PyTorch
    β†’ in my case conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
  3. Then just pip install fastai or pip install -r requirements.txt in case of fastbook

ps: i havnt tried mamba yet, conda method might be slower at times :confused:

1 Like

I think β€˜Localhost’ is not a good idea. Because it makes the cell below run differently.


If you use β€˜Localhost’, it makes the if statement True, but it should be False for the local computer. The same logic applies to the cell below:

It is supposed to work for the local computer but yours does not because of the same reason.
I hope it helps.

2 Likes

Ok but there are a few missing steps to for me to draw this owl. Maybe let’s start with just the datasets package installation on Jeremy’s notebook.

When I run the cell:

the output after running this cell has no links to whl files which is the next step in the how to download pip installers post.

How to download the datasets packages so I can upload it? Am I downloading to my local machine and then loading it into Kaggle? I’m accustomed to installing a package in an python environment but it’s still there when I come back to that environment. This doesn’t appear to happen when you run install package with a Kaggle notebook in a way that makes the kernel work offline. Do I follow a process as suggested here?

I see this in the notebook you posted the link to (thanks for that! it helped me investigate further.)

UPDATE Sept 14, 2020 : install no longer shows the whl file. Instead use !pip download [package_name] , then you will get the whl file.

Also, I tried the above suggestion (!pip download datasets ) in a Kaggle notebook and I can see them downloaded to my current directory on kaggle (when I do !ls *.whl in a cell). If you have a local install of pip on your laptop, I would say download it there instead of downloading it in kaggle, then copying it down to your laptop, then uploading it to kaggle again. This way you could avoid having to download from kaggle to your machine and just save the .whl files directly on your machine.

HTH

1 Like

Thanks for letting me know @nikem. When running on my machine, do I leave it empty as default or should I have something similar to Localhost in there?
iskaggle = os.environ.get(β€˜KAGGLE_KERNEL_RUN_TYPE’, β€˜β€™)

1 Like

Thanks Tanishq! It worked magic!

1 Like

Just leave it empty. We’ll let you know in the notebooks if you need to change things to make them work.

1 Like

It’s ready now, although the directions are different to what I first said. This should work (I just tested it in a fresh env):

mamba install -c fastchan fastai
pip install fastbook
3 Likes

I had a same problem and this is what fixed for me using

Click on Add Data on right upper corner.
image

Then click on β€œsearch datasets”.
Type " deberta-v3-small model"
Click β€œadd”
The model will show under Input folder on the right.

then runt the cell:
model_nm = β€˜β€¦/input/debertav3small’

Hopefully, it is answered but sorry if I have misunderstood your question. Thank you to Tanishq for helping me initially.

4 Likes

Will leave it blank :+1:. Thank you Jeremy.

on my local machine i executed
pip download datasets -d <folder_path>
<folder_path> is the folder you want datasets package to be downloaded to
then from that folder i uploaded to kaggle datasets and added this dataset to the notebook and then when installing with pip in the notebook you need to give path to these whl files
pip install --no-index --find-links ../input/huggingface-datasets datasets -q
in my case the path to whl files is ../input/huggingface-datasets

2 Likes