Lesson 4 official topic

Thanks for posting this. I attempted to submit from Jeremy’s notebook 10 days ago and couldn’t figure it out.

Did you also follow the steps in the link Jeremy recommended? Severstal: Steel Defect Detection | Kaggle.

I tried copying Jeremy’s notebook and making the changes in the notebook I could see from yours but then received an error. Copying your notebook however the notebook ran offline and I was able to submit the notebook and get on the leaderboard - finally! Is there something I’m missing? Jeremy’s notebook has nothing in the input folder but yours has debertav3small and housing. Did you follow the post above to get those files into your input folder?

Can you provide the exact steps you followed for a kaggle newbie please? :slight_smile:

1 Like

there are three places where Jeemy’s notebook uses internet:

  • install datasets package
  • download deberta-v3-small model
  • download housing dataset
    So to make it work offline i have downloaded the datasets package and it’s dependencies using pip, deberta-v3-small model from huggingface hub and housing dataset from internet and pre-processed it as per Jeremys notebook. Then uploaded all the above as kaggle dataset sand included the dataset s in the notebook.

changing it from preds to preds.flatten() should fix it…


You should be able to use mamba install -c fastchan transformers or pip install transformers if you’re on Linux or WSL.

FYI I added a comment to your kaggle notebook last week asking if you can provide more info about how you got this set up – I’m sure people would find it helpful to understand!


Initial outline here:


Thanks, yeah it looks like sentencepiece 0.1.86 was already installed: running pip install on sentencepiece 0.1.96 + restarting the kernel then rerunning with import worked - I tried proceeding without restarting per Jeremy’s suggestion but it seemed to need a restart.

Finally caught up with the lesson today. As Jeremy mentioned, it’s all new content not quite 1-to-1 mappable in the book. Some thoughts on the lecture :

  • Get introduced to a different library (eg. huggingface) and play around with it
  • Notice how the API might be feel a bit different, but the core concepts stay the same
  • The concept of pre-training a language model on unlabelled training data (via next work prediction, masked word prediction etc.)
  • Taking that language model and then fine-tuning the language model on specific labelled tasks (eg. classification)
  • Thinking through the process of trying to reshape a non-classification problem into a classification problem, unfamiliar problem category → familiar problem category.
  • Revisit Training, validation & test sets, especially Validation set design & Test set separation
  • Importance of visual exploration of data, metrics etc.
  • The “Text Preprocessing” part in Chapter 10 explains more on Tokenisation & Numericalization that should still be conceptually relevant for the lecture today

Also, the transformer arch. itself is fun to study, but maybe avoid it at this stage. There’ll be plenty of time later.


I’d suggest reading ch10 of the book (and the chapters before that), and running the “clean” version of the notebook as discussed in the previous lesson (and in “lesson 0”).

(NB: Info about what book chapters are covered is in the first post of each lesson thread.)


For my setup that did not work, I did this last week and it complained about not having the correct SSL1.0 library, I tried many things to solve the problem and unfortunately didn’t save the exact error message, but it was one of the libssl 1.0.0.so files that could not be found. When I rebuilt the transformer library it worked (using the latest version). I’m using CUDA 11 and a RTX-3090 GPU.

Are you using conda? Stuff like SSL libs should all be handled automatically by package dependencies in conda. (Although once you start building your own libs or using pip installers this can break – so best to just use conda/mamba as much as possible.)

True and yes. Pytorch complains a lot about the RTX-3090 unless you install it according to their CUDA 11 instructions. When you do that, there are some incompatibilities that must be dealt with. So some of the libraries in my Conda install may be different. The SSL version it installs is 1.0.0.X which is a different version than one that Conda transformers uses. Fast.ai though seems to work fine. Probably best not to rebuild anything unless you’re in a similar situation.

I wonder if anybody else is finding the official topic a bit unwieldy. When I’m listening to the lesson and paying attention, it becomes hard for me to find questions that others have asked because they become buried in discussion and commentary. I think the discussions are worthy but perhaps there’s a better way to organise it so people can easily find the questions that are being asked without losing track of the lesson, and upvote so Jeremy can address those questions during the session.


Might be a start of word indicator issue possible keyboard language mismatch, you can always do %debug at the top of the cell, this will open the debugger after the error, and you can then inspect the variables in debug, when it open there is a help on instructions. Could be one transformer is handling encoded/unencoded text better than the other (UTF-8 etc)


A dedicated tool for live Q&A like Slido would indeed be nice, but on the other hand, the advantage of the forum is that other participants can also answer questions, even after the lesson has ended. So not sure how one would best tackle that problem.

@jeremy wouldn’t it be beneficial to include env file for an easy local setup (I don’t know much about conda/mamba setup, but requirements.txt works well for Python in general) with instructions to install i.e.

conda env create -f environment.yml
1 Like

Was about to say the same thing.

FYI I have a 3090 and have no such problems. You shouldn’t need to install CUDA separately at all - what you need is packaged automatically with PyTorch via conda.

(I’m not mentioning this to suggest you should change anything, but to discourage other folks reading this from doing anything other than just installing the conda package.)


No I don’t recommend using an environment. If you just mamba install -c fastchan fastai then pip install fastbook it should install everything you need.

1 Like

distilroberta-base uses another way of tokenizing called BPE (Byte Pair Encoding) so the special characters you see are no error. A short explanation from some github user:

TLDR; This is how the byte-level BPE works. Main advantages are:

Smaller vocabularies
No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
Each byte value gets a “visible” character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can’t just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won’t ever need an “unknown” token.