I tried copying Jeremy’s notebook and making the changes in the notebook I could see from yours but then received an error. Copying your notebook however the notebook ran offline and I was able to submit the notebook and get on the leaderboard - finally! Is there something I’m missing? Jeremy’s notebook has nothing in the input folder but yours has debertav3small and housing. Did you follow the post above to get those files into your input folder?
Can you provide the exact steps you followed for a kaggle newbie please?
there are three places where Jeemy’s notebook uses internet:
install datasets package
download deberta-v3-small model
download housing dataset
So to make it work offline i have downloaded the datasets package and it’s dependencies using pip, deberta-v3-small model from huggingface hub and housing dataset from internet and pre-processed it as per Jeremys notebook. Then uploaded all the above as kaggle dataset sand included the dataset s in the notebook.
Thanks, yeah it looks like sentencepiece 0.1.86 was already installed: running pip install on sentencepiece 0.1.96 + restarting the kernel then rerunning with import worked - I tried proceeding without restarting per Jeremy’s suggestion but it seemed to need a restart.
For my setup that did not work, I did this last week and it complained about not having the correct SSL1.0 library, I tried many things to solve the problem and unfortunately didn’t save the exact error message, but it was one of the libssl 1.0.0.so files that could not be found. When I rebuilt the transformer library it worked (using the latest version). I’m using CUDA 11 and a RTX-3090 GPU.
Are you using conda? Stuff like SSL libs should all be handled automatically by package dependencies in conda. (Although once you start building your own libs or using pip installers this can break – so best to just use conda/mamba as much as possible.)
True and yes. Pytorch complains a lot about the RTX-3090 unless you install it according to their CUDA 11 instructions. When you do that, there are some incompatibilities that must be dealt with. So some of the libraries in my Conda install may be different. The SSL version it installs is 1.0.0.X which is a different version than one that Conda transformers uses. Fast.ai though seems to work fine. Probably best not to rebuild anything unless you’re in a similar situation.
I wonder if anybody else is finding the official topic a bit unwieldy. When I’m listening to the lesson and paying attention, it becomes hard for me to find questions that others have asked because they become buried in discussion and commentary. I think the discussions are worthy but perhaps there’s a better way to organise it so people can easily find the questions that are being asked without losing track of the lesson, and upvote so Jeremy can address those questions during the session.
Might be a start of word indicator issue possible keyboard language mismatch, you can always do %debug at the top of the cell, this will open the debugger after the error, and you can then inspect the variables in debug, when it open there is a help on instructions. Could be one transformer is handling encoded/unencoded text better than the other (UTF-8 etc)
A dedicated tool for live Q&A like Slido would indeed be nice, but on the other hand, the advantage of the forum is that other participants can also answer questions, even after the lesson has ended. So not sure how one would best tackle that problem.
@jeremy wouldn’t it be beneficial to include env file for an easy local setup (I don’t know much about conda/mamba setup, but requirements.txt works well for Python in general) with instructions to install i.e.
distilroberta-base uses another way of tokenizing called BPE (Byte Pair Encoding) so the special characters you see are no error. A short explanation from some github user:
TLDR; This is how the byte-level BPE works. Main advantages are:
No unknown token
This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
Each byte value gets a “visible” character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can’t just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.
The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won’t ever need an “unknown” token.