Fastai v2 text

I am having issues getting a text model up and running in fastai v2. I am worked with protein sequences, hence I need to define a custom tokenizer.

from fastai2.basics import *                                                                                                                                                                                                     
from fastai2.text.all import *                                                                                                                                                                                                   
                                                                                                                                                                                                                                 
BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'                                                                                                                                                                    
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'                                                                                                                                                                    
                                                                                                                                                                                                                                 
defaults.text_spec_tok = [PAD]                                                                                                                                                                                                   
                                                                                                                                                                                                                              
class MolTokenizer(BaseTokenizer):                                                                                                                                                                                               
    def __init__(self, split_char=' '):                                                                                                                                                                                          
        self.split_char = ' '                                                                                                                                                                                                    
    def __call__(self, items):                                                                                                                                                                                                   
        return ( ['GO']+list(t.upper() )+['END'] for t in items)  

I then begin with the following

bs = 64                                                                                                                                                                                                                          
corpus_train = pd.read_csv('./processed/train.csv.gzip',index_col=None, compression='gzip')                                                                                                                                      
corpus_valid = pd.read_csv('./processed/valid.csv.gzip',index_col=None, compression='gzip') 
corpus_train['is_valid'] = False                                                                                                                                                                                                 
corpus_valid['is_valid'] = True                                                                                                                                                                                                  
corpus = corpus_train.append(corpus_valid, ignore_index=True)                                                                                                                                                                    
                                                                                                                                                                                                                                 
path = './test/'                                                                                                                                                                                                                 
df_tok, count = tokenize_df(corpus, 'sequence', rules=[], tok_func=partial(MolTokenizer))                                                                                                                                        
dls_lm = TextDataLoaders.from_df(df_tok, path=path, text_vocab=make_vocab(count,min_freq=1), text_col='text', is_lm=True, valid_col='is_valid')  

everything checks out for my vocab

dls_lm.train_ds.vocab
['xxpad', 'L', 'A', 'G', 'V', 'E', 'S', 'I', 'K', 'R', 'D', 'T', 'P', 'N', 'Q', 'F', 'Y', 'M', 'H', 'C', 'W', 'GO', 'END', 'xxfake']

and df_tok looks good

df_tok.text.head(2)
0    [GO, M, A, N, Y, T, A, A, D, I, K, A, L, R, E, R, T, G, A, G, M, M, D, V, K, K, A, L, D, E, A, N, G, D, A, E, K, A, I, E, I, I, R, I, K, G, L, K, G, A, T, K, R, E, G, R, S, T, A, E, G, L, V, A, A, K, V, N, G, G, V, G, V, M, I, E, V, N, C, E, T, D, F, V, A, K, A, D, K, F, I, Q, L, A, D, K, V, L, N, V, ...]
1    [GO, M, P, K, S, R, R, A, V, S, L, S, V, L, I, G, A, V, I, A, A, L, A, G, A, L, I, A, V, T, V, P, A, R, P, N, R, P, E, A, D, R, E, A, L, W, K, I, V, H, D, R, C, E, F, G, Y, R, R, T, G, A, Y, A, P, C, T, F, V, D, E, Q, S, G, T, A, L, Y, K, A, D, F, D, P, Y, Q, F, L, L, I, P, L, A, R, I, T, G, I, E, D, ...]

However, there is something going on either during numericalization or generating the batches

xx, yy = dls_lm.one_batch()
xx[:5]
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       device='cuda:1')

All my data gets numericalized to a zero token. Does anyone see what I am doing wrong?

1 Like

It seems the FastAI/Fastbook tutorials/lectures for NLP focus on English, especially with the use of Spacy. If more effort would be given to non-western languages, especially from the global south, or at the very least using a neutral tokenizer (such as sentencepiece) as the default, rather than an western opinionated tokenizer (Spacy) as the default it would go a long way to greater inclusion and diversity.

1 Like

How would one use FastAI’s sentencepiece into this fastbook tutorial? https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb?

I’ve had to manually build the sentencepiece model, vocab – but unsure how to plug this into the provide notebook. In particular how would I plug sentencepiece to FastAI2’s Tokenizer class?

I’m curious, what languages do you have in mind? Is tokenization so different in these cases? I know for some languages (Chinese being an obvious example) tokenization is very different.

I’m focusing on Nguni languages (Southern Africa) and In previous FastAI versions (pt.2 2018) I was able to use SentencePiece. The FastAI2 version is not so clear on how to go from
SentencePieceTokenizer to dls_lm


dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)```

and finally to

    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()
1 Like

I did a little bit of work on trying out SHARNN with fastai: Training SHA-RNN with fastai
Probably it might be useful :slight_smile:

2 Likes

Has anyone faced a dead kernel error while using the Tokenizer class?
I’ve a corpus of ~200,000 lines.

Steps to reproduce:

  1. Create SentencePieceTokenizer class
  2. Pass in corpus into SentencePieceTokenizer.setup (currently I’m passing entire corpus because I’m worried a subset might not capture all the possible words. The language I’m using, has a very high morphology).
  3. Pass SentencePieceTokenizer into Tokenizer in order to get FastAI’s additional tokenizer helper methods –

Here is my stack trace (Jupyter-Notebook log):

ntencepiece_trainer.cc(116) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=30000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
TrainerSpec {
  input: tmp/texts.out
  input_format:
  model_prefix: tmp/spm
  model_type: UNIGRAM
  vocab_size: 30000
  self_test_sample_size: 0
  character_coverage: 0.99999
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  user_defined_symbols: ▁xxunk
  user_defined_symbols: ▁xxpad
  user_defined_symbols: ▁xxbos
  user_defined_symbols: ▁xxeos
  user_defined_symbols: ▁xxfld
  user_defined_symbols: ▁xxrep
  user_defined_symbols: ▁xxwrep
  user_defined_symbols: ▁xxup
  user_defined_symbols: ▁xxmaj
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 9
  bos_id: -1
  eos_id: -1
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: tmp/texts.out
trainer_interface.cc(315) LOG(INFO) Loaded all 196848 sentences
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxunk
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxpad
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxbos
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxeos
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxfld
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxrep
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxwrep
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxup
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxmaj
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=19916373
trainer_interface.cc(392) LOG(INFO) Done: 99.9991% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=94
trainer_interface.cc(403) LOG(INFO) Final character coverage=0.999991
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 196846 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(133) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(184) LOG(INFO) Initialized 793115 seed sentencepieces
trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 196846
trainer_interface.cc(451) LOG(INFO) Done! 577249
unigram_model_trainer.cc(470) LOG(INFO) Using 577249 sentences for EM training
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=280412 obj=16.2455 num_tokens=1181656 num_tokens/piece=4.214
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=236822 obj=13.6384 num_tokens=1185910 num_tokens/piece=5.0076
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=177327 obj=13.5918 num_tokens=1207058 num_tokens/piece=6.80696
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=175180 obj=13.5007 num_tokens=1208000 num_tokens/piece=6.89576
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=131345 obj=13.6065 num_tokens=1254268 num_tokens/piece=9.54942
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=131046 obj=13.5553 num_tokens=1254916 num_tokens/piece=9.57615
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=98280 obj=13.7175 num_tokens=1305316 num_tokens/piece=13.2816
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=98253 obj=13.6504 num_tokens=1305836 num_tokens/piece=13.2905
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=73688 obj=13.8502 num_tokens=1359889 num_tokens/piece=18.4547
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=73683 obj=13.7946 num_tokens=1360351 num_tokens/piece=18.4622
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=55262 obj=14.0404 num_tokens=1417469 num_tokens/piece=25.65
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=55260 obj=13.978 num_tokens=1418091 num_tokens/piece=25.6622
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=41445 obj=14.281 num_tokens=1481025 num_tokens/piece=35.7347
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=41445 obj=14.2049 num_tokens=1481276 num_tokens/piece=35.7408
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=33000 obj=14.4826 num_tokens=1543754 num_tokens/piece=46.7804
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=33000 obj=14.401 num_tokens=1543804 num_tokens/piece=46.7819
trainer_interface.cc(507) LOG(INFO) Saving model: tmp/spm.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: tmp/spm.vocab
[I 19:20:37.324 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
WARNING:root:kernel f3fcecac-d891-4a93-bebb-368c3a1f809a restarted

Code

Tokenize using FastAI's SentencePiece

txts = L(o.open().read() for o in files[:1])

sp = SentencePieceTokenizer(lang='zul', vocab_sz=30000)

sp.setup(txts[:30000])

tkn = Tokenizer(sp)

type(tkn)

fastai2.text.core.Tokenizer

print(coll_repr(tkn(txt), 31))

Any and all help would be greatly appreciated.
cc @jeremy

FYI, fastai simply wraps SentencePiece. To debug this, first try doing it in regular SentencePiece to see if the issue is still there. This can be seen in the setup of SentencePieceProcessor:

def setup(self, items, rules=None):
        from sentencepiece import SentencePieceProcessor
        if rules is None: rules = []
        if self.tok is not None: return {'sp_model': self.sp_model}
        raw_text_path = self.cache_dir/'texts.out'
        with open(raw_text_path, 'w') as f:
            for t in progress_bar(maps(*rules, items), total=len(items), leave=False):
                f.write(f'{t}\n')
        sp_model = self.train(raw_text_path)
        self.tok = SentencePieceProcessor() <-
        self.tok.Load(str(sp_model))

    def __call__(self, items):
        for t in items: yield self.tok.EncodeAsPieces(t) <-

And all it does is use the SentencePiece library :slight_smile:

Tokenizer itself is just a wrapper to get the tokenized words going

2 Likes

I used the SentencePiece library and no issues instead, it was the first thing I tried (I’ve spent 2.5 days stuck on just this first part). If possible I would like to skip the FastAI wrapper for SentencePiece and simply pass in, somehow, the output of sentencepiece to FastAI.

Can you suggest how I can go forward since I’ve already attempted the first solution you provided?

p.s. I’ve spent hours looking at the code – I’ve written python code for years and debugged many python apps as a seasoned engineer – there’s something that’s being implicit that should be explicit and that’s what I’m missing.

1 Like

Hi,
I’m trying to understand the way the text dataloader work and how batches are divided as explained on chapter 12 of the amazing book.
I tried to create super simple dataset and inspect one batch from it:

    test_L = L(((i,i+100) for i in range(100)))
    bs = 5
    test_chunks = group_chunks(test_L,bs)
    print (coll_repr(test_L,10))
    print (coll_repr(test_chunks,10))

    cut = int(len(test_chunks) * 0.8)

    dls_chunks = DataLoaders.from_dsets(
        group_chunks(test_L[:cut], bs), 
        group_chunks(test_L[cut:], bs), 
        bs=bs, drop_last=True, shuffle=False)

    x, y = dls_chunks.train.one_batch()
    x[:10],y[:10],x.shape,y.shape,cut

and I don’t understand the results :

(#100) [(0, 100),(1, 101),(2, 102),(3, 103),(4, 104),(5, 105),(6, 106),(7, 107),(8, 108),(9, 109)...]
(#100) [(0, 100),(20, 120),(40, 140),(60, 160),(80, 180),(1, 101),(21, 121),(41, 141),(61, 161),(81, 
181)...]

(tensor([ 0, 16, 32, 48, 64,  1, 17, 33, 49, 65]),
tensor([100, 116, 132, 148, 164, 101, 117, 133, 149, 165]),
torch.Size([64]),
torch.Size([64]),
80)
  1. Why the length of the first batch I get from the one_batch is 64 (x and y) can I define the sequence length somehow?
  2. The order is not clear, won’t it be better for the training if each batch had continuous numbers?
1 Like

I can run the ULMFIT and IMDB notebook (fastai v0.0.11), but with my own data (Dutch Wikipedia with the same preparation as I use in fastai v1) I get the same error.

Do you have any thoughts on where I could start to debug?

cc @fmobrj75

@zerotosingularity what versions of PyTorch and torchvision? (Are they the most recent ones?)

1 Like

I tried 1.3.0 (with Torchvision 0.4) and 1.4.0 (with Torchvision 0.5)

Try to update both fastai2 and fastcore:

git clone https://github.com/fastai/fastai2
cd fastai2

conda env create -f environment.yml
source activate fastai2

python3 -m pip install packaging
git clone https://github.com/fastai/fastai2
cd fastai2
python3 -m pip install -e .[dev] 

Then be sure fastcore is also up to date:

python3 -m pip uninstall fastcore

python3 -m pip install packaging
git clone https://github.com/fastai/fastcore
cd fastcore
python3 -m pip install -e .[dev]

Hope it also works for you.

1 Like

Thank you for the help!

I just tried these and I am now hitting an issue with Fastprogress (even trying to find the LR):

'NBMasterBar' object has no attribute 'out'

I have 0.2.3 installed and also tried the install from the repo.

Any chance you could list your environment so I can compare?

pip list

or

conda list -e

Here’s mine:

This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
#_libgcc_mutex=0.1=main
absl-py=0.9.0=pypi_0
art=4.5=pypi_0
asn1crypto=1.3.0=py37_0
astor=0.8.1=pypi_0
attrs=19.3.0=py_0
backcall=0.1.0=py37_0
blas=1.0=mkl
bleach=3.1.0=py37_0
ca-certificates=2020.1.1=0
cachetools=4.0.0=pypi_0
certifi=2019.11.28=py37_0
cffi=1.14.0=py37h2e261b9_0
chardet=3.0.4=py37_1003
cmdstanpy=0.4.0=pypi_0
convertdate=2.2.0=pypi_0
coverage=5.0.3=pypi_0
cryptography=2.8=py37h1ba5d50_0
cudatoolkit=10.1.243=h6bb024c_0
cycler=0.10.0=py37_0
cymem=2.0.2=py37he1b5a44_0
cython=0.29.15=pypi_0
cython-blis=0.2.4=py37h516909a_1
dbus=1.13.12=h746ee38_0
decorator=4.4.2=py_0
defusedxml=0.6.0=py_0
entrypoints=0.3=py37_0
ephem=3.7.7.1=pypi_0
expat=2.2.6=he6710b0_0
fastai2=0.0.12=dev_0
fastcore=0.1.15=dev_0
fastprogress=0.2.3=dev_0
fastscript=0.1.4=pypi_0
fontconfig=2.13.0=h9420a91_0
freetype=2.9.1=h8a8886c_1
gast=0.2.2=pypi_0
glib=2.63.1=h5a9c865_0
gmp=6.1.2=h6c8ec71_1
google-auth=1.11.3=pypi_0
google-auth-oauthlib=0.4.1=pypi_0
google-pasta=0.2.0=pypi_0
grpcio=1.27.2=pypi_0
gst-plugins-base=1.14.0=hbbd80ab_1
gstreamer=1.14.0=hb453b48_1
h5py=2.10.0=pypi_0
holidays=0.10.1=pypi_0
icu=58.2=h9c2bf20_1
idna=2.8=py37_0
importlib_metadata=1.5.0=py37_0
intel-openmp=2020.0=166
ipykernel=5.1.4=py37h39e3cac_0
ipython=7.13.0=py37h5ca1d4c_0
ipython_genutils=0.2.0=py37_0
ipywidgets=7.5.1=py_0
jedi=0.16.0=py37_0
jinja2=2.11.1=py_0
joblib=0.14.1=py_0
jpeg=9b=h024ee3a_2
jsonschema=3.2.0=py37_0
jupyter=1.0.0=py37_7
jupyter_client=5.3.4=py37_0
jupyter_console=6.1.0=py_0
jupyter_core=4.6.1=py37_0
keras-applications=1.0.8=pypi_0
keras-preprocessing=1.1.0=pypi_0
kiwisolver=1.1.0=py37he6710b0_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libpng=1.6.37=hbc83047_0
libsodium=1.0.16=h1bed415_0
libstdcxx-ng=9.1.0=hdf63c60_0
libtiff=4.1.0=h2733197_0
libuuid=1.0.3=h1bed415_2
libxcb=1.13=h1bed415_1
libxml2=2.9.9=hea5a465_1
lunarcalendar=0.0.9=pypi_0
markdown=3.2.1=pypi_0
markupsafe=1.1.1=py37h7b6447c_0
matplotlib=3.1.3=py37_0
matplotlib-base=3.1.3=py37hef1b27d_0
mistune=0.8.4=py37h7b6447c_0
mkl=2020.0=166
mkl-service=2.3.0=py37he904b0f_0
mkl_fft=1.0.15=py37ha843d7b_0
mkl_random=1.1.0=py37hd6b4f25_0
murmurhash=1.0.2=py37he6710b0_0
nbconvert=5.6.1=py37_0
nbdev=0.2.13=pypi_0
nbformat=5.0.4=py_0
ncurses=6.2=he6710b0_0
ninja=1.9.0=py37hfd86e86_0
notebook=6.0.3=py37_0
numpy=1.18.1=py37h4f9e942_0
numpy-base=1.18.1=py37hde5b4d6_1
oauthlib=3.1.0=pypi_0
olefile=0.46=py37_0
openssl=1.1.1d=h7b6447c_4
opt-einsum=3.2.0=pypi_0
packaging=20.3=pypi_0
pandas=0.25.3=pypi_0
pandoc=2.2.3.2=0
pandocfilters=1.4.2=py37_1
parso=0.6.1=py_0
pcre=8.43=he6710b0_0
pexpect=4.8.0=py37_0
pickleshare=0.7.5=py37_0
pillow=7.0.0=py37hb39fc2d_0
pip=20.0.2=py37_1
plac=0.9.6=py37_0
plotly=4.5.4=pypi_0
preshed=2.0.1=py37he6710b0_0
prometheus_client=0.7.1=py_0
prompt_toolkit=3.0.3=py_0
protobuf=3.11.3=pypi_0
ptyprocess=0.6.0=py37_0
pyasn1=0.4.8=pypi_0
pyasn1-modules=0.2.8=pypi_0
pycparser=2.19=py37_0
pygments=2.5.2=py_0
pymeeus=0.3.7=pypi_0
pyopenssl=19.1.0=py37_0
pyparsing=2.4.6=py_0
pyqt=5.9.2=py37h05f1152_2
pyrsistent=0.15.7=py37h7b6447c_0
pysocks=1.7.1=py37_0
pystan=2.19.1.1=pypi_0
python=3.7.6=h0371630_2
python-dateutil=2.8.1=py_0
pytorch=1.4.0=py3.7_cuda10.1.243_cudnn7.6.3_0
pytz=2019.3=py_0
pyyaml=5.3=py37h7b6447c_0
pyzmq=18.1.1=py37he6710b0_0
qt=5.9.7=h5867ecd_1
qtconsole=4.7.1=py_0
qtpy=1.9.0=py_0
readline=7.0=h7b6447c_5
requests=2.23.0=py37_0
requests-oauthlib=1.3.0=pypi_0
retrying=1.3.3=pypi_0
rsa=4.0=pypi_0
scikit-learn=0.22.1=py37hd81dba3_0
scipy=1.4.1=py37h0b6359f_0
seaborn=0.10.0=pypi_0
send2trash=1.5.0=py37_0
setuptools=46.0.0=py37_0
setuptools-git=1.2=pypi_0
sip=4.19.8=py37hf484d3e_0
six=1.14.0=py37_0
spacy=2.1.8=py37hc9558a2_0
sqlite=3.31.1=h7b6447c_0
srsly=0.1.0=py37he1b5a44_0
tensorboard=2.1.1=pypi_0
tensorflow=2.1.0=pypi_0
tensorflow-estimator=2.1.0=pypi_0
termcolor=1.1.0=pypi_0
terminado=0.8.3=py37_0
testpath=0.4.4=py_0
thinc=7.0.8=py37hc9558a2_0
tk=8.6.8=hbc83047_0
torchvision=0.5.0=py37_cu101
tornado=6.0.4=py37h7b6447c_1
tqdm=4.42.1=py_0
traitlets=4.3.3=py37_0
tscv=0.0.4=pypi_0
urllib3=1.25.8=py37_0
wasabi=0.2.2=py_0
wcwidth=0.1.8=py_0
webencodings=0.5.1=py37_1
werkzeug=1.0.0=pypi_0
wheel=0.34.2=py37_0
widgetsnbextension=3.5.1=py37_0
wrapt=1.12.1=pypi_0
xz=5.2.4=h14c3975_4
yaml=0.1.7=had09818_2
zeromq=4.3.1=he6710b0_3
zipp=2.2.0=py_0
zlib=1.2.11=h7b6447c_3
zstd=1.3.7=h0b5b093_0
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
alpha-vantage=2.1.3=pypi_0
asn1crypto=1.3.0=py37_0
attrs=19.3.0=py_0
backcall=0.1.0=py37_0
beautifulsoup4=4.8.2=pypi_0
blas=1.0=mkl
bleach=3.1.0=py_0
bs4=0.0.1=pypi_0
ca-certificates=2020.1.1=0
certifi=2019.11.28=py37_0
cffi=1.14.0=py37h2e261b9_0
chardet=3.0.4=py37_1003
cryptography=2.8=py37h1ba5d50_0
cudatoolkit=10.1.243=h6bb024c_0
cycler=0.10.0=py37_0
cymem=2.0.2=py37he1b5a44_0
cython-blis=0.2.4=py37h516909a_1
dbus=1.13.12=h746ee38_0
decorator=4.4.1=py_0
defusedxml=0.6.0=py_0
entrypoints=0.3=py37_0
expat=2.2.6=he6710b0_0
fast-tabnet=0.0.5=pypi_0
fastai2=0.0.8=dev_0
fastcore=0.1.11=dev_0
fastprogress=0.2.2=py_0
fastscript=0.1.4=pypi_0
fontconfig=2.13.0=h9420a91_0
freetype=2.9.1=h8a8886c_1
glib=2.63.1=h5a9c865_0
gmp=6.1.2=h6c8ec71_1
gst-plugins-base=1.14.0=hbbd80ab_1
gstreamer=1.14.0=hb453b48_1
icu=58.2=h9c2bf20_1
idna=2.8=py37_0
importlib_metadata=1.5.0=py37_0
intel-openmp=2020.0=166
ipykernel=5.1.4=py37h39e3cac_0
ipython=7.12.0=py37h5ca1d4c_0
ipython_genutils=0.2.0=py37_0
ipywidgets=7.5.1=py_0
jedi=0.16.0=py37_0
jinja2=2.11.1=py_0
joblib=0.14.1=py_0
jpeg=9b=h024ee3a_2
jsonschema=3.2.0=py37_0
jupyter=1.0.0=py37_7
jupyter_client=5.3.4=py37_0
jupyter_console=6.1.0=py_0
jupyter_core=4.6.1=py37_0
kiwisolver=1.1.0=py37he6710b0_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libpng=1.6.37=hbc83047_0
libsodium=1.0.16=h1bed415_0
libstdcxx-ng=9.1.0=hdf63c60_0
libtiff=4.1.0=h2733197_0
libuuid=1.0.3=h1bed415_2
libxcb=1.13=h1bed415_1
libxml2=2.9.9=hea5a465_1
lxml=4.5.0=pypi_0
markupsafe=1.1.1=py37h7b6447c_0
matplotlib=3.1.3=py37_0
matplotlib-base=3.1.3=py37hef1b27d_0
mdlp-discretization=0.3.3=pypi_0
mistune=0.8.4=py37h7b6447c_0
mkl=2020.0=166
mkl-service=2.3.0=py37he904b0f_0
mkl_fft=1.0.15=py37ha843d7b_0
mkl_random=1.1.0=py37hd6b4f25_0
more-itertools=8.2.0=pypi_0
murmurhash=1.0.2=py37he6710b0_0
nbconvert=5.6.1=py37_0
nbdev=0.2.12=pypi_0
nbformat=5.0.4=py_0
ncurses=6.1=he6710b0_1
ninja=1.9.0=py37hfd86e86_0
notebook=6.0.3=py37_0
numpy=1.18.1=py37h4f9e942_0
numpy-base=1.18.1=py37hde5b4d6_1
olefile=0.46=py_0
openssl=1.1.1d=h7b6447c_4
packaging=20.1=pypi_0
pandas=1.0.1=py37h0573a6f_0
pandoc=2.2.3.2=0
pandocfilters=1.4.2=py37_1
parso=0.6.1=py_0
pcre=8.43=he6710b0_0
pexpect=4.8.0=py37_0
pickleshare=0.7.5=py37_0
pillow=7.0.0=py37hb39fc2d_0
pip=20.0.2=py37_1
plac=0.9.6=py37_0
pluggy=0.13.1=pypi_0
preshed=2.0.1=py37he6710b0_0
prometheus_client=0.7.1=py_0
prompt_toolkit=3.0.3=py_0
psycopg2=2.8.4=pypi_0
ptyprocess=0.6.0=py37_0
py=1.8.1=pypi_0
pycparser=2.19=py_0
pygments=2.5.2=py_0
pyopenssl=19.1.0=py37_0
pyparsing=2.4.6=py_0
pyqt=5.9.2=py37h05f1152_2
pyrsistent=0.15.7=py37h7b6447c_0
pysocks=1.7.1=py37_0
pytest=5.3.5=pypi_0
python=3.7.6=h0371630_2
python-dateutil=2.8.1=py_0
pytorch=1.3.1=py3.7_cuda10.1.243_cudnn7.6.3_0
pytz=2019.3=py_0
pyyaml=5.3=py37h7b6447c_0
pyzmq=18.1.1=py37he6710b0_0
qt=5.9.7=h5867ecd_1
qtconsole=4.6.0=py_1
readline=7.0=h7b6447c_5
requests=2.22.0=py37_1
scikit-learn=0.22.1=py37hd81dba3_0
scipy=1.4.1=py37h0b6359f_0
send2trash=1.5.0=py37_0
sentencepiece=0.1.85=pypi_0
setuptools=45.2.0=py37_0
sip=4.19.8=py37hf484d3e_0
six=1.14.0=py37_0
soupsieve=1.9.5=pypi_0
spacy=2.1.8=py37hc9558a2_0
sqlalchemy=1.3.14=pypi_0
sqlite=3.31.1=h7b6447c_0
srsly=0.1.0=py37he1b5a44_0
terminado=0.8.3=py37_0
testpath=0.4.4=py_0
thinc=7.0.8=py37hc9558a2_0
tk=8.6.8=hbc83047_0
torchvision=0.4.2=py37_cu101
tornado=6.0.3=py37h7b6447c_3
tqdm=4.42.1=py_0
traitlets=4.3.3=py37_0
urllib3=1.25.8=py37_0
wasabi=0.2.2=py_0
wcwidth=0.1.8=py_0
webencodings=0.5.1=py37_1
wheel=0.34.2=py37_0
widgetsnbextension=3.5.1=py37_0
xlrd=1.2.0=pypi_0
xz=5.2.4=h14c3975_4
yaml=0.1.7=had09818_2
zeromq=4.3.1=he6710b0_3
zipp=2.2.0=py_0
zlib=1.2.11=h7b6447c_3
zstd=1.3.7=h0b5b093_0

Another issue I ran into some point in time (I dont remember when) that maybe you will face. Some point in time, I had CUDA 9 installed, so I had to migrate to Cuda 10 to make new versions of pytorch work.

Yours:

Might need to stick with Fastai2 v 0.0.11 for now. That one runs the IMBD and ULMFIT notebook, but seems to not work with my data (which used to work in v1).

I already cut of texts that were shorter than 5/10 words, but to no avail…

Thank you for the help! I’ll continue my search…

Have you checked your CUDA version? If suspect pytorch 1.4 requires CUDA 10.

I’m running Cuda 10.2 and the latest Nvidia drivers.

From the current course info, I see we will likely start with 0.0.11 instead of the development version. I will start with that one, and try to find a solution. On 0.0.11 the IMDB and ULMFIT notebooks seem to work…