Part 2 Lesson 10 wiki

What I mean by more processing is just that, The CUDA memory used triples when all layers are un-frozen, the number of cuda GPU temperature increases to the limit (at least on my box). But the question remains, what is the benefit if any of unfreezing all of the layers after an initial round of training and then fine-tuning as @jeremy has done.

I understand that we have to flag with -1, but we can do that without going through the first itos(). I understand that we want the most frequent, but we can toss infrequent tokens without having to convert all of our text to imdb code numbers.

wiki103_itos = load((WIKIPEDIA103_PATH/'itos_wt103.pkl').open('rb'))
word2code = defaultdict(lambda:-1, {v:k for k,v in enumerate(wiki103_itos) if word_frequency[v] > 2})
encoded_trn_ln = np.array([word2code[token] for token in tokenized_training_text[i]])

I think this gives the same results but saves a step.

When you unfreeze and train fully you’re training for the imdb classification task…not word predictions anymore…

As for the temperature, it seems then, the pytorch dynamic graph does handle the frozen layers differently.

Thanks, that makes total sense. I’m in the process of comparing the results of BS=16 with unfreeze to BS=48 with only the last layer frozen to see how the results differ.

1 Like

Getting a late start on imdb.py

Would someone kindly confirm how much cpu ram you’d need for the lesson? I am crapping out on colab notebook (13gb) just loading up the imdb dataset fairly early on.

Thanks in advance
Asif

My stand-alone box (Core I7-7800 with 64GB), uses 36GB just for that python task.

I’m not sure what you mean. The word “laptop” exists in both wikitext103 and imdb. But the index values will be totally different.

We need to cross check to map the imdb tokens to wiki indexes. Only then can we benefit from all the wiki backbone weights!

1 Like

Thanks. Yeah, I got stuck on just the loading data in memory part … surprised since its barely 1GB of data. I guess time to spin up ec2

A

I always thought it does not compute derivatives on frozen layers.

At core.py:

def set_trainable_attr(m,b):
m.trainable=b
for p in m.parameters(): p.requires_grad=b

From:
http://pytorch.org/docs/master/notes/autograd.html#excluding-subgraphs-from-backward

Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.

So it should be much less computations when more layers frozen.
Or I am missing something…

2 Likes

I’m asking why you even need to assign the imdb codes at all.

Right now, we

  1. figure out a code for laptop (say 2345)
  2. convert all the instances of laptop into 2345 in a string (let’s call it foo_txt).
  3. we download the codes for wikipedia103, where laptop has the code of 981.
  4. we figure out a mapping of imdb codes to wikipedia103 codes (2345 => 981)
  5. shift around all the weights in the encoder so the weights associated with 981 are now associated with 2345

I’m asking why we do steps 1, 4, and 5. Couldn’t we instead

  1. download the codes for wikipedia103, where laptop has the code of 981
  2. convert all the instances of laptop into 2345 in a string (let’s call it foo_txt)
    ?

I recognize that we do have to do some processing on the imdb corpus to cut to get down to the max vocabulary size, but that doesn’t require converting all the instances of laptop into 2345.

I recognize that we have to handle cases where an imdb string isn’t in the wikipedia103 corpus, but we have to do that anyway; using a defaultdict makes that easy.

No you’re right…I’m coming from a static graph mindset…

1 Like

We want 981 in the end not 2345. IMDB is the target, not wiki. We’re talking about “transfer” learning of the the LM.

See where we do:
r = stoi2[w]

But you don’t want to throw away the 2345 because you’ll need it when you train the classifier.

1 Like

I found using a 1/50th subset of the data (ie trn_texts[:1500], val_texts[:500]) was a reasonable compromise between fair results and being fast enough to tweak to understand whats going on.

when running the following on a GTX1080Ti

learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)

#for full dataset after 5 hrs:

epoch trn_loss val_loss accuracy
14 4.045284 4.072163 0.299391

#1/50th dataset after 213 seconds:
14 4.899093 4.919841 0.187322

#1/100th dataset after 96 seconds:
14 5.937064 5.737786 0.097723

NB these results are for a baseline test without the wiki103 model weights (accuracy with the model significantly better)

1 Like

After sorting the chunks by size, I then:

sort_idx = np.concatenate(np.random.permutation(ck_idx))

So the chunks are then randomized again. But within each chunk everything is a similar size. Try looking at just 128 or so rows in your graph and you should see it.

Ah, the output makes sense now. That means the solution I was thinking of won’t work in this case. The first batch should be ck_idx[0], but i’ll have to think about how to force that and what the implications are.

@even maybe just change the SortishSampler so it always puts the largest batch first?

Is the idea that sorting batches in descending order of sequence lengths could address the GPU memory clog problem?


similar performance… will try using 1/50 of the data! thx

Just to be clear, the real measure of the LM performance is not the val_loss but e^val_loss…so it’s a massive difference in perplexity score if your val_loss is 4.1 vs 4.9

But then, again, if you’re just trying to understand what’s going on, it’s better to start small. In this case you always can load your original saved weights back up and retrain if things get messed up.

Another point to note:
You don’t want to be throwing away imdb reviews during the classification phase. Jeremy’s slide does show that the LM backbone greatly helps with small volume datasets, but to truly measure the success, we need all of the 50k reviews.
If the LM was trained with a small subset, the classifier will have to bear the burden of the remaining training.

Yeah sampling is rarely a good idea. Instead, just do less epochs.

3 Likes