Part 2 Lesson 10 wiki

(Bart Fish) #382

What I mean by more processing is just that, The CUDA memory used triples when all layers are un-frozen, the number of cuda GPU temperature increases to the limit (at least on my box). But the question remains, what is the benefit if any of unfreezing all of the layers after an initial round of training and then fine-tuning as @jeremy has done.

(Kaitlin Duck Sherwood) #383

I understand that we have to flag with -1, but we can do that without going through the first itos(). I understand that we want the most frequent, but we can toss infrequent tokens without having to convert all of our text to imdb code numbers.

wiki103_itos = load((WIKIPEDIA103_PATH/'itos_wt103.pkl').open('rb'))
word2code = defaultdict(lambda:-1, {v:k for k,v in enumerate(wiki103_itos) if word_frequency[v] > 2})
encoded_trn_ln = np.array([word2code[token] for token in tokenized_training_text[i]])

I think this gives the same results but saves a step.

(Arvind Nagaraj) #384

When you unfreeze and train fully you’re training for the imdb classification task…not word predictions anymore…

As for the temperature, it seems then, the pytorch dynamic graph does handle the frozen layers differently.

(Bart Fish) #385

Thanks, that makes total sense. I’m in the process of comparing the results of BS=16 with unfreeze to BS=48 with only the last layer frozen to see how the results differ.

(Asif Imran) #386

Getting a late start on

Would someone kindly confirm how much cpu ram you’d need for the lesson? I am crapping out on colab notebook (13gb) just loading up the imdb dataset fairly early on.

Thanks in advance

(Bart Fish) #387

My stand-alone box (Core I7-7800 with 64GB), uses 36GB just for that python task.

(Arvind Nagaraj) #388

I’m not sure what you mean. The word “laptop” exists in both wikitext103 and imdb. But the index values will be totally different.

We need to cross check to map the imdb tokens to wiki indexes. Only then can we benefit from all the wiki backbone weights!

(Asif Imran) #389

Thanks. Yeah, I got stuck on just the loading data in memory part … surprised since its barely 1GB of data. I guess time to spin up ec2


(Alex) #390

I always thought it does not compute derivatives on frozen layers.


def set_trainable_attr(m,b):
for p in m.parameters(): p.requires_grad=b


Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.

So it should be much less computations when more layers frozen.
Or I am missing something…

(Kaitlin Duck Sherwood) #391

I’m asking why you even need to assign the imdb codes at all.

Right now, we

  1. figure out a code for laptop (say 2345)
  2. convert all the instances of laptop into 2345 in a string (let’s call it foo_txt).
  3. we download the codes for wikipedia103, where laptop has the code of 981.
  4. we figure out a mapping of imdb codes to wikipedia103 codes (2345 => 981)
  5. shift around all the weights in the encoder so the weights associated with 981 are now associated with 2345

I’m asking why we do steps 1, 4, and 5. Couldn’t we instead

  1. download the codes for wikipedia103, where laptop has the code of 981
  2. convert all the instances of laptop into 2345 in a string (let’s call it foo_txt)

I recognize that we do have to do some processing on the imdb corpus to cut to get down to the max vocabulary size, but that doesn’t require converting all the instances of laptop into 2345.

I recognize that we have to handle cases where an imdb string isn’t in the wikipedia103 corpus, but we have to do that anyway; using a defaultdict makes that easy.

(Arvind Nagaraj) #392

No you’re right…I’m coming from a static graph mindset…

(Arvind Nagaraj) #393

We want 981 in the end not 2345. IMDB is the target, not wiki. We’re talking about “transfer” learning of the the LM.

See where we do:
r = stoi2[w]

But you don’t want to throw away the 2345 because you’ll need it when you train the classifier.

(adrian) #394

I found using a 1/50th subset of the data (ie trn_texts[:1500], val_texts[:500]) was a reasonable compromise between fair results and being fast enough to tweak to understand whats going on.

when running the following on a GTX1080Ti, 1, wds=wd, use_clr=(20,10), cycle_len=15)

#for full dataset after 5 hrs:

epoch trn_loss val_loss accuracy
14 4.045284 4.072163 0.299391

#1/50th dataset after 213 seconds:
14 4.899093 4.919841 0.187322

#1/100th dataset after 96 seconds:
14 5.937064 5.737786 0.097723

NB these results are for a baseline test without the wiki103 model weights (accuracy with the model significantly better)

(Jeremy Howard) #395

After sorting the chunks by size, I then:

sort_idx = np.concatenate(np.random.permutation(ck_idx))

So the chunks are then randomized again. But within each chunk everything is a similar size. Try looking at just 128 or so rows in your graph and you should see it.

(Even Oldridge) #396

Ah, the output makes sense now. That means the solution I was thinking of won’t work in this case. The first batch should be ck_idx[0], but i’ll have to think about how to force that and what the implications are.

(Jeremy Howard) #397

@even maybe just change the SortishSampler so it always puts the largest batch first?

(Arvind Nagaraj) #398

Is the idea that sorting batches in descending order of sequence lengths could address the GPU memory clog problem?

(nok) #399

similar performance… will try using 1/50 of the data! thx

(Arvind Nagaraj) #400

Just to be clear, the real measure of the LM performance is not the val_loss but e^val_loss…so it’s a massive difference in perplexity score if your val_loss is 4.1 vs 4.9

But then, again, if you’re just trying to understand what’s going on, it’s better to start small. In this case you always can load your original saved weights back up and retrain if things get messed up.

Another point to note:
You don’t want to be throwing away imdb reviews during the classification phase. Jeremy’s slide does show that the LM backbone greatly helps with small volume datasets, but to truly measure the success, we need all of the 50k reviews.
If the LM was trained with a small subset, the classifier will have to bear the burden of the remaining training.

(Jeremy Howard) #401

Yeah sampling is rarely a good idea. Instead, just do less epochs.