Lesson 12 (2019) discussion and wiki

With CNNs the showed it in the lecture.
I never tried it with RNNs, this is why I cannot tell. I would expect that with RNNs it is trickier because you multiply with the same matrices again and again (more or less, depending on the RNN cell).

1 Like

Please take a look at Google Machine Learning Glossary, for those interested in.

6 Likes

I always do initial few epoch with higher learning rate and use CLR. Then drastically reduce learning rate and start playing with batch size. Have seen that lower image size, higher batch responds well with higher learning rate. Then i increase the image resolution and reduce batch size along with learning rate and again use CLR learning rate approach,

2 Likes

Have anyone used " Googleā€™s AutoAugment"? I am trying to understand it and apparently this looks quite promising. Image Augmentation is quite pain and very much manual oneā€¦Getting correct image augmentation set is quite trial and error. So wanted to explore this AutoAugment and see even if we can do something in fast.ai?

In general isnā€™t regularization keeping the gradients at certain range? If Iā€™m using gradient clipping do I need regularization?

Should RNN parameters then be initialized with some function or can I just use some random distribution? Is there difference?

1 Like

Weā€™re just using a regular linear layer.

1 Like

Nice one! Made it into a pdf:

google - Machine Learning Glossary.pdf (579.6 KB)

4 Likes

I think using very low learning rate on the first batch might complement initialization. Something in the neighborhood of 1e-7 or even less for the first epoch might initialize the model to the data at hand. I have used this approach a few times with good results, but I did not do any rigorous study on that.

1 Like

My intuition is that it might or might not work depending on how you mix it up. If you mix up on the level of words or sentences, it might be reducing quality of modeling because NLP often tries to preserve sequence of words or sentences. Hence inserting alien text inside of a sentence or between sentences can be detrimental.

As far as I understand, we do not yet take into account sequence information on the paragraph level though. Mixing up with alien paragraphs from other documents in the corpus might provide model with information on thematic relatedness of the paragraphs from different documents and that can be useful. For example, mixup of a corpus of deep learning papers might give the model means for detecting idea borrowing. It might give us ability to learn citations that are not directly mentioned in the papers.

1 Like

Sequence Length, Batch Size and BPTT

9 Likes

In order to use transfer learning in Xresnet, Can Jeremy upload the Xresnet imagenet pre-train model?

Thanks for making this chart. Itā€™s timely because I am now on my third try to understand RNNs!

Now I am stuck with some questions about notebook 12_text, in section Batching, where the IMDB review language is fine-tuned. Would someone kindly confirm or clarify so I can get out of the fog?

  • Batching, batch size, etc. of course refers to mini-batches, so letā€™s say ā€œbatchā€ here in this post.

  • At the end of a batch, gradients are updated. Thereā€™s a hidden state. The hidden state gets carried over, per row, to the next batch, the text continues on the same row of the next batch, and processing continues. Right?

  • At the first (mini)batch, the hidden state is initialized. And the start of each training row after the first may not line up with the start of a review. So the hidden state will lack the context of a complete first review in that row. Right?

  • But we donā€™t seem to care about this imperfection in training. It that because the speed gained outweighs any training inaccuracy? If we did care, we could make sure each row starts with a review and contains only complete reviews following.

  • bptt is said to be the ā€œdepthā€ of your RNN, the number of iterations of your ā€œforā€ loop in the forward pass. But why should the depth of the RNN be related to the length of the input sequences that make up a batch? bptt seems better related to the frequency of gradient updates.

  • Is there any insight on how many tokens it takes to fully ā€œeducateā€ a hidden state? Vaguely, this means to the point where having different tokens back far enough does not matter to its current state.

  • I am still unclear on exactly when and how the loss is calculated. Is it word by word or per bptt sized array? Happy to save this question for later, unless it bears on any of the above.

Thanks for your help and clarifications!

2 Likes

Hi Brad, I also got 0 speedup on AWD_LSTM with mixed precision. Iā€™m still trying to figure out how to parse this statement: ā€˜you also need to have everything be a multiple of 8ā€™

2 Likes

I found a reference on ā€œeverything to be a multiple of 8ā€ on the NVidia web site:
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

In practice, for mixed precision training, our recommendations are:

  1. Choose mini-batch to be a multiple of 8
  2. Choose linear layer dimensions to be a multiple of 8
  3. Choose convolution layer channel counts to be a multiple of 8
  4. For classification problems, pad vocabulary to be a multiple of 8
  5. For sequence problems, pad the sequence length to be a multiple of 8
9 Likes

Brad, that ā€œmultiples of 8ā€ thing made my learning about 40% faster! It was ugly what I did to the code; any cleaner solutions would be appreciated!

First you already know probably, but it took me a while to find it:

import apex.fp16_utils as fp16

then the ā€˜to_fp16ā€™ thing:

learn = to_fp16(language_model_learner(data_lm, AWD_LSTM))

Then I did something I really didnā€™t want to do. In order to get my first 30% training speed improvement, I had to throw away the pre-trained fastai model for AWD_LSTM, because the n_hid was 1150 and that isnā€™t a multiple of 8. I imagine there is nothing that can be done about that other than re-training the pre-trained model with n_hid of, say, 1144. Iā€™m going to end up doing that eventually because my corpus (clinical notes) is quirky and much bigger than Wiki and it will take about a week to train, even with these improvements.

learn = to_fp16(language_model_learner(data_lm, AWD_LSTM, pretrained=False))

Then I had to make the bptt parameter 72, rather than default of 70. That was easy enough because I could pass the new parameter in the data block api:

     .databunch(bs=1000, num_workers=4, bptt=72)

Then it continued to get ugly. I hope there is a cleaner way of doing this, but this is what I did so I could change n_hid from 1150 to 1144:

  • I replaced ā€˜awd_lstm_lm_configā€™ with ā€˜my_awd_lstm_lm_configā€™ to set the n_hid to 1444 (divisable by 8)
  • I replaced ā€˜_model_metaā€™ with ā€˜_my_model_metaā€™ so it would use "my_awd_lstm_lm_configā€™
  • I replaced two class definitions so they would call _my_model_meta
my_awd_lstm_lm_config = dict(emb_sz=400, n_hid=1144, n_layers=3, pad_token=1, qrnn=False, bidir=False, output_p=0.1,
                          hidden_p=0.15, input_p=0.25, embed_p=0.02, weight_p=0.2, tie_weights=True, out_bias=True)

_my_model_meta = {AWD_LSTM: {'hid_name':'emb_sz', 'url':URLs.WT103_1,
                          'config_lm':my_awd_lstm_lm_config, 'split_lm': awd_lstm_lm_split,
                          'config_clas':awd_lstm_clas_config, 'split_clas': awd_lstm_clas_split},
               Transformer: {'hid_name':'d_model', 'url':URLs.OPENAI_TRANSFORMER,
                             'config_lm':tfmer_lm_config, 'split_lm': tfmer_lm_split,
                             'config_clas':tfmer_clas_config, 'split_clas': tfmer_clas_split},
               TransformerXL: {'hid_name':'d_model', 
                              'config_lm':tfmerXL_lm_config, 'split_lm': tfmerXL_lm_split,
                              'config_clas':tfmerXL_clas_config, 'split_clas': tfmerXL_clas_split}}

def get_language_model(arch:Callable, vocab_sz:int, config:dict=None, drop_mult:float=1.):
    "Create a language model from `arch` and its `config`, maybe `pretrained`."
    meta = _my_model_meta[arch]
    config = ifnone(config, meta['config_lm'].copy())
    for k in config.keys(): 
        if k.endswith('_p'): config[k] *= drop_mult
    tie_weights,output_p,out_bias = map(config.pop, ['tie_weights', 'output_p', 'out_bias'])
    init = config.pop('init') if 'init' in config else None
    encoder = arch(vocab_sz, **config)
    enc = encoder.encoder if tie_weights else None
    decoder = LinearDecoder(vocab_sz, config[meta['hid_name']], output_p, tie_encoder=enc, bias=out_bias)
    model = SequentialRNN(encoder, decoder)
    return model if init is None else model.apply(init)

def language_model_learner(data:DataBunch, arch, config:dict=None, drop_mult:float=1., pretrained:bool=True,
                           pretrained_fnames:OptStrTuple=None, **learn_kwargs) -> 'LanguageLearner':
    "Create a `Learner` with a language model from `data` and `arch`."
    model = get_language_model(arch, len(data.vocab.itos), config=config, drop_mult=drop_mult)
    meta = _my_model_meta[arch]
    learn = LanguageLearner(data, model, split_func=meta['split_lm'], **learn_kwargs)
    if pretrained:
        if 'url' not in meta: 
            warn("There are no pretrained weights for that architecture yet!")
            return learn
        model_path = untar_data(meta['url'], data=False)
        fnames = [list(model_path.glob(f'*.{ext}'))[0] for ext in ['pth', 'pkl']]
        learn.load_pretrained(*fnames)
        learn.freeze()
    if pretrained_fnames is not None:
        fnames = [learn.path/learn.model_dir/f'{fn}.{ext}' for fn,ext in zip(pretrained_fnames, ['pth', 'pkl'])]
        learn.load_pretrained(*fnames)
        learn.freeze()
    return learn

This gave me a 30% speed improvement!

Then I got an additional 10% improvement by hacking my vocab to pad it to a mult of 8. I did this after creating my DataBunch but before creating my learner:

i = 0
while len(data_lm.vocab.itos)%8 != 0:
    data_lm.vocab.itos.insert(0,'xx'+str(i))
    i +=1

For the language model, nothing else in the mixed precision made any detectable difference (on my RTX 2080TI), but this was worth the pain for building a large pre-trained language model.

I was able to verify the weights were multiples of 8 with this:

print(len(data_lm.vocab.itos))
print(learn.summary())

This gave:

4856
======================================================================
Layer (type)         Output Shape         Param #    Trainable 
======================================================================
RNNDropout           [72, 400]            0          False     
______________________________________________________________________
RNNDropout           [72, 1144]           0          False     
______________________________________________________________________
RNNDropout           [72, 1144]           0          False     
______________________________________________________________________
Linear               [72, 4856]           1,947,256  True      
______________________________________________________________________
RNNDropout           [72, 400]            0          False     
______________________________________________________________________
8 Likes

Nice! Iā€™ll be interested to hear how it affects your results.

So far, I couldnā€™t be more happy with the results Iā€™m seeing on my language model training:

image

This is running about 5x faster on a single RTX 2080 TI than it was running on a 1080 (no TI). When I first plugged in the RTX 2080 TI I was only getting a 30% improvement in speed with no changes to the train. Then when I added the FP16 and raised the batch size as much as I could, I got the 5x speedup. Since I have a 7GB corpus, this reduced my (10 epoch) training time from 16 days to 3.5 days!

This 72% accuracy is blowing my mind. Before when I trained a subset of 4GB with 10 epochs, I only got up to 34% accuracy. The biggest difference is that I trained the 4GB corpus starting from the pretrained wiki103 model, and this 7GB I trained without any pretraining. This makes sense to me because the 7GB is clinical notes, and to use pretraining would be like the trail wagging the dog, since wiki is smaller. Plus clinical notes, while English language, tend to follow different patterns of narrative than wiki. My goal is for my model to be an expert on understanding clinical notes, and I donā€™t much care whether it looses some encyclopedic knowledge from Wiki.

This corpus has about 4 million History/Physicals My fantasy is that this model will know a ton about the signs, symptoms, diagnosis and treatment, at a UCSF expert level, of more diseases and maladies than I have ever learned in medical school, let alone remembered. My wilder fantasy is that it is learning a few patterns of diseases/syndromes that have never been identified in the literature.To work through how to extract that learning will be an adventure as I see it.

What Iā€™m realizing is that those of us who are interested in clinical NLP can share this adventure by signing up for the MIMIC III dataset.

https://mimic.physionet.org/gettingstarted/access/

I believe that they donā€™t really turn people down; at least Iā€™ve never met anyone who has been turned down. Iā€™m talking to a MD professor here to formulate ā€œchallengesā€ using the MIMIC III data, and this standard dataset would give us a metric of progress. I am sure Jeremy would agree, as well as Ian Goodfellow (in an interview), that rapid progress in a given domain happens when people have a standard dataset in that domain. MIMIC III may not be perfect, but I think it can take us very far beyond the current state of clinical NLP.

11 Likes

Actually you need some kind of affiliation with some university or research institute, hard for private especially no US.

I worked with clinical data for a while: MIMIC-III is getting old, but is a good start, moreover clinical NLP is more demanding than traditional nlp. Patient phenotyping needs a robust name entity recognizer, and predictive diagnosis requires no simple statistical models due to interpretable issues. Hope it can help you to design the challenges!

Does that mean you know someone who was turned down for those reasons? I just donā€™t know. Different institutions will have different policies. The fact that MIMIC has been around for a few years may make them less nervous about sharing more widely. The incentive for them to share is that when they go to get their grant renewed, they will want to show a large number of people using their data. From the perspective of the HIPAA regulations, you could put the data on an open web page because it is de-identified (though nobody is doing that)

1 Like