Lesson 12 (2019) discussion and wiki

Should RNN parameters then be initialized with some function or can I just use some random distribution? Is there difference?

1 Like

We’re just using a regular linear layer.

1 Like

Nice one! Made it into a pdf:

google - Machine Learning Glossary.pdf (579.6 KB)

4 Likes

I think using very low learning rate on the first batch might complement initialization. Something in the neighborhood of 1e-7 or even less for the first epoch might initialize the model to the data at hand. I have used this approach a few times with good results, but I did not do any rigorous study on that.

1 Like

My intuition is that it might or might not work depending on how you mix it up. If you mix up on the level of words or sentences, it might be reducing quality of modeling because NLP often tries to preserve sequence of words or sentences. Hence inserting alien text inside of a sentence or between sentences can be detrimental.

As far as I understand, we do not yet take into account sequence information on the paragraph level though. Mixing up with alien paragraphs from other documents in the corpus might provide model with information on thematic relatedness of the paragraphs from different documents and that can be useful. For example, mixup of a corpus of deep learning papers might give the model means for detecting idea borrowing. It might give us ability to learn citations that are not directly mentioned in the papers.

1 Like

Sequence Length, Batch Size and BPTT

9 Likes

In order to use transfer learning in Xresnet, Can Jeremy upload the Xresnet imagenet pre-train model?

Thanks for making this chart. It’s timely because I am now on my third try to understand RNNs!

Now I am stuck with some questions about notebook 12_text, in section Batching, where the IMDB review language is fine-tuned. Would someone kindly confirm or clarify so I can get out of the fog?

  • Batching, batch size, etc. of course refers to mini-batches, so let’s say “batch” here in this post.

  • At the end of a batch, gradients are updated. There’s a hidden state. The hidden state gets carried over, per row, to the next batch, the text continues on the same row of the next batch, and processing continues. Right?

  • At the first (mini)batch, the hidden state is initialized. And the start of each training row after the first may not line up with the start of a review. So the hidden state will lack the context of a complete first review in that row. Right?

  • But we don’t seem to care about this imperfection in training. It that because the speed gained outweighs any training inaccuracy? If we did care, we could make sure each row starts with a review and contains only complete reviews following.

  • bptt is said to be the “depth” of your RNN, the number of iterations of your “for” loop in the forward pass. But why should the depth of the RNN be related to the length of the input sequences that make up a batch? bptt seems better related to the frequency of gradient updates.

  • Is there any insight on how many tokens it takes to fully “educate” a hidden state? Vaguely, this means to the point where having different tokens back far enough does not matter to its current state.

  • I am still unclear on exactly when and how the loss is calculated. Is it word by word or per bptt sized array? Happy to save this question for later, unless it bears on any of the above.

Thanks for your help and clarifications!

2 Likes

Hi Brad, I also got 0 speedup on AWD_LSTM with mixed precision. I’m still trying to figure out how to parse this statement: ‘you also need to have everything be a multiple of 8’

2 Likes

I found a reference on “everything to be a multiple of 8” on the NVidia web site:
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

In practice, for mixed precision training, our recommendations are:

  1. Choose mini-batch to be a multiple of 8
  2. Choose linear layer dimensions to be a multiple of 8
  3. Choose convolution layer channel counts to be a multiple of 8
  4. For classification problems, pad vocabulary to be a multiple of 8
  5. For sequence problems, pad the sequence length to be a multiple of 8
9 Likes

Brad, that “multiples of 8” thing made my learning about 40% faster! It was ugly what I did to the code; any cleaner solutions would be appreciated!

First you already know probably, but it took me a while to find it:

import apex.fp16_utils as fp16

then the ‘to_fp16’ thing:

learn = to_fp16(language_model_learner(data_lm, AWD_LSTM))

Then I did something I really didn’t want to do. In order to get my first 30% training speed improvement, I had to throw away the pre-trained fastai model for AWD_LSTM, because the n_hid was 1150 and that isn’t a multiple of 8. I imagine there is nothing that can be done about that other than re-training the pre-trained model with n_hid of, say, 1144. I’m going to end up doing that eventually because my corpus (clinical notes) is quirky and much bigger than Wiki and it will take about a week to train, even with these improvements.

learn = to_fp16(language_model_learner(data_lm, AWD_LSTM, pretrained=False))

Then I had to make the bptt parameter 72, rather than default of 70. That was easy enough because I could pass the new parameter in the data block api:

     .databunch(bs=1000, num_workers=4, bptt=72)

Then it continued to get ugly. I hope there is a cleaner way of doing this, but this is what I did so I could change n_hid from 1150 to 1144:

  • I replaced ‘awd_lstm_lm_config’ with ‘my_awd_lstm_lm_config’ to set the n_hid to 1444 (divisable by 8)
  • I replaced ‘_model_meta’ with ‘_my_model_meta’ so it would use "my_awd_lstm_lm_config’
  • I replaced two class definitions so they would call _my_model_meta
my_awd_lstm_lm_config = dict(emb_sz=400, n_hid=1144, n_layers=3, pad_token=1, qrnn=False, bidir=False, output_p=0.1,
                          hidden_p=0.15, input_p=0.25, embed_p=0.02, weight_p=0.2, tie_weights=True, out_bias=True)

_my_model_meta = {AWD_LSTM: {'hid_name':'emb_sz', 'url':URLs.WT103_1,
                          'config_lm':my_awd_lstm_lm_config, 'split_lm': awd_lstm_lm_split,
                          'config_clas':awd_lstm_clas_config, 'split_clas': awd_lstm_clas_split},
               Transformer: {'hid_name':'d_model', 'url':URLs.OPENAI_TRANSFORMER,
                             'config_lm':tfmer_lm_config, 'split_lm': tfmer_lm_split,
                             'config_clas':tfmer_clas_config, 'split_clas': tfmer_clas_split},
               TransformerXL: {'hid_name':'d_model', 
                              'config_lm':tfmerXL_lm_config, 'split_lm': tfmerXL_lm_split,
                              'config_clas':tfmerXL_clas_config, 'split_clas': tfmerXL_clas_split}}

def get_language_model(arch:Callable, vocab_sz:int, config:dict=None, drop_mult:float=1.):
    "Create a language model from `arch` and its `config`, maybe `pretrained`."
    meta = _my_model_meta[arch]
    config = ifnone(config, meta['config_lm'].copy())
    for k in config.keys(): 
        if k.endswith('_p'): config[k] *= drop_mult
    tie_weights,output_p,out_bias = map(config.pop, ['tie_weights', 'output_p', 'out_bias'])
    init = config.pop('init') if 'init' in config else None
    encoder = arch(vocab_sz, **config)
    enc = encoder.encoder if tie_weights else None
    decoder = LinearDecoder(vocab_sz, config[meta['hid_name']], output_p, tie_encoder=enc, bias=out_bias)
    model = SequentialRNN(encoder, decoder)
    return model if init is None else model.apply(init)

def language_model_learner(data:DataBunch, arch, config:dict=None, drop_mult:float=1., pretrained:bool=True,
                           pretrained_fnames:OptStrTuple=None, **learn_kwargs) -> 'LanguageLearner':
    "Create a `Learner` with a language model from `data` and `arch`."
    model = get_language_model(arch, len(data.vocab.itos), config=config, drop_mult=drop_mult)
    meta = _my_model_meta[arch]
    learn = LanguageLearner(data, model, split_func=meta['split_lm'], **learn_kwargs)
    if pretrained:
        if 'url' not in meta: 
            warn("There are no pretrained weights for that architecture yet!")
            return learn
        model_path = untar_data(meta['url'], data=False)
        fnames = [list(model_path.glob(f'*.{ext}'))[0] for ext in ['pth', 'pkl']]
        learn.load_pretrained(*fnames)
        learn.freeze()
    if pretrained_fnames is not None:
        fnames = [learn.path/learn.model_dir/f'{fn}.{ext}' for fn,ext in zip(pretrained_fnames, ['pth', 'pkl'])]
        learn.load_pretrained(*fnames)
        learn.freeze()
    return learn

This gave me a 30% speed improvement!

Then I got an additional 10% improvement by hacking my vocab to pad it to a mult of 8. I did this after creating my DataBunch but before creating my learner:

i = 0
while len(data_lm.vocab.itos)%8 != 0:
    data_lm.vocab.itos.insert(0,'xx'+str(i))
    i +=1

For the language model, nothing else in the mixed precision made any detectable difference (on my RTX 2080TI), but this was worth the pain for building a large pre-trained language model.

I was able to verify the weights were multiples of 8 with this:

print(len(data_lm.vocab.itos))
print(learn.summary())

This gave:

4856
======================================================================
Layer (type)         Output Shape         Param #    Trainable 
======================================================================
RNNDropout           [72, 400]            0          False     
______________________________________________________________________
RNNDropout           [72, 1144]           0          False     
______________________________________________________________________
RNNDropout           [72, 1144]           0          False     
______________________________________________________________________
Linear               [72, 4856]           1,947,256  True      
______________________________________________________________________
RNNDropout           [72, 400]            0          False     
______________________________________________________________________
8 Likes

Nice! I’ll be interested to hear how it affects your results.

So far, I couldn’t be more happy with the results I’m seeing on my language model training:

image

This is running about 5x faster on a single RTX 2080 TI than it was running on a 1080 (no TI). When I first plugged in the RTX 2080 TI I was only getting a 30% improvement in speed with no changes to the train. Then when I added the FP16 and raised the batch size as much as I could, I got the 5x speedup. Since I have a 7GB corpus, this reduced my (10 epoch) training time from 16 days to 3.5 days!

This 72% accuracy is blowing my mind. Before when I trained a subset of 4GB with 10 epochs, I only got up to 34% accuracy. The biggest difference is that I trained the 4GB corpus starting from the pretrained wiki103 model, and this 7GB I trained without any pretraining. This makes sense to me because the 7GB is clinical notes, and to use pretraining would be like the trail wagging the dog, since wiki is smaller. Plus clinical notes, while English language, tend to follow different patterns of narrative than wiki. My goal is for my model to be an expert on understanding clinical notes, and I don’t much care whether it looses some encyclopedic knowledge from Wiki.

This corpus has about 4 million History/Physicals My fantasy is that this model will know a ton about the signs, symptoms, diagnosis and treatment, at a UCSF expert level, of more diseases and maladies than I have ever learned in medical school, let alone remembered. My wilder fantasy is that it is learning a few patterns of diseases/syndromes that have never been identified in the literature.To work through how to extract that learning will be an adventure as I see it.

What I’m realizing is that those of us who are interested in clinical NLP can share this adventure by signing up for the MIMIC III dataset.

https://mimic.physionet.org/gettingstarted/access/

I believe that they don’t really turn people down; at least I’ve never met anyone who has been turned down. I’m talking to a MD professor here to formulate “challenges” using the MIMIC III data, and this standard dataset would give us a metric of progress. I am sure Jeremy would agree, as well as Ian Goodfellow (in an interview), that rapid progress in a given domain happens when people have a standard dataset in that domain. MIMIC III may not be perfect, but I think it can take us very far beyond the current state of clinical NLP.

11 Likes

Actually you need some kind of affiliation with some university or research institute, hard for private especially no US.

I worked with clinical data for a while: MIMIC-III is getting old, but is a good start, moreover clinical NLP is more demanding than traditional nlp. Patient phenotyping needs a robust name entity recognizer, and predictive diagnosis requires no simple statistical models due to interpretable issues. Hope it can help you to design the challenges!

Does that mean you know someone who was turned down for those reasons? I just don’t know. Different institutions will have different policies. The fact that MIMIC has been around for a few years may make them less nervous about sharing more widely. The incentive for them to share is that when they go to get their grant renewed, they will want to show a large number of people using their data. From the perspective of the HIPAA regulations, you could put the data on an open web page because it is de-identified (though nobody is doing that)

1 Like

I just sent the contact (for MIMIC III) a email asking them what restrictions to access that they might have in their policies. I’ll let you all know as soon as I get a response.

2 Likes

Yep, it was me. A few years ago I tried to request access to MIMIC III but I couldn’t provide a reference name from the academia (no official contract at that time). BTW the people from PhysioNet team were very kind.

Here a post could be interesting for someone :

Getting access to MIMIC III hospital database for data science projects

1 Like

I got a response back from MIT (the MIMIC III project):

Ken Pierce <kpierce@mit.edu>
Mon 5/6/2019 11:19 AM
Dear Dr. Ludwig,

Thanks for your message.  Neither of these conditions

>    *   They live outside the United States
>    *   They are working for a commercial company and not a university or research institute.

would make any difference in deciding whether to grant a student access to the data.  We are happy to consider providing access to anyone who submits a request according to our instructions.

Cordially,
Ken

So I think there is a great chance that most of us in the fast.ai forum could get access for use in clinical NLP experiments. Keep in mind that this is a hot topic in medicine now, and this is why they built the MIMIC III dataset; in a way they want you as much as you want them!

I also found that the MIT group will post our challenge on their web site, so we can recruit even more participants! They have detailed instructions on how to set up a challenge. I think they would welcome this!

If you want to apply now, it took me about 2 weeks to get certified, including the free online training on research subject protection rules.

I will post more as soon as I can formulate a challenge goal, but please create a challenge yourself if one of you can come up with it first, especially if you can get some direction from a subject matter expert that you know. - Dana

3 Likes

Hi @danaludwig , I still believe that anyone needs to indicate a reference from the academia or research institute. BTW, how did you get access to the data ? Did you provide such a reference ? I ask you that just for the sake of clarity. Thanks in advance for your reply.

It’s an old question, but for those interested, we indeed can use stride 2 instead of multiple stride 1 with 2x2 maxpool: https://stackoverflow.com/questions/44666390/max-pool-layer-vs-convolution-with-stride-performance

It is superior since it is simple and faster. Here’s a quote from the paper https://arxiv.org/pdf/1412.6806.pdf:

‘We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks’

2 Likes