Lesson 12 (2019) discussion and wiki

danaludwig · May 4, 2019, 3:26am

I just sent the contact (for MIMIC III) a email asking them what restrictions to access that they might have in their policies. I’ll let you all know as soon as I get a response.

fabris · May 4, 2019, 10:48am

Yep, it was me. A few years ago I tried to request access to MIMIC III but I couldn’t provide a reference name from the academia (no official contract at that time). BTW the people from PhysioNet team were very kind.

Here a post could be interesting for someone :

Getting access to MIMIC III hospital database for data science projects

danaludwig · May 7, 2019, 4:49am

I got a response back from MIT (the MIMIC III project):

Ken Pierce <kpierce@mit.edu>
Mon 5/6/2019 11:19 AM
Dear Dr. Ludwig,

Thanks for your message.  Neither of these conditions

>    *   They live outside the United States
>    *   They are working for a commercial company and not a university or research institute.

would make any difference in deciding whether to grant a student access to the data.  We are happy to consider providing access to anyone who submits a request according to our instructions.

Cordially,
Ken

So I think there is a great chance that most of us in the fast.ai forum could get access for use in clinical NLP experiments. Keep in mind that this is a hot topic in medicine now, and this is why they built the MIMIC III dataset; in a way they want you as much as you want them!

I also found that the MIT group will post our challenge on their web site, so we can recruit even more participants! They have detailed instructions on how to set up a challenge. I think they would welcome this!

If you want to apply now, it took me about 2 weeks to get certified, including the free online training on research subject protection rules.

I will post more as soon as I can formulate a challenge goal, but please create a challenge yourself if one of you can come up with it first, especially if you can get some direction from a subject matter expert that you know. - Dana

fabris · May 7, 2019, 2:55pm

Hi @danaludwig , I still believe that anyone needs to indicate a reference from the academia or research institute. BTW, how did you get access to the data ? Did you provide such a reference ? I ask you that just for the sake of clarity. Thanks in advance for your reply.

wittmannf · May 8, 2019, 7:19pm

It’s an old question, but for those interested, we indeed can use stride 2 instead of multiple stride 1 with 2x2 maxpool: deep learning - Max pool layer vs Convolution with stride performance - Stack Overflow

It is superior since it is simple and faster. Here’s a quote from the paper https://arxiv.org/pdf/1412.6806.pdf:

‘We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks’

danaludwig · May 10, 2019, 4:05am

I did have a UCSF email address, and I used my boss (client) as a manager reference; I was a contractor and self-employed at the time.

DrHB · May 14, 2019, 3:17pm

I don’t know if this question was asked or not, but how does one decide on total number of Epochs ? Jeremy mentioned that with MixUp they have to train for double the number of epoch, if I recall correctly 160… how this number was decided ?

Also let say for simplicity I train for 20 epochs (onecycle) and may loss is still going down, I added 10 more epoch with lower lr and my loss is stabilized. Does this mean in theory I could have trained my model with 30 epoch ?

Obviously I am over simplifying, but what are in general good practice when it comes to choosing number of epochs ?

charming · May 15, 2019, 9:02am

How to use label smoothing in multi-classification problems (each picture can have multiple labels)? we will use binary-cross-entropy instead of cross-entropy

piotr.czapla · May 19, 2019, 10:53am

Yeah, Sentence piece makes a model to tokenise the text based on the things it see during training. If you retrain the model the subwords are going to be change so for example word
“Wikipedia” after wikipedia training could be tokenized as “_Wikipedia” while after you train new sp model on let say fastai forum it can get tokenzied as “_Wiki p e d i a” due to difference in frequrency of the words in this new corpus. Once you have such tokenziation I don’t know how to transfer the weights from word Wikipedia to new subwords _Wiki and the leters: p e d i a

But the good news is that I haven’t notice large difference in the tokenziation. so adding the original corpus the wikipedia might not be necessery. Moreover you don’t have to start with wikipedia. Start with your domain cropus it will train faster and it might give you a comparable results.

Only for sentence piece models, as there you can convert back and forth between different tokenization models. Have a look here: [1810.10222] Universal Language Model Fine-Tuning with Subword Tokenization for Polish

But perplexity seems to be bad measurement for the quality of the model at least this is what we noticed competing in the recent poleval. We have some ideas how to estimate the quality of the language model before you actually do the whole training but that is still work in progress. Basically we fix random seeds and train few lm for one epochs on WIki then fineteune and train 10 classfiiers. This let you judge the quality of your model before you go through expensive 10 epoch training on wikipedia. But that is early work and it might not work. The point is that perplexity isn’t a good estimate for the downstream task.

wgpubs · May 21, 2019, 9:38pm

Thats an interesting observation.

So if I’m understanding correctly, the idea of using a pre-trained LM isn’t that important (or even really possible) when using SentencePiece granted that you probably have a sufficiently sized corpus of your own?

piotr.czapla · May 23, 2019, 2:50pm

It still make sense. We got few % improvement going from wikipedia to the MLDoc over training directly with MLDoc I haven’t tried merging MLDoc with wikipedia. Starting from a pretrained model give you a quick and very high baseline. Sometimes it is all you really need.

dreambeats · May 26, 2019, 2:33pm

That’s a pretty good question actually, I am looking for an answer too. But for now I just set a number of epochs that I think is reasonable, not afraid of going a little too high. And then i use the SaveModelCallback to save the model when there is an improvement in say, the val loss.

jeremy · May 26, 2019, 7:29pm

There isn’t really anything other than experimentation, unfortunately. Run as many as you have time for! Generally more augmentation and more epochs gives better results.

wgpubs · May 26, 2019, 8:27pm

So basically, you only need to do the wiki-103+ your corpus trick if there are things in your corpus that SentencePiece wouldn’t run across in wiki-103, correct?

For example, if my custom corpus has the word “fubar” in it … and that word didn’t exist in my SentencePiece model/vocab trained on wiki-103 … I could still use that SP model/vocab to tokenize the word “fubar” and it would?

Kaspar · May 27, 2019, 5:01am

Correct . In the Extreme case each letter would become a tonen. You would get UNC (unknown token) In the very rare case the letter does not exits in the vocab

wgpubs · May 27, 2019, 11:09pm

Thanks for the clarification!

A bit off-topic, but I was looking at your fastai-sentencepiece.py code and I had a few questions:

When you build your SPM vocab, I noticed that you build it against the entirety of your .txt files. Is there a reason why you didn’t create a training and validation set and build your vocab from from just the training set?
I’m a little confused by your note here on handling tokens like TK_MAJ, etc… Shouldn’t those tokens be included before you train your SP model?
I see Jeremy’s note to limit your LM to 100 million tokens, but I didn’t notice anything in this file at least where you were doing the same. Just wondering why?

Thanks again.

Kaspar · May 28, 2019, 3:00am

you could that the sentencepiece vocab should be based on a trainingset
2 ) TK_MAJ and many other symbols are inserted into sp vocab Thé text is transformed using TK_MAJ and other symbols before feeding it to do.
jeremys limitation to 100 million tokens is due to fastai’s mémorable limitations in handling big corpus. Google bert and openai’ Gpt2 proves beyound doubt that an nlp library should ne written for handling huge corpus. I believe thé next version if fastai will do that

wgpubs · May 28, 2019, 6:13pm

Thanks again Kaspar!

I notice that you do a lot more pre-processing than other related projects (e.g., ulmfit-multilingual) via your custom spm_rules here … and so I’m wondering why? It seems like the ulmfit-multilingual folks were satisfied with just using the MosesTokenizer for preparing the files for SP to be trained on.

I’m also wondering why this block of code?

tl_hash = only_alphas(tl).encode()).hexdigest()
if tl_hash not in unique_lines: 
    txt_selected.append(tl)
    unique_lines[tl_hash]=""

Kaspar · May 29, 2019, 7:12pm

i processed the entire english wikipedia and noticed that there are auto-generated phrases or and user that make copy&paste of standard phrases.

wgpubs · May 29, 2019, 7:17pm

Good to know.

Any reason you’re not including the TK_MAJ and TK_UP tokens in the .txt files for SP to train on?

I ask because one of your “spm_rules” is to lowercase everything which would prohibit the ability to use these tokens later on.