Google BERT Language Models

(Monique Monteiro) #21

Interesting discussion: https://github.com/google-research/bert/issues/66

2 Likes

(Kaushal Trivedi) #22

Use the max sequence length 512. That got me to 95% accuracy with Bert base uncased model.

2 Likes

(Monique Monteiro) #23

I’ve reduced my dataset in order to include only the documents up to 512, but the accuracy got worse: 51%. I suspect data reduction led to overfitting.

BTW, I haven’t read BERT’s source code yet, so I’d like to know if BERT automatically “divides” documents longer than max sequence length, using the same label. Or should I do that manually?

0 Likes

(Monique Monteiro) #24

An update: I added a preprocessing step to break the documentos in chunks with at most 512 words (despite knowing BERT considers SentencePiece tokenization) and my model’s accuracy jumped to 76%!

2 Likes

(Andrew Nanton) #25

Have you been using BERT with fastai using the huggingface port to pytorch? If so, are you able to offer any insights about how you were able to get it working?

1 Like

(Monique Monteiro) #26

No, I’ve been using BERT’s “original” Google version. No pytorch yet.

1 Like