Google BERT Language Models

monilouise · March 20, 2019, 5:41pm

Interesting discussion: https://github.com/google-research/bert/issues/66

ktrivedi · March 21, 2019, 6:03am

Use the max sequence length 512. That got me to 95% accuracy with Bert base uncased model.

monilouise · March 22, 2019, 4:33pm

I’ve reduced my dataset in order to include only the documents up to 512, but the accuracy got worse: 51%. I suspect data reduction led to overfitting.

BTW, I haven’t read BERT’s source code yet, so I’d like to know if BERT automatically “divides” documents longer than max sequence length, using the same label. Or should I do that manually?

monilouise · April 10, 2019, 5:10pm

An update: I added a preprocessing step to break the documentos in chunks with at most 512 words (despite knowing BERT considers SentencePiece tokenization) and my model’s accuracy jumped to 76%!

phren0logy · April 11, 2019, 12:59am

Have you been using BERT with fastai using the huggingface port to pytorch? If so, are you able to offer any insights about how you were able to get it working?

monilouise · April 12, 2019, 4:44pm

No, I’ve been using BERT’s “original” Google version. No pytorch yet.

Waleed · May 15, 2019, 12:28pm

Unsupervised Data Augmentation is the best reported performance using BERT for text classification.