Interesting discussion: https://github.com/google-research/bert/issues/66
Use the max sequence length 512. That got me to 95% accuracy with Bert base uncased model.
I’ve reduced my dataset in order to include only the documents up to 512, but the accuracy got worse: 51%. I suspect data reduction led to overfitting.
BTW, I haven’t read BERT’s source code yet, so I’d like to know if BERT automatically “divides” documents longer than max sequence length, using the same label. Or should I do that manually?
An update: I added a preprocessing step to break the documentos in chunks with at most 512 words (despite knowing BERT considers SentencePiece tokenization) and my model’s accuracy jumped to 76%!
Have you been using BERT with fastai using the huggingface port to pytorch? If so, are you able to offer any insights about how you were able to get it working?
No, I’ve been using BERT’s “original” Google version. No pytorch yet.
Unsupervised Data Augmentation is the best reported performance using BERT for text classification.