Great to see all the progress here!!
@arampacha re HuggingFace datasets, here they suggest: i) doing all the preprocessing while loading everything into memory, which mightn’t be feasible for large datasets ii) they mentioned an improvement with tokenizers that increased speed by 10x (back in August). I think in a chat somewhere they mentioned that they were working on parallelising preprocessing with Datasets, but no idea when it will be released.
Also, that Wiki103 Colab example is great!
Discord link broken?
@arampacha Also, the discord invite link doesn’t work for me, maybe its expired?
Tokenization
1 question around what tokenizer to use, expanding on the tokenization bullet above
The default HuggingFace “ReformerTokenizer” uses sentencepiece, do you think this was after discussion with the authors? The reformer colabs use different tokenizers depending on the task:
SubwordTextEncoder
- Machine Translation
In the Machine Translation Trax colab they use a pretrained EN-DE SubwordTextEncoder
:
from tensor2tensor.data_generators.text_encoder import SubwordTextEncoder
...
# Set up our sub-word tokenizer
tokenizer = SubwordTextEncoder(
'gs://trax-ml/reformer/mt/vocab.translate_ende_wmt32k.32768.subwords')
SentencePiece Processor
- Text Generation
In the Text Generation Trax colab they use SentencePiece:
from sentencepiece import SentencePieceProcessor
...
# Load a BPE vocabulaary with 320 types. This mostly consists of single letters
# and pairs of letters, but it has some common words and word pieces, too.
!gsutil cp gs://trax-ml/reformer/cp.320.* .
TOKENIZER = SentencePieceProcessor()
TOKENIZER.load('cp.320.model')
I see the Stanford SQUAD paper doesn’t mention what tokenizer they used either… it doesn’t seem like they did any pre-training either so maybe its not so surprising they didn’t get great results. The time to pre-train would also be an issue I guess
Github
Authors Questions
I think adding a question around tokenizers used would should be added to our list for the authors I’ve created a wiki page on the github that we can add questions to: Questions for the Authors · morganmcg1/reformer-fastai-old Wiki · GitHub
Move Google Doc info to Github?
I also updated the github’s readme with the resources links from the google doc. Do you think I should move the info from the Google doc into Wiki’s/Issues in
Kanban Project
I also created Kanban project on the github (Reformer Reproducibility · GitHub), which might be useful to assign and track tasks. I haven’t used github’s version before, and we don’t have to use it, but its nice to have the option
Weights and Biases
I just applied for a academic plan for this reproducibility challenge so that we can create a team and share experiment results (the free tier doesn’t let you create teams), will let you know their response
enwik8-64k data
I’ll look into finding a source for this now
Next Meeting
Does the same time this Thursday suit everyone?