Hi @stefan-ai! Iâm eager to join meeting today, but Iâm also fine with moving meeting to other date. Or doing both.
And I also have more questions than results at the moment.
Iâm starting a call under https://us04web.zoom.us/j/72420542182?pwd=UnZmODdZRVFVZkpiRHd4VkRYVmwxdz09
Please feel free to join!
Related to what we discussed today:
Link to lucidrains reformer repo: https://github.com/lucidrains/reformer-pytorch
Trying it on 64k tokens seq_len causal lm: https://colab.research.google.com/gist/arampacha/9cc2fd7b5818c91ce64013b83bcfa567/reformer_wikitext_clm.ipynb
Hi, I was also unable to attend yesterday. Any new ideas from the meeting?
I have started on an LSH exploration. This is perhaps a bit tangent to the main goal of the project, but I have an interest in clustering. Iâll add the notebook to the repo in a /exploration subfolder - have a look if youâre interested!
Also, my thoughts for the next steps are:
- test if the authors repo code works out of the box
- consider lucidrains reformer implementation
- getting the datasets (from hf/datasets I guess?)
- setting up a wandb project (wandb is easy to use with the fastai callback)
- einops transformer implementation (have to look into einops a bitâŠ)
- implement revnet, lsh-attention etc. separately
- run ablations and experiments
Also, should we continue discussion in the forum or set up slack/discord etc.?
Dâoh! I managed to delete the notebook before pushing it to git - so the file is lost⊠Anyway, I was surprised how easy it was to get a basic version of LSH working, basically just following the steps described in the paper with a few lines of code:
To get b hashes, we first fix a random matrix R of size [dk, b/2]. We then define h(x) = arg max([xR; âxR]) where [u; v] denotes the concatenation of two vectors.
In the trax library this method is called hash_vecs(). It has a few tweaks compared to the original LSH-algo, but works out of the box.
Hi everybody,
We had a great meeting yesterday with @arampacha, @Dean-DAGs and @PriyanK7n.
@arampacha made a lot of progress, getting a Reformer language model to train successfully on a subset of Wikitext 103. See his post and notebook above.
A couple of other points we discussed:
-
Training speed could become an issue when training Reformer (could you please share the training stats that you mentioned yesterday, @arampacha?)
-
Relatively soon - maybe in the next meeting - we should create separate tasks so that we donât end up all working on the same issues.
-
Among the first tasks should be to re-create and share the enwiki dataset to make sure everyone is working with the same data and we can save pre-processing time
-
@arampacha reported an issue when trying to load hugginfaceâs
google/reformer-enwik8
so decided to train from scratch. -
The other pre-trained model on huggingface model hub,
google/reformer-crime-and-punishment
, uses a different tokenization approach than the enwiki model. Due to conflicting sequence lengths, I didnât manage to successfully fine-tune the model on downstream tasks. -
Since the Reformer paper is very brief and leaves out some important details, we might have to reach out to the authors for clarification. However, letâs first collect our issues before doing so.
-
@Dean-DAGs potentially has a contact at huggingface and kindly offered to reach out if needed.
-
Additionally, we could try to replicate these results from training Reformer on SQuAD 2.0
If I missed anything, please add it guys!
@hallvagi: Thanks for sharing your ideas. We agreed on some of these points in our meetings already. This list is a great starting point for formulating specific tasks that team members or smaller groups can start working on. Nice to hear that you had a good experience implementing basic LSH.
I think weâre on a good way. Letâs keep the momentum going and meet again soon to define concrete next steps. Have a nice weekend everybody
PS: I agree that a separate slack/discord channel would be helpful. Could someone set it up?
Getting a chat for effective communication is a good idea. Iâve set up discord server, here is an invite https://discord.gg/mG5GVq3n. Although I have no experience with this stuff, so if anyone is willing to take over, itâs cool. But I think a simple server will do for a start
Great to see all the progress here!!
@arampacha re HuggingFace datasets, here they suggest: i) doing all the preprocessing while loading everything into memory, which mightnât be feasible for large datasets ii) they mentioned an improvement with tokenizers that increased speed by 10x (back in August). I think in a chat somewhere they mentioned that they were working on parallelising preprocessing with Datasets, but no idea when it will be released.
Also, that Wiki103 Colab example is great!
Discord link broken?
@arampacha Also, the discord invite link doesnât work for me, maybe its expired?
Tokenization
1 question around what tokenizer to use, expanding on the tokenization bullet above
The default HuggingFace âReformerTokenizerâ uses sentencepiece, do you think this was after discussion with the authors? The reformer colabs use different tokenizers depending on the task:
SubwordTextEncoder
- Machine Translation
In the Machine Translation Trax colab they use a pretrained EN-DE SubwordTextEncoder
:
from tensor2tensor.data_generators.text_encoder import SubwordTextEncoder
...
# Set up our sub-word tokenizer
tokenizer = SubwordTextEncoder(
'gs://trax-ml/reformer/mt/vocab.translate_ende_wmt32k.32768.subwords')
SentencePiece Processor
- Text Generation
In the Text Generation Trax colab they use SentencePiece:
from sentencepiece import SentencePieceProcessor
...
# Load a BPE vocabulaary with 320 types. This mostly consists of single letters
# and pairs of letters, but it has some common words and word pieces, too.
!gsutil cp gs://trax-ml/reformer/cp.320.* .
TOKENIZER = SentencePieceProcessor()
TOKENIZER.load('cp.320.model')
I see the Stanford SQUAD paper doesnât mention what tokenizer they used either⊠it doesnât seem like they did any pre-training either so maybe its not so surprising they didnât get great results. The time to pre-train would also be an issue I guess
Github
Authors Questions
I think adding a question around tokenizers used would should be added to our list for the authors Iâve created a wiki page on the github that we can add questions to: https://github.com/morganmcg1/reformer-fastai/wiki/Questions-for-the-Authors
Move Google Doc info to Github?
I also updated the githubâs readme with the resources links from the google doc. Do you think I should move the info from the Google doc into Wikiâs/Issues in
Kanban Project
I also created Kanban project on the github (https://github.com/morganmcg1/reformer-fastai/projects/1), which might be useful to assign and track tasks. I havenât used githubâs version before, and we donât have to use it, but its nice to have the option
Weights and Biases
I just applied for a academic plan for this reproducibility challenge so that we can create a team and share experiment results (the free tier doesnât let you create teams), will let you know their response
enwik8-64k data
Iâll look into finding a source for this now
Next Meeting
Does the same time this Thursday suit everyone?
enwik8 data ( + Tokenizer?)
Iâve tracked down the enwik8 data also also how it was âencodedâ (== Tokenization), added a gist notebook here in the repo which you can run on colab: https://github.com/morganmcg1/reformer-fastai/blob/main/enwiki8_Tensor2Tensor_download.ipynb
TL;DR
ât2t_enwik8_l65kâ is listed as one of the parameters in the Reformer enwik8 trax config
This turns out to be a Tensor2Tensor
Problem dataset. These are datasets that also already have some pre-processing done to them. In this case, the data is encoded with a ByteTextEncoder
defined in the Tensor2Tensor
library.
Have a look at the notebook above for a full trace of the lineage.
One thing Iâm not sure about is whether other pre-processing functions are applied to the dataset besides the encodingâŠ
Raw data
I have added a Data sub header to our github README with links to the above notebook as well as a direct link to the raw 100mb enwik8 zip that the Tensor2Tensor library downloads for you
@morgan new invite https://discord.gg/TMTWQ4kG3V
@marii @tyoc213 @wgpubs
sorry if I missed someone, feel free to join
Almost missed the next meeting bit. Yes, I would be okay with save time next Thursday.
Hi there, have lots of interest⊠but I think havent found like a good paper to reproduce with my current knowledge and resources (they require lot of TPU or thousand hours!!!)⊠and I think I found one that is also a little like the âfirst taskâ you should do when doing ASR (recognize 1 word) it is https://openreview.net/forum?id=ijvwzg_7I71
Wish me luck implementing my second paper ever (and hopefully first time finishing it), first one I started but didnt finish it :â( (hopefully I will get back, or implement the ideas I did have for that one).
Hey all, we got approval for academic membership for Weights and Biases meaning we can add team projects could you send me your Weights and Biases usernames and Iâll add you to the team?
Interesting Paper, Reformer struggled compared to most:
LONG RANGE ARENA: A BENCHMARK FOR EFFICIENT TRANSFORMERS
" Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL."
Some Reformer mentions:
Speed Results:
âBased on our implementation, the slowest model is the Reformer model (Kitaev et al., 2020) that is about 80% the speed of vanilla Transformer at 4K sequence lengths and half the speed at 1K sequence length.â
Implementaion:
âReformerâs Implementation Having optimized ops to support many of Reformerâs functionality is crucial. Hence, Reformer is implemented slightly differently from other Transformer models. Instead of computing tensors with batch size dimensions B and head dimensions H, (i.e., B Ă H Ă N Ă d), we compute the attention function for tensors of N Ă d dimensions. After which, we parallelize this function via VMAP over the batch and head dimensionsâ
Another paper which helps to relate Reformer to other transformer architecture modifications https://arxiv.org/abs/2009.06732
Since you all you are using WandB already, you should consider checking this out:
You can keep me in the loop in case of any inquires/doubts about WandB or feature requests for the same.
Feedback Requested
Hey all, we (see below) have been working on the Papers With Code Reproducibility Challenge for almost 3 months now where have reproduced the experiments done in the Reformer paper (an efficient transformer model) and would love any and all feedback you have on our final report before submission (tomorrow night):
You can ping me, comment here or comment on the forums post: Reproducibility Challenge 2020 - fastai folks interested There is still some formatting but weâd love to hear how it reads, do we make our points clearly and what needs to be fixed!
The team: @arampacha @hallvagi @imrandude @Dean-DAGs @tyoc213 @PriyanK7n & I
Our docs are here if youâd like to see the code: https://arampacha.github.io/reformer_fastai/
BONUS: if you have an hour or two to go into depth with it weâd be happy to thank you as a contributor too!
Project Summary: Our Reproducibility Challenge Experience
Way back in October 2020 the Papers With Code ML Reproducibility Challenge 2020 was launched and shared in the fast.ai forums. A few of us jumped at the chance to test our ML knowledge and push our skills. Fast forward 110 days since that initial post and we delivered our Reformer Reproducibility submission via OpenReview! Here are a few reflections on our experience; what we enjoyed, tools we used and what we would have done differently:
TLDR;
- Working as a team pushes your motivation, your skills and your throughput
- nbdev for development, Weights & Biases for tracking and Discord for communication
- We could have better used task/project management tools more, maybe we needed a different tool
- Next time weâll start experiments sooner and maybe pick a more practical paper
- It was a massive learning experience and a lot of fun
Why participate
Implementing code from scratch is much more enjoyable and meaningful when there is a direct application, e.g. working towards this reproducibility challenge. Spending weeks and months focussed on a single paper forces you to understand the paper down to the last full stop. It also gives you a great appreciation of how difficult writing a good paper is, you see almost every word and sentence is chosen carefully to communicate a particular concept, problem or model setting.
N heads are better than one a.k.a. Multihead Attention
Our team was distributed across 6 countries and everyone had a somewhat different background, set of skills and personality. This mix was definitely beneficial for getting things done much more smoothly. Having 2 x N eyes researching implementation information or reviewing code really improved coverage and sped up the entire process. It also makes debugging much faster!
Writing code that the entire team will use also meant writing cleaner code with more tests so that it was as clear as possible for your teammates. And finally, during a long project like this itâs easy to get distracted or lazy, however seeing everyone else delivering great work quickly pulls you back into line!
Good tools are key
nbdev
The nbdev literate programming environment from fast.ai was super convenient to minimise the projectâs development friction. Writing tests as we developed meant that we caught multiple bugs early and auto-generation of docs lends itself immensely to the reproducibility of your code. Most of us will be using this again for our next projects.
Weights & Biases
Weights & Biases generously gave us a team account which enabled us all to log our experiments to a single project. Being directly able to link your runs and results to the final report was really nice. Also itâs pretty exciting monitoring 10+ experiments live!
Discord
A Discord server worked really well for all our chat and voice communication. Frequent calls to catchup and agree on next steps were super useful. Todo lists and core pieces of code often ended up as pinned messages for quick reference and linking Github activity to a channel was useful for keeping an eye on new commits to the repo.
Overleaf
When it came to writing the final report in latex, Overleaf was a wonderful tool for collaborative editing.
ReviewNB
The ReviewNB app on GitHub was very useful for visualizing diffs in notebooks.
Learn from the best
The Reformer architecture had several complex parts, and having Phil Wangâs and HuggingFaceâs Github code was very helpful to understand design decisions and fix issues.
Things we can improve for the next time
Start experiments early
We started our experiments quite late in the project; as we aimed to reimplement Reformer in Pytorch (with reference to existing implementations) about ~90% of our time was spent on ensuring our implementation was faithful to the paper and that it was working correctly. In retrospect starting experiments earlier would have allowed more in depth exploration of what we observed while testing. Full scale experiments have a way of inducing problems you didnât foresee during the implementation phaseâŠ
Task distribution and coordination
When working in a distributed and decentralized team, efficient task allocation and tracking is important. Early in the project todo lists lived in peopleâs heads, or were quickly buried under 50 chat messages. This was suboptimal for a number of reasons, including that it made involving new people in the project more challenging as they could not easily identify where they could best contribute.
We made a switch to Trello to better track open tasks. It worked reasonably well however its effectiveness was probably proportional to how much time a couple of team members had to review the kanban board, advocate for its use and focus the teamâs attention there. The extra friction associated with needing to use another tool unconnected to Github or Discord was probably the reason for why we didnât use it as much as we could have. Integrating Trello into our workflow or giving Github Projects a trial could have been useful.
More feedback
We had originally intended to get feedback from the fastai community during the project. In the end we were too late in sharing our material, so there wasnât time for much feedback. Early feedback would have been very useful and the project might have benefited from some periodic summary of accomplishments and current problems. We could have solicited additional feedback from the authors too.
Distributed training
This was our first exposure to distributed training and unfortunately we had a lot of issues with it. We were also unable to log the results from distributed runs properly to Weights & Biases. This slowed down our experiment iteration speed and is why we could not train our models for as long as we would have preferred.
Choice of paper to reproduce
It would have been useful to calculate a rough estimate of the compute budget the paperâs experiments required before jumping into it. In the latter stages of the project we realised that we would be unable to fully replicate some of the paperâs experiments, but instead had to run scaled down versions. In addition, where your interest sits between theoretical and practical papers should be considered when selecting a paper for the challenge.
More tools
We could have tried even more handy tools such as knockknock to alert us when models are finished training and Github Projects for task management.
Some final thoughts
We came out of this project even more motivated compared to how we entered; a great indication that it was both enjoyable and useful for us! Our advice would be to not hesitate to join events like this one and challenge yourself, and try and find one or more other folks in the forums or Discord to work with. After successfully delivering our submission to the challenge we are all eager to work together again on our next project, stay tuned for more!