Reproducibility Challenge 2020 - fastai folks interested

morgan · November 8, 2020, 4:23pm

Great to see all the progress here!!

@arampacha re HuggingFace datasets, here they suggest: i) doing all the preprocessing while loading everything into memory, which mightn’t be feasible for large datasets ii) they mentioned an improvement with tokenizers that increased speed by 10x (back in August). I think in a chat somewhere they mentioned that they were working on parallelising preprocessing with Datasets, but no idea when it will be released.

Also, that Wiki103 Colab example is great!

Discord link broken?

@arampacha Also, the discord invite link doesn’t work for me, maybe its expired?

Tokenization

1 question around what tokenizer to use, expanding on the tokenization bullet above

The default HuggingFace “ReformerTokenizer” uses sentencepiece, do you think this was after discussion with the authors? The reformer colabs use different tokenizers depending on the task:

`SubwordTextEncoder` - Machine Translation

In the Machine Translation Trax colab they use a pretrained EN-DE SubwordTextEncoder :

from tensor2tensor.data_generators.text_encoder import SubwordTextEncoder

...

# Set up our sub-word tokenizer
tokenizer = SubwordTextEncoder(
    'gs://trax-ml/reformer/mt/vocab.translate_ende_wmt32k.32768.subwords')

`SentencePiece Processor` - Text Generation

In the Text Generation Trax colab they use SentencePiece:

from sentencepiece import SentencePieceProcessor

...

# Load a BPE vocabulaary with 320 types. This mostly consists of single letters
# and pairs of letters, but it has some common words and word pieces, too.
!gsutil cp gs://trax-ml/reformer/cp.320.* .

TOKENIZER = SentencePieceProcessor()
TOKENIZER.load('cp.320.model')

I see the Stanford SQUAD paper doesn’t mention what tokenizer they used either… it doesn’t seem like they did any pre-training either so maybe its not so surprising they didn’t get great results. The time to pre-train would also be an issue I guess

Github

Authors Questions

I think adding a question around tokenizers used would should be added to our list for the authors I’ve created a wiki page on the github that we can add questions to: Questions for the Authors · morganmcg1/reformer-fastai-old Wiki · GitHub

Move Google Doc info to Github?

I also updated the github’s readme with the resources links from the google doc. Do you think I should move the info from the Google doc into Wiki’s/Issues in

Kanban Project

I also created Kanban project on the github (Reformer Reproducibility · GitHub), which might be useful to assign and track tasks. I haven’t used github’s version before, and we don’t have to use it, but its nice to have the option

Weights and Biases

I just applied for a academic plan for this reproducibility challenge so that we can create a team and share experiment results (the free tier doesn’t let you create teams), will let you know their response

enwik8-64k data

I’ll look into finding a source for this now

Next Meeting

Does the same time this Thursday suit everyone?

morgan · November 8, 2020, 6:10pm

enwik8 data ( + Tokenizer?)

I’ve tracked down the enwik8 data also also how it was “encoded” (== Tokenization), added a gist notebook here in the repo which you can run on colab: https://github.com/morganmcg1/reformer-fastai/blob/main/enwiki8_Tensor2Tensor_download.ipynb

TL;DR

“t2t_enwik8_l65k” is listed as one of the parameters in the Reformer enwik8 trax config

This turns out to be a Tensor2Tensor Problem dataset. These are datasets that also already have some pre-processing done to them. In this case, the data is encoded with a ByteTextEncoder defined in the Tensor2Tensor library.

Have a look at the notebook above for a full trace of the lineage.

One thing I’m not sure about is whether other pre-processing functions are applied to the dataset besides the encoding…

Raw data

I have added a Data sub header to our github README with links to the above notebook as well as a direct link to the raw 100mb enwik8 zip that the Tensor2Tensor library downloads for you

arampacha · November 8, 2020, 6:13pm

@morgan new invite https://discord.gg/TMTWQ4kG3V
@marii @tyoc213 @wgpubs
sorry if I missed someone, feel free to join

marii · November 9, 2020, 8:03am

Almost missed the next meeting bit. Yes, I would be okay with save time next Thursday.

tyoc213 · November 9, 2020, 8:05am

Hi there, have lots of interest… but I think havent found like a good paper to reproduce with my current knowledge and resources (they require lot of TPU or thousand hours!!!)… and I think I found one that is also a little like the “first task” you should do when doing ASR (recognize 1 word) it is https://openreview.net/forum?id=ijvwzg_7I71

Wish me luck implementing my second paper ever (and hopefully first time finishing it), first one I started but didnt finish it :’( (hopefully I will get back, or implement the ideas I did have for that one).

morgan · November 9, 2020, 2:09pm

Hey all, we got approval for academic membership for Weights and Biases meaning we can add team projects could you send me your Weights and Biases usernames and I’ll add you to the team?

hallvagi · November 10, 2020, 8:45am

Great work @morgan setting up all the administrative stuff

morgan · November 10, 2020, 10:56pm

Interesting Paper, Reformer struggled compared to most:

LONG RANGE ARENA: A BENCHMARK FOR EFFICIENT TRANSFORMERS

" Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL."

Some Reformer mentions:

Speed Results:
“Based on our implementation, the slowest model is the Reformer model (Kitaev et al., 2020) that is about 80% the speed of vanilla Transformer at 4K sequence lengths and half the speed at 1K sequence length.”

Implementaion:
“Reformer’s Implementation Having optimized ops to support many of Reformer’s functionality is crucial. Hence, Reformer is implemented slightly differently from other Transformer models. Instead of computing tensors with batch size dimensions B and head dimensions H, (i.e., B × H × N × d), we compute the attention function for tensors of N × d dimensions. After which, we parallelize this function via VMAP over the batch and head dimensions”

morgan · November 11, 2020, 9:26pm

Next Meeting: Thursday November 12th at 7pm GMT, on the Reproducibility discord

Lots to discuss!

arampacha · November 16, 2020, 11:15am

Another paper which helps to relate Reformer to other transformer architecture modifications https://arxiv.org/abs/2009.06732

Diganta · December 12, 2020, 7:45pm

Since you all you are using WandB already, you should consider checking this out:

You can keep me in the loop in case of any inquires/doubts about WandB or feature requests for the same.

morgan · January 28, 2021, 6:22pm

Feedback Requested

Hey all, we (see below) have been working on the Papers With Code Reproducibility Challenge for almost 3 months now where have reproduced the experiments done in the Reformer paper (an efficient transformer model) and would love any and all feedback you have on our final report before submission (tomorrow night):

You can ping me, comment here or comment on the forums post: Reproducibility Challenge 2020 - fastai folks interested There is still some formatting but we’d love to hear how it reads, do we make our points clearly and what needs to be fixed!

The team: @arampacha @hallvagi @imrandude @Dean-DAGs @tyoc213 @PriyanK7n & I

Our docs are here if you’d like to see the code: https://arampacha.github.io/reformer_fastai/

BONUS: if you have an hour or two to go into depth with it we’d be happy to thank you as a contributor too!

morgan · February 11, 2021, 10:27am

Project Summary: Our Reproducibility Challenge Experience

Way back in October 2020 the Papers With Code ML Reproducibility Challenge 2020 was launched and shared in the fast.ai forums. A few of us jumped at the chance to test our ML knowledge and push our skills. Fast forward 110 days since that initial post and we delivered our Reformer Reproducibility submission via OpenReview! Here are a few reflections on our experience; what we enjoyed, tools we used and what we would have done differently:

TLDR;

Working as a team pushes your motivation, your skills and your throughput
nbdev for development, Weights & Biases for tracking and Discord for communication
We could have better used task/project management tools more, maybe we needed a different tool
Next time we’ll start experiments sooner and maybe pick a more practical paper
It was a massive learning experience and a lot of fun

Why participate

Implementing code from scratch is much more enjoyable and meaningful when there is a direct application, e.g. working towards this reproducibility challenge. Spending weeks and months focussed on a single paper forces you to understand the paper down to the last full stop. It also gives you a great appreciation of how difficult writing a good paper is, you see almost every word and sentence is chosen carefully to communicate a particular concept, problem or model setting.

N heads are better than one a.k.a. Multihead Attention

Our team was distributed across 6 countries and everyone had a somewhat different background, set of skills and personality. This mix was definitely beneficial for getting things done much more smoothly. Having 2 x N eyes researching implementation information or reviewing code really improved coverage and sped up the entire process. It also makes debugging much faster!

Writing code that the entire team will use also meant writing cleaner code with more tests so that it was as clear as possible for your teammates. And finally, during a long project like this it’s easy to get distracted or lazy, however seeing everyone else delivering great work quickly pulls you back into line!

Good tools are key

nbdev

The nbdev literate programming environment from fast.ai was super convenient to minimise the project’s development friction. Writing tests as we developed meant that we caught multiple bugs early and auto-generation of docs lends itself immensely to the reproducibility of your code. Most of us will be using this again for our next projects.

Weights & Biases

Weights & Biases generously gave us a team account which enabled us all to log our experiments to a single project. Being directly able to link your runs and results to the final report was really nice. Also it’s pretty exciting monitoring 10+ experiments live!

Discord

A Discord server worked really well for all our chat and voice communication. Frequent calls to catchup and agree on next steps were super useful. Todo lists and core pieces of code often ended up as pinned messages for quick reference and linking Github activity to a channel was useful for keeping an eye on new commits to the repo.

Overleaf

When it came to writing the final report in latex, Overleaf was a wonderful tool for collaborative editing.

ReviewNB

The ReviewNB app on GitHub was very useful for visualizing diffs in notebooks.

Learn from the best

The Reformer architecture had several complex parts, and having Phil Wang’s and HuggingFace’s Github code was very helpful to understand design decisions and fix issues.

Things we can improve for the next time

Start experiments early

We started our experiments quite late in the project; as we aimed to reimplement Reformer in Pytorch (with reference to existing implementations) about ~90% of our time was spent on ensuring our implementation was faithful to the paper and that it was working correctly. In retrospect starting experiments earlier would have allowed more in depth exploration of what we observed while testing. Full scale experiments have a way of inducing problems you didn’t foresee during the implementation phase…

Task distribution and coordination

When working in a distributed and decentralized team, efficient task allocation and tracking is important. Early in the project todo lists lived in people’s heads, or were quickly buried under 50 chat messages. This was suboptimal for a number of reasons, including that it made involving new people in the project more challenging as they could not easily identify where they could best contribute.

We made a switch to Trello to better track open tasks. It worked reasonably well however its effectiveness was probably proportional to how much time a couple of team members had to review the kanban board, advocate for its use and focus the team’s attention there. The extra friction associated with needing to use another tool unconnected to Github or Discord was probably the reason for why we didn’t use it as much as we could have. Integrating Trello into our workflow or giving Github Projects a trial could have been useful.

More feedback

We had originally intended to get feedback from the fastai community during the project. In the end we were too late in sharing our material, so there wasn’t time for much feedback. Early feedback would have been very useful and the project might have benefited from some periodic summary of accomplishments and current problems. We could have solicited additional feedback from the authors too.

Distributed training

This was our first exposure to distributed training and unfortunately we had a lot of issues with it. We were also unable to log the results from distributed runs properly to Weights & Biases. This slowed down our experiment iteration speed and is why we could not train our models for as long as we would have preferred.

Choice of paper to reproduce

It would have been useful to calculate a rough estimate of the compute budget the paper’s experiments required before jumping into it. In the latter stages of the project we realised that we would be unable to fully replicate some of the paper’s experiments, but instead had to run scaled down versions. In addition, where your interest sits between theoretical and practical papers should be considered when selecting a paper for the challenge.

More tools

We could have tried even more handy tools such as knockknock to alert us when models are finished training and Github Projects for task management.

Some final thoughts

We came out of this project even more motivated compared to how we entered; a great indication that it was both enjoyable and useful for us! Our advice would be to not hesitate to join events like this one and challenge yourself, and try and find one or more other folks in the forums or Discord to work with. After successfully delivering our submission to the challenge we are all eager to work together again on our next project, stay tuned for more!