I was looking at code in fastai2/fastai2/text/models/ core.py and have seen this piece of code where there seems to be unreachable code. Is this intentional (since development is ongoing) or is it some leftover code?
Note the return right below def _concat
# Cell
class SentenceEncoder(Module):
"Create an encoder over `module` that can process a full sentence."
def __init__(self, bptt, module, pad_idx=1, max_len=None): store_attr(self, 'bptt,module,pad_idx,max_len')
def _concat(self, ts, bs):
return torch.cat()
bs,sl = ts[0].shape[0],sum([t.shape[1] for t in ts])
res = ts[0].new_zeros(bs, sl, *ts[0].shape[2:])
ts,xtra = (ts[:sz],ts[sz]) if len(ts) > sz else (ts,None)
for i,j in enumerate(idxs):
c = torch.cat([t[i] for t in ts[j:] if t.shape[0] > i] + [t[i] for t in ts[:j] if t.shape[0] > i] +
([] if xtra is None or xtra.shape[0] <= i else [xtra[i]]))
res[i,:c.shape[0]] = c
return res
Iām slowly trying to understand/make work the TransformerXL code. I have added a script at fastai2/fastai2/text/models/ core.py and I am trying to make run. The model loads fine.
First I got a complaint related to the MixedPrecision callback, which I have deactivated until things work that far. (Error: 'LMLearner' object has no attribute 'master_pgs').
Then I got a complaint that None had no size which happened because MultiHeadRelativeAttention was using forward from MultiHeadAttention and so _apply_attention was missing parameters.
This is apparently solved by giving MultiHeadRelativeAttention its own forward where we now pass kwargs.
Edit: Another problem is that LinearDecoder has changed a bit (probably the encoder output is different). So I have taken the old LinearDecoder and the right one for each case is given using the _model_meta dict.
And thatās where I am right now, I think I have to deal with more errors if such a naive port of the transformer.py script from V1 is even possible (at least I am learning a bit about this part of the code ).
Note that fastai 2 wonāt reimplement transformers models like in v1. We plan to use hugging face implementations since they have done a great job at creating a model garden.
Thatās good news! I agree they are doing great work keeping things up to date and in one place, so I think thatās a good decision on your side. We are also trying to make hugging face work for our problem. We have already found a blog post using hugging face (some of it) on v2 (a given application) so I am confident this line of work should be easier (our unusual dataloader requirements complicate things a bit, though). I was exploring the other approach in case it proved more or less trivial to port. Itāsā¦ not trivial, but it looks feasible. I got to the optimizer so far.
Hey @Pablo I wrote that post, let me know if you have any questions or if I can help , although by the looks of it here you probably know more than me at this stage!
I have been studying a bit more the Fastai V2 code, and I noticed something weird in one method of the class _BaseOptimizer:
def set_freeze(n, rg, ignore_force_train=False):
for p in self.param_groups[n]: p.requires_grad_(rg or (state.get('force_train', False) and not ignore_force_train))
Note this does not seem to be a static method, because it does not say so and because it uses self. But it is missing self as the first argument. Is this a bug?
Thanks for the post, @morgan itās been very helpful, and itās clear that you put a lot of effort into this.
We managed to make it work for our case (one file per document, all in one folder, with multi-label data associated to this docs in a csv file).
Models like Bert work great, but they have a very relevant problem, which is that they only work with the first x tokens (512 tokens in Bertās case). Our documents are longer, so other models like XLNet require around 70GB of GPU memory even for batch size 4ā¦ so this is hard to address.
We are also working on Multifit, which I believe uses a much smaller model, so we can probably work with longer documents. But Multifit is not ported to V2 yetā¦ we are going to try Multifit on a real-but-smaller dataset which should be fine on V1, to see how promising this is (if it is groundbreaking compared to Bert we will have to fight to make it work on V2 or somehow shallow our many docs).
If any of you have any other ideas for classifiers for very long and very many documentsā¦ Iād be glad to know!
Ah good to hear youāre making progress! I donāt have any experience with it, and youāve probably already considered it, but is there the possibility to break your document into chunks and then do some ensembling of predictions on the chunks to get a single prediction for the document? Or extract the last layer embeddings from BERT for each of the chunks, combine them and send them through a linear classifier?
For new transformers Iād keep an eye on any ELECTRA pytorch ports that get released over the next week or two as google research just posted their code yesterday, https://github.com/google-research/electra, paper: https://openreview.net/pdf?id=r1xMH1BtvB. But youāll have the same problem here as it looks like the reduced the input to 128 (although one of the models does use 512 too)
Yes, we have tried chunks at inference time (with Bert). Recall raised significantly, at the cost of precision. It looks like we need to do this at training time as well, at least as a fine-tuning step. It feels a bit āhackyā, so I was looking for something that can work with long texts by construction.
My still superficial understanding of TransformerXL suggests this was the way to go, and I donāt get yet why such crazy memory requirements.
I will post here if there are any interesting developments.
True, ensemling predictions might be a bit too hacky for the real world but aggregating embedding layers might help on the precision sideā¦in the recent Google QUEST kaggle competition a few of the gold medallists (1st and 2nd I think) also combined the last layer embeddings from 2 BERT models (one trained on questions, one on answers): https://www.kaggle.com/c/google-quest-challenge/discussion. Some of them have shared your code in case you need a head start
I had not considered combining embeddings instead of predictions. It seems like it would be a bit harder to code, but itās an interesting alternative. Thanks for sharing!
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-c5acc50824c2> in <module>()
----> 1 from fastai2.basics import *
4 frames
/usr/local/lib/python3.6/dist-packages/fastai2/basics.py in <module>()
----> 1 from .data.all import *
2 from .optimizer import *
3 from .callback.core import *
4 from .learner import *
5 from .metrics import *
/usr/local/lib/python3.6/dist-packages/fastai2/data/all.py in <module>()
1 from ..torch_basics import *
----> 2 from .core import *
3 from .load import *
4 from .external import *
5 from .transforms import *
/usr/local/lib/python3.6/dist-packages/fastai2/data/core.py in <module>()
114 # Cell
115 @docs
--> 116 class DataLoaders(GetAttr):
117 "Basic wrapper around several `DataLoader`s."
118 _default='train'
/usr/local/lib/python3.6/dist-packages/fastai2/data/core.py in DataLoaders()
127
128 def _set(i, self, v): self.loaders[i] = v
--> 129 train ,valid = add_props(lambda i,x: x[i], _set)
130 train_ds,valid_ds = add_props(lambda i,x: x[i].dataset)
131
/usr/local/lib/python3.6/dist-packages/fastcore/utils.py in add_props(f, n)
530 def add_props(f, n=2):
531 "Create properties passing each of `range(n)` to f"
--> 532 return (property(partial(f,i)) for i in range(n))
533
534 # Cell
TypeError: 'function' object cannot be interpreted as an integer
So I figured out fastcore latest version on pypi wasnāt pushed yet (which seems normal as we donāt push it after every single change, I guess). To eliminate this error, I installed the fastcore latest version (0.1.15) by doing this:
I was wondering if there is any mechanism to automatically sync the installation of the latest version (from repos) of both fastai2 and fastcore when we directly install fastai2 from master.
For this same reason, I always installed the editable version of fastai2 like this:
pip install -e .
instead of this:
pip install -e ".[dev]"
And then , I install the editable version of nbdev
@farid thatās a good point - if you use fastai2 from master, you need to do the same for fastcore. And you need to git pull both whenever you update.
Thank you Jeremy. I was wondering if after git pull both, we have to pip install them each time. By the way, this what Iām doing now but I was wondering if itās the proper way to do it.
I thought that somehow they would auto-magically pip themselves up but I guess this is what we call laziness in the real world!
Thatās what I did but then, several times, I realized that my local fastai2 (and fastcore) were lagging behind. For instance, this morning my fastai2 version stayed at 0.0.11 and fastcore was at 0.1.13. So, I used pip install -e . for both of them. I will see then when a new version is pushed.
So, I guess by using -e pip install creates a watcher to observe if any pull action takes place. Is it a fair assumption?