Fastai v2 text

wgpubs · March 4, 2020, 1:49am

Yah that works … thanks.

What if instead of all the labels space delimited in a single column … we have one column per label, each being 0 or 1? For example, a DataFrame that looks something like this w/r/t the labels:

sgugger · March 4, 2020, 1:50am

Normally, if you pass a list of labels to ColReader, it assumes they are one-hot encoded.

wgpubs · March 4, 2020, 2:11am

I would expect the behavior would be the same whether we put all the labels in a single column or each label in a separate column, but they aren’t.

Apporach #1:

I get a separate vocab for the targets
The target datatype is TensorMultiCategory
I see the input and targets when I call show

x_tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=vocab)
]

y_tfms = [ 
    ColReader('labels', label_delim=' '),
    MultiCategorize(vocab=SENT_LABELS[1:], add_na=False) 
]

dsets = Datasets(items=df,
                 tfms=[x_tfms, y_tfms], 
                 splits=ColSplitter(col='is_valid')(df), 
                 dl_type=SortedDL)

print(dsets.train[21])
#(TensorText([    2,     8,    10,   146,   518,    19,  1392,    34,    21,    53,
#          221,    11,   111,    29,    15,   961,    68,  3102,   533,   200,
#          14,    23,    11, 10426,    22,   487,    48,     0,   243,    20,
#          393,     9,  1434,   159,    16, 12065,  3410,     9,    14,   437,
#          15,   563,    97,    30,   210,  4183,   161, 11518,    13,   231,
#          754,  1852,     9,     8,    35,    16,    67,  2698,  1354,    22,
#            9,     8,    12,    28,   460,   170,  1253,    11,  4013,     9,
#           14,   260,   205,    95,    16,    18,  4929,    13,    46,    36,
#           39,   355,    22,  1877,    59,     9,     8,    14,   189,   382,
#         1042,   205,   210,     9]), TensorMultiCategory([6, 2]))

dsets.show(dsets.train[21])
#xxbos xxmaj the pay stations for visitor parking are so difficult to use with a credit / debit card … i have #to insert that thing like xxunk before it works . enforcement ▁ is legitimately insane . i got a ticket #because my permit fell over sideways , still completely visible . xxmaj there is no rule against that . #xxmaj and they almost never listen to appeals . i understand your department is in debt , but do n't pass #that onto me . xxmaj i 'm already buying your permit .
#is_very_negative;is_negative

Apporach #2:

There is no vocab for the targets
The target shows as a tuple (e.g., (0, 0, 1, 1, 0, 0, 0, 0))
I see the input but not the targets when I call show

x_tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules, mark_fields=include_fld_tok), 
    Numericalize(vocab=vocab)
]

y_tfms = [ 
    ColReader(SENT_LABELS[1:])
]

dsets = Datasets(items=df,
                 tfms=[x_tfms, y_tfms], 
                 splits=ColSplitter(col='is_valid')(df), 
                 dl_type=SortedDL)

print(dsets.train[21])
#(TensorText([    2,     8,    10,   146,   518,    19,  1392,    34,    21,    53,
#          221,    11,   111,    29,    15,   961,    68,  3102,   533,   200,
#           14,    23,    11, 10426,    22,   487,    48,     0,   243,    20,
#          393,     9,  1434,   159,    16, 12065,  3410,     9,    14,   437,
#           15,   563,    97,    30,   210,  4183,   161, 11518,    13,   231,
#          754,  1852,     9,     8,    35,    16,    67,  2698,  1354,    22,
#            9,     8,    12,    28,   460,   170,  1253,    11,  4013,     9,
#           14,   260,   205,    95,    16,    18,  4929,    13,    46,    36,
#           39,   355,    22,  1877,    59,     9,     8,    14,   189,   382,
#         1042,   205,   210,     9]), (0, 0, 1, 1, 0, 0, 0, 0))

dsets.show(dsets.train[21])
# xxbos xxmaj the pay stations for visitor parking are so difficult to use with a credit / debit card … i have # to insert that thing like xxunk before it works . enforcement ▁ is legitimately insane . i got a ticket  
# because my permit fell over sideways , still completely visible . xxmaj there is no rule against that . 
# xxmaj and they almost never listen to appeals . i understand your department is in debt , but do n't pass 
# that onto me . xxmaj i 'm already buying your permit .

Also when I call dls.show_batch using either approach, I get an exception (but that is for a different post).

sgugger · March 4, 2020, 2:20am

Note that OneHotEncode is a separate transform from MultiCategorize (which you will need to add in approach 1 or your loss function won’t be happy). In the second approach, if you add it with the proper vocab, show will work.

At the higher level, the data block API does all of that for you (as long as you pass the right block).

wgpubs · March 4, 2020, 2:29am

How do you add it with the proper vocab? OneHotEncode doesn’t take a vocab (just a c).

sgugger · March 4, 2020, 2:32am

Ah yes, sorry you need EncodedMultiCategorize for encoded labels, and the OneHot transform in both. Look at the data.transforms notebook for examples.

Also note that the low level API is meant to be used when you have to write your own transforms for specific tasks. In your case, everything is handled by the data block API at the upper level.

wgpubs · March 4, 2020, 2:49am

Understood. I just find that it helps me understand what is going on once I’m able to implement things with the low level APIs

Btw, this still doesn’t seem to work (assuming I’ve added the transforms together properly):

y_tfms = [ 
    ColReader(SENT_LABELS[1:]),
    EncodedMultiCategorize(vocab=SENT_LABELS[1:]),
    OneHotEncode()
]

Exception:

IndexError                                Traceback (most recent call last)
<timed exec> in <module>

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
    259     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    260         super().__init__(dl_type=dl_type)
--> 261         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    262         self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
    263 

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in <listcomp>(.0)
    259     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    260         super().__init__(dl_type=dl_type)
--> 261         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    262         self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
    263 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
     39             return x
     40 
---> 41         res = super().__call__(*((x,) + args), **kwargs)
     42         res._newchk = 0
     43         return res

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in __init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose)
    200         if do_setup:
    201             pv(f"Setting up {self.tfms}", verbose)
--> 202             self.setup(train_setup=train_setup)
    203 
    204     def _new(self, items, **kwargs): return super()._new(items, tfms=self.tfms, do_setup=False, types=self.types, **kwargs)

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in setup(self, train_setup)
    219             for f in self.tfms.fs:
    220                 self.types.append(getattr(f, 'input_types', type(x)))
--> 221                 x = f(x)
    222             self.types.append(type(x))
    223         types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in __call__(self, x, **kwargs)
     70     @property
     71     def name(self): return getattr(self, '_name', _get_name(self))
---> 72     def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
     73     def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
     74     def __repr__(self): return f'{self.name}: {self.encodes} {self.decodes}'

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in _call(self, fn, x, split_idx, **kwargs)
     82         f = getattr(self, fn)
     83         if not _is_tuple(x): return self._do_call(f, x, **kwargs)
---> 84         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     85         return retain_type(res, x)
     86 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in <genexpr>(.0)
     82         f = getattr(self, fn)
     83         if not _is_tuple(x): return self._do_call(f, x, **kwargs)
---> 84         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     85         return retain_type(res, x)
     86 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in _do_call(self, f, x, **kwargs)
     86 
     87     def _do_call(self, f, x, **kwargs):
---> 88         return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
     89 
     90 add_docs(Transform, decode="Delegate to `decodes` to undo transform", setup="Delegate to `setups` to set up transform")

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/dispatch.py in __call__(self, *args, **kwargs)
     96         if not f: return args[0]
     97         if self.inst is not None: f = MethodType(f, self.inst)
---> 98         return f(*args, **kwargs)
     99 
    100     def __get__(self, inst, owner):

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/transforms.py in encodes(self, o)
    237         if not self.c: warn("Couldn't infer the number of classes, please pass a value for `c` at init")
    238 
--> 239     def encodes(self, o): return TensorMultiCategory(one_hot(o, self.c).float())
    240     def decodes(self, o): return one_hot_decode(o, None)
    241 

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/torch_core.py in one_hot(x, c)
    490     "One-hot encode `x` with `c` classes."
    491     res = torch.zeros(c, dtype=torch.uint8)
--> 492     if isinstance(x, Tensor) and x.numel()>0: res[x] = 1.
    493     else: res[list(L(x, use_list=None))] = 1.
    494     return res

IndexError: tensors used as indices must be long, byte or bool tensors

sgugger · March 4, 2020, 3:02am

Mmmm, are your columns considered as floats by any chance (still needs a fix in fastai2 if that’s the case, just trying to find the problem).
If you %debug, that is x? What is its dtype?

wgpubs · March 4, 2020, 3:07am

No, they are int64 …

is_very_positive             int64
is_positive                  int64
is_very_negative             int64
is_negative                  int64
is_suggestion                int64

wgpubs · March 4, 2020, 3:50am

So trying using the DataBlock API …

cls_blocks = (
    TextBlock.from_df(corpus_cols, vocab=vocab, seq_len=bptt, rules=custom_tok_rules, mark_fields=include_fld_tok),
    MultiCategoryBlock(encoded=True, vocab=SENT_LABELS[1:])
)

cls_dblock = DataBlock(blocks=cls_blocks, 
                       get_x=ColReader('text'),
                       get_y=ColReader(SENT_LABELS[1:]),
                       splitter=ColSplitter(col='is_valid'))

… does not work as intended (perhaps I’m missing a block???) … dls.show_batch() throws the following exception:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-283-90634fcc3c9e> in <module>
----> 1 dls.show_batch()

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in show_batch(self, b, max_n, ctxs, show, **kwargs)
     90         if b is None: b = self.one_batch()
     91         if not show: return self._pre_show_batch(b, max_n=max_n)
---> 92         show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)
     93 
     94     def show_results(self, b, out, max_n=9, ctxs=None, show=True, **kwargs):

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in _pre_show_batch(self, b, max_n)
     83         b = self.decode(b)
     84         if hasattr(b, 'show'): return b,None,None
---> 85         its = self._decode_batch(b, max_n, full=False)
     86         if not is_listy(b): b,its = [b],L((o,) for o in its)
     87         return detuplify(b[:self.n_inp]),detuplify(b[self.n_inp:]),its

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in _decode_batch(self, b, max_n, full)
     77         f = self.after_item.decode
     78         f = compose(f, partial(getattr(self.dataset,'decode',noop), full = full))
---> 79         return L(batch_to_samples(b, max_n=max_n)).map(f)
     80 
     81     def _pre_show_batch(self, b, max_n=9):

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in map(self, f, *args, **kwargs)
    360              else f.format if isinstance(f,str)
    361              else f.__getitem__)
--> 362         return self._new(map(g, self))
    363 
    364     def filter(self, f, negate=False, **kwargs):

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in _new(self, items, *args, **kwargs)
    313     @property
    314     def _xtra(self): return None
--> 315     def _new(self, items, *args, **kwargs): return type(self)(items, *args, use_list=None, **kwargs)
    316     def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
    317     def copy(self): return self._new(self.items.copy())

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
     39             return x
     40 
---> 41         res = super().__call__(*((x,) + args), **kwargs)
     42         res._newchk = 0
     43         return res

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in __init__(self, items, use_list, match, *rest)
    304         if items is None: items = []
    305         if (use_list is not None) or not _is_array(items):
--> 306             items = list(items) if use_list else _listify(items)
    307         if match is not None:
    308             if is_coll(match): match = len(match)

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in _listify(o)
    240     if isinstance(o, list): return o
    241     if isinstance(o, str) or _is_array(o): return [o]
--> 242     if is_iter(o): return list(o)
    243     return [o]
    244 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/foundation.py in __call__(self, *args, **kwargs)
    206             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    207         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 208         return self.fn(*fargs, **kwargs)
    209 
    210 # Cell

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/utils.py in _inner(x, *args, **kwargs)
    339     if order is not None: funcs = funcs.sorted(order)
    340     def _inner(x, *args, **kwargs):
--> 341         for f in L(funcs): x = f(x, *args, **kwargs)
    342         return x
    343     return _inner

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in decode(self, o, full)
    271     def __iter__(self): return (self[i] for i in range(len(self)))
    272     def __repr__(self): return coll_repr(self)
--> 273     def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
    274     def subset(self, i): return type(self)(tls=L(tl.subset(i) for tl in self.tls), n_inp=self.n_inp)
    275     def _new(self, items, *args, **kwargs): return super()._new(items, tfms=self.tfms, do_setup=False, **kwargs)

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in <genexpr>(.0)
    271     def __iter__(self): return (self[i] for i in range(len(self)))
    272     def __repr__(self): return coll_repr(self)
--> 273     def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
    274     def subset(self, i): return type(self)(tls=L(tl.subset(i) for tl in self.tls), n_inp=self.n_inp)
    275     def _new(self, items, *args, **kwargs): return super()._new(items, tfms=self.tfms, do_setup=False, **kwargs)

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/core.py in decode(self, o, **kwargs)
    208     def __iter__(self): return (self[i] for i in range(len(self)))
    209     def show(self, o, **kwargs): return self.tfms.show(o, **kwargs)
--> 210     def decode(self, o, **kwargs): return self.tfms.decode(o, **kwargs)
    211     def __call__(self, o, **kwargs): return self.tfms.__call__(o, **kwargs)
    212     def overlapping_splits(self): return L(Counter(self.splits.concat()).values()).filter(gt(1))

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in decode(self, o, full)
    195         for f in reversed(self.fs):
    196             if self._is_showable(o): return o
--> 197             o = f.decode(o, split_idx=self.split_idx)
    198         return o
    199 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in decode(self, x, **kwargs)
     71     def name(self): return getattr(self, '_name', _get_name(self))
     72     def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
---> 73     def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
     74     def __repr__(self): return f'{self.name}: {self.encodes} {self.decodes}'
     75 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in _call(self, fn, x, split_idx, **kwargs)
     82         f = getattr(self, fn)
     83         if not _is_tuple(x): return self._do_call(f, x, **kwargs)
---> 84         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     85         return retain_type(res, x)
     86 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in <genexpr>(.0)
     82         f = getattr(self, fn)
     83         if not _is_tuple(x): return self._do_call(f, x, **kwargs)
---> 84         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     85         return retain_type(res, x)
     86 

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/transform.py in _do_call(self, f, x, **kwargs)
     86 
     87     def _do_call(self, f, x, **kwargs):
---> 88         return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
     89 
     90 add_docs(Transform, decode="Delegate to `decodes` to undo transform", setup="Delegate to `setups` to set up transform")

~/development/_training/ml/nlp-playground/_libs/fastcore/fastcore/dispatch.py in __call__(self, *args, **kwargs)
     96         if not f: return args[0]
     97         if self.inst is not None: f = MethodType(f, self.inst)
---> 98         return f(*args, **kwargs)
     99 
    100     def __get__(self, inst, owner):

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/data/transforms.py in decodes(self, o)
    246     def __init__(self, vocab): self.vocab,self.c = vocab,len(vocab)
    247     def encodes(self, o): return TensorCategory(tensor(o).float())
--> 248     def decodes(self, o): return MultiCategory (one_hot_decode(o, self.vocab))
    249 
    250 # Cell

~/development/_training/ml/nlp-playground/_libs/fastai2/fastai2/torch_core.py in one_hot_decode(x, vocab)
    496 # Cell
    497 def one_hot_decode(x, vocab=None):
--> 498     return L(vocab[i] if vocab else i for i,x_ in enumerate(x) if x_==1)
    499 
    500 # Cell

~/anaconda3/envs/playground-nlp/lib/python3.7/site-packages/torch/tensor.py in __iter__(self)
    454         # map will interleave them.)
    455         if self.dim() == 0:
--> 456             raise TypeError('iteration over a 0-d tensor')
    457         if torch._C._get_tracing_state():
    458             warnings.warn('Iterating over a tensor might cause the trace to be incorrect. '

TypeError: iteration over a 0-d tensor

I did notice that each target is a tuple of TensorCategory objects (e.g., (TensorCategory(0.), TensorCategory(0.), TensorCategory(1.), TensorCategory(1.), TensorCategory(0.), TensorCategory(0.), TensorCategory(0.), TensorCategory(0.)))
rather than a single TensorMultiCategory object such as I get when I use a single column with the labels space-delimited.

wgpubs · March 4, 2020, 6:11am

I think the problem is in EncodedMultiCategorize.

The encodes should return a TensorCategory that is a long:

def encodes(self, o): return TensorCategory(tensor(o).long())

Right now its a float and if you add the OneHotEncode you get an exception … IndexError: tensors used as indices must be long, byte or bool tensors

sgugger · March 4, 2020, 2:53pm

Sorry, looked again this morning and EncodedMultiCategorize works on its own without OneHotEncode (don’t trust me, trsut the notebook and the tests that are there). For your data block, you should tell your block your categories are onehot encoded by using MultiCategoryBlock(encoded=True) (a block does not have access to the data, so it can’t guess anything).

wgpubs · March 4, 2020, 5:00pm

… unfortunately your tests don’t account for tuples (which is what ColReader returns).

_tfm = EncodedMultiCategorize(vocab=['no', 'yes'])
_tfm([0,1])
# returns TensorCategory([0., 1.]))
_tfm((0,1))
# returns ((TensorCategory(0.), TensorCategory(1.))

I also don’t understand why it returns TensorCategory instead of TensorMultiCategory (the later seems more appropriate given that the transform is for multi-labelled datasets and is what both OneHotEncode and MultiCategorize return via encodes).

sgugger · March 4, 2020, 5:22pm

Fixed those two things.

wgpubs · March 4, 2020, 5:42pm

Checks out. Thanks!

cdparks · March 6, 2020, 6:54pm

I am having issues getting a text model up and running in fastai v2. I am worked with protein sequences, hence I need to define a custom tokenizer.

from fastai2.basics import *                                                                                                                                                                                                     
from fastai2.text.all import *                                                                                                                                                                                                   
                                                                                                                                                                                                                                 
BOS,EOS,FLD,UNK,PAD = 'xxbos','xxeos','xxfld','xxunk','xxpad'                                                                                                                                                                    
TK_MAJ,TK_UP,TK_REP,TK_WREP = 'xxmaj','xxup','xxrep','xxwrep'                                                                                                                                                                    
                                                                                                                                                                                                                                 
defaults.text_spec_tok = [PAD]                                                                                                                                                                                                   
                                                                                                                                                                                                                              
class MolTokenizer(BaseTokenizer):                                                                                                                                                                                               
    def __init__(self, split_char=' '):                                                                                                                                                                                          
        self.split_char = ' '                                                                                                                                                                                                    
    def __call__(self, items):                                                                                                                                                                                                   
        return ( ['GO']+list(t.upper() )+['END'] for t in items)

I then begin with the following

bs = 64                                                                                                                                                                                                                          
corpus_train = pd.read_csv('./processed/train.csv.gzip',index_col=None, compression='gzip')                                                                                                                                      
corpus_valid = pd.read_csv('./processed/valid.csv.gzip',index_col=None, compression='gzip') 
corpus_train['is_valid'] = False                                                                                                                                                                                                 
corpus_valid['is_valid'] = True                                                                                                                                                                                                  
corpus = corpus_train.append(corpus_valid, ignore_index=True)                                                                                                                                                                    
                                                                                                                                                                                                                                 
path = './test/'                                                                                                                                                                                                                 
df_tok, count = tokenize_df(corpus, 'sequence', rules=[], tok_func=partial(MolTokenizer))                                                                                                                                        
dls_lm = TextDataLoaders.from_df(df_tok, path=path, text_vocab=make_vocab(count,min_freq=1), text_col='text', is_lm=True, valid_col='is_valid')

everything checks out for my vocab

dls_lm.train_ds.vocab
['xxpad', 'L', 'A', 'G', 'V', 'E', 'S', 'I', 'K', 'R', 'D', 'T', 'P', 'N', 'Q', 'F', 'Y', 'M', 'H', 'C', 'W', 'GO', 'END', 'xxfake']

and df_tok looks good

df_tok.text.head(2)
0    [GO, M, A, N, Y, T, A, A, D, I, K, A, L, R, E, R, T, G, A, G, M, M, D, V, K, K, A, L, D, E, A, N, G, D, A, E, K, A, I, E, I, I, R, I, K, G, L, K, G, A, T, K, R, E, G, R, S, T, A, E, G, L, V, A, A, K, V, N, G, G, V, G, V, M, I, E, V, N, C, E, T, D, F, V, A, K, A, D, K, F, I, Q, L, A, D, K, V, L, N, V, ...]
1    [GO, M, P, K, S, R, R, A, V, S, L, S, V, L, I, G, A, V, I, A, A, L, A, G, A, L, I, A, V, T, V, P, A, R, P, N, R, P, E, A, D, R, E, A, L, W, K, I, V, H, D, R, C, E, F, G, Y, R, R, T, G, A, Y, A, P, C, T, F, V, D, E, Q, S, G, T, A, L, Y, K, A, D, F, D, P, Y, Q, F, L, L, I, P, L, A, R, I, T, G, I, E, D, ...]

However, there is something going on either during numericalization or generating the batches

xx, yy = dls_lm.one_batch()
xx[:5]
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       device='cuda:1')

All my data gets numericalized to a zero token. Does anyone see what I am doing wrong?

sabzo · March 11, 2020, 9:30pm

It seems the FastAI/Fastbook tutorials/lectures for NLP focus on English, especially with the use of Spacy. If more effort would be given to non-western languages, especially from the global south, or at the very least using a neutral tokenizer (such as sentencepiece) as the default, rather than an western opinionated tokenizer (Spacy) as the default it would go a long way to greater inclusion and diversity.

sabzo · March 11, 2020, 9:49pm

How would one use FastAI’s sentencepiece into this fastbook tutorial? https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb?

I’ve had to manually build the sentencepiece model, vocab – but unsure how to plug this into the provide notebook. In particular how would I plug sentencepiece to FastAI2’s Tokenizer class?

Pablo · March 12, 2020, 10:34am

I’m curious, what languages do you have in mind? Is tokenization so different in these cases? I know for some languages (Chinese being an obvious example) tokenization is very different.

sabzo · March 12, 2020, 11:07am

I’m focusing on Nguni languages (Southern Africa) and In previous FastAI versions (pt.2 2018) I was able to use SentencePiece. The FastAI2 version is not so clear on how to go from
SentencePieceTokenizer to dls_lm


dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)```

and finally to

    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()