TextDataBunch from file/string?

I am trying to make my own little next-word-guesser. I have all my data cleaned and in a file called chatlog.txt

After being unable to decipher how to use TextDataBunch, I tried to use the from_folder option with only one file in the folder.

I then get the following error:

NameError                                 Traceback (most recent call last)
<ipython-input-8-df7f6b5b8bc8> in <module>()
----> 1 data_lm = TextDataBunch.from_folder('chats')

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in from_folder(cls, path, train, valid, test, classes, tokenizer, vocab, **kwargs)
    180         path = Path(path)
    181         processor = _get_processor(tokenizer=tokenizer, vocab=vocab, **kwargs)
--> 182         src = (TextFilesList.from_folder(path)
    183                             .split_by_folder(train=train, valid=valid)
    184                             .label_from_folder(classes=classes))

NameError: name 'TextFilesList' is not defined

This is runing on google colab:

=== Software === 
python version  : 3.6.6
fastai version  : 1.0.27
torch version   : 1.0.0.dev20181116
nvidia driver   : 396.44
torch cuda ver  : 9.2.148
torch cuda is   : available
torch cudnn ver : 7104
torch cudnn is  : enabled

=== Hardware === 
nvidia gpus     : 1
torch available : 1
  - gpu0        : 11441MB | Tesla K80

=== Environment === 
platform        : Linux-4.14.65+-x86_64-with-Ubuntu-18.04-bionic
distro          : #1 SMP Sun Sep 9 02:18:33 PDT 2018
conda env       : Unknown
python          : /usr/bin/python3
sys.path        : 
/env/python
/usr/lib/python36.zip
/usr/lib/python3.6
/usr/lib/python3.6/lib-dynload
/usr/local/lib/python3.6/dist-packages
/usr/lib/python3/dist-packages
/usr/local/lib/python3.6/dist-packages/IPython/extensions

Sun Nov 18 16:22:58 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    27W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1 Like

i get the same error with ,what i believe, is the correct default folder structure

Later attempts have their own frustrations:

data_lm = (TextList.from_folder('chatdata')                           
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'valid']) 
            .random_split_by_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()         )  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in na_op(x, y)
    675         try:
--> 676             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    677         except TypeError:

/usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)

/usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    118     if result is None:
--> 119         result = _evaluate_standard(op, op_str, a, b)
    120 

/usr/local/lib/python3.6/dist-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in <lambda>(x, y)
     77                          default_axis=default_axis),
---> 78         radd=arith_method(lambda x, y: y + x, names('radd'), op('+'),
     79                           default_axis=default_axis),

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-78-9e0aa21f5d42> in <module>()
      2            #Inputs: all the text files in path
      3             .filter_by_folder(include=['train', 'valid'])
----> 4             .random_split_by_pct(0.1)
      5            #We randomly split and keep 10% (10,000 reviews) for validation
      6             .label_for_lm()         )  

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in _inner(*args, **kwargs)
    294             self.valid = fv(*args, **kwargs)
    295             self.__class__ = LabelLists
--> 296             self.process()
    297             return self
    298         return _inner

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self)
    330     def process(self):
    331         xp,yp = self.get_processors()
--> 332         for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
    333         return self
    334 

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    396             filt = array([o is None for o in self.y])
    397             if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 398         self.x.process(xp)
    399         return self
    400 

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
     46         if processor is not None: self.processor = processor
     47         self.processor = listify(self.processor)
---> 48         for p in self.processor: p.process(self)
     49         return self
     50 

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in process(self, ds)
    265     def process_one(self, item):  return self.tokenizer._process_all_1([item])[0]
    266     def process(self, ds):
--> 267         ds.items = _join_texts(ds.items, self.mark_fields)
    268         tokens = []
    269         for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):

/usr/local/lib/python3.6/dist-packages/fastai/text/data.py in _join_texts(texts, mark_fields)
    321     if is1d(texts): texts = texts[:,None]
    322     df = pd.DataFrame({i:texts[:,i] for i in range(texts.shape[1])})
--> 323     text_col = f'{BOS} {FLD} {1} ' + df[0] if mark_fields else  f'{BOS} ' + df[0]
    324     for i in range(1,len(df.columns)):
    325         text_col += (f' {FLD} {i+1} ' if mark_fields else ' ') + df[i]

/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
    737                 lvalues = lvalues.values
    738 
--> 739         result = wrap_results(safe_na_op(lvalues, rvalues))
    740         return construct_result(
    741             left,

/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
    698         try:
    699             with np.errstate(all='ignore'):
--> 700                 return na_op(lvalues, rvalues)
    701         except Exception:
    702             if isinstance(rvalues, ABCSeries):

/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in na_op(x, y)
    684                 result = np.empty(len(x), dtype=x.dtype)
    685                 mask = notna(x)
--> 686                 result[mask] = op(x[mask], y)
    687             else:
    688                 raise TypeError("{typ} cannot perform the operation "

/usr/local/lib/python3.6/dist-packages/pandas/core/ops.py in <lambda>(x, y)
     76         add=arith_method(operator.add, names('add'), op('+'),
     77                          default_axis=default_axis),
---> 78         radd=arith_method(lambda x, y: y + x, names('radd'), op('+'),
     79                           default_axis=default_axis),
     80         sub=arith_method(operator.sub, names('sub'), op('-'),

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

where

!find chatdata

chatdata
chatdata/train
chatdata/train/log.txt
chatdata/valid
chatdata/valid/log.txt
chatdata/tmp
chatdata/tmp/valid_ids.npy
chatdata/tmp/valid_lbl.npy
chatdata/tmp/itos.pkl
chatdata/tmp/train_ids.npy
chatdata/tmp/classes.txt
chatdata/tmp/train_lbl.npy

Yes, it was a remnant from the old API. Just pushed a fix, this method should work now.

I have fastai version : 1.0.30

I get the TypeError: ufunc ‘add’ did not contain a loop with signature matching types
and it might be the same bug.

How would I check if my version is including your fix yet?

THanks

It should. Can you share more of your code?

Sure this is the line of code.

lmdata = TextLMDataBunch.from_folder(path=PATH, train=TRN, valid=VAL)

The 3 variables were earlier in the code, defined as strings containing paths that are known to exist. THey contain 20000 small text files in VAL and 1.5 million files in TRN.

PATH=’/mnt/fastssd/bot_subreddit_recom/’
TRN=’/mnt/fastssd/bot_subreddit_recom/train/’
VAL=’/mnt/fastssd/bot_subreddit_recom/valid/’

The error output started like this:

TypeError Traceback (most recent call last)
~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
1011 try:
-> 1012 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1013 except TypeError:

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
204 if use_numexpr:
–> 205 return _evaluate(op, op_str, a, b, **eval_kwargs)
206 return _evaluate_standard(op, op_str, a, b)

Note that train and valid should be just ‘train’ and ‘valid’ in your case (which is the default). Maybe that’s why you have this bug?

lmdata = TextLMDataBunch.from_folder(path=PATH, train=‘train’, valid=‘valid’)

Sadly the sugggestion, although it sounded reasonable to me as well, effected no change.

Result:

TypeError Traceback (most recent call last)
~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
1011 try:
-> 1012 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1013 except TypeError:

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
204 if use_numexpr:
–> 205 return _evaluate(op, op_str, a, b, **eval_kwargs)
206 return _evaluate_standard(op, op_str, a, b)

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
119 if result is None:
–> 120 result = _evaluate_standard(op, op_str, a, b)
121

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
64 with np.errstate(all=‘ignore’):
—> 65 return op(a, b)
66

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in radd(left, right)
112 def radd(left, right):
–> 113 return right + left
114

TypeError: ufunc ‘add’ did not contain a loop with signature matching types dtype(’<U32’) dtype(’<U32’) dtype(’<U32’)

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
in
1 #lmdata = TextLMDataBunch.from_folder(path=’.’, train=‘train’, valid=‘valid’) # todo
2 #lmdata = TextLMDataBunch.from_folder(path=PATH, train=TRN, valid=VAL) # todo TypeError: ufunc ‘add’ did not contain a loop with signature matching types dtype(’<U32’) dtype(’<U32’) dtype(’<U32’)
----> 3 lmdata = TextLMDataBunch.from_folder(path=PATH, train=‘train’, valid=‘valid’)

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/text/data.py in from_folder(cls, path, train, valid, test, classes, tokenizer, vocab, **kwargs)
191 src = (TextList.from_folder(path, processor=processor)
192 .split_by_folder(train=train, valid=valid))
–> 193 src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_folder(classes=classes)
194 if test is not None: src.add_test_folder(path/test)
195 return src.databunch(**kwargs)

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
348 self.valid = fv(*args, **kwargs)
349 self.class = LabelLists
–> 350 self.process()
351 return self
352 return _inner

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/data_block.py in process(self)
392 def process(self):
393 xp,yp = self.get_processors()
–> 394 for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
395 return self
396

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
483 filt = array([o is None for o in self.y])
484 if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
–> 485 self.x.process(xp)
486 return self
487

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/data_block.py in process(self, processor)
58 if processor is not None: self.processor = processor
59 self.processor = listify(self.processor)
—> 60 for p in self.processor: p.process(self)
61 return self
62

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/text/data.py in process(self, ds)
241 def process_one(self, item): return self.tokenizer._process_all_1([item])[0]
242 def process(self, ds):
–> 243 ds.items = _join_texts(ds.items, self.mark_fields)
244 tokens = []
245 for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):

~/anaconda3/envs/py36/lib/python3.6/site-packages/fastai/text/data.py in _join_texts(texts, mark_fields)
326 if is1d(texts): texts = texts[:,None]
327 df = pd.DataFrame({i:texts[:,i] for i in range(texts.shape[1])})
–> 328 text_col = f’{BOS} {FLD} {1} ’ + df[0] if mark_fields else f’{BOS} ’ + df[0]
329 for i in range(1,len(df.columns)):
330 text_col += (f’ {FLD} {i+1} ’ if mark_fields else ’ ') + df[i]

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right)
1067 rvalues = rvalues.values
1068
-> 1069 result = safe_na_op(lvalues, rvalues)
1070 return construct_result(left, result,
1071 index=left.index, name=res_name, dtype=None)

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
1031 try:
1032 with np.errstate(all=‘ignore’):
-> 1033 return na_op(lvalues, rvalues)
1034 except Exception:
1035 if is_object_dtype(lvalues):

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
1021 result = np.empty(len(x), dtype=x.dtype)
1022 mask = notna(x)
-> 1023 result[mask] = op(x[mask], y)
1024
1025 result, changed = maybe_upcast_putmask(result, ~mask, np.nan)

~/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/ops.py in radd(left, right)
111
112 def radd(left, right):
–> 113 return right + left
114
115

TypeError: ufunc ‘add’ did not contain a loop with signature matching types dtype(’<U32’) dtype(’<U32’) dtype(’<U32’)

(Update: I pasted the entire error message.)

Could you enclose your error message within a code block (you can find it in the top bar of your message)? Its easier to read that way.

OK dude. I’ll get back in a moment for it.

I also maybe fixed this particular bug! :smile: Now it works right locally for me in my notebook.

It’s a casting problem. Sometimes pandas dataframe gets confused when adding columns if it’s a number in the data somewhere as the “word” or “token”. We are doing strictly text work in this python file and we always want string not integer operation for “adding”. Here’s the code fix (data.py):

def _join_texts(texts:Collection[str], mark_fields:bool=False):
    if not isinstance(texts, np.ndarray): texts = np.array(texts)
    if is1d(texts): texts = texts[:,None]
    df = pd.DataFrame({i:texts[:,i] for i in range(texts.shape[1])})
    #text_col = f'{BOS} {FLD} {1} ' + df[0] if mark_fields else  f'{BOS} ' + df[0]
    text_col = f'{BOS} {FLD} {1} ' + df[0].astype(str) if mark_fields else  f'{BOS} ' + df[0].astype(str)
    for i in range(1,len(df.columns)):
        #text_col += (f' {FLD} {i+1} ' if mark_fields else ' ') + df[i]
        text_col += (f' {FLD} {i+1} ' if mark_fields else ' ') + df[i].astype(str)
    return text_col.values

THe file is fastai/text/data.py

If someone else could try it too (@ sgugger), and if you like it, check this into the codebase that would be nice. Or maybe I can do a git pull myself. I never did it before yet but I’m game.

HTH,

gob

1 Like

Please do a PR with this fix :slight_smile:

I think I found a similar issue. I’m working on a Chechen language model. I got the same error ricknta got in this part1v3 thread.

Short: use multiple text files when reading from folder for an LM. It appears fastai.text treats text files the way vision treats individual images. Issue: building a data_lm object from text files.

What I noticed, is I kept getting a warning that my validation set was either empty or size zero – which is odd since I have 8.12M lines of text in a text file, and am randomly splitting by percent. This didn’t change if I set the percentage to 10% or 50%.

My hunch was that fastai is splitting file-wise, treating each text file the way it treats individual image files. I created two small files of 10k and 5k lines out of my giant text file, and was successfully able to create an LM data object on them.

I checked that the text of my data.train_ds[0][0].text matched up to one of the text files, and data.valid_ds[0][0].text to the other.

This is after changing the folder name to ‘train’; after the above, it looks like any folder name can be used.