Issue with TextClasDataBunch

Hi,

I’m trying to use fastai’s text data module to load my text data for an NLP project. The data is stored in folders/subfolders where there is a train and validation folder, and inside each there are subfolders for each of my twenty classes, each containing close to 1k text files. When I try to use the from_folder function to create a TextClasDataBunch object, I get the following error:

ValueError: Invalid file path or buffer object type: <class 'NoneType'>

For reference here is the line of code where this problem occurs:

databunch = TextClasDataBunch.from_folder(path = path, valid = ‘valid’, train = ‘train’, tokenizer = data_tokenizer, shuffle = True)

data_tokenizer is a Tokenizer object with the tokenization function being SpacyTokenizer()

And here is the complete stack trace:

File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/fastai/text/data.py", line 345, in from_folder
    classes=txt_kwargs.pop('classes', None), **txt_kwargs)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/fastai/text/data.py", line 199, in from_folder
    return cls(folder, tokenizer, name=name, classes=classes, **kwargs)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/fastai/text/data.py", line 37, in __init__
    if not self.check_toks(): self.tokenize()
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/fastai/text/data.py", line 82, in tokenize
    curr_len = get_chunk_length(self.df) if (self.create_mtd == TextMtd.DF) else get_chunk_length(self.csv_file, self.chunksize)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/fastai/core.py", line 125, in get_chunk_length
    else:  dfs = pd.read_csv(data, header=None, chunksize=chunksize)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 424, in _read
    filepath_or_buffer, encoding, compression)
  File "/Users/anprahlad/.pyenv/versions/venv/lib/python3.6/site-packages/pandas/io/common.py", line 218, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <class 'NoneType'>

Has anyone else using the fastai text data library run into this issue? If so how did you get past this? Any advice would be greatly appreciated!

The problem is coming during the tokenize function, so I’m guessing there is something wrong with your tokenizer. Maybe try the default first and see if it solves your issue?

I’ve tried that and it gives me the same error. The new initialization of the TextClasDataBunch looks something like this:

databunch = TextClasDataBunch.from_folder(path=path, shuffle=True)

Any ideas?

I also ran into this issue (with the default tokenizer). I found when I converted my data into csv format, and used the from_csv method instead, then I no longer got this error. Obviously not ideal to have to do this!