Understanding what behind Data Block API

luffylucky · November 24, 2018, 12:13am

I kind of very like the idea of using Data Block API in order to create databunch. I also understand the pipeline and the order of series of operations used.
But when I dig more into the source code to understand what really happens behind, I feel lost at some points. Maybe Jeremy will make it more clear in the next lectures but now I like to clarify something:

Nomally the pipeline in block api should be processed:
Example for text problem : TextList -> ItemList -> ItemLists -> LabelLists -> Databunch.
So where would the text in dataset be transformed into numbers and preprocessed? I guess that would be done by calling process method but I don’t know exactly where it would be called in the pipeline above.
Maybe related to the first question. But what are the private variables such as _bunch, _processor in TextList class used for? they are created at the beginning of the class, but I don’t see either they are called somewhere in the pipeline (maybe my bad)
In lesson 3, i had an error while trying to print

print(TextList.from_csv(path, 'texts.csv', cols='text'))

AttributeError: 'NoneType' object has no attribute 'textify'

Same thing happens for TabularList. Is it normal? in this case, it should print out the type fastai.text.data.TextList, right?

Thanks!

sebderhy · December 24, 2018, 10:13am

I think I have a similar problem, when trying to create a text list from a dataframe (or when using a csv created from my dataframe).

First I run the following line, which seems to work:
textlist = TextList.from_df(df_lm, path, cols='description')

However, I think the object created is not good, because when I run:
textlist.\_\_getitem\_\_(6)

I also get the error:
AttributeError: 'NoneType' object has no attribute 'textify'

The weird thing is that I can see the text data without issue when doing:
textlist.items[6]

Any idea what could be the issue, and how to solve it?