.csv file is not correctly loaded/read (Lesson 3)

marc_galitski · June 22, 2020, 4:33pm

Hey everyone! I’m currently attempting some NLP and created a .csv file with 3 columns: ‘label’, ‘text’, and ‘is_validation’. I pd.read_csv the .csv file and am able to access the indices with the corresponding strings of text, when specifying ‘text’ as column. However, when I’m attempting to tokenize the .csv file (TextClasDataBunch.from_csv) and then enter data.show_batch() I get a ‘Stop Iteration’ error with no further explanation

. I tried to interpret the Traceback but am unable to narrow down a solution, as my .csv file looks almost identical to the one used in Lesson 3 (the ‘texts.csv’ from the imdb_sample) and no issues occur there.
Furthermore, if the data.show_batch() is omitted and data.train_ds[0][0] is called, there is no text output but only numbers (as opposed to text as in Lesson 3). Calling data.train_ds[0][0].data[:10] returns those same numbers but as a list. Seemingly, the text within the .csv does not get read-in correctly.

A few notes on the data set: it only has 31 labelled samples (‘none’, ‘valid’), with 10 being validation samples, and 21 training. I could imagine that this small sample size may have something to do with the error (i.e. sample too small to keep iterating over it, thus ‘stopping iteration’). However, collecting samples for this type of data is a pain, hence why I turned to forums before continuing to search for more samples.

Does anyone have an answer? A tutorial on how to create good .csv files for training would be much appreciated as well! Thanxx

abcde13 · June 22, 2020, 10:42pm

You may very well be right. From what I can tell, from_csv by itself never specifies a batch size, and therefore, databunch defaults to 64. Can you try doing from_csv(path, 'sample.csv', bs=2)?

Also, though I’m quite new as well, welcome to the forums!

marc_galitski · June 24, 2020, 12:48pm

No way!! It worked!

I tried changing the batch_size hyperparameter I declared at the beginning to 2, but it didn’t work. Seems like you have to directly pass in the bs as an argument for from_csv.

Thanks so much!

abcde13 · June 24, 2020, 5:00pm

No problem. In the prospect of teaching a person how to fish, let me show you how I figured that out.

Jeremy always tells us to look at the source code. And Jeremy’s always right. I knew that databunch is usually where you pass in the batch size for the regular data_block api, but this is special, in that it creates the databunch in one fell swoop. So I looked at from_csv source. https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L210. It doesn’t say anything about a batch_size, it just creates a dataframe via from_df. https://github.com/fastai/fastai/blob/54a9e3cf4fd0fa11fc2453a5389cc9263f6f0d77/fastai/tabular/data.py#L86. It’s there tha you notice a default bs=64. Which means the only way to override it is to pass it in from_csv as a kwarg.

Hope that helps.

marc_galitski · June 24, 2020, 7:16pm

Yes, alot actually! I literally started studying the data_block docs after you posted the solution as I figured that’s where I’ll get the most knowledge from. There are a lot of intricacies to learn, that’s for sure!