I’m running through the quick start code with a different dataset, and data bunches built with from_df. I’m having an issue where I run the exact same code at different times, but the text classifier at the end outputs differently formatted results.
One of them is as it appears in the quick start page:
(Category positive, tensor(1), tensor([0.0049, 0.9951]))
While the other one that I’m getting often is like this:
(FloatItem [0.177895], tensor([0.1779]), tensor([0.1779]))
Can anyone help me figure out why the model is outputting predictions in the second form and not the first?
Edit: Ok I figured out that it was because I had my labels as a list 0’s and 1’s, which seemed to make the classifier classifier class treat it as a regression problem. Changing the 0’s to ‘negative’ and 1’s to ‘positive’ makes it work consistently as expected now.
Pushed a few changes under the hood for the text classifier to limit memory usage (was unable to fit ULMFit with less than 10Gb of GPU RAM). The main thing to know is that you should use pad_input_chunk instead of pad_input now, otherwise your data won’t arrive to the model in the format it expects.
I am interested in building a language model at work as the primary objective using fastai2:
However, I just have a couple questions:
The data has personal contact details and numbers, is there any way in fastai2 to remove such information? One way I imagined was to keep only top 60k vocab which should remove phone numbers and personal information and replace everything else with <UNK>.
How can I be sure that ULMFiT won’t predict special chars like XXMAJXXTOP when I run inference to complete the sentence?
@much_learner sorry! That method is outdated (I haven’t looked at it in awhile). From what I can see, you should try TextBlock.from_df. There you should be able to do what we were trying above and replace CategoryBlock with MultiCategoryBlock
Ah! That’s because torchvision did an update last night and broke everything. See the install directions for colab here as I’m constantly updating them Fastai-v2 - read this before posting please! 😊 You want version 0.4.2. Apologies for glazing over that @much_learner
I see they are hot encoded tensor([1., 0., 0., 0., 1., 0., how to change them to regression? y vocab also contains categories, but these numbers should be floats.
It appears any databunch saving and loading is missing from the Fastai v2 notebook:
data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)
Indeed, when I try to run this in my Fastai v2 notebook, I get:
NameError: name ‘load_data’ is not defined
I’m assuming this means this functionality hasn’t made it yet to Fastai v2. I’ve checked the documentation and don’t see a way to load/save a databunch. Can someone confirm if this is indeed the case so I know I’m not missing anything?
Yes. This is mostly due to (I think) with how the new language model data works there’s no need to (its grabbed as its needed and made on the fly instead of all at once)
But only about 10 seconds to load the old databunch using Fastai v1 code:
data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)
I hope that’s helpful feedback!
Edit from below: Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.
It is thank you! Most likely it’ll be implemented a bit later (I know they restructured text a little bit ago) as I know they just got into load_learner
Actually no. This is mostly due cause all our objects pickle now., so you can directly save with torch.save and load with torch.load the DataBunch that takes time to create.
Also note for a problem in a folder like IMDB, fastai2 caches all the tokenized texts so you don’t need to do it twice. It still takes time to load the second time cause it needs to read the lengths of all files, caching them is on my TODO list.
Hi @sgugger. Does this replicate the behaviour we had in fastaiv1 with collate? Fo example, if I fave a document with 1000 tokens will the databunch and the model use bptt to break the input into bptt chuncks and feed the model?