Significant changes just pushed

jeremy · January 27, 2018, 2:21am

I’ve just pushed changes based on my recent work in NLP. I haven’t had time to thoroughly test whether they’ve broken any of the lessons, so apologies in advance if you find anything not working. Please either let me know here of problems, or if you know how to fix them feel free to send a PR (and let us know here that you’ve fixed it).

One key change is that metrics now take pytorch tensors/variables, not numpy arrays. The reason is that (to my surprise!) calculating accuracy was taking a significant percentage of training time. Moving it from numpy to pytorch as fixed that problem.

The other key changes are around the NLP modules. They are designed to continue to work much the same way as before, but there have been internal changes. (FYI, you’ll see significant overlap between fastai.nlp and fastai.text. fastai.nlp is the old module that’s designed to work with torchtext. fastai.text is a new module designed to replace torchtext. You should stick with fastai.nlp since the new module isn’t documented, unless you’re interested in getting involved in development of this module.)

wgpubs · January 27, 2018, 7:24pm

Hi Jeremy

I’m wanting to work with the new code in the fastai.text namespace as it seems more friendly for both multi-class and multi-label problems.

In the .nlp namespace there exists a helper method to build the DataLoaders that was heavily dependenent on torchtext. In the .text namepace, there exists just a constructor that accepts the dataloaders as arguments. My question is:

What should these datasets/dataloaders look like and is there a recommended way to build them?

(attached is what I did for the toxic comp. using the nlp namespace; any feedback and what is good, bad, or could be improved would be appreciated in addition to how to translate these to something that will work with the text namespace).

gist.github.com

https://gist.github.com/ohmeow/5b3543a5115040001fce59a105ac4269

toxic.py

class TextMultiLabelDataset(torchtext.data.Dataset):
    def __init__(self, df, tt_text_field, tt_label_field, txt_col, lbl_cols, **kwargs):
        # torchtext Field objects
        fields = [('text', tt_text_field)]
        for l in lbl_cols: fields.append((l, tt_label_field))
            
        is_test = False if lbl_cols[0] in df.columns else True
        n_labels = len(lbl_cols)
        
        examples = []

This file has been truncated. show original

I’m also going to submit a PR for a def text_labels_from_dataframes() once I get things working as is (I still find working with dataframes so much more pleasant and flexible than the reading files approach).

Thanks

jeremy · January 27, 2018, 10:12pm

@wgpubs I plan to release fastai.text in the next week or so I’ll be providing some documentation and sample scripts then. If you haven’t seen something by this time next week, feel free to ping me.

hafidz · February 2, 2018, 8:20pm

Looking forward to an update on this. Currently working on the Toxic challenge. Having issues with loading the data to torchtext.

wgpubs · February 2, 2018, 9:02pm

check out my gist @hafidz

I would love to get feedback on this, but this is my attempt to create a multi-label friendly dataset using torchtext and fastai for that competition.

hafidz · February 3, 2018, 5:20am

@wgpubs Are you referring to this gist -> https://gist.github.com/ohmeow/5b3543a5115040001fce59a105ac4269

Sorry had to clarify since I think I’ve seen other gist from you before in other threads.

wgpubs · February 3, 2018, 10:15pm

Yah that is it

layla.tadjpour · February 5, 2018, 5:25am

Great! Thanks for this. I have been also working on toxic comment like you and @hafidz and struggling to create a dataset for multi-lable data frames.

hafidz · February 6, 2018, 7:25pm

Currently getting this error when trying to fit the model @wgpubs Tried to construct a tensor from a int sequence, but found an item of type float at index (0)

Any ideas?

wgpubs · February 6, 2018, 11:17pm

@jeremy … ping!

Finding some free time today to work on the new fastai.text package so if you have anything w/r/t building the datasets/dataloaders great … if not, no problem.

thanks

wgpubs · February 6, 2018, 11:18pm

If you can post a gist of your actual code that would help. lmk.

hafidz · February 7, 2018, 1:01am

Sure. Here’s the gist https://gist.github.com/ikanez/ea4224f46102270aaf40e77396179a72

More details available in the link I sent earlier. Didn’t want to hijack this thread, thats why i created another one.

Thanks in advance!

digitalspecialists · February 19, 2018, 11:55pm

Documentation and sample scripts would be very welcome! I’ve not quite been able to assemble what is needed.