Creating a Sentiment Classifier on Twitter Data

ulat · August 29, 2018, 5:35pm

After watching lesson 4 I want to do my own sentiment analyses on twitter data with transfer learning and using the datamodel learned from the movie reviews.

So far so good but I am already failing when trying to load the twitter sentiment data. This is a csv file with
sentiment: 0/1, text of the posting

I have tried the write my own Dataset class:

class TwitterDataset(torchtext.data.Dataset):
def __init__(self, path, text_field, label_field, **kwargs):
    datafields = [('Sentiment', TWITTER_LABELS), ('SentimentText', TEXT)]
    examples = []
    examples.append(data.Example.fromCSV(data=path, fields=datafields))
    super().__init__(examples, datafields, **kwargs)
    
@staticmethod
def sort_key(ex): return len(ex.text)

@classmethod
def splits(cls, text_field, label_field, root='.data',
           train='train', test='test', **kwargs):
    return super().splits(
        root, text_field=text_field, label_field=label_field,
        train=train, validation=None, test=test, **kwargs)

But I am not sure, why I have to use a path? I want to use just one single csv file.

Setting the data field for the label:

TWITTER_LABELS = data.Field(sequential=False, use_vocab=False)

running the splits:

splits = TwitterDataset.splits(TEXT, TWITTER_LABELS, cleaned_twitter_sentiment_file, train='trn', test='val')

creating the model:

md2 = TextData.from_splits(cleaned_twitter_sentiment_file, splits, bs)

This returns this error:

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)
in ()
----> 1 md2 = TextData.from_splits(cleaned_twitter_sentiment_file, splits, bs)

~/Dokumente/Projekte/fastai/courses/dl1/fastai/nlp.py in from_splits(cls, path, splits, bs, text_name, label_name)
338 @classmethod
339 def from_splits(cls, path, splits, bs, text_name=‘text’, label_name=‘label’):
–> 340 text_fld = splits[0].fields[text_name]
341 label_fld = splits[0].fields[label_name]
342 if hasattr(label_fld, ‘build_vocab’): label_fld.build_vocab(splits[0])

KeyError: ‘text’

Any help is appreciated.

By the way: why seems everybody using pytorch these days? Probably is supports better processing but I feel the handling is far more complicated and unintuitive than keras…