After watching lesson 4 I want to do my own sentiment analyses on twitter data with transfer learning and using the datamodel learned from the movie reviews.
So far so good but I am already failing when trying to load the twitter sentiment data. This is a csv file with
sentiment: 0/1, text of the posting
I have tried the write my own Dataset class:
class TwitterDataset(torchtext.data.Dataset):
def __init__(self, path, text_field, label_field, **kwargs):
datafields = [('Sentiment', TWITTER_LABELS), ('SentimentText', TEXT)]
examples = []
examples.append(data.Example.fromCSV(data=path, fields=datafields))
super().__init__(examples, datafields, **kwargs)
@staticmethod
def sort_key(ex): return len(ex.text)
@classmethod
def splits(cls, text_field, label_field, root='.data',
train='train', test='test', **kwargs):
return super().splits(
root, text_field=text_field, label_field=label_field,
train=train, validation=None, test=test, **kwargs)
But I am not sure, why I have to use a path? I want to use just one single csv file.
Setting the data field for the label:
TWITTER_LABELS = data.Field(sequential=False, use_vocab=False)
running the splits:
splits = TwitterDataset.splits(TEXT, TWITTER_LABELS, cleaned_twitter_sentiment_file, train='trn', test='val')
creating the model:
md2 = TextData.from_splits(cleaned_twitter_sentiment_file, splits, bs)
This returns this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in ()
----> 1 md2 = TextData.from_splits(cleaned_twitter_sentiment_file, splits, bs)
~/Dokumente/Projekte/fastai/courses/dl1/fastai/nlp.py in from_splits(cls, path, splits, bs, text_name, label_name)
338 @classmethod
339 def from_splits(cls, path, splits, bs, text_name=‘text’, label_name=‘label’):
–> 340 text_fld = splits[0].fields[text_name]
341 label_fld = splits[0].fields[label_name]
342 if hasattr(label_fld, ‘build_vocab’): label_fld.build_vocab(splits[0])
KeyError: ‘text’
Any help is appreciated.
By the way: why seems everybody using pytorch these days? Probably is supports better processing but I feel the handling is far more complicated and unintuitive than keras…