How to conserve memory when creating torchtext.data.Dataset from dataframe?

poppingtonic · March 8, 2018, 12:21pm

I have a dataset of tweets that I’d like to perform sentiment analysis on, and have loaded a csv file with a text and sentiment column to a dataframe. I loaded it in memory using the LanguageModelData.from_dataframes function, trained a model, and now want to use the model’s encoder for sentiment analysis. With the model_data object in memory, I have this method for creating a sentiment analysis dataset, grabbed from this PR:

class MyDataset(torchtext.data.Dataset):
    def __init__(self, df, text_field, label_field, is_test=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        examples = []
        for i, row in df.iterrows():
            label = 'pos'
            if not is_test and row['toxic']==0:
                label = 'neg' 
            text = row['comment_text']
            examples.append(torchtext.data.Example.fromlist([text, label], fields))

        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, train_df, val_df=None, test_df=None, **kwargs):
        train_data, val_data, test_data = (None, None, None)

        if train_df is not None:
            train_data = cls(train_df.copy(), text_field, label_field, **kwargs)
        if val_df is not None:
            val_data = cls(val_df.copy(), text_field, label_field, **kwargs)
        if test_df is not None:
            test_data = cls(test_df.copy(), text_field, label_field, True, **kwargs)

        return tuple(d for d in (train_data, val_data, test_data) if d is not None)

Is there a way to not double the amount of memory used when creating this dataset from an in-memory dataframe?

splits = MyDataset.splits(TEXT, LABEL, train_df=train_df, val_df=val_df, test_df=test_df)

With the model_data object in memory and the dataframe as well, the whole process eats up 16GB of RAM and jumps to 10GB of swap space. The df.iterrows doesn’t look like the most memory-efficient way to do this.