DataBlock().dataloaders consumes a lot of RAM on large text datasets

arsh_k · August 2, 2021, 12:18pm

Hey everyone,

I have been using Colab Pro which has about 25 GB RAM, however, I have this massive text dataset of around 3 million rows, which eats up all my RAM in these lines of code. I have been trying to make a language model learner by using this data.

df_train = pd.read_csv('gdrive/MyDrive/train.csv', escapechar = '\\', quoting = csv.QUOTE_NONE, usecols = ['DESCRIPTION', 'BROWSE_NODE_ID'])
df_test = pd.read_csv('gdrive/MyDrive/test.csv', escapechar = '\\', quoting = csv.QUOTE_NONE, usecols = ['DESCRIPTION'])

df_train = df_train[df_train.DESCRIPTION.notnull()]

amazon_lm = DataBlock(
    blocks = TextBlock.from_df('DESCRIPTION', is_lm = True),
    splitter = RandomSplitter(0.05)
).dataloaders(df_train, bs=32, seq_len = 72)

The DataBlock().dataloaders line of code eats up all my RAM. I have tried using Vaex but I think the fast.ai library doesn’t support it when I try to convert the model into a classifier for my purpose. I have also removed the null values, which reduced my dataset to about 2.1 million rows. Is there a better way to reduce RAM usage for the dataloaders code snippet?