I have been using Colab Pro which has about 25 GB RAM, however, I have this massive text dataset of around 3 million rows, which eats up all my RAM in these lines of code. I have been trying to make a language model learner by using this data.
df_train = pd.read_csv('gdrive/MyDrive/train.csv', escapechar = '\\', quoting = csv.QUOTE_NONE, usecols = ['DESCRIPTION', 'BROWSE_NODE_ID']) df_test = pd.read_csv('gdrive/MyDrive/test.csv', escapechar = '\\', quoting = csv.QUOTE_NONE, usecols = ['DESCRIPTION']) df_train = df_train[df_train.DESCRIPTION.notnull()] amazon_lm = DataBlock( blocks = TextBlock.from_df('DESCRIPTION', is_lm = True), splitter = RandomSplitter(0.05) ).dataloaders(df_train, bs=32, seq_len = 72)
The DataBlock().dataloaders line of code eats up all my RAM. I have tried using Vaex but I think the fast.ai library doesn’t support it when I try to convert the model into a classifier for my purpose. I have also removed the null values, which reduced my dataset to about 2.1 million rows. Is there a better way to reduce RAM usage for the dataloaders code snippet?