I am trying to do something like Natural language Generation(on a different dataset) based on our lesson on language modelling with arxiv dataset. So what i am trying to achieve is - Compress the original input text into shorter form which will provide same meaning and information as provided by original text.
1.The dataset i have is an excel sheet. Used pandas to read the data into a dataframe.
2.Creates another data frame of only the columns i think need to train
3.Split data to Train and Val doing this:
At this point, am a bit confused as to what to do next and what i have done(upto now) is the correct approach
Q:Should i be doing something like this (from the arxiv notebook) to get txt format data:
Q: How do i tokenise the data and then do LanguageModelData?