Abstractive extraction: Based on lang_model_arxiv lesson

I am trying to do something like Natural language Generation(on a different dataset) based on our lesson on language modelling with arxiv dataset. So what i am trying to achieve is - Compress the original input text into shorter form which will provide same meaning and information as provided by original text.
1.The dataset i have is an excel sheet. Used pandas to read the data into a dataframe.

2.Creates another data frame of only the columns i think need to train

3.Split data to Train and Val doing this:

At this point, am a bit confused as to what to do next and what i have done(upto now) is the correct approach
Q:Should i be doing something like this (from the arxiv notebook) to get txt format data:

Q: How do i tokenise the data and then do LanguageModelData?