Question about database and NLP


This is my first post on the forum. I am very new to machine learning.

So, my question was about length of text in NLP.

Basically, in France, there is a famous exam where you need to make an abastractive summary of a text that can be quite long (around 4000 words) and very difficult to understand (an essay about philosophy or science). The method is you figure out the idea developed in each paragraph, then you find out the argumentative structure of the text, and you make a summary. Sometimes, the summary of one paragraph is just one or two words in a sentence. And you always need to make a 10% summary (so generally between 400 and 200 words, depending on the text).

I gather a lot of previous exams (with the answer) and some examples found on the web, although it’s only 169 observations (with sometimes the same text but with a different version of the summary). I know it’s very little, but it was more out of curiosity that I wanted to see what could be done with it. Here’s the data : Dropbox - generated.json - Simplify your life

I wanted to try to use a pretrained model (MT5), but it seems that some text are too “long” to process in paperspace. So, is it possible to summarize very long and complex text ?

Moreover, it wouldn’t be so hard to build a bigger database (like 10K observations, which is not that big, but still) with all previous maths exams and their solutions of undergraduate (“prepa”). However, I’m thinking that if I build it, the problem and the solution would be quite long latex code, and I’d face a bigger problem.
Here’s an example of the solution of an exam : Dropbox - Lyon13-2.pdf - Simplify your life

I am asking that because in long math exams, a lot of questions are connected with each other, since you use the first questions to answer the last ones and so on. It seems that papers that do math problem resolution are solving questions that “stand alone”.

Sorry for how long the post is. If anyone has a hint, that would be great. Thank you.