I am starting a project that aims at summarising text (in Spanish) and would like to leverage Transformers. I am not sure what model I should start with and how to approach this problem. I imagine I would need to fine-tune a model on my dataset but do not fully understand how to pick a model. Indeed, some support multiple languages, some are more fitted for some tasks.
If you want to do extractive summarization, you could start with a pre-trained Spanish BERT model. A good place to look is the huggingface model hub: https://huggingface.co/models?search=spanish
For abstractive summarization, I don’t know if there is any pre-trained model available in Spanish. If you have a large training corpus and the resources, you could try training a model from scratch. To my knowledge, BART and T5 have shown some promising results for summarization.
If you aren’t worried about actually building the model yourself, you could try using a library like OpenNMT or Fairseq to train a summarization model. In addition, it could serve as a baseline if you do decide to build your own. (https://opennmt.net/OpenNMT-py/examples/Summarization.html)
I would start with some pre-trained Spanish language model, e.g. https://huggingface.co/Geotrend/bert-base-es-cased, and then fine-tune it on an extractive summarization dataset. I’m not sure which datasets exist in Spanish for this task, but this one could be interesting.