Transformers Spanish Summarizer

nn.Charles · November 5, 2020, 12:19pm

Hi everyone,

I am starting a project that aims at summarising text (in Spanish) and would like to leverage Transformers. I am not sure what model I should start with and how to approach this problem. I imagine I would need to fine-tune a model on my dataset but do not fully understand how to pick a model. Indeed, some support multiple languages, some are more fitted for some tasks.

Thanks for your help !

Charles

msivanes · November 5, 2020, 2:47pm

This is how I would start.

Read about the related work in Summarization with a focus on multilingual aspects from here
Look at how others have approached this with code examples. See https://ohmeow.github.io/blurr/modeling-summarization/
Guidance about Model Selection for the task summarization

Good luck!! Happy learning!!

stefan-ai · November 5, 2020, 6:13pm

Hi Charles,

If you want to do extractive summarization, you could start with a pre-trained Spanish BERT model. A good place to look is the huggingface model hub: https://huggingface.co/models?search=spanish

For abstractive summarization, I don’t know if there is any pre-trained model available in Spanish. If you have a large training corpus and the resources, you could try training a model from scratch. To my knowledge, BART and T5 have shown some promising results for summarization.

birosjh · November 6, 2020, 2:41am

If you aren’t worried about actually building the model yourself, you could try using a library like OpenNMT or Fairseq to train a summarization model. In addition, it could serve as a baseline if you do decide to build your own. (https://opennmt.net/OpenNMT-py/examples/Summarization.html)

msivanes · November 6, 2020, 12:44pm

@nn.Charles
Just discovered a live project. Seems like a fun project to work on if you are new to Summarization (if not please ignore)

Wilfredo · March 31, 2021, 3:05pm

Hi Stefan,
Maybe a suggestion how to start to pre-trained BERT model in Spanish for extractive summarization?
Thank you so much in advance.

stefan-ai · April 13, 2021, 7:05am

Hi Wilfredo,

I would start with some pre-trained Spanish language model, e.g. https://huggingface.co/Geotrend/bert-base-es-cased, and then fine-tune it on an extractive summarization dataset. I’m not sure which datasets exist in Spanish for this task, but this one could be interesting.

Wilfredo · April 26, 2021, 3:15pm

Thank you so much stefan-ai