Training GPT2 Language model on Hinglish (Hindi + English) twitter data

nikhil_no_1 · March 11, 2020, 9:19am

I am working on building a few NLP apps for Twitter data. First up I am building a tweet generator which is trained on a specific set of users. I am using this repo ru_transformers as reference. Big thanks to Mikhail Grankin for sharing his work and providing such detailed write-up. So far I have trained it on a very small dataset (~20 MB). Idea is to have an overfitted model so that I can see all the pieces are working fine, before I jump into training on full dataset. In the training dataset, I have one tweet on each line with empty line between tweets. I am using YTTM tokenizer as instructed in the write-up. So far, I have only done little pre-processing, removing extremely short tweets. Other than that I want to keep all other information intact. Most of the tweets are written in Hinglish (hindi meaning words in english language) with good percent of native hindi words also present along with lots of emojis.

Here are a couple of examples of what the model is generated on this extremely tiny dataset. Prompt is what I specify and model returns 3 samples.

Prompt: "we need to"
{
    "replies": [
        "ting to the law of the state and state leaders. Our judiciary needs water listing our government also as our cooperation.\" <|n|<|n| @ pradip103 these guys will be closed and still such subjects who are alive & amp; good in state forever.",
        " started trending # terrorism <|n| <|n| Next year we are begging congress # Hindus <|n| <|n| only indians are telling and respect for others and what we are working <|n|n| Many happy returns of the day @ sard",
        " Woman ... At least approximately Indians have been almost 25% Muslim population percentage in south India and is all Indians including 30%. Only game is now."
    ]
}

Prompt: "we need to"
{
    "replies": [
        " mouga kabvan? Kisse Owaisi Ko sikhate hein?",
        " sir Mr.",
        "ఏ turned out to create mayhem against Islamism and population of India. Else how will it be chief of that India chief left?\" <|n|<|n| Khan is punching towards Suit. Including his Congi IT cell workout."
    ]
}

Prompt: "we need to"
{
    "replies": [
        " ने सोशल मीडिया पर कब्ज़ा किया था| # HinduRashtra # HDL\" <|n|||||n| @ upma23 जन्मदिन की हार्दिक शुभकामनाएं । भगवान श्री कृष्ण👆 <|n||||n| @ ashish_prataps धन्यवाद! Taged Champion!",
        " this might be so apt about this. My part is right in Mumbai. Jai Hind Jai Bharat🇮🇳 <|n|||<|n| And yet to cry # Pigs the inhumanity.",
        "-a journo of sexual slavery.# India # ExitPoll # OlaHuUber 💣💥 <|n| <|n|| # OlaHuUber 💣💥 Israel-e- Medina 💣💥 <|n| <|n| OlaHuUber 💣💥 Media is a loser Bollywood Funny person."
    ]
}

One obvious thing that I need to fix is these “<|n|” characters. Any idea on how to address it? Is there anything else that I am not doing right, which I should be aware of before I go ahead and train the full model? Is YTTM a good choice for tokenizer. It seems to me it is doing it’s job properly but would like to be sure. Any comments/suggestions are welcome.