Approaching the creation of an Endangered Language Chatbot

Hello all!

I recently created a blog post where I explain my understanding of the Drivetrain approach from what I have learned so far from Chapter 2 of the fastai course, and also apply this understanding to creating an Endangered Language Chatbot.

You can have a read through the post here:
https://forbo7.github.io/ForBlog/model%20deployment/drivetrain%20approach/2022/05/27/How-to-Approach-Creating-AI-Models.html

I would appreciate any comments, suggestions, or corrections!

Interesting idea. Maybe not so much data would be required to create a chatbot for an endangered language. Chatbots maybe able to train themselves.However some sort of feedback from native speakers would be required to prevent the chatbots from developing their own accent. Just a though.

Ah, I see you’re looking at it from the spoken perspective. Yes, feedback from native speakers will definitely be required, though it would be difficult to obtain since the languages themselves would be endangered or extinct. Nonnative speakers learning to speak the language would help, but their spoken accuracy would not be as good.

Perhaps too much data won’t be required to make a natural sounding voice or accent, but a lot of data will definitely be required for the model to “learn and speak” the language. As an example, the best English language model out there at the moment is GPT-3 and that model has a whopping 175 billion parameters :smile:. And even then, the model still isn’t at par of a native speaker, understanding-wise and written-wise.

Yes, GPT-3 seams to need that enormous amount of training data to avoid overfitting. I have the impression that NLP has a long way to go to be useful with smaller training datasets