Multi-Fit - Train and use my own "language" as a model

elamey · December 1, 2019, 11:53pm

First post, Hello! I’m a career .NET middleware application developer who is trying to come up to speed in machine learning. I’ve read a lot of posts here and am very happy to have found this resource!

Question: Is there a tutoral/notebook on how to train a “new” or “unknown” language into a pretrained model?

Deeper Dive: Could I use the new MultiFiT discovery/method to treat my own text (tagged/classified) data as an “unknown language.” This would be a stand alone “language” model, meaning I’m not looking to cross reference this “unknown language” to English for example. I’d just like to train it using MultiFiT from my “unknown language” corpus and make next word predictions. I’ve got the model loading and predicting part down (thanks to the great resources here). But actually “training” a model, on a “new” language, i’m not able to find a tutorial on how to do that.

Example: The words in my “data corpus” are made up of only 2 “letters.” They always alternate, but are different “seemingly” random lengths (they have patterns, i’m just looking to predict them).

Example: (sample input marked by >> << tags the tags are NOT a part of the vocabulary)
Sample of Corpus: yy zzz yyyyyy z yyy zz yy zz y z y z yyyy zz >>yy zz yy zz y<< zzzz yy zz y z y zz
Sample Input: yy zz yy zz y
Sample “next word” prediction: zzzz

To summarize: I’m asking for resources, discussion / assistance / links / articles on how to train a “new” and “unknown” language suth the new MultiFiT method, into a pre-trained model. I greatly appreciate any assistance and will gladly share the results!