Hi there,
Thanks in advance for taking the time to read this. Here’s what I’m looking to do – maybe someone here can provide some guidance.
I have a database of ~200K movies, which are of the form:
{
name: "The Godfather",
length: 100,
description: "Greatest movie ever",
rating: "r",
//etc...
}
I am ingesting new movies from different sources that do not conform to this schema. For example, a new record may refer to name
as title
or any other arbitrary string. I would like my model to map title
– or any arbitrary string – onto name
.
I’m thinking I will use NLP for this. Here’s what I have in mind:
I’ll create a language model from all the movies in my database. In this model, my tokens will be the keys and values from the movie records. So, for example, a sentence in this language, using the record above, would be (values are in bold): “name the godfather length 100 description Greatest movie ever rating r”.
That way, if I ingest a record that does not conform to our schema, e.g.:
{title: "The Godfather"}
my model will know that “The Godfather” is the “word” that follows the word name
in my existing language model, and that therefore I should create a (duplicate) record containing the entry:
{name: "The Godfather"}
A couple questions:
- Is this a reasonable approach?
- How do I account for “words” that are not single word strings? E.g. “The Godfather” contains two words but should be treated as a single unit in this language.
- Is my dataset large enough?
- Is there another deep learning domain (e.g. tabular DL) better suited to solve this kind of problem?
Thanks very much for reading.