Creating a Language Model with Movie Objects

Hi there,

Thanks in advance for taking the time to read this. Here’s what I’m looking to do – maybe someone here can provide some guidance.

I have a database of ~200K movies, which are of the form:

name: "The Godfather", 
length: 100, 
description: "Greatest movie ever", 
rating: "r",

I am ingesting new movies from different sources that do not conform to this schema. For example, a new record may refer to name as title or any other arbitrary string. I would like my model to map title – or any arbitrary string – onto name.

I’m thinking I will use NLP for this. Here’s what I have in mind:

I’ll create a language model from all the movies in my database. In this model, my tokens will be the keys and values from the movie records. So, for example, a sentence in this language, using the record above, would be (values are in bold): “name the godfather length 100 description Greatest movie ever rating r”.

That way, if I ingest a record that does not conform to our schema, e.g.:

{title: "The Godfather"}

my model will know that “The Godfather” is the “word” that follows the word name in my existing language model, and that therefore I should create a (duplicate) record containing the entry:

{name: "The Godfather"}

A couple questions:

  1. Is this a reasonable approach?
  2. How do I account for “words” that are not single word strings? E.g. “The Godfather” contains two words but should be treated as a single unit in this language.
  3. Is my dataset large enough?
  4. Is there another deep learning domain (e.g. tabular DL) better suited to solve this kind of problem?

Thanks very much for reading.