How to use your own document instead of IMDb dataset in NLP lesson 8 part 1

Here is the code from the lecture.

How can I use my own document instead of IMDB dataset to tokenise it?
My objective is to create summary of a legal case instead of movie review as show in the lecture.

You need to save your text(s) as a .txt file, save it in path and pass the relevant folder name(s) in that path to get_text_files. Something like:

files = get_text_files('projects/data', folders ['legal_texts'])

If you don’t have .txt files you can use get_files() in a similar manner

1 Like

Do I have to convert the text into some particular structure before using it or can I use the legal case text as it is?

So you just have a single text? IMDB is broken up into many reviews, test what happens when you pass your single text in, then try break it up smaller sections and see what happens :slight_smile:

1 Like