How to use your own document instead of IMDb dataset in NLP lesson 8 part 1

Darsh · September 7, 2020, 5:22am

Here is the code from the lecture.

How can I use my own document instead of IMDB dataset to tokenise it?
My objective is to create summary of a legal case instead of movie review as show in the lecture.

morgan · September 7, 2020, 1:19pm

You need to save your text(s) as a .txt file, save it in path and pass the relevant folder name(s) in that path to get_text_files. Something like:

files = get_text_files('projects/data', folders ['legal_texts'])

If you don’t have .txt files you can use get_files() in a similar manner

Darsh · September 7, 2020, 6:44pm

Do I have to convert the text into some particular structure before using it or can I use the legal case text as it is?

morgan · September 8, 2020, 7:21am

So you just have a single text? IMDB is broken up into many reviews, test what happens when you pass your single text in, then try break it up smaller sections and see what happens