I need help with dealing with the dataset that I have in form of a CSV file. I am new to NLP(played around with some toy datasets) and tried to tackle a problem of tag recommendation based on the article(article page given in form of HTML code), its title. I don’t know how to get started with the dataset since the training file is of 1.3GB and I am finding it difficult to even view it entirely. I want to remove the tags from HTML code. Please help me and suggest other ways that might be good to start with.
The dataset looks like this.
I would load the CSV file using the CSV library found in Python.
You will then need to select the article rows and clean them. I would recommend checking out
https://docs.fast.ai/text.transform.html#text.transform you will probably find the function you are looking for there.
If you would like to manually parse the HTML you should check out the library BS4
There are some steps to do data preprocessing:-
1. Import the required libraries
2. Import the dataset
3. Handle missing data
4. Encoding categorical data that doesn’t have binary result
5. Encoding categorical data that have binary result
6. Feature Scaling
7. Splitting the data into the Training set and Test set
You can read all the steps in detail here:- Data Preprocessing with Python