ULMFiT - French

pierreguillou · September 13, 2019, 7:47pm

Guide to download the French Amazon Customer Reviews

Read information page and license about Amazon Customer Reviews Dataset.

Create an AWS Free Tier account.
Login to your AWS account to the IAM console with the login/password of step 1.
Create en IAM Admin User and Group by following theses rules.
Create your IAM user access keys (access key ID and secret access key) by following theses rules. DO NOT FORGET to save your 2 keys.
Install the AWS Command Line Interface (aws cli) in an ubuntu terminal on your computer by following theses rules.
Configure your aws cli by following theses rules.
With you aws cli, you can list the available reviews datasets in the bucket with the ls command by typing the following code in your ubuntu terminal:
aws s3 ls s3://amazon-reviews-pds/tsv/

List (2017-11-24):
amazon_reviews_multilingual_DE_v1_00.tsv.gz
amazon_reviews_multilingual_FR_v1_00.tsv.gz
amazon_reviews_multilingual_JP_v1_00.tsv.gz
amazon_reviews_multilingual_UK_v1_00.tsv.gz
amazon_reviews_multilingual_US_v1_00.tsv.gz
To download data using the aws cli, you can use the cp command. For instance, the following command will copy the file named amazon_reviews_multilingual_FR_v1_00.tsv to your local data folder:
cd path_to_your_data_folder
aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv .
Unzip your file:
gzip -d amazon_reviews_multilingual_FR_v1_00.tsv.gz

In your jupyter notebook, open your tsv file with pandas with for example the following code (see list of columns names):

fields = ['review_id', 'review_body', 'star_rating']
df = pd.read_csv(path_data/'amazon_reviews_multilingual_FR_v1_00.tsv', delimiter='\t',encoding='utf-8', usecols=fields)
df = df[fields]
df.loc[pd.isna(df.review_body),'review_body']='NA'
df.head()

That’s it. You can start fine-tuning your LM model and then fine-tuning your classifier with the French Amazon Customer Reviews by using the ULMFiT method implemented in the nn-vietnamese.ipynb notebook. Have fun and please, publish your results. Thanks