NLP - Best approach(es) for anonymization / pseudonymization of personal data?

Alexandre_DIEUL · February 24, 2020, 8:03pm

Hi everyone,

I’ve been modifing IMDB lesson 3 & 4 to classify chunks of text according to certain labels. It worked perfectly : nice work fastai !
This time, I may use text data with personnal information (names, phone numbers, logins, emails, etc).
I need a rock solid approach to make sure I’m compliant with the GDPR (General Data Protection Regulation).

What would be your best approch(es) no to send any sensitive data on a cloud vm for training, weither by using a method from fastai itself (that I don’t know about ) or using python (or anything more specific) ?

Have a great evening.
Looking forward to read your enlightening answers.

Sincerely yours,
Alexandre.

Alexandre_DIEUL · February 24, 2020, 8:28pm

My first naive hypothesis would be : why not stripping the vocab from the TextClasDataBunch after the tokenization and the numericalization but then, could I be able to fine tune the langage model without the vocab ?

What’s your opinion on that ?

darek.kleczek · February 25, 2020, 10:20am

I would start by using Spacy NER pipeline to identify named entities and mask them (assuming you’re doing this in English or another language covered by Spacy). See if this is sufficient for your needs, if not then probably more research on named entity recognition is needed, or playing around with regex. I assume this is a common problem, so maybe there are already some proven methods.

Alexandre_DIEUL · February 25, 2020, 11:16am

Thank you so much. I’ll try Spacy!
Have a nice day.

Johnpal · February 25, 2020, 4:56pm

You can also use this open source library

Alexandre_DIEUL · February 26, 2020, 10:38am

Thanks I’ll have a look !

Alexandre_DIEUL · February 26, 2020, 3:16pm

I tried presidio.
Unfortunately, there are errors in the interpretation of the labels.

Example :

“3: telephone number:
Personal: 06 28 73 09 41. Not to be communicated.”

will be “anonymized” as :

“3: telephone number:
Personal: 06 28 73 <DATE_TIME>. Not to be <US_DRIVER_LICENSE>.”

I think my european/french data is too far-off from the data presidio used for the training of its solution.

Regards,