NLP - Best approach(es) for anonymization / pseudonymization of personal data?

Hi everyone,

I’ve been modifing IMDB lesson 3 & 4 to classify chunks of text according to certain labels. It worked perfectly : nice work fastai !
This time, I may use text data with personnal information (names, phone numbers, logins, emails, etc).
I need a rock solid approach to make sure I’m compliant with the GDPR (General Data Protection Regulation).

What would be your best approch(es) no to send any sensitive data on a cloud vm for training, weither by using a method from fastai itself (that I don’t know about :slight_smile: ) or using python (or anything more specific) ?

Have a great evening.
Looking forward to read your enlightening answers.

Sincerely yours,
Alexandre.

1 Like

My first naive hypothesis would be : why not stripping the vocab from the TextClasDataBunch after the tokenization and the numericalization but then, could I be able to fine tune the langage model without the vocab ?

What’s your opinion on that ?

I would start by using Spacy NER pipeline to identify named entities and mask them (assuming you’re doing this in English or another language covered by Spacy). See if this is sufficient for your needs, if not then probably more research on named entity recognition is needed, or playing around with regex. I assume this is a common problem, so maybe there are already some proven methods.

1 Like

Thank you so much. I’ll try Spacy!
Have a nice day.

You can also use this open source library

2 Likes

Thanks I’ll have a look !

I tried presidio.
Unfortunately, there are errors in the interpretation of the labels.

Example :

“3: telephone number:
Personal: 06 28 73 09 41. Not to be communicated.”

will be “anonymized” as :

“3: telephone number:
Personal: 06 28 73 <DATE_TIME>. Not to be <US_DRIVER_LICENSE>.”

I think my european/french data is too far-off from the data presidio used for the training of its solution.

Regards,

1 Like