I’ve been modifing IMDB lesson 3 & 4 to classify chunks of text according to certain labels. It worked perfectly : nice work fastai !
This time, I may use text data with personnal information (names, phone numbers, logins, emails, etc).
I need a rock solid approach to make sure I’m compliant with the GDPR (General Data Protection Regulation).
What would be your best approch(es) no to send any sensitive data on a cloud vm for training, weither by using a method from fastai itself (that I don’t know about ) or using python (or anything more specific) ?
Have a great evening.
Looking forward to read your enlightening answers.
My first naive hypothesis would be : why not stripping the vocab from the TextClasDataBunch after the tokenization and the numericalization but then, could I be able to fine tune the langage model without the vocab ?
I would start by using Spacy NER pipeline to identify named entities and mask them (assuming you’re doing this in English or another language covered by Spacy). See if this is sufficient for your needs, if not then probably more research on named entity recognition is needed, or playing around with regex. I assume this is a common problem, so maybe there are already some proven methods.