NLP - Best approach(es) for anonymization / pseudonymization of personal data?

My first naive hypothesis would be : why not stripping the vocab from the TextClasDataBunch after the tokenization and the numericalization but then, could I be able to fine tune the langage model without the vocab ?

What’s your opinion on that ?