All:
I wanted to follow up on this, since I’ve been very busy and hadn’t looped back on it. I’m going to provide a high level overview of how I approached this project, and if anyone wants more details I’d be happy to provide them.
I was working on unsupervised classification of over 250K parole complaint revocation narratives written in natural language for the Colorado Dept. of Corrections (DOC) over a several year span. I had previously worked on this task utilizing Doc2Vec, and I wanted to see how Fast AI’s language model would perform.
I started by first examining and pre-processed the data (i.e. removing null records and so forth). Next I cleaned the text. This was a bit different from the IMDb sentiment workflow, as I didn’t need a language model that “understood” the connections between words and meanings. Since this was more of a classification task I found that the model actually suffered from too much filler text, stop-words, and non-categorical verbiage.
Ex: There were three records for offender X, and two of those records mentioned controlled substance charges. The third record only made brief mention of a deadly weapon, but included X’s name. The algorithm wanted to cluster the record on offender X’s name instead of grouping the record in with the other weapons related documents. Removing all names from the data set for example allowed the model to correctly identify the last record and group it correctly.
The cleaning process I found to be effective was as follows:
-
Removed line breaks and other formatting items
-
Removed slashes as there were many references to items such as "drug/alcohol"
-
Removed non-alpha characters
-
Removed English stop works except for the word "not" as this was found in meaningful descriptions such as "… controlled substance possession charges not weapons related …"
-
Fixed words incorrectly joined by lack of spaces. (Ex: "… while\nincustody" should be "while in custody")
-
Fixed spelling mistakes
-
I also ran everything through a tailored list of custom stop words specific to the DOC in order to help the model focus in the important text only and de-noise the data
I took the outputs of the steps above and performed tokenization/numericalization, created a 80/20 train/validation split, and then built and trained the language model.
This gave me a collection of tokenized vocabulary words mapped to numerical IDs which were in turn mapped to 200 dimension matrices created by the language model. I took the collection of documents I wanted to cluster and substituted each vocabulary word found within into the matching word matrix.
At this point I knew I needed to aggregate all the word matrices that made up a particular document, so that the resulting structure could be fed into the Scikit-Learn K-Means clustering algorithm for clustering. I tried both stacking the matrices and averaging the values in order to achieve this. I found that stacking the matrices resulting in much more accurate and sensible results, and so I finalized on this method.
After clustering the documents I created word clouds and graphs showing the most commonly utilized words in the clustered documents, and I was able to perform meaningful analysis on the results as well as provide recommendations for improvement. I’m happy to say the results were well received within the DOC. This was the first time machine learning was being used in a visible way, and so I wanted to work overtime to ensure the first foray was a positive experience for the organization. Also, a number of changes to the parole revocation system are being planned as a result of this project which is very satisfying professionally.
As I said before if anyone has any questions or needs more details please let me know.
Thank you.