SKLearn/Tabluar Data: Is there a way to handle DataFrame elements that contain lists other than one hot encoding?

kyle.h · February 25, 2019, 3:54pm

Short version: I have a column in a DataFrame with a list of attorneys on a case. One hot encoding for the 4000 row data set would result in >500 columns because of the many attorney names.

Is there a way to use this data in machine learning using sklearn? Should I try a neural net?

More Info:
I am trying to predict how long a civil court case will take, what the outcomes will be, etc. I know that the plaintiffs attorneys are an important factor in this, and in this type of case it is common to have 2 or 3 plaintiffs attorneys under one law firm and 2 or 3 under another law firm. Some law firms are quite big, with several hundred attorneys and the law firm itself is not as important as the individual attorney. I could one hot encode each attorney name, but that’d result in tons of columns. What should I do?