Given two models that score the same data set and generate different scores, we’d like to analyze the scored data sets.
For example,
- what % of the top 100 examples overlap
- What % of the bottom overlap
- Which errors did model B correct that appeared in model A
The input data could simply be a list of ID’s, scores and labels.
Would not be hard to build, but would be better to contribute to existing. Anyone aware of such a thing?