Using public benchmarks for testing?

Sorry if this is a silly question, but let me first give some context:

Recently, I began using a library that implemented a technique in a relatively new whitepaper. I was curious about what the results of this particular library’s implementation would be against academic benchmarks used in the whitepaper, not to compare the library’s results to the whitepaper results, but rather to see if using this new technique indeed provided the benefits it claimed with various models and to what degree. I asked the author to see if he had done any of this testing himself, and he said no but would love to know the results so he can have a reference in the future for regression testing. Perhaps these results could even be used to compare implementations of techniques across different libraries.

So, I found the same benchmark data and splits online but have run into a bunch of issues, and now I’m pausing to think if this is even a good idea in the first place. If it is, is there a service or tool that lets you test your model outputs against a suite of public benchmarks or datasets? Perhaps even compare your results with other submissions? The site I found the benchmark did have a “leaderboard,” but I’m looking for something that can automate this across multiple benchmarks and make it easy.

It also got me thinking about how fine-tuned models built from foundational models like gpt3 probably never get tested before being put into production. It has me slightly worried. And although these foundational models have made it easier for people to develop their models, I get a sense that most make it into production without any testing due to how difficult it is to source and set up.