[ML Infrastructure] Scaling ML Models

vansh · December 26, 2022, 1:48am

I have a background in building large-scale systems (think > 1 Exabyte; > 1M tps) and recently pivoted into ML infra. Here are some takeaways for running ML models at scale from a few recent outages:

Predictions/ML data often doesn’t change or changes by very little
If you don’t have enough training jobs, then your infrastructure will sit idle - look for cloud providers that offer pay-per-request until you are big enough
Storing results in fast cloud data stores (DynamoDB, Spanner) is expensive as the system scales. Worth investing in a good object store + cache early on.

What design patterns are you all using scale ML systems in Prod?