[ML Infrastructure] Scaling ML Models

I have a background in building large-scale systems (think > 1 Exabyte; > 1M tps) and recently pivoted into ML infra. Here are some takeaways for running ML models at scale from a few recent outages:

  1. Predictions/ML data often doesn’t change or changes by very little
  2. If you don’t have enough training jobs, then your infrastructure will sit idle - look for cloud providers that offer pay-per-request until you are big enough
  3. Storing results in fast cloud data stores (DynamoDB, Spanner) is expensive as the system scales. Worth investing in a good object store + cache early on.

What design patterns are you all using scale ML systems in Prod?