I have a background in building large-scale systems (think > 1 Exabyte; > 1M tps) and recently pivoted into ML infra. Here are some takeaways for running ML models at scale from a few recent outages:
- Predictions/ML data often doesn’t change or changes by very little
- If you don’t have enough training jobs, then your infrastructure will sit idle - look for cloud providers that offer pay-per-request until you are big enough
- Storing results in fast cloud data stores (DynamoDB, Spanner) is expensive as the system scales. Worth investing in a good object store + cache early on.
What design patterns are you all using scale ML systems in Prod?