tldr; $1.17 per hour for a V100 GPU & easier to use than Paperspace which charges double
Hi, I was inspired by your (the community’s) achievements in the DAWNBench competition, particularly the use of spot instances to lower costs. But even with the collection of scripts made available, running them still seemed really tough from a DevOps perspective. To make deep learning more accessible to aspiring researchers I quit my job and started building a product to make getting started easier and cheaper. Salamander adds just 26% to AWS spot prices (& reduces costs further with other optimisations). So with joy, I’d love to introduce Salamander to you all!
Here’s a 1-minute demo:
And link to the website:
When designing a server, just select “fast.ai” under “Software” and all course materials + the library will automatically be installed into a new conda environment.
I don’t have enough money to give away free compute credits, but have done everything else within my power to lower costs. Your science and daily work are wonderful things, it’s through action that our love for humanity becomes apparent & I hope Salamander can help you all love humanity more.
& I’ll ofc be watching this thread to help, listen & talk.
Hi Ashton, do you still experience the slow initialization of the GPU on spot instances for first run? Do you have the option of specifying a custom AMI?
@penguinshin I’m not familiar with the slow initialization issue. Sometimes importing pytorch and calling .cuda() for the first time can be slow, perhaps related? Certainly less than 5 minutes at worst.
Regarding AMI’s, you cannot select a custom AMI (not built yet, but soon you’ll be able to initialise from custom docker images). Right now, you can select your desired software when designing a server; there’s 9 valid combinations and I created an AMI for each one:
there are 3 reasons I chose not to support custom AMI’s:
makes it harder/impossible to build advanced features like one-click launch for Jupyter Lab & notifications when the gpu stays inactive for over an hour
Salamander only supports a subset of blockDeviceMappings, and therefore some AMIs will never work
docker containers don’t have these limitations, but fulfil the same needs
Makes sense. I guess the only limitation of docker is the installation time of the packages, but this is really only an issue for big packages like torch and cudnn, which you’ve already installed. Looks great!
It’s worth noting snapshots also have latency problems, but not nearly as severe. When starting new servers your data gets gradually copied across in blocks. If you access a block that hasn’t been copied yet that’ll load across the network just like EFS. But subsequent requests for that data will be much faster. Don’t think it’s a problem for most workloads.
Salamander only keeps snapshots when instances are powered off to lower costs; also bc volumes can’t be attached to servers in different availability zones and we always want to pick the cheapest zone at startup.
Hi there Ashton, thanks for this new website !
That seems awesome, unfortunately I might not understand how awesome it is, : how does adding " just 26% to AWS spot prices " can make it a better option than directly using AWS ?
Sorry, never used aws, did all the course on paperspace…
@ouflepapi Great question! AWS’s on-demand price for a V100 is $3.06, and their spot price in the best availability zone is currently $0.93; Salamander adds 26% to that so $1.17. You can use spot instances directly but it’s much harder than the normal on-demand stuff; and if you make any mistakes there’s a risk of losing your work.
If you’d like to try using spot instances directly adapting the scripts fast.ai used to compete in the DAWNBench competition would be the best place to start (https://github.com/fastai/imagenet-fast/tree/master/aws), however I encountered the following issues when trying to do so myself:
You need to run this stuff manually from the command line, without a nice web interface
Mounting the primary volume after startup and using pivot_root has a small chance of failing when /init/sbin runs, which prevents the instance starting properly and can be hard to diagnose
Starting instances takes about 2 minutes longer because setting up the volume requires a reboot
Uses one particular availability zone, instead of picking the cheapest one
Keeps redundant volumes around when instances are turned off leading to increased storage costs (I think… but this might’ve been another solution I tried)
Hi Ashton, I have give it a spin. I think, this is by far the most intuitive way of getting started (putting aside Google Colab and its restrictions, which are such a pain if you need to use it for more serious work). Great work!
Ditto. I used Terraform to automate and provision AWS EC2 Spot instance and followed Slav’s post here on how to swap the root volume. Yeah, the solution works but not perfect as what you mentioned there. But with Salamander, all these pain points are taken away.
I recommend your list all the time. I would add the pricing estimate if possible. Currently, Salamander at the bottom of the list doesn’t seem attractive, but clearly, there are strong benefits for people to use it.
Glad to know that you find it useful.
I was thinking about adding more metrics like pricing etc., but I refrained from it since I was not exactly sure what two metrics are the most important.
When I started out this project, I found no resource that aggregated all the providers out there and now we have one. Now that we have it, I am currently thinking about useful metrics like price / hour or any other metrics.
It might also be slightly tricky since the price that is quoted here should be for one particular flagship GPU instance so that price comparison makes sense (even though not all users are interested in this instance but can use this as a benchmark). Wanted to take a community input here. What metrics do you think will be most useful to users? Thanks for your time.
@binga Could we find time to get 93% accuracy on cifar10 or equivalent for each gpu, figure out which gpu is most cost-effective for each provider and use that value?
Only annoying thing that could throw of results is per-provider variation due to eg, available RAM & driver configuration.
Not sure if anyone has posted on this, but you can fully “initialize” your EBS volumes which have been restored from AMIs to avoid the slow initial performance (replace /dev/xvdf with your volume name):
Time to 93% CIFAR could be useful but the variance of infrastructure between providers makes it slightly harder to draw an apples to apples comparison, as you rightly mentioned in your second point. Moving this conversation to the repo.