Lately, I have been paying around $100 per month for my AWS account. And the reason I think is because of the following reasons:
- I code, test and experiment in my AWS jupyter notebook (online), hence my usage time increases.
- In addition, I train and fix my hyperparameters with the actual training data and not the sample data, the time to execute each epochs last around 10 minutes.
Recently I was advised to test my network code with the sample data and I hope this will decrease my billing time. Now I am planning to create my notebook locally and when I am satisfied with my code, only then I will upload it to my AWS account and execute it against the actual data. Is my new approach same as yours? Can you please share your way of training/validating and testing data online.
This is indeed the correct way. The only reason we use a paid cloud service is for the GPU.
~80% of your time will be spent understanding concepts, writing code, checking correctness etc. for which a Laptop or PC is enough, infact quicker because of lack of lag. Using a small sample dataset will give faster turnaround time.
Burning a cloud VM all this time is a waste of money.
Thank You for your reply. Actually I observed most of the time takes place in training the data in the cloud VM. Like I stated, each epoch take around 10 minutes to train and If I choose 5 epochs, usually it takes 30 minutes with 0.01 learning rate and takes around 12 minutes for much slower learning rates. Overall, in my case it takes close to an hour for training the data. Is this the same case with you too. The reason I am asking is because in the videos and the samples notebooks, the epochs take lesson than a second. Am I missing something here? I suspect, my vm is running on a CPU rather than a GPU. Is there a way to check if my VM is running on a GPU?
Thank You again.
If you are replicating the class datasets/lectures then yes 10 mins is excessive and your training is likely running on CPU. But if your model is complex and you are training over large data and doing a full network training (i.e. not using some pre-trained network as base) then it certainly could take 10min or more.
- Check if the GPU is there in your AWS machine - run nvidia-smi command
- Is the version of pytorch or Theano or whatever you are using compiled with CUDA and is it enabled?
May be best to troubleshoot your environment first with one of the class models/lecture sheet.