Platform: Salamander ✅

princejain1101 · October 6, 2019, 7:36am

i am processing a dataframe for training which takes more than 4-5 hours. But the salamander machine i have gets automatically disconnected and stopped after ~3-4 hours. That result in waste of processing and thus money. Please help me resolve it.
i have Amalie Dietrich server machine.
Accelerated Computing
$0.49 per hour
[K80 GPU]
4x vCPU
61GB RAM

jeremy · October 6, 2019, 5:35pm

Unfortunately we can’t control when AWS stops a machine - but generally on weekends capacity is higher so this shouldn’t happen so much. You could also try changing to a different instance type - especially if your preprocessing doesn’t need GPU, you can switch to a cheaper CPU type just for the preprocessing step.

ashtonsix · October 6, 2019, 6:31pm

or you could split the preprocessing into multiple steps and save after every step. if your work is interrupted you can then load the intermediary result and resume

chris · October 6, 2019, 7:38pm

I’ve not tried all the other options, but Salamander seemed incredibly easy to get started. Good work! Ease of getting started might be a double edged sword, because I almost quit and went elsewhere before noticing this on the help page:

Starts Jupyter Notebook (inside a tmux session called “jupyter”)

Oh! So that’s where it’s running from! You might want to make that more obvious because surely you need to know that if you’re going to install other packages with pip.

jeremy · October 7, 2019, 3:39am

I don’t think you really need to know that, do you? You can just use ! commands inside your notebook, or use the terminal that’s inside jupyter notebook, etc. There’s no need to connect to the tmux session that I can see.

chris · October 7, 2019, 12:34pm

I had not thought of that! Thank you.

Maybe that’s the information you need to make more obvious?

banksiaboy · October 7, 2019, 11:25pm

I just created my account, purchased $20 credit, tired to start server.
I get: We tried starting your server but failed because: "no available servers"
I don’t see any other posts in this thread matching. I’m assuming this is unusual. Can somebody check please…
Cheers,
–Peter G

banksiaboy · October 7, 2019, 11:27pm

Must have been a rusty disk. Instance has started - no problem. Propagation delay of some sort?

nchukaobah · October 10, 2019, 10:45am

Thanks Jeremy. Love Jupyter

dvinubius · October 11, 2019, 9:59pm

I’ve started the DL1 course and I’m using a basic Salamander server:
[K80 GPU] 4x vCPU 61GB RAM

I reckon that this server should run pretty well - at least as fast as the computer used during Lesson 1.

However, every time I train the models from the original lesson1 notebook, it takes about 4 times longer than it should. For instance, on resnet34 it takes a minute or so on each epoch.

Have I missed something with regards to server configuration?

jeremy · October 12, 2019, 12:33am

K80 is an old GPU - not as good as what I used in lesson 1. Use a g3s instance instead.

dvinubius · October 12, 2019, 6:16pm

I’m sorry, but now I’m confused… Trying to understand the gear specs and what actually matters.

In lesson 1 you’re saying that the card you use most of the time has 11G GPU memory. I assumed the machine there was also a 11G GPU machine.

The g3s specs say 8G GPU memory.

The Salamander Server (K80) says 12GB integrated RAM. At first I thought this must refer to normal RAM, in order for K80 to be so slow, but further digging into GDDR5 taught me that this is actually graphics card memory. I would conclude, this is the GPU RAM. The memory you taught us should not be less than 8G, in order to avoid frustration while experimenting with our model training.

Is my conclusion correct? If so, I would also conclude that we’re not interested just in the GPU RAM, but also the TFLOPS numbers.

Thank you.

jeremy · October 12, 2019, 6:33pm

There’s a lot of topics on the forum about that - so have a search around. Amount of RAM is memory, not speed.

radoshi · October 14, 2019, 6:04pm

I had the same problem. I created a server, played around with a notebook copy, upgraded fastai and pytorch libs using conda, and shut the server down. Upon restarting the server, I encountered the same problem as Rajesh. I eventually just deleted the server and created a new one and that seems fine.

jeremy · October 15, 2019, 3:29am

Yes, you have to be careful when updating things with conda, since it can update your jupyter notebook dependencies in an incompatible way. When you update with conda, make sure that it’s only fastai and pytorch that are being modified (it’ll warn you before it does anything).

clio23 · October 16, 2019, 8:41am

since i cancel the account today , will i get the credit back which i paid today automaticlly. the credit have 20 dollars left.

robertm · October 17, 2019, 1:21pm

@jeremy One topic that is perhaps obvious to others but not me is data privacy when uploading to Salamander. Is it safe/secure to upload non-anonymized medical images for training? Thanks

jeremy · October 18, 2019, 5:55am

@robertm it’s a standard AWS instance, but with one exception, which is that I have the ability to log into it if needed to debug a problem. Also, note that most of the templates run a jupyter server by default (although it requires login).

The Salamander code hasn’t been thru any official security audits.

So it really depends on the details of the data, and your jurisdiction.

robertm · October 18, 2019, 11:29am

Thanks, it seems best to at least anonymize the images. However, in cases where patient meta-data is itself used for training, I will need to learn to train locally on my encrypted machine. Thanks again.

JeroenH · October 24, 2019, 11:15am

Hello,

This is my first post at the forum, I want to start with expressing my gratitude for the free courses and this awesome forum!

I’m running into problems with Salamander, it’s basically the same problem as banksiaboy describes above.

After starting the server, I get an error message:

We tried starting your server but failed because: “no available servers”

I tried restarting the server, I tried logging out and back in. Over the last few weeks it was running without any problems. Any advise on this topic would be highly appreciated

Best regards,
Jeroen