Keep SSH connection Alive

xiaokunxu · May 4, 2017, 6:17pm

By using the default aws-ssh alias, which translates to ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instanceIp, my connection to the ec2 instance was consistently terminated/disconnected upon short idle, say 5 minutes no activity in the terminal while playing with the notebook in the browser. The error message reads: packet_write_wait: Connection to xx.xx.xx.xx port 22: Broken pipe

I’ve tried to add -o ServerAliveInterval=3600 argument in both the aws-ssh alias, and in my local ssh config file (see below), but the symptom persists. I’m wondering anyone else experience the same issue? Maybe we could fixed this on the ec2 instance by configuring the sshd service? Thanks!

I’m using a Mac, with zsh 5.3.1 in iTerm2, with ssh version: OpenSSH_7.4p1, LibreSSL 2.5.0

> cat ~/.ssh/config
Host *
    ServerAliveInterval 3600
    TCPKeepAlive yes
    ServerAliveCountMax 2

jamesflynn · July 25, 2017, 3:06pm

I am having the same problem. I am in Vietnam and have my P2 instance on US-west… strange thing is if I terminate my instance and run the setup script again it works the first time, but then hangs and I can not get it to work again. Is it a latency issue?

anshbansal · July 25, 2017, 3:32pm

There is a command line utility called mosh. If you use mosh it changes the protocol used (don’t remember exactly what) and makes sure that if your internet breaks your connection gets restored. I have used it earlier but wasn’t able to get it to work with the ami. If some linux guru can help this can be used to solve the broken pipe problem.

rachel · July 25, 2017, 5:14pm

Update your config to have a much shorter ServerAliveInterval (try 180 or 240). This is how often your computer will send a signal to the remote host, and I think 3600 is not frequent enough. Also, try making ServerAliveCountMax higher, because ssh will disconnect from the server once this threshold is reached (of messages that haven’t been responded to)

anshbansal · July 25, 2017, 6:03pm

Is the connection closed from the server side due to inactivity? Is that why making the server alive interval shorter help?

xiaokunxu · July 26, 2017, 5:28am

@anshbansal not sure if the root cause if from the server or client side, until I see this https://unix.stackexchange.com/questions/3026/what-options-serveraliveinterval-and-clientaliveinterval-in-sshd-config-exac, particularly the answer from @kenorb.

Basically, on the server, check /etc/ssh/sshd_config. In mine the line ClientAliveInterval=0 and ClientAliveCountMax=3 are commented out. According to man sshd_config the server “assume” a default value of 15 seconds and 3 counts, i.e. 45 seconds before disconnection without client response.

on the client side, check /etc/ssh/ssh_config. In mine the ServerAliveInterval and ServerAliveMaxCount variables are not set, hence our ssh client won’t send a heartbeat signal at all. The two variables could be either set in /etc/ssh/ssh_config at root level, or ~/.ssh/config at user level.

jamesflynn · July 26, 2017, 6:35am

Thanks everyone, I’ve tried all those - mosh, screen, ssh config settings, tried it from the coffee shop, the gym, in a house, with a mouse, on a train, in the rain… I can’t SSH in any more, I get a timeout every time. I’ve gone through the AWS troubleshooting for ssh timeouts. I will see if there’s another way to access and change the server sshd_config…

xiaokunxu · July 26, 2017, 7:04am

@james Is there a possibility that the error is triggered while you run aws-alias.sh and you are using an earlier version of the code?

There used to be an unnecessary “export instanceId=i-9aa9c282” at the bottom of the code (see issue https://github.com/fastai/courses/commit/b9173ec24a6fc37cda3417abb9d6524e739a8437#diff-44b1a77bfa3ba801bd867fdc2c6771d6), which could yield wrong ip.

This bug has been fixed more than a month ago. try git pull to get the fresh code and test again.

jamesflynn · July 26, 2017, 3:36pm

I started again from scratch, deleted and cloned the repo, went through the aws reset instructions… managed to ssh into the instance but as I was typing nvidia-smi it froze up then after about a minute my pipe totally broke. Damn pipe!

packet_write_wait: Connection to 54.186.243.43 port 22: Broken pipe`

see!

jamesflynn · July 26, 2017, 3:38pm

Weirdly after this happens once any future ssh attempt times out.

jamesflynn · July 28, 2017, 9:28am

I know this isn’t really a deep learning problem (I’m looking forward to getting into those) but does anyone have any ideas how to work through the course from Vietnam?

I’m trying to introduce some exciting new technologies to developers at a bank here to see how we can serve the 60M unbanked in Vietnam, so it’s a worthy cause! (Depending on your opinions about market capitalism) I promise I’m not using it to recognize people for selective targeting of gym membership offers (that’s next).

It would be a real bummer if this were just another example of compounded exclusivity…

Dubious Solutions I Have Thought Of:
Set up a machine in California that I can Teamviewer into…
Run the course on lesser AWS hardware in Singapore (does this work?)
Build a box myself?

dorab · July 28, 2017, 4:32pm

I have a similar (but not the same) problem.
I can ssh into the instance. But then, after a very short (2mins?) inactivity, the connection hangs.
If I log out and then ssh back in, the connection stays up pretty much all the time. I now do this as a matter-of-course. But it’s a pain. Wonder whether anyone else has experienced the same behavior.

When I first set up the instance, I had to reboot the instance to get it to be able to ssh in consistently.
Have you tried a reboot of the instance?

Good luck with the gym membership offers

anshbansal · July 30, 2017, 8:45am

Maybe ask at https://unix.stackexchange.com/. I guess there would be people there with more expertise on this kind of stuff.

Or maybe consider https://aws.amazon.com/premiumsupport/compare-plans/. I have not used their support for anything but maybe the free one would work? Or if spending 29$ 1 month of technical support from AWS should be enough to get this fixed.

jamesflynn · August 1, 2017, 7:03am

Thanks, that worked. I upgraded to the paid support plan and after posing the question I got the following message back from AWS support:

I have reached Customer Service team and can confirm that your instances in us-west-2 region was not reachable due to the fact your account was closed from the billing console on 09/26/2014 and it was then reinstated on 05/01/2017. Since then the region was isolated due to which the instance in that region was unreachable.

My apologies on behalf of AWS for any inconvenience caused due to this issue, I worked with the team and had your account restrictions removed on priority.

It is now working! Hooray! spent hours on that… thanks for the tip anshbansal!!