Re: Hour long timeout to ssh/telnet/ftp to down host?

Rob Landley (landley@webofficenow.com)
Tue, 12 Jun 2001 17:47:35 -0400


>You can tune things by setting the tcp-timeout probably..I don't
>know exactly where to set this..

Aha, found it. /proc/sys/net/ipv4/tcp_syn_retries

I am a victim of the exponential retry falloff, it would seem. syn_retries
of 1 takes a few seconds, 3 takes less than half a minute, and 5 takes
several minutes. The default value of 10 is what's giving me the problem
(something like 20 minutes to time out, according to my earlier tests.)

Then the fact that ssh then re-attempts the connection four times before
actually failing is where I got my hour and change timeout. ("ssh -v -v -v"
comes in handy...)

Fun.

Can we change the default value for this to something more sane, like 5?
Exponential falloff is not good when your order of magnitude hits double
digits.

> You probably don't want all tcp to time out at 15 seconds anyway, so

Just connection initiation. (If their ip stack hasn't replied to me by then,
I doubt it's going to.)

> I'd suggest either using non-blocking connect (if you have the code
> that does the connect), or just set a timer (or use sigalarm) when you
> start the attempt, and fail the attempt if the timer or alarm signal
> goes off.

Except I'm using off-the-shelf ssh. (I asked them about this problem a month
ago, and there was some discussion of a workaround on their mailing list, but
2.9 came out and still had the same behavior. Apparently, if it's not a
problem in OpenBSD, it's not a problem in OpenSSH...)

> > If it's glibc I'm probably better off writing a wrapper to ping the
> > destination before trying to connect, or killing the connection after a
> > timeout with no traffic. But both of those are really ugly solutions.
>
> Ugly is relative, and don't use ping because there is still a race
> condition (ping worked, but by the time you try tcp, the box is down.)

Yeah, but it would eventually time out and recover, I've got ten threads out
querying boxes, that's a really rare race condition. And I already
acknowledged it was ugly. :)

So the problem is just that tcp_syn_retries' default value of 10 is way too
high due to the exponentially increasing gap between each retry.

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/