HAProxy 1.4dev6 broken on Solaris

Update: I feel like a jackass now, I thought I was running this against the stable haproxy build, but in reality this was against haproxy-1.4dev6. DOH! Well on the bright-side, I am helping out the author fix a potentially critical bug. Here is the truss and tcp dump if anyone cares.

Well yet another Solaris specific bug/issue to report. HAProxy resets long running connections. Meaning users on slow bandwidth connections are affected by this. I have sent tcpdumps and logs to the author of HAProxy, hopefully this bug/issue would be resolved. I am writing this as a precautionary warning to other Solaris admins out there.

Here the way to trigger this, see if your service is affected by this.

wget --limit-rate=2k http://somesite.com/onebigfile.txt

Result:

syris:~ victori$ wget --limit-rate=20k http://somesite.com/onebigfile.txt --2010-01-20 11:19:29-- http://somesite.com/onebigfile.txt Resolving somesite.com (somesite.com)... 72.11.142.91 Connecting to somesite.com (somesite.com)|72.11.142.91|:84... connected. HTTP request sent, awaiting response... 200 OK Length: 3806025 (3.6M) Saving to: “onebigfile.txt”


 7% [====>                                                                            ] 269,008     20.1K/s   in 13s     
2010-01-20 11:19:42 (20.1 KB/s) - Read error at byte 269008/3806025 (Connection reset by peer). Retrying.
--2010-01-20 11:19:43--  (try: 2)  http://somesite.com/onebigfile.txt

Connecting to somesite.com (somesite.com)|72.11.142.91|:84... connected.

HTTP request sent, awaiting response... 200 OK

Length: 3806025 (3.6M)

Saving to: “onebigfile.txt”

4% [==> ] 186,016 20.0K/s eta

/Raging, why are there so many Solaris TCP issues? First Varnish? now HAProxy? ARGHHHHH!@#!@

2 comments so far

Willy Tarreau says:

January 22, 2010 at 10:06 am

Victor,

I think the analysis is of interest to more people than just you and me, so I’m continuing here.

As I explained to you, this is not even an haproxy issue, it’s a normal behaviour caused by large kernel socket buffers, which is triggered by your low client-side timeouts (10s) which don’t allow the client to fetch enough of the system buffers for the kernel to let haproxy write there.

I could reliably reproduce the same behaviour with apache, except that apache uses considerably larger timeouts.

From the traces you provided to me, I suspect the system buffers are around 128 kB. Thus, when using wget to run the test, you have one receive buffer assigned to wget’s socket, and a transmit buffer assigned to haproxy’s socket, which means 256 kB between haproxy and wget, over which neither has control. For optimisation purposes, the kernel decides not to signal free space every time a few bytes are out, otherwise it could cause a very high context switch rate. Most systems use hi-water/lo-water mechanisms with figures like 100%/40%. That means that your client has to read 60% of the buffers before the system accepts new writes there. When you limit your client to 2kB per second, it can take up to 0.60*256/2 = 76.8 seconds. In practice, the receive buffers should be released more often to reduce packet losses on the network, but you still get the 38 seconds just for the transmit buffer.
Clearly, your 10 second timeout is too low to cover that.

I could reliably reproduce your issue on Linux too, so this behaviour is not even a Solaris bug. On linux, with wget at 2 kB/s, I see pauses of 21 to 27 seconds when reading directly from the server, and from 27 to 33 seconds when passing through haproxy (because of the added
buffer). Setting a timeout of 30 seconds sometimes works, setting it to 34 seconds makes it work every time.

You can avoid the issue by reducing the default buffer size under linux in /proc/sys/net/ipv4/tcp_wmem. I remember there is the equivalent on Solaris using “ndd /dev/tcp”, but right now I don’t remember the variable name.

Also sometimes it’s annoying to reduce the socket sizes for the whole system. For this reason I have added global settings in haproxy, which allow it to change the default kernel settings. By setting the “tune.sndbuf.client” variable in the “global” section, you can now force smaller buffers. I tried with 8kB and 10 seconds timeouts with success, though 10s is still very low.

Last, keep in mind that if you want to support very slow clients, you also have to accept the fact that those are the ones who experience the higher drop rates. For this reason, you must set your client timeouts large enough to cover a few dropped packets in a row. 10 seconds covers exactly two drops, with a margin of 1 second which may be a bit low for a round trip ACK. The best thing to do is normally to use the same timeout on the client and server and to set it high enough to cover both sides maximum wait time.

Thanks for your traces and dumps, it was helpful to help reproduce the issue, and at least it is documented now 🙂
none says:

July 24, 2010 at 6:48 pm

Victor,

Before making stupid claims with nothing more than FUD all over it, document yourself, it doesn’t hurt at all.

Best regards.

Letsgetdugg