Letsgetdugg

Random tech jargon

Browsing the tag solaris

Update: I feel like a jackass now, I thought I was running this against the stable haproxy build, but in reality this was against haproxy-1.4dev6. DOH! Well on the bright-side, I am helping out the author fix a potentially critical bug. Here is the truss and tcp dump if anyone cares.

Well yet another Solaris specific bug/issue to report. HAProxy resets long running connections. Meaning users on slow bandwidth connections are affected by this. I have sent tcpdumps and logs to the author of HAProxy, hopefully this bug/issue would be resolved. I am writing this as a precautionary warning to other Solaris admins out there.

Here the way to trigger this, see if your service is affected by this.

wget –limit-rate=2k http://somesite.com/onebigfile.txt

Result:

syris:~ victori$ wget –limit-rate=20k http://somesite.com/onebigfile.txt
–2010-01-20 11:19:29– http://somesite.com/onebigfile.txt
Resolving somesite.com (somesite.com)… 72.11.142.91
Connecting to somesite.com (somesite.com)|72.11.142.91|:84… connected.
HTTP request sent, awaiting response… 200 OK
Length: 3806025 (3.6M)
Saving to: “onebigfile.txt”

7% [====> ] 269,008 20.1K/s in 13s

2010-01-20 11:19:42 (20.1 KB/s) – Read error at byte 269008/3806025 (Connection reset by peer). Retrying.

–2010-01-20 11:19:43– (try: 2) http://somesite.com/onebigfile.txt
Connecting to somesite.com (somesite.com)|72.11.142.91|:84… connected.
HTTP request sent, awaiting response… 200 OK
Length: 3806025 (3.6M)
Saving to: “onebigfile.txt”

4% [==> ] 186,016 20.0K/s eta

/Raging, why are there so many Solaris TCP issues? First Varnish? now HAProxy? ARGHHHHH!@#!@

Tagged with ,

Clearing stale cache by domain

You can clear a site’s cache by domain, this is really nifty if you have Varnish in front of multiple sites. You can log into Varnish’s administration console via telnet and execute the following purge command to wipe out the undesired cache.

purge req.http.host ~ letsgetdugg.com

Monitor Response codes

Worried that some of your clients might be receiving 503 Varnish response pages? Find out with varnishtop.

varnishtop -i TxStatus

Here is how the output looks like.

list length 7 web 4018.65 TxStatus 200 132.35 TxStatus 304 44.17 TxStatus 404 34.63 TxStatus 302 30.87 TxStatus 301 9.36 TxStatus 403 1.39 TxStatus 503
Tagged with , ,

Update 2010-02-19: Seems other people are also affected by the Varnish LINGER crash on OpenSolaris. This does not address the core problem but removes the “fail fast” behavior with no negative side effects.

r4576 has been running reliably with the fix below.

In varnishd/bin/cache_acceptor.c

if (need_linger)
                setsockopt(sp->fd, SOL_SOCKET, SO_LINGER,
                    &linger, sizeof linger);

Remove TCP_assert line encapsulating setsockopt().

Update 2010-02-17: This might be a random fluke but Varnish has connection issues when compiled under SunCC, stick to GCC. I have compiled Varnish with GCC 4.3.2 and the build seems to work well. Give r4572 a try, phk commited some solaris aware errno code.

Update 2010-02-16: r4567 seems stable. Errno isn’t thread-safe by default on Solaris like other platforms, you need to define -pthreads for GCC and -mt for SunCC in both the compile and linking flags.

GCC example:

VCC_CC=”cc -Kpic -G -m64 -o %o %s” CC=/opt/sfw/bin/gcc CFLAGS=”-O3 -L/opt/extra/lib/amd64 -pthreads -m64 -fomit-frame-pointer” LDFLAGS=”-lumem -pthreads” ./configure –prefix=/opt/extra

SunCC Example:

VCC_CC=”cc -Kpic -G -m64 -o %o %s” CC=/opt/SSX0903/bin/cc CFLAGS=”-xO3 -fast -xipo -L/opt/extra/lib/amd64 -mt -m64″ LDFLAGS=”-lumem -mt” ./configure –prefix=/opt/extra

Here are the sources on how I pieced it all together: sun docs, stack overflow answer

See what -pthreads define on GCC

# gcc -x c /dev/null -E -dM -pthreads | grep REENT
#define _REENTRANT 1

snippet from solaris’s /usr/include/errno.h to confirm that errno isn’t thread safe by default.

#if defined(_REENTRANT) || defined(_TS_ERRNO) || _POSIX_C_SOURCE – 0 >= 199506L
extern int *___errno();
#define errno (*(___errno()))
#else
extern int errno;
/* ANSI C++ requires that errno be a macro */
#if __cplusplus >= 199711L
#define errno errno
#endif
#endif /* defined(_REENTRANT) || defined(_TS_ERRNO) */

Update 2010-01-28: r4508 seems stable. No patches needed aside from removing an assert(AZ) in cache_acceptor.c on line 163.

Update 2010-01-21: If your using Varnish from trunk past r4445 apply this session cache_waiter_poll patch to avoid stalled connections.

Update 2009-21-12: Still using Varnish in production, the site is working beautifully with the settings below.

Update(new): I think I figured the last remaining piece of the puzzle. Switching Varnish’s default listener to poll fixed the long connection accept wait times.

Update: Monitor charts looked good, but persistent connections kept flaking under production traffic. I was forced to revert back to Squid 2.7. *Sigh* I think Squid might be the only viable option on Solaris when it comes to reverse proxy caching. The information below is useful if you still want to try out Varnish on Solaris.

I have finally wrangled Varnish to work reliably on Solaris without any apparent issues. The recent commit to trunk by phk(creator) fixed the last remaining Solaris issue that I am aware of.

There are three four requirements to get this working reliably on Solaris.

1. Run from trunk – r4508 is a known stable revision that works well. Remove the AZ() assert in cache_acceptor.c on line 163.

2. Set connect_timeout to 0, this is needed to work around a Varnish/Solaris TCP incompatibility that resides in lib/libvarnish/tcp.c#TCP_connect timeout code.

3. Switch the default waiter to poll. EventPorts seems bugged on OpenSolaris builds.

4. If you have issues starting Varnish, start Varnish in the foreground via -F argument.

Here is a Pingdom graph of our monitored service. Can you tell when Varnish was swapped in for Squid? Varnish does a better job of keeping content cached due to header normalization and larger cache size.

varnish latency improvement

There are a few “gotchas” to look out for to get it all running reliably. Here is the configuration that I used in production. I have annotated each setting with a brief description.

newtask -p highfile /opt/extra/sbin/varnishd -f /opt/extra/etc/varnish/default.vcl -a 0.0.0.0:82 # IP/Port to listen on -p listen_depth=8192 # Connections kernel buffers before rejecting. -p waiter=poll # Listener implementation to use. -p thread_pool_max=2000 # Max threads per pool -p thread_pool_min=50 # Min Threads per pool, crank this high -p thread_pools=4 # Thread Pool per CPU -p thread_pool_add_delay=2ms # Thread init delay, not to bomb OS -p cc_command='cc -Kpic -G -m64 -o %o %s' # 64-Bit if needed -s file,/sessions/varnish_cache.bin,512M # Define cache size -p sess_timeout=10s # Keep-Alive timeout -p max_restarts=12 # Amount of restart attempts -p session_linger=120ms # Milliseconds to keep thread around -p connect_timeout=0s # Important bug work around for Solaris -p lru_interval=20s # LRU interval checks -p sess_workspace=65536 # Space for headers -T 0.0.0.0:8086 # Admin console -u webservd # User to run varnish as

System configuration Optimizations

Solaris lacks SO_{SND|RCV}TIMEO BSD socket flags. These flags are used to define TCP timeout values per socket. Every other OS has it Mac OS X, Linux, FreeBSD, AIX but not Solaris. Meaning Varnish is unable to make use of custom defined timeout values on Solaris. You can do the next best thing with Solaris; optimize the TCP timeouts globally.

# Turn off Nagle. Nagle Adds latency. /usr/sbin/ndd -set /dev/tcp tcp_naglim_def 1 # 30 second TIME_WAIT timeout. (4 minutes default) /usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000 # 15 min keep-alive (2 hour default) /usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 900000 # 120 sec connect time out , 3 min default ndd -set /dev/tcp tcp_ip_abort_cinterval 120000 # Send ACKs right away - less latency on bursty connections. ndd -set /dev/tcp tcp_deferred_acks_max 0 # RFC says 1 segment, BSD/Win stack requires 2 segments. /usr/sbin/ndd -set /dev/tcp tcp_slow_start_initial 2

Varnish Settings Dissected

Here are the most important settings to look out for when deploying Varnish in production.

File Descriptors

Run Varnish under a Solaris project that gives the proxy enough file descriptors to handle the concurrency. If Varnish can not allocate enough file descriptors, it can’t serve the requests.

# Paste into /etc/project # Run the Application newtask -p highfile highfile:101::*:*:process.max-file-descriptor=(basic,32192,deny)

Threads

Give enough idle threads to Varnish so it does not stall on requests. Thread creation is slow and expensive, idle threads are not. Don’t go cheap with threads, allocate a minimum of 200. Modern browsers use 8 concurrent connections by default, meaning Varnish will need 8 threads to handle a single page view.

thread_pool_max=2000 # 2000 max threads per pool thread_pool_min=50 # 50 min threads per pool # 50 threads x 4 Pools = 200 threads thread_pools=4 # 4 Pools, Pool per CPU Core. session_linger=120ms # How long to keep a thread around # To handle further requests.
Tagged with , ,

Squid is a fundamental part of our infrastructure at Fabulously40. It helps us lower response times quite considerably. The problem with Squid is that it is quite “dense” when it comes to configuration flexibility. Unless your willing to do a bit of C hacking on it, it does not have much configuration flexibility. This can be overcome by using supporting software to help squid out.

Note: Our configuration would be quite simplified if we used Varnish but it lacks some key features that make Squid a better candidate.

1. Varnish can’t stream cache-misses, it can only buffer. This adds latency to cache-miss requests.
2. Varnish is unable to avoid caching objects based on content-length size.
3. Varnish has an issue with connect_timeout and Solaris socket handling.

Until Varnish can handle the three things listed, Squid remains the best choice at the cost of configuration complexity.

Optimize Cache Hits by Normalizing Headers

“Accept-Encoding: gzip” and “Accept-Encoding: gzip/deflate” will be cached separately unless you normalize client headers. Squid has no configuration option to normalize headers like Varnish. However, you can use Nginx to normalize headers before passing off the request to Squid.

Here is the setup: Client -> Nginx -> Squid -> Backend Services

The NGINX Configuration

location / { proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_hide_header Pragma; # Remove client cache-control header # to avoid fetching from backend if page is in cache proxy_set_header Cache-Control ""; # Normalize static assets for squid if ($request_uri ~* "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|ico|swf|mp4|flv|mov|dmg|mkv)") { set $normal_encoding ""; proxy_pass http://squids; break; } # Normalize gzip encoding if ($http_accept_encoding ~* gzip) { set $normal_encoding "gzip"; proxy_pass http://squids; break; } # Normalize deflate encoding if ($http_accept_encoding ~* deflate) { set $normal_encoding "deflate"; proxy_pass http://squids; break; } # Define the normalize header proxy_set_header Accept-Encoding $normal_encoding; # default... proxy_pass http://squids; break; }

So by the time Squid receives the request, the accept-encoding header is normalized, for efficient cache storage.

Avoid Caching Error Pages with Squid 2.7

Squid 2.7 has better support for reverse caching and HTTP 1.1 than the Squid 3.x branch. However, it missing one important ACL that Squid 3.x has but 2.7 does not; http_status. Squid 2.7 is unable to be configured to avoid caching pages based on status response codes from origin. I have written a Perlbal plugin CacheKill specifically to address this issue. Perlbal::Plugin::CacheKill sits between backend services and Squid and rewrites cache-control headers based on response code.

Here is the setup: Client -> Squid -> Perlbal -> Backend Services

If the backend service responds with a 501,502 or 503 http status code, Perlbal will append Cache-Control: no-cache header before giving back the response to Squid.

Here is the configuration file for Perlbal::Plugin::CacheKill to deny Squid from caching error pages.

CREATE SERVICE fab40 SET listen = 0.0.0.0:8003 SET role = reverse_proxy SET pool = backends cache_kill codes = 404,500,501,503,502 SET plugins = CacheKill ENABLE fab40

Stitching software together can make Squid just as flexible as Varnish with its VCL configuration.

Tagged with , ,

*Update* Patches got accepted into MogileFS Trunk ;-)

Just go check out trunk, it has all my patches already included.

http://code.sixapart.com/svn/mogilefs/trunk/

The only thing you need is my mogstored disk patch which is still pending. All the issues revolving around postgresql and solaris have been already included in trunk.


I fixed a few issues with MogileFS and Solaris. MogileFS should run wonderfully on Solaris with my patches applied.

Directory for all my patches: http://victori.uploadbooth.com/patches

http://victori.uploadbooth.com/patches/solaris-disk-du.patch

This patch fixes mogstored to work with solaris’s df utility.

http://victori.uploadbooth.com/patches/store-max-requests.patch

This patch adds a new feature to the MogileFS Tracker – max_requests.

The default is 0, but it is suggested you set it to 1000 max_requests, to avoid memory leaks.

The tracker will give out the database handle up to the max_requests limit before expiring the connection for a new one. This avoids memory leaks with long running persistent connections. PostgreSQL has issues with long persistent connections, it accumulates a lot of ram and does not let go until the process/connection is killed off. This patch makes sure that the connection is expired after so many dbh handle requests.

http://victori.uploadbooth.com/patches/mogilefs-sunos-pg.patch

This patch applies the InactiveDestroy argument to avoid the MogileFS Tracker locking up with the PostgreSQL store on Solaris.

http://victori.uploadbooth.com/patches/solaris-mogilefs-full.patch

This is the full patch for all my fixes.

I am slowly migrating our fab40 static asset data to MogileFS. I have imported >300,000 images, no issues with my patches so far.

/ PLUG go make an account on uploadbooth!

Enjoy ;-)

I just received my “Guide to Open-Source Operating Systems” comparing Solaris with Linux from Sun’s marketing department. Here are some of the facts that made me cringe due to blatant lying and half truths. Hey Sun, don’t let the facts get in your way.

Believe it or not but this is actually verbatim from the guide.

• Solaris runs on more hardware platforms.

• Solaris is supported by more applications.

• Solaris holds performance and price/performance world records that demonstrate its speed and scalability on a variety of systems.

• Solaris is supported by Sun, the company dedicated to UNIX for more than two decades.

1. Lets see, the first fact is just blatant lying. Last I checked Linux supported IA-32, MIPS, x86-64, SPARC, DEC Alpha, Itanium, PowerPC, ARM, m68k, PA-RISC, s390, SuperH, M32R and many more platforms. While Solaris only supports SPARC, IA-32 and x86-64. Does anyone at Sun’s marketing department care to fact check?

2. Depends on your definition of “supported.” Marketing is most likely referring to commercial support. I don’t have the facts to back this up but I doubt this is hold true with Linux in 2009, maybe they had a case back in 1999. Majority of open source applications are developed against Linux and Solaris compatibility is just an after thought.

3. You win http://www.tpc.org/tpcc/results/tpcc_perf_results.asp

Sun develops some of the best hardware and software on the market, but their marketing department is a disaster. There can only be one Steve Jobs and his reality distortion field.

Once again I have been blind sided by yet another conservative out-of-the-box setting. IPFilter is tuned way too conservative with it’s state table size.

Here is how you can tell if your hitting any issues, run ipfstat and check for lost packets.

victori@opensolaris:~# ipfstat | grep lost fragment state(in): kept 0 lost 0 not fragmented 0 fragment state(out): kept 0 lost 0 not fragmented 0 packet state(in): kept 798 lost 100 packet state(out): kept 612 lost 234

Notice that the in and out lost state lines have a non-zero value. This means IPFilter has been dropping client connections, bummer.

The default settings are quite conservative.

victori@opensolaris:~# ipf -T list | grep fr_state
fr_statemax min 0×1 max 0x7fffffff current 4096
fr_statesize min 0×1 max 0x7fffffff current 5002

You need to shutdown IPFilter and apply larger table size limits.

victori@opensolaris:~# svcadm disable ipfilter
victori@opensolaris:~# /usr/sbin/ipf -T fr_statemax=18963,fr_statesize=27091

Lets confirm that it works.

victori@opensolaris:~# ipf -T list | grep fr_state
fr_statemax min 0×1 max 0x7fffffff current 18963
fr_statesize min 0×1 max 0x7fffffff current 27091

Awesome, now all we need to do is enable IPfilter and no more lost packets.

victori@opensolaris:~# svcadm enable ipfilter

To make this persistent across reboots edit ipf.conf

victori@opensolaris:~# vi /usr/kernel/drv/ipf.conf
name=”ipf” parent=”pseudo” instance=0 fr_statemax=18963 fr_statesize=27091;

Then update the contents

victori@opensolaris:~# devfsadm -i ipf

This can be applied to any OS that uses IPFilter.

Update: The following information could be beneficial to some, however my issues actually were with Caviar black drives shipping with TLER disabled. You need to pay Western Digital a premium for their “RAID” drives with TLER enabled. So for anyone reading this, avoid consumer Western Digital drives if you plan on using them for RAID.

zfs_vdev_max_pending

I can’t believe how long I have been tolerating horrible concurrent IO performance on OpenSolaris running ZFS. When I have any IO intensive writes happening the whole system slows down to a crawl for any further IO. Running “ls” on a uncached directory is just painful.

victori@opensolaris:/opt# iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 87.0 0.0 2878.1 0.0 0.0 0.0 0.4 0 100 c4t0d0 0.0 83.0 0.0 2878.1 0.0 0.1 0.2 0.7 1 50 c4t1d0 1.0 0.0 28.0 0.0 0.0 0.0 0.0 5.4 0 1 c4t2d0

Notice c4t0d0 is blocking at 100%. If a disk is blocking at 100% good luck getting the disk to do any other operations such as reads.

SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.

Dynamically set the ZFS command queue to 1 to optimize for NCQ.

echo zfs_vdev_max_pending/W0t1 | mdb -kw

And add to /etc/system

set zfs:zfs_vdev_max_pending=1

Enjoy your OpenSolaris server on cheap SATA disks!

Tagged with , ,

Recently a primary boot disk went bad on our server and I got blind sided by a non-bootable secondary mirror disk. All the data was intact but I could not boot it. This required a slow re-installation and migration process that took a very long time.

• EFI partitioned drives are not ZFS bootable.
• ZFS attach automatically partitions the drive as EFI.
• ZFS send/recv transfers on gzip compressed data-slices is slow.

Here is the correct way of getting both the disks in the ZFS mirror to boot.

Plug the new drive into the server that you want to add to the ZFS mirror. If your hot swapping or adding a new drive while the server is still on, you need to use cfgadm to configure it.

victori@solaris:~# cfgadm -c configure sata1/1

Now that the drive is configured and seen by the server you need to repartition it with format so it can be used as a bootable device.

victori@solaris:~# format

AVAILABLE DISK SELECTIONS:
0. c4t0d0
/pci@0,0/pci8086,346c@1f,2/disk@0,0
1. c4t1d0
/pci@0,0/pci8086,346c@1f,2/disk@1,0
2. c4t2d0
/pci@0,0/pci8086,346c@1f,2/disk@2,0

* select your new drive *

# fdisk

* use fdisk to remove the EFI partition and add a solaris2 partition. *

Select the partition type to create:
1=SOLARIS2 2=UNIX 3=PCIXOS 4=Other
5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG
9=DOS16LBA A=x86 Boot B=Diagnostic C=FAT32
D=FAT32LBA E=DOSEXTLBA F=EFI 0=Exit?

This step is very important, if you did not repartition your drive, zfs attach will default the drive back to an EFI partition table that is not bootable.

c4t0d0s2 — primary drive.
c4t1d0s2 — new drive that we are setting up.

victori@solaris:~# prtvtoc /dev/rdsk/c4t0d0s2 | fmthard -s – /dev/rdsk/c4t1d0s2

You should now be able to attach the secondary drive to your mirror using the identical slice.

zpool attach rpool c4t0d0s0 c4t1d0s0

Once the mirror is done synchronizing you need to install the bootloader on the drive.

victori@solaris:~# installgrub -m /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c4t1d0s0
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 267 sectors starting at 50 (abs 16115)
stage1 written to master boot sector

Trouble Shooting

victori@solaris:~# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c4t1d0s0

raw device must be a root slice (not s2)

You did not re-partition the drive to a solaris2 partition. EFI partitions can’t be made bootable. Use the format tool to reconfigure the drive with a solaris2 partition.

zpool attach rpool c4t0d0s0 c4t1d0s0

cannot open/stat device /dev/rdsk/c1t0d0s0

You did not copy your label information from your primary to your secondary disk with prtvtoc and fmthard.

Tagged with , ,

I have finally nailed out all our issues surrounding Varnish on Solaris, thanks to the help of sky from #varnish. Apparently Varnish uses a wrapper around connect() to drop stale connections to avoid thread pileups if the back-end ever dies. Setting connect_timeout to 0 will force Varnish to use connect() directly. This should eliminate all 503 back-end issues under Solaris that I have mentioned in an earlier blog post.

Here is our startup script for varnish that works for our needs. Varnish is a 64-bit binary hence the “-m64″ cc_command passed.

#!/bin/sh

rm /sessions/varnish_cache.bin

newtask -p highfile /opt/extra/sbin/varnishd -f /opt/extra/etc/varnish/default.vcl -a 72.11.142.91:80 -p listen_depth=8192 -p thread_pool_max=2000 -p thread_pool_min=12 -p thread_pools=4 -p cc_command=’cc -Kpic -G -m64 -o %o %s’ -s file,/sessions/varnish_cache.bin,4G -p sess_timeout=10s -p max_restarts=12 -p session_linger=50s -p connect_timeout=0s -p obj_workspace=16384 -p sess_workspace=32768 -T 0.0.0.0:8086 -u webservd -F

I noticed varnish had particular problem of keeping connections around in CLOSE_WAIT state for a long time, enough to cause issues. I did some tuning on Solaris’s TCP stack so it is more aggressive in closing sockets after the work has been done.

Here are my aggressive TCP settings to force Solaris to close off connections in a short duration of time, to avoid file descriptor leaks. You can merge the following TCP tweaks with the settings I have posted earlier to handle more clients.

# 67 seconds default 675 seconds
/usr/sbin/ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500

# 30 seconds, aggressively close connections – default 4 minutes on solaris < 8
/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000

# 1 minute, poll for dead connection - default 2 hours
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 60000

Last but not least, I have finally swapped out ActiveMQ for the FUSE message broker, an “enterprise” ActiveMQ distribution. Hopefully it won’t crash once a week like ActiveMQ does for us. The FUSE message broker is based off of ActiveMQ 5.3 sources that fix various memory leaks found in the current stable release of ActiveMQ 5.2 as of this writing.

If the FUSE message broker does not work out, I might have to give Kestrel a try. Hey, if it worked for twitter, it should work for us…right?