Update 2010-02-19: Seems other people are also affected by the Varnish LINGER crash on OpenSolaris. This does not address the core problem but removes the “fail fast” behavior with no negative side effects.
r4576 has been running reliably with the fix below.
In varnishd/bin/cache_acceptor.c
setsockopt(sp->fd, SOL_SOCKET, SO_LINGER,
&linger, sizeof linger);
Remove TCP_assert line encapsulating setsockopt().
Update 2010-02-17: This might be a random fluke but Varnish has connection issues when compiled under SunCC, stick to GCC. I have compiled Varnish with GCC 4.3.2 and the build seems to work well. Give r4572 a try, phk commited some solaris aware errno code.
Update 2010-02-16: r4567 seems stable. Errno isn’t thread-safe by default on Solaris like other platforms, you need to define -pthreads for GCC and -mt for SunCC in both the compile and linking flags.
GCC example:
SunCC Example:
Here are the sources on how I pieced it all together: sun docs, stack overflow answer
See what -pthreads define on GCC
#define _REENTRANT 1
snippet from solaris’s /usr/include/errno.h to confirm that errno isn’t thread safe by default.
extern int *___errno();
#define errno (*(___errno()))
#else
extern int errno;
/* ANSI C++ requires that errno be a macro */
#if __cplusplus >= 199711L
#define errno errno
#endif
#endif /* defined(_REENTRANT) || defined(_TS_ERRNO) */
Update 2010-01-28: r4508 seems stable. No patches needed aside from removing an assert(AZ) in cache_acceptor.c on line 163.
Update 2010-01-21: If your using Varnish from trunk past r4445 apply this session cache_waiter_poll patch to avoid stalled connections.
Update 2009-21-12: Still using Varnish in production, the site is working beautifully with the settings below.
Update(new): I think I figured the last remaining piece of the puzzle. Switching Varnish’s default listener to poll fixed the long connection accept wait times.
Update: Monitor charts looked good, but persistent connections kept flaking under production traffic. I was forced to revert back to Squid 2.7. *Sigh* I think Squid might be the only viable option on Solaris when it comes to reverse proxy caching. The information below is useful if you still want to try out Varnish on Solaris.
I have finally wrangled Varnish to work reliably on Solaris without any apparent issues. The recent commit to trunk by phk(creator) fixed the last remaining Solaris issue that I am aware of.
There are three four requirements to get this working reliably on Solaris.
1. Run from trunk – r4508 is a known stable revision that works well. Remove the AZ() assert in cache_acceptor.c on line 163.
2. Set connect_timeout to 0, this is needed to work around a Varnish/Solaris TCP incompatibility that resides in lib/libvarnish/tcp.c#TCP_connect timeout code.
3. Switch the default waiter to poll. EventPorts seems bugged on OpenSolaris builds.
4. If you have issues starting Varnish, start Varnish in the foreground via -F argument.
Here is a Pingdom graph of our monitored service. Can you tell when Varnish was swapped in for Squid? Varnish does a better job of keeping content cached due to header normalization and larger cache size.

There are a few “gotchas” to look out for to get it all running reliably. Here is the configuration that I used in production. I have annotated each setting with a brief description.
newtask -p highfile /opt/extra/sbin/varnishd -f /opt/extra/etc/varnish/default.vcl -a 0.0.0.0:82 # IP/Port to listen on -p listen_depth=8192 # Connections kernel buffers before rejecting. -p waiter=poll # Listener implementation to use. -p thread_pool_max=2000 # Max threads per pool -p thread_pool_min=50 # Min Threads per pool, crank this high -p thread_pools=4 # Thread Pool per CPU -p thread_pool_add_delay=2ms # Thread init delay, not to bomb OS -p cc_command='cc -Kpic -G -m64 -o %o %s' # 64-Bit if needed -s file,/sessions/varnish_cache.bin,512M # Define cache size -p sess_timeout=10s # Keep-Alive timeout -p max_restarts=12 # Amount of restart attempts -p session_linger=120ms # Milliseconds to keep thread around -p connect_timeout=0s # Important bug work around for Solaris -p lru_interval=20s # LRU interval checks -p sess_workspace=65536 # Space for headers -T 0.0.0.0:8086 # Admin console -u webservd # User to run varnish as
System configuration Optimizations
Solaris lacks SO_{SND|RCV}TIMEO BSD socket flags. These flags are used to define TCP timeout values per socket. Every other OS has it Mac OS X, Linux, FreeBSD, AIX but not Solaris. Meaning Varnish is unable to make use of custom defined timeout values on Solaris. You can do the next best thing with Solaris; optimize the TCP timeouts globally.
# Turn off Nagle. Nagle Adds latency. /usr/sbin/ndd -set /dev/tcp tcp_naglim_def 1 # 30 second TIME_WAIT timeout. (4 minutes default) /usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000 # 15 min keep-alive (2 hour default) /usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 900000 # 120 sec connect time out , 3 min default ndd -set /dev/tcp tcp_ip_abort_cinterval 120000 # Send ACKs right away - less latency on bursty connections. ndd -set /dev/tcp tcp_deferred_acks_max 0 # RFC says 1 segment, BSD/Win stack requires 2 segments. /usr/sbin/ndd -set /dev/tcp tcp_slow_start_initial 2
Varnish Settings Dissected
Here are the most important settings to look out for when deploying Varnish in production.
File Descriptors
Run Varnish under a Solaris project that gives the proxy enough file descriptors to handle the concurrency. If Varnish can not allocate enough file descriptors, it can’t serve the requests.
# Paste into /etc/project # Run the Application newtask -p highfilehighfile:101::*:*:process.max-file-descriptor=(basic,32192,deny)
Threads
Give enough idle threads to Varnish so it does not stall on requests. Thread creation is slow and expensive, idle threads are not. Don’t go cheap with threads, allocate a minimum of 200. Modern browsers use 8 concurrent connections by default, meaning Varnish will need 8 threads to handle a single page view.
thread_pool_max=2000 # 2000 max threads per pool thread_pool_min=50 # 50 min threads per pool # 50 threads x 4 Pools = 200 threads thread_pools=4 # 4 Pools, Pool per CPU Core. session_linger=120ms # How long to keep a thread around # To handle further requests.
Are you running JRuby in production? Do you want distributed file storage for your “enterprise” application? Look no further, MogileFS is here.
MogileFS-Client has compatibility issues with JRuby due to it’s use of the low level Socket class. JRuby 1.5-dev does not yet support all the Socket methods, so here is a monkey patch to get the ruby mogilefs client working on JRuby. Yes it blocks, but who cares JRuby has native threads.
This is exactly why I love Ruby; monkey patching.
def self.mogilefs_new(host,port,timeout=5.0)
TCPSocket.open(host,port,timeout)
end
end
class TCPSocket
attr_accessor :mogilefs_addr, :mogilefs_connected, :mogilefs_size, :mogilefs_tcp_cork
def self.open(host,port,timeout = 5.0)
super(host,port.to_i)
end
def readable?
true
end
def write_nonblock(data)
write(data)
end
def recv_nonblock(size,arg)
recv(size,arg)
end
def mogilefs_init(host = nil, port = nil)
true
end
end
Here is an example test case on how to get it all to work.
require ‘mogilefs’
# jmogilefs.rb is the monkey patch above
# load it after loading mogilefs client.
require ‘jmogilefs.rb’
mg = MogileFS::MogileFS.new(:domain=>‘testserv’,:hosts=>[‘xxx.xxx.xxx.xxx:6001′])
p mg.get_file_data ‘video:100:default.jpg’
p mg.get_paths ‘video:100:default.jpg’,true
mg.list_keys(‘video:100′)[0].each do |f|
p f
end
Squid is a fundamental part of our infrastructure at Fabulously40. It helps us lower response times quite considerably. The problem with Squid is that it is quite “dense” when it comes to configuration flexibility. Unless your willing to do a bit of C hacking on it, it does not have much configuration flexibility. This can be overcome by using supporting software to help squid out.
Note: Our configuration would be quite simplified if we used Varnish but it lacks some key features that make Squid a better candidate.
1. Varnish can’t stream cache-misses, it can only buffer. This adds latency to cache-miss requests.
2. Varnish is unable to avoid caching objects based on content-length size.
3. Varnish has an issue with connect_timeout and Solaris socket handling.
Until Varnish can handle the three things listed, Squid remains the best choice at the cost of configuration complexity.
Optimize Cache Hits by Normalizing Headers
“Accept-Encoding: gzip” and “Accept-Encoding: gzip/deflate” will be cached separately unless you normalize client headers. Squid has no configuration option to normalize headers like Varnish. However, you can use Nginx to normalize headers before passing off the request to Squid.
Here is the setup: Client -> Nginx -> Squid -> Backend Services
The NGINX Configuration
location / { proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_hide_header Pragma; # Remove client cache-control header # to avoid fetching from backend if page is in cache proxy_set_header Cache-Control ""; # Normalize static assets for squid if ($request_uri ~* "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|ico|swf|mp4|flv|mov|dmg|mkv)") { set $normal_encoding ""; proxy_pass http://squids; break; } # Normalize gzip encoding if ($http_accept_encoding ~* gzip) { set $normal_encoding "gzip"; proxy_pass http://squids; break; } # Normalize deflate encoding if ($http_accept_encoding ~* deflate) { set $normal_encoding "deflate"; proxy_pass http://squids; break; } # Define the normalize header proxy_set_header Accept-Encoding $normal_encoding; # default... proxy_pass http://squids; break; }
So by the time Squid receives the request, the accept-encoding header is normalized, for efficient cache storage.
Avoid Caching Error Pages with Squid 2.7
Squid 2.7 has better support for reverse caching and HTTP 1.1 than the Squid 3.x branch. However, it missing one important ACL that Squid 3.x has but 2.7 does not; http_status. Squid 2.7 is unable to be configured to avoid caching pages based on status response codes from origin. I have written a Perlbal plugin CacheKill specifically to address this issue. Perlbal::Plugin::CacheKill sits between backend services and Squid and rewrites cache-control headers based on response code.
Here is the setup: Client -> Squid -> Perlbal -> Backend Services
If the backend service responds with a 501,502 or 503 http status code, Perlbal will append Cache-Control: no-cache header before giving back the response to Squid.
Here is the configuration file for Perlbal::Plugin::CacheKill to deny Squid from caching error pages.
CREATE SERVICE fab40 SET listen = 0.0.0.0:8003 SET role = reverse_proxy SET pool = backends cache_kill codes = 404,500,501,503,502 SET plugins = CacheKill ENABLE fab40
Stitching software together can make Squid just as flexible as Varnish with its VCL configuration.
http://github.com/victori/perlbal-plugin-mogilefs
Key features
- Asynchronous, does not stall the Perlbal event loop.
- Converts URL paths to MogileFS fetch keys.
- Failover to filesystem if key fetch failed.
- Pretty statistics in Perlbal’s Management console.
Its freaking awesome
On a side note, I have also updated my other two Perlbal plugins.
http://github.com/victori/perlbal-plugin-stickysessions
- Session affinity via Cookie.
http://github.com/victori/perlbal-plugin-backendheaders
- Appending Backend information on the served response.
*Update* Patches got accepted into MogileFS Trunk
Just go check out trunk, it has all my patches already included.
http://code.sixapart.com/svn/mogilefs/trunk/
The only thing you need is my mogstored disk patch which is still pending. All the issues revolving around postgresql and solaris have been already included in trunk.
I fixed a few issues with MogileFS and Solaris. MogileFS should run wonderfully on Solaris with my patches applied.
Directory for all my patches: http://victori.uploadbooth.com/patches
http://victori.uploadbooth.com/patches/solaris-disk-du.patch
This patch fixes mogstored to work with solaris’s df utility.
http://victori.uploadbooth.com/patches/store-max-requests.patch
This patch adds a new feature to the MogileFS Tracker – max_requests.
The default is 0, but it is suggested you set it to 1000 max_requests, to avoid memory leaks.
The tracker will give out the database handle up to the max_requests limit before expiring the connection for a new one. This avoids memory leaks with long running persistent connections. PostgreSQL has issues with long persistent connections, it accumulates a lot of ram and does not let go until the process/connection is killed off. This patch makes sure that the connection is expired after so many dbh handle requests.
http://victori.uploadbooth.com/patches/mogilefs-sunos-pg.patch
This patch applies the InactiveDestroy argument to avoid the MogileFS Tracker locking up with the PostgreSQL store on Solaris.
http://victori.uploadbooth.com/patches/solaris-mogilefs-full.patch
This is the full patch for all my fixes.
I am slowly migrating our fab40 static asset data to MogileFS. I have imported >300,000 images, no issues with my patches so far.
/ PLUG go make an account on uploadbooth!
Enjoy
I just received my “Guide to Open-Source Operating Systems” comparing Solaris with Linux from Sun’s marketing department. Here are some of the facts that made me cringe due to blatant lying and half truths. Hey Sun, don’t let the facts get in your way.
Believe it or not but this is actually verbatim from the guide.
• Solaris is supported by more applications.
• Solaris holds performance and price/performance world records that demonstrate its speed and scalability on a variety of systems.
• Solaris is supported by Sun, the company dedicated to UNIX for more than two decades.
1. Lets see, the first fact is just blatant lying. Last I checked Linux supported IA-32, MIPS, x86-64, SPARC, DEC Alpha, Itanium, PowerPC, ARM, m68k, PA-RISC, s390, SuperH, M32R and many more platforms. While Solaris only supports SPARC, IA-32 and x86-64. Does anyone at Sun’s marketing department care to fact check?
2. Depends on your definition of “supported.” Marketing is most likely referring to commercial support. I don’t have the facts to back this up but I doubt this is hold true with Linux in 2009, maybe they had a case back in 1999. Majority of open source applications are developed against Linux and Solaris compatibility is just an after thought.
3. You win http://www.tpc.org/tpcc/results/tpcc_perf_results.asp
Sun develops some of the best hardware and software on the market, but their marketing department is a disaster. There can only be one Steve Jobs and his reality distortion field.
Once again I have been blind sided by yet another conservative out-of-the-box setting. IPFilter is tuned way too conservative with it’s state table size.
Here is how you can tell if your hitting any issues, run ipfstat and check for lost packets.
victori@opensolaris:~# ipfstat | grep lost fragment state(in): kept 0 lost 0 not fragmented 0 fragment state(out): kept 0 lost 0 not fragmented 0 packet state(in): kept 798 lost 100 packet state(out): kept 612 lost 234
Notice that the in and out lost state lines have a non-zero value. This means IPFilter has been dropping client connections, bummer.
The default settings are quite conservative.
fr_statemax min 0×1 max 0x7fffffff current 4096
fr_statesize min 0×1 max 0x7fffffff current 5002
You need to shutdown IPFilter and apply larger table size limits.
victori@opensolaris:~# /usr/sbin/ipf -T fr_statemax=18963,fr_statesize=27091
Lets confirm that it works.
fr_statemax min 0×1 max 0x7fffffff current 18963
fr_statesize min 0×1 max 0x7fffffff current 27091
Awesome, now all we need to do is enable IPfilter and no more lost packets.
To make this persistent across reboots edit ipf.conf
name=”ipf” parent=”pseudo” instance=0 fr_statemax=18963 fr_statesize=27091;
Then update the contents
This can be applied to any OS that uses IPFilter.
Update: The following information could be beneficial to some, however my issues actually were with Caviar black drives shipping with TLER disabled. You need to pay Western Digital a premium for their “RAID” drives with TLER enabled. So for anyone reading this, avoid consumer Western Digital drives if you plan on using them for RAID.
zfs_vdev_max_pending
I can’t believe how long I have been tolerating horrible concurrent IO performance on OpenSolaris running ZFS. When I have any IO intensive writes happening the whole system slows down to a crawl for any further IO. Running “ls” on a uncached directory is just painful.
victori@opensolaris:/opt# iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 87.0 0.0 2878.1 0.0 0.0 0.0 0.4 0 100 c4t0d0 0.0 83.0 0.0 2878.1 0.0 0.1 0.2 0.7 1 50 c4t1d0 1.0 0.0 28.0 0.0 0.0 0.0 0.0 5.4 0 1 c4t2d0
Notice c4t0d0 is blocking at 100%. If a disk is blocking at 100% good luck getting the disk to do any other operations such as reads.
SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.
Dynamically set the ZFS command queue to 1 to optimize for NCQ.
And add to /etc/system
Enjoy your OpenSolaris server on cheap SATA disks!
Recently a primary boot disk went bad on our server and I got blind sided by a non-bootable secondary mirror disk. All the data was intact but I could not boot it. This required a slow re-installation and migration process that took a very long time.
• ZFS attach automatically partitions the drive as EFI.
• ZFS send/recv transfers on gzip compressed data-slices is slow.
Here is the correct way of getting both the disks in the ZFS mirror to boot.
Plug the new drive into the server that you want to add to the ZFS mirror. If your hot swapping or adding a new drive while the server is still on, you need to use cfgadm to configure it.
Now that the drive is configured and seen by the server you need to repartition it with format so it can be used as a bootable device.
AVAILABLE DISK SELECTIONS:
0. c4t0d0
/pci@0,0/pci8086,346c@1f,2/disk@0,0
1. c4t1d0
/pci@0,0/pci8086,346c@1f,2/disk@1,0
2. c4t2d0
/pci@0,0/pci8086,346c@1f,2/disk@2,0
* select your new drive *
# fdisk
* use fdisk to remove the EFI partition and add a solaris2 partition. *
Select the partition type to create:
1=SOLARIS2 2=UNIX 3=PCIXOS 4=Other
5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG
9=DOS16LBA A=x86 Boot B=Diagnostic C=FAT32
D=FAT32LBA E=DOSEXTLBA F=EFI 0=Exit?
This step is very important, if you did not repartition your drive, zfs attach will default the drive back to an EFI partition table that is not bootable.
c4t0d0s2 — primary drive.
c4t1d0s2 — new drive that we are setting up.
You should now be able to attach the secondary drive to your mirror using the identical slice.
Once the mirror is done synchronizing you need to install the bootloader on the drive.
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 267 sectors starting at 50 (abs 16115)
stage1 written to master boot sector
Trouble Shooting
raw device must be a root slice (not s2)
You did not re-partition the drive to a solaris2 partition. EFI partitions can’t be made bootable. Use the format tool to reconfigure the drive with a solaris2 partition.
cannot open/stat device /dev/rdsk/c1t0d0s0
You did not copy your label information from your primary to your secondary disk with prtvtoc and fmthard.
I am working on a little twitter project that uses twitter4r as the client API. As of recently twitter pulled some strings on their API and broke compatibility.
/opt/local/lib/ruby/gems/1.8/gems/mbbx6spp-twitter4r-0.4.0/lib/twitter/client/base.rb:43:in `raise_rest_error’: Not Found (Twitter::RESTError)
from /opt/local/lib/ruby/gems/1.8/gems/mbbx6spp-twitter4r-0.4.0/lib/twitter/client/base.rb:48:in `handle_rest_response’
from /opt/local/lib/ruby/gems/1.8/gems/mbbx6spp-twitter4r-0.4.0/lib/twitter/client/base.rb:20:in `http_connect’
from /opt/local/lib/ruby/1.8/net/http.rb:543:in `start’
from /opt/local/lib/ruby/gems/1.8/gems/mbbx6spp-twitter4r-0.4.0/lib/twitter/client/base.rb:16:in `http_connect’
from /opt/local/lib/ruby/gems/1.8/gems/mbbx6spp-twitter4r-0.4.0/lib/twitter/client/user.rb:37:in `user’
from somebot.rb:5
Curse you twitter!
Luckly Ruby has the concept of monkey patching, here is the fix to get it all working correctly.
@@USER_URIS = {
:info => ‘/users/show.json’,
:friends => ‘/statuses/friends.json’,
:followers => ‘/statuses/followers.json’,
}
end
Shazzam… it works…

(6 votes, average: 4.00 out of 5)

