Browsing the topic administration
Ted Dziuba beautifully articulated why deadlines go to crap and seemingly straight forward tasks go out the window. You sir have done a public service for us all, thank you.
What I hate is fording endless rivers of horseshit that are in the way of seemingly simple tasks. And I hate it even more when I have to explain to a non-programmer what I am doing, "building LXML against a different version of libiconv because I think it might be the source of a crash". "But all I asked you to do was parse some documents." Good times.
Since Varnish did not work out on Solaris yet again. I have decided to bite the bullet and write a headers normalization patch for Squid 2.7. This patch should produce much better cache hit rates with Squid. Efficiency++
What the patch does:
1. Removes Cache-Control request headers, don’t let clients by-pass cache if it is primed.
2. Normalize Accept-Encoding Headers for a higher cache hit rate.
3. Clear Accept-Encoding Headers for content that should not be compressed such as image,video and audio.
and the patch: squid-headers-normalization.patch
Update: Fixed a minor memory leak, all good now.
Update 2: Added audio exception to strip accept-encoding.
--- src/client_side.c.og 2010-01-20 12:00:56.000000000 -0800 +++ src/client_side.c 2010-01-19 20:35:31.000000000 -0800 @@ -3983,6 +3983,7 @@ errorAppendEntry(http->entry, err); return -1; } + /* compile headers */ /* we should skip request line! */ if ((http->http_ver.major >= 1) && !httpMsgParseRequestHeader(request, &msg)) { @@ -3992,10 +3993,59 @@ err->url = xstrdup(http->uri); http->al.http.code = err->http_status; http->log_type = LOG_TCP_DENIED; + http->entry = clientCreateStoreEntry(http, method, null_request_flags); errorAppendEntry(http->entry, err); return -1; } + + /* + * Normalize Request Cache-Control / If-Modified-Since Headers + * Don't let client by-pass the cache if there is cached content. + */ + if(httpHeaderHas(&request->header,HDR_CACHE_CONTROL)) { + httpHeaderDelByName(&request->header,"cache-control"); + } + + /* + * Un-comment this if you want Squid to always respond with the request + * instead of returning back with a 304 if the cache has not changed. + */ + /* + if(httpHeaderHas(&request->header,HDR_IF_MODIFIED_SINCE)) { + httpHeaderDelByName(&request->header,"if-modified-since"); + }*/ + + /* + * Normalize Accept-Encoding Headers sent from client + */ + if(httpHeaderHas(&request->header,HDR_ACCEPT_ENCODING)) { + String val = httpHeaderGetByName(&request->header,"accept-encoding"); + if(val.buf) { + if(strstr(val.buf,"gzip") != NULL) { + httpHeaderDelByName(&request->header,"accept-encoding"); + httpHeaderPutStr(&request->header,HDR_ACCEPT_ENCODING,"gzip"); + } else if(strstr(val.buf,"deflate") != NULL) { + httpHeaderDelByName(&request->header,"accept-encoding"); + httpHeaderPutStr(&request->header,HDR_ACCEPT_ENCODING,"deflate"); + } else { + httpHeaderDelByName(&request->header,"accept-encoding"); + } + } + stringClean(&val); + } + + /* + * Normalize Accept-Encoding Headers for video/image content + */ + char *mime_type = mimeGetContentType(http->uri); + if(mime_type) { + if(strstr(mime_type,"image") != NULL || strstr(mime_type,"video") != NULL || strstr(mime_type,"audio") != NULL) { + httpHeaderDelByName(&request->header,"accept-encoding"); + } + } + + /* * If we read past the end of this request, move the remaining * data to the beginning
Clearing stale cache by domain
You can clear a site’s cache by domain, this is really nifty if you have Varnish in front of multiple sites. You can log into Varnish’s administration console via telnet and execute the following purge command to wipe out the undesired cache.
purge req.http.host ~ letsgetdugg.com
Monitor Response codes
Worried that some of your clients might be receiving 503 Varnish response pages? Find out with varnishtop.
varnishtop -i TxStatus
Here is how the output looks like.
list length 7 web 4018.65 TxStatus 200 132.35 TxStatus 304 44.17 TxStatus 404 34.63 TxStatus 302 30.87 TxStatus 301 9.36 TxStatus 403 1.39 TxStatus 503
Update 2010-02-19: Seems other people are also affected by the Varnish LINGER crash on OpenSolaris. This does not address the core problem but removes the “fail fast” behavior with no negative side effects.
r4576 has been running reliably with the fix below.
In varnishd/bin/cache_acceptor.c
setsockopt(sp->fd, SOL_SOCKET, SO_LINGER,
&linger, sizeof linger);
Remove TCP_assert line encapsulating setsockopt().
Update 2010-02-17: This might be a random fluke but Varnish has connection issues when compiled under SunCC, stick to GCC. I have compiled Varnish with GCC 4.3.2 and the build seems to work well. Give r4572 a try, phk commited some solaris aware errno code.
Update 2010-02-16: r4567 seems stable. Errno isn’t thread-safe by default on Solaris like other platforms, you need to define -pthreads for GCC and -mt for SunCC in both the compile and linking flags.
GCC example:
SunCC Example:
Here are the sources on how I pieced it all together: sun docs, stack overflow answer
See what -pthreads define on GCC
#define _REENTRANT 1
snippet from solaris’s /usr/include/errno.h to confirm that errno isn’t thread safe by default.
extern int *___errno();
#define errno (*(___errno()))
#else
extern int errno;
/* ANSI C++ requires that errno be a macro */
#if __cplusplus >= 199711L
#define errno errno
#endif
#endif /* defined(_REENTRANT) || defined(_TS_ERRNO) */
Update 2010-01-28: r4508 seems stable. No patches needed aside from removing an assert(AZ) in cache_acceptor.c on line 163.
Update 2010-01-21: If your using Varnish from trunk past r4445 apply this session cache_waiter_poll patch to avoid stalled connections.
Update 2009-21-12: Still using Varnish in production, the site is working beautifully with the settings below.
Update(new): I think I figured the last remaining piece of the puzzle. Switching Varnish’s default listener to poll fixed the long connection accept wait times.
Update: Monitor charts looked good, but persistent connections kept flaking under production traffic. I was forced to revert back to Squid 2.7. *Sigh* I think Squid might be the only viable option on Solaris when it comes to reverse proxy caching. The information below is useful if you still want to try out Varnish on Solaris.
I have finally wrangled Varnish to work reliably on Solaris without any apparent issues. The recent commit to trunk by phk(creator) fixed the last remaining Solaris issue that I am aware of.
There are three four requirements to get this working reliably on Solaris.
1. Run from trunk – r4508 is a known stable revision that works well. Remove the AZ() assert in cache_acceptor.c on line 163.
2. Set connect_timeout to 0, this is needed to work around a Varnish/Solaris TCP incompatibility that resides in lib/libvarnish/tcp.c#TCP_connect timeout code.
3. Switch the default waiter to poll. EventPorts seems bugged on OpenSolaris builds.
4. If you have issues starting Varnish, start Varnish in the foreground via -F argument.
Here is a Pingdom graph of our monitored service. Can you tell when Varnish was swapped in for Squid? Varnish does a better job of keeping content cached due to header normalization and larger cache size.

There are a few “gotchas” to look out for to get it all running reliably. Here is the configuration that I used in production. I have annotated each setting with a brief description.
newtask -p highfile /opt/extra/sbin/varnishd -f /opt/extra/etc/varnish/default.vcl -a 0.0.0.0:82 # IP/Port to listen on -p listen_depth=8192 # Connections kernel buffers before rejecting. -p waiter=poll # Listener implementation to use. -p thread_pool_max=2000 # Max threads per pool -p thread_pool_min=50 # Min Threads per pool, crank this high -p thread_pools=4 # Thread Pool per CPU -p thread_pool_add_delay=2ms # Thread init delay, not to bomb OS -p cc_command='cc -Kpic -G -m64 -o %o %s' # 64-Bit if needed -s file,/sessions/varnish_cache.bin,512M # Define cache size -p sess_timeout=10s # Keep-Alive timeout -p max_restarts=12 # Amount of restart attempts -p session_linger=120ms # Milliseconds to keep thread around -p connect_timeout=0s # Important bug work around for Solaris -p lru_interval=20s # LRU interval checks -p sess_workspace=65536 # Space for headers -T 0.0.0.0:8086 # Admin console -u webservd # User to run varnish as
System configuration Optimizations
Solaris lacks SO_{SND|RCV}TIMEO BSD socket flags. These flags are used to define TCP timeout values per socket. Every other OS has it Mac OS X, Linux, FreeBSD, AIX but not Solaris. Meaning Varnish is unable to make use of custom defined timeout values on Solaris. You can do the next best thing with Solaris; optimize the TCP timeouts globally.
# Turn off Nagle. Nagle Adds latency. /usr/sbin/ndd -set /dev/tcp tcp_naglim_def 1 # 30 second TIME_WAIT timeout. (4 minutes default) /usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000 # 15 min keep-alive (2 hour default) /usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 900000 # 120 sec connect time out , 3 min default ndd -set /dev/tcp tcp_ip_abort_cinterval 120000 # Send ACKs right away - less latency on bursty connections. ndd -set /dev/tcp tcp_deferred_acks_max 0 # RFC says 1 segment, BSD/Win stack requires 2 segments. /usr/sbin/ndd -set /dev/tcp tcp_slow_start_initial 2
Varnish Settings Dissected
Here are the most important settings to look out for when deploying Varnish in production.
File Descriptors
Run Varnish under a Solaris project that gives the proxy enough file descriptors to handle the concurrency. If Varnish can not allocate enough file descriptors, it can’t serve the requests.
# Paste into /etc/project # Run the Application newtask -p highfilehighfile:101::*:*:process.max-file-descriptor=(basic,32192,deny)
Threads
Give enough idle threads to Varnish so it does not stall on requests. Thread creation is slow and expensive, idle threads are not. Don’t go cheap with threads, allocate a minimum of 200. Modern browsers use 8 concurrent connections by default, meaning Varnish will need 8 threads to handle a single page view.
thread_pool_max=2000 # 2000 max threads per pool thread_pool_min=50 # 50 min threads per pool # 50 threads x 4 Pools = 200 threads thread_pools=4 # 4 Pools, Pool per CPU Core. session_linger=120ms # How long to keep a thread around # To handle further requests.
Squid is a fundamental part of our infrastructure at Fabulously40. It helps us lower response times quite considerably. The problem with Squid is that it is quite “dense” when it comes to configuration flexibility. Unless your willing to do a bit of C hacking on it, it does not have much configuration flexibility. This can be overcome by using supporting software to help squid out.
Note: Our configuration would be quite simplified if we used Varnish but it lacks some key features that make Squid a better candidate.
1. Varnish can’t stream cache-misses, it can only buffer. This adds latency to cache-miss requests.
2. Varnish is unable to avoid caching objects based on content-length size.
3. Varnish has an issue with connect_timeout and Solaris socket handling.
Until Varnish can handle the three things listed, Squid remains the best choice at the cost of configuration complexity.
Optimize Cache Hits by Normalizing Headers
“Accept-Encoding: gzip” and “Accept-Encoding: gzip/deflate” will be cached separately unless you normalize client headers. Squid has no configuration option to normalize headers like Varnish. However, you can use Nginx to normalize headers before passing off the request to Squid.
Here is the setup: Client -> Nginx -> Squid -> Backend Services
The NGINX Configuration
location / { proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_hide_header Pragma; # Remove client cache-control header # to avoid fetching from backend if page is in cache proxy_set_header Cache-Control ""; # Normalize static assets for squid if ($request_uri ~* "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|ico|swf|mp4|flv|mov|dmg|mkv)") { set $normal_encoding ""; proxy_pass http://squids; break; } # Normalize gzip encoding if ($http_accept_encoding ~* gzip) { set $normal_encoding "gzip"; proxy_pass http://squids; break; } # Normalize deflate encoding if ($http_accept_encoding ~* deflate) { set $normal_encoding "deflate"; proxy_pass http://squids; break; } # Define the normalize header proxy_set_header Accept-Encoding $normal_encoding; # default... proxy_pass http://squids; break; }
So by the time Squid receives the request, the accept-encoding header is normalized, for efficient cache storage.
Avoid Caching Error Pages with Squid 2.7
Squid 2.7 has better support for reverse caching and HTTP 1.1 than the Squid 3.x branch. However, it missing one important ACL that Squid 3.x has but 2.7 does not; http_status. Squid 2.7 is unable to be configured to avoid caching pages based on status response codes from origin. I have written a Perlbal plugin CacheKill specifically to address this issue. Perlbal::Plugin::CacheKill sits between backend services and Squid and rewrites cache-control headers based on response code.
Here is the setup: Client -> Squid -> Perlbal -> Backend Services
If the backend service responds with a 501,502 or 503 http status code, Perlbal will append Cache-Control: no-cache header before giving back the response to Squid.
Here is the configuration file for Perlbal::Plugin::CacheKill to deny Squid from caching error pages.
CREATE SERVICE fab40 SET listen = 0.0.0.0:8003 SET role = reverse_proxy SET pool = backends cache_kill codes = 404,500,501,503,502 SET plugins = CacheKill ENABLE fab40
Stitching software together can make Squid just as flexible as Varnish with its VCL configuration.
Once again I have been blind sided by yet another conservative out-of-the-box setting. IPFilter is tuned way too conservative with it’s state table size.
Here is how you can tell if your hitting any issues, run ipfstat and check for lost packets.
victori@opensolaris:~# ipfstat | grep lost fragment state(in): kept 0 lost 0 not fragmented 0 fragment state(out): kept 0 lost 0 not fragmented 0 packet state(in): kept 798 lost 100 packet state(out): kept 612 lost 234
Notice that the in and out lost state lines have a non-zero value. This means IPFilter has been dropping client connections, bummer.
The default settings are quite conservative.
fr_statemax min 0×1 max 0x7fffffff current 4096
fr_statesize min 0×1 max 0x7fffffff current 5002
You need to shutdown IPFilter and apply larger table size limits.
victori@opensolaris:~# /usr/sbin/ipf -T fr_statemax=18963,fr_statesize=27091
Lets confirm that it works.
fr_statemax min 0×1 max 0x7fffffff current 18963
fr_statesize min 0×1 max 0x7fffffff current 27091
Awesome, now all we need to do is enable IPfilter and no more lost packets.
To make this persistent across reboots edit ipf.conf
name=”ipf” parent=”pseudo” instance=0 fr_statemax=18963 fr_statesize=27091;
Then update the contents
This can be applied to any OS that uses IPFilter.
Update: The following information could be beneficial to some, however my issues actually were with Caviar black drives shipping with TLER disabled. You need to pay Western Digital a premium for their “RAID” drives with TLER enabled. So for anyone reading this, avoid consumer Western Digital drives if you plan on using them for RAID.
zfs_vdev_max_pending
I can’t believe how long I have been tolerating horrible concurrent IO performance on OpenSolaris running ZFS. When I have any IO intensive writes happening the whole system slows down to a crawl for any further IO. Running “ls” on a uncached directory is just painful.
victori@opensolaris:/opt# iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 87.0 0.0 2878.1 0.0 0.0 0.0 0.4 0 100 c4t0d0 0.0 83.0 0.0 2878.1 0.0 0.1 0.2 0.7 1 50 c4t1d0 1.0 0.0 28.0 0.0 0.0 0.0 0.0 5.4 0 1 c4t2d0
Notice c4t0d0 is blocking at 100%. If a disk is blocking at 100% good luck getting the disk to do any other operations such as reads.
SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.
Dynamically set the ZFS command queue to 1 to optimize for NCQ.
And add to /etc/system
Enjoy your OpenSolaris server on cheap SATA disks!
Recently a primary boot disk went bad on our server and I got blind sided by a non-bootable secondary mirror disk. All the data was intact but I could not boot it. This required a slow re-installation and migration process that took a very long time.
• ZFS attach automatically partitions the drive as EFI.
• ZFS send/recv transfers on gzip compressed data-slices is slow.
Here is the correct way of getting both the disks in the ZFS mirror to boot.
Plug the new drive into the server that you want to add to the ZFS mirror. If your hot swapping or adding a new drive while the server is still on, you need to use cfgadm to configure it.
Now that the drive is configured and seen by the server you need to repartition it with format so it can be used as a bootable device.
AVAILABLE DISK SELECTIONS:
0. c4t0d0
/pci@0,0/pci8086,346c@1f,2/disk@0,0
1. c4t1d0
/pci@0,0/pci8086,346c@1f,2/disk@1,0
2. c4t2d0
/pci@0,0/pci8086,346c@1f,2/disk@2,0
* select your new drive *
# fdisk
* use fdisk to remove the EFI partition and add a solaris2 partition. *
Select the partition type to create:
1=SOLARIS2 2=UNIX 3=PCIXOS 4=Other
5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG
9=DOS16LBA A=x86 Boot B=Diagnostic C=FAT32
D=FAT32LBA E=DOSEXTLBA F=EFI 0=Exit?
This step is very important, if you did not repartition your drive, zfs attach will default the drive back to an EFI partition table that is not bootable.
c4t0d0s2 — primary drive.
c4t1d0s2 — new drive that we are setting up.
You should now be able to attach the secondary drive to your mirror using the identical slice.
Once the mirror is done synchronizing you need to install the bootloader on the drive.
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 267 sectors starting at 50 (abs 16115)
stage1 written to master boot sector
Trouble Shooting
raw device must be a root slice (not s2)
You did not re-partition the drive to a solaris2 partition. EFI partitions can’t be made bootable. Use the format tool to reconfigure the drive with a solaris2 partition.
cannot open/stat device /dev/rdsk/c1t0d0s0
You did not copy your label information from your primary to your secondary disk with prtvtoc and fmthard.
I am in the process of evaluating which option to choose for a new production deployment of a Sinatra application.
Pros and Cons of the implementations:
JRuby Stack:
Pros:
• Multi-threaded, easy to scale with spiked traffic / shared resources.
Cons:
• Single process is a single point of failure.
MRI Ruby Stack:
Pros:
• Scaled via processes, no single point of failure.
Cons:
• Single Process, no shared resources (Possibly using more memory over time).
These tests are run against a real-world application that is soon to be released, not some dummy “hello world” application.
Application Background:
Sinatra / HAML templates (not compiled, rendered per request) / CouchDB / R18N Translation
Server Specifications:
Hardware: 8Gig / Quad Core Xeon x5355
MRI Stack:
Ruby 1.8.7 (2008-08-11 patchlevel 72)
Nginx Passenger 2.2.4
Passenger Config: passenger_max_pool_size 8, passenger_use_global_queue on
Java Stack:
JRuby 1.3.1 (ruby 1.8.6p287) (2009-06-15 6586)
Jetty-6.1.15
JDK Flags: -server -Xverify:none -XX:MaxPermSize=96m -XX:+AggressiveOpts -Xss128k -Xms256m -Xmx384m -XX:+UseParallelGC -XX:+UseParallelOldGC
JDK 1.7.0 b67
Here are the results. I have taken the best time out of 10 runs, giving enough time for the JDK to warmup and passenger to load all the children. The results are clipped for brevity.
Benchmark command:
JRuby Results:
Time per request: 116.316 [ms] (mean)
Time taken for tests: 11.632 seconds
Memory Use After Test: 437M (RSS)
MRI Results:
Time per request: 84.142 [ms] (mean)
Time taken for tests: 8.414 seconds
Memory Use After Test: 264M (RSS)
Conclusions and final thoughts:
Seems like MRI Ruby has a 39% performance advantage on JRuby executing my application. I am still a bit skeptical if MRI Ruby would still win out in production when it turns into a long running process marathon with varied traffic patterns. At the end of the day the JVM currently has the edge in garbage collection on MRI Ruby, so in “theory” JRuby should be the better choice. This is all a hypothetical guesstimate[sic] on my behalf. I will most likely end up trying both variants in production and see which works best.
One of our articles on fabulously40 went viral on the tagged.com which is one of the largest social networks in the alexa top 100. The viral aspect was quite apparent when the bandwidth sky rocketed to 30Mb/sec of sustained traffic. We were pushing over 60gigs of image data per day! We broke our bandwidth quota in just 3 days. Granted, this was toward the end of the billing cycle for that month.
Facing overdraft charges for bandwidth I decided to re-encode the images with ImageMagick. I converted the images to a “lower quality” compression for a file savings of 71%. Swapping out the images with the newer smaller images dropped the bandwidth by 50% Awesome.
Spot the difference where I swapped out the images?


(5 votes, average: 4.80 out of 5)
(6 votes, average: 4.00 out of 5)