Browsing the topic open source
I just pushed up a new version of Satan to GitHub. For the
uniformed uninformed Satan is my process reaper for run away unix processes. Satan was designed to work with Solaris’ SMF self-healing properties. Basically, Satan kills while SMF revives. The new version that was pushed up contains HTTP health checks, so Satan now has the ability to kill processes that are not responding back with a HTTP/200 response code.
The motivation behind HTTP health checks was because once a month or so at Fabulously40 our ActiveMQ would break down while still accepting connections, the only way to figure out if it was zombified was to check the HTTP administrator interface. If the ActiveMQ instance was actually knelled over, the administrator interface would come back with a HTTP/500 response code, hence the birth of HTTP health checks.
Here is our Satan configuration file that we use at Fabulously40.
The “args” property might be a bit confusing, it is a snippet of text that Satan looks for in the arguments passed to your application to identify the running process. So for example, if you start your ActiveMQ instance with the following arguments; “java -jar activemq.jar -Dactivemq=8161 -XXXXX” Placing “8161″ in args property would be a good unique identifier for Satan to pick up on.
Satan.watch do |s| s.name = "jvm instances" # name of job s.user = "webservd" # under what user s.group = "webservd" # under what group s.deamon = "java" # deamon binary name to grep for s.args = nil # globally look for specific arguments, optional s.debug = true # if to write out debug information s.safe_mode = false # If in safe mode, satan will not kill ;-( s.interval = 10.seconds # interval to run at to collect statistics s.sleep_after_kill = 1.minute # sleep after killing, satan is tired! s.contact = "email@example.com" # admin contact, optional if you want email alerts s.kill_if do |process| process.condition(:cpu) do |cpu| # on cpu condition cpu.name = "50% CPU limit" # name for job cpu.args = "jetty" # make sure this is a jetty process, optional cpu.above = 48.percent # if above certain percentage cpu.times = 5 # how many times we can hit this condition before killing end process.condition(:memory) do |memory| # on memory condition memory.name = "850MB limit" # name for job memory.args = "jetty" # make sure this is a jetty process, optional memory.above = 850.megabytes # limit for memory use memory.times = 5 # how many times we can hit this condition before killing end # ActiveMQ tends to die on us under heavy load so we need the power of satan! process.condition(:http) do |http| # on http condition http.name = "HTTP ActiveMQ Check" # name for job http.args = "8161" # look for specific app arguments # to associate app to URI http.uri = "http://localhost:8161/admin/queues.jsp" # the URI http.times = 5 # how many times before kill end end end
I have finally nailed out all our issues surrounding Varnish on Solaris, thanks to the help of sky from #varnish. Apparently Varnish uses a wrapper around connect() to drop stale connections to avoid thread pileups if the back-end ever dies. Setting connect_timeout to 0 will force Varnish to use connect() directly. This should eliminate all 503 back-end issues under Solaris that I have mentioned in an earlier blog post.
Here is our startup script for varnish that works for our needs. Varnish is a 64-bit binary hence the “-m64″ cc_command passed.
newtask -p highfile /opt/extra/sbin/varnishd -f /opt/extra/etc/varnish/default.vcl -a 220.127.116.11:80 -p listen_depth=8192 -p thread_pool_max=2000 -p thread_pool_min=12 -p thread_pools=4 -p cc_command=’cc -Kpic -G -m64 -o %o %s’ -s file,/sessions/varnish_cache.bin,4G -p sess_timeout=10s -p max_restarts=12 -p session_linger=50s -p connect_timeout=0s -p obj_workspace=16384 -p sess_workspace=32768 -T 0.0.0.0:8086 -u webservd -F
I noticed varnish had particular problem of keeping connections around in CLOSE_WAIT state for a long time, enough to cause issues. I did some tuning on Solaris’s TCP stack so it is more aggressive in closing sockets after the work has been done.
Here are my aggressive TCP settings to force Solaris to close off connections in a short duration of time, to avoid file descriptor leaks. You can merge the following TCP tweaks with the settings I have posted earlier to handle more clients.
/usr/sbin/ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500
# 30 seconds, aggressively close connections – default 4 minutes on solaris < 8
/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000
# 1 minute, poll for dead connection - default 2 hours
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 60000
Last but not least, I have finally swapped out ActiveMQ for the FUSE message broker, an “enterprise” ActiveMQ distribution. Hopefully it won’t crash once a week like ActiveMQ does for us. The FUSE message broker is based off of ActiveMQ 5.3 sources that fix various memory leaks found in the current stable release of ActiveMQ 5.2 as of this writing.
If the FUSE message broker does not work out, I might have to give Kestrel a try. Hey, if it worked for twitter, it should work for us…right?
Since wicket sessions can vary greatly in size, using the standard memcached server implementation became impractical due to the slab allocator.
The current code on github lacks the ehcache store and an Actor IoHandler adapter. The internal SMemcached application at fabulously40 uses a private caching API so we can hook up various caching backend storage implementations such as mysql, postgresql, ehcache or even another memcached server. You can grab the TCache project on github that SMemcached uses to unify caching under a single API. This gives SMemcached a lot of flexibility when it comes to caching your data.
fyi. TCache stands for “Tanek” Cache, Tanek means cache in russian.
The project works quite well, but don’t use it in production just yet since there is no data expiration for cached data in the HashMap storage implementation. This is just a technical preview. Do use it in production, this is what we use at Fabulously40