Varnish on Solaris
Update 2010-02-19: Seems other people are also affected by the Varnish LINGER crash on OpenSolaris. This does not address the core problem but removes the “fail fast” behavior with no negative side effects.
r4576 has been running reliably with the fix below.
In varnishd/bin/cache_acceptor.c
if (need_linger) setsockopt(sp->fd, SOL_SOCKET, SO_LINGER, &linger, sizeof linger);
Remove TCP_assert line encapsulating setsockopt().
Update 2010-02-17: This might be a random fluke but Varnish has connection issues when compiled under SunCC, stick to GCC. I have compiled Varnish with GCC 4.3.2 and the build seems to work well. Give r4572 a try, phk commited some solaris aware errno code.
Update 2010-02-16: r4567 seems stable. Errno isn’t thread-safe by default on Solaris like other platforms, you need to define -pthreads for GCC and -mt for SunCC in both the compile and linking flags.
GCC example:
VCC_CC="cc -Kpic -G -m64 -o %o %s" CC=/opt/sfw/bin/gcc CFLAGS="-O3 -L/opt/extra/lib/amd64 -pthreads -m64 -fomit-frame-pointer" LDFLAGS="-lumem -pthreads" ./configure --prefix=/opt/extra
SunCC Example:
VCC_CC="cc -Kpic -G -m64 -o %o %s" CC=/opt/SSX0903/bin/cc CFLAGS="-xO3 -fast -xipo -L/opt/extra/lib/amd64 -mt -m64" LDFLAGS="-lumem -mt" ./configure --prefix=/opt/extra
Here are the sources on how I pieced it all together: sun docs, stack overflow answer
See what -pthreads define on GCC
# gcc -x c /dev/null -E -dM -pthreads | grep REENT
#define _REENTRANT 1
snippet from solaris’s /usr/include/errno.h to confirm that errno isn’t thread safe by default.
#if defined(_REENTRANT) || defined(_TS_ERRNO) || _POSIX_C_SOURCE - 0 >= 199506L
extern int *___errno();
#define errno (*(___errno()))
#else
extern int errno;
/* ANSI C++ requires that errno be a macro */
#if __cplusplus >= 199711L
#define errno errno
#endif
#endif /* defined(_REENTRANT) || defined(_TS_ERRNO) */
Update 2010-01-28: r4508 seems stable. No patches needed aside from removing an assert(AZ) in cache_acceptor.c on line 163.
Update 2010-01-21: If your using Varnish from trunk past r4445 apply this session cache_waiter_poll patch to avoid stalled connections.
Update 2009-21-12: Still using Varnish in production, the site is working beautifully with the settings below.
Update(new): I think I figured the last remaining piece of the puzzle. Switching Varnish’s default listener to poll fixed the long connection accept wait times.
Update: Monitor charts looked good, but persistent connections kept flaking under production traffic. I was forced to revert back to Squid 2.7. *Sigh* I think Squid might be the only viable option on Solaris when it comes to reverse proxy caching. The information below is useful if you still want to try out Varnish on Solaris.
I have finally wrangled Varnish to work reliably on Solaris without any apparent issues. The recent commit to trunk by phk(creator) fixed the last remaining Solaris issue that I am aware of.
There are three four requirements to get this working reliably on Solaris.
1. Run from trunk – r4508 is a known stable revision that works well. Remove the AZ() assert in cache_acceptor.c on line 163.
2. Set connect_timeout to 0, this is needed to work around a Varnish/Solaris TCP incompatibility that resides in lib/libvarnish/tcp.c#TCP_connect timeout code.
3. Switch the default waiter to poll. EventPorts seems bugged on OpenSolaris builds.
4. If you have issues starting Varnish, start Varnish in the foreground via -F argument.
Here is a Pingdom graph of our monitored service. Can you tell when Varnish was swapped in for Squid? Varnish does a better job of keeping content cached due to header normalization and larger cache size.
There are a few “gotchas” to look out for to get it all running reliably. Here is the configuration that I used in production. I have annotated each setting with a brief description.
newtask -p highfile /opt/extra/sbin/varnishd
-f /opt/extra/etc/varnish/default.vcl
-a 0.0.0.0:82 # IP/Port to listen on
-p listen_depth=8192 # Connections kernel buffers before rejecting.
-p waiter=poll # Listener implementation to use.
-p thread_pool_max=2000 # Max threads per pool
-p thread_pool_min=50 # Min Threads per pool, crank this high
-p thread_pools=4 # Thread Pool per CPU
-p thread_pool_add_delay=2ms # Thread init delay, not to bomb OS
-p cc_command='cc -Kpic -G -m64 -o %o %s' # 64-Bit if needed
-s file,/sessions/varnish_cache.bin,512M # Define cache size
-p sess_timeout=10s # Keep-Alive timeout
-p max_restarts=12 # Amount of restart attempts
-p session_linger=120ms # Milliseconds to keep thread around
-p connect_timeout=0s # Important bug work around for Solaris
-p lru_interval=20s # LRU interval checks
-p sess_workspace=65536 # Space for headers
-T 0.0.0.0:8086 # Admin console
-u webservd # User to run varnish as
System configuration Optimizations
Solaris lacks SO_{SND|RCV}TIMEO BSD socket flags. These flags are used to define TCP timeout values per socket. Every other OS has it Mac OS X, Linux, FreeBSD, AIX but not Solaris. Meaning Varnish is unable to make use of custom defined timeout values on Solaris. You can do the next best thing with Solaris; optimize the TCP timeouts globally.
# Turn off Nagle. Nagle Adds latency.
/usr/sbin/ndd -set /dev/tcp tcp_naglim_def 1
# 30 second TIME_WAIT timeout. (4 minutes default)
/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 30000
# 15 min keep-alive (2 hour default)
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 900000
# 120 sec connect time out , 3 min default
ndd -set /dev/tcp tcp_ip_abort_cinterval 120000
# Send ACKs right away - less latency on bursty connections.
ndd -set /dev/tcp tcp_deferred_acks_max 0
# RFC says 1 segment, BSD/Win stack requires 2 segments.
/usr/sbin/ndd -set /dev/tcp tcp_slow_start_initial 2
Varnish Settings Dissected
Here are the most important settings to look out for when deploying Varnish in production.
File Descriptors
Run Varnish under a Solaris project that gives the proxy enough file descriptors to handle the concurrency. If Varnish can not allocate enough file descriptors, it can’t serve the requests.
# Paste into /etc/project
# Run the Application newtask -p highfile
highfile:101::*:*:process.max-file-descriptor=(basic,32192,deny)
Threads
Give enough idle threads to Varnish so it does not stall on requests. Thread creation is slow and expensive, idle threads are not. Don’t go cheap with threads, allocate a minimum of 200. Modern browsers use 8 concurrent connections by default, meaning Varnish will need 8 threads to handle a single page view.
thread_pool_max=2000 # 2000 max threads per pool
thread_pool_min=50 # 50 min threads per pool
# 50 threads x 4 Pools = 200 threads
thread_pools=4 # 4 Pools, Pool per CPU Core.
session_linger=120ms # How long to keep a thread around
# To handle further requests.
FYI, the projects file is /etc/project, not /etc/projects. Thanks for writing this. Following your notes I was able to get Varnish running under Solaris 10.
Corrected, thank you for the correction. You might also want to check out the patches recently submitted by slink that help improve things with time handling performance on (open)solaris. TIM_mono() is called 4 times per request, with slink’s patch it avoids a syscall every time varnish needs access to temporal information.
http://varnish.projects.linpro.no/ticket/609
To exploit one of the best Solaris features:
http://varnish.projects.linpro.no/attachment/ticket/628
Please tell me if this works for you.
Hi Varnish on Solaris friends,
here’s another patch to exploit one of the best security features on Solaris, least privileges. Please let me know if it works for you:
http://varnish.projects.linpro.no/ticket/628
P.S.: Please delete the previous comment
And with this one event ports should work. That’s all for tonight.
http://varnish.projects.linpro.no/ticket/629
Awesome! Coming from the Ubuntu world, I had to do some very basic setup to get a good build environment going on OpenSolaris. Documented that here:
http://www.varnish-cache.org/wiki/Installation#OpenSolaris
It would be good if you could link to ticket #632 ( (http://varnish-cache.org/ticket/632) after “4. If you have issues starting Varnish, start Varnish in the foreground via -F argument.” It’s quite a serious bug, it should *always* be run in foreground.
So what’s your latest status? 🙂
I’ve been running with varnish in production for a few days now and while I see no 503s any more, the processes crashes 1-3 times a day.
I’m on unpatched varnish 2.1.2, compiled with sun studio compiler, configured with the settings you mentioned in your blog post.
When the process does crash, there is no info in the syslog about what happened.
we needed to install varnish on a T2000 (sparc Solaris10)
varnish 3.0.1 gives a bus fault at:
VSC_C_main->shm_cycles++;
in:
bin/varnishd/cache_shmlog.c
this behavior is seen when running varnishd in debug mode, and “start”ing a child – it dies immediately.
gcc also seemed incapable of dealing with varnish, so SUNWspro was necessary. We read that varnish NEEDS 64bit to allow enough address space, but default libpcre from sunfreeware is 32bit.
so, the answer was to recompile pcre to the side (without blowing away the existing one – gnu utilities use it) and then compile varnish 2.1.5 forcing 64bit along the way:
tar xvf pcre-8.13.tar
cd pcre-8.13
./configure –disable-cpp CFLAGS=”-g -O3″ CC=”/usr/bin/cc -m64″ –enable-utf8 –enable-unicode-properties –prefix=/usr/local/pcre/
make
su –
make install
tar xvf varnish-2.1.5.tar
cd varnish 2.1.5
VCC_CC=”cc -Kpic -G -m64 -o %o %s” CC=/usr/bin/cc CXX=’CC’ CXXFLAGS=’-g0 -m64′ CFLAGS=”-g -fast -L/usr/local/lib/sparcv9 -mt -m64″ LDFLAGS=”-lumem -mt -m64″ LD_LIBRARY_PATH=/usr/local/pcre/lib/ CPPFLAGS=” -I/usr/sfw/include” PCRE_LIBS=” -L/usr/local/pcre/lib -lpcre ” PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./configure –prefix=/opt/varnish/
make
su –
make install
122665 5127Aw, this was a truly nice post. In concept I wish to put in writing like this moreover – taking time and actual effort to make an outstanding article nevertheless what can I say I procrastinate alot and not at all appear to get something done. 36851