Letsgetdugg

Random tech jargon

Browsing the 2009 April archive

There seems to be some interest Catalyst vs Rails vs Django benchmark. The older benchmark is quite old, it was done in 2007. A lot has changed since then. I am re-running the numbers once again to see what has changed. This time around the hardware is faster and the benchmark is slightly more simple. I am just stress testing the controller response performance of the two frameworks.

Benchmark System:
Quad Core Xeon x5355 @ 2.66GHz,8 Gigs Ram,OpenSolaris SNV98

Quick Summary:
Catalyst 5.8/Perl 5.10: 611.78req/sec (Single Process,bsdmalloc)
Catalyst 5.8/Perl 5.10: 1485.53req/sec (Multi Process,bsdmalloc)
Rails 2.3.2/MRI Ruby 1.8.7: 259.93req/sec (Single Process,bsdmalloc)
Rails 2.3.2/JRuby 1.3-dev: 311.71req/sec (Single-Threaded,bsdmalloc)
Rails 2.3.2/JRuby 1.3-dev: 992.32req/sec (Multi-Threaded,libumem)
Rails 2.3.2/MRI Ruby 1.9.1: 603.92req/sec (Single Process,bsdmalloc)

Jump to conclusion….

Catalyst 5.8 / Perl 5.10
Compiled: SUNCC -xO5 -xipo -fast -xtarget=native

# ab -n1000 -c100 http://somedomain:3000/ This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright 2006 The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Finished 1000 requests Server Software: Server Hostname: somedomain Server Port: 3000 Document Path: / Document Length: 11 bytes Concurrency Level: 100 Time taken for tests: 0.673159 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 159000 bytes HTML transferred: 11000 bytes Requests per second: 1485.53 [#/sec] (mean) Time per request: 67.316 [ms] (mean) Time per request: 0.673 [ms] (mean, across all concurrent requests) Transfer rate: 230.26 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 2.3 0 35 Processing: 15 62 10.8 62 103 Waiting: 15 59 11.6 61 98 Total: 15 62 10.6 63 103 Percentage of the requests served within a certain time (ms) 50% 63 66% 66 75% 69 80% 71 90% 74 95% 76 98% 86 99% 91 100% 103 (longest request)

Rails 2.3.2 / Ruby 1.8.7
Compiled: SUNCC -xO5 -xipo -fast -xtarget=native

# ab -n1000 -c100 http://somedomain:3000/main/index This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking somedomain (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Mongrel Server Hostname: somedomain Server Port: 3000 Document Path: /main/index Document Length: 11 bytes Concurrency Level: 100 Time taken for tests: 3.847 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 290003 bytes HTML transferred: 11000 bytes Requests per second: 259.93 [#/sec] (mean) Time per request: 384.718 [ms] (mean) Time per request: 3.847 [ms] (mean, across all concurrent requests) Transfer rate: 73.61 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 1.1 0 16 Processing: 10 369 65.2 389 428 Waiting: 9 368 65.3 388 427 Total: 10 369 65.2 390 428 Percentage of the requests served within a certain time (ms) 50% 390 66% 396 75% 398 80% 400 90% 404 95% 407 98% 413 99% 417 100% 428 (longest request)

Rails 2.3.2 / JRuby 1.3-dev build 6586 (Multi-Threaded), libumem
Platform: JDK7 B56

# ab -n1000 -c100 http://somedomain.com:3000/main/index
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking somedomain.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests

Server Software:
Server Hostname:        somedomain.com
Server Port:            3000

Document Path:          /main/index
Document Length:        11 bytes

Concurrency Level:      100
Time taken for tests:   1.008 seconds
Complete requests:      1000
Failed requests:        1
   (Connect: 0, Receive: 0, Length: 1, Exceptions: 0)
Write errors:           0
Non-2xx responses:      1
Total transferred:      253875 bytes
HTML transferred:       11936 bytes
Requests per second:    992.32 [#/sec] (mean)
Time per request:       100.773 [ms] (mean)
Time per request:       1.008 [ms] (mean, across all concurrent requests)
Transfer rate:          246.02 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   1.7      0      13
Processing:    10   79  18.8     80     122
Waiting:        9   79  18.8     79     122
Total:         10   80  18.9     80     122

Percentage of the requests served within a certain time (ms)
  50%     80
  66%     88
  75%     94
  80%     98
  90%    102
  95%    108
  98%    113
  99%    114
 100%    122 (longest request)

Rails 2.3.2 / JRuby 1.3-dev build 6586 (Single-Threaded)
Platform: JDK7 B56

# ab -n1000 -c100 http://somedomain:3000/main/index This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking somedomain (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: somedomain Server Port: 3000 Document Path: /main/index Document Length: 11 bytes Concurrency Level: 100 Time taken for tests: 3.208 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 253000 bytes HTML transferred: 11000 bytes Requests per second: 311.71 [#/sec] (mean) Time per request: 320.810 [ms] (mean) Time per request: 3.208 [ms] (mean, across all concurrent requests) Transfer rate: 77.01 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 2.5 0 31 Processing: 37 304 56.4 318 350 Waiting: 36 304 56.5 318 349 Total: 37 305 56.5 318 352 Percentage of the requests served within a certain time (ms) 50% 318 66% 326 75% 330 80% 332 90% 336 95% 341 98% 345 99% 348 100% 352 (longest request)

Rails 2.3.2 / Ruby 1.9.1
Compiled: GCC -O3 -fomit-frame-pointer (SunCC failed to compile)

# ab -n1000 -c100 http://somedomain:3000/main/index This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking somedomain (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: thin Server Hostname: somedomain Server Port: 3000 Document Path: /main/index Document Length: 11 bytes Concurrency Level: 100 Time taken for tests: 1.656 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 267001 bytes HTML transferred: 11000 bytes Requests per second: 603.92 [#/sec] (mean) Time per request: 165.585 [ms] (mean) Time per request: 1.656 [ms] (mean, across all concurrent requests) Transfer rate: 157.47 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.6 0 8 Processing: 28 160 39.0 187 222 Waiting: 13 145 38.5 143 209 Total: 28 160 39.0 187 222 Percentage of the requests served within a certain time (ms) 50% 187 66% 190 75% 191 80% 191 90% 196 95% 201 98% 221 99% 221 100% 222 (longest request)

Conclusion

Seems like Catalyst has the edge in controller performance compared to Rails on MRI Ruby 1.8.7. Catalyst’s controller processing is 135% faster than Rails in single process performance and 471% faster as a forking multi process. It is nice to see that the Catalyst team addressed the controller performance short comings of the earlier versions of Catalyst. Like any benchmark take it with a grain of salt. In a real application your data access layer will most likely be the bottle neck.

Rails 2.3.2 under JRuby with threading enabled ran 283% faster than with MRI Ruby 1.8.7. I am anxiously waiting on JDK7 B57 with invoke dynamic support, this should help push JRuby’s performance even further. I guess I know what deployment option I will choose when deploying Rails.

Pick your poison, both frameworks provide excellent controller response performance. Keep in mind scaling is all about architecture and not how fast your controller’s responses are. That said, having an efficient framework does help ;-)

Tagged with , , , ,

I have recently blogged about swapping malloc implementations for the JVM to help boost multi-threaded performance. Well there is yet another malloc implementation that solaris comes with that is optimized for single threaded performance; bsdmalloc. I just recently switched our perl interpreter to use bsdmalloc and got 33% faster performance with our perlbal proxy.

You can try out multiple malloc implementations by setting LD_PRELOAD environment variable.

LD_PRELOAD="/usr/lib/libbsdmalloc.so" perl somecode.pl

So here is the rule of thumb for which malloc implementation to use for your application.

libumem = For multithreaded applications. umem avoids thread heap contention and is highly optimized for multi-threaded applications.

bsdmalloc = For single threaded applications. PHP/Perl/Python and Ruby will fall into this category.

Applying the right malloc implementation to your resource intensive application can see a nice performance benefit.

Tagged with ,

I wrote a quick micro benchmark to test out ruby threads. Apparently ruby can’t make use of multiple CPUs with it’s threading implementation. I guess you have to resort to forking to scale up to multiple cpu cores while using mri ruby. However, there is an alternative solution, just use JRuby. JRuby utilizes all cores when running the benchmark.

Ruby 1.8.7 - Compiled with SunCC SSX0903 (-xO5 -fast -xipo) Total number of insane floating point divisions in 10 seconds is 5969107
Ruby 1.9.1 / Compiled with GCC 4.3.2 (-O3 -fomit-frame-pointer) Total number of insane floating point divisions in 10 seconds is 8596894

Ran as: jruby –fast cpuMax.rb
177% increase in performance

JRuby 1.3-dev / JDK7 b56 Total number of insane floating point divisions in 10 seconds is 15915896

Ran as: jruby –fast -J-Djruby.compile.mode=JIT -J-Djruby.jit.threshold=0 -J-server cpuMax.rb
374% increase in performance

JRuby 1.3-dev / JDK7 b56 Total number of insane floating point divisions in 10 seconds is 28334441

Looking at mpstat, I can see the MRI ruby implementation is not utilizing all 4 cores.

Ruby 1.8.7 MRI

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  828   0   13   428  204  487   26   93   16    0  1903    4   5   0  91
  1 2682   0    3    39    2  280   32   81   12    2  1189   13   2   0  85
  2 1902   0    0    34   11  259   16   57   13    0  1094   11   3   0  86
  3 1017   0    3   192  150  111   34   38    8    0   676   92   2   0   6
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0    5   422  205  388   11   75    5    0   627    4   1   0  95
  1  161   0   13    20    2  196   11   61    8    0  1405    2   2   0  96
  2  292   0    6    32   15  272   10   57    7    0   700    2   1   0  97
  3    0   0    0   108   65   74   35   28    3    0   346   99   0   0   1

Now here is the JRuby implementation.

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  229   0  202   755  193 1977  365  329   94    1  2869   90   2   0   8
  1  328   0   86   371    1 1817  226  303  125    0  2809   86   2   0  12
  2  294   0  128   326    0 1771  248  287  109    0  2290   88   2   0  10
  3  320   0  172   402   62 1848  246  241  116    0  2238   86   3   0  11
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  317   0  297   700  192 2047  323  341  136    1  2819   89   2   0   9
  1  320   0   61   279    2 1611  134  195  130    0  1960   85   2   0  13
  2  288   0  235   379    0 1941  291  299  115    0  2462   87   2   0  11
  3  308   0   78   316   43 1688  142  159  104    0  1706   85   2   0  13

I think I will stick to JRuby for production use.

#!/usr/bin/ruby

require ‘thread’
threads = []
counter = 0
mutex = Mutex.new

4.times do
     threads << Thread.new {
        x=0
        y=0
        time=Time.new

        while 1 do
                if Time.new - time >= 10 then
                        break
                else
                        x=1.00/24000000000.001
                        y+=1
                end
        end
        mutex.synchronize { counter+=y.to_i }
    }
end
threads.each { |t| t.join }

puts "Total number of insane floating point divisions in 10 seconds is "+counter.to_s

For those interested in how we run Fabulously40.

1. Single server, OpenSolaris / 8Gigs RAM / Quad Xeon x5355 / 100Mbit line.
2. Static and dynamic data cached up front on varnish
3. Even though Nginx can handle L7 load balancing, Perlbal offers better flexibility with its plugin system
4. Jetty application servers easily scale out by using memcached as the session store
5. Write intensive operations are done asynchronously via the ActiveMQ message store system
6. One PostgreSQL database on RAID1 with a hot standby database on a third disk.

The application can do 6,000+req/sec and 80-120req/sec without the varnish cache. The platform uses Wicket, Hibernate and Spring for it’s internals.

There you have it.

fab40 arcitecture

You might find this plugin nifty if you have multiple application servers processing requests. The Perlbal BackendHeaders plugin appends X-Backend headers with which backend served the request.

Update 06/26/09 Now on github perlbal-plugin-backendheaders

syris:~ victori$ curl -I http://fabulously40.com/questions HTTP/1.1 200 OK Server: nginx/0.7.52 Content-Type: text/html; charset=utf-8 Expires: Thu, 01 Jan 1970 00:00:00 GMT Content-Language: en X-Backend: 72.11.142.91:8880 X-Dilbert: If you have any trouble sounding condescending, find a Unix user to show you how it's done Content-Length: 48046
package Perlbal::Plugin::BackendHeaders;

use Perlbal;
use strict;
use warnings;

#
# Add $self->{service}->run_hook(‘modify_response_headers’, $self);
# To sub handle_response in BackendHTTP after Content-Length is set.
#
# LOAD BackendHeaders
# SET plugins        = backendheaders

sub load {
    my $class = shift;
    return 1;
}

sub unload {
    my $class = shift;
    return 1;
}

# called when we’re being added to a service
sub register {
    my ( $class, $svc ) = @_;

    my $modify_response_headers_hook = sub {
        my Perlbal::BackendHTTP $be  = shift;
        my Perlbal::HTTPHeaders $hds = $be->{res_headers};
        my Perlbal::Service $svc     = $be->{service};
        return 0 unless defined $hds && defined $svc;

        $hds->header( ‘X-Backend’, $be->{ipport} );

        return 0;
    };

    $svc->register_hook( ‘BackendHeaders’, ‘modify_response_headers’,
        $modify_response_headers_hook );
    return 1;
}

# called when we’re no longer active on a service
sub unregister {
    my ( $class, $svc ) = @_;
    $svc->unregister_hooks(‘BackendHeaders’);
    $svc->unregister_setters(‘BackendHeaders’);
    return 1;
}

1;

Tagged with ,

Apparently Solaris comes with some crummy settings for web hosting. Here are the settings I have used to improve our web performance at our service.

victori@fab40:/etc/rc2.d# netstat -sP tcp | grep -i drop tcpTimRetransDrop = 6029 tcpTimKeepalive = 2467 tcpListenDrop = 27327 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans = 99988

If tcpListenDrop is above 0, you have more connections than the system can handle with the default settings. Increasing tcp_conn_req_max_q accordingly should fix the issue. Raise the number incrementally until tcpListenDrop stops gradually increasing.

The tcp_conn_req_max_q default is 1024.

/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q 8192 /usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0 8192

Lower the anonymous port range to support the larger connection queue that was defined.

/usr/sbin/ndd -set /dev/tcp tcp_smallest_anon_port 2048

Up the buffer size for transmissions, to you know….. actually make use of that 100mbit connection?

/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 1048576 /usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 1048576 /usr/sbin/ndd -set /dev/tcp tcp_max_buf 2097152

To persist these settings across a reboot just write out the contents to /etc/rc2.d/S99netoptimize bash file

I was told Solaris was configured out of the box for today’s hardware? wtf?

Tagged with , ,

OpenSolaris uses a single-threaded malloc by default for all applications. The JDK that is compiled for Solaris fails to be linked against mtmalloc or the newer umem malloc implementation that is multithread optimized. In a multithreaded application using a single threaded malloc can degrade performance. As memory is being allocated concurrently in multiple threads, all the threads must wait in a queue while malloc() handles one request at a time, this is called heap contention. To get around this contention point you can force the JDK to use the umem malloc. 

LD_PRELOAD=/usr/lib/libumem.so /opt/jdk1.7.0/bin/java start.jar or LD_PRELOAD=/usr/lib/libmtmalloc.so /opt/jdk1.7.0/bin/java start.jar

This simple fix has really improved performance on our web service fabulously40. The application went from serving 120req/sec uncached to 170req/sec. Not bad no? 

This also works wonders for mysql and varnish, two applications that really put those threads to use.  We have dropped 100ms in response time with varnish by just using umem for the malloc implementation.

I am not exactly sure why this isn’t documented but nginx as of 0.7.x supports event ports

This is a huge performance win for Solaris. Nginx can avoid the 0(n) file descriptor problem with event ports support.

To enable event ports add this to your nginx.conf

events { use eventport; }

Here is our performance-proven configuration that we use on fabulously40

The follow configuration will help you survive massive traffic with nginx. We have served 4.4 million requests in a 4 hour time frame with no issues. That is 305req/sec.

worker_processes 8; worker_rlimit_nofile 10240; events { worker_connections 8024; use eventport; } http { keepalive_timeout 20; server_names_hash_bucket_size 64; sendfile on; tcp_nopush on; client_max_body_size 150m; gzip on; gzip_comp_level 5; gzip_vary on; gzip_proxied any; gzip_types text text/plain text/css text/xml application/xml text/javascript text/html application/x-javascript; }
Tagged with ,

Well this has been a long time coming but I can declare the Typeface blogging platform to be a dead project. I have migrated this blog to wordpress.

Tagged with ,

Try this fun perl benchmark, to test your dual core, SMP or hyperthreaded system.

Before running, make sure you have perl 5.8 with threading support compiled in.

Perl has native ithreads as of perl 5.8.

#!/usr/bin/perl -w
use threads;
use strict;

my $y1=Bench->new();
print "Bencmarking multi-threadedn";
$y1->benchmark();

print "Benchmarking single-threadedn";
$y1->ncpu(1);
$y1->benchmark();

package Bench;

sub new ()
{
        my $self = {result => 0,ncpu=>0};
       
        my $cpus =`sysctl hw.ncpu`;
        $cpus =~ /: (.*)/g;
        $self->{ncpu}=$1;
       
        my $class = shift;
        bless ($self,$class);
        return $self;
}

sub ncpu {
        my ($self,$num) = @_;
        if(defined $num) { $self->{ncpu}=$num; } else { return $self->{ncpu}; }
}

sub benchmark ()
{
        my ($self)=@_;
        my @thr;
        for(my $i=0;$i < $self->{ncpu};$i++)
        {
                print "Starting thread $in";
                push @thr, threads->create(‘benchmark_thread’);
        }
        my $total=0;
        foreach my $t (@thr)
        {
                $total=$total+$t->join();
        }
        print "Total number of insane floating point divisions in 10 seconds is ". $total . "n";       
}

sub benchmark_thread()
{
        my ($y,$x)=0;
        my $time1 = time();
        my ($self)=@_;
        while(1){
          #$time2 = time();
          if((time()  - $time1)>= 10){last;}
          else {
            $x=1.00/24000000000.001;
            $y++;
          }
        }      
        return $y;
}

Tagged with , ,