Saturday, December 8, 2007

JGroups Cluster on EC2 large 64bit instances

Introduction:

In this series of articles I have been covering getting the proposed Pentaho Cluster running on EC2.
http://blog.vmdatamine.com/2007/09/pentaho-business-suite-cluster-research.html
http://blog.vmdatamine.com/2007/09/pentaho-cluster-installing-jgroups.html
http://blog.vmdatamine.com/2007/11/pentaho-cluster-installing-jgroups-on.html

In the last article, I ran a JGroups cluster test and found the results disappointing compared to the test results published by JBoss.

So given there are new larger instances available now with better performance I decided to see how JGroups would perform on those instances.

The specification of the large instances can be found in Amazon's announcement.

The only change from the last test from an upgrade to Java (JDK 6 Release 3) and running on Amazon public image Fedora 64 bit OS.

I tried both the large and extra large instances in a 4 node cluster setup running TCP.

Comments:

  1. The network bandwidth is still the limiting factor.
  2. I had to modify the tcp.xml settings to enable queues to stop the test hanging sporadically.
  3. The larger 64 bit instances have more throughput vs the small nodes. This could be due to settings or CPU is an underlining factor after all.
  4. You are not going to reach the JBoss performance results without a faster network.

Results:

Two nodes: 2 senders:

-- results:

10.252.93.220:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=3211ms, msgs/sec=6228.59, throughput=6.23MB

10.252.99.47:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=3236ms, msgs/sec=6180.47, throughput=6.18MB

combined: 6204.53 msgs/sec averaged over all receivers (throughput=6.2MB/sec)

Two nodes: 1 sender, 1 receiver

-- results:

10.252.93.220:7800 (myself):
num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=2607ms, msgs/sec=3835.83, throughput=3.84MB

10.252.99.47:7800:
num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=2515ms, msgs/sec=3976.14, throughput=3.98MB

combined: 3905.98 msgs/sec averaged over all receivers
(throughput=3.9MB/sec)

4 nodes: 2 senders, 2 receivers

-- results:

10.252.23.15:7800:

num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4640ms, msgs/sec=4310.34, throughput=4.31MB
10.252.98.208:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4687ms, msgs/sec=4267.12, throughput=4.27MB
10.252.79.0:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4642ms, msgs/sec=4308.49, throughput=4.31MB
10.252.93.203:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4654ms, msgs/sec=4297.38, throughput=4.3MB

combined: 4295.83 msgs/sec averaged over all receivers (throughput=4.3MB/sec)


4 nodes: 2 senders, 2 receivers 100k messages test

-- results:

10.252.23.15:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19828ms, msgs/sec=10086.75, throughput=10.09MB

10.252.98.208:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19893ms, msgs/sec=10053.79, throughput=10.05MB

10.252.79.0:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19828ms, msgs/sec=10086.75, throughput=10.09MB

10.252.93.203:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19917ms, msgs/sec=10041.67, throughput=10.04MB

combined: 10067.24 msgs/sec averaged over all receivers (throughput=10.07MB/sec)

2nd run:

-- results:

10.252.23.15:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20605ms, msgs/sec=9706.38, throughput=9.71MB

10.252.98.208:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20629ms, msgs/sec=9695.09, throughput=9.7MB

10.252.79.0:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20590ms, msgs/sec=9713.45, throughput=9.71MB

10.252.93.203:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20636ms, msgs/sec=9691.8, throughput=9.69MB

combined: 9701.68 msgs/sec averaged over all receivers (throughput=9.7MB/sec)

Extra large 4 nodes : 2 senders, 2 receivers

-- results:

10.252.106.3:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17857ms, msgs/sec=11200.09, throughput=11.2MB

10.252.15.79:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17831ms, msgs/sec=11216.42, throughput=11.22MB

10.252.10.223:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17837ms, msgs/sec=11212.65, throughput=11.21MB

10.252.6.223:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17837ms, msgs/sec=11212.65, throughput=11.21MB

combined: 11210.45 msgs/sec averaged over all receivers (throughput=11.21MB/sec)


-- results:

10.252.106.3:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15585ms, msgs/sec=12832.85, throughput=12.83MB

10.252.15.79:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15564ms, msgs/sec=12850.17, throughput=12.85MB

10.252.10.223:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15563ms, msgs/sec=12850.99, throughput=12.85MB

10.252.6.223:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15563ms, msgs/sec=12850.99, throughput=12.85MB

combined: 12846.25 msgs/sec averaged over all receivers (throughput=12.85MB/sec)

Extra large 4 nodes : 4 senders and TCP queues

-- results:

10.252.106.3:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27784ms, msgs/sec=14396.78, throughput=14.4MB

10.252.15.79:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27744ms, msgs/sec=14417.53, throughput=14.42MB

10.252.10.223:7800 (myself):
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27761ms, msgs/sec=14408.7, throughput=14.41MB

10.252.6.223:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27755ms, msgs/sec=14411.82, throughput=14.41MB

combined: 14408.71 msgs/sec averaged over all receivers (throughput=14.41MB/sec)

with TCP queue_max_size set to 1000

10.252.106.3:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26667ms, msgs/sec=14999.81, throughput=15MB

10.252.15.79:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26709ms, msgs/sec=14976.23, throughput=14.98MB

10.252.10.223:7800 (myself):
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26676ms, msgs/sec=14994.75, throughput=14.99MB

10.252.6.223:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26710ms, msgs/sec=14975.66, throughput=14.98MB

combined: 14986.61 msgs/sec averaged over all receivers (throughput=14.99MB/sec)

Example of TCP.XML (note &gt and &lt to handle HTML)


<>
< TCP start_port="7800"
loopback="false"
discard_incompatible_packets="true"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
enable_bundling="true"
use_send_queues="true"
sock_conn_timeout="300"
skip_suspected_members="true"

use_concurrent_stack="true"

thread_pool.enabled="true"
thread_pool.min_threads="8"
thread_pool.max_threads="40"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="run"

oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="8"
oob_thread_pool.max_threads="20"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="true"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run"/ >

< TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:10.252.106.3[7800],
10.252.15.79[7800],10.252.10.223[7800],10.252.6.223[7800]}"
port_range="1"
num_initial_members="2"/ >


No comments: