VM Datamining: 2007

Saturday, December 8, 2007

JGroups Cluster on EC2 large 64bit instances

Introduction:

In this series of articles I have been covering getting the proposed Pentaho Cluster running on EC2.
http://blog.vmdatamine.com/2007/09/pentaho-business-suite-cluster-research.html
http://blog.vmdatamine.com/2007/09/pentaho-cluster-installing-jgroups.html
http://blog.vmdatamine.com/2007/11/pentaho-cluster-installing-jgroups-on.html

In the last article, I ran a JGroups cluster test and found the results disappointing compared to the test results published by JBoss.

So given there are new larger instances available now with better performance I decided to see how JGroups would perform on those instances.

The specification of the large instances can be found in Amazon's announcement.

The only change from the last test from an upgrade to Java (JDK 6 Release 3) and running on Amazon public image Fedora 64 bit OS.

I tried both the large and extra large instances in a 4 node cluster setup running TCP.

Comments:

The network bandwidth is still the limiting factor.
I had to modify the tcp.xml settings to enable queues to stop the test hanging sporadically.
The larger 64 bit instances have more throughput vs the small nodes. This could be due to settings or CPU is an underlining factor after all.
You are not going to reach the JBoss performance results without a faster network.

Results:

Two nodes: 2 senders:

-- results:

10.252.93.220:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=3211ms, msgs/sec=6228.59, throughput=6.23MB

10.252.99.47:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=3236ms, msgs/sec=6180.47, throughput=6.18MB

combined: 6204.53 msgs/sec averaged over all receivers (throughput=6.2MB/sec)

Two nodes: 1 sender, 1 receiver

-- results:

10.252.93.220:7800 (myself):
num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=2607ms, msgs/sec=3835.83, throughput=3.84MB

10.252.99.47:7800:
num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=2515ms, msgs/sec=3976.14, throughput=3.98MB

combined: 3905.98 msgs/sec averaged over all receivers
(throughput=3.9MB/sec)

4 nodes: 2 senders, 2 receivers

-- results:

10.252.23.15:7800:

num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4640ms, msgs/sec=4310.34, throughput=4.31MB
10.252.98.208:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4687ms, msgs/sec=4267.12, throughput=4.27MB
10.252.79.0:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4642ms, msgs/sec=4308.49, throughput=4.31MB
10.252.93.203:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4654ms, msgs/sec=4297.38, throughput=4.3MB

combined: 4295.83 msgs/sec averaged over all receivers (throughput=4.3MB/sec)

4 nodes: 2 senders, 2 receivers 100k messages test

-- results:

10.252.23.15:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19828ms, msgs/sec=10086.75, throughput=10.09MB

10.252.98.208:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19893ms, msgs/sec=10053.79, throughput=10.05MB

10.252.79.0:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19828ms, msgs/sec=10086.75, throughput=10.09MB

10.252.93.203:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=19917ms, msgs/sec=10041.67, throughput=10.04MB

combined: 10067.24 msgs/sec averaged over all receivers (throughput=10.07MB/sec)

2nd run:

-- results:

10.252.23.15:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20605ms, msgs/sec=9706.38, throughput=9.71MB

10.252.98.208:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20629ms, msgs/sec=9695.09, throughput=9.7MB

10.252.79.0:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20590ms, msgs/sec=9713.45, throughput=9.71MB

10.252.93.203:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=20636ms, msgs/sec=9691.8, throughput=9.69MB

combined: 9701.68 msgs/sec averaged over all receivers (throughput=9.7MB/sec)

Extra large 4 nodes : 2 senders, 2 receivers

-- results:

10.252.106.3:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17857ms, msgs/sec=11200.09, throughput=11.2MB

10.252.15.79:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17831ms, msgs/sec=11216.42, throughput=11.22MB

10.252.10.223:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17837ms, msgs/sec=11212.65, throughput=11.21MB

10.252.6.223:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=17837ms, msgs/sec=11212.65, throughput=11.21MB

combined: 11210.45 msgs/sec averaged over all receivers (throughput=11.21MB/sec)

-- results:

10.252.106.3:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15585ms, msgs/sec=12832.85, throughput=12.83MB

10.252.15.79:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15564ms, msgs/sec=12850.17, throughput=12.85MB

10.252.10.223:7800 (myself):
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15563ms, msgs/sec=12850.99, throughput=12.85MB

10.252.6.223:7800:
num_msgs_expected=200000, num_msgs_received=200000 (loss rate=0.0%), received=200MB, time=15563ms, msgs/sec=12850.99, throughput=12.85MB

combined: 12846.25 msgs/sec averaged over all receivers (throughput=12.85MB/sec)

Extra large 4 nodes : 4 senders and TCP queues

-- results:

10.252.106.3:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27784ms, msgs/sec=14396.78, throughput=14.4MB

10.252.15.79:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27744ms, msgs/sec=14417.53, throughput=14.42MB

10.252.10.223:7800 (myself):
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27761ms, msgs/sec=14408.7, throughput=14.41MB

10.252.6.223:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=27755ms, msgs/sec=14411.82, throughput=14.41MB

combined: 14408.71 msgs/sec averaged over all receivers (throughput=14.41MB/sec)

with TCP queue_max_size set to 1000

10.252.106.3:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26667ms, msgs/sec=14999.81, throughput=15MB

10.252.15.79:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26709ms, msgs/sec=14976.23, throughput=14.98MB

10.252.10.223:7800 (myself):
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26676ms, msgs/sec=14994.75, throughput=14.99MB

10.252.6.223:7800:
num_msgs_expected=400000, num_msgs_received=400000 (loss rate=0.0%), received=400MB, time=26710ms, msgs/sec=14975.66, throughput=14.98MB

combined: 14986.61 msgs/sec averaged over all receivers (throughput=14.99MB/sec)

Example of TCP.XML (note &gt and &lt to handle HTML)


<>
   < TCP start_port="7800"
        loopback="false"
        discard_incompatible_packets="true"
        max_bundle_size="64000"
        max_bundle_timeout="30"
        use_incoming_packet_handler="true"
        enable_bundling="true"
        use_send_queues="true"
        sock_conn_timeout="300"
        skip_suspected_members="true"

        use_concurrent_stack="true"

        thread_pool.enabled="true"
        thread_pool.min_threads="8"
        thread_pool.max_threads="40"
        thread_pool.keep_alive_time="5000"
        thread_pool.queue_enabled="true"
        thread_pool.queue_max_size="100"
        thread_pool.rejection_policy="run"

        oob_thread_pool.enabled="true"
        oob_thread_pool.min_threads="8"
        oob_thread_pool.max_threads="20"
        oob_thread_pool.keep_alive_time="5000"
        oob_thread_pool.queue_enabled="true"
        oob_thread_pool.queue_max_size="100"
        oob_thread_pool.rejection_policy="run"/ >

< TCPPING timeout="3000"
            initial_hosts="${jgroups.tcpping.initial_hosts:10.252.106.3[7800],
10.252.15.79[7800],10.252.10.223[7800],10.252.6.223[7800]}"
            port_range="1"
            num_initial_members="2"/ >

Monday, November 5, 2007

Pentaho Cluster : Installing JGroups on EC2

Overview:

Wondering why I hadn't updated my progress with installing JGroups on EC2?
It was because I had three false starts and got nowhere.

Finally however I found some more documentation and was able to get it running.

I found this report about a JGroups Performance test and the associated JBoss wiki Perftests.

That was enough information to understand how to get it working. It also helped that the more recent version JGroups 2.5.1 came with some sample configuration files.

Comments:

The network bandwidth between EC2 nodes is the limiting factor.
For 2 nodes: 4183.65 msgs/sec averaged over all receivers (throughput=4.18MB/sec) vs 60783.12 msgs/sec averaged over all receivers (throughput=60.78MB/sec)
For 4 nodes: 3852.11 msgs/sec averaged over all receivers (throughput=3.85MB/sec)
vs 60783.12 msgs/sec averaged over all receivers (throughput=60.78MB/sec)

The JGroups on EC2 performance versus the JGroups Performance report was very bad. Without changing any settings it was 15 times slower. This is the benefit of having a 1 Gigabit LAN versus 100 Megabit LAN.
The next step would be to test on the larger instances. If the network performance, rated as better for the larger instances versus the default is true, it will show up in the results.

Install:

wget http://easynews.dl.sourceforge.net/sourceforge/javagroups/JGroups-2.5.1.bin.zip
unzip JGroups-2.5.1.bin.zip -d YourJavaLibDirectory
cd YourJavaLibDirectory.
nslookup `hostname` to get your servers IP address.
edit the JGroups-2.5.1.bin/config.txt and JGroups-2.5.1.bin/tcp.xml to add the hosts. See the sample files at the bottom of this post.
java -cp JGroups-2.5.1.bin/concurrent.jar:JGroups-2.5.1.bin/jgroups-all.jar:JGroups-2.5.1.bin/commons-logging.jar org.jgroups.tests.perf.Test -receiver -config JGroups-2.5.1.bin/config.txt -props JGroups-2.5.1.bin/tcp.xml
java -cp JGroups-2.5.1.bin/concurrent.jar:JGroups-2.5.1.bin/jgroups-all.jar:JGroups-2.5.1.bin/commons-logging.jar org.jgroups.tests.perf.Test -sender -config JGroups-2.5.1.bin/config.txt -props JGroups-2.5.1.bin/tcp.xml
If you have the hosts correct it should run the test.

Results:

2 nodes
-- results:

10.255.23.160:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4664ms, msgs/sec=4288.16, throughput=4.29MB

10.255.26.143:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4903ms, msgs/sec=4079.14, throughput=4.08MB

combined: 4183.65 msgs/sec averaged over all receivers (throughput=4.18MB/sec)

4 nodes (2 senders, 2 receivers):

-- results:

10.253.15.95:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5212ms, msgs/sec=3837.3, throughput=3.84MB

10.255.23.160:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5174ms, msgs/sec=3865.48, throughput=3.87MB

10.255.26.143:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5192ms, msgs/sec=3852.08, throughput=3.85MB

10.253.83.143:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5190ms, msgs/sec=3853.56, throughput=3.85MB

combined: 3852.11 msgs/sec averaged over all receivers (throughput=3.85MB/sec)

Sample Output:

2 Nodes:

----------------------- TEST -----------------------
Date: Mon Nov 05 04:30:54 EST 2007
Run by: root

mcast_port: 7500
log_interval: 1000
sender: true
props: JGroups-2.5.1.bin/tcp.xml
jmx: false
bind_addr: localhost
num_members: 2
msg_size: 1000
dump_transport_stats: false
start_port: 7800
topic: topic/testTopic
num_senders: 2
cluster: 10.255.23.160:7800,10.255.26.143:7801
num_msgs: 10000
transport: org.jgroups.tests.perf.transports.JGroupsTransport
config: JGroups-2.5.1.bin/config.txt
processing_delay: 0
mcast_addr: 228.1.2.3
JGroups version: 2.5.1

Nov 5, 2007 4:30:54 AM org.jgroups.JChannel init
INFO: JGroups version: 2.5.1

-------------------------------------------------------
GMS: address is 10.255.26.143:7800
-------------------------------------------------------
-- 10.255.26.143:7800 joined
-- waiting for 2 members to join
-- 10.255.23.160:7800 joined
-- READY (2 acks)

-- sending 10000 1KB messages
-- received 1000 messages
-- received 2000 messages
++ sent 1000
-- received 3000 messages
++ sent 2000
-- received 4000 messages
-- received 5000 messages
++ sent 3000
-- received 6000 messages
++ sent 4000
-- received 7000 messages
-- received 8000 messages
++ sent 5000
-- received 9000 messages
-- received 10000 messages
-- received 11000 messages
++ sent 6000
-- received 12000 messages
++ sent 7000
-- received 13000 messages
-- received 14000 messages
-- received 15000 messages
++ sent 8000
-- received 16000 messages
-- received 17000 messages
++ sent 9000
-- received 18000 messages
-- received 19000 messages
++ sent 10000
-- received 20000 messages

-- results:

10.255.23.160:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4664ms, msgs/sec=4288.16, throughput=4.29MB

10.255.26.143:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4903ms, msgs/sec=4079.14, throughput=4.08MB

combined: 4183.65 msgs/sec averaged over all receivers (throughput=4.18MB/sec)

4 Nodes:

Sender Node Output

----------------------- TEST -----------------------
Date: Mon Nov 05 04:30:54 EST 2007
Run by: root

mcast_port: 7500
log_interval: 1000
sender: true
props: JGroups-2.5.1.bin/tcp.xml
jmx: false
bind_addr: localhost
num_members: 2
msg_size: 1000
dump_transport_stats: false
start_port: 7800
topic: topic/testTopic
num_senders: 2
cluster: 10.255.23.160:7800,10.255.26.143:7801
num_msgs: 10000
transport: org.jgroups.tests.perf.transports.JGroupsTransport
config: JGroups-2.5.1.bin/config.txt
processing_delay: 0
mcast_addr: 228.1.2.3
JGroups version: 2.5.1

Nov 5, 2007 4:30:54 AM org.jgroups.JChannel init
INFO: JGroups version: 2.5.1

-------------------------------------------------------
GMS: address is 10.255.26.143:7800
-------------------------------------------------------
-- 10.255.26.143:7800 joined
-- waiting for 2 members to join
-- 10.255.23.160:7800 joined
-- READY (2 acks)

-- sending 10000 1KB messages
-- received 1000 messages
-- received 2000 messages
++ sent 1000
-- received 3000 messages
++ sent 2000
-- received 4000 messages
-- received 5000 messages
++ sent 3000
-- received 6000 messages
++ sent 4000
-- received 7000 messages
-- received 8000 messages
++ sent 5000
-- received 9000 messages
-- received 10000 messages
-- received 11000 messages
++ sent 6000
-- received 12000 messages
++ sent 7000
-- received 13000 messages
-- received 14000 messages
-- received 15000 messages
++ sent 8000
-- received 16000 messages
-- received 17000 messages
++ sent 9000
-- received 18000 messages
-- received 19000 messages
++ sent 10000
-- received 20000 messages

-- results:

10.255.23.160:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4664ms, msgs/sec=4288.16, throughput=4.29MB

10.255.26.143:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=4903ms, msgs/sec=4079.14, throughput=4.08MB

combined: 4183.65 msgs/sec averaged over all receivers (throughput=4.18MB/sec)

Receiver Node Output

----------------------- TEST -----------------------
Date: Mon Nov 05 04:46:12 EST 2007
Run by: root

mcast_port: 7500
log_interval: 1000
sender: false
props: JGroups-2.5.1.bin/tcp.xml
jmx: false
bind_addr: localhost
num_members: 4
msg_size: 1000
dump_transport_stats: false
start_port: 7800
topic: topic/testTopic
num_senders: 2
cluster: 10.255.23.160:7800,10.255.26.143:7801,10.253.83.143:7802,10.253.15.95:7803
num_msgs: 10000
transport: org.jgroups.tests.perf.transports.JGroupsTransport
config: JGroups-2.5.1.bin/config.txt
processing_delay: 0
mcast_addr: 228.1.2.3
JGroups version: 2.5.1

Nov 5, 2007 4:46:12 AM org.jgroups.JChannel init
INFO: JGroups version: 2.5.1

-------------------------------------------------------
GMS: address is 10.253.83.143:7800
-------------------------------------------------------
-- 10.253.15.95:7800 joined
-- 10.253.83.143:7800 joined
-- waiting for 4 members to join
-- 10.255.23.160:7800 joined
-- 10.255.26.143:7800 joined
-- READY (4 acks)

-- received 1000 messages
-- received 2000 messages
-- received 3000 messages
-- received 4000 messages
-- received 5000 messages
-- received 6000 messages
-- received 7000 messages
-- received 8000 messages
-- received 9000 messages
-- received 10000 messages
-- received 11000 messages
-- received 12000 messages
-- received 13000 messages
-- received 14000 messages
-- received 15000 messages
-- received 16000 messages
-- received 17000 messages
-- received 18000 messages
-- received 19000 messages
-- received 20000 messages

-- local results:
sender: 10.255.23.160:7800: num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=5180ms,

msgs/sec=1930.5, throughput=1.93MB
sender: 10.253.15.95:7800: num_msgs_expected=10000, num_msgs_received=10000 (loss rate=0.0%), received=10MB, time=4832ms,

msgs/sec=2069.54, throughput=2.07MB

-- results:

10.253.15.95:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5212ms, msgs/sec=3837.3, throughput=3.84MB

10.255.23.160:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5174ms, msgs/sec=3865.48, throughput=3.87MB

10.255.26.143:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5192ms, msgs/sec=3852.08, throughput=3.85MB

10.253.83.143:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5190ms, msgs/sec=3853.56, throughput=3.85MB

combined: 3852.11 msgs/sec averaged over all receivers (throughput=3.85MB/sec)

-- results:

10.253.15.95:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5212ms, msgs/sec=3837.3, throughput=3.84MB

10.255.23.160:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5174ms, msgs/sec=3865.48, throughput=3.87MB

10.255.26.143:7800:
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5192ms, msgs/sec=3852.08, throughput=3.85MB

10.253.83.143:7800 (myself):
num_msgs_expected=20000, num_msgs_received=20000 (loss rate=0.0%), received=20MB, time=5190ms, msgs/sec=3853.56, throughput=3.85MB

combined: 3852.11 msgs/sec averaged over all receivers (throughput=3.85MB/sec)

Sample config.txt file

############################
# only used by TCP Transport
############################

# List of hosts in the cluster. Since we don't specify ports, you cannot run multiple TcpTransports
# on the same machine: each ember has to be run on a separate machine (this may be changed in a next version)
#cluster=127.0.0.1:7800,127.0.0.1:7801
# 2nodes # cluster=10.255.23.160:7800,10.255.26.143:7801
cluster=10.255.23.160:7800,10.255.26.143:7801,10.253.83.143:7802,10.253.15.95:7803

Sample hosts line in tcp.xml

initial_hosts="${jgroups.tcpping.initial_hosts:10.255.23.160[7800],10.255.26.143[7801],10.253.83.143[7802],10.253.15.95[7803]}"

Thursday, October 18, 2007

IOzone benchmark on EC2

I have run off another IO benchmark on EC2 using IOZone.

The OS is Centos 4.4 OS, running on Amazon Machine Image (AMI), which is a Zen based Virtual Machine (VM).

There seem to be a couple of sweet spots identified by the benchmark.

To stay in CPU Cache keep your file size smaller than under 256KB and read and write in 64 Byte chunks
If you must read from larger files, the size of the cache is the maximum size to remain in memory
Reading is best done between 64 Byte to 512 Byte chunks.
At least on EC2 stay away from reading 16 Kbyte files in 128 byte, 1K and 8K chunks.

The results and larger graphs can be found here

http://s3.amazonaws.com/dbadojo_benchmark/iozone_ec2_write.GIF
http://s3.amazonaws.com/dbadojo_benchmark/iozone_ec2_read.GIF
http://s3.amazonaws.com/dbadojo_benchmark/iozone_ec2_random_read.GIF
http://s3.amazonaws.com/dbadojo_benchmark/iozone_ec2_random_write.GIF
http://s3.amazonaws.com/dbadojo_benchmark/iozone_benchmark_ec2.zip (Right Click SAVE AS)

Have Fun

Paul

Thursday, October 4, 2007

Bonnie IO benchmark on EC2

I thought it might be useful to cross-link to an article I posted about running the bonnie and bonnie++ IO benchmark tool against EC2.

http://blog.dbadojo.com/2007/10/bonnie-io-benchmark-vs-ec2.html

Now back to finalizing testing JGroups on 2 separate nodes.

Have Fun

Paul

Thursday, September 20, 2007

Pentaho Cluster: Installing JGroups

As I mentioned in the previous post I am going to attempt to make a Pentaho Cluster, based on this presentation(PDF) and outlined in this preparation post.

So the fundamental building block for this cluster is the JGroups Java libraries, so first cab off the rank is to download and install the jar files and get demo to work.

After some mishaps with my CLASSPATH I got that sorted and tested the demo using two JGroups instances running on the same node. See the picture. Drawing in one window was immediately reflected in the second window on the right.

The install.html which is part of the zipped download file was good and explained the procedure well.

So the next thing is try the JGroups clustering on two separate EC2 nodes. That is next...

Have Fun

Paul

Installing JGroups onto a EC2 node with Java already installed.

Check Java is installed

java -version


java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)

Download JGroups

http://labs.jboss.com/jgroups/downloads
wget http://easynews.dl.sourceforge.net/sourceforge/javagroups/JGroups-2.3.bin.zip

Unzip and copy files to /usr/local

unzip JGroups-2.3.bin.zip -d /usr/local

Run the checker to make sure you have installed correctly

java -jar JGroups-2.3.bin/jgroups-all.jar


Version:        2.3
CVS:            $Id: Version.java,v 1.35 2006/06/11 19:15:23 belaban Exp $
History:        (see doc/history.txt for details)

Start X-window to EC2

ssh -i id_rsa-gsg-keypair -X root@yourEC2host

Check the display is set correctly

echo $DISPLAY

localhost:10.0

Run the demo twice, use & to background each command from command line.

cd /usr/local
java -cp JGroups-2.3.bin/concurrent.jar:JGroups-2.3.bin/jgroups-all.jar:JGroups-2.3.bin/commons-logging.jar org.jgroups.demos.Draw


Sep 20, 2007 7:34:29 AM org.jgroups.protocols.UDP createSockets
INFO: sockets will use interface 10.253.22.176
Sep 20, 2007 7:34:29 AM org.jgroups.protocols.UDP createSockets
INFO: socket information:
local_addr=10.253.22.176:32772, mcast_addr=228.8.8.8:45566, bind_addr=/10.253.22.176, ttl=32
sock: bound to 10.253.22.176:32772, receive buffer size=64000, send buffer size=32000
mcast_recv_sock: bound to 10.253.22.176:45566, send buffer size=64000, receive buffer size=64000
mcast_send_sock: bound to 10.253.22.176:32773, send buffer size=64000, receive buffer size=64000

-------------------------------------------------------
GMS: address is 10.253.22.176:32772
-------------------------------------------------------
** View=[10.253.22.176:32769|1] [10.253.22.176:32769, 10.253.22.176:32772]
** View=[10.253.22.176:32769|1] [10.253.22.176:32769, 10.253.22.176:32772]

Monday, September 10, 2007

Pentaho Business Suite Cluster : Research and preparation

As I mentioned in this article on running the Pentaho Demo on EC2, according to a recent presentation, Pentaho has the ability to be clustered using JBoss Clustering (JBoss JGroups), JBoss AS and Apache.

This post is to outline the background documentation I am using to give this cluster a go on EC2.
There are some posts indicating that the lack of multicasting with EC2 is a issue, however comments and the doco suggest that it will revert to TCP, although that is slower.

Given I already have Pentaho installed and saved as an AMI, I am going to build on that to make the single node. The documentation and wikis confident that it should auto discover and be a piece of cake, however we shall see.

Have Fun

Paul

Here is the specification of the cluster node (from that presentation)

Single Node

Single CPU
2 GB RAM
JBoss AS 4.0.3
JBoss JGroups?
Pentaho 1.1.5

Cluster Master:

Apache HTTP server 2.0.58 with mod_jk module version 1.2.15
JGroups cluster master
JMS / Web Services for Operational BI

Doco (Documentation)

http://docs.jboss.org/jbossas/jboss4guide/r4/html/jbosscache.chapt.html
http://docs.jboss.org/jbossas/jboss4guide/r4/html/cluster.chapt.html
http://www.onjava.com/pub/a/onjava/2002/07/10/jboss.html
http://www.jboss.org/wiki/Wiki.jsp?page=JBossHA
http://www.jboss.org/developers/projects/jboss/clustering
http://clusterstore.demo.jboss.com/

Forums and Blogs:

http://www.jgroups.org/javagroupsnew/docs/Perftest.html
http://blog.decaresystems.ie/index.php/2007/01/29/amazon-web-services-the-future-of-datacenter-computing-part-1
http://blog.decaresystems.ie/index.php/2007/02/12/amazon-web-services-the-future-of-datacenter-computing-part-2

Apache
http://jakarta.apache.org/tomcat/tomcat-3.3-doc/mod_jk-howto.html

Downloads

http://labs.jboss.com/jbossas/downloads
http://labs.jboss.com/jgroups/downloads
http://labs.jboss.com/jbosscache/download/index.html

Monday, September 3, 2007

Mondrian OLAP on MySQL EC2 Part 1

If you are wondering why there has been no recent postings, it is due to mainly to struggling to get the Mondrian installed. It had nothing to do with the environment and mostly due to documentation missing vital examples or explanation.

It is with relief I can say I have passed the test and can pass over to the fair lands of using Mondrian OLAP and running the demo.

You can download Mondrian from the Pentaho website or directly from sourceforge.
Use this installation document as a guide however it glosses over some of the nice little details which will make the demo work or not.

Comments:

Installing the demo was way too hard, just bundle tomcat or build a include everything binary for MySQL, PostgreSQL, Oracle or whatever. You want people to try the program, not many people will persist like I did to get it to work.
Having to hand edit every file which connects to the database or is used as a tomcat configuration file sucks, ever heard of storing that stuff in a single file or in the database.
Provide a simple SQL script to create schema objects and load data. Running the Foodmart loader did not prove you could run the demo! use this SQL script otherwise from Gizzar
Some kind of example of either a tomcat or other verification tool would be good, rather than having to lather, rinse and repeat over java, JDBC, Tomcat and red herring errors.

Many thanks to some other blogs and sites which helped solved various issues which cropped up
Gizzar article on Mondrian Open Source OLAP with MySQL
University of Vienna (Wien) for Tomcat FAQ solving tomcat shutdown on missing X
Mondrian Forum

Have Fun

Paul

Install:
Note: This really requires a HOWTO document. I will work on that based on this. Hopefully you can use this in conjunction with installation guide and the various other blogs who had fun with this.

Install some linux packages: yum install gcc autoconf
Download the last release of Mondrian non-embedded files from SourceForge.
Download and install Java 1.5 or better from Sun Java Downloads.
Download and install MySQL 5.0 from MySQL Downloads.
Download and unzip the MySQL JDBC driver
Download and install Apache Tomcat 5.0.28.
Verify that Java, MySQL and Tomcat are working by doing the following:
java -version
mysql -V
/usr/local/tomcat/bin/startup.sh
Point your browser at http://yourhostname:8080, if it works, tomcat is working.
unzip Mondrian.zip -b /usr/local/mondrian
Create the MySQL database: mysqladmin create foodmart -u root -p
Create the Foodmart User: create user 'foodmart'@'yourhostname' identified by 'foodmart';
Grant permissions: grant all privileges on *.* to 'foodmart'@'yourhostname' identified by 'foodmart';
Explode the mondrian.war into /usr/local/tomcat/webapps/mondrian: jar -xvf mondrian.war
Locate the 4 jar files eigenbase-properties.jar,eigenbase-resgen.jar,eigenbase-xom.jar and log4j-1.2.9.jar
Run Mondrian.FoodmartLoader to create tables and load data, pass in the full path to all the required JAR files, otherwise it will fail with a Class notFound error. This is an example:


java -cp "lib/mondrian.jar: /usr/local/mysql-5.0.45-linux-i686/mysql-connector-java-5.0.7/src/lib/log4j-1.2.9.jar: 
lib/eigenbase-xom.jar:lib/eigenbase-resgen.jar:lib/eigenbase-properties.jar:
/usr/local/mysql/mysql-connector-java-5.0.7/mysql-connector-java-5.0.7-bin.jar" \
mondrian.test.loader.MondrianFoodMartLoader \
-verbose -tables -data -indexes \
-jdbcDrivers=com.mysql.jdbc.Driver \
-inputFile=demo/FoodMartCreateData.sql \
-outputJdbcURL="jdbc:mysql://localhost/foodmart?user=foodmart&password=foodmart"

cp /usr/local/jakarta-tomcat-5.0.28/webapps/mondrian/WEB-INF/lib/xalan.jar /usr/local/jakarta-tomcat-5.0.28/common/endorsed/
Modify the query files, web.xml, datasource.xml and mondrian.properties file to replace localhost with your hostname


sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" fourhier.jsp
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" mondrian.jsp
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" colors.jsp
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" arrows.jsp
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" web.xml
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" datasource.xml
sed -i`date +%y%m%d` -e "s/localhost/yourhostname/" mondrian.properties

Modify the connection string in each file as per install guide, including the #38 stuff it is not a browser character issue.
Test connectivity for mysql user foodmart@yourhostname. Connection errors can cause this error


Mondrian Error:Internal error: Error while creating SQL dialect

export CATALINA_OPTS='-Djava.awt.headless=true' or add to your shell profile, this stops tomcat dying when X-windows is not found!! Thanks to this link for solving that.

Saturday, August 25, 2007

Pentaho Business Intelligence suite on EC2

As I mentioned in the road map, the plan is to run through the popular data mining and business intelligence software installing and running any demos available.

This is partially to gain experience with the software but also to demonstrate the ability to use the on demand nature of Amazon's EC2 (Elastic Cloud beta) to provide the ability to use the tools when required and ramp up or down the amount of computing resources used.

Pentaho is a popular open source Business Intelligence (BI) software suite, it has a active support group, good support forums with active Pentaho employee participation. You can download the software suite or subsections of the software from the downloads area or alternatively go to the sourceforge site and get them from there.

Like any good software vendor, open source or not, they provide a Pentaho 1.2.1 GA demo of their software so potential clients can get a good look and feel for the product.

I used my old faithful CentOS 4.4 linux distro which is essentially Red Hat Linux Enterprise 4 (RHEL4) running MySQL 5.1 as a base to install the Pentaho demo.
The Pentaho BI Suite is built on Java and the demo uses JBoss, providing access to the various parts of the BI Suite (Reporting, Kettle ETL, Weka and Shark Workflow Engine).
So naturally it required a Java JDK. Given I had the JDK-1.5.0.12 for linux handy I installed that Java.

Comments on the install and demo:

I tried the Pentaho 1.6.0 Release Candidate 1 demo (pentaho_demo_mysql5-1.6.0-RC1.782.tar.gz) and the demo install failed with a bunch of java class errors. I found this Pentaho forum post indicating similar issues. I haven't tried the 1.6.0 zip file to check whether it is indeed an issue with missing jar files.
Once I reverted to the Pentaho 1.2.1 GA demo everything was sweet.
To run the Pentaho Server on EC2 and use your browser you will need to update the /install_dir/pentaho-demo/jboss/server/default/deploy/pentaho.war/WEB-INF/web.xml and modify the base-url to be the hostname of your server.
The log produced by the start_pentaho.sh was very verbose and actually very interesting to see the calls made to service the web requests.
Set the EC2-security group to allow access to port 8080 unless you are using the default security group.
Point your browser at http://yourEC2-DNS-hostname:8080

I haven't delved sufficiently into the documentation to determine what is required to separate the components of Pentaho onto separate server/compute resources. However it seems Pentaho already have delivered a presentation on that ability. So that is something to test in the future.

I have included a screenshot of the home page once the demo was up and running. As per normal I have dumped the most revelant pieces of my work at the end of this post.

Have Fun

Paul



Get Java 1.5 JDK and follow the intructions at Java 1.5 and install

cd /usr/local
sh /mnt/jdk-1_5_0_12-linux-i586.bin

Do you agree to the above license terms? [yes or no]
yes
Unpacking...
Checksumming...
0
0
Extracting...
UnZipSFX 5.42 of 14 January 2001, by Info-ZIP (Zip-Bugs@lists.wku.edu).
 creating: jdk1.5.0_12/
 creating: jdk1.5.0_12/jre/
 creating: jdk1.5.0_12/jre/bin/
inflating: jdk1.5.0_12/jre/bin/java
inflating: jdk1.5.0_12/jre/bin/keytool
inflating: jdk1.5.0_12/jre/bin/policytool
...

Setup a bunch of symbolic links. Note: This allows flexibility to change the versions in the future.

ln -s /usr/local/jdk1.5.0_12/ java
cd bin
ln -s /usr/local/jdk1.5.0_12/bin/java java

Check java is working

java -version

java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)


edit .bash_profile add JAVA_HOME and add java and mysql binaries to the path

[pentaho@domU-12-31-35-00-53-92 ~]$ source .bash_profile

Example of bash_profile:

cat .bash_profile

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
      . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:/usr/local/mysql/bin
JAVA_HOME=/usr/local/java/

export PATH JAVA_HOME
unset USERNAME

Check the JAVA and MySQL versions and path:

java -version

java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)

mysql -V

mysql  Ver 14.13 Distrib 5.1.20-beta, for pc-linux-gnu (i686) using readline 5.0

Get Pentaho demo 1.2 zipfile Demo (to be safe)

wget http://umn.dl.sourceforge.net/sourceforge/pentaho/pentaho_demo-1.2.1.625-GA.zip
unzip pentaho_demo-1.2.1.625-GA.zip -d /usr/local/pentaho

cd /usr/local/pentaho
chown -R root:pentaho .
ls -la

total 12
drwxr-xr-x   3 root pentaho 4096 Aug 25 03:01 .
drwxr-xr-x  15 root root    4096 Aug 25 03:00 ..
drwxr-xr-x   5 root pentaho 4096 Aug 25 03:01 pentaho-demo

Loading the sample data and checking what was created in MySQL database:

cd /usr/local/pentaho/pentaho-demo/data
mysql -u root -p < SampleDataDump_MySql.sql
mysql -u root -p 
Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.20-beta-log MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| hibernate          |
| mysql              |
| quartz             |
| sampledata         |
| test               |
+--------------------+
6 rows in set (0.00 sec)


mysql> use sampledata
Database changed
mysql> show tables;
+----------------------+
| Tables_in_sampledata |
+----------------------+
| CUSTOMERS            |
| CUSTOMER_W_TER       |
| DEPARTMENT_MANAGERS  |
| EMPLOYEES            |
| OFFICES              |
| ORDERDETAILS         |
| ORDERFACT            |
| ORDERS               |
| PAYMENTS             |
| PRODUCTS             |
| QUADRANT_ACTUALS     |
| TIME                 |
| TRIAL_BALANCE        |
+----------------------+
13 rows in set (0.00 sec)

mysql> show table status;
+---------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+----------------------------------------------------------------------------------+
| Name                | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation         | Checksum | Create_options | Comment                                                                          |
+---------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+----------------------------------------------------------------------------------+
| CUSTOMERS           | InnoDB |      10 | Compact    |  117 |            420 |       49152 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| CUSTOMER_W_TER      | InnoDB |      10 | Compact    |  103 |            477 |       49152 |               0 |        16384 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| DEPARTMENT_MANAGERS | InnoDB |      10 | Compact    |    4 |           4096 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| EMPLOYEES           | InnoDB |      10 | Compact    |   23 |            712 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| OFFICES             | InnoDB |      10 | Compact    |    7 |           2340 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| ORDERDETAILS        | InnoDB |      10 | Compact    | 2913 |             61 |      180224 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| ORDERFACT           | InnoDB |      10 | Compact    | 3027 |            173 |      524288 |               0 |       131072 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB; (`PRODUCTCODE`) REFER `sampledata`.`PRODUCTS`(`PRODUCTCOD |
| ORDERS              | InnoDB |      10 | Compact    |  227 |            216 |       49152 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:59 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| PAYMENTS            | InnoDB |      10 | Compact    |  272 |             60 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:59 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| PRODUCTS            | InnoDB |      10 | Compact    |   91 |            720 |       65536 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:58 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| QUADRANT_ACTUALS    | InnoDB |      10 | Compact    |  148 |            110 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:59 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| TIME                | InnoDB |      10 | Compact    |  207 |            237 |       49152 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:59 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
| TRIAL_BALANCE       | InnoDB |      10 | Compact    |   22 |            744 |       16384 |               0 |            0 |         0 |           NULL | 2007-08-25 03:04:59 | NULL        | NULL       | latin1_general_cs |     NULL |                | InnoDB free: 11264 kB                                                            |
+---------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-------------------+----------+----------------+----------------------------------------------------------------------------------+
13 rows in set (0.01 sec)

The Pentaho 1.2.1 GA demo does not set the permission correctly for linux,
the files lack execution permission.

chmod -R +x /usr/local/pentaho/pentaho-demo/

Start the server and redirect STDOUT and STDERR to the one file

mkdir -p /usr/local/pentaho/pentaho-demo/logs
cd usr/local/pentaho/pentaho-demo
./start-pentaho.sh > logs/pentaho_`date +%Y%m%d`.log 2>&1 &
[1] 3375

Check the output of the server log

tail -f  /usr/local/pentaho/pentaho-demo/logs/pentaho_20070825.log

JAVA_HOME set to /usr/local/java/
JAVA is /usr/local/java//bin/java
=========================================================================

JBoss Bootstrap Environment

JBOSS_HOME: /usr/local/pentaho/pentaho-demo/jboss

JAVA: /usr/local/java//bin/java

JAVA_OPTS: -server -Xms128m -Xmx512m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterva
l=3600000 -Djava.awt.headless=true -Djava.io.tmpdir=/tmp/ -Dprogram.name=run.sh

CLASSPATH: /usr/local/pentaho/pentaho-demo/jboss/bin/run.jar:/usr/local/java//lib/tools.jar

=========================================================================

[Server@1a758cb]: [Thread[main,5,main]]: checkRunning(false) entered
[Server@1a758cb]: [Thread[main,5,main]]: checkRunning(false) exited
[Server@1a758cb]: Startup sequence initiated from main() method
[Server@1a758cb]: Loaded properties from [/usr/local/pentaho/pentaho-demo/data/server.properties]
[Server@1a758cb]: Initiating startup sequence...
[Server@1a758cb]: Server socket opened successfully in 6 ms.
04:26:13,863 INFO  [Server] Starting JBoss (MX MicroKernel)...
04:26:13,865 INFO  [Server] Release ID: JBoss [Zion] 4.0.4.GA (build: CVSTag=JBoss_4_0_4_GA date=200605151000)
04:26:13,866 INFO  [Server] Home Dir: /usr/local/pentaho/pentaho-demo/jboss
04:26:13,947 INFO  [Server] Home URL: file:/usr/local/pentaho/pentaho-demo/jboss/
04:26:13,949 INFO  [Server] Patch URL: null
04:26:13,949 INFO  [Server] Server Name: default
04:26:13,949 INFO  [Server] Server Home Dir: /usr/local/pentaho/pentaho-demo/jboss/server/default
04:26:13,949 INFO  [Server] Server Home URL: file:/usr/local/pentaho/pentaho-demo/jboss/server/default/
04:26:13,949 INFO  [Server] Server Log Dir: /usr/local/pentaho/pentaho-demo/jboss/server/default/log
04:26:13,950 INFO  [Server] Server Temp Dir: /usr/local/pentaho/pentaho-demo/jboss/server/default/tmp
04:26:13,950 INFO  [Server] Root Deployment Filename: jboss-service.xml
04:26:14,735 INFO  [ServerInfo] Java version: 1.5.0_12,Sun Microsystems Inc.
04:26:14,735 INFO  [ServerInfo] Java VM: Java HotSpot(TM) Server VM 1.5.0_12-b04,Sun Microsystems Inc.
04:26:14,735 INFO  [ServerInfo] OS-System: Linux 2.6.16-xenU,i386
04:26:16,761 INFO  [Server] Core system initialized
[Server@1a758cb]: Database [index=0, id=0, db=file:sampledata/sampledata, alias=sampledata] opened sucessfully in 6835 ms.
[Server@1a758cb]: Database [index=1, id=1, db=file:shark/shark, alias=shark] opened sucessfully in 46 ms.
[Server@1a758cb]: Database [index=2, id=2, db=file:hibernate/hibernate, alias=hibernate] opened sucessfully in 13 ms.
[Server@1a758cb]: Database [index=3, id=3, db=file:quartz/quartz, alias=quartz] opened sucessfully in 79 ms.
[Server@1a758cb]: Startup sequence completed in 6984 ms.
...
04:29:36,794 INFO  [STDOUT] Pentaho BI Platform server is ready. (1.2.1-625 GA)
04:29:44,906 INFO  [TomcatDeployer] deploy, ctxPath=/sw-style, warUrl=.../deploy/sw-style.war/
04:29:45,985 INFO  [Http11BaseProtocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080
04:29:46,132 INFO  [ChannelSocket] JK: ajp13 listening on /0.0.0.0:8009
04:29:46,276 INFO  [JkMain] Jk running ID=0 time=0/163  config=null
04:29:46,294 INFO  [Server] JBoss (MX MicroKernel) [4.0.4.GA (build: CVSTag=JBoss_4_0_4_GA date=200605151000)] Started in 3m:32s:342ms

Need to edit the web.xml file to get the base-url

vi /usr/local/pentaho/pentaho-demo/jboss/server/default/deploy/pentaho.war/WEB-INF/web.xml

replace localhost:8080 with your external EC2 DNS name eg: ec2-67-202-2-78.z-2.compute-1.amazonaws.com:8080

Friday, August 17, 2007

Weka Web service on EC2

Time is flying. We are working on getting a WEKA EC2 node presented as a web service, so essentially you can point your WSDL (Web Server Definition Language) aware software at it and
use the WEKA data mining tool as required.

Other stuff I have been working on and reviewing was the next build of the Pentaho BI Suite of products. I have the software downloaded, just requires time to run through the install and run the demos if available.

I am also looking at taking part in the Amazon Paid AMI, so once these AMI (Amazon Machine Images) are ready, once you sign up for Amazon EC2 (and S3) you can run specific nodes as required.

Other stuff, I ran through an install of the Oracle SOA Suite as well. So I have a couple of articles in the pipeline.

I am interested in your thoughts on using paid AMIs versus web service.

Have Fun

Paul

Tuesday, August 7, 2007

Weka Data mining on EC2 - testing

So once the VM was setup and running. It was time to see how WEKA performed in a virtual environment.

The performance on the EC2 node was good. These are not large datasets and I have a couple of those to play with in the near future. Given you can join the netflix prize competition and download a dataset with 100 Million data points (more than 2 Gig).

If you are surprised at the length of this post. I have found in the past, that when I am the one using a search engine, I want to see as much information as possible. There might be somewhere out there who in the future wants a quick solution to running WEKA without having to read the documentation.

So of this work was guided from the README, once I got the hang of it I got some datasets on Leukemia-ALLAML and ran WEKA on those.

Have Fun

Paul



List options for Weka Classifying

java weka.classifiers.trees.J48

Weka exception: No training file and no object input file given.

General options:

-t 
      Sets training file.
-T 
      Sets test file. If missing, a cross-validation will be performed on the training data.
-c 
      Sets index of class attribute (default: last).
-x 
      Sets number of folds for cross-validation (default: 10).
-s 
      Sets random number seed for cross-validation (default: 1).
-m 
      Sets file with cost matrix.
-l 
      Sets model input file.
-d 
      Sets model output file.
-v
      Outputs no statistics for training data.
-o
      Outputs statistics only, not the classifier.
-i
      Outputs detailed information-retrieval statistics for each class.
-k
      Outputs information-theoretic statistics.
-p 
      Only outputs predictions for test instances, along with attributes (0 for none).
-r
      Only outputs cumulative margin distribution.
-z 
      Only outputs the source representation of the classifier, giving it the supplied name.
-g
      Only outputs the graph representation of the classifier.

Options specific to weka.classifiers.trees.J48:

-U
      Use unpruned tree.
-C 
      Set confidence threshold for pruning.
      (default 0.25)
-M 
      Set minimum number of instances per leaf.
      (default 2)
-R
      Use reduced error pruning.
-N 
      Set number of folds for reduced error
      pruning. One fold is used as pruning set.
      (default 3)
-B
      Use binary splits only.
-S
      Don't perform subtree raising.
-L
      Do not clean up after the tree has been built.
-A
      Laplace smoothing for predicted probabilities.
-Q 
      Seed for random data shuffling (default 1).

Running the NaiveBayes Classifier on the labor dataset

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/labor.arff

Naive Bayes Classifier

Class bad: Prior probability = 0.36

duration:  Normal Distribution. Mean = 2 StandardDev = 0.7071 WeightSum = 20 Precision = 1.0
wage-increase-first-year:  Normal Distribution. Mean = 2.6563 StandardDev = 0.8643 WeightSum = 20 Precision = 0.3125
wage-increase-second-year:  Normal Distribution. Mean = 2.9524 StandardDev = 0.8193 WeightSum = 15 Precision = 0.35714285714285715
wage-increase-third-year:  Normal Distribution. Mean = 2.0344 StandardDev = 0.1678 WeightSum = 4 Precision = 0.38749999999999996
cost-of-living-adjustment:  Discrete Estimator. Counts =  10 2 6  (Total = 18)
working-hours:  Normal Distribution. Mean = 39.4887 StandardDev = 1.8903 WeightSum = 19 Precision = 1.8571428571428572
pension:  Discrete Estimator. Counts =  12 3 6  (Total = 21)
standby-pay:  Normal Distribution. Mean = 2.5 StandardDev = 0.866 WeightSum = 4 Precision = 2.0
shift-differential:  Normal Distribution. Mean = 2.4691 StandardDev = 1.5738 WeightSum = 9 Precision = 2.7777777777777777
education-allowance:  Discrete Estimator. Counts =  4 10  (Total = 14)
statutory-holidays:  Normal Distribution. Mean = 10.2 StandardDev = 0.805 WeightSum = 20 Precision = 1.2
vacation:  Discrete Estimator. Counts =  12 8 3  (Total = 23)
longterm-disability-assistance:  Discrete Estimator. Counts =  6 9  (Total = 15)
contribution-to-dental-plan:  Discrete Estimator. Counts =  8 8 1  (Total = 17)
bereavement-assistance:  Discrete Estimator. Counts =  10 4  (Total = 14)
contribution-to-health-plan:  Discrete Estimator. Counts =  9 3 7  (Total = 19)


Class good: Prior probability = 0.64

duration:  Normal Distribution. Mean = 2.25 StandardDev = 0.6821 WeightSum = 36 Precision = 1.0
wage-increase-first-year:  Normal Distribution. Mean = 4.3837 StandardDev = 1.1773 WeightSum = 36 Precision = 0.3125
wage-increase-second-year:  Normal Distribution. Mean = 4.447 StandardDev = 0.9805 WeightSum = 31 Precision = 0.35714285714285715
wage-increase-third-year:  Normal Distribution. Mean = 4.5795 StandardDev = 0.7893 WeightSum = 11 Precision = 0.38749999999999996
cost-of-living-adjustment:  Discrete Estimator. Counts =  14 8 3  (Total = 25)
working-hours:  Normal Distribution. Mean = 37.5491 StandardDev = 2.9266 WeightSum = 32 Precision = 1.8571428571428572
pension:  Discrete Estimator. Counts =  1 3 8  (Total = 12)
standby-pay:  Normal Distribution. Mean = 11.2 StandardDev = 2.0396 WeightSum = 5 Precision = 2.0
shift-differential:  Normal Distribution. Mean = 5.6818 StandardDev = 5.0584 WeightSum = 22 Precision = 2.7777777777777777
education-allowance:  Discrete Estimator. Counts =  8 4  (Total = 12)
statutory-holidays:  Normal Distribution. Mean = 11.4182 StandardDev = 1.2224 WeightSum = 33 Precision = 1.2
vacation:  Discrete Estimator. Counts =  8 11 15  (Total = 34)
longterm-disability-assistance:  Discrete Estimator. Counts =  16 1  (Total = 17)
contribution-to-dental-plan:  Discrete Estimator. Counts =  3 9 14  (Total = 26)
bereavement-assistance:  Discrete Estimator. Counts =  19 1  (Total = 20)
contribution-to-health-plan:  Discrete Estimator. Counts =  1 8 15  (Total = 24)


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          56               98.2456 %
Incorrectly Classified Instances         1                1.7544 %
Kappa statistic                          0.961
Mean absolute error                      0.0481
Root mean squared error                  0.1532
Relative absolute error                 10.5249 %
Root relative squared error             32.1057 %
Total Number of Instances               57


=== Confusion Matrix ===

a  b   <-- classified as  19  1 |  a = bad   0 37 |  b = good    === Stratified cross-validation ===  Correctly Classified Instances          51               89.4737 % Incorrectly Classified Instances         6               10.5263 % Kappa statistic                          0.7741 Mean absolute error                      0.1042 Root mean squared error                  0.2637 Relative absolute error                 22.7763 % Root relative squared error             55.2266 % Total Number of Instances               57   === Confusion Matrix ===    a  b   <-- classified as  18  2 |  a = bad   4 33 |  b = good  Trying a different classifier from the list on the same dataset

java weka.classifiers.lazy.IBk -t $WEKAHOME/data/labor.arff

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


Time taken to build model: 0 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances          57              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0.0169
Root mean squared error                  0.0169
Relative absolute error                  3.7085 %
Root relative squared error              3.5513 %
Total Number of Instances               57


=== Confusion Matrix ===

a  b   <-- classified as  20  0 |  a = bad   0 37 |  b = good    === Stratified cross-validation ===  Correctly Classified Instances          47               82.4561 % Incorrectly Classified Instances        10               17.5439 % Kappa statistic                          0.6235 Mean absolute error                      0.1876 Root mean squared error                  0.4113 Relative absolute error                 41.0144 % Root relative squared error             86.1487 % Total Number of Instances               57   === Confusion Matrix ===    a  b   <-- classified as  16  4 |  a = bad   6 31 |  b = good  What the dataset looks like ARFF format

What ARFF files look like

cat $WEKAHOME/data/labor.arff

% Date: Tue, 15 Nov 88 15:44:08 EST
% From: stan 
% To: aha@ICS.UCI.EDU
%
% 1. Title: Final settlements in labor negotitions in Canadian industry
%
% 2. Source Information
%    -- Creators: Collective Barganing Review, montly publication,
%       Labour Canada, Industrial Relations Information Service,
%         Ottawa, Ontario, K1A 0J2, Canada, (819) 997-3117
%         The data includes all collective agreements reached
%         in the business and personal services sector for locals
%         with at least 500 members (teachers, nurses, university
%         staff, police, etc) in Canada in 87 and first quarter of 88.
%    -- Donor: Stan Matwin, Computer Science Dept, University of Ottawa,
%                 34 Somerset East, K1N 9B4, (stan@uotcsi2.bitnet)
%    -- Date: November 1988
%
% 3. Past Usage:
%    -- testing concept learning software, in particular
%       an experimental method to learn two-tiered concept descriptions.
%       The data was used to learn the description of an acceptable
%       and unacceptable contract.
%       The unacceptable contracts were either obtained by interviewing
%       experts, or by inventing near misses.
%       Examples of use are described in:
%         Bergadano, F., Matwin, S., Michalski, R.,
%         Zhang, J., Measuring Quality of Concept Descriptions,
%         Procs. of the 3rd European Working Sessions on Learning,
%         Glasgow, October 1988.
%         Bergadano, F., Matwin, S., Michalski, R., Zhang, J.,
%         Representing and Acquiring Imprecise and Context-dependent
%         Concepts in Knowledge-based Systems, Procs. of ISMIS'88,
%         North Holland, 1988.
% 4. Relevant Information:
%    -- data was used to test 2tier approach with learning
% from positive and negative examples
%
% 5. Number of Instances: 57
%
% 6. Number of Attributes: 16
%
% 7. Attribute Information:
%    1.  dur: duration of agreement
%        [1..7]
%    2   wage1.wage : wage increase in first year of contract
%        [2.0 .. 7.0]
%    3   wage2.wage : wage increase in second year of contract
%        [2.0 .. 7.0]
%    4   wage3.wage : wage increase in third year of contract
%        [2.0 .. 7.0]
%    5   cola : cost of living allowance
%        [none, tcf, tc]
%    6   hours.hrs : number of working hours during week
%        [35 .. 40]
%    7   pension : employer contributions to pension plan
%        [none, ret_allw, empl_contr]
%    8   stby_pay : standby pay
%        [2 .. 25]
%    9   shift_diff : shift differencial : supplement for work on II and III shift
%        [1 .. 25]
%   10   educ_allw.boolean : education allowance
%        [true false]
%   11   holidays : number of statutory holidays
%        [9 .. 15]
%   12   vacation : number of paid vacation days
%        [ba, avg, gnr]
%   13   lngtrm_disabil.boolean :
%        employer's help during employee longterm disabil
%        ity [true , false]
%   14   dntl_ins : employers contribution towards the dental plan
%        [none, half, full]
%   15   bereavement.boolean : employer's financial contribution towards the
%        covering the costs of bereavement
%        [true , false]
%   16   empl_hplan : employer's contribution towards the health plan
%        [none, half, full]
%
% 8. Missing Attribute Values: None
%
% 9. Class Distribution:
%
% 10. Exceptions from format instructions: no commas between attribute values.
%
%
@relation 'labor-neg-data'
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'
2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good'
?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'
3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'
3,4,5,5,'tc',?,'empl_contr',?,?,?,12,'generous','yes','none','yes','half','good'
3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'
2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'
1,5.7,?,?,'none',40,'empl_contr',?,4,?,11,'generous','yes','full',?,?,'good'
3,3.5,4,4.6,'none',36,?,?,3,?,13,'generous',?,?,'yes','full','good'
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,'full',?,?,'good'
2,3.5,4,?,'none',40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
3,3.5,4,5.1,'tcf',37,?,?,4,?,13,'generous',?,'full','yes','full','good'
1,3,?,?,'none',36,?,?,10,'no',11,'generous',?,?,?,?,'good'
2,4.5,4,?,'none',37,'empl_contr',?,?,?,11,'average',?,'full','yes',?,'good'
1,2.8,?,?,?,35,?,?,2,?,12,'below_average',?,?,?,?,'good'
1,2.1,?,?,'tc',40,'ret_allw',2,3,'no',9,'below_average','yes','half',?,'none','bad'
1,2,?,?,'none',38,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,4,5,?,'tcf',35,?,13,5,?,15,'generous',?,?,?,?,'good'
2,4.3,4.4,?,?,38,?,?,4,?,12,'generous',?,'full',?,'full','good'
2,2.5,3,?,?,40,'none',?,?,?,11,'below_average',?,?,?,?,'bad'
3,3.5,4,4.6,'tcf',27,?,?,?,?,?,?,?,?,?,?,'good'
2,4.5,4,?,?,40,?,?,4,?,10,'generous',?,'half',?,'full','good'
1,6,?,?,?,38,?,8,3,?,9,'generous',?,?,?,?,'good'
3,2,2,2,'none',40,'none',?,?,?,10,'below_average',?,'half','yes','full','bad'
2,4.5,4.5,?,'tcf',?,?,?,?,'yes',10,'below_average','yes','none',?,'half','good'
2,3,3,?,'none',33,?,?,?,'yes',12,'generous',?,?,'yes','full','good'
2,5,4,?,'none',37,?,?,5,'no',11,'below_average','yes','full','yes','full','good'
3,2,2.5,?,?,35,'none',?,?,?,10,'average',?,?,'yes','full','bad'
3,4.5,4.5,5,'none',40,?,?,?,'no',11,'average',?,'half',?,?,'good'
3,3,2,2.5,'tc',40,'none',?,5,'no',10,'below_average','yes','half','yes','full','bad'
2,2.5,2.5,?,?,38,'empl_contr',?,?,?,10,'average',?,?,?,?,'bad'
2,4,5,?,'none',40,'none',?,3,'no',10,'below_average','no','none',?,'none','bad'
3,2,2.5,2.1,'tc',40,'none',2,1,'no',10,'below_average','no','half','yes','full','bad'
2,2,2,?,'none',40,'none',?,?,'no',11,'average','yes','none','yes','full','bad'
1,2,?,?,'tc',40,'ret_allw',4,0,'no',11,'generous','no','none','no','none','bad'
1,2.8,?,?,'none',38,'empl_contr',2,3,'no',9,'below_average','yes','half',?,'none','bad'
3,2,2.5,2,?,37,'empl_contr',?,?,?,10,'average',?,?,'yes','none','bad'
2,4.5,4,?,'none',40,?,?,4,?,12,'average','yes','full','yes','half','good'
1,4,?,?,'none',?,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,2,3,?,'none',38,'empl_contr',?,?,'yes',12,'generous','yes','none','yes','full','bad'
2,2.5,2.5,?,'tc',39,'empl_contr',?,?,?,12,'average',?,?,'yes',?,'bad'
2,2.5,3,?,'tcf',40,'none',?,?,?,11,'below_average',?,?,'yes',?,'bad'
2,4,4,?,'none',40,'none',?,3,?,10,'below_average','no','none',?,'none','bad'
2,4.5,4,?,?,40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
2,4.5,4,?,'none',40,?,?,5,?,11,'average',?,'full','yes','full','good'
2,4.6,4.6,?,'tcf',38,?,?,?,?,?,?,'yes','half',?,'half','good'
2,5,4.5,?,'none',38,?,14,5,?,11,'below_average','yes',?,?,'full','good'
2,5.7,4.5,?,'none',40,'ret_allw',?,?,?,11,'average','yes','full','yes','full','good'
2,7,5.3,?,?,?,?,?,?,?,11,?,'yes','full',?,?,'good'
3,2,3,?,'tcf',?,'empl_contr',?,?,'yes',?,?,'yes','half','yes',?,'good'
3,3.5,4,4.5,'tcf',35,?,?,?,?,13,'generous',?,?,'yes','full','good'
3,4,3.5,?,'none',40,'empl_contr',?,6,?,11,'average','yes','full',?,'full','good'
3,5,4.4,?,'none',38,'empl_contr',10,6,?,11,'generous','yes',?,?,'full','good'
3,5,5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
3,6,6,4,?,35,?,?,14,?,9,'generous','yes','full','yes','full','good'
%
%
%

Basic Statistics and Validation of dataset

java weka.core.Instances  $WEKAHOME/data/labor.arff

Relation Name:  labor-neg-data
Num Instances:  57
Num Attributes: 17

   Name                      Type  Nom  Int Real     Missing      Unique  Dist
 1 duration                   Num   0%  98%   0%     1 /  2%     0 /  0%     3
 2 wage-increase-first-year   Num   0%  49%  49%     1 /  2%     7 / 12%    17
 3 wage-increase-second-year  Num   0%  47%  33%    11 / 19%     8 / 14%    15
 4 wage-increase-third-year   Num   0%  14%  12%    42 / 74%     6 / 11%     9
 5 cost-of-living-adjustment  Nom  65%   0%   0%    20 / 35%     0 /  0%     3
 6 working-hours              Num   0%  89%   0%     6 / 11%     3 /  5%     8
 7 pension                    Nom  47%   0%   0%    30 / 53%     0 /  0%     3
 8 standby-pay                Num   0%  16%   0%    48 / 84%     6 / 11%     7
 9 shift-differential         Num   0%  54%   0%    26 / 46%     5 /  9%    10
10 education-allowance        Nom  39%   0%   0%    35 / 61%     0 /  0%     2
11 statutory-holidays         Num   0%  93%   0%     4 /  7%     0 /  0%     6
12 vacation                   Nom  89%   0%   0%     6 / 11%     0 /  0%     3
13 longterm-disability-assis  Nom  49%   0%   0%    29 / 51%     0 /  0%     2
14 contribution-to-dental-pl  Nom  65%   0%   0%    20 / 35%     0 /  0%     3
15 bereavement-assistance     Nom  53%   0%   0%    27 / 47%     0 /  0%     2
16 contribution-to-health-pl  Nom  65%   0%   0%    20 / 35%     0 /  0%     3
17 class                      Nom 100%   0%   0%     0 /  0%     0 /  0%     2

Trying Associations

java weka.associations.Apriori -t $WEKAHOME/data/weather.nominal.arff

Apriori
=======

Minimum support: 0.15 (2 instances)
Minimum metric : 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Size of set of large itemsets L(2): 47

Size of set of large itemsets L(3): 39

Size of set of large itemsets L(4): 6

Best rules found:

1. humidity=normal windy=FALSE 4 ==> play=yes 4    conf:(1)
2. temperature=cool 4 ==> humidity=normal 4    conf:(1)
3. outlook=overcast 4 ==> play=yes 4    conf:(1)
4. temperature=cool play=yes 3 ==> humidity=normal 3    conf:(1)
5. outlook=rainy windy=FALSE 3 ==> play=yes 3    conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3    conf:(1)
7. outlook=sunny humidity=high 3 ==> play=no 3    conf:(1)
8. outlook=sunny play=no 3 ==> humidity=high 3    conf:(1)
9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2    conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2    conf:(1)

 Trying FILTER 

java weka.filters.supervised.attribute.Discretize \
-i $WEKAHOME/data/iris.arff -c last

@relation iris-weka.filters.supervised.attribute.Discretize-Rfirst-last

@attribute sepallength {'\'(-inf-5.55]\'','\'(5.55-6.15]\'','\'(6.15-inf)\''}
@attribute sepalwidth {'\'(-inf-2.95]\'','\'(2.95-3.35]\'','\'(3.35-inf)\''}
@attribute petallength {'\'(-inf-2.45]\'','\'(2.45-4.75]\'','\'(4.75-inf)\''}
@attribute petalwidth {'\'(-inf-0.8]\'','\'(0.8-1.75]\'','\'(1.75-inf)\''}
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

@data

'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(-inf-2.95]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
...


Running an experiment

java weka.experiment.Experiment -r -T $WEKAHOME/data/iris.arff \
-D weka.experiment.InstancesResultListener \
-P weka.experiment.RandomSplitResultProducer --  \
-W weka.experiment.ClassifierSplitEvaluator --  \
-W weka.classifiers.rules.OneR

Experiment:
Runs from: 1 to: 10
Datasets: /usr/local/weka/data/iris.arff
Custom property iterator: off
ResultProducer: RandomSplitResultProducer: -P 66.0 -W weka.experiment.ClassifierSplitEvaluator --: 
ResultListener: weka.experiment.InstancesResultListener@1270b73

Initializing...
RandomSplitResultProducer: setting additional measures for split evaluator
Iterating...
Postprocessing...

Running the Lazy Classifier on larger dataset:

java weka.classifiers.lazy.IBk -t $WEKAHOME/data/soybean.arff

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 4.38 seconds

=== Error on training data ===

Correctly Classified Instances         682               99.8536 %
Incorrectly Classified Instances         1                0.1464 %
Kappa statistic                          0.9984
Mean absolute error                      0.0029
Root mean squared error                  0.0152
Relative absolute error                  2.9949 %
Root relative squared error              6.9346 %
Total Number of Instances              683


=== Confusion Matrix ===

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 88  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 92  0  0  0  0  0  0  0  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  0  0  0  0  0  0  0  0 44  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0 91  0  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  1 90  0  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 16  0 |  r = 2-4-d-injury   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8 |  s = herbicide-injury    === Stratified cross-validation ===  Correctly Classified Instances         623               91.2152 % Incorrectly Classified Instances        60                8.7848 % Kappa statistic                          0.9036 Mean absolute error                      0.0122 Root mean squared error                  0.0879 Relative absolute error                 12.71   % Root relative squared error             40.1285 % Total Number of Instances              683   === Confusion Matrix ===    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 88  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 81  0  0  0  0  5  4  2  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 19  1  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  2 17  0  0  1  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  0  0  0  0  0  0  0  0 44  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  6  0  0  0  0 13  0  1  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  4  0  0  0  0  0 81  6  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  3  0  0  0  0  0 17 71  0  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   2  1  0  2  0  0  0  0  0  0  0  0  0  0  0  0  0  8  3 |  r = 2-4-d-injury   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8 |  s = herbicide-injury   


Testing the Instances call

java weka.core.Instances  $WEKAHOME/data/soybean.arff

Relation Name:  soybean
Num Instances:  683
Num Attributes: 36

   Name                      Type  Nom  Int Real     Missing      Unique  Dist
 1 date                       Nom 100%   0%   0%     1 /  0%     0 /  0%     7
 2 plant-stand                Nom  95%   0%   0%    36 /  5%     0 /  0%     2
 3 precip                     Nom  94%   0%   0%    38 /  6%     0 /  0%     3
 4 temp                       Nom  96%   0%   0%    30 /  4%     0 /  0%     3
 5 hail                       Nom  82%   0%   0%   121 / 18%     0 /  0%     2
 6 crop-hist                  Nom  98%   0%   0%    16 /  2%     0 /  0%     4
 7 area-damaged               Nom 100%   0%   0%     1 /  0%     0 /  0%     4
 8 severity                   Nom  82%   0%   0%   121 / 18%     0 /  0%     3
 9 seed-tmt                   Nom  82%   0%   0%   121 / 18%     0 /  0%     3
10 germination                Nom  84%   0%   0%   112 / 16%     0 /  0%     3
11 plant-growth               Nom  98%   0%   0%    16 /  2%     0 /  0%     2
12 leaves                     Nom 100%   0%   0%     0 /  0%     0 /  0%     2
13 leafspots-halo             Nom  88%   0%   0%    84 / 12%     0 /  0%     3
14 leafspots-marg             Nom  88%   0%   0%    84 / 12%     0 /  0%     3
15 leafspot-size              Nom  88%   0%   0%    84 / 12%     0 /  0%     3
16 leaf-shread                Nom  85%   0%   0%   100 / 15%     0 /  0%     2
17 leaf-malf                  Nom  88%   0%   0%    84 / 12%     0 /  0%     2
18 leaf-mild                  Nom  84%   0%   0%   108 / 16%     0 /  0%     3
19 stem                       Nom  98%   0%   0%    16 /  2%     0 /  0%     2
20 lodging                    Nom  82%   0%   0%   121 / 18%     0 /  0%     2
21 stem-cankers               Nom  94%   0%   0%    38 /  6%     0 /  0%     4
22 canker-lesion              Nom  94%   0%   0%    38 /  6%     0 /  0%     4
23 fruiting-bodies            Nom  84%   0%   0%   106 / 16%     0 /  0%     2
24 external-decay             Nom  94%   0%   0%    38 /  6%     0 /  0%     3
25 mycelium                   Nom  94%   0%   0%    38 /  6%     0 /  0%     2
26 int-discolor               Nom  94%   0%   0%    38 /  6%     0 /  0%     3
27 sclerotia                  Nom  94%   0%   0%    38 /  6%     0 /  0%     2
28 fruit-pods                 Nom  88%   0%   0%    84 / 12%     0 /  0%     4
29 fruit-spots                Nom  84%   0%   0%   106 / 16%     0 /  0%     4
30 seed                       Nom  87%   0%   0%    92 / 13%     0 /  0%     2
31 mold-growth                Nom  87%   0%   0%    92 / 13%     0 /  0%     2
32 seed-discolor              Nom  84%   0%   0%   106 / 16%     0 /  0%     2
33 seed-size                  Nom  87%   0%   0%    92 / 13%     0 /  0%     2
34 shriveling                 Nom  84%   0%   0%   106 / 16%     0 /  0%     2
35 roots                      Nom  95%   0%   0%    31 /  5%     0 /  0%     3
36 class                      Nom 100%   0%   0%     0 /  0%     0 /  0%    19

Using NaiveBayes Classifier on soybean data

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/soybean.arff

Naive Bayes Classifier

Class diaporthe-stem-canker: Prior probability = 0.03

date:  Discrete Estimator. Counts =  1 1 1 6 6 6 6  (Total = 27)
plant-stand:  Discrete Estimator. Counts =  21 1  (Total = 22)
precip:  Discrete Estimator. Counts =  1 1 21  (Total = 23)
temp:  Discrete Estimator. Counts =  1 21 1  (Total = 23)
hail:  Discrete Estimator. Counts =  20 2  (Total = 22)
crop-hist:  Discrete Estimator. Counts =  1 7 8 8  (Total = 24)
area-damaged:  Discrete Estimator. Counts =  18 4 1 1  (Total = 24)
severity:  Discrete Estimator. Counts =  1 15 7  (Total = 23)
seed-tmt:  Discrete Estimator. Counts =  12 10 1  (Total = 23)
...

Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0.11 seconds

=== Error on training data ===

Correctly Classified Instances         640               93.7042 %
Incorrectly Classified Instances        43                6.2958 %
Kappa statistic                          0.931
Mean absolute error                      0.0081
Root mean squared error                  0.0765
Relative absolute error                  8.4277 %
Root relative squared error             34.8958 %
Total Number of Instances              683


=== Confusion Matrix ===

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 88  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 79  0  0  0  0  5  4  4  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  1 19  0  0  0  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  0  0  0  0  0  0  0  0 44  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  2  0  0  0  0 18  0  0  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0 91  0  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  3  0  0  0  0  0 21 66  1  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0 |  r = 2-4-d-injury   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8 |  s = herbicide-injury    === Stratified cross-validation ===  Correctly Classified Instances         635               92.9722 % Incorrectly Classified Instances        48                7.0278 % Kappa statistic                          0.923 Mean absolute error                      0.0096 Root mean squared error                  0.0817 Relative absolute error                  9.9344 % Root relative squared error             37.2742 % Total Number of Instances              683   === Confusion Matrix ===    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 88  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 77  0  0  0  0  5  6  4  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  2 18  0  0  0  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  0  0  0  0  0  0  0  0 44  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  2  0  0  0  0 17  1  0  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0 91  0  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  3  0  0  0  0  0 22 65  1  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0 |  r = 2-4-d-injury   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8 |  s = herbicide-injury  The same dataset with as a Pruned Decision Tree

java weka.classifiers.trees.J48 -t $WEKAHOME/data/soybean.arff

J48 pruned tree
------------------

leafspot-size = lt-1/8
|   canker-lesion = dna
|   |   leafspots-marg = w-s-marg
|   |   |   seed-size = norm: bacterial-blight (21.0/1.0)
|   |   |   seed-size = lt-norm: bacterial-pustule (3.23/1.23)
|   |   leafspots-marg = no-w-s-marg: bacterial-pustule (17.91/0.91)
|   |   leafspots-marg = dna: bacterial-blight (0.0)
|   canker-lesion = brown: bacterial-blight (0.0)
|   canker-lesion = dk-brown-blk: phytophthora-rot (4.78/0.1)
|   canker-lesion = tan: purple-seed-stain (11.23/0.23)
leafspot-size = gt-1/8
|   roots = norm
|   |   mold-growth = absent
|   |   |   fruit-spots = absent
|   |   |   |   leaf-malf = absent
|   |   |   |   |   fruiting-bodies = absent
|   |   |   |   |   |   date = april: brown-spot (5.0)
|   |   |   |   |   |   date = may: brown-spot (24.0/1.0)
|   |   |   |   |   |   date = june
|   |   |   |   |   |   |   precip = lt-norm: phyllosticta-leaf-spot (4.0)
|   |   |   |   |   |   |   precip = norm: brown-spot (5.0/2.0)
|   |   |   |   |   |   |   precip = gt-norm: brown-spot (21.0)
|   |   |   |   |   |   date = july
|   |   |   |   |   |   |   precip = lt-norm: phyllosticta-leaf-spot (1.0)
|   |   |   |   |   |   |   precip = norm: phyllosticta-leaf-spot (2.0)
|   |   |   |   |   |   |   precip = gt-norm: frog-eye-leaf-spot (11.0/5.0)
|   |   |   |   |   |   date = august
|   |   |   |   |   |   |   leaf-shread = absent
|   |   |   |   |   |   |   |   seed-tmt = none: alternarialeaf-spot (16.0/4.0)
|   |   |   |   |   |   |   |   seed-tmt = fungicide
|   |   |   |   |   |   |   |   |   plant-stand = normal: frog-eye-leaf-spot (6.0)
|   |   |   |   |   |   |   |   |   plant-stand = lt-normal: alternarialeaf-spot (5.0/1.0)
|   |   |   |   |   |   |   |   seed-tmt = other: frog-eye-leaf-spot (3.0)
|   |   |   |   |   |   |   leaf-shread = present: alternarialeaf-spot (2.0)
|   |   |   |   |   |   date = september
|   |   |   |   |   |   |   stem = norm: alternarialeaf-spot (44.0/4.0)
|   |   |   |   |   |   |   stem = abnorm: frog-eye-leaf-spot (2.0)
|   |   |   |   |   |   date = october: alternarialeaf-spot (31.0/1.0)
|   |   |   |   |   fruiting-bodies = present: brown-spot (34.0)
|   |   |   |   leaf-malf = present: phyllosticta-leaf-spot (10.0)
|   |   |   fruit-spots = colored
|   |   |   |   fruit-pods = norm: brown-spot (2.0)
|   |   |   |   fruit-pods = diseased: frog-eye-leaf-spot (62.0)
|   |   |   |   fruit-pods = few-present: frog-eye-leaf-spot (0.0)
|   |   |   |   fruit-pods = dna: frog-eye-leaf-spot (0.0)
|   |   |   fruit-spots = brown-w/blk-specks
|   |   |   |   crop-hist = diff-lst-year: brown-spot (0.0)
|   |   |   |   crop-hist = same-lst-yr: brown-spot (2.0)
|   |   |   |   crop-hist = same-lst-two-yrs: brown-spot (0.0)
|   |   |   |   crop-hist = same-lst-sev-yrs: frog-eye-leaf-spot (2.0)
|   |   |   fruit-spots = distort: brown-spot (0.0)
|   |   |   fruit-spots = dna: brown-stem-rot (9.0)
|   |   mold-growth = present
|   |   |   leaves = norm: diaporthe-pod-&-stem-blight (7.25)
|   |   |   leaves = abnorm: downy-mildew (20.0)
|   roots = rotted
|   |   area-damaged = scattered: herbicide-injury (1.1/0.1)
|   |   area-damaged = low-areas: phytophthora-rot (30.03)
|   |   area-damaged = upper-areas: phytophthora-rot (0.0)
|   |   area-damaged = whole-field: herbicide-injury (3.66/0.66)
|   roots = galls-cysts: cyst-nematode (7.81/0.17)
leafspot-size = dna
|   int-discolor = none
|   |   leaves = norm
|   |   |   stem-cankers = absent
|   |   |   |   canker-lesion = dna: diaporthe-pod-&-stem-blight (5.53)
|   |   |   |   canker-lesion = brown: purple-seed-stain (0.0)
|   |   |   |   canker-lesion = dk-brown-blk: purple-seed-stain (0.0)
|   |   |   |   canker-lesion = tan: purple-seed-stain (9.0)
|   |   |   stem-cankers = below-soil: rhizoctonia-root-rot (19.0)
|   |   |   stem-cankers = above-soil: anthracnose (0.0)
|   |   |   stem-cankers = above-sec-nde: anthracnose (24.0)
|   |   leaves = abnorm
|   |   |   stem = norm
|   |   |   |   plant-growth = norm: powdery-mildew (22.0/2.0)
|   |   |   |   plant-growth = abnorm: cyst-nematode (4.3/0.39)
|   |   |   stem = abnorm
|   |   |   |   plant-stand = normal
|   |   |   |   |   leaf-malf = absent
|   |   |   |   |   |   seed = norm: diaporthe-stem-canker (21.0/1.0)
|   |   |   |   |   |   seed = abnorm: anthracnose (9.0)
|   |   |   |   |   leaf-malf = present: 2-4-d-injury (3.0)
|   |   |   |   plant-stand = lt-normal
|   |   |   |   |   fruiting-bodies = absent: phytophthora-rot (50.16/7.61)
|   |   |   |   |   fruiting-bodies = present
|   |   |   |   |   |   roots = norm: anthracnose (11.0/1.0)
|   |   |   |   |   |   roots = rotted: phytophthora-rot (12.89/2.15)
|   |   |   |   |   |   roots = galls-cysts: phytophthora-rot (0.0)
|   int-discolor = brown
|   |   leaf-malf = absent: brown-stem-rot (35.73/0.73)
|   |   leaf-malf = present: 2-4-d-injury (3.15/0.68)
|   int-discolor = black: charcoal-rot (22.22/2.22)

Number of Leaves  :     61

Size of the tree :      93


Time taken to build model: 0.23 seconds
Time taken to test model on training data: 0.09 seconds

=== Error on training data ===

Correctly Classified Instances         658               96.3397 %
Incorrectly Classified Instances        25                3.6603 %
Kappa statistic                          0.9598
Mean absolute error                      0.0104
Root mean squared error                  0.0625
Relative absolute error                 10.7981 %
Root relative squared error             28.5358 %
Total Number of Instances              683


=== Confusion Matrix ===

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   1  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 88  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 90  0  0  0  0  0  0  2  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  1 19  0  0  0  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  1  0  0  0  0  0  0  0 43  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  3  0  0  0  0 17  0  0  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0 88  3  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0 10 81  0  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 16  0 |  r = 2-4-d-injury   0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4 |  s = herbicide-injury    === Stratified cross-validation ===  Correctly Classified Instances         625               91.5081 % Incorrectly Classified Instances        58                8.4919 % Kappa statistic                          0.9068 Mean absolute error                      0.0135 Root mean squared error                  0.0842 Relative absolute error                 14.0484 % Root relative squared error             38.4134 % Total Number of Instances              683   === Confusion Matrix ===    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <-- classified as  19  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = diaporthe-stem-canker   0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = charcoal-rot   1  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = rhizoctonia-root-rot   0  0  0 87  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0 |  d = phytophthora-rot   0  0  0  0 44  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = brown-stem-rot   0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = powdery-mildew   0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0 |  g = downy-mildew   0  0  0  0  0  0  0 85  0  0  0  0  2  1  4  0  0  0  0 |  h = brown-spot   0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0 |  i = bacterial-blight   0  0  0  0  0  0  0  0  1 19  0  0  0  0  0  0  0  0  0 |  j = bacterial-pustule   0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0 |  k = purple-seed-stain   0  0  0  4  0  0  0  0  0  0  0 40  0  0  0  0  0  0  0 |  l = anthracnose   0  0  0  0  0  0  0  3  0  0  0  0 14  0  3  0  0  0  0 |  m = phyllosticta-leaf-spot   0  0  0  0  0  0  0  1  0  0  0  0  0 85  5  0  0  0  0 |  n = alternarialeaf-spot   0  0  0  0  0  0  0  3  0  0  0  0  1 20 67  0  0  0  0 |  o = frog-eye-leaf-spot   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15  0  0  0 |  p = diaporthe-pod-&-stem-blight   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0 |  q = cyst-nematode   0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0 14  0 |  r = 2-4-d-injury   0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  1  0  2  3 |  s = herbicide-injury   Building a data model

java weka.classifiers.trees.J48 -t $WEKAHOME/data/soybean.arff \
-i -k -d J48-data.model > J48-data.out &

On the segment data provider, build on one set, check against another

java weka.classifiers.trees.J48 -t $WEKAHOME/data/segment-test.arff \
-i -k -d J48-segment-data.model >J48-segment-data.out

The results:

[weka@domU-12-31-36-00-26-23 tutorial]$ ls -l
total 108
-rw-rw-r--  1 weka weka 60556 Aug  1 09:13 J48-data.model
-rw-rw-r--  1 weka weka 12906 Aug  1 09:13 J48-data.out
-rw-rw-r--  1 weka weka 18784 Aug  1 09:17 J48-segment-data.model
-rw-rw-r--  1 weka weka  6146 Aug  1 09:17 J48-segment-data.out

more J48-segment-data.out

J48 pruned tree
------------------

region-centroid-row <= 155 |   intensity-mean <= 31.6296 |   |   hue-mean <= -1.84512 |   |   |   hue-mean <= -2.22949 |   |   |   |   saturation-mean <= 0.48999: window (3.0) |   |   |   |   saturation-mean > 0.48999: foliage (77.0)
|   |   |   hue-mean > -2.22949
|   |   |   |   saturation-mean <= 0.864482 |   |   |   |   |   rawgreen-mean <= 14.6667 |   |   |   |   |   |   region-centroid-col <= 100 |   |   |   |   |   |   |   hue-mean <= -2.03349 |   |   |   |   |   |   |   |   hue-mean <= -2.14532: foliage (2.0) |   |   |   |   |   |   |   |   hue-mean > -2.14532: window (13.0/3.0)
|   |   |   |   |   |   |   hue-mean > -2.03349
|   |   |   |   |   |   |   |   region-centroid-row <= 150: brickface (2.0) |   |   |   |   |   |   |   |   region-centroid-row > 150: window (2.0)
|   |   |   |   |   |   region-centroid-col > 100: window (56.0)
|   |   |   |   |   rawgreen-mean > 14.6667
|   |   |   |   |   |   region-centroid-row <= 122: window (26.0/1.0) |   |   |   |   |   |   region-centroid-row > 122
|   |   |   |   |   |   |   region-centroid-col <= 165: cement (10.0) |   |   |   |   |   |   |   region-centroid-col > 165: window (4.0/1.0)
|   |   |   |   saturation-mean > 0.864482
|   |   |   |   |   hue-mean <= -2.101: foliage (22.0) |   |   |   |   |   hue-mean > -2.101
|   |   |   |   |   |   region-centroid-row <= 132 |   |   |   |   |   |   |   hue-mean <= -2.08047: foliage (9.0) |   |   |   |   |   |   |   hue-mean > -2.08047: window (3.0/1.0)
|   |   |   |   |   |   region-centroid-row > 132
|   |   |   |   |   |   |   region-centroid-row <= 143: window (10.0) |   |   |   |   |   |   |   region-centroid-row > 143: foliage (2.0)
|   |   hue-mean > -1.84512
|   |   |   exgreen-mean <= -5.77778 |   |   |   |   exred-mean <= -5.88889 |   |   |   |   |   region-centroid-row <= 104: brickface (6.0) |   |   |   |   |   region-centroid-row > 104: foliage (3.0)
|   |   |   |   exred-mean > -5.88889: brickface (118.0/1.0)
|   |   |   exgreen-mean > -5.77778
|   |   |   |   exred-mean <= -0.777778: grass (5.0/1.0) |   |   |   |   exred-mean > -0.777778
|   |   |   |   |   region-centroid-col <= 34: foliage (2.0) |   |   |   |   |   region-centroid-col > 34: window (14.0)
|   intensity-mean > 31.6296
|   |   rawblue-mean <= 88.4444: cement (94.0/1.0) |   |   rawblue-mean > 88.4444: sky (110.0)
region-centroid-row > 155
|   rawred-mean <= 23.3333 |   |   exgreen-mean <= -3.77778: cement (5.0/1.0) |   |   exgreen-mean > -3.77778: grass (118.0)
|   rawred-mean > 23.3333: path (94.0)

Number of Leaves  :     26

Size of the tree :      51


Time taken to build model: 0.45 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances         800               98.7654 %
Incorrectly Classified Instances        10                1.2346 %
Kappa statistic                          0.9856
K&B Relative Info Score              79692.1947 %
K&B Information Score                 2232.1312 bits      2.7557 bits/instance
Class complexity | order 0            2268.6706 bits      2.8008 bits/instance
Class complexity | scheme               45.7746 bits      0.0565 bits/instance
Complexity improvement     (Sf)       2222.896  bits      2.7443 bits/instance
Mean absolute error                      0.0058
Root mean squared error                  0.054
Relative absolute error                  2.3848 %
Root relative squared error             15.443  %
Total Number of Instances              810


=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
1         0.001      0.992     1         0.996    brickface
1         0          1         1         1        sky
0.959     0          1         0.959     0.979    foliage
0.973     0.003      0.982     0.973     0.977    cement
0.992     0.009      0.954     0.992     0.973    window
1         0          1         1         1        path
0.992     0.001      0.992     0.992     0.992    grass


=== Confusion Matrix ===

 a   b   c   d   e   f   g   <-- classified as  125   0   0   0   0   0   0 |   a = brickface    0 110   0   0   0   0   0 |   b = sky    0   0 117   1   4   0   0 |   c = foliage    0   0   0 107   2   0   1 |   d = cement    1   0   0   0 125   0   0 |   e = window    0   0   0   0   0  94   0 |   f = path    0   0   0   1   0   0 122 |   g = grass    === Stratified cross-validation ===  Correctly Classified Instances         757               93.4568 % Incorrectly Classified Instances        53                6.5432 % Kappa statistic                          0.9235 K&B Relative Info Score              75326.8356 % K&B Information Score                 2110.05   bits      2.605  bits/instance Class complexity | order 0            2268.8296 bits      2.801  bits/instance Class complexity | scheme            37665.7637 bits     46.5009 bits/instance Complexity improvement     (Sf)     -35396.9341 bits    -43.6999 bits/instance Mean absolute error                      0.02 Root mean squared error                  0.1312 Relative absolute error                  8.1735 % Root relative squared error             37.5168 % Total Number of Instances              810   === Detailed Accuracy By Class ===  TP Rate   FP Rate   Precision   Recall  F-Measure   Class   0.96      0.009      0.952     0.96      0.956    brickface   1         0.001      0.991     1         0.995    sky   0.844     0.022      0.873     0.844     0.858    foliage   0.9       0.01       0.934     0.9       0.917    cement   0.881     0.031      0.841     0.881     0.86     window   0.989     0.001      0.989     0.989     0.989    path   0.984     0.003      0.984     0.984     0.984    grass   === Confusion Matrix ===     a   b   c   d   e   f   g   <-- classified as  120   0   3   0   2   0   0 |   a = brickface    0 110   0   0   0   0   0 |   b = sky    4   0 103   1  14   0   0 |   c = foliage    0   1   2  99   5   1   2 |   d = cement    2   0  10   3 111   0   0 |   e = window    0   0   0   1   0  93   0 |   f = path    0   0   0   2   0   0 121 |   g = grass   Checking meta classifier:

java weka.classifiers.meta.ClassificationViaRegression \
-W weka.classifiers.functions.LinearRegression \
-t $WEKAHOME/data/iris.arff -x 2 -- -S 1

Options: -W weka.classifiers.functions.LinearRegression -- -S 1

Classification via Regression

Classifier for class with index 0:


Linear Regression Model

class =

    0.0656 * sepallength +
    0.2425 * sepalwidth +
   -0.2228 * petallength +
   -0.0634 * petalwidth +
    0.1225

Classifier for class with index 1:


Linear Regression Model

class =

   -0.0215 * sepallength +
   -0.4407 * sepalwidth +
    0.2185 * petallength +
   -0.4832 * petalwidth +
    1.563

Classifier for class with index 2:


Linear Regression Model

class =

   -0.0441 * sepallength +
    0.1982 * sepalwidth +
    0.0042 * petallength +
    0.5465 * petalwidth +
   -0.6854



Time taken to build model: 0.14 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances         127               84.6667 %
Incorrectly Classified Instances        23               15.3333 %
Kappa statistic                          0.77
Mean absolute error                      0.2164
Root mean squared error                  0.2943
Relative absolute error                 48.6997 %
Root relative squared error             62.4309 %
Total Number of Instances              150


=== Confusion Matrix ===

a  b  c   <-- classified as  50  0  0 |  a = Iris-setosa   0 34 16 |  b = Iris-versicolor   0  7 43 |  c = Iris-virginica    === Stratified cross-validation ===  Correctly Classified Instances         123               82      % Incorrectly Classified Instances        27               18      % Kappa statistic                          0.73 Mean absolute error                      0.2349 Root mean squared error                  0.3157 Relative absolute error                 52.8443 % Root relative squared error             66.9658 % Total Number of Instances              150   === Confusion Matrix ===    a  b  c   <-- classified as  49  1  0 |  a = Iris-setosa   0 33 17 |  b = Iris-versicolor   0  9 41 |  c = Iris-virginica   Testing some real datasets now Leukemia-ALLAML

The data can be found here http://research.i2r.a-star.edu.sg/rp/Leukemia/ALLAML.html

java weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.tree.J48.model  > Leukemia-ALLAML.tree.J48.out

The results:

more Leukemia-ALLAML.tree.J48.out

J48 pruned tree
------------------

attribute4847 <= 938: ALL (27.0) attribute4847 > 938: AML (11.0)

Number of Leaves  :     2

Size of the tree :      3


Time taken to build model: 1.13 seconds
Time taken to test model on training data: 0.07 seconds

=== Error on training data ===

Correctly Classified Instances          38              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
K&B Relative Info Score               3744.5181 %
K&B Information Score                   33.0001 bits      0.8684 bits/instance
Class complexity | order 0              33.0001 bits      0.8684 bits/instance
Class complexity | scheme                0      bits      0      bits/instance
Complexity improvement     (Sf)         33.0001 bits      0.8684 bits/instance
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               38


=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
1         0          1         1         1        ALL
1         0          1         1         1        AML


=== Confusion Matrix ===

a  b   <-- classified as  27  0 |  a = ALL   0 11 |  b = AML   === Error on test data ===  Correctly Classified Instances          31               91.1765 % Incorrectly Classified Instances         3                8.8235 % Kappa statistic                          0.8198 K&B Relative Info Score               3160.6324 % K&B Information Score                   27.8544 bits      0.8192 bits/instance Class complexity | order 0              34.609  bits      1.0179 bits/instance Class complexity | scheme             3222      bits     94.7647 bits/instance Complexity improvement     (Sf)      -3187.391  bits    -93.7468 bits/instance Mean absolute error                      0.0882 Root mean squared error                  0.297 Relative absolute error                 18.9873 % Root relative squared error             58.8575 % Total Number of Instances               34   === Detailed Accuracy By Class ===  TP Rate   FP Rate   Precision   Recall  F-Measure   Class   0.9       0.071      0.947     0.9       0.923    ALL   0.929     0.1        0.867     0.929     0.897    AML   === Confusion Matrix ===    a  b   <-- classified as  18  2 |  a = ALL   1 13 |  b = AML  Same data with NaiveBayes:

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.NaiveBayes.J48.model  > Leukemia-ALLAML.NaiveBayes.J48.out

Checking the results (trimmed): 

tail -100 Leukemia-ALLAML.NaiveBayes.J48.out

attribute7096:  Normal Distribution. Mean = 17632.9975 StandardDev = 5491.6378 WeightSum = 11 Precision = 885.6756756756756
attribute7097:  Normal Distribution. Mean = 16260.5405 StandardDev = 1742.1779 WeightSum = 11 Precision = 687.9459459459459
attribute7098:  Normal Distribution. Mean = 918.855 StandardDev = 410.0149 WeightSum = 11 Precision = 64.37837837837837
attribute7099:  Normal Distribution. Mean = 280.1548 StandardDev = 173.7343 WeightSum = 11 Precision = 17.216216216216218
attribute7100:  Normal Distribution. Mean = 59.3889 StandardDev = 189.0803 WeightSum = 11 Precision = 29.694444444444443
attribute7101:  Normal Distribution. Mean = 11265.5725 StandardDev = 2448.7777 WeightSum = 11 Precision = 390.9189189189189
attribute7102:  Normal Distribution. Mean = 10453.7396 StandardDev = 3122.437 WeightSum = 11 Precision = 419.6756756756757
attribute7103:  Normal Distribution. Mean = 318.7273 StandardDev = 376.3747 WeightSum = 11 Precision = 47.37837837837838
attribute7104:  Normal Distribution. Mean = 2731.2801 StandardDev = 1380.7546 WeightSum = 11 Precision = 236.56756756756758
attribute7105:  Normal Distribution. Mean = -288.0413 StandardDev = 90.8241 WeightSum = 11 Precision = 11.606060606060606
attribute7106:  Normal Distribution. Mean = 0 StandardDev = 63.0836 WeightSum = 11 Precision = 7.324324324324325
attribute7107:  Normal Distribution. Mean = 300.6417 StandardDev = 114.8094 WeightSum = 11 Precision = 27.558823529411764
attribute7108:  Normal Distribution. Mean = -6.5039 StandardDev = 40.9087 WeightSum = 11 Precision = 8.942857142857143
attribute7109:  Normal Distribution. Mean = 249.1057 StandardDev = 80.4043 WeightSum = 11 Precision = 16.81081081081081
attribute7110:  Normal Distribution. Mean = 56.7107 StandardDev = 49.8522 WeightSum = 11 Precision = 6.636363636363637
attribute7111:  Normal Distribution. Mean = 63.7126 StandardDev = 31.0336 WeightSum = 11 Precision = 9.870967741935484
attribute7112:  Normal Distribution. Mean = -16.5111 StandardDev = 217.4379 WeightSum = 11 Precision = 25.945945945945947
attribute7113:  Normal Distribution. Mean = 267.1091 StandardDev = 128.0862 WeightSum = 11 Precision = 16.6
attribute7114:  Normal Distribution. Mean = 122.4791 StandardDev = 87.51 WeightSum = 11 Precision = 17.054054054054053
attribute7115:  Normal Distribution. Mean = 233.8717 StandardDev = 111.0206 WeightSum = 11 Precision = 11.588235294117647
attribute7116:  Normal Distribution. Mean = 307.9662 StandardDev = 139.9155 WeightSum = 11 Precision = 24.37142857142857
attribute7117:  Normal Distribution. Mean = -319.0614 StandardDev = 110.253 WeightSum = 11 Precision = 25.43243243243243
attribute7118:  Normal Distribution. Mean = -2319.9951 StandardDev = 878.3917 WeightSum = 11 Precision = 105.89189189189189
attribute7119:  Normal Distribution. Mean = 378.2703 StandardDev = 120.9712 WeightSum = 11 Precision = 94.56756756756756
attribute7120:  Normal Distribution. Mean = 182.4489 StandardDev = 82.9293 WeightSum = 11 Precision = 10.1875
attribute7121:  Normal Distribution. Mean = 797.0098 StandardDev = 352.9267 WeightSum = 11 Precision = 38.62162162162162
attribute7122:  Normal Distribution. Mean = 11.3143 StandardDev = 56.262 WeightSum = 11 Precision = 11.314285714285715
attribute7123:  Normal Distribution. Mean = 348.8624 StandardDev = 134.0911 WeightSum = 11 Precision = 67.32432432432432
attribute7124:  Normal Distribution. Mean = -17.8909 StandardDev = 48.2762 WeightSum = 11 Precision = 4.685714285714286
attribute7125:  Normal Distribution. Mean = 1109.484 StandardDev = 549.1813 WeightSum = 11 Precision = 57.2972972972973
attribute7126:  Normal Distribution. Mean = 326.3333 StandardDev = 147.522 WeightSum = 11 Precision = 29.666666666666668
attribute7127:  Normal Distribution. Mean = 8.5 StandardDev = 20.0873 WeightSum = 11 Precision = 5.5
attribute7128:  Normal Distribution. Mean = 1145.2208 StandardDev = 1057.6857 WeightSum = 11 Precision = 91.28571428571429
attribute7129:  Normal Distribution. Mean = -24.6494 StandardDev = 26.9834 WeightSum = 11 Precision = 3.7142857142857144


Time taken to build model: 0.42 seconds
Time taken to test model on training data: 1.28 seconds

=== Error on training data ===

Correctly Classified Instances          38              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
K&B Relative Info Score               3744.5181 %
K&B Information Score                   33.0001 bits      0.8684 bits/instance
Class complexity | order 0              33.0001 bits      0.8684 bits/instance
Class complexity | scheme                0      bits      0      bits/instance
Complexity improvement     (Sf)         33.0001 bits      0.8684 bits/instance
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               38


=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
1         0          1         1         1        ALL
1         0          1         1         1        AML


=== Confusion Matrix ===

a  b   <-- classified as  27  0 |  a = ALL   0 11 |  b = AML   === Error on test data ===  Correctly Classified Instances          30               88.2353 % Incorrectly Classified Instances         4               11.7647 % Kappa statistic                          0.7518 K&B Relative Info Score               2905.1505 % K&B Information Score                   25.6028 bits      0.753  bits/instance Class complexity | order 0              34.609  bits      1.0179 bits/instance Class complexity | scheme             4296      bits    126.3529 bits/instance Complexity improvement     (Sf)      -4261.391  bits   -125.335  bits/instance Mean absolute error                      0.1176 Root mean squared error                  0.343 Relative absolute error                 25.3165 % Root relative squared error             67.9628 % Total Number of Instances               34   === Detailed Accuracy By Class ===  TP Rate   FP Rate   Precision   Recall  F-Measure   Class   0.95      0.214      0.864     0.95      0.905    ALL   0.786     0.05       0.917     0.786     0.846    AML   === Confusion Matrix ===    a  b   <-- classified as  19  1 |  a = ALL   3 11 |  b = AML  Running off predictions


java weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.tree.J48.model -p 0  > Leukemia-ALLAML.tree.J48.out

more Leukemia-ALLAML.tree.J48.out

0 ALL 1.0 ALL
1 ALL 1.0 ALL
2 ALL 1.0 ALL
3 ALL 1.0 ALL
4 ALL 1.0 ALL
5 ALL 1.0 ALL
6 ALL 1.0 ALL
7 ALL 1.0 ALL
8 ALL 1.0 ALL
9 ALL 1.0 ALL
10 ALL 1.0 ALL
11 ALL 1.0 ALL
12 ALL 1.0 ALL
13 ALL 1.0 ALL
14 AML 1.0 ALL
15 ALL 1.0 ALL
16 AML 1.0 ALL
17 ALL 1.0 ALL
18 ALL 1.0 ALL
19 ALL 1.0 ALL
20 AML 1.0 AML
21 AML 1.0 AML
22 AML 1.0 AML
23 AML 1.0 AML
24 AML 1.0 AML
25 AML 1.0 AML
26 AML 1.0 AML
27 AML 1.0 AML
28 AML 1.0 AML
29 AML 1.0 AML
30 ALL 1.0 AML
31 AML 1.0 AML
32 AML 1.0 AML
33 AML 1.0 AML

java -mx1024m weka.classifiers.bayes.NaiveBayes \
-t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.NaiveBayes.J48.model -p 0 > Leukemia-ALLAML.NaiveBayes.J48.pred

The results:

[weka@domU-12-31-36-00-26-23 tutorial]$ ls -l
total 3920
-rw-rw-r--  1 weka weka   60556 Aug  1 09:13 J48-data.model
-rw-rw-r--  1 weka weka   12906 Aug  1 09:13 J48-data.out
-rw-rw-r--  1 weka weka   18784 Aug  1 09:17 J48-segment-data.model
-rw-rw-r--  1 weka weka    6146 Aug  1 09:17 J48-segment-data.out
-rw-rw-r--  1 weka weka 1506090 Aug  1 09:41 Leukemia-ALLAML.NaiveBayes.J48.model
-rw-rw-r--  1 weka weka 1703981 Aug  1 09:36 Leukemia-ALLAML.NaiveBayes.J48.out
-rw-rw-r--  1 weka weka     535 Aug  1 09:41 Leukemia-ALLAML.NaiveBayes.J48.pred
-rw-rw-r--  1 weka weka  666093 Aug  1 09:40 Leukemia-ALLAML.tree.J48.model
-rw-rw-r--  1 weka weka     535 Aug  1 09:40 Leukemia-ALLAML.tree.J48.out

more Leukemia-ALLAML.NaiveBayes.J48.pred

0 ALL 1.0 ALL
1 ALL 1.0 ALL
2 AML 1.0 ALL
3 ALL 1.0 ALL
4 ALL 1.0 ALL
5 ALL 1.0 ALL
6 ALL 1.0 ALL
7 ALL 1.0 ALL
8 ALL 1.0 ALL
9 ALL 1.0 ALL
10 ALL 1.0 ALL
11 ALL 1.0 ALL
12 ALL 1.0 ALL
13 ALL 1.0 ALL
14 ALL 1.0 ALL
15 ALL 1.0 ALL
16 ALL 1.0 ALL
17 ALL 1.0 ALL
18 ALL 1.0 ALL
19 ALL 1.0 ALL
20 AML 1.0 AML
21 AML 1.0 AML
22 AML 1.0 AML
23 AML 1.0 AML
24 ALL 1.0 AML
25 AML 1.0 AML
26 AML 1.0 AML
27 AML 1.0 AML
28 AML 1.0 AML
29 AML 1.0 AML
30 ALL 1.0 AML
31 AML 1.0 AML
32 ALL 1.0 AML
33 AML 1.0 AML

VM Datamining