Weka is a data mining and data discovery tool. We have installed Weka as standalone software on a Amazon EC2 node already. Refer to this links for the past articles on Weka on EC2.
Weka Data Mining on EC2 - install
Weka Data Mining on EC2 - testing
In the posting roadmap I mentioned we would looking at some Grid or Web aware versions of Weka. Originally Weka was developed for consumption within a closed group. Individual researchers could run their data mining either on their own workstation or on a server.
Outline:
Grid Weka was developed out of the University College, in Dublin, Ireland. It has additional java code to allow Weka to offload various processing steps to both local and remote servers.
Install:
The Grid Weka HOWTO guide was good. It assumes that you know how to install Java and setup the Java environment correctly though.
- Install Java. Get the latest JDK 1.5 or higher from java.sun.com
- Download Grid Weka
- Place the weka.jar file in an appropriate location.
- Make sure your JAVA_HOME environment variable is set.
- Create and edit a file called .weka-parallel and place in users home directory.
- Run the GridWeka Servers using command: java -classpath /yourpath/weka.jar weka.core.DistributedServer yourport &
- To access the remote servers you will need to open the port in your firewall.
Grid Weka works as expected. I haven't tested the real potential benefit of using remote servers to run multiple classifications in parallel. I will and post the results in a future article.
The duration of the classifications were longer using the remote servers than running on a single dedicated process on a single server. Most of the time spent though was in network traffic.
Once again the poor network performance between EC2 nodes is a killer for network intensive applications.
I found a Java execution/library, JFS which uses the idea of Parallel streams. I was able to reduce the network time by around 20%.
This suggests that the Grid Weka would benefit from being compression aware, allowing for the stream of data to be compressed on the fly, effectively doubling the network bandwidth at the expense of CPU. With the increasing CPU performance, follow Google's lead and use the spare CPU time to compress everything.
There is a java misc IOstream class which utilizes gzip
http://java.sun.com/developer/technicalArticles/Streams/ProgIOStreams/
I reviewed the code and adding the Gzip wrapper around the IOstream is easy and there are plenty of examples of code out on the lazy net. More on that later...
Full install and results:
[root@ip-10-251-71-99 ~]# id weka
uid=502(weka) gid=503(weka) groups=503(weka)
[root@ip-10-251-71-99 ~]# su - weka
[weka@ip-10-251-71-99 tutorial]$ env|grep JAVA
JAVA_HOME=/usr/local/java
[weka@ip-10-251-71-99 tutorial]$ java -version
java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
[weka@ip-10-251-71-99 tutorial]$ ls
J48-data.model J48-segment-data.out Leukemia-ALLAML.NaiveBayes.J48.pred
J48-data.out Leukemia-ALLAML.NaiveBayes.J48.model Leukemia-ALLAML.tree.J48.model
J48-segment-data.model Leukemia-ALLAML.NaiveBayes.J48.out Leukemia-ALLAML.tree.J48.out
[weka@ip-10-251-71-99 tutorial]$ cd ..
[weka@ip-10-251-71-99 ~]$ ls
tutorial weka-3-4-11.zip
[weka@ip-10-251-71-99 ~]$ mkdir gridweka
[weka@ip-10-251-71-99 ~]$ cd gridweka/
[weka@ip-10-251-71-99 gridweka]$ wget http://cssa.ucd.ie/xin/weka/weka.jar
--19:04:04-- http://cssa.ucd.ie/xin/weka/weka.jar
=> `weka.jar'
Resolving cssa.ucd.ie... 193.1.132.54
Connecting to cssa.ucd.ie|193.1.132.54|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,926,948 (1.8M) [application/octet-stream]
100%[============================================================================>] 1,926,948 91.54K/s ETA 00:00
19:04:25 (90.80 KB/s) - `weka.jar' saved [1926948/1926948]
Starting two weka servers
[weka@ip-10-251-71-99 gridweka]$ pwd
/home/weka/gridweka
[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8001 &
[1] 2735
[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:11:37 EST 2008: Server started on port 8001
Thu Feb 21 19:11:37 EST 2008: Waiting for connections...
[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8002 &
[2] 2745
[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:15:37 EST 2008: Server started on port 8002
Thu Feb 21 19:15:37 EST 2008: Waiting for connections...
On client machine which happens to be the same box.
$ cat .weka-parallel
PORT=8001
ec2-72-44-33-131.compute-1.amazonaws.com
2
1024
Results
[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-t $WEKAHOME/data/segment-challenge.arff -d segment.model -a
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 19:38:54 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 19:38:55 EST 2008: Processed job 0 request from localhost
Thu Feb 21 19:38:55 EST 2008: Connection job 0 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 19:38:57 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 19:38:57 EST 2008: Processed job 1 request from localhost
server 1: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 1
server 1: index 2
server 1: index 3
server 1: index 4
server 1: index 5
server 1: index 6
server 1: index 7
server 1: index 8
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 9
server 1: index 10
server 1: index 11
server 1: index 12
server 1: index 13
server 1: index 14
server 1: index 15
server 1: index 16
server 1: index 17
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 18
server 1: index 19
server 1: index 20
server 1: index 21
server 1: index 22
server 1: index 23
server 1: index 24
server 1: index 25
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 26
server 1: index 27
server 1: index 28
server 1: index 29
server 1: index 30
server 1: index 31
server 1: index 32
server 1: index 33
server 1: index 34
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 35
server 1: index 36
server 1: index 37
server 1: index 38
server 1: index 39
server 1: index 40
server 1: index 41
server 1: index 42
server 1: index 43
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 44
server 1: index 45
server 1: index 46
server 1: index 47
server 1: index 48
server 1: index 49
J48 pruned tree
------------------
region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)
Number of Leaves : 34
Size of the tree : 67
Time taken to build model: 4.19 seconds
Time taken to test model on training data: 0.03 seconds
=== Error on training data ===
Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500
=== Confusion Matrix ===
a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1450 96.6667 % Incorrectly Classified Instances 50 3.3333 % Kappa statistic 0.9611 Mean absolute error 0.0095 Root mean squared error 0.0976 Relative absolute error 3.8904 % Root relative squared error 27.8938 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: localhost/127.0.0.1 === Confusion Matrix === a b c d e f g <-- classified as 199 0 1 2 3 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 3 8 0 0 | c = foliage 0 0 6 209 5 0 0 | d = cement 2 0 9 6 187 0 0 | e = window 0 0 0 2 0 234 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 19:39:10 EST 2008: Connection job 1 with localhost closed. Test remote client. Use telnet first to check the port
telnet ec2-72-44-33-131.compute-1.amazonaws.com 8001
Escape Character is 'CTRL+]'
Telnet> quit
On the server
[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:48:57 EST 2008: Connection job 0 with 203-214-155-114.dyn.iinet.net.au closed.
Issues with configuration file location
Continuing running client on Linux specifying -C option to use 2 parallel servers.
java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-T $WEKAHOME/data/segment-test.arff -l segment.model -a -C 2
Thu Feb 21 20:31:32 EST 2008: Processed job 0 request from localhost
Thu Feb 21 20:31:32 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 20:31:32 EST 2008: Processed job 5 request from localhost
Thu Feb 21 20:31:32 EST 2008: Connection job 5 with localhost closed.
J48 pruned tree
------------------
region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)
Number of Leaves : 34
Size of the tree : 67
=== Error on test data ===
Correctly Classified Instances 779 96.1728 %
Incorrectly Classified Instances 31 3.8272 %
Kappa statistic 0.9553
Mean absolute error 0.0109
Root mean squared error 0.1046
Relative absolute error 4.4715 %
Root relative squared error 29.905 %
Total Number of Instances 810
=== Confusion Matrix ===
a b c d e f g <-- classified as 124 0 0 0 1 0 0 | a = brickface 0 110 0 0 0 0 0 | b = sky 1 0 119 0 2 0 0 | c = foliage 1 0 0 107 2 0 0 | d = cement 1 0 12 7 105 0 1 | e = window 0 0 0 0 0 94 0 | f = path 0 0 1 0 0 2 120 | g = grass Running parallel cross-validation on 2 servers
java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 -t \
$WEKAHOME/data/segment-challenge.arff -d segment.model -x 10 -a -C 2
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 20:34:09 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 20:34:09 EST 2008: Processed job 6 request from localhost
Thu Feb 21 20:34:10 EST 2008: Connection job 6 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
Thu Feb 21 20:34:13 EST 2008: Processed job 1 request from localhost
server 2: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 1 time 571
------------------------------------
server 2: index 1
server 2: index 2
server 2: index 3
server 2: index 4
server 2: index 5
server 2: index 6
server 2: index 7
server 2: index 8
server 2: index 9
J48 pruned tree
------------------
region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)
Number of Leaves : 34
Size of the tree : 67
Time taken to build model: 0.96 seconds
Time taken to test model on training data: 0.09 seconds
=== Error on training data ===
Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500
=== Confusion Matrix ===
a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1436 95.7333 % Incorrectly Classified Instances 64 4.2667 % Kappa statistic 0.9502 Mean absolute error 0.0122 Root mean squared error 0.1104 Relative absolute error 4.9799 % Root relative squared error 31.5589 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: localhost/127.0.0.1 === Confusion Matrix === a b c d e f g <-- classified as 196 0 3 1 5 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 2 9 0 0 | c = foliage 2 0 4 207 6 1 0 | d = cement 3 0 16 6 179 0 0 | e = window 0 0 0 3 0 233 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 20:34:18 EST 2008: Connection job 1 with localhost closed. Thu Feb 21 20:34:18 EST 2008: Connection job 0 with localhost closed. Ok starting another server with two more GridWeka servers
[weka@ip-10-251-69-175 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8001 &
[1] 2694
[weka@ip-10-251-69-175 gridweka]$ Thu Feb 21 20:56:32 EST 2008: Server started on port 8001
Thu Feb 21 20:56:32 EST 2008: Waiting for connections...
[weka@ip-10-251-69-175 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8002 &
[2] 2704
[weka@ip-10-251-69-175 gridweka]$ Thu Feb 21 20:56:40 EST 2008: Server started on port 8002
Thu Feb 21 20:56:40 EST 2008: Waiting for connections...
[weka@ip-10-251-69-175 gridweka]$ hostname
ip-10-251-69-175
Updating the .weka-parallel file
[weka@ip-10-251-71-99 gridweka]$ vi /home/weka/.weka-parallel
[weka@ip-10-251-71-99 gridweka]$ cat /home/weka/.weka-parallel
PORT=8001
localhost
2
1024
ip-10-251-69-175
2
1024
Ok running the previous classification with cross validation. This time on 2 servers each running 2 GridWeka
servers
[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar \
weka.classifiers.trees.J48 -t $WEKAHOME/data/segment-challenge.arff \
-d segment.model -x 10 -a -C 4
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 21:02:51 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 21:02:51 EST 2008: Processed job 9 request from localhost
Thu Feb 21 21:02:51 EST 2008: Connection job 9 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
---Judgement--- server 3: Memory free : 524288000 --> Passed!
Using 3st Server(3) to do crossValidate.
---Judgement--- server 4: Memory free : 524288000 --> Passed!
Using 4st Server(4) to do crossValidate.
server 3: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 3: index 1
**************** Checking servers' status ****************Thu Feb 21 21:03:04 EST 2008: Processed job 3 request from localhost
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 3: index 3
server 3: index 4
server 2: index 2
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 4 time 0
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
Thu Feb 21 21:03:04 EST 2008: Processed job 10 request from localhost
server 2: index 7
server 3: index 8
server 2: index 9
server 3: index 5
server 1: index 6
--------- Refresh the Rank ---------
--> connectInfo 0: rank 4 time 0
--> connectInfo 1: rank 3 time 1063
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 2: index 5
server 4: index 5
--------- Refresh the Rank ---------
--> connectInfo 0: rank 5 time 0
--> connectInfo 1: rank 4 time 1063
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 3 time 973
------------------------------------
Thu Feb 21 21:03:06 EST 2008: Connection job 10 with localhost closed.
J48 pruned tree
------------------
region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)
Number of Leaves : 34
Size of the tree : 67
Time taken to build model: 5.06 seconds
Time taken to test model on training data: 0.09 seconds
=== Error on training data ===
Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500
=== Confusion Matrix ===
a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1436 95.7333 % Incorrectly Classified Instances 64 4.2667 % Kappa statistic 0.9502 Mean absolute error 0.0122 Root mean squared error 0.1104 Relative absolute error 4.9799 % Root relative squared error 31.5589 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: ip-10-251-69-175/10.251.69.175 localhost/127.0.0.1 localhost/127.0.0.1 ip-10-251-69-175/10.251.69.175 === Confusion Matrix === a b c d e f g <-- classified as 196 0 3 1 5 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 2 9 0 0 | c = foliage 2 0 4 207 6 1 0 | d = cement 3 0 16 6 179 0 0 | e = window 0 0 0 3 0 233 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 21:03:07 EST 2008: Connection job 3 with localhost closed. Final demo, using the Leukemia-ALLAML data
time java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-t $WEKAHOME/data/ALL-AML_train.arff -d Leukemia-ALLAML.tree.J48.model \
-i -x 10 -a -C 4
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 21:10:03 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 21:10:03 EST 2008: Processed job 13 request from localhost
Thu Feb 21 21:10:05 EST 2008: Connection job 13 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
---Judgement--- server 3: Memory free : 524288000 --> Passed!
Using 3st Server(3) to do crossValidate.
---Judgement--- server 4: Memory free : 524288000 --> Passed!
Using 4st Server(4) to do crossValidate.
Thu Feb 21 21:10:57 EST 2008: Processed job 5 request from localhost
Thu Feb 21 21:11:03 EST 2008: Processed job 14 request from localhost
server 4: index 2
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 1 time 676
------------------------------------
server 4: index 4
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 1 time 676
------------------------------------
server 4: index 5
server 1: index 3
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 2 time 676
------------------------------------
server 1: index 6
server 4: index 7
server 1: index 8
server 4: index 9
server 1: index 0
server 4: index 0
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 2 time 676
------------------------------------
server 1: index 1
J48 pruned tree
------------------
attribute4847 <= 938: ALL (27.0) attribute4847 > 938: AML (11.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 5.07 seconds
Time taken to test model on training data: 0.01 seconds
=== Error on training data ===
Correctly Classified Instances 38 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 38
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
1 0 1 1 1 ALL
1 0 1 1 1 AML
=== Confusion Matrix ===
a b <-- classified as 27 0 | a = ALL 0 11 | b = AML === Stratified cross-validation === Correctly Classified Instances 32 84.2105 % Incorrectly Classified Instances 6 15.7895 % Kappa statistic 0.6358 Mean absolute error 0.1579 Root mean squared error 0.3974 Relative absolute error 37.8015 % Root relative squared error 87.2867 % Total Number of Instances 38 Cross-validation ran in parallel using this computer and the following machines: ip-10-251-69-175/10.251.69.175 localhost/127.0.0.1 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.852 0.182 0.92 0.852 0.885 ALL 0.818 0.148 0.692 0.818 0.75 AML === Confusion Matrix === a b <-- classified as 23 4 | a = ALL 2 9 | b = AML Thu Feb 21 21:11:10 EST 2008: Connection job 14 with localhost closed. server 3: index 0 --------- Refresh the Rank --------- --> connectInfo 0: rank 4 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 5 time 0
--> connectInfo 3: rank 3 time 699
--> connectInfo 4: rank 2 time 676
------------------------------------
server 2: index 1
--------- Refresh the Rank ---------
--> connectInfo 0: rank 5 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 2 time 637
--> connectInfo 3: rank 4 time 699
--> connectInfo 4: rank 3 time 676
------------------------------------
Thu Feb 21 21:11:21 EST 2008: Connection job 5 with localhost closed.
real 1m19.614s
user 0m13.140s
sys 0m12.660s
rerunning with java option -Xprof to profile to CPU
Flat profile of 39.70 secs (113 total ticks): Thread-5
Interpreted + native Method
49.5% 0 + 53 java.net.SocketInputStream.socketRead0
8.4% 0 + 9 java.net.SocketOutputStream.socketWrite0
57.9% 0 + 62 Total interpreted
Compiled + native Method
0.9% 0 + 1 java.lang.String.
0.9% 0 + 1 Total compiled
Stub + native Method
32.7% 0 + 35 java.io.FileInputStream.read
6.5% 0 + 7 java.net.SocketOutputStream.socketWrite0
1.9% 0 + 2 java.lang.System.identityHashCode
41.1% 0 + 44 Total stub
Thread-local ticks:
5.3% 6 Blocked (of total)
Flat profile of 34.95 secs (111 total ticks): Thread-7
Compiled + native Method
1.8% 1 + 1 weka.classifiers.EvaluationClient.determineIndex
1.8% 1 + 1 Total compiled
Thread-local ticks:
98.2% 109 Compilation
Flat profile of 54.81 secs (473 total ticks): main
Interpreted + native Method
0.6% 0 + 2 java.net.Inet4AddressImpl.lookupAllHostAddr
0.3% 0 + 1 java.lang.System.currentTimeMillis
0.3% 0 + 1 java.io.FileInputStream.read
0.3% 0 + 1 java.lang.Thread.start0
0.3% 0 + 1 java.util.zip.ZipFile.getEntry
0.3% 0 + 1 weka.classifiers.trees.J48.main
0.3% 1 + 0 java.net.InetAddress.getCachedAddress
0.3% 0 + 1 weka.classifiers.Evaluation.evaluateModel
2.5% 1 + 8 Total interpreted
Compiled + native Method
22.5% 81 + 0 weka.classifiers.BuildModelClient.start
1.4% 5 + 0 java.io.DataInputStream.readLine
1.1% 4 + 0 weka.classifiers.EvaluationClient.start
1.1% 4 + 0 sun.misc.FloatingDecimal.readJavaFormatString
0.3% 1 + 0 java.lang.StringBuffer.toString
0.3% 1 + 0 java.lang.AbstractStringBuilder.append
0.3% 1 + 0 java.io.ObjectOutputStream.defaultWriteFields
0.3% 1 + 0 java.lang.String.regionMatches
0.3% 1 + 0 sun.nio.cs.US_ASCII$Decoder.decodeArrayLoop
0.3% 1 + 0 weka.core.Instances.getInstanceFull
0.3% 1 + 0 java.io.ObjectOutputStream.writeObject0
28.1% 101 + 0 Total compiled
Stub + native Method
60.8% 0 + 219 java.io.FileInputStream.read
6.1% 0 + 22 java.io.FileOutputStream.writeBytes
0.3% 0 + 1 java.lang.Class.isArray
0.3% 0 + 1 java.lang.Float.floatToIntBits
67.5% 0 + 243 Total stub
Thread-local ticks:
23.9% 113 Blocked (of total)
1.9% 7 Compilation
Flat profile of 35.26 secs (126 total ticks): Thread-6
Interpreted + native Method
66.1% 0 + 82 java.net.SocketInputStream.socketRead0
7.3% 0 + 9 java.net.SocketOutputStream.socketWrite0
0.8% 0 + 1 java.net.PlainSocketImpl.socketConnect
74.2% 0 + 92 Total interpreted
Compiled + native Method
2.4% 0 + 3 java.io.ObjectStreamClass.lookup
2.4% 0 + 3 Total compiled
Stub + native Method
20.2% 0 + 25 java.io.FileInputStream.read
20.2% 0 + 25 Total stub
Thread-local ticks:
1.6% 2 Blocked (of total)
3.2% 4 Compilation
Global summary of 55.12 seconds:
100.0% 504 Received ticks
1.8% 9 Received GC ticks
1.4% 7 Compilation
0.2% 1 Unknown code
real 0m55.151s
user 0m8.470s
sys 0m12.520s
Rerun the JFS Parallel Access to Network the time was about 20% faster
http://jfs.des.udc.es/docs/jfs.html
time jfsrun weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-d Leukemia-ALLAML.tree.J48.model -x 10 -a -C 4
real 0m41.736s
user 0m6.990s
sys 0m9.180s