Saturday, October 18, 2008

Datamining in the cloud

Amazon EC2 and other on-demand or cloud computing vendors are providing a way to scale up and scale down your computing and storage requirements.

Over the last 18 months or so I have reviewed and demo'ed a bunch of Data mining and Business intelligence (BI) software and suites. This was both for my benefit, so I can check out what is out there is this software space and also to see how effective each is at providing a way to scale up and scale down your BI computing needs.

Many people want to use these compute clouds like dedicated hosting providers which is the wrong use, at least until the prices drop to be comparable to dedicated hosting.
What they are for is for peak demand resourcing, short term and mid term computing requirements. This is especially true for tasks where the work you are trying to do is constrained by CPU and which scale well.

Let's look at the short term computing example :

You have a task which will take one machine with 4 cores (2x Dual Core) 16 hours to complete or 4 hours per core. However the task scales, so you if you throw 16 cores at it, it will be completed in 1 hour.

I see this as the core of providing any kind of on-demand service. The main selling point is you are saving people time. They can take that time and reinvest it (if you like) and do more analysis or more in-depth analysis. They can broaden the techniques they use to analyze their data or use the same techniques in more detail.

One example of this type of broadening on analysis is using Weka and running many different learning algorithms on the same training and testing set of data to find the best one. This is perfect use of on-demand computing.

So how to you provide this type of service under a cloud computing model?

You have introduce a queue.
Your base load is done by a dedicated hosting provider and your peak load by starting and stopping (ramping up and down) the compute resources as required.

Why stop with one queue, many service providers have many channels to accessing their services. So work which must be done immediately where time is a premium, would pay some premium for the immediacy required. Work which isn't required until the next business day can be completed when spare resources are freed up.

Another way to approach this is to let a market form for the compute resources. The best mechanism for controlling the use of resources is a price signal.

Have Fun

Wednesday, June 25, 2008

Weka Web Service Prototype: Reader choice

We are back on track now to delivering Weka as a web service. I finally got a freelancer Java developer to help me implement the required front-end. Everyone wanted to volunteer, noone had any time.

Rather than use a scattergun approach I am interested in getting some reader feedback (yes you the reader) on what specific part of Weka people would like to use first: classifiers, clustering or attribute selection. First come first served.

The basic service will also include the standard file handling in weka, so you can pass ARFF formatted file or just a plain CSV or just a URL to a file. You will know it is ready when I post an example of using the web service.

Once we are happy with a simple web service we can start to ramp up the functionality. I envisage that this service will be blended within other web services or as part of a Mashup.
Once we have demonstrated what the raw service is capable of, we can open the compute floodgates and use EC2 to ramp up and down as required.

Have Fun

Paul

Wednesday, April 16, 2008

More benchmarking of EC2 instances

I have been running a series using the sysbench fileio option to further benchmark
all three types of EC2 instances on my database blog.

Sysbench fileio vs EC2 Part 1
Sysbench fileio vs Large EC2 Part 2
Sysbench fileio vs EC2 Part 3
seeker io benchmark vs EC2

Basically it looks like / mounted filesystem (/dev/sda) has much better raw io performance than the larger /mnt (/dev/sdb) filesystem.

I am completing the Kettle Slaves work and should have a post up soon.

Tuesday, March 25, 2008

Globus on EC2

As I mentioned in the roadmap, I am evaluating alternatives in providing Weka Data mining software as a web service.
Weka as a Web Service (Weka4WS) requires that you use the Globus Toolkit.

Globus tries to provide the framework for deploying any web service or Software as a service (SaaS).

Install:

Rather than re-invent the wheel. The Globus quick start guide, whilst not quick is very thorough in going through the whole process of getting both a master server (and Certificate Authority) and slave or client or node server.

I am happy to report that I managed to get through the guide and get a master server setup and running correctly. This is one of those installs where you are wondering what is next and what other requirement lies on the next page down.
If you are wondering, part of my dayjob (as a DBA) is to install software occasionally and get it working so I am no noob at this.

Summary:

I would have to say I have been seriously disappointed at the skill level ala assumed knowledge required to do these software installs. What small percentage of the potential users of the product are going to spend hours getting a server setup.
However on the plus side, if I am finding it difficult, providing the software as a service is sure to stimulate the demand for these applications. I also got to play again with PostgreSQL. It has been a while.

Have Fun

Paul



http://grid.deis.unical.it/weka4ws/

Get Globus Toolkit. I am used CentOS 4.4 which is equivalent to RHEL 4

http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html - Quickstart Install Guide

http://www-unix.globus.org/ftppub/gt4/4.0/4.0.4/installers/bin/gt4.0.4-x86_rhas_4-installer.tar.gz

yum install zlib-devel.i386

Setting up Install Process
Setting up repositories
Reading repository metadata in from local files
Parsing package install arguments
Resolving Dependencies
--> Populating transaction set with selected packages. Please wait.
---> Downloading header for zlib-devel to pack into transaction set.
zlib-devel-1.2.1.2-1.2.i3 100% |=========================| 6.2 kB 00:00
---> Package zlib-devel.i386 0:1.2.1.2-1.2 set to be updated
--> Running transaction check

Dependencies Resolved

=============================================================================
Package Arch Version Repository Size
=============================================================================
Installing:
zlib-devel i386 1.2.1.2-1.2 base 89 k

Transaction Summary
=============================================================================
Install 1 Package(s)
Update 0 Package(s)
Remove 0 Package(s)
Total download size: 89 k
Is this ok [y/N]: y
Downloading Packages:
(1/1): zlib-devel-1.2.1.2 100% |=========================| 89 kB 00:00
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing: zlib-devel ######################### [1/1]

Installed: zlib-devel.i386 0:1.2.1.2-1.2
Complete!

Apache Ant

http://ant.apache.org/bindownload.cgi
tar -xvzf apache-ant-1.7.0-bin.tar.gz -C /usr/local


GCC

yum install gcc

[root@ip-10-251-73-176 mnt]# perl --version

This is perl, v5.8.5 built for i386-linux-thread-multi

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.

yum install postgresql postgresql-server.i386

Create Globus user

[root@ip-10-251-73-176 mnt]# adduser globus
[root@ip-10-251-73-176 mnt]# passwd globus
Changing password for user globus.
passwd: all authentication tokens updated successfully.
[root@ip-10-251-73-176 mnt]# mkdir /usr/local/globus-4.0.4/
[root@ip-10-251-73-176 mnt]# chown globus:globus /usr/local/globus-4.0.4/

Untar globus into /usr/local

tar -xzvf gt4.0.4-x86_rhas_4-installer.tar.gz -C /usr/local
chown -R globus:globus /usr/local/gt4.0.4-x86_rhas_4-installer/

# User specific environment and startup programs

LD_LIBRARY_PATH=/usr/local/java/lib
JAVA_HOME=/usr/local/java
ANT_HOME=/usr/local/apache-ant-1.7.0
WEKAHOME=/usr/local/weka

PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$PATH:$HOME/bin

export PATH LD_LIBRARY_PATH JAVA_HOME WEKAHOME ANT_HOME
unset USERNAME
[globus@ip-10-251-73-176 ~]$ source .bash_profile

./configure --prefix=/usr/local/globus-4.0.4/
checking for javac... /usr/local/java/bin/javac
checking for ant... /usr/local/apache-ant-1.7.0/bin/ant
configure: creating ./config.status
config.status: creating Makefile

make
make install


[globus@ip-10-251-73-176 gt4.0.4-x86_rhas_4-installer]$ export GLOBUS_LOCATION=/usr/local/globus-4.0.4
[globus@ip-10-251-73-176 gt4.0.4-x86_rhas_4-installer]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh
[globus@ip-10-251-73-176 gt4.0.4-x86_rhas_4-installer]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca

[globus@ip-10-251-73-176 gt4.0.4-x86_rhas_4-installer]$ ls ~/.globus/simpleCA/
cacert.pem crl grid-ca-ssl.conf newcerts serial
certs globus_simple_ca_b4fe7be7_setup-0.19.tar.gz index.txt private

[root@ip-10-251-73-176 mnt]# export GLOBUS_LOCATION=/usr/local/globus-4.0.4/
[root@ip-10-251-73-176 mnt]# $GLOBUS_LOCATION/setup/globus_simple_ca_ebb88ce5_setup/setup-gsi -default

You will need to change the hashkey. Find it by checking for the file

[root@ip-10-251-73-176 mnt]# ls -l /usr/local/globus-4.0.4/setup/globus_simple_ca_*_setup/setup-gsi
-rwxr-xr-x 1 globus globus 194 Feb 22 02:16 /usr/local/globus-4.0.4/setup/globus_simple_ca_b4fe7be7_setup/setup-gsi

/usr/local/globus-4.0.4/setup/globus_simple_ca_b4fe7be7_setup/setup-gsi -default

setup-gsi: Configuring GSI security
Making /etc/grid-security...
mkdir /etc/grid-security
Making trusted certs directory: /etc/grid-security/certificates/
mkdir /etc/grid-security/certificates/
Installing /etc/grid-security/certificates//grid-security.conf.b4fe7be7...
Running grid-security-config...
Installing Globus CA certificate into trusted CA certificate directory...
Installing Globus CA signing policy into trusted CA certificate directory...
setup-gsi: Complete

[root@ip-10-251-73-176 mnt]# ls /etc/grid-security/certificates/
b4fe7be7.0 globus-host-ssl.conf.b4fe7be7 grid-security.conf.b4fe7be7
b4fe7be7.signing_policy globus-user-ssl.conf.b4fe7be7

[root@ip-10-251-73-176 mnt]# grid-cert-request -host `hostname`
The hostname ip-10-251-73-176 does not appear to be fully qualified.
Do you wish to continue? [n] y
Generating a 1024 bit RSA private key
..........++++++
.....................................................................++++++
writing new private key to '/etc/grid-security/hostkey.pem'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Level 0 Organization [Grid]:Level 0 Organizational Unit [GlobusTest]:Level 1 Organizational Unit [simpleCA-ip-10-251-73-176.ec2.internal]:Name (e.g., John M. Smith) []:

A private host key and a certificate request has been generated
with the subject:

/O=Grid/OU=GlobusTest/OU=simpleCA-ip-10-251-73-176.ec2.internal/CN=host/ip-10-251-73-176

----------------------------------------------------------

The private key is stored in /etc/grid-security/hostkey.pem
The request is stored in /etc/grid-security/hostcert_request.pem

Please e-mail the request to the Globus Simple CA roobaron@gmail.com
You may use a command similar to the following:

cat /etc/grid-security/hostcert_request.pem | mail roobaron@gmail.com

Only use the above if this machine can send AND receive e-mail. if not, please
mail using some other method.

Your certificate will be mailed to you within two working days.
If you receive no response, contact Globus Simple CA at roobaron@gmail.com

Add GLOBUS_LOCATION to your profile

[globus@ip-10-251-73-176 ~]$ vi .bash_profile
[globus@ip-10-251-73-176 ~]$ source .bash_profile
[globus@ip-10-251-73-176 ~]$ grid-ca-sign -in /etc/grid-security/hostcert_request.pem -out hostsigned.pem

To sign the request
please enter the password for the CA key:

The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/01.pem

Copy the Certificates into the correct location as root

[root@ip-10-251-73-176 mnt]# cp /etc/grid-security/hostcert.pem /etc/grid-security/hostcert.pem.old
[root@ip-10-251-73-176 mnt]# cp ~globus/hostsigned.pem /etc/grid-security/hostcert.pem
[root@ip-10-251-73-176 mnt]# cp /etc/grid-security/hostcert.pem /etc/grid-security/containcert.pem
[root@ip-10-251-73-176 mnt]# cp /etc/grid-security/hostkey.pem /etc/grid-security/containkey.pem

[root@ip-10-251-73-176 mnt]# ls -l /etc/grid-security/*.pem
-rw-r--r-- 1 root root 2759 Feb 22 02:25 /etc/grid-security/containcert.pem
-r-------- 1 root root 887 Feb 22 02:25 /etc/grid-security/containkey.pem
-rw-r--r-- 1 root root 2759 Feb 22 02:24 /etc/grid-security/hostcert.pem
-rw-r--r-- 1 root root 1437 Feb 22 02:20 /etc/grid-security/hostcert_request.pem
-r-------- 1 root root 887 Feb 22 02:20 /etc/grid-security/hostkey.pem

Creating demo user

[root@ip-10-251-73-176 /mnt]$ adduser globus1
[root@ip-10-251-73-176 /mnt]$ passwd globus1
Changing password for user globus1.
passwd: all authentication tokens updated successfully.

[root@ip-10-251-73-176 /mnt]$ cp /home/globus/.bash_profile /home/globus1/
cp: overwrite `/home/globus1/.bash_profile'? y
[root@ip-10-251-73-176 /mnt]$ su - globus1

[globus1@ip-10-251-73-176 ~]$ source /usr/local/globus-4.0.4/etc/globus-user-env.sh
[globus1@ip-10-251-73-176 ~]$ grid-cert-request
Enter your name, e.g., John Smith: Globus
A certificate request and private key is being created.
You will be asked to enter a PEM pass phrase.
This pass phrase is akin to your account password,
and is used to protect your key file.
If you forget your pass phrase, you will need to
obtain a new certificate.

Generating a 1024 bit RSA private key
............++++++
..++++++
writing new private key to '/home/globus1/.globus/userkey.pem'
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Level 0 Organization [Grid]:Level 0 Organizational Unit [GlobusTest]:Level 1 Organizational Unit [simpleCA-ip-10-251-73-176.ec2.internal]:Level 2 Organizational Unit [ec2.internal]:Name (e.g., John M. Smith) []:

A private key and a certificate request has been generated with the subject:

/O=Grid/OU=GlobusTest/OU=simpleCA-ip-10-251-73-176.ec2.internal/OU=ec2.internal/CN=Globus

If the CN=Globus is not appropriate, rerun this
script with the -force -cn "Common Name" options.

Your private key is stored in /home/globus1/.globus/userkey.pem
Your request is stored in /home/globus1/.globus/usercert_request.pem

Please e-mail the request to the Globus Simple CA roobaron@gmail.com
You may use a command similar to the following:

cat /home/globus1/.globus/usercert_request.pem | mail roobaron@gmail.com

Only use the above if this machine can send AND receive e-mail. if not, please
mail using some other method.

Your certificate will be mailed to you within two working days.
If you receive no response, contact Globus Simple CA at roobaron@gmail.com

Copied to /tmp so globus can just sign it

[root@ip-10-251-73-176 /mnt]$ su - globus
[globus@ip-10-251-73-176 ~]$ grid-ca-sign -in /tmp/usercert_request.pem -out /tmp/signed.pem

To sign the request
please enter the password for the CA key:

The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/02.pem

[root@ip-10-251-73-176 /mnt]$ su - globus1
[globus1@ip-10-251-73-176 ~]$ cp /tmp/signed.pem ~/.globus/usercert.pem
[globus1@ip-10-251-73-176 ~]$ ls -l ~/.globus/
total 12
-rw-r--r-- 1 globus1 globus1 2769 Feb 22 02:41 usercert.pem
-rw-r--r-- 1 globus1 globus1 1453 Feb 22 02:38 usercert_request.pem
-r-------- 1 globus1 globus1 963 Feb 22 02:38 userkey.pem

Map file. The string comes from the output from the Certificate request

[root@ip-10-251-73-176 /mnt]$ vi /etc/grid-security/grid-mapfile
[root@ip-10-251-73-176 /mnt]$ cat /etc/grid-security/grid-mapfile
"/O=Grid/OU=GlobusTest/OU=simpleCA-ip-10-251-73-176.ec2.internal/CN=host/ip-10-251-73-176/CN=Globus" globus1




Setting up GridFTP

[root@ip-10-251-73-176 /mnt]$ vi /etc/xinetd.d/gridftp
[root@ip-10-251-73-176 /mnt]$ cat /etc/xinetd.d/gridftp
service gsiftp
{
instances = 100
socket_type = stream
wait = no
user = root
env += GLOBUS_LOCATION=/usr/local/globus-4.0.4
env += LD_LIBRARY_PATH=/usr/local/globus-4.0.4/lib
server = /usr/local/globus-4.0.4/sbin/globus-gridftp-server
server_args = -i
log_on_success += DURATION
nice = 10
disable = no
}

Adding GridFTP as a service

[root@ip-10-251-73-176 /mnt]$ vi /etc/services
[root@ip-10-251-73-176 /mnt]$ /etc/init.d/xinetd reload
Reloading configuration: [ OK ]
[root@ip-10-251-73-176 /mnt]$ netstat -an | grep 2811
tcp 0 0 0.0.0.0:2811 0.0.0.0:* LISTEN

Testing

[globus1@ip-10-251-73-176 ~]$ grid-proxy-init -verify -debug
grid-proxy-init: error while loading shared libraries: libglobus_gsi_proxy_core_gcc32.so.0: cannot open shared object file: No such file or directory

This error means you need to source the /usr/local/globus-4.0.4/etc/globus-user-env.sh

[root@ip-10-251-73-176 /mnt]$ su - globus1
[globus1@ip-10-251-73-176 ~]$ source /usr/local/globus-4.0.4/etc/globus-user-env.sh
[globus1@ip-10-251-73-176 ~]$ grid-proxy-init -verify -debug

User Cert File: /home/globus1/.globus/usercert.pem
User Key File: /home/globus1/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /tmp/x509up_u504
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-ip-10-251-73-176.ec2.internal/OU=ec2.internal/CN=Globus
Enter GRID pass phrase for this identity:
Creating proxy .......++++++++++++
........++++++++++++
Done
Proxy Verify OK
Your proxy is valid until: Fri Feb 22 14:49:47 2008


Setup start-stop script... why is this not done by the developers!?!

[globus@ip-10-251-73-176 ~]$ vi $GLOBUS_LOCATION/start-stop
[globus@ip-10-251-73-176 ~]$ chmod +x $GLOBUS_LOCATION/start-stop
[globus@ip-10-251-73-176 ~]$ ls -l $GLOBUS_LOCATION/start-stop
-rwxrwxr-x 1 globus globus 528 Feb 22 04:21 /usr/local/globus-4.0.4/start-stop
[globus@ip-10-251-73-176 ~]$ exit
logout
[root@ip-10-251-73-176 /mnt]$ vi /etc/init.d/globus-4.0.4
[root@ip-10-251-73-176 /mnt]$ cat /usr/local/globus-4.0.4/start-stop
#! /bin/sh
set -e

export JAVA_HOME=/usr/local/java
export ANT_HOME=/usr/local/apache-ant-1.7.0
export GLOBUS_LOCATION=/usr/local/globus-4.0.4
export GLOBUS_OPTIONS="-Xms256M -Xmx512M" 1

. $GLOBUS_LOCATION/etc/globus-user-env.sh

cd $GLOBUS_LOCATION
case "$1" in
start)
$GLOBUS_LOCATION/sbin/globus-start-container-detached -p 8443
;;
stop)
$GLOBUS_LOCATION/sbin/globus-stop-container-detached
;;
*)
echo "Usage: globus {start|stop}" >&2
exit 1
;;
esac
exit 0
[root@ip-10-251-73-176 /mnt]$ chmod +x /etc/init.d/globus-4.0.4

Starting up Globus... finally we are getting somewhere

[root@ip-10-251-73-176 /mnt]$ /etc/init.d/globus-4.0.4 start
Starting Globus container. PID: 14945
WARNING: It seems like the container died directly
Please see $GLOBUS_LOCATION/var/container.log for more information
[root@ip-10-251-73-176 /mnt]$ tail $GLOBUS_LOCATION/var/container.log
Failed to start container: Failed to initialize 'ManagedJobFactoryService' service [Caused by: [SEC] Service credentials not configured and was not able to obtain container credentials.; nested exception is:
org.globus.wsrf.security.SecurityException: [SEC] Error obtaining container credentials; nested exception is:
org.globus.wsrf.config.ConfigException: Failed to initialize container security config [Caused by: [Caused by: Failed to load credentials. [Caused by: /etc/grid-security/containercert.pem (No such file or directory)]]]]
[root@ip-10-251-73-176 /mnt]$ ls -l /etc/grid-security/containercert.pem
ls: /etc/grid-security/containercert.pem: No such file or directory

Issue with missing container certificate

[root@ip-10-251-73-176 /mnt]$ mv /etc/grid-security/containcert.pem /etc/grid-security/containercert.pem
[root@ip-10-251-73-176 /mnt]$ mv /etc/grid-security/containkey.pem /etc/grid-security/containerkey.pem

Woot! we have a Globus container running

[root@ip-10-251-73-176 /mnt]$ /etc/init.d/globus-4.0.4 start
Starting Globus container. PID: 15140
[root@ip-10-251-73-176 /mnt]$ tail $GLOBUS_LOCATION/var/container.log
[42]: https://10.251.73.176:8443/wsrf/services/TriggerService
[43]: https://10.251.73.176:8443/wsrf/services/TriggerServiceEntry
[44]: https://10.251.73.176:8443/wsrf/services/Version
[45]: https://10.251.73.176:8443/wsrf/services/WidgetNotificationService
[46]: https://10.251.73.176:8443/wsrf/services/WidgetService
[47]: https://10.251.73.176:8443/wsrf/services/gsi/AuthenticationService
[48]: https://10.251.73.176:8443/wsrf/services/mds/test/execsource/IndexService
[49]: https://10.251.73.176:8443/wsrf/services/mds/test/execsource/IndexServiceEntry
[50]: https://10.251.73.176:8443/wsrf/services/mds/test/subsource/IndexService
[51]: https://10.251.73.176:8443/wsrf/services/mds/test/subsource/IndexServiceEntry

Testing the container using demo service CounterService

[root@ip-10-251-73-176 /mnt]$ su - globus1
[globus1@ip-10-251-73-176 ~]$ counter-client -s https://10.251.73.176:8443/wsrf/services/CounterService

Got notification with value: 3
Counter has value: 3

Starting Postgres database

[root@ip-10-251-73-176 /mnt]$ mkdir -p /var/lib/postgres/data
[root@ip-10-251-73-176 /mnt]$ vi /var/lib/postgres/data/pg_hba.conf
[root@ip-10-251-73-176 /mnt]$ cp /usr/share/pgsql/pg_hba.conf.sample /var/lib/postgres/data/pg_hba.conf
[root@ip-10-251-73-176 /mnt]$ id postgres
uid=26(postgres) gid=26(postgres) groups=26(postgres)
[root@ip-10-251-73-176 /mnt]$ vi /var/lib/postgres/data/pg_hba.conf
[root@ip-10-251-73-176 /mnt]$ /etc/init.d/postgresql start
Initializing database: [ OK ]
Starting postgresql service: [ OK ]
[root@ip-10-251-73-176 /mnt]$ ps -ef|grep post
postgres 15628 1 0 04:41 ttyp0 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 15629 15628 0 04:41 ttyp0 00:00:00 postgres: stats buffer process
postgres 15630 15629 0 04:41 ttyp0 00:00:00 postgres: stats collector process

[root@ip-10-251-73-176 /mnt]$ su postgres -c "createuser -P globus"
Enter password for new user:
Enter it again:
Shall the new user be allowed to create databases? (y/n) y
Shall the new user be allowed to create more new users? (y/n) n
CREATE USER

[root@ip-10-251-73-176 /mnt]$ su - globus
[globus@ip-10-251-73-176 ~]$ createdb rftDatabase
CREATE DATABASE
[globus@ip-10-251-73-176 ~]$ psql -d rftDatabase -f $GLOBUS_LOCATION/share/globus_wsrf_rft/rft_schema.sql
psql:/usr/local/globus-4.0.4/share/globus_wsrf_rft/rft_schema.sql:6: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "requestid_pkey" for table "requestid"
CREATE TABLE
psql:/usr/local/globus-4.0.4/share/globus_wsrf_rft/rft_schema.sql:11: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "transferid_pkey" for table "transferid"
CREATE TABLE
psql:/usr/local/globus-4.0.4/share/globus_wsrf_rft/rft_schema.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "request_pkey" for table "request"
CREATE TABLE
psql:/usr/local/globus-4.0.4/share/globus_wsrf_rft/rft_schema.sql:65: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "transfer_pkey" for table "transfer"
CREATE TABLE
psql:/usr/local/globus-4.0.4/share/globus_wsrf_rft/rft_schema.sql:71: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "restart_pkey" for table "restart"
CREATE TABLE
CREATE TABLE
CREATE INDEX


Restarted ok. Testing RFT

[root@ip-10-251-73-176 ~]# su - globus1
[globus1@ip-10-251-73-176 ~]$ cp /usr/local/globus-4.0.4/share/globus_wsrf_rft_test/transfer.xfr /tmp/rft.xfr
[globus1@ip-10-251-73-176 ~]$ cat /tmp/rft.xfr
true
16000
16000
false
1
true
1
null
null
false
10
gsiftp://10.251.73.176:2811/etc/group
gsiftp://10.251.73.176:2811/tmp/paultest.tmp
[globus1@ip-10-251-73-176 ~]$ dd if=/dev/zero of=/tmp/paultest.tmp bs=100M count=1
1+0 records in
1+0 records out
[globus1@ip-10-251-73-176 ~]$ ls -l /tmp/paultest.tmp
-rw-rw-r-- 1 globus1 globus1 104857600 Feb 22 06:01 /tmp/paultest.tmp
[globus1@ip-10-251-73-176 ~]$ rft -h 10.251.73.176 -f /tmp/rft.xfr
Number of transfers in this request: 1
Subscribed for overall status
Termination time to set: 60 minutes

Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
0/1/0/0/0

Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
1/0/0/0/0
All Transfers are completed

Outline of what globusrun-ws options

[globus1@ip-10-251-73-176 ~]$ globusrun-ws -help
globusrun-ws -help | -version[s]
globusrun-ws -usage [-validate | -submit | -monitor | -status | -kill]
Use the above to see which of the following are valid with each mode.

Modes of operation: (one should appear first on command line)
-validate:
checks the job description for syntax errors and a subset
of semantic errors without making any service requests.
-submit:
submits (or resubmits) a job to a job host in one of
three output modes: batch, interactive, or interactive-streaming.
-monitor:
attaches to an existing job in interactive or
interactive-streaming output modes.
-status:
reports the current state of the job and exits.
-kill:
requests the immediate cancellation of the job and exits.

Options:
-F, -factory :
If supplied, this option causes an EPR to be constructed using
ad-hoc methods that depend on GT implementation details.
For interoperability to other implementations of WS_GRAM,
the -factory-epr-file option should be used instead.
*format: [protocol://]{hostname|hostaddr}[:port][/service]
*default: https://localhost:8443/wsrf/services/ManagedJobFactoryService
-Ft, -factory-type :
in the absence of -factory-epr-file, this specifies a
type of scheduler.
*default: 'Fork' for single jobs and 'Multi' for multijobs
-Ff, -factory-epr-file :
causes the EPR for the ManagedJobFactory to be read from the given
file. This EPR is used as the service endpoint for submission of the
job.
-f, -job-description-file :
causes the job description to be read from the given file.
This description is modified according to the other options and
passed in the WS_GRAM submission messages.
-c, -job-command [--] [arg ...]:
takes all remaining globusrun-ws arguments as its arguments;
therefore it must appear last among globusrun-ws options. This
option causes globusrun-ws to generate a simple job description with
the named program and arguments.
-o, -job-epr-output-file :
the created ManagedJob EPR will be written to the given
file following successful submission. The file will not be written
if the submission fails.
-j, -job-epr-file :
causes the EPR for the ManagedJob to be read from the given file.
This EPR is used as the endpoint for service requests.
-s, -streaming:
The standard output and standard error files of the job are
monitored and data is written to the corresponding output of
globusrun-ws. The standard output will contain ONLY job output data,
while the standard error may be a mixture of job error output as
well as globusrun-ws messages, unless -quiet is specified.
*implies: -staging-delegate if job description does not already
contain stdout and stderr endpoints
*note: use of -streaming (with or without -batch) places a hold state
on the job before cleanup. you must let the job complete
or come back to it with -monitor or -kill
-so, -stdout-file :
append stdout out stream to the specified file instead of to stdout.
-se, -stderr-file :
append stderr out stream to the specified file instead of to stderr.
-I, -submission-id :
causes the job to be (re)submitted using the given uuid in the
reliability protocol.
-If, -submission-id-file :
causes the uuid to be read from the given file. It is an error
to use with -submission-id
-Io, -submission-id-output-file :
the uuid in use is written to the given file, whether this uuid
was generated for the user or given by one of the above
input options.
-J, -job-delegate:
If supplied AND the job description does not already provide a
jobCredential element, globusrun-ws will delegate the
client credential to WS_GRAM and introduce the corresponding
element to the submission input.
-S, -staging-delegate:
If supplied AND the job description does include staging or
cleanup directives AND the job description does not already provide
the necessary stagingCredential or transferCredential element(s),
globusrun-ws will delegate the client credential to WS_GRAM and RFT,
and introduce the corresponding elements to the submission input.
-Jf, -job-credential-file :
If supplied AND the job description does not already provide a
jobCredential element, globusrun-ws will copy the supplied epr into
the job description. This should be an epr returned from the
DelegationFactoryService intended for use by the job (or, in the
case of a multijob, for authenticating to the subjobs).
*note: for multijob descriptions, only the top level jobCredential
will be copied into.
-Sf, -staging-credential-file :
If supplied AND the job description does not already provide a
stagingCredential element, globusrun-ws will copy the supplied epr
into the job description. This should be an epr returned from the
DelegationFactoryService intended for use with the RFT service
associated with the ManagedJobService.
*note: this option is ignored for multijobs.
-Tf, -transfer-credential-file :
If supplied, globusrun-ws will copy the epr into each of the
stage in, stage out, and cleanup elements that do not already
contain a transferCredential element. This should be an epr
returned from the DelegationFactoryService intended for use by
RFT to authenticate with the target gridftp server.
*note: this option is ignored for multijobs.
-b, -batch:
enables batch mode. the tool prints the resulting ManagedJob EPR as
the sole standard output (unless in quiet mode) and exits.
*note: without this, -submit is equivalent to
-submit -batch immediately followed by -monitor.
-q, -quiet:
all non-fatal status and protocol-related messages are suppressed.
-n, -no-cleanup:
the default behavior of trapping interrupts (SIGINT) and canceling
the job is disabled. Instead, the interrupt simply causes the tool
to exit without affecting the ManagedJob resource.
-host, -host-authz:
The GSI 'host authorization' rule is used to verify that the
service is using a host credential appropriate for the underlying
service address information.
*default
-self, -self-authz:
The GSI 'self authorization' rule is used to verify that the
service is using a (proxy) credential derived from the same
identity as the client's.
-subject, -subject-authz :
The service must be using a credential with the exact subject
name provided by this option.
-p, -private:
If supplied, privacy-protection is enabled between globusrun-ws
and WS_GRAM or GridFTP services. It is a fatal error to select
privacy protection if it is not available due to build options or
other security settings.
*note: currently only works with https endpoints
-T, -http-timeout :
Set timeout for HTTP socket, in milliseconds, for all Web services
interactions.
*default: 120000 (2 minutes).
-term, -termination +|:
Set a termination time (+relative to now).
*default: +24:00
-dbg, -debug:
Display message and GridFTP debug output on stderr
-pft, -print-fault-type:
When a fault occurs, display a line containing
Fault Type: fault-type
on stderr
-ipv6, -allow-ipv6:
Allow streaming transfers to use IPV6.
-passive:
Force streaming transfers to use MODE S to allow for passive mode
transfers. (Useful if you're behind a firewall, but expensive
because there is no connection caching).
-nodcau:
Disable data channel authentication on streaming transfers.

Errors of note

[globus1@ip-10-251-73-176 ~]$ globusrun-ws -submit -c /bin/true
Submitting job...Failed.
globusrun-ws: Error submitting job
globus_xio_gsi: gss_init_sec_context failed.
GSS Major Status: Unexpected Gatekeeper or Service Name
globus_gsi_gssapi: Authorization denied: The name of the remote host (ip-10-251-73-176), and the expected name for the remote host (ip-10-251-73-176.ec2.internal) do not match. This happens when the name in the host certificate does not match the information obtained from DNS and is often a DNS configuration problem.

Fix: make sure to add a line in /etc/hosts eg:

10.251.73.176 ip-10-251-73-176 ip-10-251-73-176.ec2.internal

[globus1@ip-10-251-73-176 ~]$ globusrun-ws -submit -c /bin/true
Submitting job...Done.
Job ID: uuid:8e3adcd2-e136-11dc-b2fd-12313a004646
Termination time: 02/23/2008 11:08 GMT
Current job state: Failed
Destroying job...Done.
globusrun-ws: Job failed: Error code: 201
Script stderr:
We trust you have received the usual lecture from the local SystemAdministrator. It usually boils down to these two things: #1) Respect the privacy of others. #2) Think before you type.Password:


Fix: Make sure sudoers file is correct

Saturday, February 23, 2008

Grid Weka on EC2

Weka is a data mining and data discovery tool. We have installed Weka as standalone software on a Amazon EC2 node already. Refer to this links for the past articles on Weka on EC2.

Weka Data Mining on EC2 - install
Weka Data Mining on EC2 - testing

In the posting roadmap I mentioned we would looking at some Grid or Web aware versions of Weka. Originally Weka was developed for consumption within a closed group. Individual researchers could run their data mining either on their own workstation or on a server.

Outline:

Grid Weka was developed out of the University College, in Dublin, Ireland. It has additional java code to allow Weka to offload various processing steps to both local and remote servers.

Install:

The Grid Weka HOWTO guide was good. It assumes that you know how to install Java and setup the Java environment correctly though.

  1. Install Java. Get the latest JDK 1.5 or higher from java.sun.com
  2. Download Grid Weka
  3. Place the weka.jar file in an appropriate location.
  4. Make sure your JAVA_HOME environment variable is set.
  5. Create and edit a file called .weka-parallel and place in users home directory.
  6. Run the GridWeka Servers using command: java -classpath /yourpath/weka.jar weka.core.DistributedServer yourport &
  7. To access the remote servers you will need to open the port in your firewall.
Results:

Grid Weka works as expected. I haven't tested the real potential benefit of using remote servers to run multiple classifications in parallel. I will and post the results in a future article.

The duration of the classifications were longer using the remote servers than running on a single dedicated process on a single server. Most of the time spent though was in network traffic.
Once again the poor network performance between EC2 nodes is a killer for network intensive applications.

I found a Java execution/library, JFS which uses the idea of Parallel streams. I was able to reduce the network time by around 20%.

This suggests that the Grid Weka would benefit from being compression aware, allowing for the stream of data to be compressed on the fly, effectively doubling the network bandwidth at the expense of CPU. With the increasing CPU performance, follow Google's lead and use the spare CPU time to compress everything.

There is a java misc IOstream class which utilizes gzip
http://java.sun.com/developer/technicalArticles/Streams/ProgIOStreams/

I reviewed the code and adding the Gzip wrapper around the IOstream is easy and there are plenty of examples of code out on the lazy net. More on that later...

Full install and results:




[root@ip-10-251-71-99 ~]# id weka
uid=502(weka) gid=503(weka) groups=503(weka)
[root@ip-10-251-71-99 ~]# su - weka
[weka@ip-10-251-71-99 tutorial]$ env|grep JAVA
JAVA_HOME=/usr/local/java

[weka@ip-10-251-71-99 tutorial]$ java -version
java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
[weka@ip-10-251-71-99 tutorial]$ ls
J48-data.model J48-segment-data.out Leukemia-ALLAML.NaiveBayes.J48.pred
J48-data.out Leukemia-ALLAML.NaiveBayes.J48.model Leukemia-ALLAML.tree.J48.model
J48-segment-data.model Leukemia-ALLAML.NaiveBayes.J48.out Leukemia-ALLAML.tree.J48.out
[weka@ip-10-251-71-99 tutorial]$ cd ..
[weka@ip-10-251-71-99 ~]$ ls
tutorial weka-3-4-11.zip
[weka@ip-10-251-71-99 ~]$ mkdir gridweka
[weka@ip-10-251-71-99 ~]$ cd gridweka/
[weka@ip-10-251-71-99 gridweka]$ wget http://cssa.ucd.ie/xin/weka/weka.jar
--19:04:04-- http://cssa.ucd.ie/xin/weka/weka.jar
=> `weka.jar'
Resolving cssa.ucd.ie... 193.1.132.54
Connecting to cssa.ucd.ie|193.1.132.54|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,926,948 (1.8M) [application/octet-stream]

100%[============================================================================>] 1,926,948 91.54K/s ETA 00:00

19:04:25 (90.80 KB/s) - `weka.jar' saved [1926948/1926948]

Starting two weka servers

[weka@ip-10-251-71-99 gridweka]$ pwd
/home/weka/gridweka
[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8001 &
[1] 2735
[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:11:37 EST 2008: Server started on port 8001
Thu Feb 21 19:11:37 EST 2008: Waiting for connections...

[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8002 &
[2] 2745
[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:15:37 EST 2008: Server started on port 8002
Thu Feb 21 19:15:37 EST 2008: Waiting for connections...

On client machine which happens to be the same box.

$ cat .weka-parallel
PORT=8001
ec2-72-44-33-131.compute-1.amazonaws.com
2
1024

Results

[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-t $WEKAHOME/data/segment-challenge.arff -d segment.model -a
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 19:38:54 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 19:38:55 EST 2008: Processed job 0 request from localhost
Thu Feb 21 19:38:55 EST 2008: Connection job 0 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 19:38:57 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 19:38:57 EST 2008: Processed job 1 request from localhost
server 1: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 1
server 1: index 2
server 1: index 3
server 1: index 4
server 1: index 5
server 1: index 6
server 1: index 7
server 1: index 8
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 9
server 1: index 10
server 1: index 11
server 1: index 12
server 1: index 13
server 1: index 14
server 1: index 15
server 1: index 16
server 1: index 17
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 18
server 1: index 19
server 1: index 20
server 1: index 21
server 1: index 22
server 1: index 23
server 1: index 24
server 1: index 25
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 26
server 1: index 27
server 1: index 28
server 1: index 29
server 1: index 30
server 1: index 31
server 1: index 32
server 1: index 33
server 1: index 34
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 35
server 1: index 36
server 1: index 37
server 1: index 38
server 1: index 39
server 1: index 40
server 1: index 41
server 1: index 42
server 1: index 43
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 1 time 339
--> connectInfo 2: rank 3 time 0
------------------------------------
server 1: index 44
server 1: index 45
server 1: index 46
server 1: index 47
server 1: index 48
server 1: index 49
J48 pruned tree
------------------

region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)

Number of Leaves : 34

Size of the tree : 67


Time taken to build model: 4.19 seconds
Time taken to test model on training data: 0.03 seconds

=== Error on training data ===

Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500


=== Confusion Matrix ===

a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1450 96.6667 % Incorrectly Classified Instances 50 3.3333 % Kappa statistic 0.9611 Mean absolute error 0.0095 Root mean squared error 0.0976 Relative absolute error 3.8904 % Root relative squared error 27.8938 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: localhost/127.0.0.1 === Confusion Matrix === a b c d e f g <-- classified as 199 0 1 2 3 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 3 8 0 0 | c = foliage 0 0 6 209 5 0 0 | d = cement 2 0 9 6 187 0 0 | e = window 0 0 0 2 0 234 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 19:39:10 EST 2008: Connection job 1 with localhost closed. Test remote client. Use telnet first to check the port

telnet ec2-72-44-33-131.compute-1.amazonaws.com 8001

Escape Character is 'CTRL+]'

Telnet> quit

On the server

[weka@ip-10-251-71-99 gridweka]$ Thu Feb 21 19:48:57 EST 2008: Connection job 0 with 203-214-155-114.dyn.iinet.net.au closed.

Issues with configuration file location


Continuing running client on Linux specifying -C option to use 2 parallel servers.

java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-T $WEKAHOME/data/segment-test.arff -l segment.model -a -C 2

Thu Feb 21 20:31:32 EST 2008: Processed job 0 request from localhost
Thu Feb 21 20:31:32 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 20:31:32 EST 2008: Processed job 5 request from localhost
Thu Feb 21 20:31:32 EST 2008: Connection job 5 with localhost closed.

J48 pruned tree
------------------

region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)

Number of Leaves : 34

Size of the tree : 67



=== Error on test data ===

Correctly Classified Instances 779 96.1728 %
Incorrectly Classified Instances 31 3.8272 %
Kappa statistic 0.9553
Mean absolute error 0.0109
Root mean squared error 0.1046
Relative absolute error 4.4715 %
Root relative squared error 29.905 %
Total Number of Instances 810


=== Confusion Matrix ===

a b c d e f g <-- classified as 124 0 0 0 1 0 0 | a = brickface 0 110 0 0 0 0 0 | b = sky 1 0 119 0 2 0 0 | c = foliage 1 0 0 107 2 0 0 | d = cement 1 0 12 7 105 0 1 | e = window 0 0 0 0 0 94 0 | f = path 0 0 1 0 0 2 120 | g = grass Running parallel cross-validation on 2 servers

java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 -t \
$WEKAHOME/data/segment-challenge.arff -d segment.model -x 10 -a -C 2
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 20:34:09 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 20:34:09 EST 2008: Processed job 6 request from localhost
Thu Feb 21 20:34:10 EST 2008: Connection job 6 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
Thu Feb 21 20:34:13 EST 2008: Processed job 1 request from localhost
server 2: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 1 time 571
------------------------------------
server 2: index 1
server 2: index 2
server 2: index 3
server 2: index 4
server 2: index 5
server 2: index 6
server 2: index 7
server 2: index 8
server 2: index 9

J48 pruned tree
------------------

region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)

Number of Leaves : 34

Size of the tree : 67


Time taken to build model: 0.96 seconds
Time taken to test model on training data: 0.09 seconds

=== Error on training data ===

Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500


=== Confusion Matrix ===

a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1436 95.7333 % Incorrectly Classified Instances 64 4.2667 % Kappa statistic 0.9502 Mean absolute error 0.0122 Root mean squared error 0.1104 Relative absolute error 4.9799 % Root relative squared error 31.5589 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: localhost/127.0.0.1 === Confusion Matrix === a b c d e f g <-- classified as 196 0 3 1 5 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 2 9 0 0 | c = foliage 2 0 4 207 6 1 0 | d = cement 3 0 16 6 179 0 0 | e = window 0 0 0 3 0 233 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 20:34:18 EST 2008: Connection job 1 with localhost closed. Thu Feb 21 20:34:18 EST 2008: Connection job 0 with localhost closed. Ok starting another server with two more GridWeka servers

[weka@ip-10-251-69-175 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8001 &
[1] 2694
[weka@ip-10-251-69-175 gridweka]$ Thu Feb 21 20:56:32 EST 2008: Server started on port 8001
Thu Feb 21 20:56:32 EST 2008: Waiting for connections...

[weka@ip-10-251-69-175 gridweka]$ java -classpath /home/weka/gridweka/weka.jar weka.core.DistributedServer 8002 &
[2] 2704
[weka@ip-10-251-69-175 gridweka]$ Thu Feb 21 20:56:40 EST 2008: Server started on port 8002
Thu Feb 21 20:56:40 EST 2008: Waiting for connections...

[weka@ip-10-251-69-175 gridweka]$ hostname
ip-10-251-69-175

Updating the .weka-parallel file

[weka@ip-10-251-71-99 gridweka]$ vi /home/weka/.weka-parallel
[weka@ip-10-251-71-99 gridweka]$ cat /home/weka/.weka-parallel
PORT=8001
localhost
2
1024
ip-10-251-69-175
2
1024

Ok running the previous classification with cross validation. This time on 2 servers each running 2 GridWeka
servers


[weka@ip-10-251-71-99 gridweka]$ java -classpath /home/weka/gridweka/weka.jar \
weka.classifiers.trees.J48 -t $WEKAHOME/data/segment-challenge.arff \
-d segment.model -x 10 -a -C 4
---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 21:02:51 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 21:02:51 EST 2008: Processed job 9 request from localhost
Thu Feb 21 21:02:51 EST 2008: Connection job 9 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
---Judgement--- server 3: Memory free : 524288000 --> Passed!
Using 3st Server(3) to do crossValidate.
---Judgement--- server 4: Memory free : 524288000 --> Passed!
Using 4st Server(4) to do crossValidate.
server 3: index 0
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 3: index 1
**************** Checking servers' status ****************Thu Feb 21 21:03:04 EST 2008: Processed job 3 request from localhost

--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 3: index 3
server 3: index 4
server 2: index 2
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 4 time 0
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
Thu Feb 21 21:03:04 EST 2008: Processed job 10 request from localhost
server 2: index 7
server 3: index 8
server 2: index 9
server 3: index 5
server 1: index 6
--------- Refresh the Rank ---------
--> connectInfo 0: rank 4 time 0
--> connectInfo 1: rank 3 time 1063
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 5 time 0
------------------------------------
server 2: index 5
server 4: index 5
--------- Refresh the Rank ---------
--> connectInfo 0: rank 5 time 0
--> connectInfo 1: rank 4 time 1063
--> connectInfo 2: rank 2 time 608
--> connectInfo 3: rank 1 time 585
--> connectInfo 4: rank 3 time 973
------------------------------------
Thu Feb 21 21:03:06 EST 2008: Connection job 10 with localhost closed.

J48 pruned tree
------------------

region-centroid-row <= 155 | value-mean <= 91.4444 | | rawred-mean <= 24.6667 | | | hue-mean <= -1.89048 | | | | hue-mean <= -2.22266 | | | | | region-centroid-row <= 146: foliage (102.0/1.0) | | | | | region-centroid-row > 146: cement (3.0)
| | | | hue-mean > -2.22266
| | | | | rawred-mean <= 2.55556 | | | | | | hue-mean <= -2.09121 | | | | | | | region-centroid-row <= 129: foliage (50.0) | | | | | | | region-centroid-row > 129
| | | | | | | | region-centroid-col <= 128 | | | | | | | | | rawred-mean <= 0.666667: foliage (30.0/4.0) | | | | | | | | | rawred-mean > 0.666667: window (5.0)
| | | | | | | | region-centroid-col > 128
| | | | | | | | | vedge-mean <= 0.333334: window (11.0) | | | | | | | | | vedge-mean > 0.333334
| | | | | | | | | | region-centroid-col <= 216: window (3.0) | | | | | | | | | | region-centroid-col > 216: foliage (2.0)
| | | | | | hue-mean > -2.09121: window (38.0/1.0)
| | | | | rawred-mean > 2.55556
| | | | | | region-centroid-row <= 121 | | | | | | | exgreen-mean <= -15.4444: brickface (2.0/1.0) | | | | | | | exgreen-mean > -15.4444
| | | | | | | | vedge-mean <= 2.94444: window (75.0) | | | | | | | | vedge-mean > 2.94444
| | | | | | | | | region-centroid-col <= 134: cement (2.0) | | | | | | | | | region-centroid-col > 134: window (8.0)
| | | | | | region-centroid-row > 121
| | | | | | | rawred-mean <= 7.88889 | | | | | | | | region-centroid-col <= 43: brickface (2.0) | | | | | | | | region-centroid-col > 43: window (13.0/2.0)
| | | | | | | rawred-mean > 7.88889
| | | | | | | | saturation-mean <= 0.492526: cement (15.0) | | | | | | | | saturation-mean > 0.492526
| | | | | | | | | region-centroid-col <= 82: foliage (2.0) | | | | | | | | | region-centroid-col > 82: cement (4.0/1.0)
| | | hue-mean > -1.89048
| | | | exgreen-mean <= -4.77778 | | | | | vedge-mean <= 2.77778: brickface (198.0/2.0) | | | | | vedge-mean > 2.77778
| | | | | | region-centroid-row <= 115: brickface (4.0) | | | | | | region-centroid-row > 115: foliage (3.0/1.0)
| | | | exgreen-mean > -4.77778
| | | | | hedge-mean <= 0.833335 | | | | | | region-centroid-col <= 115: foliage (4.0) | | | | | | region-centroid-col > 115: window (42.0)
| | | | | hedge-mean > 0.833335: grass (2.0)
| | rawred-mean > 24.6667
| | | hue-mean <= -2.17742 | | | | vedge-mean <= 5: window (4.0/1.0) | | | | vedge-mean > 5: foliage (18.0)
| | | hue-mean > -2.17742
| | | | rawgreen-mean <= 24.4444: brickface (3.0/1.0) | | | | rawgreen-mean > 24.4444: cement (180.0)
| value-mean > 91.4444: sky (220.0)
region-centroid-row > 155
| exgreen-mean <= -2 | | saturation-mean <= 0.385555 | | | region-centroid-row <= 159 | | | | region-centroid-col <= 208: cement (3.0) | | | | region-centroid-col > 208: path (2.0)
| | | region-centroid-row > 159: path (234.0)
| | saturation-mean > 0.385555: cement (11.0)
| exgreen-mean > -2: grass (205.0)

Number of Leaves : 34

Size of the tree : 67


Time taken to build model: 5.06 seconds
Time taken to test model on training data: 0.09 seconds

=== Error on training data ===

Correctly Classified Instances 1485 99 %
Incorrectly Classified Instances 15 1 %
Kappa statistic 0.9883
Mean absolute error 0.0029
Root mean squared error 0.0535
Relative absolute error 1.1672 %
Root relative squared error 15.2785 %
Total Number of Instances 1500


=== Confusion Matrix ===

a b c d e f g <-- classified as 205 0 0 0 0 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 1 0 205 0 2 0 0 | c = foliage 1 0 0 217 2 0 0 | d = cement 2 0 6 1 195 0 0 | e = window 0 0 0 0 0 236 0 | f = path 0 0 0 0 0 0 207 | g = grass === Stratified cross-validation === Correctly Classified Instances 1436 95.7333 % Incorrectly Classified Instances 64 4.2667 % Kappa statistic 0.9502 Mean absolute error 0.0122 Root mean squared error 0.1104 Relative absolute error 4.9799 % Root relative squared error 31.5589 % Total Number of Instances 1500 Cross-validation ran in parallel using this computer and the following machines: ip-10-251-69-175/10.251.69.175 localhost/127.0.0.1 localhost/127.0.0.1 ip-10-251-69-175/10.251.69.175 === Confusion Matrix === a b c d e f g <-- classified as 196 0 3 1 5 0 0 | a = brickface 0 220 0 0 0 0 0 | b = sky 0 1 196 2 9 0 0 | c = foliage 2 0 4 207 6 1 0 | d = cement 3 0 16 6 179 0 0 | e = window 0 0 0 3 0 233 0 | f = path 0 0 0 0 2 0 205 | g = grass Thu Feb 21 21:03:07 EST 2008: Connection job 3 with localhost closed. Final demo, using the Leukemia-ALLAML data

time java -classpath /home/weka/gridweka/weka.jar weka.classifiers.trees.J48 \
-t $WEKAHOME/data/ALL-AML_train.arff -d Leukemia-ALLAML.tree.J48.model \
-i -x 10 -a -C 4

---Judgement--- server 1: Memory free : 524288000 --> Passed!
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Thu Feb 21 21:10:03 EST 2008: Connection job 0 with localhost closed.
Thu Feb 21 21:10:03 EST 2008: Processed job 13 request from localhost
Thu Feb 21 21:10:05 EST 2008: Connection job 13 with localhost closed.
---Judgement--- server 1: Memory free : 524288000 --> Passed!
Using 1st Server(1) to do crossValidate.
---Judgement--- server 2: Memory free : 524288000 --> Passed!
Using 2st Server(2) to do crossValidate.
---Judgement--- server 3: Memory free : 524288000 --> Passed!
Using 3st Server(3) to do crossValidate.
---Judgement--- server 4: Memory free : 524288000 --> Passed!
Using 4st Server(4) to do crossValidate.
Thu Feb 21 21:10:57 EST 2008: Processed job 5 request from localhost
Thu Feb 21 21:11:03 EST 2008: Processed job 14 request from localhost
server 4: index 2
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 1 time 676
------------------------------------
server 4: index 4
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 2 time 0
--> connectInfo 1: rank 3 time 0
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 1 time 676
------------------------------------
server 4: index 5
server 1: index 3
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 2 time 676
------------------------------------
server 1: index 6
server 4: index 7
server 1: index 8
server 4: index 9
server 1: index 0
server 4: index 0
**************** Checking servers' status ****************
--------- Refresh the Rank ---------
--> connectInfo 0: rank 3 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 4 time 0
--> connectInfo 3: rank 5 time 0
--> connectInfo 4: rank 2 time 676
------------------------------------
server 1: index 1

J48 pruned tree
------------------

attribute4847 <= 938: ALL (27.0) attribute4847 > 938: AML (11.0)

Number of Leaves : 2

Size of the tree : 3


Time taken to build model: 5.07 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances 38 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 38


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
1 0 1 1 1 ALL
1 0 1 1 1 AML


=== Confusion Matrix ===

a b <-- classified as 27 0 | a = ALL 0 11 | b = AML === Stratified cross-validation === Correctly Classified Instances 32 84.2105 % Incorrectly Classified Instances 6 15.7895 % Kappa statistic 0.6358 Mean absolute error 0.1579 Root mean squared error 0.3974 Relative absolute error 37.8015 % Root relative squared error 87.2867 % Total Number of Instances 38 Cross-validation ran in parallel using this computer and the following machines: ip-10-251-69-175/10.251.69.175 localhost/127.0.0.1 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.852 0.182 0.92 0.852 0.885 ALL 0.818 0.148 0.692 0.818 0.75 AML === Confusion Matrix === a b <-- classified as 23 4 | a = ALL 2 9 | b = AML Thu Feb 21 21:11:10 EST 2008: Connection job 14 with localhost closed. server 3: index 0 --------- Refresh the Rank --------- --> connectInfo 0: rank 4 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 5 time 0
--> connectInfo 3: rank 3 time 699
--> connectInfo 4: rank 2 time 676
------------------------------------
server 2: index 1
--------- Refresh the Rank ---------
--> connectInfo 0: rank 5 time 0
--> connectInfo 1: rank 1 time 606
--> connectInfo 2: rank 2 time 637
--> connectInfo 3: rank 4 time 699
--> connectInfo 4: rank 3 time 676
------------------------------------
Thu Feb 21 21:11:21 EST 2008: Connection job 5 with localhost closed.

real 1m19.614s
user 0m13.140s
sys 0m12.660s

rerunning with java option -Xprof to profile to CPU

Flat profile of 39.70 secs (113 total ticks): Thread-5

Interpreted + native Method
49.5% 0 + 53 java.net.SocketInputStream.socketRead0
8.4% 0 + 9 java.net.SocketOutputStream.socketWrite0
57.9% 0 + 62 Total interpreted

Compiled + native Method
0.9% 0 + 1 java.lang.String.
0.9% 0 + 1 Total compiled

Stub + native Method
32.7% 0 + 35 java.io.FileInputStream.read
6.5% 0 + 7 java.net.SocketOutputStream.socketWrite0
1.9% 0 + 2 java.lang.System.identityHashCode
41.1% 0 + 44 Total stub

Thread-local ticks:
5.3% 6 Blocked (of total)


Flat profile of 34.95 secs (111 total ticks): Thread-7

Compiled + native Method
1.8% 1 + 1 weka.classifiers.EvaluationClient.determineIndex
1.8% 1 + 1 Total compiled

Thread-local ticks:
98.2% 109 Compilation


Flat profile of 54.81 secs (473 total ticks): main

Interpreted + native Method
0.6% 0 + 2 java.net.Inet4AddressImpl.lookupAllHostAddr
0.3% 0 + 1 java.lang.System.currentTimeMillis
0.3% 0 + 1 java.io.FileInputStream.read
0.3% 0 + 1 java.lang.Thread.start0
0.3% 0 + 1 java.util.zip.ZipFile.getEntry
0.3% 0 + 1 weka.classifiers.trees.J48.main
0.3% 1 + 0 java.net.InetAddress.getCachedAddress
0.3% 0 + 1 weka.classifiers.Evaluation.evaluateModel
2.5% 1 + 8 Total interpreted

Compiled + native Method
22.5% 81 + 0 weka.classifiers.BuildModelClient.start
1.4% 5 + 0 java.io.DataInputStream.readLine
1.1% 4 + 0 weka.classifiers.EvaluationClient.start
1.1% 4 + 0 sun.misc.FloatingDecimal.readJavaFormatString
0.3% 1 + 0 java.lang.StringBuffer.toString
0.3% 1 + 0 java.lang.AbstractStringBuilder.append
0.3% 1 + 0 java.io.ObjectOutputStream.defaultWriteFields
0.3% 1 + 0 java.lang.String.regionMatches
0.3% 1 + 0 sun.nio.cs.US_ASCII$Decoder.decodeArrayLoop
0.3% 1 + 0 weka.core.Instances.getInstanceFull
0.3% 1 + 0 java.io.ObjectOutputStream.writeObject0
28.1% 101 + 0 Total compiled

Stub + native Method
60.8% 0 + 219 java.io.FileInputStream.read
6.1% 0 + 22 java.io.FileOutputStream.writeBytes
0.3% 0 + 1 java.lang.Class.isArray
0.3% 0 + 1 java.lang.Float.floatToIntBits
67.5% 0 + 243 Total stub

Thread-local ticks:
23.9% 113 Blocked (of total)
1.9% 7 Compilation


Flat profile of 35.26 secs (126 total ticks): Thread-6

Interpreted + native Method
66.1% 0 + 82 java.net.SocketInputStream.socketRead0
7.3% 0 + 9 java.net.SocketOutputStream.socketWrite0
0.8% 0 + 1 java.net.PlainSocketImpl.socketConnect
74.2% 0 + 92 Total interpreted

Compiled + native Method
2.4% 0 + 3 java.io.ObjectStreamClass.lookup
2.4% 0 + 3 Total compiled

Stub + native Method
20.2% 0 + 25 java.io.FileInputStream.read
20.2% 0 + 25 Total stub

Thread-local ticks:
1.6% 2 Blocked (of total)
3.2% 4 Compilation


Global summary of 55.12 seconds:
100.0% 504 Received ticks
1.8% 9 Received GC ticks
1.4% 7 Compilation
0.2% 1 Unknown code

real 0m55.151s
user 0m8.470s
sys 0m12.520s

Rerun the JFS Parallel Access to Network the time was about 20% faster

http://jfs.des.udc.es/docs/jfs.html

time jfsrun weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-d Leukemia-ALLAML.tree.J48.model -x 10 -a -C 4


real 0m41.736s
user 0m6.990s
sys 0m9.180s




Monday, January 28, 2008

Installing Kettle on EC2

As I mentioned in the roadmap, I am going to run through installing Kettle or Pentaho Data Integration (PDI) on EC2.

For starters I am just using the small instances on EC2. However we can start pushing and benchmarking later. Given some of the disappointment to the lack of network bandwidth, at least as the applications currently use it, on the larger instances, running an Kettle Master/Slave Cluster is still going to be limited by the amount of traffic to maintain and manage the cluster.

On with the show, I had a full blown Pentaho demo Amazon Machine Image (AMI) already for a previous post of Pentaho BI Suite. However I wanted to just install the Kettle portion only. Here I am concentrating on the ETL portion of Pentaho.

Install:

  1. Install Java (JDK 1.5 or better)
  2. Install MySQL 5.0 or better.
  3. Download Kettle
  4. mkdir /usr/local/kettle
  5. unzip Kettle-3.0.1.zip -d /usr/local/kettle
  6. chmod +x /usr/local/kettle/*.sh
  7. export PATH=$PATH:/usr/local/kettle/
Tests:

Simple tests to make sure it is running ok and there are no java classpath issues, just run runSamples.sh


cd /usr/local/kettle
./runSamples.sh

EXECUTING TRANSFORMATION [samples/transformations/Add sequence - specify a common counter.ktr]
INFO 27-01 20:34:13,704 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:13,707 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:16:700 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:17,111 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Add sequence - specify a common counter.ktr]
INFO 27-01 20:34:17,477 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:17,483 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:17,484 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:16.954, Stop=2008/01/27 20:34:17.483
INFO 27-01 20:34:17,484 (LogWriter.java:println:403) -Pan - Processing ended after 0 seconds.
EXECUTING TRANSFORMATION [samples/transformations/Aggregate - basics.ktr]
INFO 27-01 20:34:18,221 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:18,223 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:21:225 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:21,896 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Aggregate - basics.ktr]
INFO 27-01 20:34:23,525 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:23,527 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:23,528 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:21.410, Stop=2008/01/27 20:34:23.528
INFO 27-01 20:34:23,528 (LogWriter.java:println:403) -Pan - Processing ended after 2 seconds.
EXECUTING TRANSFORMATION [samples/transformations/Calculator - Substract constant value one from a number.ktr]
INFO 27-01 20:34:24,273 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:24,276 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:27:296 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:27,720 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Calculator - Substract constant value one from a number.ktr]
INFO 27-01 20:34:27,875 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:27,878 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:27,879 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:27.522, Stop=2008/01/27 20:34:27.878
...


Setup repository db (using MySQL 5.1 db on EC2)


export PASSWD=yourpasswordhere
mysql -u root -p$PASSWD

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.1.20-beta-log MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database kettle_repos;
Query OK, 1 row affected (0.00 sec)

mysql> grant all on kettle_repos.* to paulm@'myhost identified by 'xxxx';
Query OK, 0 rows affected (0.00 sec)



Test connectivity first using mysql client, you may to need to allow your machine to connect via any firewall and/or grant permission in your EC2 security group.
Just add your ip address and the port 3306 only.


mysql -u paulm -p$PASSWD --host=ec2host -D kettle_repos --protocol=tcp

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5 to server version: 5.1.20-beta-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show tables;
--------------
show tables
--------------

Empty set (0.24 sec)

mysql> exit


Now create a new database connection in Pentaho Spoon. The Kettle GUI


















Results:

Connection to database [MySQL51_repos] is OK.
Hostname : ec2-67-202-33-238.compute-1.amazonaws.com
Port : 3306
Database name : kettle_repos



Now create a new repository, choose the create/upgrade option in the connection dialog

Start a new transformation.

I understand this is a simple step, if this was it, I would use LOAD DATA INFILE to load the data. The idea is the build on the simple tasks, learning new stuff and then we can combine the simple steps into complex analysis and truly push Kettle ETL engine.

I am using the KDD Cup 1999 dataset, just a 1000 rows of csv to start

zcat kddcup.data_10_percent.gz|head -1000 > kddcup.data.1000.csv

The headers are in a separate kddcup.names file however I mucked around with sed and got the file into a csv header row at this row to the CSV file to make the file input step and field names easier to edit.


cat kddcup.names |awk -F ":" ' { print $1"," }' > kddcup.name.header
tr -d '\n' < kddcup.name.header > kddcup.name.header.clean
cat kddcup.name.header.clean

type,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_comp
_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count
error_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_sr
host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_
ate,dst_host_rerror_rate,dst_host_srv_rerror_rate



Create a database to hold the data and grant access.


mysql -u root -p$PASSWD
mysql> create database KDD99;
Query OK, 1 row affected (0.03 sec)

mysql> grant all on KDD99.* to 'paulm'@'myhost' identified by 'xxxx';
Query OK, 0 rows affected (0.00 sec)



Sample CREATE TABLE, generated from table output step


CREATE TABLE KDD99.KDD99
(
type INT
, duration VARCHAR(3)
, protocol_type VARCHAR(8)
, service VARCHAR(8)
, flag INT
, src_bytes INT
, dst_bytes INT
, land INT
, wrong_fragment INT
, urgent INT
, hot INT
, num_failed_logins INT
, logged_in INT
, num_compromised INT
, root_shell INT
, su_attempted INT
, num_root INT
, num_file_creations INT
, num_shells INT
, num_access_files INT
, num_outbound_cmds INT
, is_host_login INT
, is_guest_login INT
, count INT
, srv_count INT
, serror_rate INT
, srv_serror_rate INT
, rerror_rate INT
, srv_rerror_rate INT
, same_srv_rate INT
, diff_srv_rate FLOAT
, srv_diff_host_rate INT
, dst_host_count INT
, dst_host_srv_count INT
, dst_host_same_srv_rate INT
, dst_host_diff_srv_rate FLOAT
, dst_host_same_src_port_rate FLOAT
, dst_host_srv_diff_host_rate INT
, dst_host_serror_rate INT
, dst_host_srv_serror_rate INT
, dst_host_rerror_rate INT
, dst_host_srv_rerror_rate VARCHAR(25)
)
;



Run the transformation.
The most fun of any data loads is finding that your datatypes and lengths are
not always long enough.


2008/01/28 14:14:00 - Spoon - Transformation opened.
2008/01/28 14:14:00 - Spoon - Launching transformation [Transformation 1]...
2008/01/28 14:14:00 - Spoon - Started the transformation execution.
2008/01/28 14:14:00 - Transformation 1 - Dispatching started for transformation [Transformation 1]
2008/01/28 14:14:00 - Transformation 1 - Nr of arguments detected:0
2008/01/28 14:14:00 - Transformation 1 - This is not a replay transformation
2008/01/28 14:14:00 - Transformation 1 - This transformation can be replayed with replay date: 2008/01/28 14:14:00
2008/01/28 14:14:00 - Transformation 1 - Initialising 2 steps...
2008/01/28 14:14:02 - Table output.0 - Connected to database [MySQL51_data] (commit=1000)
2008/01/28 14:14:03 - CSV file input.0 - Starting to run...
2008/01/28 14:14:03 - Table output.0 - Starting to run...
2008/01/28 14:14:03 - CSV file input.0 - Finished processing (I=1001, O=0, R=0, W=1000, U=0, E=0)
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Because of an error, this step can't continue:
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Error batch inserting rows into table [KDD99].
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Errors encountered (first 10):
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Data truncation: Data too long for column 'duration' at row 1
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) :
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) :
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Error updating batch
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Data truncation: Data too long for column 'duration' at row 1
2008/01/28 14:18:07 - Spoon - The transformation has finished!!
2008/01/28 14:18:08 - Table output.0 - Finished processing (I=0, O=999, R=1000, W=0, U=0, E=1)