VM Datamining: July 2007

Sunday, July 29, 2007

Roadmap for testing data mining and business intelligence software

Thought I would update what I am planning:

Basically if I can demo any of the majors expect me to have install and some demo testing done.

There are plenty out there so I am going to start with majors and minors.

Roadmap for testing datamining software.

Weka
Pentaho
NumPy
Picalo
R Statistical Language Clustering techniques.
many more...

If people want specific tools or software tested please comment on this post.

Have Fun

Paul

Friday, July 20, 2007

Bizgres Clickstream Demo

As part of this series I have been reviewing and creating Amazon Machine Images (AMI) of some of the open source data mining, business intelligence software out there.

Apart from some earlier issues with getting java installed it has been straight-forward.

Please check this articles for the background

http://blog.vmdatamine.com/2007/07/bizgres-greenplum-on-ec2.html
https://wiki.dbadojo.com/howto-install-bizgres-on-centos
http://blog.vmdatamine.com/2007/07/running-bizgres-demo.html

To run the Bizgres Clickstream demo you will need to download this tarball and use the Clickstream User Guide.

My only issue was the environment variables I had a Postgresql database specified and it caused the demo to fail during the test of the database. Under the hood it was using a similar test suite (IVP) as outlined in the article on the demo.

After that everything was good. I had to open up the 8080 port on EC2 for the security group the instance was running to be able to view the Jasper reports.

I will write another article about my thoughts after running through the demo and clickstream demo.

I have included some screen shots as well (edited to remove my Firefox bar)

Have Fun

Paul

Here is the .bash_profile and bizgres_path.sh I setup


[bgadmin@domU-12-31-35-00-1C-C1 bizgresClickStream]$ cat ~/.bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

unset USERNAME


# For Bizgres

# This is the installed location of the Bizgres binaries. For example:

BIZHOME=/usr/local/bizgres
export BIZHOME

# Bizgres is installed with a Java Development Kit (JDK) installation that is compatible with Bizgres.

JAVA_HOME=/usr/local/jdk1.5.0_12/
export JAVA_HOME

# The LD_LIBRARY_PATH environment variable should point to the location of the PostgreSQL library files.
# For Solaris, this also points to the GNU compiler and readline library files as well.

LD_LIBRARY_PATH=$BIZHOME/pgsql/lib:$BIZHOME/lib
export LD_LIBRARY_PATH

# The default port number of the Bizgres/PostgreSQL database server.

#PGPORT=5432
#export PGPORT

# The location of the PostgreSQL manual pages.

MANPATH=$BIZHOME/doc:$MANPATH
export MANPATH

# The name of the default Bizgres/PostgreSQL database to use.

#PGDATABASE=bizdb
#export PGDATABASE

PGPATH=$BIZHOME
export PGPATH

# The host name of the Bizgres/PostgreSQL database server that clients use to connect to the database.

#PGHOST=`hostname`
#export PGHOST

#PGUSER=bgadmin
#export PGUSER

# Your PATH environment variable should point to the location of your JDK bin directory (listed first),
# the location of the Bizgres Loader bin directory,
# and the location of the Bizgres database engine (PostgreSQL) bin directory.

PATH=$JAVA_HOME/bin:$BIZHOME/pgsql/bin:$BIZHOME/client/loader/bin:$PATH
#PATH=$PATH:$HOME/bin

export PATH
unset USERNAME

 The bizgres_path.sh file 

[bgadmin@domU-12-31-35-00-1C-C1 bizgresClickStream]$ cat /usr/local/bizgres/bizgres_path.sh
BIZHOME=/usr/local/bizgres
PATH=$BIZHOME/pgsql/bin:$BIZHOME/client/loader/bin:$PATH
LD_LIBRARY_PATH=$BIZHOME/lib
MANPATH=$BIZHOME/doc:$MANPATH
PGPATH=$BIZHOME

export BIZHOME
export PATH
export LD_LIBRARY_PATH
export MANPATH
export PGPATH

 The screen dump of running make under the bizgresClickstream directory 

[bgadmin@domU-12-31-35-00-1C-C1 bizgresClickStream]$ make
Tomcat package is complete, good.
bizgresClickStream package is complete, good.
Found sed, good.
Found gtar, good.
Found JDK version 1.5, good.
Port 8080 appears to be free, good.
Tomcat appears not to be running, good.
Port 5432 appears to be free, good.
Port 10000 appears to be free, good.
Bizgres is installed at /usr/local/bizgres, good.
Testing Bizgres installation...
Bizgres test passes, good
Installing Tomcat...
Using CATALINA_BASE:   /usr/local/bizgres/demo/solutions/bizgresClickStream/tomcat
Using CATALINA_HOME:   /usr/local/bizgres/demo/solutions/bizgresClickStream/tomcat
Using CATALINA_TMPDIR: /usr/local/bizgres/demo/solutions/bizgresClickStream/tomcat/temp
Using JRE_HOME:       /usr/local/jdk1.5.0_12/
Waiting 5 secs for WAR deployment
Uncompressing database dump files...
done.
CREATE DATABASE
Starting...
nohup: appending output to `nohup.out'
Started.
KETL startup succeeded.
Bizgres clickstream installation complete.  Please read the documentation for next steps.

 Starting KETL and running the job to generate the reports 

[bgadmin@domU-12-31-35-00-1C-C1 bizgresClickStream]$ cd bin
[bgadmin@domU-12-31-35-00-1C-C1 bin]$ source clicksenv.sh

[bgadmin@domU-12-31-35-00-1C-C1 bin]$ ketl_ctl
KETL Console - Version 0.9 beta release

->connect localhost
Connected to domU-12-31-35-00-1C-C1.z-2.compute-1.internal
->job RUN_REPORTS execute 1 multi ignoredependencies
RUN_REPORTS
Job submitted to server for direct execution.

->status jobs
Executing
---------

Failed
------

Just Failed
-----------

Ready To Run
------------


->quit
[bgadmin@domU-12-31-35-00-1C-C1 bin]$ ls $WEBAPP/jasper
CumulativeEntryPages.jasper          DailyGeographicActivity.jrxml         ReferrersByWeek.jrxml
CumulativeEntryPages.jrxml           DailySiteActivity.jasper              SearchEngineReferrersByWeek.jasper
CumulativeExitPages.jasper           DailySiteActivity.jrxml               SearchEngineReferrersByWeek.jrxml
CumulativeExitPages.jrxml            DailyTopReferrers.jasper              WeeklyEntryPages.jasper
CumulativeGeographicActivity.jasper  DailyTopReferrers.jrxml               WeeklyEntryPages.jrxml
CumulativeGeographicActivity.jrxml   DailyTopSearchEngineReferrers.jasper  WeeklyExitPages.jasper
CumulativeSiteActivity.jasper        DailyTopSearchEngineReferrers.jrxml   WeeklyExitPages.jrxml
CumulativeSiteActivity.jrxml         DailyTrafficActivity.jasper           WeeklyGeographicActivity.jasper
CumulativeTopReferrers.jasper        DailyTrafficActivity.jrxml            WeeklyGeographicActivity.jrxml
CumulativeTopReferrers.jrxml         Daily_10_2004_11_18.html              WeeklySiteActivity.jasper
CumulativeTrafficActivity.jasper     Daily_13_2004_11_18.html              WeeklySiteActivity.jrxml
CumulativeTrafficActivity.jrxml      Daily_16_2004_11_18.html              WeeklyTopReferrers.jasper
Cumulative_12.pdf                    Daily_19_2004_11_18.html              WeeklyTopReferrers.jrxml
Cumulative_15.pdf                    Daily_1_2004_11_18.html               WeeklyTopSearchEngineReferrers.jasper
Cumulative_18.pdf                    Daily_4_2004_11_18.html               WeeklyTopSearchEngineReferrers.jrxml
Cumulative_21.pdf                    Daily_7_2004_11_18.html               WeeklyTrafficActivity.jasper
Cumulative_3.pdf                     EntryPagesForWeek.jasper              WeeklyTrafficActivity.jrxml
Cumulative_6.pdf                     EntryPagesForWeek.jrxml               Weekly_11_2004_11_15.html
Cumulative_9.pdf                     ExitPagesForWeek.jasper               Weekly_14_2004_11_15.html
DailyEntryPages.jasper               ExitPagesForWeek.jrxml                Weekly_17_2004_11_15.html
DailyEntryPages.jrxml                GeographicByWeek.jasper               Weekly_20_2004_11_15.html
DailyExitPages.jasper                GeographicByWeek.jrxml                Weekly_2_2004_11_15.html
DailyExitPages.jrxml                 README.txt                            Weekly_5_2004_11_15.html
DailyGeographicActivity.jasper       ReferrersByWeek.jasper                Weekly_8_2004_11_15.html

Make sure port 8080 is open

 After I was finished, shutdown KETL and check the status 

[bgadmin@domU-12-31-35-00-1C-C1 bin]$ ketl_ctl
KETL Console - Version 0.9 beta release

->connect localhost
Connected to domU-12-31-35-00-1C-C1.z-2.compute-1.internal
->shutdown
...

->status
KETL Cluster Status
Registered Servers: 2
Alive Servers     : 0
Pending Jobs

Server    : domU-12-31-35-00-1C-C1.z-2.compute-1.internal
Status    : Shutdown
Start Time: 2007-07-20 08:01:06.234
Last Ping : 2007-07-20 08:16:59.446735
Executors (Stats)
    SQL: (Total: 2)
    KETL: (Total: 2)
    XMLSESSIONIZER: (Total: 1)
    OSJOB: (Total: 2)

Friday, July 13, 2007

Note about the site

If you have visited the site in the past, you would have noticed I have changed the template and added a bunch of RSS and email subscription information on the sidebar.

I hope this makes it easier to consume the content as it appears if you use a RSS feed reader.

As to the jobs link, I am always interested in new and interesting widgets, I don't expect anything much to happen with it.

The main reason for changing templates was so the code and screen dumps I put into posts are not truncated off for being too wide.

Have Fun

Paul

Streambase on EC2 - demo

After installing Streambase, I went through the getting started guide, it was good, perhaps a version for the Developer Edition might have been useful.

Again there were some hardcoded path issues with the /etc/init.d scripts, looking for the binaries in /usr/bin rather than a passed location. I will add this to the wiki HOW TO document as well.

I had to add some symbolic links to get Streambase to attempt to start and then it failed as the Developer edition is not allowed to be run as a separate server.

Not phased, I moved on using the Creating a Clustered Application documentation and got the cluster management node and processing node configured and running, just to make I understood that process.

At that point, I installed the Streambase Studio onto my Windows XP box and had a look at the various demos available.

One main difference is the way data is treated by Streambase and StreamSQL. The data is manipulated as it arrives rather than storing and then analyzing. So various intermediate steps where data is being tranformed or sliced and diced can be discarded.
Of course various stages can also been stored and also used as a feed into another process.

One of the nice things is that Streambase Studio is built on Eclipse, making the flow of the data visually apparent.

I might drop the Streambase team an email and ask if I can get access to the Enterprise Edition to at least test running the server on EC2 properly as a stand alone server.

So next steps are to go back to Bizgres and Greenplum and run through the clickstream demo and review the clustering/master slave node options with that product.

stay tuned...

Paul

Streambase on EC2 - Install

As I mentioned recently I have a couple of tasks on the schedule to get various datamining software on EC2 as AMI (Amazon Machine Images). This enables me to have a platform from which to launch into the various demos and tutorials available and also put the software through its paces.

The next step was to get Streambase server installed onto a CentOS 4.4 base install running on EC2. CentOS 4.4 is close enough to Redhat Enterprise that it fulfills the requirements for installing Streambase.

I used the Streambase Installation Guide for Linux as a guide.

Given my fun the other day with installing Java, that piece of the install was very quick. In investment terms the time spent going through that pain had a big ROI (Return of Investment) today. I just need to keep reminding myself that fact when I get stuck in the future.

The other thing to note was the install guide keeps mentioning a tarball which is produced after running the bin file which you download. Running that file with sh StreamBase-3.7.3Trial.bin extracts the tarball as well for the development kit and everything is dumped at once.

Pre-requisites for installing Streamase onto CentOS or any other base install of Linux:

Being registered, so you can download the streambase binary install file.
compat-gcc-32 or compat-gcc-33 if available
compat-libstdc++-33
Java JDK 1.5

You can install those compat dependences using yum:

yum install compat-gcc-32
yum install compat-libstdc++-33.i386

I will create a wiki HOWTO on this rather than dump the output to the screen here and update this post with the link.

I will have some more time in a couple of hours to run through the demo/tutorial with the Streambase Studio on my local PC and the server running off EC2.
The next task is to test the master and slave node install and make some images of the master and also the slave node.

Have Fun

Paul

Thursday, July 12, 2007

Running Bizgres demo

So I got Bizgres installed on CentOS 4 running on EC2 (Xen Virtualization).
Here is the HOWTO recipe for doing the build.
Next I went through the Bizgres User Guide and ran the Instance Verification Program (IVP).

The first attempt through yesterday was dogged by some environment variables problems and a missing file.

The IVP Makefile expects to find a file in $BIZHOME called bizgres_path.sh. I checked and it does not exist, nor was it in the bizgres-0_9_GA.tar.bz2 tarball.

So I had to create that file. Save a file called bizgres_path.sh where you installed the software.



BIZHOME=/usr/local/bizgres
PATH=$BIZHOME/pgsql/bin:$BIZHOME/client/loader/bin:$PATH
LD_LIBRARY_PATH=$BIZHOME/lib
MANPATH=$BIZHOME/doc:$MANPATH
PGPATH=$BIZHOME

export BIZHOME
export PATH
export LD_LIBRARY_PATH
export MANPATH
export PGPATH

The next issue was the path.sh file which is part of the IVP.tar file.
If you installed the software in something other than /usr/local/bizgres or like me are using a symbolic link /usr/local/bizgres which points at /usr/local/bizgres-0.9_GA it will fail.

I needed to add the -follow option to the find to make that follow any symbolic links.

Here is my path.sh



#
# If BIZHOME is alreay set use its value to source bizgres_path.sh
# ow, start looking at /usr/local
#
if [ x"$BIZHOME" == x ]; then
 BIZHOME='/usr/local/bizgres'
fi

PATHSH=`find "$BIZHOME" -follow -name bizgres_path.sh | sort | tail -1`

if [ x"$PATHSH" != x ]; then
source $PATHSH
else
echo "Can not find the path variables file, exiting."
#exit 1
fi

After those two problems, I had some issues getting the PostgreSQL instance running, which I think was related to running initdb early.

After that the demo ran perfectly.

Next task is to get streamSQL running as these products have a time limited evaluation license.

I will run through the master and slave node process probably early next week.

Have Fun

Paul

Here is my screen dump of the demo run


[bgadmin@domU-12-31-36-00-3D-83 IVP]$ source path.sh
[bgadmin@domU-12-31-36-00-3D-83 IVP]$ make
./create_db.sh


This is an Installation Validation Program for Bizgres.
It will install a sample Bizgres Database on one machine.

After this script runs successfully,
you should have a running Bizgres Database named
dgtestdb, that you can connect to using the command:
psql dgtestdb

You can also load sample data into dgtestdb using the command:
./load-data.sh

If this script fails to execute, look in the file "IVP.log" for the reason why.


Creating the dgtestdb data directory...
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.
done.
Starting the instance located in dgtestdb ... done.
Creating the database "dgtestdb" ... done.
Creating table "bigtable1" in "dgtestdb"... done.
---------------------------------------------------------------
---------------------------------------------------------------
---------------------------------------------------------------

Congratulations: A Bizgres database has been installed successfully!

You can now connect to the database using the command:
psql -p 5432 dgtestdb

You can now load sample data into dgtestdb using the command:
./load-data.sh

./load-data.sh
make[1]: Entering directory `/home/bgadmin/demo/IVP/data-generator'
gcc -O3 -c main.c
gcc -O3 -c utils.c
gcc -o generator -O3 main.o utils.o
./generator seedfile.txt 1000000 > dbdata.txt
make[1]: Leaving directory `/home/bgadmin/demo/IVP/data-generator'

Loader Ver 2.0.13
======================
Time: 07-12-2007 20:01:49
Host: LOCALHOST
Port: 5432
Database: dgtestdb
Username: bgadmin
Password: 
Control File: file:/home/bgadmin/demo/IVP/copy.ctl
Destination: database
Batching: OFF
======================

Contacting entry DB... jdbc:postgresql://LOCALHOST:5432/dgtestdb?user=bgadmin

validating input streams...
Control file found | Data file found : '/home/bgadmin/demo/IVP/data-generator/dbdata.txt'

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Found control command...
LOAD ivp.bigtable1( a ,  b ,  c ,  d ,  e ,  f ,  g ,  h ,  i ,  j ,  k ,  l ,  m ,  n , o)
FROM '/home/bgadmin/demo/IVP/data-generator/dbdata.txt' WITH DELIMITER AS '|' NULL AS '' ESCAPE AS '\'

Checking if control command is valid...
Control command verified
DBWriter trying to connect to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter successfully connected to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter testing an empty COPY command
Test passed. COPY accepted by the backend
DBWriter trying to connect to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter successfully connected to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter testing an empty COPY command
Test passed. COPY accepted by the backend
DBWriter trying to connect to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter successfully connected to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter testing an empty COPY command
Test passed. COPY accepted by the backend
DBWriter trying to connect to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter successfully connected to database host=LOCALHOST,port=5432,dbname=dgtestdb
DBWriter testing an empty COPY command
Test passed. COPY accepted by the backend
Starting to load data.

..................
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Database dgtestdb - on LOCALHOST:5432
- Sent 250000 rows to this segment
- Sent 35667923 bytes to this segment

Database dgtestdb - on LOCALHOST:5432
- Sent 250000 rows to this segment
- Sent 35643409 bytes to this segment

Database dgtestdb - on LOCALHOST:5432
- Sent 250000 rows to this segment
- Sent 35662826 bytes to this segment

Database dgtestdb - on LOCALHOST:5432
- Sent 250000 rows to this segment
- Sent 35647055 bytes to this segment

TOTALS:

- Time          : 56.522 secs
- Rows Read     : 1000000 (non blank)
- Rows Loaded   : 1000000
- Bytes Loaded  : 142621213
- Load rate     : 2.4063938 Mbytes/sec
- Loaded Batches: 18
- Failed Batches: 0
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


No more control commands.
Application terminated with return code 0

The result from this next query should be 1000000

time psql -p 5432 dgtestdb -c "select count(*) from IVP.bigtable1"
count
---------
1000000
(1 row)


real    0m2.045s
user    0m0.000s
sys     0m0.000s

Sunday, July 8, 2007

HOWTO Install bizgres on CentOS

As I mentioned before I have created an howto recipe for installing Bizgres onto CentOS.

This was a build rather than an install using the binary install provided of Greenplum's website mainly due to being CentOS rather than Redhat and the whole thing running on top of Xen virtual OS running on EC2.

The next step is the follow through the documentation to build the master and data nodes.

Also on the agenda for this week is getting a version of the StreamSQL database running.

More later...

Thursday, July 5, 2007

Bizgres (Greenplum) on EC2

First cab of the rank to build was a greenplum or bizgres EC2 image.

I was following the install from source documentation from Greenplum's site, using a base CentOS 4 public EC2 image.

Java 1.5 SDK/JDK wasn't installed, so that was the first thing. I unfortunately found a bad lead in the form of this wiki on JavaOnCentOS.

In the end i ditched the rpmbuild method, ran the install file jdk-1_5_0_12-linux-i586.bin and created a symbolic link to /usr/local/bin.

The next thing missing from the doco was you need to set the JAVA_HOME environment variable and that PostgreSQL needs the devel package of both readline and zlib to build ok.

So 2 hours later I have bizgres installed on EC2. The actual ant build took 10 minutes.

I will write up a recipe/HOW-TO based on my screen dumps from putty over the weekend.

The next thing is to actually run the demo scripts.

Have Fun

Paul

Tuesday, July 3, 2007

Streams real time data manipulation links

As I mentioned I would put up a more detailed article on the background to streams.

It is a reasonably hot topic as the broker boys and quants in driving research due to the potential to make megabucks, tying together stream technology with algorithmic trading.
So like old mate Wayne Gretzky "Skate to where the puck is going to be", I am thinking about where the puck (or buck if you like) is going to be, current research is where the puck is at now.

Note, this is not the same as Oracle streams which is a different technology. Oracle streams mines Oracle redo transaction logs to enable real-time application to other databases either as a form of replication (for redundancy) or reporting or both.

Here are the URLs I used to start to get a feel for streams.

http://en.wikipedia.org/wiki/StreamSQL

Stanford University Links:

http://infolab.stanford.edu/stream/

Brown University Links:

http://www.cs.brown.edu/research/aurora/
http://www.cs.brown.edu/research/aurora/publications.html
http://www.cs.brown.edu/research/aurora/aurora_1_2.tar.gz
http://www.cs.brown.edu/research/borealis/public/

Offshoots from this research:

http://www.Streambase.com
http://www.streamsql.org

If you have others free feel to post them in the comments and I will add to the article.

What is VM datamine

I have been playing with Amazon EC2 for a couple of months now.
I have a blog dedicated to creating various databases (commercial and otherwise). At the moment the aim is to get Oracle RAC running on EC2.

So I have been creating various Amazon Machine Images (AMI), basically EC2 runs off the back of Xen virtualization software.

Rather than write a generalised blog, I am trying to be more specific with content. So this blog will cover various EC2 builds testing out the many and varied datamining/Business Intelligence (BI) software out there.

I was currently reviewing the basis of bizgres (an extension of PostgreSQL) with its commercial arm of Greenplum.

In my websurfing over the last 2 or so days since the seed of this idea formed I have also looked at the streamsql, which is a real-time streaming extension to standard SQL. It also has a commercial arm in Streambase. More about Streams shortly.

So what is the idea?

Why not use Amazon as a testing ground to providing a web service based on the above technology and more.

For example:
Instead of requiring a dedicated server for checking clickstream, why not store or upload your web access logs (even real time clicks) to Amazon S3 or an equivalent web storage service and then combining that with specific virtual instances (processing nodes) which can fire up and shutdown as required.
The process nodes morph from a streams collection tool, to a dataminer based on Bizgres, then again morphing to producing a output viewable in something like Yale.

Amazon also have a message queue service so there is nothing to stop writing a director node which spawns and reaps the above service nodes as required, storing the dataset being processed like a car on a production line.

Interested in helping out?

Either post a comment or email me at paulmoen at gmail dot com

Have Fun

Paul

VM Datamining