As I mentioned in the roadmap, I am going to run through installing Kettle or Pentaho Data Integration (PDI) on EC2.
For starters I am just using the small instances on EC2. However we can start pushing and benchmarking later. Given some of the disappointment to the lack of network bandwidth, at least as the applications currently use it, on the larger instances, running an Kettle Master/Slave Cluster is still going to be limited by the amount of traffic to maintain and manage the cluster.
On with the show, I had a full blown Pentaho demo Amazon Machine Image (AMI) already for a previous post of Pentaho BI Suite. However I wanted to just install the Kettle portion only. Here I am concentrating on the ETL portion of Pentaho.
Install:
- Install Java (JDK 1.5 or better)
- Install MySQL 5.0 or better.
- Download Kettle
- mkdir /usr/local/kettle
- unzip Kettle-3.0.1.zip -d /usr/local/kettle
- chmod +x /usr/local/kettle/*.sh
- export PATH=$PATH:/usr/local/kettle/
Simple tests to make sure it is running ok and there are no java classpath issues, just run runSamples.sh
cd /usr/local/kettle
./runSamples.sh
EXECUTING TRANSFORMATION [samples/transformations/Add sequence - specify a common counter.ktr]
INFO 27-01 20:34:13,704 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:13,707 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:16:700 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:17,111 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Add sequence - specify a common counter.ktr]
INFO 27-01 20:34:17,477 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:17,483 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:17,484 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:16.954, Stop=2008/01/27 20:34:17.483
INFO 27-01 20:34:17,484 (LogWriter.java:println:403) -Pan - Processing ended after 0 seconds.
EXECUTING TRANSFORMATION [samples/transformations/Aggregate - basics.ktr]
INFO 27-01 20:34:18,221 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:18,223 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:21:225 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:21,896 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Aggregate - basics.ktr]
INFO 27-01 20:34:23,525 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:23,527 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:23,528 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:21.410, Stop=2008/01/27 20:34:23.528
INFO 27-01 20:34:23,528 (LogWriter.java:println:403) -Pan - Processing ended after 2 seconds.
EXECUTING TRANSFORMATION [samples/transformations/Calculator - Substract constant value one from a number.ktr]
INFO 27-01 20:34:24,273 (LogWriter.java:println:403) -Pan - Logging is at level : Minimal logging
INFO 27-01 20:34:24,276 (LogWriter.java:println:403) -Pan - Start of run.
2008/01/27 20:34:27:296 EST [INFO] DefaultFileReplicator - Using "/tmp/vfs_cache" as temporary files store.
INFO 27-01 20:34:27,720 (LogWriter.java:println:403) -Trans - Dispatching started for filename [samples/transformations/Calculator - Substract constant value one from a number.ktr]
INFO 27-01 20:34:27,875 (LogWriter.java:println:403) -Trans - Transformation ended.
INFO 27-01 20:34:27,878 (LogWriter.java:println:403) -Pan - Finished!
INFO 27-01 20:34:27,879 (LogWriter.java:println:403) -Pan - Start=2008/01/27 20:34:27.522, Stop=2008/01/27 20:34:27.878
...
Setup repository db (using MySQL 5.1 db on EC2)
export PASSWD=yourpasswordhere
mysql -u root -p$PASSWD
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.1.20-beta-log MySQL Community Server (GPL)
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> create database kettle_repos;
Query OK, 1 row affected (0.00 sec)
mysql> grant all on kettle_repos.* to paulm@'myhost identified by 'xxxx';
Query OK, 0 rows affected (0.00 sec)
Test connectivity first using mysql client, you may to need to allow your machine to connect via any firewall and/or grant permission in your EC2 security group.
Just add your ip address and the port 3306 only.
mysql -u paulm -p$PASSWD --host=ec2host -D kettle_repos --protocol=tcp
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5 to server version: 5.1.20-beta-log
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show tables;
--------------
show tables
--------------
Empty set (0.24 sec)
mysql> exit
Now create a new database connection in Pentaho Spoon. The Kettle GUI
Results:
Connection to database [MySQL51_repos] is OK.
Hostname : ec2-67-202-33-238.compute-1.amazonaws.com
Port : 3306
Database name : kettle_repos
Now create a new repository, choose the create/upgrade option in the connection dialog
Start a new transformation.
I understand this is a simple step, if this was it, I would use LOAD DATA INFILE to load the data. The idea is the build on the simple tasks, learning new stuff and then we can combine the simple steps into complex analysis and truly push Kettle ETL engine.
I am using the KDD Cup 1999 dataset, just a 1000 rows of csv to start
zcat kddcup.data_10_percent.gz|head -1000 > kddcup.data.1000.csv
The headers are in a separate kddcup.names file however I mucked around with sed and got the file into a csv header row at this row to the CSV file to make the file input step and field names easier to edit.
cat kddcup.names |awk -F ":" ' { print $1"," }' > kddcup.name.header
tr -d '\n' < kddcup.name.header > kddcup.name.header.clean
cat kddcup.name.header.clean
type,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_comp
_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count
error_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_sr
host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_
ate,dst_host_rerror_rate,dst_host_srv_rerror_rate
Create a database to hold the data and grant access.
mysql -u root -p$PASSWD
mysql> create database KDD99;
Query OK, 1 row affected (0.03 sec)
mysql> grant all on KDD99.* to 'paulm'@'myhost' identified by 'xxxx';
Query OK, 0 rows affected (0.00 sec)
Sample CREATE TABLE, generated from table output step
CREATE TABLE KDD99.KDD99
(
type INT
, duration VARCHAR(3)
, protocol_type VARCHAR(8)
, service VARCHAR(8)
, flag INT
, src_bytes INT
, dst_bytes INT
, land INT
, wrong_fragment INT
, urgent INT
, hot INT
, num_failed_logins INT
, logged_in INT
, num_compromised INT
, root_shell INT
, su_attempted INT
, num_root INT
, num_file_creations INT
, num_shells INT
, num_access_files INT
, num_outbound_cmds INT
, is_host_login INT
, is_guest_login INT
, count INT
, srv_count INT
, serror_rate INT
, srv_serror_rate INT
, rerror_rate INT
, srv_rerror_rate INT
, same_srv_rate INT
, diff_srv_rate FLOAT
, srv_diff_host_rate INT
, dst_host_count INT
, dst_host_srv_count INT
, dst_host_same_srv_rate INT
, dst_host_diff_srv_rate FLOAT
, dst_host_same_src_port_rate FLOAT
, dst_host_srv_diff_host_rate INT
, dst_host_serror_rate INT
, dst_host_srv_serror_rate INT
, dst_host_rerror_rate INT
, dst_host_srv_rerror_rate VARCHAR(25)
)
;
Run the transformation.
The most fun of any data loads is finding that your datatypes and lengths are
not always long enough.
2008/01/28 14:14:00 - Spoon - Transformation opened.
2008/01/28 14:14:00 - Spoon - Launching transformation [Transformation 1]...
2008/01/28 14:14:00 - Spoon - Started the transformation execution.
2008/01/28 14:14:00 - Transformation 1 - Dispatching started for transformation [Transformation 1]
2008/01/28 14:14:00 - Transformation 1 - Nr of arguments detected:0
2008/01/28 14:14:00 - Transformation 1 - This is not a replay transformation
2008/01/28 14:14:00 - Transformation 1 - This transformation can be replayed with replay date: 2008/01/28 14:14:00
2008/01/28 14:14:00 - Transformation 1 - Initialising 2 steps...
2008/01/28 14:14:02 - Table output.0 - Connected to database [MySQL51_data] (commit=1000)
2008/01/28 14:14:03 - CSV file input.0 - Starting to run...
2008/01/28 14:14:03 - Table output.0 - Starting to run...
2008/01/28 14:14:03 - CSV file input.0 - Finished processing (I=1001, O=0, R=0, W=1000, U=0, E=0)
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Because of an error, this step can't continue:
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Error batch inserting rows into table [KDD99].
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Errors encountered (first 10):
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Data truncation: Data too long for column 'duration' at row 1
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) :
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) :
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Error updating batch
2008/01/28 14:18:07 - Table output.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Data truncation: Data too long for column 'duration' at row 1
2008/01/28 14:18:07 - Spoon - The transformation has finished!!
2008/01/28 14:18:08 - Table output.0 - Finished processing (I=0, O=999, R=1000, W=0, U=0, E=1)