Tuesday, August 7, 2007

Weka Data mining on EC2 - testing

So once the VM was setup and running. It was time to see how WEKA performed in a virtual environment.

The performance on the EC2 node was good. These are not large datasets and I have a couple of those to play with in the near future. Given you can join the netflix prize competition and download a dataset with 100 Million data points (more than 2 Gig).

If you are surprised at the length of this post. I have found in the past, that when I am the one using a search engine, I want to see as much information as possible. There might be somewhere out there who in the future wants a quick solution to running WEKA without having to read the documentation.

So of this work was guided from the README, once I got the hang of it I got some datasets on Leukemia-ALLAML and ran WEKA on those.

Have Fun

Paul



List options for Weka Classifying

java weka.classifiers.trees.J48

Weka exception: No training file and no object input file given.

General options:

-t
Sets training file.
-T
Sets test file. If missing, a cross-validation will be performed on the training data.
-c
Sets index of class attribute (default: last).
-x
Sets number of folds for cross-validation (default: 10).
-s
Sets random number seed for cross-validation (default: 1).
-m
Sets file with cost matrix.
-l
Sets model input file.
-d
Sets model output file.
-v
Outputs no statistics for training data.
-o
Outputs statistics only, not the classifier.
-i
Outputs detailed information-retrieval statistics for each class.
-k
Outputs information-theoretic statistics.
-p
Only outputs predictions for test instances, along with attributes (0 for none).
-r
Only outputs cumulative margin distribution.
-z
Only outputs the source representation of the classifier, giving it the supplied name.
-g
Only outputs the graph representation of the classifier.

Options specific to weka.classifiers.trees.J48:

-U
Use unpruned tree.
-C
Set confidence threshold for pruning.
(default 0.25)
-M
Set minimum number of instances per leaf.
(default 2)
-R
Use reduced error pruning.
-N
Set number of folds for reduced error
pruning. One fold is used as pruning set.
(default 3)
-B
Use binary splits only.
-S
Don't perform subtree raising.
-L
Do not clean up after the tree has been built.
-A
Laplace smoothing for predicted probabilities.
-Q
Seed for random data shuffling (default 1).

Running the NaiveBayes Classifier on the labor dataset

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/labor.arff

Naive Bayes Classifier

Class bad: Prior probability = 0.36

duration: Normal Distribution. Mean = 2 StandardDev = 0.7071 WeightSum = 20 Precision = 1.0
wage-increase-first-year: Normal Distribution. Mean = 2.6563 StandardDev = 0.8643 WeightSum = 20 Precision = 0.3125
wage-increase-second-year: Normal Distribution. Mean = 2.9524 StandardDev = 0.8193 WeightSum = 15 Precision = 0.35714285714285715
wage-increase-third-year: Normal Distribution. Mean = 2.0344 StandardDev = 0.1678 WeightSum = 4 Precision = 0.38749999999999996
cost-of-living-adjustment: Discrete Estimator. Counts = 10 2 6 (Total = 18)
working-hours: Normal Distribution. Mean = 39.4887 StandardDev = 1.8903 WeightSum = 19 Precision = 1.8571428571428572
pension: Discrete Estimator. Counts = 12 3 6 (Total = 21)
standby-pay: Normal Distribution. Mean = 2.5 StandardDev = 0.866 WeightSum = 4 Precision = 2.0
shift-differential: Normal Distribution. Mean = 2.4691 StandardDev = 1.5738 WeightSum = 9 Precision = 2.7777777777777777
education-allowance: Discrete Estimator. Counts = 4 10 (Total = 14)
statutory-holidays: Normal Distribution. Mean = 10.2 StandardDev = 0.805 WeightSum = 20 Precision = 1.2
vacation: Discrete Estimator. Counts = 12 8 3 (Total = 23)
longterm-disability-assistance: Discrete Estimator. Counts = 6 9 (Total = 15)
contribution-to-dental-plan: Discrete Estimator. Counts = 8 8 1 (Total = 17)
bereavement-assistance: Discrete Estimator. Counts = 10 4 (Total = 14)
contribution-to-health-plan: Discrete Estimator. Counts = 9 3 7 (Total = 19)


Class good: Prior probability = 0.64

duration: Normal Distribution. Mean = 2.25 StandardDev = 0.6821 WeightSum = 36 Precision = 1.0
wage-increase-first-year: Normal Distribution. Mean = 4.3837 StandardDev = 1.1773 WeightSum = 36 Precision = 0.3125
wage-increase-second-year: Normal Distribution. Mean = 4.447 StandardDev = 0.9805 WeightSum = 31 Precision = 0.35714285714285715
wage-increase-third-year: Normal Distribution. Mean = 4.5795 StandardDev = 0.7893 WeightSum = 11 Precision = 0.38749999999999996
cost-of-living-adjustment: Discrete Estimator. Counts = 14 8 3 (Total = 25)
working-hours: Normal Distribution. Mean = 37.5491 StandardDev = 2.9266 WeightSum = 32 Precision = 1.8571428571428572
pension: Discrete Estimator. Counts = 1 3 8 (Total = 12)
standby-pay: Normal Distribution. Mean = 11.2 StandardDev = 2.0396 WeightSum = 5 Precision = 2.0
shift-differential: Normal Distribution. Mean = 5.6818 StandardDev = 5.0584 WeightSum = 22 Precision = 2.7777777777777777
education-allowance: Discrete Estimator. Counts = 8 4 (Total = 12)
statutory-holidays: Normal Distribution. Mean = 11.4182 StandardDev = 1.2224 WeightSum = 33 Precision = 1.2
vacation: Discrete Estimator. Counts = 8 11 15 (Total = 34)
longterm-disability-assistance: Discrete Estimator. Counts = 16 1 (Total = 17)
contribution-to-dental-plan: Discrete Estimator. Counts = 3 9 14 (Total = 26)
bereavement-assistance: Discrete Estimator. Counts = 19 1 (Total = 20)
contribution-to-health-plan: Discrete Estimator. Counts = 1 8 15 (Total = 24)


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances 56 98.2456 %
Incorrectly Classified Instances 1 1.7544 %
Kappa statistic 0.961
Mean absolute error 0.0481
Root mean squared error 0.1532
Relative absolute error 10.5249 %
Root relative squared error 32.1057 %
Total Number of Instances 57


=== Confusion Matrix ===

a b <-- classified as 19 1 | a = bad 0 37 | b = good === Stratified cross-validation === Correctly Classified Instances 51 89.4737 % Incorrectly Classified Instances 6 10.5263 % Kappa statistic 0.7741 Mean absolute error 0.1042 Root mean squared error 0.2637 Relative absolute error 22.7763 % Root relative squared error 55.2266 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 18 2 | a = bad 4 33 | b = good Trying a different classifier from the list on the same dataset

java weka.classifiers.lazy.IBk -t $WEKAHOME/data/labor.arff

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


Time taken to build model: 0 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances 57 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.0169
Root mean squared error 0.0169
Relative absolute error 3.7085 %
Root relative squared error 3.5513 %
Total Number of Instances 57


=== Confusion Matrix ===

a b <-- classified as 20 0 | a = bad 0 37 | b = good === Stratified cross-validation === Correctly Classified Instances 47 82.4561 % Incorrectly Classified Instances 10 17.5439 % Kappa statistic 0.6235 Mean absolute error 0.1876 Root mean squared error 0.4113 Relative absolute error 41.0144 % Root relative squared error 86.1487 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 16 4 | a = bad 6 31 | b = good What the dataset looks like ARFF format

What ARFF files look like

cat $WEKAHOME/data/labor.arff

% Date: Tue, 15 Nov 88 15:44:08 EST
% From: stan
% To: aha@ICS.UCI.EDU
%
% 1. Title: Final settlements in labor negotitions in Canadian industry
%
% 2. Source Information
% -- Creators: Collective Barganing Review, montly publication,
% Labour Canada, Industrial Relations Information Service,
% Ottawa, Ontario, K1A 0J2, Canada, (819) 997-3117
% The data includes all collective agreements reached
% in the business and personal services sector for locals
% with at least 500 members (teachers, nurses, university
% staff, police, etc) in Canada in 87 and first quarter of 88.
% -- Donor: Stan Matwin, Computer Science Dept, University of Ottawa,
% 34 Somerset East, K1N 9B4, (stan@uotcsi2.bitnet)
% -- Date: November 1988
%
% 3. Past Usage:
% -- testing concept learning software, in particular
% an experimental method to learn two-tiered concept descriptions.
% The data was used to learn the description of an acceptable
% and unacceptable contract.
% The unacceptable contracts were either obtained by interviewing
% experts, or by inventing near misses.
% Examples of use are described in:
% Bergadano, F., Matwin, S., Michalski, R.,
% Zhang, J., Measuring Quality of Concept Descriptions,
% Procs. of the 3rd European Working Sessions on Learning,
% Glasgow, October 1988.
% Bergadano, F., Matwin, S., Michalski, R., Zhang, J.,
% Representing and Acquiring Imprecise and Context-dependent
% Concepts in Knowledge-based Systems, Procs. of ISMIS'88,
% North Holland, 1988.
% 4. Relevant Information:
% -- data was used to test 2tier approach with learning
% from positive and negative examples
%
% 5. Number of Instances: 57
%
% 6. Number of Attributes: 16
%
% 7. Attribute Information:
% 1. dur: duration of agreement
% [1..7]
% 2 wage1.wage : wage increase in first year of contract
% [2.0 .. 7.0]
% 3 wage2.wage : wage increase in second year of contract
% [2.0 .. 7.0]
% 4 wage3.wage : wage increase in third year of contract
% [2.0 .. 7.0]
% 5 cola : cost of living allowance
% [none, tcf, tc]
% 6 hours.hrs : number of working hours during week
% [35 .. 40]
% 7 pension : employer contributions to pension plan
% [none, ret_allw, empl_contr]
% 8 stby_pay : standby pay
% [2 .. 25]
% 9 shift_diff : shift differencial : supplement for work on II and III shift
% [1 .. 25]
% 10 educ_allw.boolean : education allowance
% [true false]
% 11 holidays : number of statutory holidays
% [9 .. 15]
% 12 vacation : number of paid vacation days
% [ba, avg, gnr]
% 13 lngtrm_disabil.boolean :
% employer's help during employee longterm disabil
% ity [true , false]
% 14 dntl_ins : employers contribution towards the dental plan
% [none, half, full]
% 15 bereavement.boolean : employer's financial contribution towards the
% covering the costs of bereavement
% [true , false]
% 16 empl_hplan : employer's contribution towards the health plan
% [none, half, full]
%
% 8. Missing Attribute Values: None
%
% 9. Class Distribution:
%
% 10. Exceptions from format instructions: no commas between attribute values.
%
%
@relation 'labor-neg-data'
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'
2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good'
?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'
3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'
3,4,5,5,'tc',?,'empl_contr',?,?,?,12,'generous','yes','none','yes','half','good'
3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'
2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'
1,5.7,?,?,'none',40,'empl_contr',?,4,?,11,'generous','yes','full',?,?,'good'
3,3.5,4,4.6,'none',36,?,?,3,?,13,'generous',?,?,'yes','full','good'
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,'full',?,?,'good'
2,3.5,4,?,'none',40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
3,3.5,4,5.1,'tcf',37,?,?,4,?,13,'generous',?,'full','yes','full','good'
1,3,?,?,'none',36,?,?,10,'no',11,'generous',?,?,?,?,'good'
2,4.5,4,?,'none',37,'empl_contr',?,?,?,11,'average',?,'full','yes',?,'good'
1,2.8,?,?,?,35,?,?,2,?,12,'below_average',?,?,?,?,'good'
1,2.1,?,?,'tc',40,'ret_allw',2,3,'no',9,'below_average','yes','half',?,'none','bad'
1,2,?,?,'none',38,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,4,5,?,'tcf',35,?,13,5,?,15,'generous',?,?,?,?,'good'
2,4.3,4.4,?,?,38,?,?,4,?,12,'generous',?,'full',?,'full','good'
2,2.5,3,?,?,40,'none',?,?,?,11,'below_average',?,?,?,?,'bad'
3,3.5,4,4.6,'tcf',27,?,?,?,?,?,?,?,?,?,?,'good'
2,4.5,4,?,?,40,?,?,4,?,10,'generous',?,'half',?,'full','good'
1,6,?,?,?,38,?,8,3,?,9,'generous',?,?,?,?,'good'
3,2,2,2,'none',40,'none',?,?,?,10,'below_average',?,'half','yes','full','bad'
2,4.5,4.5,?,'tcf',?,?,?,?,'yes',10,'below_average','yes','none',?,'half','good'
2,3,3,?,'none',33,?,?,?,'yes',12,'generous',?,?,'yes','full','good'
2,5,4,?,'none',37,?,?,5,'no',11,'below_average','yes','full','yes','full','good'
3,2,2.5,?,?,35,'none',?,?,?,10,'average',?,?,'yes','full','bad'
3,4.5,4.5,5,'none',40,?,?,?,'no',11,'average',?,'half',?,?,'good'
3,3,2,2.5,'tc',40,'none',?,5,'no',10,'below_average','yes','half','yes','full','bad'
2,2.5,2.5,?,?,38,'empl_contr',?,?,?,10,'average',?,?,?,?,'bad'
2,4,5,?,'none',40,'none',?,3,'no',10,'below_average','no','none',?,'none','bad'
3,2,2.5,2.1,'tc',40,'none',2,1,'no',10,'below_average','no','half','yes','full','bad'
2,2,2,?,'none',40,'none',?,?,'no',11,'average','yes','none','yes','full','bad'
1,2,?,?,'tc',40,'ret_allw',4,0,'no',11,'generous','no','none','no','none','bad'
1,2.8,?,?,'none',38,'empl_contr',2,3,'no',9,'below_average','yes','half',?,'none','bad'
3,2,2.5,2,?,37,'empl_contr',?,?,?,10,'average',?,?,'yes','none','bad'
2,4.5,4,?,'none',40,?,?,4,?,12,'average','yes','full','yes','half','good'
1,4,?,?,'none',?,'none',?,?,'yes',11,'average','no','none','no','none','bad'
2,2,3,?,'none',38,'empl_contr',?,?,'yes',12,'generous','yes','none','yes','full','bad'
2,2.5,2.5,?,'tc',39,'empl_contr',?,?,?,12,'average',?,?,'yes',?,'bad'
2,2.5,3,?,'tcf',40,'none',?,?,?,11,'below_average',?,?,'yes',?,'bad'
2,4,4,?,'none',40,'none',?,3,?,10,'below_average','no','none',?,'none','bad'
2,4.5,4,?,?,40,?,?,2,'no',10,'below_average','no','half',?,'half','bad'
2,4.5,4,?,'none',40,?,?,5,?,11,'average',?,'full','yes','full','good'
2,4.6,4.6,?,'tcf',38,?,?,?,?,?,?,'yes','half',?,'half','good'
2,5,4.5,?,'none',38,?,14,5,?,11,'below_average','yes',?,?,'full','good'
2,5.7,4.5,?,'none',40,'ret_allw',?,?,?,11,'average','yes','full','yes','full','good'
2,7,5.3,?,?,?,?,?,?,?,11,?,'yes','full',?,?,'good'
3,2,3,?,'tcf',?,'empl_contr',?,?,'yes',?,?,'yes','half','yes',?,'good'
3,3.5,4,4.5,'tcf',35,?,?,?,?,13,'generous',?,?,'yes','full','good'
3,4,3.5,?,'none',40,'empl_contr',?,6,?,11,'average','yes','full',?,'full','good'
3,5,4.4,?,'none',38,'empl_contr',10,6,?,11,'generous','yes',?,?,'full','good'
3,5,5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'
3,6,6,4,?,35,?,?,14,?,9,'generous','yes','full','yes','full','good'
%
%
%

Basic Statistics and Validation of dataset

java weka.core.Instances $WEKAHOME/data/labor.arff

Relation Name: labor-neg-data
Num Instances: 57
Num Attributes: 17

Name Type Nom Int Real Missing Unique Dist
1 duration Num 0% 98% 0% 1 / 2% 0 / 0% 3
2 wage-increase-first-year Num 0% 49% 49% 1 / 2% 7 / 12% 17
3 wage-increase-second-year Num 0% 47% 33% 11 / 19% 8 / 14% 15
4 wage-increase-third-year Num 0% 14% 12% 42 / 74% 6 / 11% 9
5 cost-of-living-adjustment Nom 65% 0% 0% 20 / 35% 0 / 0% 3
6 working-hours Num 0% 89% 0% 6 / 11% 3 / 5% 8
7 pension Nom 47% 0% 0% 30 / 53% 0 / 0% 3
8 standby-pay Num 0% 16% 0% 48 / 84% 6 / 11% 7
9 shift-differential Num 0% 54% 0% 26 / 46% 5 / 9% 10
10 education-allowance Nom 39% 0% 0% 35 / 61% 0 / 0% 2
11 statutory-holidays Num 0% 93% 0% 4 / 7% 0 / 0% 6
12 vacation Nom 89% 0% 0% 6 / 11% 0 / 0% 3
13 longterm-disability-assis Nom 49% 0% 0% 29 / 51% 0 / 0% 2
14 contribution-to-dental-pl Nom 65% 0% 0% 20 / 35% 0 / 0% 3
15 bereavement-assistance Nom 53% 0% 0% 27 / 47% 0 / 0% 2
16 contribution-to-health-pl Nom 65% 0% 0% 20 / 35% 0 / 0% 3
17 class Nom 100% 0% 0% 0 / 0% 0 / 0% 2

Trying Associations

java weka.associations.Apriori -t $WEKAHOME/data/weather.nominal.arff

Apriori
=======

Minimum support: 0.15 (2 instances)
Minimum metric : 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Size of set of large itemsets L(2): 47

Size of set of large itemsets L(3): 39

Size of set of large itemsets L(4): 6

Best rules found:

1. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. outlook=overcast 4 ==> play=yes 4 conf:(1)
4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
5. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
8. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 conf:(1)

Trying FILTER

java weka.filters.supervised.attribute.Discretize \
-i $WEKAHOME/data/iris.arff -c last

@relation iris-weka.filters.supervised.attribute.Discretize-Rfirst-last

@attribute sepallength {'\'(-inf-5.55]\'','\'(5.55-6.15]\'','\'(6.15-inf)\''}
@attribute sepalwidth {'\'(-inf-2.95]\'','\'(2.95-3.35]\'','\'(3.35-inf)\''}
@attribute petallength {'\'(-inf-2.45]\'','\'(2.45-4.75]\'','\'(4.75-inf)\''}
@attribute petalwidth {'\'(-inf-0.8]\'','\'(0.8-1.75]\'','\'(1.75-inf)\''}
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

@data

'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(-inf-2.95]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(2.95-3.35]\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
'\'(-inf-5.55]\'','\'(3.35-inf)\'','\'(-inf-2.45]\'','\'(-inf-0.8]\'',Iris-setosa
...


Running an experiment

java weka.experiment.Experiment -r -T $WEKAHOME/data/iris.arff \
-D weka.experiment.InstancesResultListener \
-P weka.experiment.RandomSplitResultProducer -- \
-W weka.experiment.ClassifierSplitEvaluator -- \
-W weka.classifiers.rules.OneR

Experiment:
Runs from: 1 to: 10
Datasets: /usr/local/weka/data/iris.arff
Custom property iterator: off
ResultProducer: RandomSplitResultProducer: -P 66.0 -W weka.experiment.ClassifierSplitEvaluator --:
ResultListener: weka.experiment.InstancesResultListener@1270b73

Initializing...
RandomSplitResultProducer: setting additional measures for split evaluator
Iterating...
Postprocessing...

Running the Lazy Classifier on larger dataset:

java weka.classifiers.lazy.IBk -t $WEKAHOME/data/soybean.arff

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 4.38 seconds

=== Error on training data ===

Correctly Classified Instances 682 99.8536 %
Incorrectly Classified Instances 1 0.1464 %
Kappa statistic 0.9984
Mean absolute error 0.0029
Root mean squared error 0.0152
Relative absolute error 2.9949 %
Root relative squared error 6.9346 %
Total Number of Instances 683


=== Confusion Matrix ===

a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 92 0 0 0 0 0 0 0 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 1 90 0 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 | r = 2-4-d-injury 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury === Stratified cross-validation === Correctly Classified Instances 623 91.2152 % Incorrectly Classified Instances 60 8.7848 % Kappa statistic 0.9036 Mean absolute error 0.0122 Root mean squared error 0.0879 Relative absolute error 12.71 % Root relative squared error 40.1285 % Total Number of Instances 683 === Confusion Matrix === a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 81 0 0 0 0 5 4 2 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 19 1 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 2 17 0 0 1 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 6 0 0 0 0 13 0 1 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 4 0 0 0 0 0 81 6 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 3 0 0 0 0 0 17 71 0 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 2 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 8 3 | r = 2-4-d-injury 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury


Testing the Instances call

java weka.core.Instances $WEKAHOME/data/soybean.arff

Relation Name: soybean
Num Instances: 683
Num Attributes: 36

Name Type Nom Int Real Missing Unique Dist
1 date Nom 100% 0% 0% 1 / 0% 0 / 0% 7
2 plant-stand Nom 95% 0% 0% 36 / 5% 0 / 0% 2
3 precip Nom 94% 0% 0% 38 / 6% 0 / 0% 3
4 temp Nom 96% 0% 0% 30 / 4% 0 / 0% 3
5 hail Nom 82% 0% 0% 121 / 18% 0 / 0% 2
6 crop-hist Nom 98% 0% 0% 16 / 2% 0 / 0% 4
7 area-damaged Nom 100% 0% 0% 1 / 0% 0 / 0% 4
8 severity Nom 82% 0% 0% 121 / 18% 0 / 0% 3
9 seed-tmt Nom 82% 0% 0% 121 / 18% 0 / 0% 3
10 germination Nom 84% 0% 0% 112 / 16% 0 / 0% 3
11 plant-growth Nom 98% 0% 0% 16 / 2% 0 / 0% 2
12 leaves Nom 100% 0% 0% 0 / 0% 0 / 0% 2
13 leafspots-halo Nom 88% 0% 0% 84 / 12% 0 / 0% 3
14 leafspots-marg Nom 88% 0% 0% 84 / 12% 0 / 0% 3
15 leafspot-size Nom 88% 0% 0% 84 / 12% 0 / 0% 3
16 leaf-shread Nom 85% 0% 0% 100 / 15% 0 / 0% 2
17 leaf-malf Nom 88% 0% 0% 84 / 12% 0 / 0% 2
18 leaf-mild Nom 84% 0% 0% 108 / 16% 0 / 0% 3
19 stem Nom 98% 0% 0% 16 / 2% 0 / 0% 2
20 lodging Nom 82% 0% 0% 121 / 18% 0 / 0% 2
21 stem-cankers Nom 94% 0% 0% 38 / 6% 0 / 0% 4
22 canker-lesion Nom 94% 0% 0% 38 / 6% 0 / 0% 4
23 fruiting-bodies Nom 84% 0% 0% 106 / 16% 0 / 0% 2
24 external-decay Nom 94% 0% 0% 38 / 6% 0 / 0% 3
25 mycelium Nom 94% 0% 0% 38 / 6% 0 / 0% 2
26 int-discolor Nom 94% 0% 0% 38 / 6% 0 / 0% 3
27 sclerotia Nom 94% 0% 0% 38 / 6% 0 / 0% 2
28 fruit-pods Nom 88% 0% 0% 84 / 12% 0 / 0% 4
29 fruit-spots Nom 84% 0% 0% 106 / 16% 0 / 0% 4
30 seed Nom 87% 0% 0% 92 / 13% 0 / 0% 2
31 mold-growth Nom 87% 0% 0% 92 / 13% 0 / 0% 2
32 seed-discolor Nom 84% 0% 0% 106 / 16% 0 / 0% 2
33 seed-size Nom 87% 0% 0% 92 / 13% 0 / 0% 2
34 shriveling Nom 84% 0% 0% 106 / 16% 0 / 0% 2
35 roots Nom 95% 0% 0% 31 / 5% 0 / 0% 3
36 class Nom 100% 0% 0% 0 / 0% 0 / 0% 19

Using NaiveBayes Classifier on soybean data

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/soybean.arff

Naive Bayes Classifier

Class diaporthe-stem-canker: Prior probability = 0.03

date: Discrete Estimator. Counts = 1 1 1 6 6 6 6 (Total = 27)
plant-stand: Discrete Estimator. Counts = 21 1 (Total = 22)
precip: Discrete Estimator. Counts = 1 1 21 (Total = 23)
temp: Discrete Estimator. Counts = 1 21 1 (Total = 23)
hail: Discrete Estimator. Counts = 20 2 (Total = 22)
crop-hist: Discrete Estimator. Counts = 1 7 8 8 (Total = 24)
area-damaged: Discrete Estimator. Counts = 18 4 1 1 (Total = 24)
severity: Discrete Estimator. Counts = 1 15 7 (Total = 23)
seed-tmt: Discrete Estimator. Counts = 12 10 1 (Total = 23)
...

Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0.11 seconds

=== Error on training data ===

Correctly Classified Instances 640 93.7042 %
Incorrectly Classified Instances 43 6.2958 %
Kappa statistic 0.931
Mean absolute error 0.0081
Root mean squared error 0.0765
Relative absolute error 8.4277 %
Root relative squared error 34.8958 %
Total Number of Instances 683


=== Confusion Matrix ===

a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 79 0 0 0 0 5 4 4 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 1 19 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 2 0 0 0 0 18 0 0 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 3 0 0 0 0 0 21 66 1 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 | r = 2-4-d-injury 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury === Stratified cross-validation === Correctly Classified Instances 635 92.9722 % Incorrectly Classified Instances 48 7.0278 % Kappa statistic 0.923 Mean absolute error 0.0096 Root mean squared error 0.0817 Relative absolute error 9.9344 % Root relative squared error 37.2742 % Total Number of Instances 683 === Confusion Matrix === a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 77 0 0 0 0 5 6 4 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 2 18 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 2 0 0 0 0 17 1 0 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 3 0 0 0 0 0 22 65 1 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 | r = 2-4-d-injury 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury The same dataset with as a Pruned Decision Tree

java weka.classifiers.trees.J48 -t $WEKAHOME/data/soybean.arff

J48 pruned tree
------------------

leafspot-size = lt-1/8
| canker-lesion = dna
| | leafspots-marg = w-s-marg
| | | seed-size = norm: bacterial-blight (21.0/1.0)
| | | seed-size = lt-norm: bacterial-pustule (3.23/1.23)
| | leafspots-marg = no-w-s-marg: bacterial-pustule (17.91/0.91)
| | leafspots-marg = dna: bacterial-blight (0.0)
| canker-lesion = brown: bacterial-blight (0.0)
| canker-lesion = dk-brown-blk: phytophthora-rot (4.78/0.1)
| canker-lesion = tan: purple-seed-stain (11.23/0.23)
leafspot-size = gt-1/8
| roots = norm
| | mold-growth = absent
| | | fruit-spots = absent
| | | | leaf-malf = absent
| | | | | fruiting-bodies = absent
| | | | | | date = april: brown-spot (5.0)
| | | | | | date = may: brown-spot (24.0/1.0)
| | | | | | date = june
| | | | | | | precip = lt-norm: phyllosticta-leaf-spot (4.0)
| | | | | | | precip = norm: brown-spot (5.0/2.0)
| | | | | | | precip = gt-norm: brown-spot (21.0)
| | | | | | date = july
| | | | | | | precip = lt-norm: phyllosticta-leaf-spot (1.0)
| | | | | | | precip = norm: phyllosticta-leaf-spot (2.0)
| | | | | | | precip = gt-norm: frog-eye-leaf-spot (11.0/5.0)
| | | | | | date = august
| | | | | | | leaf-shread = absent
| | | | | | | | seed-tmt = none: alternarialeaf-spot (16.0/4.0)
| | | | | | | | seed-tmt = fungicide
| | | | | | | | | plant-stand = normal: frog-eye-leaf-spot (6.0)
| | | | | | | | | plant-stand = lt-normal: alternarialeaf-spot (5.0/1.0)
| | | | | | | | seed-tmt = other: frog-eye-leaf-spot (3.0)
| | | | | | | leaf-shread = present: alternarialeaf-spot (2.0)
| | | | | | date = september
| | | | | | | stem = norm: alternarialeaf-spot (44.0/4.0)
| | | | | | | stem = abnorm: frog-eye-leaf-spot (2.0)
| | | | | | date = october: alternarialeaf-spot (31.0/1.0)
| | | | | fruiting-bodies = present: brown-spot (34.0)
| | | | leaf-malf = present: phyllosticta-leaf-spot (10.0)
| | | fruit-spots = colored
| | | | fruit-pods = norm: brown-spot (2.0)
| | | | fruit-pods = diseased: frog-eye-leaf-spot (62.0)
| | | | fruit-pods = few-present: frog-eye-leaf-spot (0.0)
| | | | fruit-pods = dna: frog-eye-leaf-spot (0.0)
| | | fruit-spots = brown-w/blk-specks
| | | | crop-hist = diff-lst-year: brown-spot (0.0)
| | | | crop-hist = same-lst-yr: brown-spot (2.0)
| | | | crop-hist = same-lst-two-yrs: brown-spot (0.0)
| | | | crop-hist = same-lst-sev-yrs: frog-eye-leaf-spot (2.0)
| | | fruit-spots = distort: brown-spot (0.0)
| | | fruit-spots = dna: brown-stem-rot (9.0)
| | mold-growth = present
| | | leaves = norm: diaporthe-pod-&-stem-blight (7.25)
| | | leaves = abnorm: downy-mildew (20.0)
| roots = rotted
| | area-damaged = scattered: herbicide-injury (1.1/0.1)
| | area-damaged = low-areas: phytophthora-rot (30.03)
| | area-damaged = upper-areas: phytophthora-rot (0.0)
| | area-damaged = whole-field: herbicide-injury (3.66/0.66)
| roots = galls-cysts: cyst-nematode (7.81/0.17)
leafspot-size = dna
| int-discolor = none
| | leaves = norm
| | | stem-cankers = absent
| | | | canker-lesion = dna: diaporthe-pod-&-stem-blight (5.53)
| | | | canker-lesion = brown: purple-seed-stain (0.0)
| | | | canker-lesion = dk-brown-blk: purple-seed-stain (0.0)
| | | | canker-lesion = tan: purple-seed-stain (9.0)
| | | stem-cankers = below-soil: rhizoctonia-root-rot (19.0)
| | | stem-cankers = above-soil: anthracnose (0.0)
| | | stem-cankers = above-sec-nde: anthracnose (24.0)
| | leaves = abnorm
| | | stem = norm
| | | | plant-growth = norm: powdery-mildew (22.0/2.0)
| | | | plant-growth = abnorm: cyst-nematode (4.3/0.39)
| | | stem = abnorm
| | | | plant-stand = normal
| | | | | leaf-malf = absent
| | | | | | seed = norm: diaporthe-stem-canker (21.0/1.0)
| | | | | | seed = abnorm: anthracnose (9.0)
| | | | | leaf-malf = present: 2-4-d-injury (3.0)
| | | | plant-stand = lt-normal
| | | | | fruiting-bodies = absent: phytophthora-rot (50.16/7.61)
| | | | | fruiting-bodies = present
| | | | | | roots = norm: anthracnose (11.0/1.0)
| | | | | | roots = rotted: phytophthora-rot (12.89/2.15)
| | | | | | roots = galls-cysts: phytophthora-rot (0.0)
| int-discolor = brown
| | leaf-malf = absent: brown-stem-rot (35.73/0.73)
| | leaf-malf = present: 2-4-d-injury (3.15/0.68)
| int-discolor = black: charcoal-rot (22.22/2.22)

Number of Leaves : 61

Size of the tree : 93


Time taken to build model: 0.23 seconds
Time taken to test model on training data: 0.09 seconds

=== Error on training data ===

Correctly Classified Instances 658 96.3397 %
Incorrectly Classified Instances 25 3.6603 %
Kappa statistic 0.9598
Mean absolute error 0.0104
Root mean squared error 0.0625
Relative absolute error 10.7981 %
Root relative squared error 28.5358 %
Total Number of Instances 683


=== Confusion Matrix ===

a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 1 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 90 0 0 0 0 0 0 2 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 1 19 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 1 0 0 0 0 0 0 0 43 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 3 0 0 0 0 17 0 0 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 88 3 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 10 81 0 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 | r = 2-4-d-injury 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 | s = herbicide-injury === Stratified cross-validation === Correctly Classified Instances 625 91.5081 % Incorrectly Classified Instances 58 8.4919 % Kappa statistic 0.9068 Mean absolute error 0.0135 Root mean squared error 0.0842 Relative absolute error 14.0484 % Root relative squared error 38.4134 % Total Number of Instances 683 === Confusion Matrix === a b c d e f g h i j k l m n o p q r s <-- classified as 19 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 1 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 87 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 85 0 0 0 0 2 1 4 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 1 19 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 4 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 3 0 0 0 0 14 0 3 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 1 0 0 0 0 0 85 5 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 3 0 0 0 0 1 20 67 0 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 14 0 | r = 2-4-d-injury 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 2 3 | s = herbicide-injury Building a data model

java weka.classifiers.trees.J48 -t $WEKAHOME/data/soybean.arff \
-i -k -d J48-data.model > J48-data.out &

On the segment data provider, build on one set, check against another

java weka.classifiers.trees.J48 -t $WEKAHOME/data/segment-test.arff \
-i -k -d J48-segment-data.model >J48-segment-data.out

The results:

[weka@domU-12-31-36-00-26-23 tutorial]$ ls -l
total 108
-rw-rw-r-- 1 weka weka 60556 Aug 1 09:13 J48-data.model
-rw-rw-r-- 1 weka weka 12906 Aug 1 09:13 J48-data.out
-rw-rw-r-- 1 weka weka 18784 Aug 1 09:17 J48-segment-data.model
-rw-rw-r-- 1 weka weka 6146 Aug 1 09:17 J48-segment-data.out

more J48-segment-data.out

J48 pruned tree
------------------

region-centroid-row <= 155 | intensity-mean <= 31.6296 | | hue-mean <= -1.84512 | | | hue-mean <= -2.22949 | | | | saturation-mean <= 0.48999: window (3.0) | | | | saturation-mean > 0.48999: foliage (77.0)
| | | hue-mean > -2.22949
| | | | saturation-mean <= 0.864482 | | | | | rawgreen-mean <= 14.6667 | | | | | | region-centroid-col <= 100 | | | | | | | hue-mean <= -2.03349 | | | | | | | | hue-mean <= -2.14532: foliage (2.0) | | | | | | | | hue-mean > -2.14532: window (13.0/3.0)
| | | | | | | hue-mean > -2.03349
| | | | | | | | region-centroid-row <= 150: brickface (2.0) | | | | | | | | region-centroid-row > 150: window (2.0)
| | | | | | region-centroid-col > 100: window (56.0)
| | | | | rawgreen-mean > 14.6667
| | | | | | region-centroid-row <= 122: window (26.0/1.0) | | | | | | region-centroid-row > 122
| | | | | | | region-centroid-col <= 165: cement (10.0) | | | | | | | region-centroid-col > 165: window (4.0/1.0)
| | | | saturation-mean > 0.864482
| | | | | hue-mean <= -2.101: foliage (22.0) | | | | | hue-mean > -2.101
| | | | | | region-centroid-row <= 132 | | | | | | | hue-mean <= -2.08047: foliage (9.0) | | | | | | | hue-mean > -2.08047: window (3.0/1.0)
| | | | | | region-centroid-row > 132
| | | | | | | region-centroid-row <= 143: window (10.0) | | | | | | | region-centroid-row > 143: foliage (2.0)
| | hue-mean > -1.84512
| | | exgreen-mean <= -5.77778 | | | | exred-mean <= -5.88889 | | | | | region-centroid-row <= 104: brickface (6.0) | | | | | region-centroid-row > 104: foliage (3.0)
| | | | exred-mean > -5.88889: brickface (118.0/1.0)
| | | exgreen-mean > -5.77778
| | | | exred-mean <= -0.777778: grass (5.0/1.0) | | | | exred-mean > -0.777778
| | | | | region-centroid-col <= 34: foliage (2.0) | | | | | region-centroid-col > 34: window (14.0)
| intensity-mean > 31.6296
| | rawblue-mean <= 88.4444: cement (94.0/1.0) | | rawblue-mean > 88.4444: sky (110.0)
region-centroid-row > 155
| rawred-mean <= 23.3333 | | exgreen-mean <= -3.77778: cement (5.0/1.0) | | exgreen-mean > -3.77778: grass (118.0)
| rawred-mean > 23.3333: path (94.0)

Number of Leaves : 26

Size of the tree : 51


Time taken to build model: 0.45 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances 800 98.7654 %
Incorrectly Classified Instances 10 1.2346 %
Kappa statistic 0.9856
K&B Relative Info Score 79692.1947 %
K&B Information Score 2232.1312 bits 2.7557 bits/instance
Class complexity | order 0 2268.6706 bits 2.8008 bits/instance
Class complexity | scheme 45.7746 bits 0.0565 bits/instance
Complexity improvement (Sf) 2222.896 bits 2.7443 bits/instance
Mean absolute error 0.0058
Root mean squared error 0.054
Relative absolute error 2.3848 %
Root relative squared error 15.443 %
Total Number of Instances 810


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
1 0.001 0.992 1 0.996 brickface
1 0 1 1 1 sky
0.959 0 1 0.959 0.979 foliage
0.973 0.003 0.982 0.973 0.977 cement
0.992 0.009 0.954 0.992 0.973 window
1 0 1 1 1 path
0.992 0.001 0.992 0.992 0.992 grass


=== Confusion Matrix ===

a b c d e f g <-- classified as 125 0 0 0 0 0 0 | a = brickface 0 110 0 0 0 0 0 | b = sky 0 0 117 1 4 0 0 | c = foliage 0 0 0 107 2 0 1 | d = cement 1 0 0 0 125 0 0 | e = window 0 0 0 0 0 94 0 | f = path 0 0 0 1 0 0 122 | g = grass === Stratified cross-validation === Correctly Classified Instances 757 93.4568 % Incorrectly Classified Instances 53 6.5432 % Kappa statistic 0.9235 K&B Relative Info Score 75326.8356 % K&B Information Score 2110.05 bits 2.605 bits/instance Class complexity | order 0 2268.8296 bits 2.801 bits/instance Class complexity | scheme 37665.7637 bits 46.5009 bits/instance Complexity improvement (Sf) -35396.9341 bits -43.6999 bits/instance Mean absolute error 0.02 Root mean squared error 0.1312 Relative absolute error 8.1735 % Root relative squared error 37.5168 % Total Number of Instances 810 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.96 0.009 0.952 0.96 0.956 brickface 1 0.001 0.991 1 0.995 sky 0.844 0.022 0.873 0.844 0.858 foliage 0.9 0.01 0.934 0.9 0.917 cement 0.881 0.031 0.841 0.881 0.86 window 0.989 0.001 0.989 0.989 0.989 path 0.984 0.003 0.984 0.984 0.984 grass === Confusion Matrix === a b c d e f g <-- classified as 120 0 3 0 2 0 0 | a = brickface 0 110 0 0 0 0 0 | b = sky 4 0 103 1 14 0 0 | c = foliage 0 1 2 99 5 1 2 | d = cement 2 0 10 3 111 0 0 | e = window 0 0 0 1 0 93 0 | f = path 0 0 0 2 0 0 121 | g = grass Checking meta classifier:

java weka.classifiers.meta.ClassificationViaRegression \
-W weka.classifiers.functions.LinearRegression \
-t $WEKAHOME/data/iris.arff -x 2 -- -S 1

Options: -W weka.classifiers.functions.LinearRegression -- -S 1

Classification via Regression

Classifier for class with index 0:


Linear Regression Model

class =

0.0656 * sepallength +
0.2425 * sepalwidth +
-0.2228 * petallength +
-0.0634 * petalwidth +
0.1225

Classifier for class with index 1:


Linear Regression Model

class =

-0.0215 * sepallength +
-0.4407 * sepalwidth +
0.2185 * petallength +
-0.4832 * petalwidth +
1.563

Classifier for class with index 2:


Linear Regression Model

class =

-0.0441 * sepallength +
0.1982 * sepalwidth +
0.0042 * petallength +
0.5465 * petalwidth +
-0.6854



Time taken to build model: 0.14 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances 127 84.6667 %
Incorrectly Classified Instances 23 15.3333 %
Kappa statistic 0.77
Mean absolute error 0.2164
Root mean squared error 0.2943
Relative absolute error 48.6997 %
Root relative squared error 62.4309 %
Total Number of Instances 150


=== Confusion Matrix ===

a b c <-- classified as 50 0 0 | a = Iris-setosa 0 34 16 | b = Iris-versicolor 0 7 43 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 123 82 % Incorrectly Classified Instances 27 18 % Kappa statistic 0.73 Mean absolute error 0.2349 Root mean squared error 0.3157 Relative absolute error 52.8443 % Root relative squared error 66.9658 % Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 33 17 | b = Iris-versicolor 0 9 41 | c = Iris-virginica Testing some real datasets now Leukemia-ALLAML

The data can be found here http://research.i2r.a-star.edu.sg/rp/Leukemia/ALLAML.html

java weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.tree.J48.model > Leukemia-ALLAML.tree.J48.out

The results:

more Leukemia-ALLAML.tree.J48.out

J48 pruned tree
------------------

attribute4847 <= 938: ALL (27.0) attribute4847 > 938: AML (11.0)

Number of Leaves : 2

Size of the tree : 3


Time taken to build model: 1.13 seconds
Time taken to test model on training data: 0.07 seconds

=== Error on training data ===

Correctly Classified Instances 38 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
K&B Relative Info Score 3744.5181 %
K&B Information Score 33.0001 bits 0.8684 bits/instance
Class complexity | order 0 33.0001 bits 0.8684 bits/instance
Class complexity | scheme 0 bits 0 bits/instance
Complexity improvement (Sf) 33.0001 bits 0.8684 bits/instance
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 38


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
1 0 1 1 1 ALL
1 0 1 1 1 AML


=== Confusion Matrix ===

a b <-- classified as 27 0 | a = ALL 0 11 | b = AML === Error on test data === Correctly Classified Instances 31 91.1765 % Incorrectly Classified Instances 3 8.8235 % Kappa statistic 0.8198 K&B Relative Info Score 3160.6324 % K&B Information Score 27.8544 bits 0.8192 bits/instance Class complexity | order 0 34.609 bits 1.0179 bits/instance Class complexity | scheme 3222 bits 94.7647 bits/instance Complexity improvement (Sf) -3187.391 bits -93.7468 bits/instance Mean absolute error 0.0882 Root mean squared error 0.297 Relative absolute error 18.9873 % Root relative squared error 58.8575 % Total Number of Instances 34 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.9 0.071 0.947 0.9 0.923 ALL 0.929 0.1 0.867 0.929 0.897 AML === Confusion Matrix === a b <-- classified as 18 2 | a = ALL 1 13 | b = AML Same data with NaiveBayes:

java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.NaiveBayes.J48.model > Leukemia-ALLAML.NaiveBayes.J48.out

Checking the results (trimmed):

tail -100 Leukemia-ALLAML.NaiveBayes.J48.out

attribute7096: Normal Distribution. Mean = 17632.9975 StandardDev = 5491.6378 WeightSum = 11 Precision = 885.6756756756756
attribute7097: Normal Distribution. Mean = 16260.5405 StandardDev = 1742.1779 WeightSum = 11 Precision = 687.9459459459459
attribute7098: Normal Distribution. Mean = 918.855 StandardDev = 410.0149 WeightSum = 11 Precision = 64.37837837837837
attribute7099: Normal Distribution. Mean = 280.1548 StandardDev = 173.7343 WeightSum = 11 Precision = 17.216216216216218
attribute7100: Normal Distribution. Mean = 59.3889 StandardDev = 189.0803 WeightSum = 11 Precision = 29.694444444444443
attribute7101: Normal Distribution. Mean = 11265.5725 StandardDev = 2448.7777 WeightSum = 11 Precision = 390.9189189189189
attribute7102: Normal Distribution. Mean = 10453.7396 StandardDev = 3122.437 WeightSum = 11 Precision = 419.6756756756757
attribute7103: Normal Distribution. Mean = 318.7273 StandardDev = 376.3747 WeightSum = 11 Precision = 47.37837837837838
attribute7104: Normal Distribution. Mean = 2731.2801 StandardDev = 1380.7546 WeightSum = 11 Precision = 236.56756756756758
attribute7105: Normal Distribution. Mean = -288.0413 StandardDev = 90.8241 WeightSum = 11 Precision = 11.606060606060606
attribute7106: Normal Distribution. Mean = 0 StandardDev = 63.0836 WeightSum = 11 Precision = 7.324324324324325
attribute7107: Normal Distribution. Mean = 300.6417 StandardDev = 114.8094 WeightSum = 11 Precision = 27.558823529411764
attribute7108: Normal Distribution. Mean = -6.5039 StandardDev = 40.9087 WeightSum = 11 Precision = 8.942857142857143
attribute7109: Normal Distribution. Mean = 249.1057 StandardDev = 80.4043 WeightSum = 11 Precision = 16.81081081081081
attribute7110: Normal Distribution. Mean = 56.7107 StandardDev = 49.8522 WeightSum = 11 Precision = 6.636363636363637
attribute7111: Normal Distribution. Mean = 63.7126 StandardDev = 31.0336 WeightSum = 11 Precision = 9.870967741935484
attribute7112: Normal Distribution. Mean = -16.5111 StandardDev = 217.4379 WeightSum = 11 Precision = 25.945945945945947
attribute7113: Normal Distribution. Mean = 267.1091 StandardDev = 128.0862 WeightSum = 11 Precision = 16.6
attribute7114: Normal Distribution. Mean = 122.4791 StandardDev = 87.51 WeightSum = 11 Precision = 17.054054054054053
attribute7115: Normal Distribution. Mean = 233.8717 StandardDev = 111.0206 WeightSum = 11 Precision = 11.588235294117647
attribute7116: Normal Distribution. Mean = 307.9662 StandardDev = 139.9155 WeightSum = 11 Precision = 24.37142857142857
attribute7117: Normal Distribution. Mean = -319.0614 StandardDev = 110.253 WeightSum = 11 Precision = 25.43243243243243
attribute7118: Normal Distribution. Mean = -2319.9951 StandardDev = 878.3917 WeightSum = 11 Precision = 105.89189189189189
attribute7119: Normal Distribution. Mean = 378.2703 StandardDev = 120.9712 WeightSum = 11 Precision = 94.56756756756756
attribute7120: Normal Distribution. Mean = 182.4489 StandardDev = 82.9293 WeightSum = 11 Precision = 10.1875
attribute7121: Normal Distribution. Mean = 797.0098 StandardDev = 352.9267 WeightSum = 11 Precision = 38.62162162162162
attribute7122: Normal Distribution. Mean = 11.3143 StandardDev = 56.262 WeightSum = 11 Precision = 11.314285714285715
attribute7123: Normal Distribution. Mean = 348.8624 StandardDev = 134.0911 WeightSum = 11 Precision = 67.32432432432432
attribute7124: Normal Distribution. Mean = -17.8909 StandardDev = 48.2762 WeightSum = 11 Precision = 4.685714285714286
attribute7125: Normal Distribution. Mean = 1109.484 StandardDev = 549.1813 WeightSum = 11 Precision = 57.2972972972973
attribute7126: Normal Distribution. Mean = 326.3333 StandardDev = 147.522 WeightSum = 11 Precision = 29.666666666666668
attribute7127: Normal Distribution. Mean = 8.5 StandardDev = 20.0873 WeightSum = 11 Precision = 5.5
attribute7128: Normal Distribution. Mean = 1145.2208 StandardDev = 1057.6857 WeightSum = 11 Precision = 91.28571428571429
attribute7129: Normal Distribution. Mean = -24.6494 StandardDev = 26.9834 WeightSum = 11 Precision = 3.7142857142857144


Time taken to build model: 0.42 seconds
Time taken to test model on training data: 1.28 seconds

=== Error on training data ===

Correctly Classified Instances 38 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
K&B Relative Info Score 3744.5181 %
K&B Information Score 33.0001 bits 0.8684 bits/instance
Class complexity | order 0 33.0001 bits 0.8684 bits/instance
Class complexity | scheme 0 bits 0 bits/instance
Complexity improvement (Sf) 33.0001 bits 0.8684 bits/instance
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 38


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class
1 0 1 1 1 ALL
1 0 1 1 1 AML


=== Confusion Matrix ===

a b <-- classified as 27 0 | a = ALL 0 11 | b = AML === Error on test data === Correctly Classified Instances 30 88.2353 % Incorrectly Classified Instances 4 11.7647 % Kappa statistic 0.7518 K&B Relative Info Score 2905.1505 % K&B Information Score 25.6028 bits 0.753 bits/instance Class complexity | order 0 34.609 bits 1.0179 bits/instance Class complexity | scheme 4296 bits 126.3529 bits/instance Complexity improvement (Sf) -4261.391 bits -125.335 bits/instance Mean absolute error 0.1176 Root mean squared error 0.343 Relative absolute error 25.3165 % Root relative squared error 67.9628 % Total Number of Instances 34 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.95 0.214 0.864 0.95 0.905 ALL 0.786 0.05 0.917 0.786 0.846 AML === Confusion Matrix === a b <-- classified as 19 1 | a = ALL 3 11 | b = AML Running off predictions


java weka.classifiers.trees.J48 -t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.tree.J48.model -p 0 > Leukemia-ALLAML.tree.J48.out

more Leukemia-ALLAML.tree.J48.out

0 ALL 1.0 ALL
1 ALL 1.0 ALL
2 ALL 1.0 ALL
3 ALL 1.0 ALL
4 ALL 1.0 ALL
5 ALL 1.0 ALL
6 ALL 1.0 ALL
7 ALL 1.0 ALL
8 ALL 1.0 ALL
9 ALL 1.0 ALL
10 ALL 1.0 ALL
11 ALL 1.0 ALL
12 ALL 1.0 ALL
13 ALL 1.0 ALL
14 AML 1.0 ALL
15 ALL 1.0 ALL
16 AML 1.0 ALL
17 ALL 1.0 ALL
18 ALL 1.0 ALL
19 ALL 1.0 ALL
20 AML 1.0 AML
21 AML 1.0 AML
22 AML 1.0 AML
23 AML 1.0 AML
24 AML 1.0 AML
25 AML 1.0 AML
26 AML 1.0 AML
27 AML 1.0 AML
28 AML 1.0 AML
29 AML 1.0 AML
30 ALL 1.0 AML
31 AML 1.0 AML
32 AML 1.0 AML
33 AML 1.0 AML

java -mx1024m weka.classifiers.bayes.NaiveBayes \
-t $WEKAHOME/data/ALL-AML_train.arff \
-T $WEKAHOME/data/ALL-AML_test.arff -i -k \
-d Leukemia-ALLAML.NaiveBayes.J48.model -p 0 > Leukemia-ALLAML.NaiveBayes.J48.pred

The results:

[weka@domU-12-31-36-00-26-23 tutorial]$ ls -l
total 3920
-rw-rw-r-- 1 weka weka 60556 Aug 1 09:13 J48-data.model
-rw-rw-r-- 1 weka weka 12906 Aug 1 09:13 J48-data.out
-rw-rw-r-- 1 weka weka 18784 Aug 1 09:17 J48-segment-data.model
-rw-rw-r-- 1 weka weka 6146 Aug 1 09:17 J48-segment-data.out
-rw-rw-r-- 1 weka weka 1506090 Aug 1 09:41 Leukemia-ALLAML.NaiveBayes.J48.model
-rw-rw-r-- 1 weka weka 1703981 Aug 1 09:36 Leukemia-ALLAML.NaiveBayes.J48.out
-rw-rw-r-- 1 weka weka 535 Aug 1 09:41 Leukemia-ALLAML.NaiveBayes.J48.pred
-rw-rw-r-- 1 weka weka 666093 Aug 1 09:40 Leukemia-ALLAML.tree.J48.model
-rw-rw-r-- 1 weka weka 535 Aug 1 09:40 Leukemia-ALLAML.tree.J48.out

more Leukemia-ALLAML.NaiveBayes.J48.pred

0 ALL 1.0 ALL
1 ALL 1.0 ALL
2 AML 1.0 ALL
3 ALL 1.0 ALL
4 ALL 1.0 ALL
5 ALL 1.0 ALL
6 ALL 1.0 ALL
7 ALL 1.0 ALL
8 ALL 1.0 ALL
9 ALL 1.0 ALL
10 ALL 1.0 ALL
11 ALL 1.0 ALL
12 ALL 1.0 ALL
13 ALL 1.0 ALL
14 ALL 1.0 ALL
15 ALL 1.0 ALL
16 ALL 1.0 ALL
17 ALL 1.0 ALL
18 ALL 1.0 ALL
19 ALL 1.0 ALL
20 AML 1.0 AML
21 AML 1.0 AML
22 AML 1.0 AML
23 AML 1.0 AML
24 ALL 1.0 AML
25 AML 1.0 AML
26 AML 1.0 AML
27 AML 1.0 AML
28 AML 1.0 AML
29 AML 1.0 AML
30 ALL 1.0 AML
31 AML 1.0 AML
32 ALL 1.0 AML
33 AML 1.0 AML

No comments: