Ever since I happened to stumble upon this book on Data Mining, I’ve been hooked. So much so that I’ve been brushing up on statistics and probability distributions just to follow along the book.

I’m currently reading the chapter on classification using decision trees, and what appeals to me is the how the technique reduces huge sets of data (records) into simple models that can both describe the data and can predict the class of previously unseen records. If all that goes right above your head, here’s a simple example.

P1P2Class (Result)True True True False False False True False False False True False Given the above tabular values, can the computer figure out that it is the truth table for P1 ^ P2 (with ^ representing logical conjunction, not XOR) ? If it can, it has

1. A model that describes the data, i.e., the data is the conjunction of the two boolean variables.

2. An evaluator that can calculate the result, given P1 and P2, i.e., predict the class. Since all combination of values are already available, there are no unseen records in this case.

A decision tree for the above table looks like this visually :-

with green edges and red edges representing true and false values for the attribute represented by the node, respectively. As you can see, the non-leaf nodes each represent an attribute, with one edge for each possible value of that attribute. The leaf nodes represent the result (class). Since we are dealing with binary attributes, the decision tree turns out to be a binary tree. Note that you can find out the result of { P1 = True, P2 = True } by simply navigating the tree and reading the value of the leaf node ( green edge from P1 to P2, green edge from P2 to T, and then read the value, T = true).

This was so interesting that I actually wrote a generic decision tree builder in C# (download). It uses Hunt’s Algorithm to build the decision tree and uses Entropy as the measure to select attributes. At the moment, it can work only on records with nominal attributes (attributes like enums whose values are limited to a finite set). After training, it returns a decision tree, which can then be exported to XML or can be used to predict classes for new records.

Enough of the theory, let’s see it in action. Here’s how code that uses my tree builder looks.

class TruthTableEntry : Record<bool> { public bool P1 { get; set; } public bool P2 { get; set; } public override bool Equals(object obj) { var other = obj as TruthTableEntry; if (other == null) return false; return other.P1 == this.P1 && other.P2 == this.P2; } public override int GetHashCode() { return P1.GetHashCode() ^ P2.GetHashCode(); } }

static void Main(string[] args) { TruthTableEntry[] table = { new TruthTableEntry() { P1 = true, P2 = true, Class = true }, new TruthTableEntry() { P1 = true, P2 = false, Class = false}, new TruthTableEntry() { P1 = false, P2 = true, Class = false}, new TruthTableEntry() { P1 = false, P2 = false, Class = false } }; var tree = new TreeBuilder<TruthTableEntry, bool>().Train(table); string predicate = FormPredicate(tree.RootNode, ""); Console.WriteLine(predicate); }

The TruthTableEntry class represents a row in the truth table above, and an array of those records are given to the TreeBuilder instance’s Train method. The decision tree returned by it is then translated into a predicate by the FormPredicate method. Here’s how the tree looks in XML if you call Save on the tree.

<?xml version="1.0" encoding="utf-8"?> <Node ParentConditionValue="" SplitAttribute="P1"> <Node ParentConditionValue="True" SplitAttribute="P2"> <Node ParentConditionValue="True" Class="True" /> <Node ParentConditionValue="False" Class="False" /> </Node> <Node ParentConditionValue="False" Class="False" /> </Node>

which exactly matches what we expected.

The FormPredicate method attempts to transform the tree into a predicate clause, as expressed mathematically in predicate logic. For the above tree, it returns

(P1 ^ P2) v (~P1 ^ False)

which when reduced through the laws of predicate logic

(P1 ^ P2) v (~P1 ^ False)

= (P1 ^ P2) v False (since False ^ P == False)

= (P1 ^ P2) (since False v P == P)

And there you have it, the computer figured it out correctly. Yowza