What is Decision Tree in Weka?

In Decision Tree, you’ll use the Weka data-mining tool to work through some training data on the optimum sales of Lady Gaga’s CDs depending on specific factors within the store. I explain the factors in question as you walk through that data.

Requirement: The requirement is to create a model that will be able to predict a customer sale
on Lady, Gaga CDs depending on the CDs’ placement within the store. You’ve been given some data by the record store about where the product was placed, whether it was at eye level or not, and whether the customer actually purchased the CD or put it back on the shelf.

The client wants to be able to run other sets of data through the model to determine how sales of a product will fare. Working through this methodically, you need to do the following:

1. Run through the training data supplied and turn it into a definition file for Weka.
2. Use the Weka workbench to build the decision tree for you and plot an output graph.
3. Export some generated Java code with the new decision tree classifier.
4. Test the code against some test data.
5. Think about future iterations of the classifier.

It feels like there’s a lot to do, but after you get into the routine, it’s quite simple to do with the tools at hand. First look at the training data.

Training Data:
Before anything else happens, you need some training data. The client has given you some in a .csv file, but it would be nice to formalize this.

The @relation tag is the name of the dataset you are using. In this instance it’s Lady Gaga’s CDs, so I’ve called it lady gaga.

Next, you have the attributes that are used within your data model. There are five attributes in this set are the top line of raw CSV data that you received from the client.

1. Placement: What type of stand the CD is displayed on: an end rack, special offer bucket, or a standard rack?
2. Prominence: What percentage of the CDs on display are Lady Gaga CDs?
3. Pricing: What percentage of the full price was the CD at the time of purchase? Very rarely is a CD sold at full price, unless it is an old, back catalogue title.
4. Eye-Level: Was the product displayed at eye level position? The majority of sales will happen when a product is displayed at eye level.
5. Customer Purchase: What was the outcome? Did the customer purchase?

The Prominence and Pricing attributes are both numeric values. The other three are given the nominal values that are to be expected when the algorithm is being run. Placement has three: end_rack, cd_spec, or std_rack. The Eye Level attribute is either true or false, and the Customer Purchase attribute has two nominal values of either yes or no to show that the customer bought the product.

Data: Finally, you have the data. It’s comma-separated in the order of the attributes (Placement, Prominence, Pricing, Eye Level, and Customer Purchase).

In this sample, you know the outcomes—whether a customer purchased or not; this model is about using regression to get your predictions in tune for new data coming in.