Due Wednesday, 4 April 2012
The goal of this week's lab is to add the capability to cluster data to your Data class.
Read through all of the tasks and plan your design before you start writing code.
Create a Cluster class. The Cluster class should have the following
- name - a unique identifier for the clustering
- headers - a list of the headers used to create the clustering
- k - the number of clusters
- means - the set of cluster means
The cluster class should have the following methods.
- __init__ -- creates the fields and sets them to default values.
kmeans -- takes in a name, list of headers, number of
clusters, and a numpy matrix with each data point as a row and
executes k-means clustering, The name, headers, and k should be copied
into the Cluster object, and the means (code book) should be generated
by the clustering algorithm.
For this task you can either write your own clustering algorithm or make use of the scipy clustering routines. If you use the scipy routines, then you will want to generate the k cluster means using the kmeans function and then use the vq function to classify each data point into one of the clusters.
- classify -- takes in a numpy matrix with each data point as a row and returns a 1-column matrix with the cluster ID of the closest cluster mean.
Create any other necessary methods
- Add a method to your Data class that takes in a header string, a type string, and a column matrix of data (with the same number of rows as the data already in the Data class) and adds the column to the data set.
- Add a field to your Data class that can hold a list of Cluster objects.
Add a method cluster that creates a Cluster object, uses it
to cluster the numeric data (default action), then uses the classify
function and adds the resulting new column of cluster IDs to the data.
You can make the cluster IDs either an enum type or a numeric type.
You will need to do some type of normalization prior to clustering. Make sure you use the same data for clustering and classification.
The cluster method should add the Cluster object it creates to the Data field Cluster list.
Add a menu item to your program that enables the user to execute a
clustering on the currently open data set. The default action should
be to use all of the numeric variables in the data set for clustering.
Selecting the menu item should cause your program to call the cluster method of your data set. You may want to provide some kind of feedback when the process is complete.
- After selecting the cluster menu option, your program should be able to plot up to three spatial variables and then use the cluster column to color the visualization. How you select the colors is a design decision.
- Use this simple data set to debug your code. It should have two natural clusters, with four also being stable.
- Use the Australia Coast data set to cluster with k = 10 and then visualize the result using three of the variables. Try the variables premax, maxairtemp, and maxsoilmoist and take a screen shot.
Come up with an acronym or name for your program. Be creative.
For example, Deluxe Integrated System for Clustering Operations [DISCO] was a web system we designed for data analysis a few years ago. We used a big rotating disco ball for the background of the main window when you logged in to the program.
The success of your program may, in the end, be completely determined by how cool your acronym is. Then again, it's success may have something to do with the quality of your work. But it never hurts to have a cool name.
- Enable the user to select which columns to use in a clustering. The simplest approach is to create a dialog with a bunch of checkboxes, one for each numeric header.
- Add other features, like the ability to name a clustering.
- Enable the user to click on a data point in the visualization and have it generate a dialog box showing the complete feature vector for that data point. This feature is guaranteed to be used if you provide it. It's almost guaranteed to be requested if you don't.
- Demonstrate your system on your data from project 1.
- Write your own k-means algorithm instead of using the one from scipy.
- Write a different clustering algorithm, such as online clustering. You could also use the other scipy clustering functions for hierarchical clustering.
- Try out clustering on other data sets
Write a brief description of how you implemented the clustering algorithm and modified your Data and Application classes. Incorporate screen shots showing visualization of the clusters for both of the provided data sets and any others you analyze.
Once you have written up your assignment, give the page the label:
Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project. If you have any problems uploading the code, send the prof a zip file.