CS 251: Assignment #5

Due Wednesday, 4 April 2012

The goal of this week's lab is to add the capability to cluster data to your Data class.


Read through all of the tasks and plan your design before you start writing code.

  1. Create a Cluster class. The Cluster class should have the following fields.
    • name - a unique identifier for the clustering
    • headers - a list of the headers used to create the clustering
    • k - the number of clusters
    • means - the set of cluster means

    The cluster class should have the following methods.

    • __init__ -- creates the fields and sets them to default values.
    • kmeans -- takes in a name, list of headers, number of clusters, and a numpy matrix with each data point as a row and executes k-means clustering, The name, headers, and k should be copied into the Cluster object, and the means (code book) should be generated by the clustering algorithm.

      For this task you can either write your own clustering algorithm or make use of the scipy clustering routines. If you use the scipy routines, then you will want to generate the k cluster means using the kmeans function and then use the vq function to classify each data point into one of the clusters.

    • classify -- takes in a numpy matrix with each data point as a row and returns a 1-column matrix with the cluster ID of the closest cluster mean.

    Create any other necessary methods

  2. Add a method to your Data class that takes in a header string, a type string, and a column matrix of data (with the same number of rows as the data already in the Data class) and adds the column to the data set.
  3. Add a field to your Data class that can hold a list of Cluster objects.
  4. Add a method cluster that creates a Cluster object, uses it to cluster the numeric data (default action), then uses the classify function and adds the resulting new column of cluster IDs to the data. You can make the cluster IDs either an enum type or a numeric type.

    You will need to do some type of normalization prior to clustering. Make sure you use the same data for clustering and classification.

    The cluster method should add the Cluster object it creates to the Data field Cluster list.

  5. Add a menu item to your program that enables the user to execute a clustering on the currently open data set. The default action should be to use all of the numeric variables in the data set for clustering.

    Selecting the menu item should cause your program to call the cluster method of your data set. You may want to provide some kind of feedback when the process is complete.

  6. After selecting the cluster menu option, your program should be able to plot up to three spatial variables and then use the cluster column to color the visualization. How you select the colors is a design decision.
  7. Use this simple data set to debug your code. It should have two natural clusters, with four also being stable.
  8. Use the Australia Coast data set to cluster with k = 10 and then visualize the result using three of the variables. Try the variables premax, maxairtemp, and maxsoilmoist and take a screen shot.
  9. Come up with an acronym or name for your program. Be creative.

    For example, Deluxe Integrated System for Clustering Operations [DISCO] was a web system we designed for data analysis a few years ago. We used a big rotating disco ball for the background of the main window when you logged in to the program.

    The success of your program may, in the end, be completely determined by how cool your acronym is. Then again, it's success may have something to do with the quality of your work. But it never hurts to have a cool name.



Write a brief description of how you implemented the clustering algorithm and modified your Data and Application classes. Incorporate screen shots showing visualization of the clusters for both of the provided data sets and any others you analyze.


Once you have written up your assignment, give the page the label:


Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project. If you have any problems uploading the code, send the prof a zip file.