CS 251: Assignment #6


Due Wednesday 10 April 2013

The goal of this week's lab is to add basic clustering analysis to your program.


Overall, you will need to add a clustering function to your Analysis class, create a new ClusterData child class of Data, and add elements to the GUI to enable the user to execute and view clusters.

  1. Create a new child class of the Data class called ClusterData. The ClusterData class will hold a column with the cluster IDs in its data matrix. But it will also need to hold the set of cluster means, the list of DataColIDs from which the cluster was created, along with the number of clusters, the clustering method, and the distance metric. To start, the clustering method should be K-means, and the distance metric should be Euclidean.

    Consider what information should be passed into the init method to create the ClusterData class and what information might be set by mutator methods.

    As noted below, extensions are to implement alternative clustering methods--e.g. on-line clustering, hierarchical clustering, or fuzzy C-means--or alternative distance metrics--e.g. normalized Euclidean distance, Mahalanobis distance, absolute Euclidean.

  2. Analysis: create a function in your Analysis file/class called cluster. The cluster function argument should be a list of DataColIDs and the number of clusters. You may also want to add other options, depending upon how complex you make your clustering function. To start, assume that all of the data columns are simple numeric columns and that you will use simple Euclidean distance as the metric.

    You may use the scipy clustering routines to implement this function.

    The cluster function should return a ClusterData class, fully populated with the necessary information.

  3. Add the capability to cluster a data set to your GUI. Ideally, the user should be able to pick a set of columns from a data set and then tell it to cluster. By default, you can limit clustering to a set of columns from the same data set.
  4. Add the capability to plot using cluster ID to select the color of points. Ideally, the user should be able to select the x/y/z/size axes from a data set and then select which clustering to use for color. By default, you can limit cluster analysis and visualization to a single data set. When a data set is made active, only clusterings on that data set should be visible.
  5. Test your clustering on the file clusterdata.csv. It should contain two well-separated clusters, with four clusters also being stable. Include a screen shot of this in your writeup.
  6. Test your clustering on the file AustraliaCoast.csv. Use 10 clusters and visualize the result using three variables, such as premax, maxairtemp, and maxsoilmoist. If you want to visualize clusters based on long/lat, then you will need to modify the type of those columns to be numeric. Include a screen shot of this in your writeup.
  7. Execute a clustering on a data file of your choice, trying to use the analysis to demonstrate something about the data set.



For this week's writeup, create a wiki page that shows your clustering visualizations, describes how your GUI works, and explains any extensions.


Once you have written up your assignment, give the page the label:


Put your code on the handin server in your private subdirectory.