Due Wednesday 10 April 2013
The goal of this week's lab is to add basic clustering analysis to your program.
Overall, you will need to add a clustering function to your Analysis class, create a new ClusterData child class of Data, and add elements to the GUI to enable the user to execute and view clusters.
Create a new child class of the Data class called ClusterData. The
ClusterData class will hold a column with the cluster IDs in its data
matrix. But it will also need to hold the set of cluster means, the
list of DataColIDs from which the cluster was created, along with the
number of clusters, the clustering method, and the distance metric.
To start, the clustering method should be K-means, and the distance
metric should be Euclidean.
Consider what information should be passed into the init method to create the ClusterData class and what information might be set by mutator methods.
As noted below, extensions are to implement alternative clustering methods--e.g. on-line clustering, hierarchical clustering, or fuzzy C-means--or alternative distance metrics--e.g. normalized Euclidean distance, Mahalanobis distance, absolute Euclidean.
Analysis: create a function in your Analysis file/class called
cluster. The cluster function argument should be a list of DataColIDs
and the number of clusters. You may also want to add other options,
depending upon how complex you make your clustering function. To
start, assume that all of the data columns are simple numeric columns
and that you will use simple Euclidean distance as the metric.
You may use the scipy clustering routines to implement this function.
The cluster function should return a ClusterData class, fully populated with the necessary information.
- Add the capability to cluster a data set to your GUI. Ideally, the user should be able to pick a set of columns from a data set and then tell it to cluster. By default, you can limit clustering to a set of columns from the same data set.
- Add the capability to plot using cluster ID to select the color of points. Ideally, the user should be able to select the x/y/z/size axes from a data set and then select which clustering to use for color. By default, you can limit cluster analysis and visualization to a single data set. When a data set is made active, only clusterings on that data set should be visible.
- Test your clustering on the file clusterdata.csv. It should contain two well-separated clusters, with four clusters also being stable. Include a screen shot of this in your writeup.
- Test your clustering on the file AustraliaCoast.csv. Use 10 clusters and visualize the result using three variables, such as premax, maxairtemp, and maxsoilmoist. If you want to visualize clusters based on long/lat, then you will need to modify the type of those columns to be numeric. Include a screen shot of this in your writeup.
- Execute a clustering on a data file of your choice, trying to use the analysis to demonstrate something about the data set.
- Add features to the clustering, such as the distance metric.
- Run PCA on the AustraliaCoast data set and then cluster using the first three PCA dimensions. Visualize the result in PCA space.
- Implement other clustering methods.
- Enable the user to view the cluster means. Give the cluster means names.
- Write your own clustering algorithm instead of using the one in scipy.
- Add a function classify to your Analysis file/class. The function should be able to take in a ClusterData object and a new set of DataColIDs with the same variables (but representing different data points) and classify the new data with the existing cluster means. The function should return a new ClusterData object.
For this week's writeup, create a wiki page that shows your clustering visualizations, describes how your GUI works, and explains any extensions.
Once you have written up your assignment, give the page the label:
Put your code on the handin server in your private subdirectory.