CS 251: Assignment #4

Integration and Analysis

Due before spring break: 17 March 2012

The goal of this lab is to bring together all of the elements--data, GUI, and viewing--into a single application and then extend it to start doing real data analysis.

You have two weeks to complete this project. You should start the integration process in the first week, enabling selection and viewing of the data, then focus on the analysis in the second week.


Tasks

The result of this project should be an application that can read in a data set, enable the user to interactively view it in up to 5 user-selected dimensions (3 spatial, color, and size), and execute a PCA analysis and then visualize the data in selected dimensions of the eigenspace.

  1. Give yourself a new working directory and copy your display, viewing, and data python files into it. It's best to start with copies and modify them from there. Integration often involves re-writing parts of your code to make the process cleaner.
  2. Enable the user to use the GUI to load a CSV data file in the standard format. Use your data class to read and store the data. Your GUI should not immediately plot the data.
  3. Enable the user to select between 2 and 5 columns for plotting. It's up to you how strictly you handle this. You can require the user to select three spatial dimensions before allowing them to use color or size. Alternatively, you can be flexible and let the user pick 2 or 3 spatial dimensions and then either color or size as additional dimensions.

    One possible approach to handling this is to have a menu option 'Plot' that brings up a dialog which gives the user a set of popup menus for selecting each axis/color/size. You could also place popup menus on the right side of the main screen with a button "plot" that uses the current menu selections to generate a plot.

    Here is a reasonable dialog window tututorial

    To implement plotting, create a buildPoints function and an updatePoints function in your display class that are analagous to your buildAxes and updateAxes functions. The buildPoints should delete any existing canvas objects representing data and create a new set for the current plot. You probably want to follow these steps in your buildPoints method, which may take in a list of the selected headers. Either before calling it or within buildPoints you will want to reset the view.

    • Delete any existing canvas objects for plotting data.
    • Use the select method from your Data object to get the spatial columns to plot. If you are selecting only 2 columns to plot, add a column of 0's (z-value) and a column of 1's (homogeneous coordinate) to the data. If you are selecting 3 columns to plot, add a column of 1's (homogeneous coordinate). Each data point is now represented as a 4-column row in the spatial data matrix.
    • At this point you have to make a decision about how to normalize the data. A simple approach is to use the max and min value of each data column to normalize it to the range [0, 1].
    • Use the select method from your Data object to get the color and size columns to plot, if any. You may want to store these in separate fields. You may also want to normalize both of these so the values are in the range [0, 1].
    • Calculate the VTM from the current view object.
    • Transform the data using the VTM.

      pts = (vtm * data.T).T

    • Create the canvas graphics objects, ovals/squares/crosses/points, for each data point. Use the color and size data, if any, to adjust the visual attributes.

    The updatePoints method should modify the coordinates of the existing canvas objects using the current view. You should include a call to updatePoints wherever you have a call to updateAxes.

    Make sure your program can read in test data case 1 and display 2 or 3 spatial dimensions. Then try test data case 2, which has 5 dimensions, the first two of which are correlated. Capture some screen shots for your writeup.

  4. The second part of this assignment is to enable the user to execute a PCA analysis on the data and visualize a result. To support this, create an Analysis class. The Analysis class needs the following three methods. You can make others as necessary.

    __init__: the init method needs to create at least the following fields, initially setting them all to None.

    Field nameMeaning
    nameA user-supplied name for the analysis
    dataA numpy matrix to hold a copy of the data used for the analysis
    pcadataA numpy matrix to hold the data projected onto the eigenvalues/td>
    minThe minimum value for each data column
    maxThe maximum value for each data column
    meanThe mean value for each data column
    evalThe eigenvalues
    evecThe eigenvectors

    setup: The setup method needs to normalize the data by using x' = (x - min) / (max - min). Then it needs calculate the mean value for each column. Store the min, max, and mean values in the appropriate fields. To calculate the eigenvalues and eigenvectors, first compute the covariance matrix of the normalized data. If your data is stored in rows, then you want to call the numpy covariance function with the following arguments.

    mcov = np.cov( self.data, rowvar = False )

    Then calculate the eigenvalues of the covariance matrix, sort the eigenvectors from largest to smallest eigenvalue, and store the eigenvectors as columns in a numpy matrix. Store the eigenvalues and eigenvectors in their appropriate fields in the Analysis object.

    Finally, project the data onto the eigenvectors by first subtracting the mean from each row of the data and then transforming the data by the eigenvectors.

    select: Similar to the select function in your data class, this function should take in a list of column indices and return the selected data columns.

    To test your Analysis class you can use this test program which should produce this output. Note that my results have normalization turned off, since the test variables all have the same numerical significance/meaning.

  5. Enable the user to execute a PCA analysis on the currently open data set. Your GUI does not have to support letting the user pick a subset of the data for PCA, but it is an extension.
  6. Enable the user to pick 2 or 3 columns from the projected data and plot those spatially.
  7. Test your system on the Australia Coast data. Execute the PCA analysis and then use the first three coordinates in eigenspace to plot it. Capture some screen shots.

Extensions


Writeup

Write a brief user manual, with screen shots, for your application. Include any extensions or enhancements you implemented.

Include the screen shots for the provided data sets.

Handin

Once you have written up your assignment, give the page the label:

cs251s12project4

Put your code in the COMP/CS251 folder on fileserver1/Academics. Please make sure you are organizing your code by project.