CS 251: Project #2

Data Management

The purpose of this week's lab is to expand your Data class and start building the API needed by your application for analysis and visualization. In addition, you'll start working with numpy, scipy, and matrices.


Data Format

In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.

Tasks

  1. Modify your data file from project 1 into the format above so you have a data set to use for testing.
  2. Create a python file Data.py (my names are suggestions only). It will contain a class for managing and handling data. The constructor for the class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to be able to take in a list of lists that represents a data set, but this is optional.
  3. Create a method for reading the data from a file. The method should put the original data in string format into a list of lists, with one sublist for each data point. In addition, the method should store the headers and types read from the data file.

    Internally, you will also want to extract the numeric data from the raw data and create a numpy matrix to hold it. As the numeric, enumerated, and string data may be mixed, you will want to create a dictionary list that links the raw data indices to the numeric data indices. Column 3 in the original data, for example, may be in column 1 of the numeric data. You may also want to create a dictionary that links the headers to their corresponding raw data and numeric data columns. All I/O with the Data class should use the raw data column numbers or the headers themselves to access the data.

  4. Create a method that nicely prints out the data to the command line. Test your methods with the simple examples below, including your data from project 1, once you have converted it to a CSV file and inserted the necessary meta-data.
  5. Create at least the following useful methods.
    • header - takes in an optional column id and returns the header as a string. With no argument, returns a list of all of the headers.
    • header_num - returns a list of the numeric headers.
    • type - takes in an optional column id and returns the type as a string. With no argument, returns a list of all of the types.
    • value - takes in a row and column and returns the data value. If the value is numeric, it should return the numeric version of it.
    • point - takes in a row index and returns the data vector in its raw (string) form.
    • point_num - takes in a row index and returns the numeric values only as floats.
    • dim - returns the number of variables (columns) in each data point.
    • dim_num - returns the number of numeric variables in each data point.
    • size - returns the number of data points (rows).
  6. Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of headers or a list of column indices and return the matrix with those columns. You probably want to limit the number of possible indices to between one and five. It's up to you how the function will handle returning data of mixed types. You can fairly easily convert enumerated types to integers, but strings are more challenging. Numpy does not permit math on mixed format matrices.
  7. Separate from your Data class file, create a main python program that takes three arguments from the command line: a filename, the x-axis header, and the y-axis header. When executed, the program should generate a plot using matplotlib (pylab) using the two specified variables as the X and Y axes. Run this on your custom data set.
  8. Within your Data.py file, create a second class called DataColID. A DataColID object should contain a reference to a Data object and then a header and/or column index. In other words, if you have multiple data sets open, a DataColID will give you enough information to access a particular column of data. Create appropriate accessors/mutators for the class.
  9. In addition to the Data class, you will also be creating and extending an Analysis class throughout the semester. Create an Analysis.py file. All Analysis functions will take lists of DataColID objects to specify what data to analyze. Inside your Analysis file, create the following three functions.
    • range - Takes in a list of DataColID objects and returns a list of 2-element lists with the minimum and maximum values for each column. The function is required to work on all data types, but on non numeric types you can compare the raw strings to obtain the first and last strings in the set.
    • mean - Takes in a list of DataColID objects and returns a list of the mean values for each column. Any non-numeric column should have an empty string in its corresponding location in the return list. Use the built-in numpy functions to execute this calculation.
    • stdev - Takes in a list of DataColID objects and returns a list of the standard deviation for each numeric column. Any non-numeric column should have an empty string in its corresponding location in the return list. Use the built-in numpy functions to execute this calculation.

    In your Analysis.py file you have a design choice to make. You can make all of the functions methods of an Analysis class, or you can make each function a standalone function. Make the choices that makes the most sense to you.

  10. In your visualization GUI, enable the user to open multiple data sets. The user should be able to load a data set, select it as the active data set, and remove the data set from the program's memory. A suggested method of data set management is to use a list box to show all of the currently loaded data sets and then have buttons below it to load and remove data sets. The currently selected entry in the list box could be used to indicate the active data set.

Extensions


Writeup


Handin

Once you have written up your assignment, give the page the label:

cs251s13project2

Put your code in your private subdirectory in the COMP/CS251 folder on the Courses server.