CS 251: Project #2

Data Management

The purpose of this week's lab is to expand your Data class and start building the API needed by your application for analysis and visualization. In addition, you'll start working with numpy, scipy, and matrices.


Data Format

In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.

Tasks

  1. Modify your data file from project 1 into the format above so you have a data set to use for testing.
  2. Create a python file Data.py (my names are suggestions only). It will contain a class for managing and handling data. The constructor for the class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above.
  3. Create a method for reading the data from a file. The method should put the original data in string format into a list of lists, with one sublist for each data point. In addition, the method should store the headers and types read from the data file.

    Internally, you will also want to extract the numeric data from the raw data and create a numpy matrix to hold it. As the numeric, enumerated, and string data may be mixed, you will want to create a dictionary list that links the raw data indices to the numeric data indices. Column 3 in the original data, for example, may be in column 1 of the numeric data. You may also want to create a dictionary that links the headers to their corresponding raw data and numeric data columns. All I/O with the Data class should use the raw data column numbers or the headers themselves to access the data.

    Enumerated types can be converted into numeric data. Using a dictionary, you can parse through the raw data, using the raw strings as dictionary keys. Give the first key the value 0, give the second unique key the value 1, and so on, incrementing the counter with each novel key. Create the numeric version of the enumerated type by going through the column and replacing the enumerated value key with its index. Keep the conversion dictionary in your Data class, because you may want to let the user choose from the set of keys.

    Date types can also be converted into numeric data. You can use functions in the time module to convert dates into useful representations.

  4. Create a method that nicely prints out the data to the command line. Test your methods with the simple examples below, including your data from project 1, once you have converted it to a CSV file and inserted the necessary meta-data.
  5. Create at least the following useful methods.
    • header - takes in an optional column id and returns the header as a string. With no argument, returns a list of all of the headers.
    • header_num - returns a list of the numeric headers.
    • type - takes in an optional column id and returns the type as a string. With no argument, returns a list of all of the types.
    • value - takes in a row and column and returns the data value. If the value is numeric, it should return the numeric version of it.
    • point - takes in a row index and returns the data vector in its raw (string) form.
    • point_num - takes in a row index and returns the numeric values only as floats.
    • dim - returns the number of variables (columns) in each data point.
    • dim_num - returns the number of numeric variables in each data point.
    • size - returns the number of data points (rows).
    • range_num - returns a list of 2-element lists with the minimum and maximum values for each numeric column. With an optional index, returns the range for a single column. This function needs to work only for numeric data. You are free to provide ranges for other data as well using an appropriate comparison metric.
    • mean_num - returns a list of the mean values for each numeric column. With an optional index, it returns the mean for just the specified column. Use the built-in numpy functions to execute this.
    • stdev_num - returns a list of the standard deviation for each numeric column. With an optional index, it returns the stdev for just the specified column. Use the built-in numpy functions to execute this.
  6. Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of headers or a list of column indices and return the matrix with those columns. You probably want to limit the number of possible indices to between one and five. It's up to you how the function will handle returning data of mixed types. You can fairly easily convert enumerated types to integers, but strings are more challenging. Numpy does not permit math on mixed format matrices.
  7. Separate from your Data class file, create a main python program that takes three arguments from the command line: a filename, the x-axis header, and the y-axis header. When executed, the program should generate a plot using matplotlib (pylab) using the two specified variables as the X and Y axes. Run this on your custom data set.
  8. Download the BirdArrivals.csv data set from the Academics server. Make sure your program can read and store the data properly, as it includes all four types of data. Write a main program that takes in a bird name as the command line argument and generates a histogram (using matplotlib) of all of the arrival dates for all years for the selected bird. [You may want to pre-extract all of the bird names from the data, store them in your main program, and print out the list if the user runs the program with no argument.]

Extensions


Writeup


Handin

Once you have written up your assignment, give the page the label:

cs251s12project2

Put your code in your private subdirectory in the COMP/CS251 folder on fileserver1/Academics.