The purpose of this week's lab is to expand your Data class and start building the API needed by your application for analysis and visualization. In addition, you'll start working with numpy, scipy, and matrices.
In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.
- The data should be in CSV format with commas separating different entries.
- The first row of the data should be the variable names. There must be a name for each column.
- The second row of the data should be the variable types: numeric, string, enum, and date. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers;a date should be interpreted as a calendar date.
- Missing numeric data should be specified by the number -9999 in integer format. A decimal would imply an actual value.
- Any row that begins with a hash symbol should be ignored by the reader
- Modify your data file from project 1 into the format above so you have a data set to use for testing.
- Create a python file Data.py (my names are suggestions only). It will contain a class for managing and handling data. The constructor for the class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above.
Create a method for reading the data from a file. The method should
put the original data in string format into a list of lists, with one
sublist for each data point. In addition, the method should store the
headers and types read from the data file.
Internally, you will also want to extract the numeric data from the raw data and create a numpy matrix to hold it. As the numeric, enumerated, and string data may be mixed, you will want to create a dictionary list that links the raw data indices to the numeric data indices. Column 3 in the original data, for example, may be in column 1 of the numeric data. You may also want to create a dictionary that links the headers to their corresponding raw data and numeric data columns. All I/O with the Data class should use the raw data column numbers or the headers themselves to access the data.
Enumerated types can be converted into numeric data. Using a dictionary, you can parse through the raw data, using the raw strings as dictionary keys. Give the first key the value 0, give the second unique key the value 1, and so on, incrementing the counter with each novel key. Create the numeric version of the enumerated type by going through the column and replacing the enumerated value key with its index. Keep the conversion dictionary in your Data class, because you may want to let the user choose from the set of keys.
Date types can also be converted into numeric data. You can use functions in the time module to convert dates into useful representations.
- Create a method that nicely prints out the data to the command line. Test your methods with the simple examples below, including your data from project 1, once you have converted it to a CSV file and inserted the necessary meta-data.
Create at least the following useful methods.
- header - takes in an optional column id and returns the header as a string. With no argument, returns a list of all of the headers.
- header_num - returns a list of the numeric headers.
- type - takes in an optional column id and returns the type as a string. With no argument, returns a list of all of the types.
- value - takes in a row and column and returns the data value. If the value is numeric, it should return the numeric version of it.
- point - takes in a row index and returns the data vector in its raw (string) form.
- point_num - takes in a row index and returns the numeric values only as floats.
- dim - returns the number of variables (columns) in each data point.
- dim_num - returns the number of numeric variables in each data point.
- size - returns the number of data points (rows).
- range_num - returns a list of 2-element lists with the minimum and maximum values for each numeric column. With an optional index, returns the range for a single column. This function needs to work only for numeric data. You are free to provide ranges for other data as well using an appropriate comparison metric.
- mean_num - returns a list of the mean values for each numeric column. With an optional index, it returns the mean for just the specified column. Use the built-in numpy functions to execute this.
- stdev_num - returns a list of the standard deviation for each numeric column. With an optional index, it returns the stdev for just the specified column. Use the built-in numpy functions to execute this.
- Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of headers or a list of column indices and return the matrix with those columns. You probably want to limit the number of possible indices to between one and five. It's up to you how the function will handle returning data of mixed types. You can fairly easily convert enumerated types to integers, but strings are more challenging. Numpy does not permit math on mixed format matrices.
- Separate from your Data class file, create a main python program that takes three arguments from the command line: a filename, the x-axis header, and the y-axis header. When executed, the program should generate a plot using matplotlib (pylab) using the two specified variables as the X and Y axes. Run this on your custom data set.
- Download the BirdArrivals.csv data set from the Academics server. Make sure your program can read and store the data properly, as it includes all four types of data. Write a main program that takes in a bird name as the command line argument and generates a histogram (using matplotlib) of all of the arrival dates for all years for the selected bird. [You may want to pre-extract all of the bird names from the data, store them in your main program, and print out the list if the user runs the program with no argument.]
- Create more visualizations of your data set.
- Extend your main program so that the first index is the X axis, and the remaining indices are all plotted on the Y axis. For example, use this to create a bar graph with multiple dependent variables.
- Add features to the visualization of the bird arrivals data.
- Build additional I/O capability into the Data class (e.g. expand it to include xls files, XML or other formats).
- Extend the select and range methods to intelligently handle non-numeric data or add other potentially useful functions.
- Add other visualizations (using matplotlib) for your custom data set.
- Make a wiki page for the project writeup. On it, describe your DataSet class API, with brief descriptions of all the functions, their inputs, outputs, and purpose.
- Describe in your writeup how you store the data internally in your DataSet class, noting how you deal with each different type of data.
- Include in your writeup a screen capture of the figures from the last two tasks.
- Include pictures and descriptions of any extensions or other visualizations you created.
Once you have written up your assignment, give the page the label:
Put your code in your private subdirectory in the COMP/CS251 folder on fileserver1/Academics.