The purpose of this week's lab is to expand your Data class and start building the API needed by your application for analysis and visualization. In addition, you'll start working with numpy, scipy, and matrices.
In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.
- The data should be in CSV format with commas separating different entries.
- The first row of the CSV data file should be the variable names. There must be a name for each column.
- The second row of the data should be the variable types: numeric, string, enum, and date. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers;a date should be interpreted as a calendar date.
- Missing numeric data should be specified by the number -9999 in integer format. A decimal would imply an actual value.
- Any row that begins with a hash symbol should be ignored by the reader.
- You are free to use the CSV module, which is standard in Python 2.7. The 2.7 documentation is here.
- Modify your data file from project 1 into the format above so you have a data set to use for testing.
- Create a python file Data.py (my names are suggestions only). It will contain a class for managing and handling data. The constructor for the class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to be able to take in a list of lists that represents a data set, but this is optional.
Create a method for reading the data from a file. The method should
put the original data in string format into a list of lists, with one
sublist for each data point. In addition, the method should store the
headers and types read from the data file.
Internally, you will also want to extract the numeric data from the raw data and create a numpy matrix to hold it. As the numeric, enumerated, and string data may be mixed, you will want to create a dictionary list that links the raw data indices to the numeric data indices. Column 3 in the original data, for example, may be in column 1 of the numeric data. You may also want to create a dictionary that links the headers to their corresponding raw data and numeric data columns. All I/O with the Data class should use the raw data column numbers or the headers themselves to access the data.
- Create a method that nicely prints out the data to the command line. Test your methods with the simple examples below, including your data from project 1, once you have converted it to a CSV file and inserted the necessary meta-data.
Create at least the following useful methods.
- header - takes in an optional column id and returns the header as a string. With no argument, returns a list of all of the headers.
- header_num - returns a list of the numeric headers.
- type - takes in an optional column id and returns the type as a string. With no argument, returns a list of all of the types.
- value - takes in a row and column and returns the data value. If the value is numeric, it should return the numeric version of it.
- point - takes in a row index and returns the data vector in its raw (string) form.
- point_num - takes in a row index and returns the numeric values only as floats.
- dim - returns the number of variables (columns) in each data point.
- dim_num - returns the number of numeric variables in each data point.
- size - returns the number of data points (rows).
- Create a method select that returns a numpy matrix with just the selected columns. The function should take in a list of headers or a list of column indices and return the matrix with those columns. You probably want to limit the number of possible indices to between one and five. It's up to you how the function will handle returning data of mixed types. You can fairly easily convert enumerated types to integers, but strings are more challenging. Numpy does not permit math on mixed format matrices.
- Separate from your Data class file, create a main python program that takes three arguments from the command line: a filename, the x-axis header, and the y-axis header. When executed, the program should generate a plot using matplotlib (pylab) using the two specified variables as the X and Y axes. Run this on your custom data set.
- Within your Data.py file, create a second class called DataColID. A DataColID object should contain a reference to a Data object and then a header and/or column index. In other words, if you have multiple data sets open, a DataColID will give you enough information to access a particular column of data. Create appropriate accessors/mutators for the class.
In addition to the Data class, you will also be creating and extending
an Analysis class throughout the semester. Create an Analysis.py
file. All Analysis functions will take lists of DataColID objects to
specify what data to analyze. Inside your Analysis file, create the
following three functions.
- range - Takes in a list of DataColID objects and returns a list of 2-element lists with the minimum and maximum values for each column. The function is required to work on all data types, but on non numeric types you can compare the raw strings to obtain the first and last strings in the set.
- mean - Takes in a list of DataColID objects and returns a list of the mean values for each column. Any non-numeric column should have an empty string in its corresponding location in the return list. Use the built-in numpy functions to execute this calculation.
- stdev - Takes in a list of DataColID objects and returns a list of the standard deviation for each numeric column. Any non-numeric column should have an empty string in its corresponding location in the return list. Use the built-in numpy functions to execute this calculation.
In your Analysis.py file you have a design choice to make. You can make all of the functions methods of an Analysis class, or you can make each function a standalone function. Make the choices that makes the most sense to you.
- In your visualization GUI, enable the user to open multiple data sets. The user should be able to load a data set, select it as the active data set, and remove the data set from the program's memory. A suggested method of data set management is to use a list box to show all of the currently loaded data sets and then have buttons below it to load and remove data sets. The currently selected entry in the list box could be used to indicate the active data set.
- Enumerated types can be converted into numeric data. Using a dictionary, you can parse through the raw data, using the raw strings as dictionary keys. Give the first key the value 0, give the second unique key the value 1, and so on, incrementing the counter with each novel key. Create the numeric version of the enumerated type by going through the column and replacing the enumerated value key with its index. Keep the conversion dictionary in your Data class, because you may want to let the user choose from the set of keys.
- Dates can also be converted into numeric data. You can use functions in the time module to convert dates into useful internal numeric representations.
- Create more visualizations of your data set.
- Extend your main program so the user can plot from the active data set.
- Build additional I/O capability into the Data class (e.g. expand it to include xls files, XML or other formats).
- Extend the select and range methods to intelligently handle non-numeric data or add other potentially useful functions.
- Add other visualizations (using matplotlib) for your custom data set.
- Make a wiki page for the project writeup. On it, describe your DataSet class API, with brief descriptions of all the functions, their inputs, outputs, and purpose.
- Describe in your writeup how you store the data internally in your DataSet class, noting how you deal with each different type of data.
- Include pictures and descriptions of any extensions or other visualizations you created.
Once you have written up your assignment, give the page the label:
Put your code in your private subdirectory in the COMP/CS251 folder on the Courses server.