The purpose of this week's lab is to create a Data class that allows you to read and write CSV files. Then you should be able to query the Data object for information.
Update your Data class read method so it can handle files with
string, date, and enum types. This is a design decision for
you, as you can choose to do any of the following.
- Ignore and discard a non-numeric column.
- Store a non-numeric column separately, possibly as the string that was read.
- Convert a non-numeric column (e.g. enum or date) to a number and add it to the numeric data.
When you read in data with one or more non-numeric columns, you will need to adjust the loop where you create the list of lists and convert the strings into floats. In particular, you will need to keep track of the types for each column and skip any non-numeric columns.
If you choose to ignore non-numeric columns, then you will also need to make a new list of headers and types that discards the non-numeric headers and types.
Create a new method for your Data class that takes in a list of
columns headers and returns a Numpy matrix with the data for all
rows but just the specified columns. It is optional to also
allow the caller to specify a specific set of rows.
You should look at the Numpy hstack method when writing this function. It lets you combine columns to create a new matrix.
- Test your new methods (you get to write this code.). Some things to test include (1) headers with leading spaces (e.g. thing1, thing2, thing3), (2) types with leading or trailing spaces, and (3) columns that switch between numeric and non-numeric types. Try, for example, reading test data 3 and test data 4.
- Create an analysis.py file. All analysis
functions will take lists of strings (column headers) to specify
what (numeric) data to analyze. Inside your analysis file, create
the following three functions. Make these just functions;
creating an Analysis class will make your life more difficult.
- data_range - Takes in a list of column headers and the Data object and returns a list of 2-element lists with the minimum and maximum values for each column. The function is required to work only on numeric data types.
- mean - Takes in a list of column headers and the Data object and returns a list of the mean values for each column. Use the built-in numpy functions to execute this calculation.
- stdev - Takes in a list of column headers and the Data object and returns a list of the standard deviation for each specified column. Use the built-in numpy functions to execute this calculation.
- normalize_columns_separately - Takes in a list of column headers and the Data object and returns a matrix with each column normalized so its minimum value is mapped to zero and its maximum value is mapped to 1.
- normalize_columns_together - Takes in a list of column headers and the Data object and returns a matrix with each entry normalized so that the minimum value (of all the data in this set of columns) is mapped to zero and its maximum value is mapped to 1.
Test your new methods. Describe your testing in your report and include your test file in the code you hand in. Note, this could be test code you build into the Data.py file. But the test code should not run unless you execute the Data.py file directly.
- Find your own data set. Put it into a .csv file and convince me that your Data class can read it in properly. One thing you could do is open the .csv file with Excel, compute the mean and standard deviation using Excel, and then verify that the mean and standard deviations that you calculate with your analysis functions are the same.
- Enumerated types can be converted into numeric data. Using a dictionary, you can parse through the raw data, using the raw strings as dictionary keys. Give the first key the value 0, give the second unique key the value 1, and so on, incrementing the counter with each novel key. Create the numeric version of the enumerated type by going through the column and replacing the enumerated value key with its index. Keep the conversion dictionary in your Data class, because you may want to let the user choose from the set of keys.
- Dates can also be converted into numeric data. You can use functions in the time module to convert dates into useful internal numeric representations.
- Make the above extension work for multiple date formats.
- Build additional I/O capability into the Data class (e.g. expand it to include xls files, XML or other formats).
- Add a method addColumn to add a column of data to the Data object. It will require a header, a type, and the correct number of points. Add it to both the raw_data and, if appropriate, the numeric data. Be sure to update the header and type lists.
- Do some other types of simple data analysis, demonstrated on your selected data set.
Make a wiki page for the project writeup.
Once you have written up your assignment, give the page the label:
Put your code in your private subdirectory in the COMP/CS251 folder on the Courses server.