Title image Spring 2017

Data Management

The purpose of this week's lab is to create a Data class that allows you to read and write CSV files. Then you should be able to query the Data object for information.


Tasks

  1. Update your Data class to support the numeric form of the data. This will involve adding new code to your read method, new fields to your class, and new accessors.
    • Inside your read method, add new code that goes through each column of numeric data and converts the string data to numeric form (floats). Store all of the numeric columns from the raw data in a single numpy matrix.

      If some of the types in your original data are not numeric, then your numpy matrix will have fewer columns than the raw data, and the column indexes may not match if the numeric and non-numeric data is mixed. Therefore, you will need to have a second dictionary that maps the headers to their corresponding column indexes in the numeric matrix.

    • Here are the suggested new fields for the Data class.
      • self.matrix_data = np.matrix([]) # matrix of numeric data
      • self.header2matrix = {} # dictionary mapping header string to index of column in matrix data
    • Here are the new accessors:
      • get_headers (list of headers of columns with numeric data)
      • get_num_columns: returns the number of columns of numeric data
      • get_row: take a row index and returns a row of numeric data
      • get_value: takes a row index (int) and column header (string) and returns the data in the numeric matrix.
      • get_data: At a minimum, this should take a list of columns headers and return a matrix with the data for all rows but just the specified columns. It is optional to also allow the caller to specify a specific set of rows.
    • Test your new methods (you get to write this code.). Some things to test include (1) headers with leading spaces (e.g. thing1, thing2, thing3), (2) types with leading or trailing spaces, and (3) columns that switch between numeric and non-numeric types. Try, for example, reading this data file.
  2. Create an analysis.py file. All analysis functions will take lists of strings (column headers) to specify what (numeric) data to analyze. Inside your analysis file, create the following three functions.
    • data_range - Takes in a list of column headers and the Data object and returns a list of 2-element lists with the minimum and maximum values for each column. The function is required to work only on numeric data types.
    • mean - Takes in a list of column headers and the Data object and returns a list of the mean values for each column. Use the built-in numpy functions to execute this calculation.
    • stdev - Takes in a list of column headers and the Data object and returns a list of the standard deviation for each specified column. Use the built-in numpy functions to execute this calculation.
    • normalize_columns_separately - Takes in a list of column headers and the Data object and returns a matrix with each column normalized so its minimum value is mapped to zero and its maximum value is mapped to 1.
    • normalize_columns_together - Takes in a list of column headers and the Data object and returns a matrix with each entry normalized so that the minimum value (of all the data in this set of columns) is mapped to zero and its maximum value is mapped to 1.

    Test your new methods. Describe your testing in your report and include your test file in the code you hand in. Note, this could be test code you build into the Data.py file. But the test code should not run unless you execute the Data.py file directly.

  3. Find your own data set. Put it into a .csv file and convince me that your Data class can read it in properly. One thing you could do is open the .csv file with Excel, compute the mean and standard deviation using Excel, and then verify that the mean and standard deviations that you calculate with your analysis functions are the same.

Extensions


Writeup

Make a wiki page for the project writeup.

  • Write a brief summary of your project that describes the purpose, the task, and your solution to it. The summary should be 200 words or less.
  • Describe your Data class API, with brief descriptions of all the functions, their inputs, outputs, and purpose.
  • Describe in your writeup how you store the data internally in your Data class, noting how you deal with each different type of data.
  • Include pictures and descriptions of any extensions, making clear how they are extensions to the project.

  • Handin

    Once you have written up your assignment, give the page the label:

    cs251s17project2

    Put your code in your private subdirectory in the COMP/CS251 folder on the Courses server.