Title image Spring 2018

Data Management

The purpose of this week's lab is to create a Data class that allows you to read CSV files and store the information in the file. Then you should be able to query the Data object for information.

Data Format

In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.

Data Class

The Data class should have methods that tell it to read in a csv data file as described above and to access data (e.g. retrieve all the data for a particular column).

All of the required exercises for this course will make use of numeric data. Therefore, the default Data class read method will discard any columns not specified as numeric and store the numeric columns in a single NumPy floating point matrix.

Properly ignoring strings, dates, or enumerated types will be task one of the project. For the lab, all of the test data files have only numeric data.

If you want to store, convert, or otherwise deal with dates, enumerated types, or strings, these are good extensions and may be useful later in the semester when you are analyzing data you select yourself.


  1. Create a python file data.py and start writing the code for your Data class. Our test code all assumes your file is called data.py and your class is called Data. The constructor for the Data class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to have the option to take in (1) a list of headers and (2) a list of lists or a NumPy matrix that holds a data set. The following is a possible Data constructor definition.
        def __init__(self, filename = None):

    You will need to initialize a number of different fields for the Data class, but you can add them as you need them. For now, think of your constructor as having the following sections.

        # create and initialize fields for the class
        # if filename is not None
            # call self.read(filename)
  2. Create a read method for reading the data from a file.

    When you open a data file for reading using the open function, use the flag 'rU' to indicate that you want to read the file with universal [U] carriage returns. This will let you work with carriage return values from any of the standard formats (Windows, MacOS, Linux).

    Once you have opened the file, create a new reader object using the csv module.

         csv_reader = csv.reader( fp )

    You can then use csv_reader to get a single line using

        line = next(csv_reader)
    or you can loop over the lines with a for loop.
        for line in csv_reader:
            # do something with the line

    In addition to the data, the read method should separate and store the headers and data types read from the data file. These will be the first two lines of the file. (Note: As you code, you might want to write just part of the read method and then test it by writing the accessor methods that return what you have read so far.)

    You will likely want to use this set of fields to hold the necessary data:

    • headers (list of all headers)
    • types (list of all types)
    • data (NumPy matrix)
    • header2col (dictionary mapping a header to its corresponding column)

    To create the NumPy matrix, build a list of lists, where each sublist corresponds to a row of the data CSV file. You will need to go through each value and convert it from a string to a float as you build the sublists. When you have finished reading all of the rows, convert the list of lists to a NumPy matrix using thenumpy.matrix() function.

    To build the header2col dictionary, loop over the headers and types. Add an entry to the dictionary with the header as the key and the column index as the value.

    Some CSV data files have spaces before or after the types or before or after the headers. After reading the header and type lines from the file, it can be useful to loop over the headers and types and use the string strip() method to remove white space from before and after the string.

  3. Write at least these accessors. Note that to identify specific columns from a Data object, you will use the column's header (as apposed to an index). Use your header2col dictionary to obtain the column index from the header string.
    • get_headers(): returns a list of all of the headers.
    • get_types(): returns a list of all of the types.
    • get_num_dimensions(): returns the number of columns.
    • get_num_points(): returns the number of points/rows in the data set.
    • get_row( rowIndex ): returns the specified row as a NumPy matrix.
    • get_value( header, rowIndex ): returns the specified value in the give column.

    Hint: use the shape field of a Numpy matrix to get the number of rows and columns.

  4. You may test your methods with lab2_test1.py if you would like to. Read through the test file and make sure the printed results make sense.
  5. Write a __str__ method for your Data class that nicely prints out the data to the command line. You may want to make it sensitive to the number of rows/columns and print only a subset if there are too many dimensions or data ponts.

Note: You may test your Data class using testdata1.csv and testdata2.csv.

When you are done with the lab exercises, you may start on the rest of the project.