The purpose of this week's lab is to create a Data class that allows you to read CSV files and store the information in the file. Then you should be able to query the Data object for information.
In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.
- The data should be in CSV format with commas separating different entries.
- The first row of the CSV data file should be the variable names. There must be a non-empty name for each column.
- The second row of the data should be the variable types: numeric, string, enum, and date. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers;a date should be interpreted as a calendar date.
- Missing numeric data should be specified by the number -9999 in integer format. A decimal would imply an actual value.
- Any line that begins with a hash symbol should be ignored by the reader.
- You probably want to use the CSV module, which is standard in Python 3.6. The Python 3 CSV documentation is here.
The Data class should have methods that tell it to read in a csv data file as described above and to access data (e.g. retrieve all the data for a particular column).
All of the required exercises for this course will make use of numeric data. Therefore, the default Data class read method will discard any columns not specified as numeric and store the numeric columns in a single NumPy floating point matrix.
Properly ignoring strings, dates, or enumerated types will be task one of the project. For the lab, all of the test data files have only numeric data.
If you want to store, convert, or otherwise deal with dates, enumerated types, or strings, these are good extensions and may be useful later in the semester when you are analyzing data you select yourself.
Create a python file data.py and start writing the code for your
Data class. Our test code all assumes your file is called
data.pyand your class is called
Data. The constructor for the Data class should have the option of taking in a filename and then reading the data from the file. The data file should be in the format described above. You may also want your constructor to have the option to take in (1) a list of headers and (2) a list of lists or a NumPy matrix that holds a data set. The following is a possible Data constructor definition.
def __init__(self, filename = None):
You will need to initialize a number of different fields for the Data class, but you can add them as you need them. For now, think of your constructor as having the following sections.
# create and initialize fields for the class # if filename is not None # call self.read(filename)
Create a read method for reading the data from a
When you open a data file for reading using the open function, use the flag 'rU' to indicate that you want to read the file with universal [U] carriage returns. This will let you work with carriage return values from any of the standard formats (Windows, MacOS, Linux).
Once you have opened the file, create a new reader object using the csv module.
csv_reader = csv.reader( fp )
You can then use csv_reader to get a single line using
line = next(csv_reader)or you can loop over the lines with a for loop.
for line in csv_reader: # do something with the line
In addition to the data, the read method should separate and store the headers and data types read from the data file. These will be the first two lines of the file. (Note: As you code, you might want to write just part of the read method and then test it by writing the accessor methods that return what you have read so far.)
You will likely want to use this set of fields to hold the necessary data:
- headers (list of all headers)
- types (list of all types)
- data (NumPy matrix)
- header2col (dictionary mapping a header to its corresponding column)
To create the NumPy matrix, build a list of lists, where each sublist corresponds to a row of the data CSV file. You will need to go through each value and convert it from a string to a float as you build the sublists. When you have finished reading all of the rows, convert the list of lists to a NumPy matrix using the
To build the header2col dictionary, loop over the headers and types. Add an entry to the dictionary with the header as the key and the column index as the value.
Some CSV data files have spaces before or after the types or before or after the headers. After reading the header and type lines from the file, it can be useful to loop over the headers and types and use the string strip() method to remove white space from before and after the string.
Write at least these accessors. Note that to identify
specific columns from a Data object, you will use the
column's header (as apposed to an index). Use your
header2col dictionary to obtain the column index from the
get_headers(): returns a list of all of the headers.
get_types(): returns a list of all of the types.
get_num_dimensions(): returns the number of columns.
get_num_points(): returns the number of points/rows in the data set.
get_row( rowIndex ): returns the specified row as a NumPy matrix.
get_value( header, rowIndex ): returns the specified value in the give column.
Hint: use the shape field of a Numpy matrix to get the number of rows and columns.
- You may test your methods with lab2_test1.py if you would like to. Read through the test file and make sure the printed results make sense.
- Write a __str__ method for your Data class that nicely prints out the data to the command line. You may want to make it sensitive to the number of rows/columns and print only a subset if there are too many dimensions or data ponts.
When you are done with the lab exercises, you may start on the rest of the project.