The purpose of this week's lab is to create a Data class that allows you to read and write CSV files. Then you should be able to query the Data object for information.
In order to make reading the data straightforward, we're going to use a general format for the data that simplifies the task. In general, the data should have the following properties.
- The data should be in CSV format with commas separating different entries.
- The first row of the CSV data file should be the variable names. There must be a non-empty name for each column.
- The second row of the data should be the variable types: numeric, string, enum, and date. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers;a date should be interpreted as a calendar date.
- Missing numeric data should be specified by the number -9999 in integer format. A decimal would imply an actual value.
- Any row that begins with a hash symbol should be ignored by the reader.
- You are free to use the CSV module, which is standard in Python 2.7. The 2.7 documentation is here.
The Data class should have methods that tell it to read in a csv data file as described above and to access data (e.g. retrieve all the data for a particular column). In your Data class, the data should be stored in two forms.
- Raw form: The original form of the data in the CSV file is a set of comma-separated strings. You want to keep a version of the data in this format so that we have an accurate view of the file and so that that we can retrieve the values of non-numerical columns. The CSV reader will return each column value as a string, by default, so this is the raw form of the data.
- Matrix form: Most of the analysis and display will use only the numeric columns of the data. Rather than repeatedly converting data from strings to numeric values, your Data class will identify all numerica columns, convert them to floats, and then store the numeric data in a numpy matrix. As a numpy Matrix can store data of only one type, this should be a matrix of floating point numbers. Only the numeric columns of your original data should be in the numpy matrix.
Create a python file data.py and start writing the code for your
Data class. The constructor for the Data class should have the
option of taking in a filename and then reading the data from the
file. The data file should be in the format described above. You
may also want your constructor to have the option to take in a
list of lists that represents a data set. The following is a
possible Data constructor def line.
def __init__(self, filename = None):
You will need initialize a number of different fields for the Data class, but you can add them as you need them. For now, think of your constructor as having the following sections.
# create and initialize fields for the class # if filename is not None # call self.read(filename)
Create a read method for reading the data from a
file. The method should put the original data in string format
into a list of lists, with one sublist for each data point. This
is the raw form.
When you open a data file for reading using the file class, use the flag 'rU' to indicate that you want to read the file with universal [U] carriage returns. This will let you work with carriage return values from any of the standard formats (Windows, MacOS, Linux).
In addition to the data, the method should store the headers and data types read from the data file. (Note: You might want to write just part of the read method and then test it by writing the accessor methods that return what you have read so far.)
You will likely want to use this set of fields to manage the raw data:
- raw_headers (list of all headers)
- raw_types (list of all types)
- raw_data (list of lists of all data. Each row is a list of strings)
- header2raw (dictionary mapping header string to index of column in raw data)
Note: Once you get to the project, you will need to add additional fields and code to the read method in order to create and store the numeric data.
Write at least these helpful accessor methods. Note that to
extract specific columns from a Data object, you will use the
column's header (as apposed to an index).
- get_raw_headers: returns a list of all of the headers.
- get_raw_types: returns a list of all of the types.
- get_raw_num_columns: returns the number of columns in the raw data set
- get_raw_num_rows: returns the number of rows in the data set. This should be identical to the number of rows in the numeric data, so you can get away with writing just one function for this purpose.
- get_raw_row: returns a row of data (the type is list) given a row index (int).
- get_raw_value: takes a row index (an int) and column header (a string) and returns the raw data at that location. (The return type will be a string)
- You may test your methods with lab2_test1.py if you would like to. Read through the test file and make sure the printed results make sense.
- Create a method that nicely prints out the data to the command line.
When you are done with the lab exercises, you may start on the rest of the project.