Principal Components Analysis
The goal of this week's lab is to add the capability to execute PCA on a data set and then create plots based on the analysis.
The first part of the project involves extending your Data and Analysis classes/files. The second part of the project involves integrating the analysis into the GUI.
Read through all of the tasks (lab and project) and plan your design before you start writing code.
Executing a PCA analysis creates three things. First, it generates the eigenvectors, which specify a new basis, or set of axes, for the data. Second, it generates the eigenvalues, which indicate how important each eigenvector is to representing the data. Third, projecting the data from its original data space into the PCA space generates a set of transformed data.
Because the transformed data is, for all intents and purposes, a new data set, it makes sense to use a Data object to hold it. However, we need to extend that Data object to include fields for the mean values, eigenvectors and eigenvalues, as well as information such as the set of columns used for the analysis. Therefore, it makes sense to extend the Data class and create a child class, PCAData class, that will hold the results of a PCA analysis.
Implement a new class PCAData that inherits the data
class. You can put this new class into your Data.py file
or create a new file for it. The class should have new
fields to hold the the eigenvalues (numpy matrix), the
eigenvectors (numpy matrix), and the mean data values
(numpy matrix). It should also have a new field to hold
the headers of the original data columns used to create
the projected data. This may be a subset of the original
Use the existing numeric data field to hold the projected data (numpy matrix).
The constructor for the PCAData class should be able to take in the projected data, the eigenvectors, the eigenvalues, the original data means, and the original data headers (in that order). You will use this constructor in the analysis.pca method for step 2.
In the constructor you will need to make sure that all of the fields of the Data class get populated when you create the PCAData class. This includes the dictionary mapping each PCA header to its corresponding column in the data matrix. It is important to understand that the transformed PCA data will not, in general, correspond to the original data headers. You will need to make new headers for the transformed data. Using names like PCA00, PCA01, and so on, is just fine.
In order for the test file to work, your PCAData class has to support the following methods. You can make whatever other methods you feel will be useful.
- get_eigenvalues() - returns a copy of the eigenvalues as a single-row numpy matrix.
- get_eigenvectors() - returns a copy of the eigenvectors as a numpy matrix with the eigenvectors as rows.
- get_original_means() - returns the means for each column in the original data as a single row numpy matrix.
- get_original_headers() - returns a copy of the list of the headers from the original data used to generate the projected data.
Implement a function pca in your Analysis
class/file. The function should take in a list of column
headers and return a PCAData object with the projected
data, eigenvectors, eigenvalues, source data means, and
source column headers stored in it.
Your pca function should also have an optional argument that lets the user choose whether to pre-normalize the data before executing the PCA analysis. By default, that argument should be True.
When the data being used is homogeneous--it all exists in the same units with the same semantic meaning--then normalization is not the correct action. However, when the data is heterogenous, each column uses different units with different semantic meanings, then normalization avoids letting the arbitrary unit designations dominate the PCA analysis.
If the normalization argument is True, then use the normalize_columns_separately function to access the source data. Otherwise, use get_data.
There are two methods of calculating the eigenvectors and eigenvalues: singular value decomposition on the difference matrix, or direct eigenvalue and eigenvector calculation using the covariance matrix of the data. Either method usually produces the same results. The SVD version is more numerically stable, however, so use that method in your code. You will run into numerical issues with some data sets if you compute the eigenvaluse of the covariance matrix usign the eig function (second version).
# This version uses SVD def pca(d, headers, normalize=True): # assign to A the desired data. Use either normalize_columns_separately # or get_data, depending on the value of the normalize argument. # assign to m the mean values of the columns of A # assign to D the difference matrix A - m # assign to U, S, V the result of running np.svd on D, with full_matrices=False # the eigenvalues of cov(A) are the squares of the singular values (S matrix) # divided by the degrees of freedom (N-1). The values are sorted. # project the data onto the eigenvectors. Treat V as a transformation # matrix and right-multiply it by D transpose. The eigenvectors of A # are the rows of V. The eigenvectors match the order of the eigenvalues. # create and return a PCA data object with the headers, projected data, # eigenvectors, eigenvalues, and mean vector.
# This version calculates the eigenvectors of the covariance matrix def pca(d, headers, normalize=True): # assign to A the desired data. Use either normalize_columns_separately # or get_data, depending on the value of the normalize argument. # assign to C the covariance matrix of A, using np.cov with rowvar=False # assign to W, V the result of calling np.eig # sort the eigenvectors V and eigenvalues W to be in descending order. At # the end of this process, the eigenvectors should be a matrix V with # each eigenvector as a row of the matrix. # assign to m the mean values of the columns of A # assign to D the difference matrix A - m # project the data onto the eigenvectors. Treat V as a transformation # matrix and right-multiply it by D transpose. # create and return a PCA data object with the headers, projected data, # eigenvectors, eigenvalues, and mean vector.
- Test your PCA code. If you run this test file on this data file, then you should get this result.
When you are done with the lab tasks, get started on the rest of the project.