Due Monday 17 April 2017
The goal of this week's project is to build two simple classifiers that can be trained from data. In particular, you will implement a Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier. Once they are working, build some tools for evaluating the outputs and use your visualization app to look at the results.
- Write the two functions in the Classifier parent class for creating and printing a confusion matirx. The confusion_matrix method should build a numpy matrix showing the number of data points in a category classified as each output category. The confusion_matrix_str method should convert it into a string that does a nice job of printing out the matrix.
Write a python function, probably in a new file, that does the
- Reads in a training set and its category labels, possibly as a separate file.
- Reads a test set and its category labels, possibly as a separate file.
- Builds a classifier using the training set.
- Classifies the training set and prints out a confusion matrix.
- Classifies the test set and prints out a confusion matrix.
- Writes out a new CSV data file with the test set data and the categories as an extra column. Your visualization application should be able to read this file and plot it with the categories as colors.
You will want to be able to use either the Naive Bayes or the KNN classifier for this task. You can create two files, or you can let the user select one or both classifiers from the command line.
- Run the above code on the original Activity Recognition data set. Then run it again on the PCA-transformed version of the data set. Include the confusion matrices in your writeup and note any significant differences.
- Plot the activity recognition data set using the first three PCA axes and use color to show the output labels of the classifier. Include this image in your writeup.
- Repeat the above two exercises on a data set of your choice other than the Iris and Activity Recognition.
Try variations on the training data or the classifiers and compare
performance on the Activity Recognition data set. For example:
- Use more or fewer PCA dimensions.
- Compare using clustering versus the entire data set for the KNN classifier.
- Compare using different numbers of exemplars per class for the KNN classifier.
- Compare using different numbers of neighbors in the distance sum for the KNN classifier.
- Compare using different distance metrics.
- Use a method other than K-means clustering to select a subset of exemplar points for KNN classification.
- Implement a different type of classifier.
- Explore more data sets.
- Integrate machine learning analysis into your GUI. Be very careful and intentional if you do this extension. Think for a while about your design before writing a single line of code to implement it.
Make a wiki page for the project report.
- Write a brief summary of your project that describes the purpose, the task, and your solution to it. The summary should be 200 words or less.
- Write a brief description of how you implemented the two classifiers and the results on the test data sets.
- Incorporate screen shots showing a visualization of the test data sets using your visualization application. Focus integrating the text and figures.
- Be sure to document and describe any extensions.
- Summarize what you learned and identify any collaborators/assistance.
Once you have written up your assignment, give the page the label:
Put your code on the handin server in a project8 directory in your private subdirectory.