The final project is to apply concepts you have learned in the course to a new data set, or to implement new or more complex versions of algorithms or concepts from the course. The first option is to develop novel visualizations and/or undertake an analysis of a selected data set, possibly customizing your GUI to the data set of interest. The second option is to implement analysis algorithms or machine learning techniques you have not implemented as part of the projects.
You may work with a partner for this project. Use the lab time to identify a partner, if any, and to explore potential projects. By the end of the lab period, you should have a plan for your final project.
If you are pursuing a project based on data, you should have selected a data set, defined a set of questions about the data set, and defined a set of analyses and/or visualizations designed to help you answer the question.
If you are pursuing a project based on implementation of a new algorithm, then you should have an initial design of the algorithm and supporting classes/code/libraries and have selected a data set on which to evaluate the algorithm.
Data sets you can consider for this project include the following.
- Anything on the UCI Machine Learning site. Find and read some of the source papers to identify interesting questions.
- A data set of your choice, in consultation with the professor.
- Gulf of Maine data set. This data set is provided by Bigelow Labs. It represents measurements from a boat following a spiral pattern around a kelp farm in the Gulf of Maine. The measurements are designed to measure the impact of the kelp on various aspects of sea water, including salinity, oxygen, and carbon dioxide. The data set is exploratory, and there are a number of questions and visualizations that it can support. It would be an appropriate data set for PCA and clustering as part of the initial exploration of relationships within the data.
- Word Priming data set. This data set looks at the relationship between a word recalled by a subject and a primer word given to them. Specifically, participants were given a one word cue--a monosyllabic word--and asked to respond as quickly as possible with the first word that came to mind that 'looked or sounded' like the cue. The goal of generating this data set is to obtain free association norms for a large set of items with measures of forward and backward association strength (i.e., what is the probability that a given cue will elicit a given response and vice versa). We are also interested in what characteristics of cues and responses predict association strength or response frequency. Many characteristics of the primar and recalled words are included in the data set. This is an excellent choice for doing prediction with a learned classifier.
- Study Habits data set. This data set was built to examine whether stable cognitive or personality factors and use of different study strategies and techniques predict achievement as measured by GPA and what factors predict benefits of retrieval practice (i.e., the extent to which taking a test improves retention relative to repeated study). This data set is amenable to both visualization and analysis, particularly predictive analysis that tries to related study habits (independent variables) to performance (dependent variable).
- Effects of time zones on baseball win percentages. A study a few years ago showed that winning percentages of teams that had to travel from the Pacific Time Zone to the Eastern Time Zone were lower than for teams traveling the other way. One possible project would be to extend that study to compare winning percentages between all possible pairs of time zones and to examine how long it takes teams to recover (second game after arriving in a new time zone, third game, etc.). This project requires a good bit of data manipulation to get the data in a form that can be analyzed. A useful output of this project might be a GUI that allows these relationships to be visualized and analyzed.
- Effects of batter-handedness and pitcher-handedness. This project incorporates two datasets: one for left-handed pitchers and how they pitch against right-handed and left-handed batters, and a similar one for right-handed pitchers. Although batters typically do better against an opposite-handed pitcher, it is not always the case. This is an interesting data set for mining and modeling, including with clustering, linear regression, and logistic regression. It is also full of possible visualization opportunties.
- Systems Bio data set. This data set tracks 123 cells in the circadian clock over time. It is a good set for examining and visualizing data that change over space and time. One question you might ask is: are cells that are tightly synchronized to each other also close to each other in the tissue?
- MCMI Matched Soccer Data Set. This data set contains around 4000 matched pairs--one boy and one girl--of high school soccer players. The data set information includes demographic information, medical history, neurocognitive scores on the ImPACT test, and a 22 item checklist of concussion symptom scores. The primary questions of interest identify similarities and differences between boy and girl soccer players, making use of the matched pairs to reduce the number of confounding variables.
- Bird Arrivals Data Set. Arrival dates of Maine migratory breeding birds. The dataset has 21 years of arrival dates of over 100 migratory breeding birds. The biophysical region (there are 15 in the state) could be used to look for patterns of arrival across the state for each species. Comparisons of birds that eat similar food could be interesting (warblers feeding on caterpillars, aerial insectivores like swallows, swifts and flycatchers, nectar-feeders (hummingbirds and orioles) and aquatic fish-eaters (loons, grebes, Belted Kingfisher,Osprey). For most years, it is also possible to get data on temperature departure from normal (indicating a cold or warm spring) and the NAO (North Atlantic Oscillation), a hemispheric phemenon that affects weather on a broad scale.
- Red-breasted Nuthatch Christmas Count Data. Winter population dynamics of Red-breasted Nuthatches at the level of states and provinces. The data represent state- or province-wide mean abundances (measured as Number of Birds per Party-hour) from 1962 to 2014. Red-breasted Nuthatches stage southern migrations (called irruptions) from their northerly breeding grounds in some years. I am interested in the synchrony and extent of these southern migrations among states and provinces.
If you are using any of the specific data sets listed above, you can get them from the Course_Materials folder on the Courses server.
Once you have selected a data set, pick 1-3 questions that you want to answer using the data set. These questions can be observational, such as identifying relationships between variables, or they can be predictive, such as predicting a dependent variable from a set of independent variables. Consider what kinds of useful visualizations will be helpful in explaining the answers to the questions.
Selecting your questions should be done in consultation with the professor or the developer of the data set. For some data sets, the questions will have specific answers you can write programs to generate. In other cases, the questions will be answered by a program that lets the user interact with the data.
After selecting your questions, develop a plan of analysis that outlines the process and methods you will use to generate an answer. In some cases, this will involve simply using your analysis and visualization code. In other cases, it may involve designing code to create custom visualizations or analyses.
If you choose, instead, to implement a new algorithm, such as neural networks or an alternative machine learning algorithm, you still need to select a target data set on which to evaluate your implementation. It's probably a good idea to pick both a very simple data set for testing and a more interesting data to demonstrate the algorithm generalizes. If you pick a project like this, keep it simple.
Write up your project design as a wiki page with the label cs251s18project9. Do this during the lab time. When you are done with your design, then start executing your plan to complete the final project.