Spring 2018

### Clustering

The goal of this week's project is to add basic K-means clustering analysis to your program.

For this project, you will implement two versions of K-means clustering. One will use the built-in Numpy functions, one you will write yourself. You can use the Numpy version as a reference while debugging (and as a potentially faster version for very large data sets).

1. Make a project 7 directory and copy over your code from project 6.
2. In your analysis module, Add a function that uses Numpy's built-in k-means function. Your function should take as arguments a data set, the headers to use for clustering, and the number of clusters to create. It will return the cluster means, the cluster ID of each point, and the representation error.
1. import scipy.cluster.vq as vq
2. Implement the following template for the kmeans_numpy function.

```def kmeans_numpy( d, headers, K, whiten = True):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes, and representation error.
'''

# assign to A the result of getting the data from your Data object
# assign to W the result of calling vq.whiten on A

# assign to codebook, bookerror the result of calling vq.kmeans with W and K
# assign to codes, error the result of calling vq.vq with W and the codebook

# return codebook, codes, and error```

3. Test your code with this test file using this data. You should get these results. Note the clusters/ids may be reversed.
3. Write your own K-means function. The basic K-means algorithm is as follows.

```  initialize a set of K mean vectors
loop
identify the closest cluster mean for each data point
compute a new set of cluster means
calculate the error between the old and new cluster means
if the error is small enough, break
identify the closest cluster mean for each data point
return the new cluster means and the cluster IDs and distances for each data point```

To make the above function simpler, write two helper functions: kmeans_init and kmeans_classify.

• The kmeans_init should take in the data, the number of clusters K and return a numpy matrix with K rows, each one repesenting a cluster mean. A simple way to select the means is to randomly choose K data points (rows of the data matrix) to be the first K cluster means. You want to avoid have two of the initial means being the same.

```# Selects K random rows from the data matrix A and returns them as a matrix
def kmeans_init(A, K):
# Hint: generate a list of indices then shuffle it and pick K
# Hint: Probably want to check for error cases (e.g. # data points < K)
pass```

• The kmeans_classify should take in the data and cluster means and return a list or matrix (your choice) of ID values and distances. The IDs should be the index of the closest cluster mean to the data point. The default distance metric should be sum-squared distance to the nearest cluster mean.

```# Given a data matrix A and a set of means in the codebook
# Returns a matrix of the id of the closest mean to each point
# Returns a matrix of the sum-squared distance between the closest mean and each point
def kmeans_classify(A, codebook):
# Hint: you can compute the difference between all of the means and data point i using: codebook - A[i,:]
# Hint: look at the numpy functions square and sqrt
# Hint: look at the numpy functions argmin and min
```

Test your kmeans_classify with this file. You should get this result.

• Implement the core K-means algorithm a skeleton and some of the code is below.
```# Given a data matrix A and a set of K initial means, compute the optimal
# cluster means for the data and an ID and an error for each data point
def kmeans_algorithm(A, means):
# set up some useful constants
MIN_CHANGE = 1e-7     # might want to make this an optional argument
MAX_ITERATIONS = 100  # might want to make this an optional argument
D = means.shape[1]    # number of dimensions
K = means.shape[0]    # number of clusters
N = A.shape[0]        # number of data points

# iterate no more than MAX_ITERATIONS

# calculate the codes by calling kemans_classify
# codes[j,0] is the id of the closest mean to point j

# initialize newmeans to a zero matrix identical in size to means
# Hint: look up the numpy function zeros_like
# Meaning: the new means given the cluster ids for each point

# initialize a K x 1 matrix counts to zeros
# Hint: use the numpy zeros function
# Meaning: counts will store how many points get assigned to each mean

# for the number of data points
# add to the closest mean (row codes[j,0] of newmeans) the jth row of A
# add one to the corresponding count for the closest mean

# finish calculating the means, taking into account possible zero counts
#for the number of clusters K
# if counts is not zero, divide the mean by its count
# else pick a random data point to be the new cluster mean

# test if the change is small enough and exit if it is
diff = np.sum(np.square(means - newmeans))
means = newmeans
if diff < MIN_CHANGE:
break

# call kmeans_classify one more time with the final means
codes, errors = kmeans_classify( A, means )

# return the means, codes, and errors
return (means, codes, errors)
```

Once you have the above functions written, then write the top-level kmeans function, which should be similar to the numpy version from step 1.

```def kmeans(d, headers, K, whiten=True ):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes and representation errors.
'''

# assign to A the result getting the data given the headers
# if whiten is True
# assign to W the result of calling vq.whiten on the data
# else
# assign to W the matrix A

# assign to codebook the result of calling kmeans_init with W and K

# assign to codebook, codes, errors, the result of calling kmeans_algorithm with W and codebook

# return the codebook, codes, and representation error```

Test your kmeans function using this file. Your results should be pretty much identical to the numpy version.

Note that if you whiten your data before clustering, then the cluster means will be in the whitened coordinate system. The scipy whiten function divides each column by its standard deviation (according to the documentation it does not subtract off the mean first. Therefore, if you want to plot those cluster means in the original data space you need to do the following for each dimension, where c_i is the cluster mean in the original data space for dimension i, cw_i is the whitened cluster mean, and s_i is the standard deviation.

c_i = c_w*s_i

Note that, before whitening, you may want to subtract the original data mean and store it. In that case, you need to add the data mean to the right side of the expression above to get the clusters in the original data space.

When you are done with the lab, go ahead and start the project.