Title image Spring 2017


Due Monday 10 April 2017

The goal of this week's project is to add basic K-means clustering analysis to your program. During the lab you will implement your own version of K-means clustering.


You will implement two versions of K-means clustering. One will use the built-in Numpy functions, one you will write yourself. You can use the Numpy version as a reference while debugging (and as a potentially faster version for very large data sets.

  1. Add a function to your analysis module that uses Numpy's built-in k-means capabilities. It should take as arguments a data set, the headers to use for clustering, and the number of clusters to create. It will return the cluster means, the cluster ID of each point, and the representation error.
    1. import scipy.cluster.vq as vq
    2. Write the following kmeans_numpy function.
      def kmeans_numpy( d, headers, K, whiten = True):
          '''Takes in a Data object, a set of headers, and the number of clusters to create
          Computes and returns the codebook, codes, and representation error.
          # assign to A the result of getting the data from your Data object
          # assign to W the result of calling vq.whiten on A
          # assign to codebook, bookerror the result of calling vq.kmeans with W and K
          # assign to codes, error the result of calling vq.vq with W and the codebook
          # return codebook, codes, and error
    3. Test your code with this test file using this data. You should get these results. Note the clusters/ids may be reversed.
  2. Now write your own K-means function. The basic K-means algorithm is as follows.
      initialize a set of K mean vectors
        identify the closest cluster mean for each data point
        compute a new set of cluster means 
        calculate the error between the old and new cluster means
        if the error is small enough, break
      identify the closest cluster mean for each data point
      return the new cluster means and the cluster IDs and distances for each data point

    To make the above function easier, write two helper functions: kmeans_init and kmeans_classify.

    • The kmeans_init should take in the data, the number of clusters K, and an optional set of categories (cluster labels for each data point) and return a numpy matrix with K rows, each one repesenting a cluster mean. If no categories are given, a simple way to select the means is to randomly choose K data points (rows of the data matrix) to be the first K cluster means. You want to avoid have two of the initial means being the same.

      If you are given an Nx1 matrix of categories/labels, then compute the mean values of each category and return those as the initial set of means. You can assume the categories given in the data are zero-indexed and range from 0 to K-1.

    • The kmeans_classify should take in the data and cluster means and return a list or matrix (your choice) of ID values and distances. The IDs should be the index of the closest cluster mean to the data point. The default distance metric should be sum-squared distance to the nearest cluster mean.

      You can test your kmeans_classify with this file. You should get this result.

    • Copy the following algorithm, then go through it and make sure you understand what it does. This is the core K-means algorithm.
      def kmeans_algorithm(A, means):
          # set up some useful constants
          MIN_CHANGE = 1e-7
          MAX_ITERATIONS = 100
          D = means.shape[1]
          K = means.shape[0]
          N = A.shape[0]
          # iterate no more than MAX_ITERATIONS
          for i in range(MAX_ITERATIONS):
              # calculate the codes
              codes, errors = kmeans_classify( A, means )
              # calculate the new means
              newmeans = np.zeros_like( means )
              counts = np.zeros( (K, 1) )
              for j in range(N):
                  newmeans[codes[j,0],:] += A[j,:]
                  counts[codes[j,0],0] += 1.0
              # finish calculating the means, taking into account possible zero counts
              for j in range(K):
                  if counts[j,0] > 0.0:
                      newmeans[j,:] /= counts[j, 0]
                      newmeans[j,:] = A[random.randint(0,A.shape[0]-1),:] #randint is inclusive
              # test if the change is small enough
              diff = np.sum(np.square(means - newmeans))
              means = newmeans
              if diff < MIN_CHANGE:
          # call classify with the final means
          codes, errors = kmeans_classify( A, means )
          # return the means, codes, and errors
          return (means, codes, errors)

    Once you have the above functions written, then write the top-level kmeans function, which should be similar to the numpy version from step 1.

    def kmeans(d, headers, K, whiten=True, categories = ''):
        '''Takes in a Data object, a set of headers, and the number of clusters to create
        Computes and returns the codebook, codes and representation errors. 
        If given an Nx1 matrix of categories, it uses the category labels 
        to calculate the initial cluster means.
        # assign to A the result getting the data given the headers
        # if whiten is True
          # assign to W the result of calling vq.whiten on the data
        # else
          # assign to W the matrix A
        # assign to codebook the result of calling kmeans_init with W, K, and categories
        # assign to codebook, codes, errors, the result of calling kmeans_algorithm with W and codebook        
        # return the codebook, codes, and representation error

    Test your kmeans function using this file. Your results should be pretty much identical to the numpy version.

    Note that if you whiten your data before clustering, then the cluster means will be in the whitened coordinate system. The scipy whiten function divides each column by its standard deviation (according to the documentation it does not subtract off the mean first. Therefore, if you want to plot those cluster means in the original data space you need to do the following for each dimension, where c_i is the cluster mean in the original data space for dimension i, cw_i is the whitened cluster mean, and s_i is the standard deviation.

    c_i = c_w*s_i

    Note that, before whitening, you may want to subtract the original data mean and store it. In that case, you need to add the data mean to the right side of the expression above to get the clusters in the original data space.

When you are done with the lab, go ahead and start the project.