CS 336: Project #7

Project 7: CUDA on the GPU

In one Terminal, log onto bombur, so that you can edit your files with emacs: Log onto one of the computers hosting a GPU (first log onto bombur, then one of the GPU-host nodes (n1, n2, n3, n4), i.g.

ssh -X bombur.cs.colby.edu

In another Terminal, log onto one of the computers hosting a GPU (first log onto bombur, then one of the GPU-host nodes (n1, n2, n3, n4), i.g.

ssh -X bombur.cs.colby.edu
ssh -X n1

In this Terminal, you should make and run your programs. When it is time to profile your code (time it), you will run computeprof, which has a GUI (hence the -X flags). If you aren't running computeprof, then there is no need for X-forwarding.

Read about using the GPU at Colby in the Guide to GPGPU Programming at Colby.

In this project, you will be writing a CUDA C program to add two vectors and then to find the dot product of two vectors. The goal of the project is to become familiar with the basic CUDA C components and with memory-handling. You should also become aware of particularly inefficient strategies.

  1. Vector addition
    1. Write a single CUDA C program that adds two float vectors and places the result into a third vector. You should define two macros for NUM_THREADS_PER_BLOCK and NUM_BLOCKS. N (the problem size) should be computed directly from these macros. The code should create the two vectors and fill them on the host, then copy them to the device, then call the kernel, then copy the result back to the host, then print part of the result (enough to convince you it worked), and finally free the memory allocated on both the host and device.
    2. Analyze the performance of your code using computeprof. computeprof is the CUDA code profiler, and it will tell you how much time is spent in each part of the program. To use it, type computeprof from the command line. When the application is up, you can start a new session. When prompted, navigate to the executable file you want it to analyze, type in any command-line arguments, press Next, press Finish, and wait for results to appear. Which parts take the longest? Is that independent of problem size? Is any of this surprising?
    3. If you need N threads, is it better to use lots of blocks or as few blocks as possible?
    4. What happens when you use too many threads or blocks?
  2. Dot product.
    1. Read section 5.3.1 in Cuda By Example.
    2. Write a single CUDA C program that performs a dot product between two vectors. For the overall program, use the strategy outlined in their example. However, use the "naive" strategy for summing the products. Also, write CPU code to verify the answer is correct.
    3. Now write a version that uses the tree structure to sum the products (i.e. use their code snippet). It is OK to assume NUM_THREADS_PER_BLOCK is a power of 2.
    4. Time the two versions. Is the naive version slower? How large does NUM_THREADS_PER_BLOCK need to be before the tree version is faster than the naive version?

Extensions

Writeup and Handin

To hand in your project, you will gather all of the necessary files into a Proj07 directory.

  1. Create a file named README.docx for your project write-up. Include the analysis outlined earlier. The more thorough the analysis, the higher your grade will be.
  2. You should hand in all code necessary to run your solutions. Place all necessary .h, .cu, and Makefile files in the directory. Stephanie will probably want to compile and run the code. It should be possible to do so without looking for any more files.

Tar/Zip up the directory and email the tarball to Stephanie.