Drawing a Julia Set with Pthreads (due Sat Oct. 18)
To draw a Julia set, we compute the color for each pixel according to the function z(n+1) = z(n) + c. In this assignment you will be developing multiple parallel versions of the Julia set drawing problem and running them on the Natural Sciences Computer Cluster (NSCC). An important part of the project is analysis of the effectiveness of the different data distribution algorithms.
There is a helpful webpage about Julia sets that contains an analysis of the Julia set we are creating at http://aleph0.clarku.edu/~djoyce/julia/. I highly recommend the introduction to Julia and Mandelbrot sets.
Logging onto NSCC
You all have accounts on the Natural Sciences Compute Cluster (NSCC) in order to run code for CS 336. The node we will be directed to has 8 processors, with each pair of processors sharing a 6MB L2 cache. There is no queueing software installed, so it is up to you to be careful not to hog the machine nor to clobber other people's jobs.
My advice is to develop and test your code on your machine first. You can even use 8 threads to simulate the use of 8 processors - there just won't be a speedup. Once you are reasonably happy with your code, use an SFTP or SCP server (I use SFTP on Fugu) to move it (and its makefile) to the network server (colby0.colby.edu). That puts the code into a directory that NSCC can access.
To run your code on NSCC, you must log onto it using SSH. In a Terminal window, type
ssh nscc.colby.edu. The system is set up to automatically place you in your home directory associated with the colby0 file server. Tho see who else is logged in to the machine, type
who on the command line. To see who is running jobs on the machine, type
top (then, to quit it, type
q). If none of your classmates are actively using it, then it should list
FahCore_a2.exe as the process taking up most of the CPU time for each processor. This is code Randy is running in the background. It has a low priority and will yield to any processes you start.
If you have any problems using this computer, let me know AS SOON AS POSSIBLE!
What to Do
- Create a directory structure for your project with a root directory named
proj1and daughter directories
- Do a sanity check - compile (
make julia) and run (
../bin/julia) it on NSCC. It should generate
julia_seq.ppmusing the sequential code. To view the result, convert it to a JPEG using the command line
convert julia_seq.ppm julia_seq.jpg, download it to your machine, and open it. It should look like the Julia set we saw in class.
- Familiarize yourself with the sequential version of the code. In particular, note
- The image is designed to contain the complex plane with the real values between -1.5 and 1.5 and the imaginary values between -1 and 1.
- The code computes one grayscale value for each pixel. To do so, it iterates the function z(n+1) = z(n) - 0.75. z here indicates a complex number, but the code considers the real and imaginary parts separately using x for the real part and y for the imaginary part.
- The main tuning knobs for this problem are the
num_colsglobal variables which determine the resolution of the image. I suggest you increase them to 2000 each once you have your parallel code working.
- Add support for a parallel version that uses static allocation of work. I suggest you copy
julia.cinto a file
julia_static.c. Implement three forms - one that assigns different rows to different processors, one that uses cyclic allocation, and one that uses block cyclic allocation (you can determine the block size).
- Add support for a parallel version that uses dynamic allocation of work. I suggest you copy
julia.cinto a new file
julia_dynamic.cfor the new implementation. Implement a work queue like that shown in class for the Collatz conjecture. Recall that WORK_CHUNK is the macro that determines how many pixels at a time are assigned to a thread. This is an important tuning knob.
- Thoroughly test your code.
- Compile and run the working code on NSCC and then analyze the speed-up for all confiugrations for at least three different problem sizes (and block sizes and WORK_CHUNK values). There is code in sequential version to report the time it takes to run, and you should add similar code to the parallel version. For each data decomposition and block or WORK_CHUNK size, record the times and graph them against the number of threads. Also, record the runtime for the sequential code. Choose your problem sizes carefully - be sure the problem size is large enough to cause the 8-threaded version to run at least 4 seconds. Does changing the problem size affect the speed-up (i.e. does the ratio between the running time for the 1-threaded version to the 8-threaded version change from problem size to problem size? If so, can you make a conjecture as to why?)
- Write-up your results. The report should include your graphs and text describing the data in the graphs and answers to the questions above. In addition, discuss whether or not the speed-up is what you would expect theoretically. Is the parallel version with 1 thread faster, slower, or the same as the sequential version? Why? Which data decomposition provides the greatest speedup? Which data decomposition provides the smallest speedup? Why? Spend a few paragraphs discussing the merits and drawbacks of the data decompositions you used.
What to Turn InTurn in your code and a short report by emailing me. Put "CS336" in your subject line so that my mail filters will catch it and I will notice it right away. For an example report, see the HW5 Example Write-up.
- Code should be well commented. If you turn in non-working but well-commented code, I will be able to help you figure out what your problem is and your grade will suffer less both in the short term and the long term (because you will be better able to learn from your mistakes). It is better to think through a problem and share with me your process than to give up and turn in something that doesn't reflect the amount of time and effort you have put into the homework.
- The main comment associated with any functions you add (e.g. the main thread function) should have in it a short statement about what the function does, what parameters it takes, what it returns, and which global variables it accesses and changes.
- Start coding early so that you can ask me if you have any problems. I don't want you to get stuck on the code and be unable to do the speed-up analysis!
- Bruce has written helpful tutorials on the Terminal and Makefiles. Check them out.
- Be careful if you mount a network directory on your Mac. If you use your Mac Terminal in a mounted directory then you are still "on" your Mac. Specifically, if you compile code, the binaries will execute on your Mac. If you want to create executables for a different machine (.e.g. one of the dwarves or a node on the cluster), then you must log into that machine using
sshand then navigate the appropriate directory.
- Separating code from executables makes it simpler to compile and run code on multiple machines. Everything in the
srcdirectory is portable from machine to machine, but the executables in the
bindirectory are specific to that machine. For machines on the network that share a file system but not an architecture, this may mean you must keep track of which machine you last compiled on. If your code doesn't run, then recompile it.