Sponsors
OpenMP 3.0 has added an interesting construct called omp task, while Cilk++ is a compiler developed by Cilk Arts to easily parallelize a C/C++ program. It is interesting to see how we can parallelize the following simple fibonacci program.
Serial:
long fib_serial(long n)
{
if (n < 2) return n;
return fib_serial(n-1) + fib_serial(n-2);
}
OpenMP:
long fib_parallel(long n)
{
long x, y;
if (n < 2) return n;
#pragma omp task default(none) shared(x,n)
{
x = fib_parallel(n-1);
}
y = fib_parallel(n-2);
#pragma omp taskwait
return (x+y);
}
When an omp task is encountered, the following block of codes can run on another thread in parallel. Hence, both fib(n-1) and fib(n-2) can be computed in parallel. When the main thread encounters the omp taskwait, it makes sure the omp task completes, i.e. fib(n-1) and fib(n-2) are available in x and y respectively, before it continues execution.
On top of these, we need to create a parallel region with a team of threads.
#pragma omp parallel
#pragma omp single
{
r = fib_parallel(n);
}
As we specify omp single, the underlying code block will be executed only once by a single thread. The parallelism comes from the omp task inside the fib_parallel(n), which offloads computations to idling threads in the parallel region.
Cilk++:
long fib_parallel(long n)
{
long x, y;
if (n < 2) return n;
x = cilk_spawn fib_parallel(n-1);
y = fib_parallel(n-2);
cilk_sync;
return (x+y);
}
The Cilk++ routine above is similar to the OpenMP version. But this looked simpler. The cilk_spawn corresponds to omp task while cilk_sync corresponds to omp taskwait. A set of threads is created automatically when we execute a Cilk++ program, hence, we do not need to specify a parallel region like OpenMP.
See the following running time. The cutoff(30) versions simply use serial fib to compute when n <= 30.
# AMD64 X2 4200+ (2.2GHz) # Fedora 10 x86_64 # GCC 4.4 svn revision 144773 # Cilk++ 1.0.2 Serial fib(45) = 1134903170 Time: 10.630000 OpenMP Parallel fib(45) = 1134903170 Time: 884.240000 OpenMP Parallel fib(45) cutoff(30) = 1134903170 Time: 6.490000 Cilk++ Parallel fib(45) = 1134903170 Time: 58.630000 Cilk++ Parallel fib(45) cutoff(30) = 1134903170 Time: 5.240000
Interestingly, the parallel versions without cutoff are running much slower than the serial version. As the amount of work in a single fib function call is extremely small, overhead in “spawning” threads become excessively high. We have to control the number of spawns to a smaller amount, while still large enough to provide enough parallelism for the number of CPU cores in the system. By using cutoff, we obtain reasonable speed up. Note that Cilk++ fibs are more efficient than the corresponding OpenMP versions. I'm impressed with the relatively low overhead incurred by Cilk++.
Note:
Functions in the Cilk++ program use Cilk++ linkage by default. The following run uses Cilk++ linkage for the serial fib.
Serial fib(45) (Cilk++) = 1134903170 Time: 42.450000
We can see that the overhead can be significant. When a function does not require the spawn/sync capability, we shall explicitly specify extern “C++” to use C++ linkage to obtain maximum performance.
Get the source code here, fibonacci.tgz.

xman, thanks for your kind
xman,
thanks for your kind words about our Cilk++ product. It's
too bad you only have two cores---things become even more
interesting on 4 cores and above. E.g., here is what I
get with your program on a Q6600 Intel box:
$ /usr/lib/gcc-snapshot/bin/g++ --version
g++ (Debian 20090224-1) 4.4.0 20090224 (experimental)
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ /usr/lib/gcc-snapshot/bin/g++ -fopenmp -static -O3 fib.cpp
$ OMP_NUM_THREADS=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 3.070000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000
$ OMP_NUM_THREADS=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.110000
Parallel fib(35) = 9227465 Time: 11.000000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.060000
$ OMP_NUM_THREADS=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 16.830000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000
Of course, if you use Cilk++ the derivative of runtime w.r.t. number processors has the correct sign:
$ cilk++ -O3 fib.cilk
$ CILK_NPROC=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.090000
Parallel fib(35) = 9227465 Time: 0.730000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000
$ CILK_NPROC=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 0.490000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000
$ CILK_NPROC=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.130000
Parallel fib(35) = 9227465 Time: 0.260000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.020000
Hi Matteo, Thanks for your
Hi Matteo,
Thanks for your updates. Interestingly, Cilk++ version is achieving 2 times faster than GCC version for the 4-thread run. GCC 4.4 is finally released. I wonder if they have improved the OMP runtime.
Unfortunately, the final
Unfortunately, the final gcc-4.4 release seems to behave in the same way as the february snapshot that I was using yesterday :-(
Post new comment