Skip to Content

Parallelizing Recursive Fibonacci using OpenMP and Cilk++

OpenMP 3.0 has added an interesting construct called omp task, while Cilk++ is a compiler developed by Cilk Arts to easily parallelize a C/C++ program. It is interesting to see how we can parallelize the following simple fibonacci program.

Serial:

long fib_serial(long n)
{
    if (n < 2) return n;
    return fib_serial(n-1) + fib_serial(n-2);
}

OpenMP:

long fib_parallel(long n)
{
    long x, y;
    if (n < 2) return n;
    #pragma omp task default(none) shared(x,n)
    {
        x = fib_parallel(n-1);
    }
    y = fib_parallel(n-2);
    #pragma omp taskwait
    return (x+y);
}

When an omp task is encountered, the following block of codes can run on another thread in parallel. Hence, both fib(n-1) and fib(n-2) can be computed in parallel. When the main thread encounters the omp taskwait, it makes sure the omp task completes, i.e. fib(n-1) and fib(n-2) are available in x and y respectively, before it continues execution.

On top of these, we need to create a parallel region with a team of threads.

#pragma omp parallel
#pragma omp single
{
    r = fib_parallel(n);
}

As we specify omp single, the underlying code block will be executed only once by a single thread. The parallelism comes from the omp task inside the fib_parallel(n), which offloads computations to idling threads in the parallel region.

Cilk++:

long fib_parallel(long n)
{
    long x, y;
    if (n < 2) return n;
    x = cilk_spawn fib_parallel(n-1);
    y = fib_parallel(n-2);
    cilk_sync;
    return (x+y);
}

The Cilk++ routine above is similar to the OpenMP version. But this looked simpler. The cilk_spawn corresponds to omp task while cilk_sync corresponds to omp taskwait. A set of threads is created automatically when we execute a Cilk++ program, hence, we do not need to specify a parallel region like OpenMP.


See the following running time. The cutoff(30) versions simply use serial fib to compute when n <= 30.

# AMD64 X2 4200+ (2.2GHz)
# Fedora 10 x86_64
# GCC 4.4 svn revision 144773
# Cilk++ 1.0.2

Serial fib(45) = 1134903170                     Time: 10.630000
OpenMP Parallel fib(45) = 1134903170            Time: 884.240000
OpenMP Parallel fib(45) cutoff(30) = 1134903170 Time: 6.490000
Cilk++ Parallel fib(45) = 1134903170            Time: 58.630000
Cilk++ Parallel fib(45) cutoff(30) = 1134903170 Time: 5.240000

Interestingly, the parallel versions without cutoff are running much slower than the serial version. As the amount of work in a single fib function call is extremely small, overhead in “spawning” threads become excessively high. We have to control the number of spawns to a smaller amount, while still large enough to provide enough parallelism for the number of CPU cores in the system. By using cutoff, we obtain reasonable speed up. Note that Cilk++ fibs are more efficient than the corresponding OpenMP versions. I'm impressed with the relatively low overhead incurred by Cilk++.

Note:

Functions in the Cilk++ program use Cilk++ linkage by default. The following run uses Cilk++ linkage for the serial fib.

Serial fib(45) (Cilk++) = 1134903170        Time: 42.450000

We can see that the overhead can be significant. When a function does not require the spawn/sync capability, we shall explicitly specify extern “C++” to use C++ linkage to obtain maximum performance.

Get the source code here, fibonacci.tgz.

xman, thanks for your kind

xman,

thanks for your kind words about our Cilk++ product. It's
too bad you only have two cores---things become even more
interesting on 4 cores and above. E.g., here is what I
get with your program on a Q6600 Intel box:

$ /usr/lib/gcc-snapshot/bin/g++ --version
g++ (Debian 20090224-1) 4.4.0 20090224 (experimental)
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ /usr/lib/gcc-snapshot/bin/g++ -fopenmp -static -O3 fib.cpp

$ OMP_NUM_THREADS=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 3.070000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000

$ OMP_NUM_THREADS=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.110000
Parallel fib(35) = 9227465 Time: 11.000000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.060000

$ OMP_NUM_THREADS=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 16.830000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000

Of course, if you use Cilk++ the derivative of runtime w.r.t. number processors has the correct sign:

$ cilk++ -O3 fib.cilk

$ CILK_NPROC=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.090000
Parallel fib(35) = 9227465 Time: 0.730000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000

$ CILK_NPROC=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 0.490000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000

$ CILK_NPROC=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.130000
Parallel fib(35) = 9227465 Time: 0.260000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.020000

Hi Matteo, Thanks for your

Hi Matteo,

Thanks for your updates. Interestingly, Cilk++ version is achieving 2 times faster than GCC version for the 4-thread run. GCC 4.4 is finally released. I wonder if they have improved the OMP runtime.

Unfortunately, the final

Unfortunately, the final gcc-4.4 release seems to behave in the same way as the february snapshot that I was using yesterday :-(

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image. Ignore spaces and be careful about upper and lower case.