Sponsors
The introduction of omp task in OpenMP 3.0 allows easy parallelization of the famous quick sort algorithm. As OpenMP 3.0 support in GCC is relatively new, I like to compare its performance particularly the omp task with a commercial compiler Cilk++ which is designed to efficiently support spawn/sync mechanism. Conceptually, the omp task and the Cilk++ spawn/sync are very similar.
Quick sort chooses a pivot, partitions elements into [< pivot] & [>= pivot], and sorts the two partitions recursively. The implementation in my experiment performs partitioning sequentially, and sorts the two resulting partitions in parallel.
Experiment setup:
- AMD64 Phenom X4 9950 (2.6GHz)
- Fedora 10 x86_64
- GCC 4.4 svn revision 144773
- Cilk++ 1.0.2

Parallel versions recurse until a partition is as small as single element. The Parallel (OpenMP) is even slower than the serial version, although we have four cores running. The Parallel (Cilk++) shows a much more promising result.
Adaptive versions determine a threshold value before sorting, based on the total number of elements N. When a partition has at most K = sqrt(N)/2 elements, it sorts the partition using serial quick sort. This significantly reduces the total number of "spawn/sync" required, while still maintaining modest amount of parallelism. The resulting OpenMP version looked much better.


Parallel speed up shows how much faster compared to serial run, while parallel efficiency indicates fraction of theoretical maximum performance achieved. We can see there is a significant performance gap between OpenMP and Cilk++ versions. Perhaps, more intelligent techniques are required to further improve the performance of the OpenMP versions. Alternatively, we may use OpenMP nested parallel regions for limited level of recursions. This is more rigid, but may incur less overhead, and probably good enough for small number of CPU cores. Of course the GCC OpenMP can be improved over time, as the current implementation is still at an early stage. May be I'm just little impatient, as a fans of OpenMP.
The source code: qsort.tgz

I assume you are using the
I assume you are using the task structure for the recursive call in OpenMP.
Can we use section instead?
It's using task structure for
It's using task structure for comparing with Cilk++. You can use section instead, but you do not have much control on how many total threads running in parallel. Your memory usage may blow up.
Task structure may also help reducing total number of spawns as the number of cores is relatively small.
You wouldn't still happen to
You wouldn't still happen to have the source code and paper for this would you?
http://dspace.mit.edu/handle/1721.1/3848
Post new comment