Skip to main content

Sponsors

Benchmark

Parallel QuickSort with OpenMP and Cilk++

The introduction of omp task in OpenMP 3.0 allows easy parallelization of the famous quick sort algorithm. As OpenMP 3.0 support in GCC is relatively new, I like to compare its performance particularly the omp task with a commercial compiler Cilk++ which is designed to efficiently support spawn/sync mechanism. Conceptually, the omp task and the Cilk++ spawn/sync are very similar.

Quick sort chooses a pivot, partitions elements into [< pivot] & [>= pivot], and sorts the two partitions recursively. The implementation in my experiment performs partitioning sequentially, and sorts the two resulting partitions in parallel.

Experiment setup:

  • AMD64 Phenom X4 9950 (2.6GHz)
  • Fedora 10 x86_64
  • GCC 4.4 svn revision 144773
  • Cilk++ 1.0.2

Parallelizing Recursive Fibonacci using OpenMP and Cilk++

OpenMP 3.0 has added an interesting construct called omp task, while Cilk++ is a compiler developed by Cilk Arts to easily parallelize a C/C++ program. It is interesting to see how we can parallelize the following simple fibonacci program.

Serial:

long fib_serial(long n)
{
    if (n < 2) return n;
    return fib_serial(n-1) + fib_serial(n-2);
}

Parallelizing Reversi in C++ using OpenMP - OpenMP Parallel Reversi

Upgrading from single core processor to dual-core processor does not give you double performance for free most of the time. Software written to run single threaded has no idea how to fully utilize a dual-core processor. Fortunately, it is not too difficult to parallelize certain software such as Reversi, because:

  1. Large amount of time is spent in small evaluation function.
  2. Different evaluations can be computed independently.
  3. Many evaluations are to be computed.

Let's see how do I quickly parallelize the Reversi in C++ under GPL license using merely a few lines of OpenMP and minor changes to existing codes.

Following is the code snippet that we want to parallelize, to get multiple iterations computed in parallel.

    for(iter = vmoves.begin() ; iter != vmoves.end() ; iter++) {
      m = *iter ;     
      buffer_b = b ; 
      buffer_b.move_at(m, value) ;      
      s = e.eval(buffer_b, value, (value+1)%2) ; 
      cout << "(" << m.x << "," << m.y << "): " << s << endl ;
      if(s > best_s) {
        best_s = s ;
	best_move = m ;
      }          
    }             

Ooops, we have to be able to compute iter statically (iter points to different possible moves), so that OpenMP can schedule and compute different moves in parallel. So we need to change iter a bit. We have:

Quicksort and C++ Sorting Benchmark on AMD64 X2 4200+

Sorting is an old topic in Computer Science. With the advent of new processors, how's the sorting performance looked like? Following graphs plot the total running time of inserting elements into the containers and sorting the elements. Different number of elements are experimented.


C++ list is implemented as a linked-list. Without random access capability, we expect the sorting time to be very bad. From the experiment, multiset is even worse. Perhaps, if we must use multiset, we probably can sort the elements in vector, and use range insert to insert into multiset? I observed that list and multiset use much much more amount of memory (> 10x)!!! Hence, running 100M elements with 4GB RAM will make swap space into operations.


Vector is clearly the winner for simple sorting applications. It's even better than standard textbook Quick Sort. :) I shall implement other algorithms or variants to compare with this C++ vector implementation.

Benchmarking Yafray on AMD64 X2 4200+ and Pentium 4 HT 3GHz

Posted in

Did a benchmark on Yafray, a rendering software, recently on several cases. See the plot below,

Running in 64bit helps quite a bit in the performance. However, it's not clear if it is due to the use of different compilers.

For AMD64 running 64bit, I use GCC 4.1.1 64bit.
For AMD64 running 32bit, the compiler is unknown as I use the rpm provided by yafray.org.
For Pentium IV, I use GCC 4.1.1 32bit.

Overall, it should be fare enough to compare AMD64 64bit and Pentium IV 32bit since they are using the same compiler.

Running slow using the Parallel version?

Posted in

I recall that the Distributed Parallel Reversi, xversi-java-parallel-v0.8b, was able to achieve reasonable speedup when the level of evaluation is large enough. However, when I run it on my AMD Dual-Core machine, I realize the serial version, xversi-java-v0.7a, is much faster. :S (Jaw drops). I then went to check changes between the serial and parallel versions, what would cause this to happen? ...... after much struggling, I find that I'm using gcj package. Phew~~~~~ The serial version running on GCJ is reasonably fast, but the evaluation is extremely slow in the parallel version. I suppose, the RMI implementation in GCJ is causing delays. Finally, I simply install latest Java package from SUN, and it just runs fast. :) So, now I can increase the level of evaluation to improve the computer's reversi skill.

From my timing of the runs of computer vs computer, it took merely 30 seconds to 1 minute for the SUN Java to complete. However, it took more than 10 minutes for the GCJ Java to run. I dont know how long it will take, I just give up running. :p

Reversi Benchmark - Round 2 with Woodcrest

Posted in

Finally we got a Woodcrest machine for benchmark. Thanks to CopperBlue again for contributing the machine.

The Intel Xeon Dual-Core 2.66GHz (Woodcrest) achieves much better 32bit performance than AMD64X2 4200+ (2.2GHz). Running Woodcrest in 64bit achieves even higher performance.

Can you believe Woodcrest is even slower than the aging AMD64 X2? I suggest CopperBlue upgrade his Java JVM to 1.5.x. There is significant performance boost in Java 1.5 over Java 1.4. Otherwise, his latest processor is just as slow as my 1+ year-old processor. Do you care?

Reversi Benchmark: AMD64 X2 vs INTEL CORE DUO vs OTHERS

Posted in

This round we manage to get an Intel Core Duo 2.16GHz with Shared 2MB L2, Mac Book Pro, to benchmark. Thanks CopperBlue for contributing his laptop. How does it compare to AMD64 X2 4200+ (2.2GHz) with 2 x 512KB Private L2 which has almost the same clock speed? However, AMD64 X2 is able to run in both 32bit mode and 64bit mode, whereas Core Duo runs only 32bit mode. Does it matter?

Let's see the first benchmark xversi-c-benchmark-v0.1 which utilizes only one core.

Intel Core Duo 2.16GHz has slight advantage over AMD64 X2 4200+ in 32bit mode. Interestingly, AMD64 X2 running in 64bit mode boost up the performance almost by 2x. Itanium II with much lower clock speed is clearly out of the game.

What about the Parallel Reversi? Let's see the next benchmark xversi-java-parallel-benchmark-v0.1.

AMD64 X2 4200+ clearly has an advantage running in 64bit mode. The 64bit JVM definitely helps in boosting the performance. Hence, if you own a 64bit processor, use a 64bit OS, and 64bit components such as Sun Java for x86_64.

Benchmark

Posted in

I'm particularly interested in performance of a computer e.g. how long does it take your system to run my reversi program? how many positions can the system evaluate?

This book page intend to consolidate all the benchmarks results that I performed or consolidated.

Running Reversi on Multiple Platforms

Posted in

I have made little changes to let the xversi-c to play by itself. Get a copy of the benchmark version under GPL now xversi-c-benchmark-v0.1.tgz with binaries for Linux x86 or x86_64 included. It basically sets evaluation level to 6, so that it will run for a while. I first run it on my PC, surprisingly, the 64bit version is running much faster.

Host: AMD64 X2 4200+ (2.2GHz)
Compiler: GCC 4.1.1 64bit
Time 3m0.532s

Compiler: GCC 4.1.1 32bit
Time 5m7.210s

Before running this, I thought applications that do not use 64bit data types extensively probably would not benefit from running in 64bit mode. I have yet to investigate the exact causes of this performance difference. I do hope you can shed the light for me if you are experience in this. :)

See performance of the same program on other platforms, you probably will find another surprises.

Host: Itanium II 1.5GHz
Compiler: Intel(R) C Itanium(R) Compiler for Itanium(R)-based applications Version 9.0 Build 20050430 Package ID: l_cc_p_9.0.021
Time 6m35.306s

Host: Itanium II 1.4GHz
Compiler: Intel(R) C Itanium(R) Compiler for Itanium(R)-based applications Version 9.0 Build 20051201 Package ID: l_cc_c_9.0.030
Time 7m16.698s

Compiler: GCC 4.1.0
Time 6m8.471s

Host: Intel Core Duo 1.83GHz
Compiler: GCC
Time 5m5.423s

Host: Intel Pentium 4 HT 3GHz
Compiler: Intel(R) C Compiler for 32-bit applications, Version 9.0 Build 20051201Z Package ID: l_cc_c_9.0.030
Time 5m7.080s

Compiler: GCC 4.1.0
Time 4m37.404s

It's quite impressive to see that Intel Core Duo is on par with the Pentium 4 HT 3GHz, considering the clockspeed difference. And the GCC 4.1.0 is actually outperforms the Intel compiler v9 quite significantly. However, I didnt try out all the compiler switches in Intel compiler. On all the platforms, I merely use -O3. Regarding Itanium II, they are famous with their floating-point performance. In this case, that does not apply, as evaluations merely use Integer operations. In addition, Itanium II has lower clockspeed than other platforms. With higher clockspeed, it will probably work faster than other platforms.

Syndicate content