Upgrading from single core processor to dual-core processor does not give you double performance for free most of the time. Software written to run single threaded has no idea how to fully utilize a dual-core processor. Fortunately, it is not too difficult to parallelize certain software such as Reversi, because:
- Large amount of time is spent in small evaluation function.
- Different evaluations can be computed independently.
- Many evaluations are to be computed.
Let's see how do I quickly parallelize the Reversi in C++ under GPL license using merely a few lines of OpenMP and minor changes to existing codes.
Following is the code snippet that we want to parallelize, to get multiple iterations computed in parallel.
for(iter = vmoves.begin() ; iter != vmoves.end() ; iter++) {
m = *iter ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
>Ooops, we have to be able to compute iter statically (iter points to different possible moves), so that OpenMP can schedule and compute different moves in parallel. So we need to change iter a bit. We have:
int nmove = vmoves.get_num_loc();
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
>Now multiple moves/iterations can be computed in parallel. But we need to update vmoves to provide loc(int n) method return the nth move in the list. On top of adding method loc(int n), we also need to change from list to vector.
// location_list.h
inline location loc(int n) { return loc_list[n]; }
Next, we parallelize the for-loop using OpenMP. We call the for-loop parallel region.
int nmove = vmoves.get_num_loc();
#pragma omp parallel for default(none)
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
>The #pragma omp parallel for default(none) lets the immediate following for-loop (parallel region) run in parallel on multiple threads. The default(none) ensures that the code will not compile if we do not explicitly define scopes of the variables in the parallel region.
Next, we have to define the scopes of variables inside the parallel region:
int nmove = vmoves.get_num_loc();
#pragma omp parallel for default(none) \
private(buffer_b,m,s) \
firstprivate(e) \
shared(vmoves,b,value,best_s,best_move,cout,nmove)
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
>Beginners of OpenMP often have difficulty to decide whether a variable should be private or shared. For private it's easier identify. You just have to find out if a variable is merely a temporary storage/variable for an iteration. The buffer_b, m, and s are just intermediate results for computing best move (best_s, best_move). We do not use their values in another iteration. Hence they should be private.
Next, the variable e is also a temporary variable. But, we need the internal values of e to be initialized with values the same as before encountering the parallel region. Hence, we let it be firstprivate.
For shared, it's a bit tricky. Typically, all variables that you will not modify in the parallel region can be set as shared. This includes vmoves, b, value and nmove. For those variables that you need to read/write at different iterations such as best_s and best_move, we can set them as shared, however IMPORTANT, we must protect the variables whenever we read/write. Let's see how do we do that now.
int nmove = vmoves.get_num_loc();
#pragma omp parallel for default(none) \
private(buffer_b,m,s) \
firstprivate(e) \
shared(vmoves,b,value,best_s,best_move,cout,nmove)
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
#pragma omp critical
{
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
}
>With critical, only one thread will be executing the enclosing code at a time. This protects the variables inside from being corrupted. Note that we also include cout inside critical, this is because in some platforms, multiple threads writing to stdout will make the output screen becomes a mess. So, we shall treat cout as a variable that may be written by multiple threads at the same time. Hence, protect it!
Considering different evaluations might have very different computation time, we want a thread to move on computing the next move whenever the current move is done. So we set the schedule to dynamic.
int nmove = vmoves.get_num_loc();
#pragma omp parallel for default(none) \
private(buffer_b,m,s) \
firstprivate(e) \
shared(vmoves,b,value,best_s,best_move,cout,nmove) \
schedule(dynamic)
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
#pragma omp critical
{
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
}
>Finally, we want to compare the results with our sequential version, to make sure we get the parallel version right. As the parallel version may have best_s and best_move being updated in different orders, we now try to make it being updated sequentially, the same order as the sequential version. For normal run, we do not need these.
int nmove = vmoves.get_num_loc();
#pragma omp parallel for default(none) \
private(buffer_b,m,s) \
firstprivate(e) \
shared(vmoves,b,value,best_s,best_move,cout,nmove) \
schedule(dynamic) \
ordered
for(int i = 0 ; i < nmove ; i++) {
m = vmoves.loc(i) ;
buffer_b = b ;
buffer_b.move_at(m, value) ;
s = e.eval(buffer_b, value, (value+1)%2) ;
#pragma omp ordered
{
#pragma omp critical
{
cout << "(" << m.x << "," << m.y << "): " << s << endl ;
if(s > best_s) {
best_s = s ;
best_move = m ;
}
}
}
}
>The outputs of our parallel version is exactly the same as our original sequential version. See the amount of improvements we get:
# Sequential run xman@sai xversi-c-benchmark-v0.1 $ time ./reversi White: 44 Black: 20 Recursive evaluate: 53671516 Recursive end eval: 135214583 Evaluation: 44228633 real 3m13.662s user 3m13.364s sys 0m0.096s # Parallel run with ordered xman@sai xversi-c $ export OMP_NUM_THREADS=2 xman@sai xversi-c $ time ./reversi White: 44 Black: 20 Recursive evaluate: 53671516 Recursive end eval: 135214583 Evaluation: 44228633 real 2m28.802s user 3m47.450s sys 0m0.860s # Parallel run without ordered xman@sai xversi-c $ export OMP_NUM_THREADS=2 xman@sai xversi-c $ time ./reversi White: 18 Black: 46 Recursive evaluate: 41262916 Recursive end eval: 170646014 Evaluation: 34046530 real 2m10.502s user 3m34.237s sys 0m0.432s
That's about 1.3x speed up for the parallel version with ordered on a dual-core processor. :) It's a bit low, there seems to be load unbalance problem. There are certainly ways to further improve the performance if we spend more time on it. :) But imagine, this parallelization only takes a few lines, perhaps an hour to parallelize it correctly.
The parallel run without ordered is slightly faster. However it's difficult to quantify the improvement, because without ordered, it will result in different execution path, hence different amount of computations compared to the sequential version.
The package with OpenMP included can be obtained at OpenMP Parallel Reversi in C++ under GPL license. But you have to compile the OpenMP Reversi using OpenMP supported compiler such as GNU C++ that comes with FC6 or Intel C/C++ compiler.
# GNU C++ Compiler comes with FC6 $ g++ -fopenmp -DNDEBUG -O3 -Wall main.cpp board.cpp evaluate.cpp gen_move.cpp -o reversi # Intel C/C++ Compiler $ icc -openmp -DNDEBUG -O3 -Wall main.cpp board.cpp evaluate.cpp gen_move.cpp -o reversi
Post new comment