Skip to main content

Sponsors

Programming

Parallel Programming with SpeedGo Computing

Parallel programming is a new popular curriculum for graduate schools. As processors as in CPU, GPU, etc. are hitting their clock speed limit, higher core count is expected in modern processors instead of higher clock as before. There is no reason why one should hesitate to take up parallel programming course and workshop.

While you may think that there are no obvious applications that require the multiple cores settings, why should I bother? However, reality is cruel. Those standing still simply get obsolete. It's up to you to innovate! Innovating is the key to success.

Check out SpeedGo Computing blog and their website for more information about parallel programming.

Compiling Ruby 1.9.1-p376 on Fedora 12

Posted in

Get the following errors while compiling Ruby 1.9.1-p376?

ossl_ssl.c: In function ‘ossl_sslctx_get_ciphers’:
ossl_ssl.c:626: error: ‘STACK’ undeclared (first use in this function)
ossl_ssl.c:626: error: (Each undeclared identifier is reported only once
ossl_ssl.c:626: error: for each function it appears in.)
ossl_ssl.c:626: error: expected expression before ‘)’ token
ossl_ssl.c:629: error: expected expression before ‘)’ token
ossl_ssl.c:629: error: too few arguments to function ‘sk_value’
ossl_ssl.c: In function ‘ossl_ssl_get_peer_cert_chain’:
ossl_ssl.c:1198: warning: passing argument 1 of ‘sk_num’ from incompatible pointer type
/usr/include/openssl/stack.h:79: note: expected ‘const struct _STACK *’ but argument is of type ‘struct stack_st_X509 *’
ossl_ssl.c:1201: warning: passing argument 1 of ‘sk_value’ from incompatible pointer type
/usr/include/openssl/stack.h:80: note: expected ‘const struct _STACK *’ but argument is of type ‘struct stack_st_X509 *’
ossl_ssl.c: In function ‘ossl_ssl_get_cipher’:
ossl_ssl.c:1223: warning: assignment discards qualifiers from pointer target type
make[1]: *** [ossl_ssl.o] Error 1

Try the patch in this post.

ruby-1.9.1-p376 $ tar zxf fix_stack_not_found.tgz
ruby-1.9.1-p376 $ cat fix_stack_not_found.patch | patch -p1
patching file ext/openssl/ossl.c
patching file ext/openssl/ossl_pkcs7.c
patching file ext/openssl/ossl_ssl.c

ruby-1.9.1-p376 $ make
...

The compilation is now able to complete. However, I didn't verify whether the SSL is working properly.

Parallel QuickSort with OpenMP and Cilk++

The introduction of omp task in OpenMP 3.0 allows easy parallelization of the famous quick sort algorithm. As OpenMP 3.0 support in GCC is relatively new, I like to compare its performance particularly the omp task with a commercial compiler Cilk++ which is designed to efficiently support spawn/sync mechanism. Conceptually, the omp task and the Cilk++ spawn/sync are very similar.

Quick sort chooses a pivot, partitions elements into [< pivot] & [>= pivot], and sorts the two partitions recursively. The implementation in my experiment performs partitioning sequentially, and sorts the two resulting partitions in parallel.

Experiment setup:

  • AMD64 Phenom X4 9950 (2.6GHz)
  • Fedora 10 x86_64
  • GCC 4.4 svn revision 144773
  • Cilk++ 1.0.2

Parallelizing Recursive Fibonacci using OpenMP and Cilk++

OpenMP 3.0 has added an interesting construct called omp task, while Cilk++ is a compiler developed by Cilk Arts to easily parallelize a C/C++ program. It is interesting to see how we can parallelize the following simple fibonacci program.

Serial:

long fib_serial(long n)
{
    if (n < 2) return n;
    return fib_serial(n-1) + fib_serial(n-2);
}

Generating Bit Mask in C

Posted in

Given a number n, I want to generate n number of 1s on the less significant bits. For example, when n is 3, bit mask = 0000 0000 0000 01112. Observe that the bit mask is equivalent to 2n-1. The following codes try to generate the bit mask.

// Note that *long* is 64bit in this case, running on a 64bit Linux OS.
int i;
for (i = 1; i <= 64; i++) {
        printf("method 1: mask %d: %lx\n", i, (1UL << i)-1); // 2^n-1
        printf("method 2: mask %d: %lx\n", i, ~(~0UL << i));
        printf("method 3: mask %d: %lx\n", i, ~0UL >> (-i)); // obscure.
        printf("method 4: mask %d: %lx\n", i, ~0UL >> (64-i));
        printf("method 5: mask %d: %lx\n", i, ~((~0UL-1UL) << (i-1)));
        printf("\n");
}

However, method 1 and method 2 do not work when n is 64. Both of these methods suffer from shift overflow. When it shifts to the left by 64, the shift operator simply refuse to work, and return the original value.

Method 3 is surprisingly working correctly in my environment. Further study indicates that -i is converted to unsigned (large number), and applied mod 64 before the shift. It becomes equivalent to method 4. However, method 3 is not guaranteed to work with all standard compilers. Note that you can't shift by a negative constant e.g. –2.

Method 5 is just a longer way, start with 1111…10 bit pattern and shift left by n-1, then complement.

BASH Problems: Files with Spaces

Previous script xman_dos2unix in xman utility (latest) under GPL has problems on files or directories with space.

Simple work around with double quotes:

for file in "$@" ; do
...
done

Note that all references to $file requires double quotes too.

Another tedious work around with IFS and double quotes:

I use IFS=$'\n', and double quote references to files such as "${file}" to solve the space problems.

OLDIFS=$IFS
IFS=$'\n'
for file in $@ ; do
        # FIXME: I should have put these into a subroutine to reduce code duplications in explore().
        # However, I'm not familiar with subroutines in BASH yet.
        # echo "$file"
        IFS=$OLDIFS
        if [ -d "$file" ] ; then
                echo "Entering $file"
                pushd "$file" > /dev/null
                explore
                echo "Leaving $file"
                popd
        elif [ "`${FILE} \"${file}\" | ${GREP} text`" != "" ] ; then
                $DOS2UNIX "$file"
        fi
        IFS=$'\n'
done
IFS=$OLDIFS

Apply dos2unix on Text Files Recursively using BASH

I found that current dos2unix will produce an intermediate file and stop when it encounters a directory in the arguments. And it cant process folders recursively. Hence, I wrote these scripts to help me to convert all text files in a list of files and folders into DOS or UNIX format.

Explore directories and files recursively.

explore()
{
        local file=""
        ls -1 | while read -r file
        do
                # Recursively explore if we found a directory.
                if [ -d "$file" ] ; then
                        echo "Entering $file"
                        pushd "$file" > /dev/null
                        explore
                        echo "Leaving $file"
                        popd > /dev/null
                # Process it if we found a text file.
                elif [ "`${FILE} ${file} | ${GREP} text`" != "" ] ; then
                        # echo "Processing: $file"
                        $DOS2UNIX "${file}"
                fi
        done
} # end of explore()

Process each of the files and folders given in the command line.

for file in $@ ; do
        # FIXME: I should have put these into a subroutine to reduce code duplications in explore().
        # However, I'm not familiar with subroutines in BASH yet.
        if [ -d "$file" ] ; then
                echo "Entering $file"
                pushd "$file" > /dev/null
                explore
                echo "Leaving $file"
                popd
        elif [ "`${FILE} ${file} | ${GREP} text`" != "" ] ; then
                $DOS2UNIX "$file"
        fi
done

For the entire file, see xman_dos2unix in xman utility (latest) under GPL.

Parallelizing Reversi in C++ using OpenMP - OpenMP Parallel Reversi

Upgrading from single core processor to dual-core processor does not give you double performance for free most of the time. Software written to run single threaded has no idea how to fully utilize a dual-core processor. Fortunately, it is not too difficult to parallelize certain software such as Reversi, because:

  1. Large amount of time is spent in small evaluation function.
  2. Different evaluations can be computed independently.
  3. Many evaluations are to be computed.

Let's see how do I quickly parallelize the Reversi in C++ under GPL license using merely a few lines of OpenMP and minor changes to existing codes.

Following is the code snippet that we want to parallelize, to get multiple iterations computed in parallel.

    for(iter = vmoves.begin() ; iter != vmoves.end() ; iter++) {
      m = *iter ;     
      buffer_b = b ; 
      buffer_b.move_at(m, value) ;      
      s = e.eval(buffer_b, value, (value+1)%2) ; 
      cout << "(" << m.x << "," << m.y << "): " << s << endl ;
      if(s > best_s) {
        best_s = s ;
	best_move = m ;
      }          
    }             

Ooops, we have to be able to compute iter statically (iter points to different possible moves), so that OpenMP can schedule and compute different moves in parallel. So we need to change iter a bit. We have:

Impulse C API in Brief

Posted in

DMA:

Prototype: co_memory co_memory_create(const char *name, const char *loc, size_t size);  
// "Name" is used by external application to identify this memory.
// Create 64 bytes memory from the heap. loc is architecture dependent.
co_memory_create("name", "heap0", 64);

Prototype: void co_memory_readblock(co_memory mem, unsigned int offset, void *buf, size_t buffersize);
Prototype: void co_memory_writeblock(co_memory mem, unsigned int offset, void *buf, size_t buffersize);
Prototype: void *co_memory_ptr(co_memory mem); // Return a C pointer to the buffer of the shared memory.

Fixed-Point Arithmetic:
Example of format specification - 1s8.23 => 1 sign bit, 8 bit integer, 23 bit fraction.
We can choose several modes: saturation, floor, ceil.

co_int16 a = (co_int16) FXCONST16(96,7); // a <- 96 in 1s8.7 format.
c = FXADD16(a,b,8); // Add a and b in 1s7.8 format.
c = FXMUL32(a,b,8); // Multiply a and b in 1s7.8 format.
c = FXDIV16(a,b,10); // c <- a/b in 1s6.9 format.

Macros: FXADD8(), FXADD16(), FXADD32(), FXMUL8(), FXMUL16(), FXMUL32(), ...

Quicksort and C++ Sorting Benchmark on AMD64 X2 4200+

Sorting is an old topic in Computer Science. With the advent of new processors, how's the sorting performance looked like? Following graphs plot the total running time of inserting elements into the containers and sorting the elements. Different number of elements are experimented.


C++ list is implemented as a linked-list. Without random access capability, we expect the sorting time to be very bad. From the experiment, multiset is even worse. Perhaps, if we must use multiset, we probably can sort the elements in vector, and use range insert to insert into multiset? I observed that list and multiset use much much more amount of memory (> 10x)!!! Hence, running 100M elements with 4GB RAM will make swap space into operations.


Vector is clearly the winner for simple sorting applications. It's even better than standard textbook Quick Sort. :) I shall implement other algorithms or variants to compare with this C++ vector implementation.

Syndicate content