5 Ways of Parallel Programming

Modern computing hardware is all about parallelism. This is because we essentially hit the wall several years ago on increasing core clock frequency to speedup serial code execution. The transistor count has continued to follow Moore’s Law (doubling every 1.5-2 years) but these transistors have mostly gone into multiple cores, vector units, memory controllers, etc. on a single die. To utilize this hardware, software needs to be written to take advantage of it, i.e. you have to go parallel. There is some level of “parallelism” that is realized for serial codes by CPU design that executes multiple instructions per clock cycle and the bit of magic that modern optimizing compilers can do. However, to really take advantage of modern computing hardware you have to write code that is targeted toward some method (or methods ) of parallel programming. There are 5 basic way to go parallel;

  • Compiler assistance
  • Library calls
  • Directives
  • Low level hardware targeting
  • Message Passing

The first 4 of these are mostly focused on single node parallelism with vectorization, threading and offload accelerator targeting. This is very important since it is now possible to have a single workstation with dozens of CPU cores and hundreds of accelerator cores capable of handling possibly thousands of execution threads. A high end workstation can rival the compute capability of million dollar “Supercomputers” from only a few years ago!

The 5th method, Message Passing, is the traditional method of parallel computing and is still very important. Current MPI implementations can take advantage of multiple many-core CPU’s on a single node with performance that is as good (or sometimes better) than methods that are targeted more at multi-threading. And, for parallel computing across multiple nodes message passing is very well understood and widely used.

Compiler assistance ( Vectorization and language intrinsics )

This is the blissful utopia that all parallel programmers dream of! Wouldn’t it be nice if all you had to do was pick your favorite programming language, write up your program logic and algorithms describing your problem and then have the compiler magically optimize the parallel performance for whatever hardware the code was running on. Joy! Well, we have a little of that today, but only very little. Modern compilers can do a reasonable job of guessing when your code can be auto vectorized to take advantage of SIMD units like Intel AVX. This can give significant performance improvement for very little effort as long as you write code that is vectorization friendly. There are also great programming tools like those included in Intel Parallel Studio XE, “Advisor XE” and “VTune Amplifier XE” that can give you great feedback on where you might be able to restructure your code and add directives or “compiler hints” to get better parallelization.

//  ignore any potential data dependencies and vectorize the loop 
#pragma ivdep
void copy(char *cp_a, char *cp_b, int n) {
  for (int i = 0; i < n; i++) {
    cp_a[i] = cp_b[i];
  }
}

There are efforts to get more parallel constructs into programming languages along with compilers that can do good things with those constructs. For example Fortran has had "forall" and "do concurrent" for years but compiler support/implementation is often not that great. There is also PGAS ( Partitioned Global Address Space ) The PGAS model is used in Unified Parallel C, Coarray Fortran, Fortress, Chapel, X10, and Global Arrays. Also, Intel has the "Cilk Plus" extension for C/C++ which offers features for threading and vectorization. C++ has templates that simplify parallelization. There are actually many efforts in the works!

mydot = 0.0;
upc_forall( i=0; i< SIZE; i++; &x_cyc[i] )
    mydot += x_cyc[i] * y_cyc[i];
do concurrent (i=1:m)
    a(k+i) = a(k+i) + factor*a(l+i)
end do

In any case, what we really would like to have is a new parallel language and compiler that can actually do the optimal thing with modern parallel hardware without a lot of fuss about how we code up our algorithms. Don't hold your breath waiting for that!

Parallelized library calls

Optimized libraries are great! Someone has done a lot of work tuning those routines to take advantage of their target hardware. BLAS, LAPACK, FFT and special libs are available to take advantage of most hardware. Multi-core Intel CPU's and the Xeon Phi can make good use of Intel's MKL and NVIDA GPU's have some very nice accelerated libraries built on top of CUDA. When you can use parallel library calls it's the easiest way for otherwise serial code to take advantage of parallel hardware.

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

Directive based Programming ( OpenMP and OpenACC )

I feel strongly that directive/pragma based programming will be critical to the future success of parallel computing. Modern parallel computing hardware has moved to W-I-D-E parallelism. We now have single nodes with dozens of CPU cores with SIMD vector units, and often, added compute accelerators like GPU's or the Xeon Phi. We now have to look at code execution on possibly hundreds (thousands?) of threads before we even think about going over a cluster interconnect to other nodes. How are you going to do that and have code that performs well and is maintainable? OpenMP is the biggest effort. It has been around since the late 1990's and is now at version 4 and features are becoming rich and robust. This is the preferred way to do threading in C/C++ and Fortran. In C the directives are labeled as #pragma and are comments in fortran. OpenMP is implemented in most C and fortran compilers and gcc 5 has version 4 support.

OpenMP is big and important spec so development seems a little slow at times (that's not necessarily a bad thing!). OpenMP will eventually support offloading to arbitrary compute accelerators like Xeon Phi and NVIDIA GPU's. However, not surprisingly, that is difficult for a spec since offload hardware is a moving target. This is where OpenACC comes in. OpenACC is parallel to the OpenMP efforts but more "bleeding edge" to take advantage of accelerator hardware. The most work has been done for offloading to NVIDIA GPU's. PGI and Cray have openACC compilers that will do this. This a a GREAT way to work with NVIDIA GPU's (in my opinion)! For most people using your typical workstation and a modern NVIDIA GPU using OpenACC means buying PGI Accelerator with OpenACC. The PGI compilers are great and I have used them on and off for many years. I think it's worth the investment! However, I have been playing with the gcc 5.1 release and guess what? There is OpenACC support in there! Yea! It is not ready for prime time yet but soon there will be a freely available and accessible implementation of OpenACC. I think this is hugely important!

#pragma omp parallel shared(a,b,n)
  {
   #pragma omp for schedule(dynamic,1) private (i,j) nowait
    for (i = 1; i < n; i++)
       for (j = 0; j < i; j++)
         b[j + n*i] = (a[j + n*i] + a[j + n*(i-1)]) / 2.0;
  }
!$acc kernels 
      do k = 1,n1
       do i = 1,n3
        c(i,k) = 0.0
        do j = 1,n2
         c(i,k) = c(i,k) + a(i,j) * b(j,k)
        enddo
       enddo
      enddo
!$acc end kernels

Low level hardware targeting

I'm thinking mostly about NVIDIA CUDA (Compute Unified Device Architecture ) here but coding with hardware description language (HDL) like VHDL or Verilog for FPGA's (Field Programmable Gate Arrays) is in this realm too, but, outside of my experience. CUDA programming is mostly about using its expressions to help you with CPU and GPU memory allocation, data transfer, and compute "kernels" that are mapped across blocks of threads on the GPU. It's basically the SIMD (Single Instruction Multiple Data) paradigm implemented on the GPU. You can look at it more like mapping kernels across threads rather than loop indices across processes. CUDA is very usable and useful! It is low level programming and you are advised to have some idea of the GPU architecture to best envision how to map your problem to the hardware. With CUDA your efforts can be rewarded with tremendous performance out of the GPU. CUDA has been developing for nearly 10 years now and has a vibrant community of developers. I believe that in the future GPU programming will move away from direct use of CUDA and more toward the utilization of tuned libraries and directives like OpenACC, with CUDA being the underlying engine for accessing the hardware.

...
 cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
 cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);
 add<<>>(dev_a, dev_b, dev_c);
 cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);
...

Message Passing (MPI)

Now for multi-node parallel programming -- message passing communication across a network fabric. MPI (Message Passing Interface ) has been around since the early 1990's and there was also PVM (Parallel Virtual Machine) even before that but PVM is rarely used now. MPI is "the" standard for distributed memory parallelism i.e. parallel use of clusters of networked nodes. My first MPI program in the early 90's was a real eye opener. I had just put together a 4 node "Beowulf" cluster and when I got my code working with MPI and ran it on the cluster I had better that 4 x speedup! Superlinear! The reason it went superlinear is that my problem was memory bound and running on the cluster gave me essentially 4 time the memory space. I was hooked, and so was everyone else. MPI and COTS (Commodity Off The Shelf components) clusters spurred the modern Supercomputer era. MPI is a standard and it is still vibrant and evolving. It has adapted to large SMP nodes by using efficient threading and intra-node communication methods. It's performance is often as good or better than more direct threading methods like OpenMP. Hybrid parallelism using a combination of openMP on nodes and MPI between nodes has become a standard accepted practice. MPI together with accelerators like GPU's is also common. MPI becomes the glue between CPU threads that are managing distribution to multiple GPU accelerators.

  ! set up MPI
  call MPI_INIT( ierr )
  call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )
  call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr )  
  ...
    DO k=rank+1,nsym,nproc
     DO j=1,nb
        DO i=j,nb
  ...
  call  MPI_REDUCE(H,HH,nb*nb,MPI_DOUBLE_PRECISION,MPI_SUM,0, &
       MPI_COMM_WORLD,ierr)
  ...
  call MPI_BCAST(eng,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
  ...
  call MPI_FINALIZE( ierr )

All of these programming methods are well established and useful. A programmer writing numerically intensive applications for modern high performance computing hardware really has to understand all of them to some extent. It's much more complicated than the "old days" when you just needed to know how to lay out your loops and use MPI. It's worth the effort! The compute capability in even a single workstation with multi-core processors and accelerators is staggering!

Happy computing! --dbk