OpenFOAM performance on Quad socket Xeon and Opteron

OpenFOAM

OpenFOAM (Open source Field Operation And Manipulation) is a collection of libraries and programs for computational fluid dynamics, CFD, and dynamic modeling in general. There is a large collection of “solvers” for different fields (pun intended) from standard Navier-Stokes stuff to magnetics and combustion. It is freely available under the GPL licensing model and has a large user base. The main development branch is on Linux/UNIX but there are “forks” for Windows and commercial variants too.

The parallelization model for OpenFOAM is domain decomposition (via mesh partitioning) and this is implemented using standard MPI (Message Passing Interface) and is generally linked with OpenMPI. It can scale well on many-core systems and clusters. It can show “super-linear” scaling because of better accommodating memory patterns across multiple nodes or multiple memory controllers. The biggest hindrance to parallel performance is from the domain decomposition boundaries. That is,the mesh partitioning can be assigned in different ways and the boundaries of the partitions are inherently non-local to a processes assigned to a given partition. This causes communication across these boundaries and is part of the reason for the sometimes hard to understand performance because this becomes dependent on system process and memory binding.

OpenFOAM is not a great program for system benchmarking –but lets do it anyway!

Looking on-line you often find requests for OpenFOAM benchmarks or help with unexpected performance. After spending a few days with OpenFOAM I realized that I could probably show any level of good or bad performance on any computer system by carefully selecting the job, problem size, mesh partitioning, and system process binding. The problem “cases” require understanding of the problem at hand and the use of multiple application to solve a given case. The tutorials included with the program are very good but not large enough to discriminate performance on modern computer systems. It’s not great as a general benchmark because there really isn’t a good standard problem to run and give meaningful comparative results across a wide range of system configurations. However, I’m going to do it anyway! I’ll use an incompressible laminar flow on a 2D mesh, with the CFD staple, Navier-Stokes solver, “icoFoam”. I’ll tweak the “cavity” tutorial to over a million cells and adjust other parameters to make it useful for testing (without regard to how meaningful the actual job case is) and adapt it to parallel execution.

Test Systems

  • Puget Systems Peak Quad Xeon:

    • 4 x Intel Xeon E5-4624L v2 @1.9GHz 10-core
    • 64GB DDR3 1600 Reg ECC
  • Puget Systems Peak Quad Opteron:

    • 4 x AMD Opteron 6344 @2.6GHz 12-core
    • 64GB DDR3 1600 Reg ECC

Software

  • OpenFOAM version 2.3.0-1 (installed following the RHEL rpm install guide)
  • Linux CentOS 6.5
  • OpenMPI version 1.5.4-2

Test Problem

I looked through the OpenFOAM Lid-driven cavity flow tutorial, the Breaking of a dam tutorial, and this nice OpenFOAM Cavity tutorial from the notur — Norwegian Metacenter for Computational Science (their tutorials are specifically designed and intended for their HPC systems and users but they are very informative and nicely done!). In the end I decided on a very fine mesh parallel adaptation of the Lid driven cavity flow, “Cavity”, tutorial. This is based on the icoFoam solver for the Navier-Stokes equations for an incompressible laminar flow.

Here is what I did to get a benchmark job setup, running in parallel, and large enough to be a challenge on a many-core quad socket system.

  • copy the cavity directory to cavityParallel
  • create the file system/decomposeParDict for the parallel domain decomposition.
  • increase the Reynolds number by changing the kinematic viscosity from 0.01m^2/s to 0.001m^2/s in file constant/transportProperties
  • change the mesh granularity from 20x20x1 to 1024x1024x1 in file constant/polyMesh/blockMeshDict file
  • In file system/controlDict decrease time steps to help with the solver convergence since we have a much finer mesh. Change;

    • deltaT from 0.0005 to 0.00001
    • endTime to 0.001
    • writeControl from timeStep to runTime
    • writeInterval to 0.1 second
    • runTimeModifiable to false to reduce I/O

Here’s the created/modified files;

tutorials/incompressible/icoFoam/cavityParallel/system/decomposeParDict

/*--------------------------------*- C++ -*----------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2.1.1                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    location    "system";
    object      decomposeParDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
 
numberOfSubdomains 1;
 
method          simple;
 
simpleCoeffs
{
    n               ( 1 1 1 );
    delta           0.0001;
}
 
distributed     no;
 
roots           ( );
 
 
// ************************************************************************* //

tutorials/incompressible/icoFoam/cavityParallel/constant/transportProperties

/*--------------------------------*- C++ -*----------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2.3.0                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    location    "constant";
    object      transportProperties;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

nu              nu [ 0 2 -1 0 0 0 0 ] 0.001;


// ************************************************************************* //

tutorials/incompressible/icoFoam/cavityParallel/constant/polyMesh/blockMeshDict

/*--------------------------------*- C++ -*----------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2.3.0                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    object      blockMeshDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

convertToMeters 0.1;

vertices
(
    (0 0 0)
    (1 0 0)
    (1 1 0)
    (0 1 0)
    (0 0 0.1)
    (1 0 0.1)
    (1 1 0.1)
    (0 1 0.1)
);

blocks
(
    hex (0 1 2 3 4 5 6 7) (1024 1024 1) simpleGrading (1 1 1)
);

edges
(
);

boundary
(
    movingWall
    {
        type wall;
        faces
        (
            (3 7 6 2)
        );
    }
    fixedWalls
    {
        type wall;
        faces
        (
            (0 4 7 3)
            (2 6 5 1)
            (1 5 4 0)
        );
    }
    frontAndBack
    {
        type empty;
        faces
        (
            (0 3 2 1)
            (4 5 6 7)
        );
    }
);

mergePatchPairs
(
);

// ************************************************************************* //

tutorials/incompressible/icoFoam/cavityParallel/system/controlDict

/*--------------------------------*- C++ -*----------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2.3.0                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    location    "system";
    object      controlDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

application     icoFoam;

startFrom       startTime;

startTime       0;

stopAt          endTime;

endTime         0.001;

deltaT          0.00001;

writeControl    runTime;

writeInterval   0.1;

purgeWrite      0;

writeFormat     ascii;

writePrecision  6;

writeCompression off;

timeFormat      general;

timePrecision   6;

runTimeModifiable false;


// ************************************************************************* //

Job run procedure

Running the test case with various numbers of MPI process consisted of the following steps;

  • edit system/decomposeParDict so that “numberOfSubdomains” is equal to the number of MPI processes and set “n” to the mesh partition scheme. For example,

    numberOfSubdomains 40;
    
    
    simpleCoeffs
    {
        n               ( 4 10 1 );
        delta           0.0001;
    }
    
  • Then run “blockMesh” to generate the mesh followed by “decomposePar -force” to create a new domain decomposition.
  • Then start icoFoam with mpiexec. Like so …

     
    emacs -nw system/decomposeParDict
    blockMesh 
    decomposePar -force
    mpiexec  -np 40 icoFoam -parallel
    

Note that for jobs other than 1-core and all-cores I experimented with mpi process binding to improve job run time (sometimes dramatically!) "–bind-to-core" binds mpi processes to sockets and "–bysocket" adds processes in socket order.

Quad Xeon OpenFOAM parallel performance, icoFoam solver, cavity case, 1024x1024x1 mesh

 
# of cores Mesh (2D)x1 Run Time (sec) mpiexec flags
40 4×10 202  
20 4×5 542  
20 4×5 324 –bind-to-core –bysocket
10 2×5 1638  
10 2×5 628 –bind-to-core –bysocket Big improvment!
4 2×2 2003  
4 2×2 1470 –bind-to-core –bysocket
1 1×1 7851  

Notes: I didn't account for "turbo-boost" on the few core runs. That could skew the scaling somewhat but it is still essentially linear. Also, hopefully seeing the effect that core binding can have on this code will bring some comfort and understanding to thoses who have see strange performance with OpenFOAM!

Quad Opteron OpenFOAM parallel performance, icoFoam solver, cavity case, 1024x1024x1 mesh

 
# of cores Mesh (2D)x1 Run Time (sec) mpiexec flags
48 4×12 290  
36 6×6 383 **
36 3×12 337  
24 4×6 413  
24 2×12 463  
12 2×6 1161  
4 2×2 5404  
1 1×1 13404  

Notes:** mpiexec flag –bind-to-core caused fob failures on most runs and little or no improvment on others. –bind-to-socket didn't cause failures but made little difference. On the job runs for 36 and 24 cores you can see the effect that changing the mesh partition can have.

Discussion

From the results in the tables above you can see that this job scales linearly with core count out to the 40/48 SMP cores that were used in the testing. This is yet another confirmation for me that the quad socket many-core systems are a good alternative to small clusters even for parallel applications that are MPI message passing driven rather than multi-threaded.


OpenFOAM in parallel on 40 and 48 core quad socket systems achieved 40 and 48 times performance speedup!

Another thing that really stands out in the results is the effect that process binding can have! I used switches to mpiexec to manage process binding but there are alternatives with environment variables and system tools. The effect can be dramatic!


Advice to OpenFOAM users seeing abnormally poor performance or execution time and clock time discrepancies — look at process core binding!

Poor process and memory binding was a serious problem on the early quad socket systems and there is still lingering memories that make some people leary about using quads. Modern systems and software have taken care of these issues! However, users are advised to take some time to understand the implementation of parallel codes they are running so they can look out for easily fixable performance anomalies.

Recommended systems

If you play with the numbers in the results tables you will see by normalizing for core-count and clock frequency that the Xeon CPU’s are around 2.4 times faster than the Opetron CPU’s per core per clock cycle. AND, this is without recompiling OpenFOAM using the Intel compilers and libraries which could give even better performance on the Xeon’s. (if I find the time I may try that) I would personally prefer a Xeon based system for this code.

I expect approximately equivalent performance from a 64-core quad Opteron system and a quad 6-core Xeon system for around the same system price. Anything over this core-count and CPU clock with Xeon processors should be higher performance (of course).


4 x 16-core Opteron @ 2.4GHz ~ 4 x 6-core Xeon @ 2.6GHz ~ 2 x 12-core Xeon @ 2.7GHz

Since the OpenFOAM scaling was linear to 40/48 cores you can probably expect performance to mostly be a function of core count, clock speed and your budget!

My recommendations for systems would be based around Xeon v2 processors with my (current), standard favorite, quad 8-core Xeon [email protected] or quad 12-core Xeon [email protected]

Happy computing! –dbk