Table of Contents
We now have our quad Opteron test-bench system up and running so I decided to continue testing with Zemax OpticStudio and run some parallel scaling and performance analysis using this new system and our quad Xeon test system. I’ve done some curve fitting to the Amdahl’s Law equation using the measured parallel performance on the two systems. With this analysis it’s possible to do some speculative performance estimates for processors at different clock speeds and core counts. Be sure to read the caveats section at the end of the post!
- OS: Windows Server 2008 R2
- Test software: Zemax OpticStudio14 SP2
- Test job: “Double Gauss 28 degree field.zmx” file running the "performance" option
- Reported values: Millions of Ray Surfaces Per Second
Puget Systems Peak Quad Opteron:
- 4 x AMD Opteron 6344 @2.6GHz 12-core
- 64GB DDR3 1600 Reg ECC
Puget Systems Peak Quad Xeon:
- 4 x Intel Xeon E5-4624L v2 @1.9GHz 10-core
- 64GB DDR3 1600 Reg ECC
Amdahl’s Law basically says that speedup of a parallel code is limited by the sequential fraction. For example a program that runs in one time unit and spends ½ of it’s time in in a sequential code segment and ½ of it’s time in code that can be run in parallel, the speed up of the program can never be greater than a factor of 2 no matter how many parallel processes are used. A code that spends 99% of it’s runtime in parallel will never exceed a speedup of over 100. The following equation shows this relationship in terms of the parallel fraction;
S(n) = T(1)/T(n) = 1/( ( 1-P ) + P/n )
S(n) is the speedup for n parallel processes and P is the “parallel fraction”, T is time.
We can use measured performance of OpticStudio’s standard benchmark using n processes and then do a non-linear least squares curve fit to the equation above to determine the effective parallel fraction (P) and then use this curve to predict performance at different processor clock speeds.
Quad Xeon and Opteron performance with Zemax OpticStudio
The following two plots show the measured speedup of OpticStudio using from 1 to 40 processes for the Xeon system and 1 to 48 processes for the Opteron system. The brown line is what the performance would be if there was perfect scaling, i.e. linear scaling. The green(AMD) blue(Intel) lines are the Amdahl’s law curve fits.
The effective parallel fraction P determined from the curve fitting is as follows;
|Xeon:||P = 0.994984|
|Opteron:||P = 0.996589|
Now, to show the data in terms of the of the actual performance metric, namely, “millions of ray surfaces per second”, we observe that in Amdahl’s law, time is inversely proportional to the amount of work (W) done. Thus,
W(n) = W(1)/( (1-P) + P/n )
The following plots show the work done with n threads and the fit of the equation above.
Finally, to predict the performance of various Xeon and Opteron processors, we can add a clock scaling factor to the W(n) equation using a ratio of CPU clock speed, C_new/C_old. ( C_new is the clock speed of the processor we are interested in, C_old is the clock speed of the processors we used for the performance measurements, namely, 1.9GHz for the Xeon and 2.6GHz for the Opteron). For example, the predicted performance of a quad 8-core Xeon E5-4627v2 @3.3GHz looks like,
W(32) = 3.3/1.9 * 14.9 / ( (1-0.995) + 0.995/32 ) = 717 (million ray surfaces per second)
The table below lists most of the currently available quad socket Xeon and Opteron CPU's and their predicted performance on this benchmark. Enjoy!
Predicted Performance for Quad Socket CPU's
|Processor||CPU Base Clock Speed||Cores (total "real" cores)||Million Ray Surf/sec (+-10%)||Price for 1 CPU||Notes|
|Xeon E5-4627v2||3.3GHz||32||717||$2108||This is my personal recommended system CPU|
|Opteron 6344||2.6GHz||48||548||$415||This is the test Opteron system, measured performance was 552|
|Xeon E5-4624Lv2||1.9GHz||32||498||$2405||This is the test Xeon system, measured performance was 508|
The prices for these processors varies quite a bit … We’ll leave performance per dollar as an exercise for the reader 🙂
- It was fun doing this post but it is just a benchmark job for one particular program …Don't read too much into it! but I'm sure you will anyway 🙂
- This program had very good parallel thread scaling so performance was mostly dependent on CPU clock and number of cores. This will not always be the case!
- There are a lot of factors that go into finding the "best" hardware for a particular program and job type, don't over simplify your decisions.
- If you are a Zemax OpticStudio user you are welcome to try my performance formula and run the benchmark yourself to compare. I have had feedback already and the formula seems to be reasonably good even for older hardware and dual socket systems.
Happy computing! –dbk