Home > Puget Systems Blog > Quad Xeon 4600v2 Performance - Zemax OpticStudio14

Quad Xeon 4600v2 Performance - Zemax OpticStudio14

Dr Donald Kinghorn (HPC and Scientific Computing)

Quad Xeon 4600v2 Performance - Zemax OpticStudio14

Posted on April 23, 2014 by Dr Donald Kinghorn

[ View All Blog Posts ]


Need the most compute capability you can get in a single box for a well written, multithreaded application? We’ll take a look at one such application, Zemax OpticStudio14, running on a quad socket Ivy Bridge Xeon system. Performance was excellent!

Quad socket, high core count systems, can provide optimal performance for software that has good SMP thread scaling but that is not designed for, or just not practical to use with, traditional HPC distributed memory, multi-node, cluster systems. With current generation many-core quad socket systems it’s possible to have the compute capability of a small cluster in a single system. This has the advantage of only needing to maintain a single system and may have an advantage for software requiring commercial licensing too. As long as that software can take advantage of threaded parallel execution on many compute cores, a quad socket system is going to be as good as it gets. This is ideal for many compute intensive Windows applications that have been developed using efficient multi threading libraries that typically allow for as many as 64 threads. Another use case of many core quad socket systems would be for workloads that require many independent applications/instances such as virtualization. However, we are considering only a case of well a written thread parallel compute intensive application.

The software we are looking at for this test case is Zemax OpticStudio. This is high-end engineering software for optical, illumination, and laser systems design running on Windows. Testing was done as a courtesy for a prospective customer with cooperation from the software vendor, Zemax. The program test version was Zemax OpticStudio14 Sp1.

The performance metric is “million ray-surfaces per second” using the Zemax sample system “Double Gauss 28 degree field” which is their standard benchmark job run. The job has relatively short run time of less than a minute but it did show the thread scaling that we were most interested in. The software is designed to take advantage of up to 64 process threads.

The test system was a Puget Peak Quad Xeon Tower

  • 4 x Intel Xeon E5-4624Lv2 (1.9GHz) TEN CORE
  • 16 x 4GB DDR3-1600 REG ECC memory
  • ...
  • Windows Server 2008R2

Note: this is a testbench system using quad socket Ivy Bridge engineering sample CPU’s. We normally configure our Quad Xeon systems with higher clocked CPU’s.

Our biggest open question when we started the testing was --how well will the code scale on the 40-cores of the test system? Turns out going from 10-cores to 20-cores gave nearly perfect linear scaling, essentially doubling the performance, and going from 10 to 40 cores gave a speedup of 3.7 over the 10-core performance, which is still very good scaling. We expect the “sweet spot” for thread scaling to be at 32-cores. Couple that with a higher CPU clock and you have the basis for our recommended optimal system for this application.

Performance results, “Double Gauss 28 degree field”

 

* The baseline reference was the customers dual Xeon system, E5640 @2.66GHz (8 total cores)

** Based on the scaling and performance we can make a good prediction about our recommended system for this application -- 4 x Intel Xeon E5-4627v2 (3.3GHz) EIGHT CORE. The typical job times for the customer we were doing this testing for were over an hour so we are confident that considering a higher clocked processor for optimal individual thread performance and lower core count for optimal thread scaling would be the best overall system recommendation.

Performance is predicted by considering the difference in scaling at 20 and 40 cores obtained from the test system [observed performance * thread scale factor * clock scale factor]

508* (32/40.0)*(3.3/1.9)
ans =  705.85

272* (32/20.0)*(3.3/1.9)
ans =  755.87

Thus the recommended quad 8-core 3.3GHz system is predicted to achieve performance between 706 and 756 million ray-surfaces per second.

Our test numbers were enthusiastically received by the Zemax team and we were told that these were the best performance numbers they have seen reported for their software!

A quad socket, high core count, high CPU clock, Xeon Ivy Bridge based box is going to be hard to beat for a compute intensive, multithreaded SMP application with good thread scaling. The improvements Intel has made in the Ivy Bridge version of their 4600v2 series Xeon processors make it a formidable single box compute platform. A few years ago I would not have recommended a quad socket system except in unusual circumstances. I have done some other testing with our testbench quad Xeon system and have been pleasantly surprised that memory contention, processor affinity, and poor thread scheduling problems have mostly disappeared. In general a quad socket system will cost more than multiple dual socket systems but if you are restricted to running your jobs on a single box the performance is impressive!

Happy computing --dbk


Tags: Quad Xeon, Zemax


Share this blog post!

Simon

Thank you for this very interesting entry.
I wonder, if hyperthreading has to be disabled for these tests, since the 80 logical cores resulting from HT cannot be used by ZEMAX (only 64). I could imagine that ZEMAX would go down to 40 logical (20 physical) cores in that case, which would not push the system to its limits.
If the limit of CPU's would be set to >80 by the ZEMAX team and if ZEMAX would profit from HT, the performance might be even higher.

Simon

Posted on 2014-05-07 14:53:08
Andrew

I'm a heavy user of Zemax and benefit from it's multi-threaded computation but I also need to use legacy software that can only handle single threaded computation. These single threaded calculations are often the bottleneck. The question is what processor should I select to get the best of both worlds?

Posted on 2014-05-08 18:39:59
Chris

Hi Dr. Kinghorn,

Have you used any other multithreaded optical design pacakges such as LightTools, ASAP or TracePro? How does Zemax compare? Do you have the same hardware recommendations for these programs?

Thanks,
Chris

Posted on 2014-07-09 17:17:14
Donald Kinghorn

Hi Chris, I have not tried any other packages like that. This was favour to a customer with cooperation from Zemax (they appreciated the testing too!) Somewhat related, I did some testing with POV-ray and got interesting results about scaling and hyperthreading... My last blog post was about that. Zemax scalled very well and I had hyperthreading on during the testing. I did go back and run a couple of test with hyperthreading off and got about 10% better performance with OptixStudio.

I expect that these other packages will scale well to. I doubt that I could test all of the packages you mention but if you have one in particular in mind I can check with the vendor and see if they are interested (they might be!)
Best regards --Don

Posted on 2014-07-10 01:10:03
Rick Yarussi

I built a system with an Asus Z9PED8 and two Xeon 2687W engineering samples at 3GHz, and got 420 million RSS. I'm would have expected the quad CPU system to be more than 2x faster than my system...

Posted on 2014-07-11 19:03:51
Donald Kinghorn

Hi Rick, A few things; I was conservative in my predicted numbers, Also,I had HyperThreading on during the testing and In later tests I see that lowered performance around 10% . So you put all that together along with the excellent scaling (but not perfect) of Zemax and you are about right. Add that 10% into my estimate and your are just about doubled from what you see. Best wishes -Don

Posted on 2014-07-13 18:12:58
Rick Yarussi

But each of the E5-4627v2 (3.3GHz, 8 core) should be MUCH faster than each of the (3 year old) 2687W (3GHz, 8 core) that I have... Same number of cores, similar clock speed, but much newer processors... The recommended system would have twice as many CPU's (4 instead of 2), so it it should be much MORE than 2x faster than mine...

Posted on 2014-07-14 01:27:36
Rick Yarussi

FYI, my 2687W are engineering samples so the turbo speed is very limited, 3.1GHz, I think.

Posted on 2014-07-14 01:39:20
Donald Kinghorn

... Sandy Bridge to Ivy Bridge is mostly a small clock bump and
increase core count on some processors... and better cache and power
management. Performance is mostly the same considering the above. I
think some memory bound problems benefit ... Now, for Haswell it will be
a different story ... you already see it in the consumer 4 core
processors. See my post on Ivy vs Haswell ... the new Xeon EP's will be
sampling soon :-)

Posted on 2014-07-17 02:47:53
See a problem on this page? Let us know.