Performance of Xeon Phi on PCIe X8

The PCIe slot count and performance characteristics are one of the differentiating features on motherboards. This is often a point of confusion, or even just overlooked, when choosing a motherboard for a specific use. Part of the confusion is from the fact that the slot size designation and mode of operation use the same nomenclature and are not required to be the same. For example an X16 size slot may only operate at X8 mode. It is also common to have the mode change depending on slot usage i.e. a board with two X16 slots may provide X16 mode when only one slot is in use but degrade to X8 X8 mode when both slots are in use. To add more confusion many cards may be configured to use an X16 sized slot but only operate at mode X8 (or less). Even, more confusion arises from the observed application performance of cards that will operate at mode X16 but can be used at mode X8 with little perceived performance difference i.e. like some video cards. What about a high performance co-processing device like the Xeon Phi?

The Xeon Phi is PCIe v2 card utilizing an X16 slot and capable of operating in X16 mode. Will the Xeon Phi work in an X8 mode slot? If so, will the performance degrade?

I've been working with a microATX Haswell Z87 chipset motherboard with an experimental custom BIOS for use with the Xeon Phi. This board has a PCIe usage configuration with one X16 mode, OR, two by X8 mode. So, for example, if I use a Phi 3120A and the on-board Haswell HD4600 graphics the Phi will be running at mode X16. If I add a graphics card then both cards will be operating at X8 mode. This is my specific testing scenario but you may encounter similar situations on other more common dual Xeon E5 motherboards especially if you are considering using multiple Phi cards.

Test System Information:

[Peak mini with Xeon Phi 3120A and added ASUS GTX 670 DC Mini video card]

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
MemTotal:       32628288 kB

BIOS:
Above 4G decoding   Enabled

BIOS Information
        Vendor: American Megatrends Inc.
        Version: Puget Systems custom 9922

Base Board Information
        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: GRYPHON Z87
...
[kinghorn@i7 ]$ lspci -s 01:00 
01:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series

Phi (a.k.a. MIC) card info:

MicInfo Utility Log

Created Wed Oct 23 15:54:44 2013

	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-358.23.2.el6.x86_64
		Driver Version		: 6720-19
		MPSS Version		: 2.1.6720-19
		Host Physical Memory	: 32819 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : 2.1.03.0386
		SMC Firmware Version	 : 1.15.4830
		SMC Boot Loader Version	 : 1.8.4326
		uOS Version 		 : 2.6.38.8-gefd324e
		Device Serial Number 	 : ADKC32400086

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x225d
		Subsystem ID 		 : 0x3608
		Coprocessor Stepping ID	 : 2
	   PCIe Width 		         : x8
	   PCIe Speed 		         : 5 GT/s
	   PCIe Max payload size	 : 128 bytes
	   PCIe Max read req size	 : 512 bytes
           ...

System difference without added video card

           ...
	   PCIe Width 		         : x16
	   PCIe Speed 		         : 5 GT/s
	   PCIe Max payload size	 : 256 bytes
	   PCIe Max read req size	 : 512 bytes
           ...

From the info above you can see that (not surprisingly) running the Phi at X8 cut's the PCIe max payload size in half.

Xeon Phi performance at PCIe X8 and X16

To test the effect of running the Phi at PCIe mode X8 and X16 we ran selected performance benchmarks using the micprun utility included with the MPSS driver install for Phi. The following tests were done;

DGEM (native) — This is the standard double precision matrix multiply benchmark from the Lapack version included in Intel MKL. The native version executes on the Phi card only and does not pass computational data between host and card. (Matrix size is 7680×7680)
DGEM (pragma) — The pragma offload version is executed on the host and offloaded to the card with the pragma compiler option.
SHOC (download) — This is the BusSpeedDownload (host to device) benchmark from the SHOC suite using the pragma offload method for MIC from Intel Composer XE for transferring data.

	DGEM (native) /GFLOPS	DGEM (pragma) /GFLOPS	SHOC (download) / GB/s
Peak mini X8	811	589	3.308
Peak mini X16	812	645	7.203
Intel reference*	812	641	6.927

*The "Intel reference" system is a dual Xeon E5-2670 @ 2.60GHz with 64GB and a Xeon Phi 3210A card.

The Co-processor / Accelerator PCIe data transfer bottleneck is pretty clear in these results. The obvious result of running at X8 vs X16 is seen in the host-to-device SHOC numbers with data transfer performance dropping by more than %50 at X8. This is also apparent in the DGEM (pragma) result where we lose performance on the order of 10% for this particular job run when going from X16 to X8. There is also a glaring reminder that data transfer for offload computation can have a significant impact on performance, which we see here as a greater than 20% drop compared to native execution even at X16 bus width. "Real" programs will, of course, vary wildly in this regard since performance impact will depend on how much, and how often, data needs to be moved across the bus and on how well this is accommodated by buffering and local data reuse.

Fortunatly the Haswell HD4600 graphics on the Peak mini are quite good and well suited for a small developer workstation. This leaves the full X16 PCIe bandwidth available for an added Intel Xeon Phi or Nvidia Tesla card.

Happy Computing! –dbk

Tags: HPC, Intel, Xeon Phi