Hardware Recommendations for Data Science
Our hardware recommendations for data science and analysis workstations below are provided by Dr. Don Kinghorn. These follow some standard patterns, but keep in mind that your specific workflow may have unique requirements.
Data Science System Requirements
Data Science / Data Analysis is coupled with methods from machine learning, so there are some similarities here with our Hardware Recommendations for ML/AI. However, data analysis, preparation, munging, cleaning, visualization, etc does present unique challenges for system configuration. Extract, Transform, and Load (ETL) and Exploratory Data Analysis (EDA) are critical components of machine learning projects, as well as being indispensable parts of business processes and forecasting.
The “best” hardware will follow some standard patterns, but your specific application may have unique optimal requirements. The Q&A discussion below, with answers provided by Dr. Donald Kinghorn, will be mostly generalities based on typical workflows. We also recommend that you visit his HPC blog for more info.
In data science there is a significant amount of effort with movement and transformation of large data sets. The CPU, with its ability to access large amounts of memory, may dominate workflows in contrast to GPU compute in ML/DL. Multi-core parallelism will depend on the task, but parallelism in data processing is often very good.
What CPU is best for data science?
The two recommended CPU platforms are Intel’s Xeon W and AMD’s Threadripper PRO. Both of these offer high core counts, excellent memory performance & capacity, and large numbers of PCIe lanes. Specifically, the 32-core versions of either of these are recommended for their utilization and balanced memory performance.
Do more CPU cores make data science workflows faster?
The number of cores chosen will depend on the expected load and parallelism of tasks in your workflow. Larger numbers of cores may also allow for multiple simultaneous processes. An easy recommendation is for 32 cores with either of the Intel or AMD platforms mentioned above. The 96- or 64-core TR PRO may be ideal if you have highly data parallel tasks with a significant amount of time spent in computation, but scaling may not be as efficient as with the 32-core if memory access is a limiting factor. In any case, a 16-core processor would probably be considered minimal.
Does data science work better with Intel or AMD CPUs?
It is mostly a matter of preference. However, the Intel platform would be recommended if your workflow could benefit from some of the tools in the Intel oneAPI AI Analytics Toolkit, such as the Pandas alternative Modin which is optimized for Intel, or AVX-512 extensions.
Video Card (GPU)
Since the mid 2010s, GPU acceleration has been the driving force enabling rapid advancements in machine learning and AI research. NVIDIA has had a massive impact in this field. For data science, the GPU may offer significant performance over the CPU for some tasks. However, GPUs may be limited by memory capacity and appropriate applications for data tasks outside of model training.
What type of GPU (video card) is best for data science?
NVIDIA dominates for GPU compute acceleration, and is unquestionably the standard. Their GPUs will be the most supported and easiest to work with. NVIDIA also provides an excellent data-handling application suite called RAPIDS. The NVIDIA RAPIDS tools may provide significant workflow throughput.
How much VRAM (video memory) does data science need?
This is dependent on the “feature space” of your data. Memory capacity on GPUs is limited compared to the main system memory utilized by CPUs, and applications may be constrained by this. This is why it’s common for a data scientist to be tasked with “data and feature reduction” prior to model training. That is often 80+% of the hard work for ML/AI projects. For some jobs, GPU memory may be a limiting factor even when there is a GPU-accelerated tool available for the data work. For larger data problems, the 48GB available on the NVIDIA RTX A6000 may be necessary – and even that may not be enough for jobs that require all data to be resident on the device. Data movement can be a bottleneck because GPUs have such highly performant compute capabilities that they may be left idle a large percent of the time while waiting for memory to move around.
Will multiple GPUs improve performance in data science workflows?
For data analysis jobs that can take advantage of GPUs, having more than one may increase workflow. If you will be doing ML/AI jobs then multi-GPU can be beneficial since many frameworks provide for this. For data-oriented tasks, multi-GPU may have an advantage simply by providing more available memory to facilitate task parallelism. Not all workflows utilize the GPU well, though, as discussed previously.
Do I need NVLink when using multiple GPUs for data science?
NVIDIA’s NVLink provides a direct, high-performance communication bridge between a pair of GPUs. Whether this is beneficial or not is problem-type dependent. For training many types of models it is not needed. However, for any models that have a “history” component such as RNNs, LSTM, time-series and especially Transformer models, NVLink can offer a significant speed up and is therefore recommended. Please note that not all NVIDIA GPUs support NVLink, and it can only bridge two cards.
CPU Memory capacity may be the limiting factor for some data analysis tasks. This is because an entire large data set may need to be resident in memory (in-core). There are methods and tools for “out-of-core” data analysis, but this can slow performance.
How much RAM does data science need?
It is often necessary, or at least desirable, to be able to pull a full data set into memory for processing and statistical work. That could mean BIG memory requirements, as much as 1-2 TB of system memory for the CPU to access.
Storage (Hard Drives)
Storage requirements are similar to CPU memory requirements. Your data and projects will dictate requirements.
What storage configuration works best for data science?
It’s recommended to use fast NVMe storage whenever possible since data streaming can become a bottleneck when data is too large to fit in system memory. Staging job runs from NVMe can reduce job run slow ups. NVME drives are commonly available up to 4TB capacity. Together with the fast NVMe storage for staging jobs, large capacity SSDs can be used for data that exceeds the capacity of typical NVMe drives. 8TB capacity is available for SSDs. Platter drives can be used for archival storage and for very large data sets. 18TB+ capacities are now available.
Additionally, all of the above drive types can be configured in RAID arrays. This does add complexity to the system configuration and may use up slots on the motherboard which would otherwise support additional GPUs – but can allow for storage space in the 10 to 100s of terrabytes.
Should I use network attached storage for data science?
Network-attached storage is another consideration. It’s become more common for workstation motherboards to have 10Gb Ethernet ports, allowing for network storage connections with reasonably good performance without the need for more specialized networking add-ons.