Table of Contents

- The Problem – You have lots of small (independent) matrix operation in your code. How can you do those all at once without using loops over library calls?
- Batched Cholesky Decomposition and Matrix Inverse in PyTorch
- CPU and GPU Performance for Batched Cholesky decomposition in PyTorch

An amazing result in this testing is that “batched” code ran in constant time on the GPU. That means that doing the Cholesky decomposition on 1 million matrices took the same amount of time as it did with 10 matrices!

In this post we start looking at performance optimization for the Quantum Mechanics problem/code presented in the first 2 posts. This is the start of the promise to make the code over 15,000 times faster! I still find the speedup hard to believe but it turns out little things can make a big difference.

Part 1) was about describing the problem of minimizing the energy of a small atomic system using a bases set of correlated Gaussian functions. It has most of the math to describe the problem. “Doing Quantum Mechanics with a Machine Learning Framework: PyTorch and Correlated Gaussian Wavefunctions: Part 1) Introduction“

Part 2) presented the formulas that needed to be coded up along with a first working implementation. That code is a straight forward implementation of the math. It is not optimal for performance. It is correct working code which is a really important first step! “PyTorch for Scientific Computing – Quantum Mechanics Example Part 2) Program Before Code Optimizations”

An important take-away in this post is about creating “batched” tensor operations in PyTorch. Some of the important matrix library routines in PyTorch do not support batched operation. Fortunately it is reasonably easy to create them from your own algorithms. Batched operations can give a huge speedup to your code and automatically (automagically!) give you parallel execution on CPU and GPU!

## The Problem – You have lots of small (independent) matrix operation in your code. How can you do those all at once without using loops over library calls?

PyTorch is linked against Intel MKL, NVIDIA cuBLAS and MAGMA. Those libraries are highly optimized and will give some of the best performance you can get out of your hardware. However, they are only efficient for problems with larger matrix dimensions. Matrix sizes of 5,000 x 5,000 elements or larger are usually very efficient. Small matrix operations suffer from library call overhead and just don’t give good hardware utilization. But, if you have lots and lots of those small matrices you may be able to do operations on all of them at the same time!

The Overlap and Kinetic energy integral formulas from my QM problem give a good example of this,

$L_k$ and $L_l$ are small (3×3 for Li atom) lower triangular matrices of (independent variable) parameters that will be optimized. $A_{kl} = A_k + \tau_P^{\prime }A_l\tau_P$ with $\tau$ a symmetry projection matrix, $|A|$ is a determinate and I need the inverse $A_{kl}^{-1}$. The subscripts $k,l$ refer to the elements of the larger matrices that are being constructed from those formulas. That means there are lots of those small matrix operations.

To compute the inverse and determinate we can do a Cholesky decomposition. This is a matrix factorization for symmetric matrices that yields $A = L L^T$ (note: I usually use $\prime$ to indicate transpose but used a $T$ here for clarity.) $L$ is a lower triangular matrix. Once you have that the determinate is just the square of the product of the diagonal elements of $L$ and $A^{-1}$ can be computed by forward substitution. Check out the Wikipedia page on the Cholesky decomposition.

## Batched Cholesky Decomposition and Matrix Inverse in PyTorch

This is the most important concept in this post. Lets say you have many, $n \times n$ matrices, $m$ of them. You want to do an algorithm that acts on the individual elements of those matrices i.e. like a Cholesky decomposition and maybe find determinants and inverses. For a data structure to hold these matrices you can use an $m \times n \times n$ tensor. You can visualize that like this,

Think of that as a stack of matrices. Each i,j matrix element can be considered a 1 x 1 x m “column” vector going through the whole stack. If you build an algorithm that acts on on the i,j elements of a matrix it is trivial to extend that to acting on the whole “column” vector going through that stack at once.

The Cholesky decomposition algorithm is not that difficult. Look at the Wikipedia page. You can work it out with pencil and paper in a few minutes.

### Matrix version of Cholesky decomposition (in PyTorch)

Here’s a Python implementation acting on matrices, (we’ll extend this to a batched tensor version). I’m using PyTorch and will present full working test code further down in the post. [Note: I am not doing any input “sanity” checking in this code.]

```
# Matrix Cholesky decomp
def matrix_cholesky(A):
L = th.zeros_like(A)
for i in range(A.shape[-1]):
for j in range(i+1):
s = 0.0
for k in range(j):
s = s + L[i,k] * L[j,k]
L[i,j] = th.sqrt(A[i,i] - s) if (i == j) else \
(1.0 / L[j,j] * (A[i,j] - s))
return L
```

### Batched tensor version of Cholesky decomposition (in PyTorch)

Now for the batched version. This is a trivial modification!

```
# Batched Cholesky decomp
def batch_cholesky(A):
L = th.zeros_like(A)
for i in range(A.shape[-1]):
for j in range(i+1):
s = 0.0
for k in range(j):
s = s + L[...,i,k] * L[...,j,k]
L[...,i,j] = th.sqrt(A[...,i,i] - s) if (i == j) else \
(1.0 / L[...,j,j] * (A[...,i,j] - s))
return L
```

**All that was need to make a batched version was to add […, ] as first index into the tensor!** The “…” is a “wild-card” notation for the “fist” set of indices of the tensor. I described this as being a “stack” of matrices which means that […, ] is representing 1 index i.e. the “depth” of the stack. However, you could use a “tiled” version where the matrices are arranged as 2-d tiles. In this case […, ] would represent the first 2 indices of the tensor. This same code works for both of these cases. That can come in handy for indexing in some cases. I use both “stacked” and “tiled” in my QM code.

### Modification to allow “Autograd” Gradients (in PyTorch)

There is one “gotcha” with the above code. **If you want to include these operation in autograd for gradients in the computation graph you can not have any “in-place” operations.** The code above modifies L in-place (This is like doing something similar to a[0] += 1 you can’t do that if you want to use autograd!)

I found a way to allow the use of my code with autograd by “cloning”. There might be a better way to do this but what I did was simple and it seems to work fine. Here’s is the modified code,

```
# Batched Cholesky decomp (Autograd safe)
def cholesky(A):
L = th.zeros_like(A)
for i in range(A.shape[-1]):
for j in range(i+1):
s = 0.0
for k in range(j):
s = s + L[...,i,k].clone() * L[...,j,k].clone()
L[...,i,j] = th.sqrt(A[...,i,i] - s) if (i == j) else \
(1.0 / L[...,j,j].clone() * (A[...,i,j] - s))
return L
```

### Batched Matrix Inverse (in PyTorch)

The main reason I need the Cholesky decomposition is to compute matrix inverses. If you have positive definite matrices you can use a Cholesky decomposition and then “trivially” invert the lower triangular matrix from that. Then the inverse is just $A^{-1} = L^{-1} L^{-T}$. Here’s code to do the lower triangular matrix inverse.

```
# Batched inverse of lower triangular matrices
def inverseL(L):
n = L.shape[-1]
invL = th.zeros_like(L)
for j in range(0,n):
invL[...,j,j] = 1.0/L[...,j,j]
for i in range(j+1,n):
S = 0.0
for k in range(i+1):
S = S - L[...,i,k]*invL[...,k,j].clone()
invL[...,i,j] = S/L[...,i,i]
return invL
```

This is safe for use with autograd. [Note: I am not checking for singular matrices in this code. i.e. any diagonal elements of L that are 0.] I wont test with the inverse code but it has the same characteristics as the Cholesky code.

Let’s test the performance!

## CPU and GPU Performance for Batched Cholesky decomposition in PyTorch

This is where the fun starts!

### Hardware being used for testing

I’m running on my personal Puget Systems “Peak Single Xeon Tower” which is configured with;

- Intel Xeon-W 2175 14-core
- 128 GB DDR4 2600MHz memory
- NVIDIA 1080Ti
- NVIDIA Titan V (used for compute)

This is a a system that could be configured here…

I want to note that I love the Titan V! I am borrowing the one I’m using in this testing but I think I will get my own. It is wonderful to be able to use double precision for compute and still have outstanding performance. It is a little expensive but given the great performance and versatility it is a bargain! I highly recommend it. The 32GB “CEO” version would probably be the best scientific workstation compute device you could get given that it is pretty simple to utilized with something like PyTorch.

I’m going to do general testing of the code definitions above. In the next post I’ll show how I use these definitions in my quantum mechanics program. These functions are a big contributor to performance speedup in that program.

For this testing code I’ll do something similar to what I do in my QM program, but a little simpler. I’ll generate a set of positive definite matrices from products of lower triangular matrices and then do the decomposition. This way you could check that the code is giving the right decomposition since you are starting with what are really Cholesky factors! (My QM code uses sums of these kinds of matrices so I still need the decomposition.) I’ll compare performance using the batched code definitions I presented above against a loop over over the built-in Cholesky routine in PyTorch. That built-in is “potrf()”, which is a standard MKL and MAGMA/cuBLAS function acting on matrices. (That’s the Lapack naming scheme for that routine.)

### Batched Cholesky testing code

```
import torch as th
import time
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
```

You can switch the execution device in the code block below.

```
dtype = th.float64
gpuid = 0
device = th.device("cuda:"+ str(gpuid))
#device = th.device("cpu")
print("Execution device: ",device)
print("PyTorch version: ", th.__version__ )
print("CUDA version: ", th.version.cuda)
print("CUDA device:", th.cuda.get_device_name(gpuid))
```

```
Execution device: cuda:0
PyTorch version: 0.4.1
CUDA available: True
CUDA version: 9.0.176
CUDA device: TITAN V
```

```
# Utility functions
# Batched vech2L input V is nb x n(n+1)/2
def bvech2L(V,nb,n):
count = 0
L = th.zeros((nb,n,n), device=device, dtype=dtype)
for j in range(n):
for i in range(j,n):
L[...,i,j]=V[...,count]
count = count + 1
return L
# Batched Cholesky decomp
def cholesky(A):
L = th.zeros_like(A)
for i in range(A.shape[-1]):
for j in range(i+1):
s = 0.0
for k in range(j):
s = s + L[...,i,k].clone() * L[...,j,k].clone()
L[...,i,j] = th.sqrt(A[...,i,i] - s) if (i == j) else \
(1.0 / L[...,j,j].clone() * (A[...,i,j] - s))
return L
# Batched inverse of lower triangular matrices
def inverseL(L):
n = L.shape[-1]
invL = th.zeros_like(L)
for j in range(0,n):
invL[...,j,j] = 1.0/L[...,j,j]
for i in range(j+1,n):
S = 0.0
for k in range(i+1):
S = S - L[...,i,k]*invL[...,k,j].clone()
invL[...,i,j] = S/L[...,i,i]
return invL
```

Here are a couple of functions to do the testing,

```
def test_loop(n=4,m=1000):
nn = int(n*(n+1)/2)
th.manual_seed(42)
X = th.rand((m,nn), device=device, dtype=dtype)
L = th.add(bvech2L(X,m,n),th.tensor(th.eye(n), device=device, dtype=dtype))
A = th.matmul(L,th.transpose(L, 1, 2))
#print("Shape of A {}".format(A.shape))
start_time = time.time()
cholA = th.zeros_like(A)
for i in range(m):
cholA[i,:,:] = th.potrf(A[i], upper=False)
runtime = time.time() - start_time
#print("loop version took {} seconds ".format(runtime))
return runtime
```

```
def test_batch(n=4,m=1000):
nn = int(n*(n+1)/2)
th.manual_seed(42)
X = th.rand((m,nn), device=device, dtype=dtype)
L = th.add(bvech2L(X,m,n), th.tensor(th.eye(n), device=device, dtype=dtype))
A = th.matmul(L,th.transpose(L, 1, 2))
#print("Shape of A {}".format(A.shape))
start_time = time.time()
cholA = th.zeros_like(A)
cholA = cholesky(A)
runtime = time.time() - start_time
#print("batched version took {} seconds ".format(runtime))
return runtime
```

### Batched vs Looped Cholesky test runs (small matrices 10 x 10)

Lets look at numbers from a few test runs. I’ll make some plots in the next section.

#### GPU Titan V fp64 (double precision)

I have the device set to GPU (Titan V)

##### 10,000 10 x 10 matrices (batched is 1000 times faster – 0.0176 sec vs 17.07 sec)

```
tval = test_loop(n=10,m=10000)
```

```
Shape of A torch.Size([10000, 10, 10])
loop version took 17.073381 seconds
```

```
tval = test_batch(n=10,m=10000)
```

```
Shape of A torch.Size([10000, 10, 10])
batched version took 0.017659 seconds
```

##### 1,000,000 10 x 10 matrices (batched has no slowdown! – 0.0122 sec)

```
tval = test_batch(n=10,m=1000000)
```

```
Shape of A torch.Size([1000000, 10, 10])
batched version took 0.012228 seconds
```

This GPU result is one of the most amazing things I’ve ever seen on a compute device! The calculation actually takes less time to run on 1 million matrices than it does on 10 The compute time is essentially constant no matter how many matrices are in the batch!

**For larger matrices the batched version becomes much slower compared to the looped version.** For example with 100 x 100 matrices doing 10,000 matrices takes 18.4 sec looped and 7.9 sec batched. For much large 1000 x 1000 matrices the looped version is much faster. (potrif is a great algorithm for larger matrices). Doing 10, 1000 x 1000 matrices looped took 0.057 sec and the batched version took so painfully long I gave up waiting for it after about 40 min. Even 10, 200 x 200 matrices took 0.027 sec looped and 58 sec batched. **The larger the matrix the more the loops in the code start to dominate the calculation and the loops are really slow on the GPU.**

#### CPU Intel Xeon-W 2175 14-core (double precision)

The characteristics of execution on the CPU are much different than the GPU. Loops work considerably better, batched is still fast for small matrix sizes.

##### 10,000 10 x 10 matrices (batched is 2 times faster – 0.041 sec vs 0.087 sec)

```
tval = test_loop(n=10,m=10000)
```

```
Shape of A torch.Size([10000, 10, 10])
loop version took 0.087211 seconds
```

```
tval = test_batch(n=10,m=10000)
```

```
Shape of A torch.Size([10000, 10, 10])
batched version took 0.041103 seconds
```

The difference is not so dramatic on the CPU. For even smaller 5 x 5 matrices the batched version is 10 times faster. **For larger 100 x 100 matrices the looped version is much faster on the CPU. 10,000, 100 x 100 matrices took 0.89 sec looped and 45 sec batched.**

The plots in the next section will illustrate the performance differences more clearly.

### Performance plots looped vs batched Cholesky on CPU and GPU

#### Looped vs Batched CPU and GPU

The first plot is 1 to 10,000 (10 x 10) matrices running on CPU. Looped and batched are both reasonably fast on CPU but there is better performance with the batched code as the number of matrices increases.

Running this on the GPU shows the dramatic slowdown with the looped version. It is scaling linearly looped but it is running in constant time for the batched version.

Increasing the matrix size to 100 x 100 causes a flip in looped vs batched runtime on the CPU. The looped version is increasing in time but it’s hared to see in the plot because the batched version is running very slow.

**This is one of the most interesting plots.** Running on GPU with the 100 x100 matrices shows clearly the crossover point around 4000 matrices. The batched version is running in constant time and the looped version is scaling linearly. That trend line continues. Even at 1 million matrices the batched version on GPU takes about 7.5 seconds. It’s independent of the number of matrices in the batch! The looped version would be running for a very long time.

In the next plot I show the matrix size where the two versions running on CPU are scaling about the same i.e. 15 x 15 matrices.

On the GPU the 15 x 15 matrices again show the constant run-time for the batched code.

#### Performance of CPU vs GPU

It is interesting to look at the performance of CPU vs GPU when they are both running the same version of the algorithm. For the looped version the CPU has a definite advantage. The porti() routine works well on CPU. The times are increasing on the CPU even though they look constant in the plot. That’s because the **loops running on the GPU are very slow.**

In this next plot you can see how the CPU scales linearly and again the GPU is running in constant time (it actually gets faster when it has more matrices to do at once!).

#### Overall Performance looped CPU vs batched GPU

This last plot answers the question of when should you used GPU and when should you use CPU. (…at least partially answers.) If you have losts of small matrices the advantage of running on the GPU can be huge because of the constant run-time character. The GPU overtakes the CPU at about 1000 matrixes in the batch. In the case of my quantum mechanics code I have on the order of a million matrices for each optimization “epoch”.

I hope you enjoyed seeing these results and have been as surprised by them as I have been. I was “blown away” when I saw the constant run-times on the GPU! It is clearly advantageous to remove loops from code targeted for the GPU. In the next post I’ll get back top my quantum mechanics code and bring this batched Cholesky and inverse code into play there. I’ll have lots of other optimizations for that original “naive” implementation I did in part 2). There will be a performance increase of several orders of magnitude. It will make the QM code usable for research calculations. I even know a few people that might want to use it!

**Happy computing! –dbk**