**Read this article at https://www.pugetsystems.com/guides/1222**

# PyTorch for Scientific Computing - Quantum Mechanics Example Part 2) Program Before Code Optimizations

Written on August 16, 2018 by Dr Donald Kinghorn- The formulas that need to be coded up
- First "Naive" Implementation of the Problem

* Import PyTorch

* Setup compute device and data-type

* Utility functions

* Compute formulas for matrix elements

* Test the matrix element formulas.

* Define the energy "loss" function

* Test the energy

* Optimize energy using PyTorch Autograd gradients

In this post we'll put together working code to return the ground state energy and gradient for small atoms using the correlated Gaussian basis functions and Hamiltonian energy operator discussed in part 1. I'm not going to repeat much of the theory so have a look at Part 1) if you have no idea of what I'm talking about (that may or may not help!).

The title of Part 1) is "Doing Quantum Mechanics with a Machine Learning Framework: PyTorch and Correlated Gaussian Wavefunctions: Part 1) Introduction". I changed the name of the series in this post to better reflect what I'm trying to convey.

The important take-away in this post is to see how straight forward it is to write code for a complicated scientific application using PyTorch. We will optimize the code for high performance in later posts.

In Part 1) we established the feasibility of using PyTorch for this problem and confirmed that "Automatic Gradients" using PyTorch autograd works as expected.

The essence of the problem is to minimize the smallest eigenvalue, $E$, of the generalized eigenvalue problem,

Where $H = T + V$ is a matrix of evaluated integrals using our basis functions. There are terms for kinetic $T$ and potential $V$ energy operators, $S$ an "overlap" matrix of integral of the basis functions with themselves. $c$ is a vector (eigenvector) of the linear coefficients of the basis expansion. [... and there is a symmetry projection operator mixed in there to make sure we are computing the right wavefunction for the ground state energy.] The basis functions and energy operator were discussed in Part 1).

The energy is the equivalent of the "loss" function in a typical machine learning problem. We want to minimize the energy.

## The formulas that need to be coded up

The formulas we need to compute for the matrix elements look like, (for each $k,l$ basis function)

### Overlap

$L_k$ and $L_l$ are lower triangular matrices of of (independent variable) parameters that will be optimized. $A_{kl} = A_k + \tau_P^{\prime }A_l\tau_P$ with $\tau$ a symmetry projection matrix. $|A|$ means determinate.

### Kinetic energy $T$

This formula is implemented efficiently by utilizing $\mathrm{tr}\left[AB\right]=\sum_{i,j}\left(A_{i,j}B_{i,j}^{\prime}\right)$. That is, form the term by term product of $A$ and $B$ transpose, and then add up all of the components of the product. $M$ is a matrix of reduced mass constants.

### Potential energy $V$

$Q$ is a lower triangular matrix of "charge products" and $R_{ij}$ is a lower triangular matrix constructed from elements in $A_{kl}^{-1}$

**Those are very compact matrix equations for what are really quite elaborate integrals derived with matrix calculus.** If those formulas were written out with summation signs instead of matrix formulas there would be several pages of ugly formulas to consider. Having formulas in matrix form is a big plus for code implementation.

### Energy function

For the energy (loss) function we will not do complete eigenvalue decomposition, we'll use the Rayleigh quotient,

The minimum over $c$ of the Rayleigh quotient is the smallest eigenvalue and that minimal $c$ is the corresponding eigenvector. By doing this we can optimize the linear expansion coefficients in $c$ simultaneously with the non-linear parameters that are contained in the matrices $L_k$.

The independent parameter variables for our energy (loss) function $E$ is the collection of all of the lower triangular elements of the $L_k$ matrices for all of the basis functions along with the vector of expansion coefficients $c$. Those optimization parameters will be placed in a vector x. That will give $E$ as a function of x that will return the energy and gradient at the value of x. We will minimize that.

**Got it? Let's code it up!**

## First "Naive" Implementation of the Problem

The first thing to do is get working code. We will speed it up by a factor of over 15,000 in later posts! (see near the end of the post for a teaser)

**Don't worry about performance yet**, we just want to have something working that can be checked for correctness. In the "old-days" I would have used MATLAB for this step. In fact I had old MATLAB code for for this problem that I ran in Octave (open source 'clone' of MATLAB) as a check to be sure I was getting correct results. It turns out that it is nearly as easy to do this with PyTorch as it was with MATLAB! [ note: I really like MATLAB and Octave! ... but, I am becoming a huge fan of PyTorch!]

You should be able to cut and paste the code blocks below into Python Jupyter notebook cells. To get PyTorch setup to work in a notebook you can follow instructions in my post Why You Should Consider PyTorch (includes Install and a few examples).

#### Import PyTorch

```
import torch as th # PyTorch is imported as torch and will be referenced as "th"
import time
```

#### Setup compute device and data-type

For these calculation float32 is not enough precision! You will need double precision i.e. float64 in order to do the optimization runs. **The Titan V has great float64 performance, GeForce cards do not!** We will use the CPU until we do the code optimizations. (The code as-is will run on the GPU but performance is bad.)

```
dtype = th.float64
gpuid = 0
#device = th.device("cuda:"+ str(gpuid))
device = th.device("cpu")
print("Execution device: ",device)
print("PyTorch version: ", th.__version__ )
print("CUDA available: ", th.cuda.is_available())
print("CUDA version: ", th.version.cuda)
print("CUDA device:", th.cuda.get_device_name(gpuid))
```

```
Execution device: cpu
PyTorch version: 0.4.1
CUDA available: True
CUDA version: 9.0.176
CUDA device: TITAN V
```

#### Utility functions

```
# Utility functions for working with lower triangular matrices
# return the lower triangle of A in column order i.e. vech(A)
def vech(A):
count = 0
c = A.shape[0]
v = th.zeros(c * (c + 1) // 2, device=device, dtype=dtype)
for j in range(c):
for i in range(j,c):
v[count] = A[i,j]
count += 1
return v
# vech2L create lower triangular matrix L from vechA
def vech2L(v,n):
count = 0
L = th.zeros((n,n), device=device, dtype=dtype)
for j in range(n):
for i in range(j,n):
L[i,j]=v[count]
count += 1
return L
```

#### Compute formulas for matrix elements

The following function matrix_elements() implements the formulas presented in the fist part of this post. The code is mostly a direct translation from the math formulas.

```
def matrix_elements(n, vechLk, vechLl, Sym, Mass, vecQ):
'''
Returns: a dictionary with the skl, tkl, vkl matrix elements
n : the size of the Lk matrices
vechL(k,l): vector of parameters use for Lk and Ll matrices
Sym: a single symmetry projection matrix
Mass: a matrix of "reduced mass" values for the particles needed for tkl
vecQ: an array of products of particle charges needed for vkl
'''
# build Lk and Ll
Lk = vech2L(vechLk,n);
Ll = vech2L(vechLl,n);
# apply symmetry projection on Ll
# th.t() is shorthand for th.transpose(X, 0,1)
PLl = th.t(Sym) @ Ll;
# build Ak, Al, Akl, invAkl
Ak = Lk@th.t(Lk);
Al = PLl@th.t(PLl);
Akl = Ak+Al
invAkl = th.inverse(Akl);
# Overlap (normalized)
skl = 2**(3*n/2) * th.sqrt( th.pow(th.abs(th.det(Lk))*th.abs(th.det(Ll))/th.det(Akl) ,3) );
# Kinetic energy
tkl = skl*(6*th.trace(Mass@Ak@invAkl@Al));
# potential energy
TWOoSqrtPI = 1.1283791670955126 # 2/sqrt(pi)
RIJ = th.zeros((n,n), device=device, dtype=dtype);
# 1/rij i~=j
for j in range(0,n-1):
for i in range(j+1,n):
tmp2 = invAkl[i,i] + invAkl[j,j] - 2*invAkl[i,j];
#RIJ[i,j] = TWOoSqrtPI * skl/th.sqrt(tmp2);
RIJ[i,j] = 1/th.sqrt(tmp2)
# 1/rij i=j
for i in range(0,n):
#RIJ[i,i] = TWOoSqrtPI * skl/th.sqrt(invAkl[i,i]);
RIJ[i,i] = 1/th.sqrt(invAkl[i,i])
RIJ = TWOoSqrtPI*skl*RIJ
Q = vech2L(vecQ,n);
vkl = th.sum(RIJ*Q)
return {'skl':skl, 'tkl':tkl, 'vkl':vkl}
```

You can see how close the PyTorch/Python code is to the mathematical matrix formulas. This makes it very quick to get working code running.

#### Test the matrix element formulas.

```
def test_matrix_elements():
# test input with a known result
n = 3;
vechLk = th.tensor([ 1.00000039208682, 0.02548044275764261, 0.3525161612610669,
1.6669144815242515, 0.9630555318946559, 1.8382882034659822
], device=device, dtype=dtype, requires_grad=True);
vechLl = th.tensor([ 1.3353550436464964, 0.9153272033682132, 0.7958636766525028,
1.8326931436447955, 0.3450426931160630, 1.8711839323167831
], device=device, dtype=dtype, requires_grad=True);
Sym = th.tensor([[0,0,1],
[0,1,0],
[1,0,0]], device=device, dtype=dtype);
Mass = th.tensor([[5.446170e-4, 2.723085077e-4, 2.723085077e-4],
[2.723085077e-4, .5002723085, 2.723085077e-4],
[2.723085077e-4, 2.723085077e-4, .5002723085 ]], device=device, dtype=dtype);
vecQ = th.tensor([1, -1, -1, -1, 1, -1], device=device, dtype=dtype);
matels = matrix_elements(n, vechLk, vechLl, Sym, Mass, vecQ)
print('skl: ',matels['skl']) # should be 0.5334
print('tkl: ',matels['tkl']) # should be 4.3509
print('vkl: ',matels['vkl']) # should be -2.3840
```

```
start_time = time.time()
test_matrix_elements()
print(" took {:.6f} seconds ".format(time.time() - start_time))
```

`skl: tensor(0.5334, dtype=torch.float64, grad_fn=`)
tkl: tensor(4.3509, dtype=torch.float64, grad_fn=)
vkl: tensor(-2.3840, dtype=torch.float64, grad_fn=)
took 0.047023 seconds

Those numbers are the expected output for those parameters so we can move on to computing the energy.

#### Define the energy "loss" function

"energy()" loops over the lower triangle of S T and V making calls to the matrix element code. These matrices are then used to create the Hamiltonian matrix H and Overlap S that are then used to compute the energy.

```
def energy(x,n,nb,Mass,Charge,Sym,symc):
'''
Returns: the energy at the point (parameters) x
n: number of particles defined in input
nb: number of basis functions
Mass: matrix of "reduced" mass constants for system
Charge: vector of "charge products" for particle pairs
Sym: a tensor containing the symmetry projection matrices
symc: a vector of coefficients for the symmetry projection terms
'''
nx = len(x);
nn = int(n*(n+1)/2);
nsym = len(symc);
# extract linear coefs "eigen vector" from x
c = x[-nb:];
# reshape non-linear variables for easier indexing
X = th.reshape(x[:nb*nn], (nb,nn))
# Init H S T V
H = th.zeros((nb,nb), device=device, dtype=dtype);
S = th.zeros((nb,nb), device=device, dtype=dtype);
T = th.zeros((nb,nb), device=device, dtype=dtype);
V = th.zeros((nb,nb), device=device, dtype=dtype);
# outer loop is over symmetry terms
for k in range(0,nsym):
for j in range(0,nb):
for i in range(j,nb):
vechLi = X[i,:]
vechLj = X[j,:]
matels = matrix_elements(n,vechLi,vechLj,Sym[:,:,k],Mass,Charge);
S[i,j] += symc[k]*matels['skl'];
T[i,j] += symc[k]*matels['tkl'];
V[i,j] += symc[k]*matels['vkl'];
H = T + V
# complete upper triangle of H and S
for i in range(0,nb):
for j in range(i+1,nb):
H[i,j] = H[j,i];
S[i,j] = S[j,i];
# The energy from the Rayleigh quotient
cHc = c@H@c;
cSc = c@S@c;
eng = cHc/cSc;
return eng
```

#### Test the energy

The following function defines the constants for Li atom (infinite nuclear mass). There is sample input for an 8 term wavefunction with the non-linear parameters in x1.

```
def test_energy():
Mass = th.tensor([[0.5, 0.0, 0.0],
[0.0, 0.5, 0.0],
[0.0, 0.0, 0.5]], device=device, dtype=dtype);
Charge = th.tensor([-3, 1, 1, -3, 1, -3], device=device, dtype=dtype);
# symmetry projection terms
Sym = th.zeros((3,3,6), device=device, dtype=dtype)
# (1)(2)(3)
Sym[:,:,0] = th.tensor([[1,0,0],[0,1,0],[0,0,1]], device=device, dtype=dtype);
# (12)
Sym[:,:,1] = th.tensor([[0,1,0],[1,0,0],[0,0,1]], device=device, dtype=dtype);
# (13)
Sym[:,:,2] = th.tensor([[0,0,1],[0,1,0],[1,0,0]], device=device, dtype=dtype);
# (23)
Sym[:,:,3] = th.tensor([[1,0,0],[0,0,1],[0,1,0]], device=device, dtype=dtype);
# (123)
Sym[:,:,4] = th.tensor([[0,1,0],[0,0,1],[1,0,0]], device=device, dtype=dtype);
# (132)
Sym[:,:,5] = th.tensor([[0,0,1],[1,0,0],[0,1,0]], device=device, dtype=dtype);
# coeff's
symc = th.tensor([4.0,4.0,-2.0,-2.0,-2.0,-2.0], device=device, dtype=dtype);
n=3;
nb=8;
xvechL=th.tensor([
1.6210e+00, -2.1504e-01, 9.0755e-01, 9.7866e-01, -2.8418e-01,
-3.5286e+00, -3.3045e+00, -4.5036e+00, -3.2116e-01, -7.1901e-02,
1.5167e+00, -8.4489e-01, -2.1377e-01, -3.6127e-03, -5.3774e-03,
-2.1263e+00, -2.5191e-01, 2.1235e+00, -2.1396e-01, -1.4084e-03,
-1.0092e-02, 4.5349e+00, 9.4837e-03, 1.1225e+00, -2.1315e-01,
5.8451e-02, -4.9410e-03, 5.0853e+00, 7.3332e-01, 5.0672e+00,
-2.1589e-01, -6.8986e-03, -1.4310e-02, 1.5979e+00, 3.3946e-02,
-8.7965e-01, -1.1121e+00, -2.1903e-03, -4.6925e-02, 2.1457e-01,
3.3045e-03, 4.5120e+00, -2.1423e-01, -1.6493e-02, -2.3429e-03,
-8.6715e-01, -6.7070e-02, 1.5998e+00
], device=device, dtype=dtype, requires_grad=False)
evec = th.tensor([
-6.0460e-02, 7.7708e-05, 1.6152e+00, 9.5443e-01,
1.1771e-01, 3.2196e+00, 9.6344e-01, 3.1398e+00
], device=device, dtype=dtype, requires_grad=False)
# join the parameters into x1
x1 = th.tensor(th.cat((xvechL,evec)), device=device, dtype=dtype, requires_grad=True)
eng = energy(x1,n,nb,Mass,Charge,Sym,symc)
print(eng) # Should return -7.3615 at this point x1
return x1
```

```
start_time = time.time()
xrestart = test_energy()
print(" took {} seconds ".format(time.time() - start_time))
```

`tensor(-7.3615, dtype=torch.float64, grad_fn=`)
took 0.11813998222351074 seconds

With those values of the parameters for x1 the the energy is as expected. Note the "requires_grad=True" attribute for x1. That tells PyTorch to keep track of those parameters for computing gradients.

#### Optimize energy using PyTorch Autograd gradients

```
def opt_energy(steps=1, num_basis=8):
############################################################################
# Input constants for Li atom (infinite nuclear mass)
############################################################################
Mass = th.tensor([[0.5, 0.0, 0.0],
[0.0, 0.5, 0.0],
[0.0, 0.0, 0.5]], device=device, dtype=dtype);
Charge = th.tensor([-3, 1, 1, -3, 1, -3], device=device, dtype=dtype);
# symmetry projection terms
Sym = th.zeros((3,3,6), device=device, dtype=dtype)
# (1)(2)(3)
Sym[:,:,0] = th.tensor([[1,0,0],[0,1,0],[0,0,1]], device=device, dtype=dtype);
# (12)
Sym[:,:,1] = th.tensor([[0,1,0],[1,0,0],[0,0,1]], device=device, dtype=dtype);
# (13)
Sym[:,:,2] = th.tensor([[0,0,1],[0,1,0],[1,0,0]], device=device, dtype=dtype);
# (23)
Sym[:,:,3] = th.tensor([[1,0,0],[0,0,1],[0,1,0]], device=device, dtype=dtype);
# (123)
Sym[:,:,4] = th.tensor([[0,1,0],[0,0,1],[1,0,0]], device=device, dtype=dtype);
# (132)
Sym[:,:,5] = th.tensor([[0,0,1],[1,0,0],[0,1,0]], device=device, dtype=dtype);
# coeff's
symc = th.tensor([4.0,4.0,-2.0,-2.0,-2.0,-2.0], device=device, dtype=dtype);
############################################################################
n=3
nb=num_basis
th.manual_seed(3)
x1 = th.empty(int(nb*n*(n+1)/2 + nb), device=device, dtype=dtype, requires_grad=True)
th.nn.init.uniform_(x1, a=-.8, b=.8)
#x1 = xrestart
optimizer = th.optim.Rprop([x1], lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
for i in range(steps):
optimizer.zero_grad()
loss = energy(x1,n,nb,Mass,Charge,Sym,symc)
loss.backward()
optimizer.step()
if (i<10 or not i%10):print('step: {:5} f: {:4.12f} gradNorm: {:.9f}'.format(i, loss, th.norm(x1.grad)))
print('step: {:5} f: {:4.12f} gradNorm: {:.9f}'.format(i, loss, th.norm(x1.grad)))
return x1
```

"opt_energy()" is the main program. I have the problem input constants setup here as well as the optimization code. This runs optimization steps on our energy function. The gradient for the optimization is evaluated by "back propagation" in the statement loss.backward(). I used the name loss in honor of standard Machine Learning name for the function being minimized.

I experimented with some of the machine learning optimization routines in PyTorch and found that "Rprop" worked really well for this problem. I used the a utility from the neural network module "nn" for the random parameter initialization. It was great having these kinds of tools in the framework for experimenting with!

You can run this code and it will optimize a wavefunction for the Li atom ground state energy. It will be converging toward a value of -7.478060 Hartrees' (an energy unit typically used in quantum chemistry). Running this for 100 "epochs" i.e. optimization steps gives the following output.

```
start_time = time.time()
xrestart = opt_energy(steps=100, num_basis=8)
print(" took {} seconds ".format(time.time() - start_time))
```

```
step: 0 f: -0.782443089544 gradNorm: 6.657278345
step: 1 f: -0.813405818204 gradNorm: 6.663333179
step: 2 f: -0.850565814089 gradNorm: 6.664066434
step: 3 f: -0.895095668330 gradNorm: 6.655362804
step: 4 f: -0.948326026794 gradNorm: 6.631036613
step: 5 f: -1.011714048225 gradNorm: 6.582179112
step: 6 f: -1.086766887327 gradNorm: 6.496718392
step: 7 f: -1.174915424078 gradNorm: 6.359865888
step: 8 f: -1.277512892162 gradNorm: 6.162854834
step: 9 f: -1.395322269810 gradNorm: 5.891109978
step: 10 f: -1.528494104459 gradNorm: 5.568453572
step: 20 f: -4.012585110663 gradNorm: 2.752546989
step: 30 f: -6.571789033344 gradNorm: 1.543067756
step: 40 f: -7.223244601602 gradNorm: 0.284565245
step: 50 f: -7.289228643587 gradNorm: 0.149059481
step: 60 f: -7.353758373450 gradNorm: 0.142130627
step: 70 f: -7.368628179269 gradNorm: 0.069057239
step: 80 f: -7.378787154917 gradNorm: 0.062784281
step: 90 f: -7.389791432408 gradNorm: 0.047265743
step: 99 f: -7.398805061984 gradNorm: 0.095278731
took 28.317583560943604 seconds
```

This is a very small 8 term wavefunction so it wont give a great result and the current code is too slow for large numbers of basis functions.

The next run is for a wavefunction with 512 basis terms. This is one step of the optimization which will give us a time for one energy and gradient evaluation. (this is from a random start point)

```
start_time = time.time()
xrestart = opt_energy(steps=1, num_basis=512)
print(" took {} seconds ".format(time.time() - start_time))
```

```
step: 0 f: -0.720436686340 gradNorm: 4.060212842
took 2768.9312148094177 seconds
```

2768 seconds for one step is too long!

**I'll tease you and do the same thing with my fast version of the code**. After all of the code optimizations in the next few posts and running on a Titan V GPU we'll be able to bring that execution time down to the following,

```
start_time = time.time()
xrestart = opt_energyrc(steps=1,num_basis=512)
print(" took {:.4f} seconds ".format(time.time() - start_time))
```

```
step: 0 f: 82.210051671516 gradNorm: 495.193096479
took 0.1683 seconds
```

**Yes!, from 2768 seconds down to 0.1683 seconds! That's 16452 Times Faster!!** Yes, really, that much. Some of the simple code optimizations will be surprising by how much difference they make.

I'll be taking a vacation next week so it may be two weeks before I start writing up the code optimizations. I will be worth the wait. I promise!

**Happy computing --dbk**

**Tags:**PyTorch, Python, Scientific Computing, CUDA, Quantum Mechanics

"Where H=T+V is a matrix of evaluated integrals using our basis functions."

Well, let me get some scratch paper, a pencil, a large eraser, coffee and my Linear Algebra book.

Yes, should probably be a big cup of coffee :-) This is stuff I did years ago for my thesis work on application of matrix differential calculus in quantum chemistry. A couple of papers are referenced in the "Part 1" but they will be hard to find. I did this because I've wanted to implement that stuff again for a long time. I wanted a good example of using PyTorch for something other than ML/DL. It was really fun to code this up!

A big part of my work before was on the analytic gradients. Here I used autograd and I feel it is highly effective. That could be a game changer for some problems.

I'll present some code optimization that will bring this up to "production" level performance. It was pretty easy to get this to where it is running with 100% utilization on GPU (or all-core parallel on CPU) ... and this is probably another game changer. A lot of scientific code is ancient Fortran and has never been re-implemented because it is just too big of a task. The incredible work that has been done with the machine learning frameworks could significantly impact scientific computing in general. Exciting times :-) Best wishes --Don

Fixed a problem in the utility functions so this should work with PyTorch 1.0.1 now