https://www.pugetsystems.com

# Estimating CPU Performance using Amdahls Law

Written on May 4, 2015 by Matt Bach
Share:

### Introduction

If you are in the market for a new computer (or thinking of upgrading your current system), choosing the right CPU can be a daunting - yet incredibly important - task. Just in our current product line here at Puget Systems, we are selling about 60 different Intel CPU models each with their own unique specifications. While the actual list of CPUs you will need to choose between will be a bit smaller than that based on your other system requirements (ECC RAM, Mobile, PCI-E lanes, etc.) it is still incredibly difficult to determine which CPU will give you the best possible performance while staying within your budget.

While Tom's Hardware, Anandtech, and a multitude of other hardware review sites do a great job reviewing and comparing different CPUs, unless they specifically test the application(s) you personally use their results may not accurately reflect the performance that you would see. After all, as good as those sites are if they were to test every possible application they simply would not be able to complete their testing by the time the CPU becomes obsolete!

When you are choosing a CPU, there are two main specifications you need to pay attention to that define the relative performance of CPUs:

1. The frequency is essentially how many operations a single CPU core can complete in a second (how fast it is).
2. The number of cores is how many physical cores there are within a CPU (how many operations it can run simultaneously).

This doesn't take into account any differences in architecture (AMD versus Intel, Haswell versus Ivy Bridge, etc.) but when comparing two CPUs from the same family they are the two main specifications that determine the relative performance capability of a CPU. If your software only uses a single core, the frequency is a decent indicator of how well a CPU will perform. However, if your software is able to utilize multiple CPU cores it becomes very difficult to estimate the performance of different CPU models since almost no program is going to be 100% efficient at using those cores. The trick is to determine exactly how efficient your program is at using multiple CPU cores (it's parallelization efficiency) and use that number to estimate the performance of different CPU models.

To calculate the parallelization efficiency, you need to use a mathematical equation called Amdahl's Law. We were first introduced to this equation about a year and a half ago when we hired a Dr. Donald Kinghorn to help us get established in the scientific computing market. He has been invaluable as a resource in that segment (just check out our HPC blog section for a sample of what we have learned from him so far), but the knowledge he has brought to Puget Systems has been useful in many ways we never anticipated - including the practical application of Amdahl's Law.

### What is Amdahls Law?

At the most basic level, Amdahl's Law is a way of showing that unless a program (or part of a program) is 100% efficient at using multiple CPU cores, you will receive less and less of a benefit by adding more cores. At a certain point - which can be mathematically calculated once you know the parallelization efficiency - you will receive better performance by using fewer cores that run at a higher frequency than using more cores that run at a lower frequency.

### Amdahl's Law:

$S(n)=\frac{T(1)}{T(n)}=\frac{T(1)}{T(1)(B+\frac{1}{n}(1-B)}$

• $S(n)$ is the theoretical speedup
• $T(n)$ is the time an algorithm takes to finish when running n threads
• $B$ is the fraction of the algorithm that is strictly serial (so 1-B is how much of the program can be run in parallel)

Unless you deal with complex equations regularly, this may be a bit daunting of an equation. However, since we are primarily concerned with the maximum speedup that can be achieved by increasing the number of CPU cores, this equation can simplified a bit into the following:

### Parallelization Formula:

$S(n)=\frac{1}{(1-P)+\frac{P}{n}}$

• $S(n)$ is the theoretical speedup
• $P$ is the fraction of the algorithm that can be made parallel
• $n$ is the number of CPU threads

What this is basically saying is that the amount of speedup a program will see by using $n$ cores is based on how much of the program is serial (can only be run on a single CPU core) and how much of it is parallel (can be split up among multiple CPU cores).

In order to use this equation, you first need to determine the parallelization efficiency of your program. With that number, you can then use Amdahl's Law and a CPU's frequency to fairly accurately estimate the performance of almost any CPU that uses a similar architecture to the CPU you used for testing. While you are certainly invited to follow this guide in it's entirety, if you are more concerned about actually estimating a CPU's performance than all the math behind it feel free to skip ahead to the Easy Mode - Using a Google Doc spreadsheet section.

### Amdahls Law Limitations

While the method we described above is great for determining how much of a program can be run in parallel, it (and Amdahl's Law in general) has some limitations:

• Not every action done in a program will have the same amount of parallelization. If you look at the results of our recent article: Adobe Photoshop CC CPU Multi-threading Performance you will notice that how many CPU cores Photoshop can use varies greatly depending on what you are actually doing. You can mitigate this limitation somewhat by testing various tasks and calculating the parallelization efficiency for each task individually (which is what we did), but depending on the program it may not be feasible to test every single possible action.
• Amdahl's Law only applies if the CPU is the bottleneck. If what you are doing is not being limited by the CPU, you will find that after a certain number of cores you stop seeing any performance gain. If your video card, RAM, or hard drive performance is preventing the program from running any faster, adding more CPU cores will never help even if the program is 100% parallel. Also, keep in mind that if you end up purchasing a faster CPU than the one you tested with, it is entirely possible that the new CPU will be fast enough that something else in the system (RAM, HD, GPU, etc.) may then become the bottleneck and limit the performance of your new, faster CPU.
• Many programs are hard-coded to use a certain number of cores.  Even if it may be possible for a program to try to use more cores, many programs have a hard-set number of CPU cores that can be utilized. In fact, a large majority of software available today still only uses a single CPU core! This is done for a variety of reasons ranging from the nature of what the program is doing making it non-conducive to using multiple CPU cores to it simply being easier to program for a fixed number of cores.​
• Estimating the performance of a CPU will only be accurate for CPUs based similar architecture. If the CPU you used to determine the parallel efficiency of a program is vastly different than the CPU you are considering purchasing, you may not be able to accurately estimate the performance of a CPU. Even outside of AMD vs Intel CPUs, if the CPU you used to test is more than a generation or two old the actual performance of a newer CPU may be vastly different (and usually faster) than what Amdahl's Law will estimate. You can still accurately calculate the parallel efficiency and use that to compare the relative performance of two or more CPUs that use the same architecture but you won't be able to determine more than a general idea of the actual performance.

### Step 1: Test your program with various number of CPU cores

Unfortunately, determining the parallelization efficiency of a program is not something you can find just by looking in a ReadMe.txt file. The easiest way we have found to do this is to simply run your program and time how long it takes to complete a task with the number of CPU cores it can use limited artificially. Luckily, you don't need to change out your CPU a bunch of times to do this. Instead, you can simply set the program's affinity through Task Manager in Windows (or by using the "taskset" command in Linux). This is not as good as completely disabling the CPU cores through the BIOS - which is possible on some motherboards - but we have found it to be much more accurate than you would expect.

To set the affinity, simply launch the program you want to test, open Task Manager, right-click on the program listing under Details, select "Set Affinity", and choose the threads that you want to allow the program to use. Note that if your CPU supports Hyperthreading there will actually be twice as many threads listed than your CPU actually has cores. You can either disable Hyperthreading in the BIOS before doing your testing, or simply select two threads for every CPU core you want to test. Hyperthreading threads are always listed immediately after the physical core in Windows, so you would select two threads for every CPU core you want the program to use. In other words, selecting threads 1&2 will allow the program to just use a single CPU core, selecting threads 1-4 will allow the program to use two CPU cores, etc., etc.

Note that setting the affinity only lasts until the program is closed. The next time you run the program, you have to re-set the affinity again. However, if you want to quickly test a single action using various numbers of CPU cores, you don't have to close the program before changing the affinity - just click on "Set Affinity" and change it on the fly. However, you will get more accurate results by closing the program between runs as that will clean out the RAM that is already allocated to the program.

With the ability to set how many CPU cores a program can use, all you need to do is perform a repeatable action using a variety of CPU cores. For example, you may time how long it takes to complete a render in AutoCAD or export images in Lightroom using a variety of CPU cores. The larger the variety of number of cores you test the better, but you need to at least test with a single CPU core and all possible CPU cores. If possible, we recommend testing with as many combinations as possible (so if you have an eight-core CPU, test with 1,2,3,4,5,6,7, and 8 cores).

### Step 2: Determining the parallelization fraction

At this point, you should have a list that shows how long it took your program to complete an action using various numbers of CPU cores. Just to have an example, lets say your results look like those in the "Action Time (seconds)" column in the chart to the right:

The easiest way we have found to use these results to determine the parallelization efficiency of a program is to first determine how much faster the program completed the task with N cores versus how long it took with a single core. To find this out, you simply need to divide how long the action took with a single core by how long it took with N cores. In our example, for two cores the speedup is 645.4/328.3 which equals 1.97 . Fill this in for each row and we can use these numbers to determine the parallelization fraction of the program.

There is a complex mathematical way to use the actual speedup numbers to directly find the parallelization fraction using non-linear least squares curve fitting, but the easiest way we have found is to simply guess at the fraction, see how close the results are, then tweak it until the actual speedup is close to the speedup calculated using Amdahl's Law. Using a program like Excel or Google Doc's Sheets makes this much easier, but you can do it with just a calculator and a pad of paper if you want to do it manually and have hours to kill.

# of Cores Action Time (seconds) Actual Speedup Amdahl's Law Speedup (97% efficient)
1 645.4 1 1
2 328.3 1.97 1.95
3 230 2.8 2.8
4 172 3.75 3.67
5 140.3 4.6 4.5
6 117.5 5.5 5.2
7 108 6 5.9
8 97.8 6.6 6.6

To find the parallelization fraction, you need to use the parallelization equation we listed earlier and plug in different values for P:

$S(n)=\frac{1}{(1-P)+\frac{P}{n}}$

A good place to start might be to try P=.8 (or 80% parallel efficient) and perform this calculation for each # of cores. For example, for 4 cores the equation would be

$S(n)=\frac{1}{(1-.8)+\frac{.8}{4}}$

which equals 2.5. Compare this to our actual speedup in our example (which was 3.75) and you will see that our example program is actually more than 80% efficient so we need to increase the parallelization fraction to something higher. In our case, the actual fraction was .97 (97%) which is pretty decent. You will notice that the results don't line up perfectly every single time since there is a certain margin of error that always exist when you run benchmarks - you simply have to average it out and get it as close as you can. Having this in a spreadsheet where you can graph both data series makes it much easier (see the Easy Mode - Using a Google Doc spreadsheet section for a link to a Google Doc with all the calculations already performed and a graph setup).

### Step 3: Estimate CPU performance using the parallelization fraction

Once you have the parallelization fraction, you can use it to estimate the performance of any other CPU that uses the same or similar architecture as the CPU. If you are interested in a CPU that uses an entirely different architecture, you can still use this method to determine the relative difference in performance between a number of different CPU models from the same family, but it will likely not be an accurate representation of the actual performance you would see with that CPU.

To estimate a CPU's performance, you need to know the operating frequency and how many cores both the CPU you used to benchmark with and the CPU you are interested in has. With those specs in hand, you first need to calculate how many effective cores both CPUs have which is done by using the equation:

$EffectiveCores=\frac{1}{(1-P)+\frac{P}{CPU Cores}}$

Basically, this is using the same parallelization equation we used earlier only using the actual number of cores the CPU has. This gives us the effective number of CPU cores the CPU has when running your program if the program was actually 100% efficient. From this, we can multiple the number of effective cores with each CPU's operating frequency to get what is essentially how many operations per second the CPU is able to complete (or GFLOPs):

$GFLOPs = CPUFrequency*EffectiveCores$

Finally, we can estimate how long it would take the CPU you are interested in to complete the same action you benchmarked by dividing the GFLOPS of the two CPUs and multiplying it by the time it took your test CPU to complete the action with all of it's cores enabled:

$Performance=\frac{CPU1GFLOPs}{CPU2GFLOPs}*CPU1BenchTime$

With this, you should end up with an estimation of how long it would take a CPU to complete the action you benchmarked.

If you want to estimate the performance of a CPU using Amdahl's Law and don't love math, you will probably have a headache by the time you complete this guide. Lucky for you, we took the time to put together a Google Doc that has all the equations already done and ready: Estimating CPU Performance . You will need to make a copy of the Doc (go to File->Make a Copy), but once you have done that you will be able to use it as much as you like.

To use this doc, do the following:

• Complete Step 1: Test the program with various number of CPU cores . Unfortunately, you simply have to do this step yourself.
• Once you have tested your application with various numbers of CPU cores active, input your results into the orange cells in the Google Doc (replacing the example results)
• Adjust the parallel efficiency fraction (the yellow cell) until the two lines on the graph are similar. If you cannot get the two lines to line up, it may be that your program is not CPU limited (see the Amdahl's Law Limitations section)
• Change the light blue cells to reflect the cores and frequency of the CPU you used for testing (row 28) and the CPU(s) you are interested in estimating the performance of (row 29-30)
• You should see an estimation of how long it should take each CPU to perform the action you benchmarked in the green cells

This is much easier than trying to keep track of all the different equations, although we understand that there are some people who strangely love doing math.

### Conclusion

Whether you followed the step-by-step instructions or simply used the Google Doc we linked, you should now have the resources and information needed to estimate the performance of a CPU for your exact program and application. While this is not the easiest process in the world, it can be invaluable when trying to decide what CPU to use in your new computer.

Say you are purchasing a new system but are torn between two CPU models that are similar in cost, but very different in terms of frequency and core count. As an example, lets use a Xeon E5-2667 V3 and a Xeon E5-2690 V3. Using the data from the example in Step 2 and assuming that our test CPU was a Xeon E5-2660 V3 2.6GHz Ten Core we can estimate the performance of these two CPUs to be:

CPU Model ~MSRP  Estimated Action Time
Intel Xeon E5-2660 V3 2.6GHz Ten Core (Test CPU) $1450 85.3 seconds Intel Xeon E5-2667 V3 3.2GHz Eight Core$2057 82.5 seconds
Intel Xeon E5-2690 V3 2.6GHz Twelve Core $2090 74.4 seconds In this example, a E5-2667 V3 should take about 82.5 seconds to complete the action we benchmarked while a E5-2690 V3 should only take about 74.4 seconds. Since the two CPUs are only$33 apart in price, this makes it almost a no-brainer that the E5-2690 V3 is the best choice in this instance.

Remember that this only applies to CPUs that are of a similar architecture to the one you used for testing and only for the action that you benchmarked. Anything different (even within the same program) may have drastically different results. However, if you keep finding yourself waiting on a render to finish, an export to complete, or any other single task you can limit your testing to just those tasks. Each may have a different parallelization efficiency, but if you determine the efficiency for each task and give them a certain weight (likely based on how often you are waiting on each to finish) you can make a much more educated decision on which CPU is right for you.

In fact, this is exactly what we use to determine what CPU we should offer in our growing list of Recommended Systems. While we have only completed testing for AutoCAD, Photoshop, and Imaris so far, we expect to grow this list to include a large number of different software that many of our customers use.

If you followed this guide, we'd love to hear what you tested, what problems (if any) you ran into, and what parallelization fraction you found to be the closest match. Just let us know in the comments below!

Tags: Amdahl\'s Law, multithreading, multi core
LORD ODIN

"When you are choosing a CPU, there are two main specifications you need to pay attention to that define the performance of a CPU:

The frequency is how many operations a single CPU core can complete in a second (how fast it is).
The number of cores is how many physical cores there are within a CPU (how many operations it can run simultaneously)."

So what you are saying is a FX 8350 is faster than a 4770k because it has a higher clock speed? What about IPC. What about different instruction sets. Floating point operation on AMD CPUs is so poor almost every single Intel CPU that exists can outperform it per core. Obviously if you have a program that can multithread very efficiently the FX-8350 will win but if you have an app that only uses 4 cores the past few generations of i3 can beat it even though the FX-8350 is at 4 - 4.2 Ghz and the i3 is around 3.3

Posted on 2015-05-05 01:20:02

Odin, I think architecture is out of scope for what this article is tackling. I'm sure Matt can comment in more detail, but my understanding is this estimation approach is only intended to provide comparison across CPU's of the same architecture. It also doesn't look at cache size, memory frequency support, etc. The point is that EVEN if you hold all the other variables constant, and only "turn the knobs" of CPU core count and frequency, you still have a complex estimation process when it comes to knowing how your application will scale. Understanding how your application scales will help you make decisions about what processor is best for you within a given architecture.

Posted on 2015-05-05 04:07:00

Jon is right, different architectures is completely outside the scope of this guide. I thought I put in more warnings about that then it looks like I actually did though, so I went back and added a bit to the Introduction, Limitations, and Conclusion sections about that. This is a really complex topic and there are plenty of more technical things I left out (like how memory allocation across multiple CPUs can affect the performance scaling of dual or quad CPU systems) but that is definitely something I meant to touch on. Thanks for pointing it out!

Posted on 2015-05-05 05:38:35
Alex Kain

This is not at all true. Not even close. You are so far out of the ball park with this statement, you can't even see that there is a park anymore. Here's the obvious problem with your statement: Intel is using a form of SMT (simultaneous multi-threading) that was initially designed almost 25 years ago. They have only made very slight pipeline changes and have tweaked the way that individual instructions are threaded per-core (which sounds like it's more than it actually is) to that the actual threaded processes waste less core space (this has been done incrementally since the i7 920). HOWEVER, the AMD "Bulldozer"/"Piledriver" architecture uses a completely different approach; what they have done is use a CMT (clustered multi-threading) approach (just so we're clear, the IPC's on each 'core' for the FX 8350 are just as 'strong' - meaning they support just as many instruction sets (proprietary and otherwise), individually, as any Ivy Bridge core). The idea behind CMT is to use a more traditional 'brute force' computing tactic by parsing instructions per module, then threading multiple parsed sets to each core within the module. From a purely engineering standpoint (theoretical) this is the better approach, as there is less wasted instruction/processing space on a per-core basis. THE PROBLEM is that no one - and I mean no one - in the programming world (not Microsoft, not IBM; no one) saw any financial benefit to overhauling kernal level instruction threading specifically for an architecture they viewed as being a 'one-off'. It is more than possible to revise a Linux kernal to benefit AMD cores and, having assisted others with doing this, I can tell you beyond a shadow of a doubt: The "Piledriver" architecture is without peer, period. End of Story. Hands down. Class Dismissed. But let's step back to that earlier statement of "no one in the programming world saw any financial benefit to overhauling kernal level instruction threading" for what amounts to a 'one-off' architecture. Does the arch have the ability to be the best thing since sliced bread? Sure... does it really matter if no one's going to support it because the need for that much thread-level parallelism isn't used in 99.9% of daily applications? Nope. So is the Intel really better than the AMD? NOPE. Not even close... but, again, only from a purely architectural standpoint. AMD's mistake was making the 'if you build it they will come' assumption. And that's just not how it works. It has never worked that way, not even for Apple (but boy, does Apple try) and likely never will. Necessity is the mother of invention and desire the father of innovation; there was neither the necessity nor desire for that much parallelism (and yes, this type of architecture would, by design, stink out loud for single threaded processes, since the vast majority of the thread space is wasted). Oh, and one final thing: No i3 ever made - in this reality or any other - has ever beaten or will ever beat an FX 8350; I don't know where you are arbitrarily pulling that BS statement from, but you should send it back, poste haste; it's a bald-faced lie. And lies have no business being told in a public forum. (as a note: I own an FX 8350 based PC, an i7 4930K, i7 3770K, AMD A10 7300 laptop and i7 4700 laptop and have built several AMD Opteron 6300 based servers; I'm pretty hardware agnostic; I like things that work and work well)

Posted on 2015-05-06 20:53:32
LORD ODIN

I also own an FX-8350 its my main machine.... I wasnt saying the i3 beats an FX-8350 i was saying per thread the i3 beats the FX-8350 so if the 8350 had 4 cores an i3 would kill it. I was also talking about floating point operations. Integer math on AMD is just as strong and with hashing it still keeps up with most i7s but in x32 floating point operations (Probably the most used) AMD is severely lacking in power. You can pretend AMD is just as good as Intel as long as you want but ill try to stick to the facts xD.

You can post benchmarks too.. I am always ready to be proven wrong, just arguing about the situation wont make anything better

The point im making is, strong cores Intel wins... Lots of cores for an amazing price AMD wins. I would pick an FX-8350 over an i3-i5-i7 for a server build any day..... Unless it was free ill take the 1000 dollar 5960x lol

--Gaming Benchmarks--

Posted on 2015-05-06 23:17:43
Alex Kain

Actually, the factual AVX processing is just as powerful as any i7, but the problem here lies in 3 places for the decreased performance:

1. The instruction pipeline is necessarily longer to manage the CMT appropriately. Thus, the L1 cache is slower to compensate.
2. No unified North Bridge circuit onboard, which lengthens the I/O pipeline, increasing the time between input, instruction cycle and output.
3. Microsoft and other software developers refused to properly support kernal level instruction issuing.

Also: Posting from places like LinusTechTips, Tom's Hardware and CPU Boss reduces your credibility rather than add to it. Tom's has been publicly outed as shilling to the highest bidder, Linus and CPU boss copy/paste whatever they see their respective subscribers claiming, usually with zero proof. Not even one of them has mentioned the ridiculous amount of cache thrashing Intel microprocessors suffer from (hilariously, the new Zen from AMD using a similar SMT method as Intel's 'Hyperthreading', will most likely suffer from the same thing, since this is an architectural drawback) nor that, when HT is completely turned off, the processors lose ~30% performance, putting them on-par or below AMD's FX line - no, can't mention that, can we? - and I've seen the pumped up marks for the i5 that I know to be blatant lies, having owned a 2500, 3570 and 4200 series from Intel. I don't believe anything any of those sites say anymore; I've caught them in too many lies. It's hard to lie to a guy who owns the hardware. A 3% - 5% difference? Yeah, probably just differences in system configs, etc. Some of the crap I've seen them post is skewed towards Intel or Nvidia as much as 40%. It's insane.

Posted on 2015-05-06 23:47:16
LORD ODIN

Kden... No more arguing from me since you cant back up your claims :3

Posted on 2015-05-06 23:52:46
Alex Kain

I didn't see it as an argument, but rather a discussion. And if I have to post some drivel, corporate shill link from 'CPU Boss' (or their GPU site) that I could easily prove wrong with a single screenshot to 'support' my argument - you know, rather than using engineering facts a 5 year old could find with a 10 minute Google search - then it was nice talking to you while it lasted.

The math in the post is great, by the way. It does come pretty close to actual numbers I can reproduce using my 3770T and 3770K builds. Nicely done.

Posted on 2015-05-06 23:56:41

That law may not be the common to some people who are not studying on the field of electronics. Thus, it's a very helpful thing to be learned by someone who wanted to measure the performance of their central processing unit to conclude on the capabilities that it can only give to its user.

Posted on 2016-02-28 14:09:50
hossam soltan

1. The performance of a single-processor machine is evaluated using a two-program benchmark suite. Suppose

only 50% of the first program and 87.5% of the second program can be executed in parallel.

(a) What is the maximum factor of improvement that can be achieved in the benchmark score (i.e., geometric

mean) by adding more processors to that machine?

(b) What is the minimum number of processors that need to be added to that machine in order to improve

its benchmark score by a factor of three.

Posted on 2017-10-18 20:29:12
Kuch Jano
Posted on 2018-10-17 06:57:51
Narendra

Khojo Hindi Me
This Is Really Great Work. Thank You For Sharing Such A Useful Information Here In The Blog.

Posted on 2019-06-25 10:36:02

This is what i called an amazing article and yeah it was really helpful for me,, Thank you...

Posted on 2019-09-08 06:15:37