Sizing VRAM to Generative AI & LLM Workloads

Table of Contents

Introduction

Generative AI has come a long way in the last few years – and even that is probably a vast understatement. Many people now benefit from GenAI on a daily basis, even if they don’t realize it. A few examples of this include Google’s first answer to most search queries being AI-based, Amazon using AI to summarize customer reviews, and AI-based image and video editing tools becoming more widespread – though none of these are without controversy.

The largest and most powerful generative models are cloud-based, trained and running across massive server clusters with thousands of graphics processing units (GPUs – colloquially called video cards, but now capable of so much more than just generating computer graphics and video output). That level of hardware allows quick responses to the huge number of queries coming in at the same time, but has astronomical purchasing and energy costs.

However, it is entirely possible to run many forms of generative AI, especially large language models (LLMs), with more modest hardware. There are three phases to using genAI:

Training a model is the most processing-intensive and thus needs extremely powerful hardware. This is almost exclusively done with powerful server clusters.
Fine-tuning an existing model is still demanding but can be done without needing a full-blown supercomputer.
Inference, the term for normal use of a model to generate content, is the least taxing on hardware and the step that most users are looking to perform.

Inference with some small variants of LLMs – yes, I know that sounds contradictory – can be run on a PC with a single GPU. Most reasonably strong models will need a bit more processing power, though, and a number of guides and online tools have been created to help users understand how much hardware they will need for either inference or fine-tuning of various LLMs. This article will not attempt to reinvent that wheel, but rather provide links to some of those resources and then guidance on how to take that information and use it to purchase or build the right type of computer.

LLM Resources

VRAM Guide

First up, here is a guide from Wei-Ming Thor explaining the math behind calculating the amount of VRAM (the memory on GPUs) that LLMs need. If you want to understand what factors go into determining VRAM requirements, this is a great place to start!

VRAM Calculator

Next, we have a web-based LLM calculator from ApX. It will help make sure you get enough VRAM, using the methods described above, and can also estimate the number of tokens per second that a particular set of GPUs can provide with a given LLM configuration. The default setting is for inference, but there is a tab where you can switch to fine-tuning if you want – which significantly increases the hardware required!

I find that using a calculator like this is most intuitive in one of two ways:

If you know exactly what sort of model you want to run, plug in all of those parameters and see how much VRAM it will need. Then, adjust the GPU selection and quantity until you reach that total – with some headroom, to be safe.
If you are unsure of the model, but know what GPUs you have available (or have the budget for) then start by selecting the version and quantity of cards and then play around with the various model options until you find a configuration that will work – again, with some headroom.

One small aside about this tool: it doesn’t always have the latest GPUs listed, but there is a “Custom” option which lets you manually plug in the VRAM per card.

Hosting Software

Experienced users may want to manually configure their LLM software setup, but if you are just getting started or want a simple interface to use then check out LM Studio. It provides a graphical user interface (GUI) for running a range of supported models inside Windows, MacOS, and Linux… and it’s free!

While not nearly as polished as LM Studio, Stability Matrix provides a GUI for installing and running image generation models. It also supports all three major OSes, and is free to use, but it is worth noting that there is a lot of controversy over how image-generating models are trained.

There are many other great tools and articles out there – these are just a sample! If there is a resource you’ve found particularly helpful, please feel free to share a link in the comments.

Hosting your own LLM

Once you’ve used a calculator, like the one linked above, to determine how many of what model GPUs you need, it is time to build your system. Here is a general breakdown of how many GPUs you can fit in different classes of systems:

Type of Computer	AI Computation Support	Puget Systems Example
Laptop	NPU or single, mobile-class GPU	Puget Mobile C160-G
Small Form Factor Desktop	Single GPU, potentially limited by size, power, and cooling	Puget Workstation C112-S
Mid Tower PC	Single or dual GPUs, potentially limited by size, power, and cooling	Puget Workstation R121-L
Full Tower PC	Up to three or four GPUs, limited by size, power, and cooling	Puget Workstation T140-XL
Smaller Rackmounts (2U)	Up to four GPUs	Puget Server E140-2U
Larger Rackmounts (4U+)	Up to eight GPUs	Puget Server E281-4U

For example, let’s say a business wanted to stand up a DeepSeek-R1 70B model for an internal team which needs to be able to ask questions that won’t end up being recorded on a cloud service, so they need an on-prem system. They’ve got a server closet with some space available, and would like the following parameters: Q8 inference precision, FP16 KV cache precision, 8K max sequence length, and up to 8 concurrent users. Plugging those values into the calculator, we get a result of about 125GB of VRAM being needed.

A popular choice for running LLMs in a rackmount system is NVIDIA’s L40S, which has 48GB of VRAM per card. Selecting that in the calculator shows that a minimum of three (3) cards should be able to run this model, but it would be cutting things pretty close. To be on the safe side, going for four of them will ensure plenty of extra overhead in case there end up being more users or a higher number of tokens needed per input than initially expected. Four L40S GPUs can fit perfectly into a 2U like the Puget Systems Quad GPU LLM Server.

If this fictional company didn’t have a dedicated space for rackmount servers, a desktop tower could be a viable alternative. NVIDIA has recently launched a new top-end professional GPU, the RTX PRO™ 6000 Blackwell, which has 96GB of VRAM and comes in a few variants – including the Max-Q Workstation Edition, which is designed for multi-GPU systems. The calculator shows that two of those (using the Custom option in the GPU drop-down) would work nicely, and those could fit in a full tower like the Puget Systems Multi GPU AI workstation.

Of course, many models can be run with even more modest hardware; this is just an example to help you understand the way to approach configuring a system to run your own LLM. Here at Puget Systems, we have options from compact workstations with a single GPU all the way up to servers with eight NVIDIA H200s – and you can even combine those into clusters for even more total VRAM! If you’d like personal guidance on selecting the right system(s) for your needs, our expert consultants are available to help with no cost or obligation.

Overflowing VRAM

With some frameworks, there are options to utilize system memory (RAM) if a model needs more memory space than the available VRAM. LM Studio, for example, can configure this for you with a simple toggle option, making it very easy for even novice users. However, doing this comes with a very steep performance penalty!

I did some testing on my home PC, sporting an NVIDIA GeForce RTX™ 4070 SUPER 12GB, with three sizes of the Gemma 3 LLM. LM Studio configured quantization defaults for each one, which were pretty steep in order to let these run on my relatively modest GPU. I measured how much memory each version consumed via Task Manager, and LM Studio reported how many tokens per second they ran at when I asked them each the same query.

Model	Quantization	Size on Disk	Tokens / Second	VRAM Usage	System RAM Usage
Gemma-3-4b	Q4_K_M	3.34 GB	100.35 tok/s	~6 GB	None
Gemma-3-12b	Q3_K_L	7.33 GB	46.45 tok/s	~10 GB	None
Gemma-3-27b	Q4_0	16.43 GB	1.57 tok/s	~11 GB	~6GB

Even with very aggressive quantization settings, the 27 billion parameter version of Gemma 3 was too large for my GPU’s VRAM. LM Studio allowed it to overflow into system memory, so it still ran – but it took almost 30 times longer to answer the question than the 12 billion parameter variant. Was the bigger model able to answer questions more accurately, though? Let’s dig into that in the next section!

Accuracy and Hallucination

It is important to remember when working with LLMs that they are not actually conscious or intelligent, at least not in the way we think of those concepts for humans. Instead, they are extremely advanced prediction engines. However, similar to a human, they can’t “know” (reply with) information they were not trained on. Unlike most humans, though, LLMs are frequently observed making stuff up without giving any indication that they are ‘guessing’! This is commonly referred to as hallucination, and since LLMs generally don’t warn users that they are uncertain about a response, it makes taking their replies at face value a risk. Generally, models with a lower number of parameters are more likely to behave this way and provide incorrect responses.

I wanted to demonstrate this phenomenon, so during the performance testing I described above, I asked each of the models “What is the airspeed velocity of an unladen swallow?” I was expecting both a real-life answer as well discussion of this question’s famous inclusion in Monty Python and the Holy Grail. Here is my subjective analysis of what each model gave me:

Gemma-3-4b: This model’s reply was fast, but the part of the answer relating to the film reference was very wrong and the conversion from meters per second to miles per hour is also off by a little more than 10%. (expand to see the LLM’s full answer)

This is a classic question from Monty Python and the Holy Grail! The answer, as presented in the film, isn’t straightforward.

Here’s the breakdown:

Richard III (played by John Cleese) asks: “What… is the airspeed velocity of an unladen swallow?”
Archimedes (played by Graham Chapman) responds: “That depends if it is an African or European swallow.”
Richard III then asks: “African!”
Archimedes replies: “About 11 meters per second, which is 22 miles per hour. Actually, that was a bit of a guess.”

Therefore, the “correct” answer is approximately 11 meters per second (22 mph), but it’s heavily dependent on whether you’re talking about an African or European swallow.

It’s important to remember this is a comedic device in the film and not a scientifically accurate measurement. 😊

You can find more detailed explanations of the debate surrounding this question here: https://en.wikipedia.org/wiki/Airspeed_velocity_of_an_unladen_swallow

Gemma-3-12b: A more comprehensive answer. The film reference was shorter, but it was more generalized and didn’t contain explicit hallucinations – though I would say the characterization of its use in the movie is still off. The unit conversions were much also much more accurate. (expand to see the full response)

Ah, a classic question! This comes from Monty Python and the Holy Grail. The answer isn’t straightforward, as it depends on the type of swallow. Here’s a breakdown:

The Problem & Why It’s Complicated:

Different Swallows: There are many species of swallows (Barn Swallow, European Swallow, etc.). Their size, wing shape, and flight characteristics vary.
Unladen vs. Laden: “Unladen” means without a load. A laden swallow would be carrying something, which significantly affects airspeed.
Airspeed vs. Ground Speed: Airspeed is the speed relative to the air; ground speed is the speed relative to the ground and affected by wind.

The Research & Estimates:

Researchers have actually tried to answer this question! Here’s a summary of findings:

European Swallow (most likely what they were referencing): Studies suggest an average airspeed of 11 meters per second (roughly 24 mph or 39 km/h). However, speeds can range from around 7-16 m/s (15-36 mph or 24-58 km/h) depending on factors like wind and flight style.
Barn Swallow: Similar ranges to the European swallow, but with some variation.

The Monty Python Context:

In the movie, after a lengthy discussion about whether an African or European swallow is being referenced, the question remains unanswered in a humorous way, highlighting the absurdity of trying to apply precise scientific measurements to a whimsical situation.

So, the “best” answer is around 11 m/s (24 mph), but with a significant range depending on the specific swallow and conditions.

Gemma-3-27b: This model’s answer was similar to the 12b, but included mention of African and European swallows as brought up in the film. It still had accurate unit conversions, but it included two website citations at the end which both turned out to be nonexistent – though one (Wikipedia) did redirect to a related page. (expand to see the full response)

Ah, a classic question! As famously debated in Monty Python and the Holy Grail, determining the airspeed velocity of an unladen swallow is… complicated.

Here’s the breakdown:

African or European Swallow? This is the crucial first question posed by King Arthur’s guard. The answer drastically changes things.
European Swallow: Roughly 11 meters per second, or about 24 miles per hour. This is generally accepted as a reasonable estimate for a typical European swallow in flight.
African Swallow: This is where it gets tricky! There’s no definitive answer given in the film, and it’s implied they might be able to carry coconuts (which is physically improbable). Estimates vary wildly, but are generally considered faster than the European variety due to potentially different wing structures and flight styles. Some suggest around 25 mph, others higher.

However, there are many factors that affect airspeed velocity:

Weight: An unladen swallow is key! Adding any weight (like a coconut) significantly reduces speed.
Wind conditions: Headwinds slow it down, tailwinds speed it up.
Altitude: Air density affects lift and drag.
Individual bird variation: Just like people, some swallows are stronger fliers than others.

In conclusion: The airspeed velocity of an unladen European swallow is approximately 24 mph. The African swallow’s speed remains a matter of spirited debate!

You can find more information (and the original scene) here:

Granted, this experiment was extremely subjective – but based on my experience with these three models I would say that the smallest one was very unreliable with its responses. It identified the correct movie that this quote was from, but then made up dialogue about it that never occurs in the film. The middle model, which was about half as fast, did a much better job and was more accurate with its description of how the quote was used in the movie. The largest model, which was roughly 30 times slower on my system, did not provide a substantial improvement – certainly nothing worth waiting so much longer to get a reply!

Conclusion

Generative AI is still a fast-growing field, and there are numerous advantages to hosting your own models including privacy, flexibility, and cost-savings over the long-term compared to paying for cloud services. I didn’t even touch on one of the most significant: retrieval-augmented generation (RAG), which lets you connect a LLM to your own data sources so that it can reference that info when responding to queries. Hopefully the tools and information presented here are helpful! I would love to hear about your experiences running local LLMs in the comments below.