Skip to content
Main Navigation Puget Systems Logo
  • Solutions
    • Media & Entertainment
      • Photo Editing
        • Recommended Systems For:
        • Adobe Lightroom Classic
        • Adobe Photoshop
        • Stable Diffusion
      • Video Editing & Motion Graphics
        • Recommended Systems For:
        • Adobe After Effects
        • Adobe Premiere Pro
        • DaVinci Resolve
        • Foundry Nuke
      • 3D Design & Animation
        • Recommended Systems For:
        • Autodesk 3ds Max
        • Autodesk Maya
        • Blender
        • Cinema 4D
        • Houdini
        • ZBrush
      • Live Video Production
        • Recommended Systems For:
        • vMix
        • Live Streaming
      • Real-Time Engines
        • Recommended Systems For:
        • Game Development
        • Unity
        • Unreal Engine
        • Virtual Production
      • Rendering
        • Recommended Systems For:
        • Keyshot
        • OctaneRender
        • Redshift
        • V-Ray
      • Digital Audio
        • Recommended Systems For:
        • Ableton Live
        • FL Studio
        • Pro Tools
    • Engineering
      • Architecture & CAD
        • Recommended Systems For:
        • Autodesk AutoCAD
        • Autodesk Inventor
        • Autodesk Revit
        • SOLIDWORKS
      • Visualization
        • Recommended Systems For:
        • Enscape
        • Keyshot
        • Lumion
        • Twinmotion
      • Photogrammetry & GIS
        • Recommended Systems For:
        • ArcGIS Pro
        • Agisoft Metashape
        • Pix4D
        • RealityScan
    • AI & HPC
      • AI Development & Deployment
        • Recommended Systems For:
        • AI Development
        • AI Deployment & Inference
        • Servers for Scaling AI & LLMs
      • High Performance Computing
        • Recommended Systems For:
        • Data Science
        • Scientific Computing
    • More
      • Recommended Systems For:
      • Compact Size
      • NVIDIA RTX Studio
      • Quiet Operation
      • Virtual Reality
    • Business & Enterprise
      We can empower your company
    • Government & Education
      Services tailored for your organization
  • Products
    • Puget Mobile
      Powerful laptop workstations
      • Puget Mobile 16″
        Intel Core Ultra + NVIDIA GeForce
    • Puget Workstations
      High-performance Desktop PCs
      • AMD Ryzen
        Powerful CPUs with up to 16 cores
      • AMD Threadripper
        High core counts and lots of PCIe lanes
      • AMD EPYC
        Server-class CPUs in a workstation
      • Intel Core Ultra
        Balanced single- and multi-core performance
      • Intel Xeon
        Workstation CPUs with AVX512
      • Configure a Custom PC Workstation
        Configure a PC for your workflow
    • Puget Rackstations
      Workstations in rackmount chassis
      • AMD
        Ryzen, Threadripper, and EPYC CPUs
      • Intel
        Core Ultra and Xeon Processors
      • Configure a Custom Rackmount Workstation
        Tailored 4U, 5U, and 6U rack systems
    • Puget Servers
      Enterprise-class rackmount servers
      • 1U Rackmount
        Dense CPU compute servers
      • 2U Rackmount
        Mixed CPU and GPU solutions
      • 4U Rackmount
        High-density GPU computing
      • Comino Grando GPU Servers
        Liquid-cooled GPU rackmount systems
      • Custom Servers
        Engineered to meet your unique needs
    • Puget Storage
      Solutions from desktop to datacenter
      • Network-Attached Storage
        Synology desktop and rackmount NAS
      • Software-Defined Storage
        Datacenter solutions with QuantaStor
    • Recommended Third Party Peripherals
      Curated list of accessories for your workstation
    • Puget Bench for Creators
      Professional benchmarking tools
  • Publications
    • Articles
    • Blog Posts
    • Case Studies
    • HPC Blog
    • Podcasts
    • Press
  • Support
    • Contact Support
    • Onsite Services
    • Support Articles
    • Unboxing
    • Warranty Details
  • About Us
    • About Us
    • Careers
    • Contact Us
    • Enterprise
    • Gov & Edu
    • Our Customers
    • Press Kit
    • Puget Gear
    • Testimonials
  • Talk to an Expert
  • My Account
  1. Home
  2. /
  3. Hardware Articles
  4. /
  5. Sizing VRAM to Generative AI & LLM Workloads

Sizing VRAM to Generative AI & LLM Workloads

Posted on July 18, 2025 (July 18, 2025) by William George | Last updated: July 18, 2025
LinkedIn Twitter

Table of Contents

  • Introduction
  • LLM Resources
    • VRAM Guide
    • VRAM Calculator
    • Hosting Software
  • Hosting your own LLM
  • Overflowing VRAM
  • Accuracy and Hallucination
  • Conclusion

Introduction

Generative AI has come a long way in the last few years – and even that is probably a vast understatement. Many people now benefit from GenAI on a daily basis, even if they don’t realize it. A few examples of this include Google’s first answer to most search queries being AI-based, Amazon using AI to summarize customer reviews, and AI-based image and video editing tools becoming more widespread – though none of these are without controversy.

The largest and most powerful generative models are cloud-based, trained and running across massive server clusters with thousands of graphics processing units (GPUs – colloquially called video cards, but now capable of so much more than just generating computer graphics and video output). That level of hardware allows quick responses to the huge number of queries coming in at the same time, but has astronomical purchasing and energy costs.

However, it is entirely possible to run many forms of generative AI, especially large language models (LLMs), with more modest hardware. There are three phases to using genAI:

  1. Training a model is the most processing-intensive and thus needs extremely powerful hardware. This is almost exclusively done with powerful server clusters.
  2. Fine-tuning an existing model is still demanding but can be done without needing a full-blown supercomputer.
  3. Inference, the term for normal use of a model to generate content, is the least taxing on hardware and the step that most users are looking to perform.

Inference with some small variants of LLMs – yes, I know that sounds contradictory – can be run on a PC with a single GPU. Most reasonably strong models will need a bit more processing power, though, and a number of guides and online tools have been created to help users understand how much hardware they will need for either inference or fine-tuning of various LLMs. This article will not attempt to reinvent that wheel, but rather provide links to some of those resources and then guidance on how to take that information and use it to purchase or build the right type of computer.

LLM Resources

VRAM Guide

First up, here is a guide from Wei-Ming Thor explaining the math behind calculating the amount of VRAM (the memory on GPUs) that LLMs need. If you want to understand what factors go into determining VRAM requirements, this is a great place to start!

VRAM Calculator

Next, we have a web-based LLM calculator from ApX. It will help make sure you get enough VRAM, using the methods described above, and can also estimate the number of tokens per second that a particular set of GPUs can provide with a given LLM configuration. The default setting is for inference, but there is a tab where you can switch to fine-tuning if you want – which significantly increases the hardware required!

I find that using a calculator like this is most intuitive in one of two ways:

  • If you know exactly what sort of model you want to run, plug in all of those parameters and see how much VRAM it will need. Then, adjust the GPU selection and quantity until you reach that total – with some headroom, to be safe.
  • If you are unsure of the model, but know what GPUs you have available (or have the budget for) then start by selecting the version and quantity of cards and then play around with the various model options until you find a configuration that will work – again, with some headroom.

One small aside about this tool: it doesn’t always have the latest GPUs listed, but there is a “Custom” option which lets you manually plug in the VRAM per card.

Hosting Software

Experienced users may want to manually configure their LLM software setup, but if you are just getting started or want a simple interface to use then check out LM Studio. It provides a graphical user interface (GUI) for running a range of supported models inside Windows, MacOS, and Linux… and it’s free!

While not nearly as polished as LM Studio, Stability Matrix provides a GUI for installing and running image generation models. It also supports all three major OSes, and is free to use, but it is worth noting that there is a lot of controversy over how image-generating models are trained.

There are many other great tools and articles out there – these are just a sample! If there is a resource you’ve found particularly helpful, please feel free to share a link in the comments.

Hosting your own LLM

Once you’ve used a calculator, like the one linked above, to determine how many of what model GPUs you need, it is time to build your system. Here is a general breakdown of how many GPUs you can fit in different classes of systems:

Type of ComputerAI Computation SupportPuget Systems Example
LaptopNPU or single, mobile-class GPUPuget Mobile C160-G
Small Form Factor DesktopSingle GPU, potentially limited by size, power, and coolingPuget Workstation C112-S
Mid Tower PCSingle or dual GPUs, potentially limited by size, power, and coolingPuget Workstation R121-L
Full Tower PCUp to three or four GPUs, limited by size, power, and coolingPuget Workstation T140-XL
Smaller Rackmounts (2U)Up to four GPUsPuget Server E140-2U
Larger Rackmounts (4U+)Up to eight GPUsPuget Server E281-4U

For example, let’s say a business wanted to stand up a DeepSeek-R1 70B model for an internal team which needs to be able to ask questions that won’t end up being recorded on a cloud service, so they need an on-prem system. They’ve got a server closet with some space available, and would like the following parameters: Q8 inference precision, FP16 KV cache precision, 8K max sequence length, and up to 8 concurrent users. Plugging those values into the calculator, we get a result of about 125GB of VRAM being needed.

A popular choice for running LLMs in a rackmount system is NVIDIA’s L40S, which has 48GB of VRAM per card. Selecting that in the calculator shows that a minimum of three (3) cards should be able to run this model, but it would be cutting things pretty close. To be on the safe side, going for four of them will ensure plenty of extra overhead in case there end up being more users or a higher number of tokens needed per input than initially expected. Four L40S GPUs can fit perfectly into a 2U like the Puget Systems Quad GPU LLM Server.

If this fictional company didn’t have a dedicated space for rackmount servers, a desktop tower could be a viable alternative. NVIDIA has recently launched a new top-end professional GPU, the RTX PRO™ 6000 Blackwell, which has 96GB of VRAM and comes in a few variants – including the Max-Q Workstation Edition, which is designed for multi-GPU systems. The calculator shows that two of those (using the Custom option in the GPU drop-down) would work nicely, and those could fit in a full tower like the Puget Systems Multi GPU AI workstation.

Of course, many models can be run with even more modest hardware; this is just an example to help you understand the way to approach configuring a system to run your own LLM. Here at Puget Systems, we have options from compact workstations with a single GPU all the way up to servers with eight NVIDIA H200s – and you can even combine those into clusters for even more total VRAM! If you’d like personal guidance on selecting the right system(s) for your needs, our expert consultants are available to help with no cost or obligation. 

Overflowing VRAM

With some frameworks, there are options to utilize system memory (RAM) if a model needs more memory space than the available VRAM. LM Studio, for example, can configure this for you with a simple toggle option, making it very easy for even novice users. However, doing this comes with a very steep performance penalty!

I did some testing on my home PC, sporting an NVIDIA GeForce RTX™ 4070 SUPER 12GB, with three sizes of the Gemma 3 LLM. LM Studio configured quantization defaults for each one, which were pretty steep in order to let these run on my relatively modest GPU. I measured how much memory each version consumed via Task Manager, and LM Studio reported how many tokens per second they ran at when I asked them each the same query.

ModelQuantizationSize on DiskTokens / SecondVRAM UsageSystem RAM Usage
Gemma-3-4bQ4_K_M3.34 GB100.35 tok/s~6 GBNone
Gemma-3-12bQ3_K_L7.33 GB46.45 tok/s~10 GBNone
Gemma-3-27bQ4_016.43 GB1.57 tok/s~11 GB~6GB

Even with very aggressive quantization settings, the 27 billion parameter version of Gemma 3 was too large for my GPU’s VRAM. LM Studio allowed it to overflow into system memory, so it still ran – but it took almost 30 times longer to answer the question than the 12 billion parameter variant. Was the bigger model able to answer questions more accurately, though? Let’s dig into that in the next section!

Accuracy and Hallucination

It is important to remember when working with LLMs that they are not actually conscious or intelligent, at least not in the way we think of those concepts for humans. Instead, they are extremely advanced prediction engines. However, similar to a human, they can’t “know” (reply with) information they were not trained on. Unlike most humans, though, LLMs are frequently observed making stuff up without giving any indication that they are ‘guessing’! This is commonly referred to as hallucination, and since LLMs generally don’t warn users that they are uncertain about a response, it makes taking their replies at face value a risk. Generally, models with a lower number of parameters are more likely to behave this way and provide incorrect responses.

I wanted to demonstrate this phenomenon, so during the performance testing I described above, I asked each of the models “What is the airspeed velocity of an unladen swallow?” I was expecting both a real-life answer as well discussion of this question’s famous inclusion in Monty Python and the Holy Grail. Here is my subjective analysis of what each model gave me:

Gemma-3-4b: This model’s reply was fast, but the part of the answer relating to the film reference was very wrong and the conversion from meters per second to miles per hour is also off by a little more than 10%. (expand to see the LLM’s full answer)

This is a classic question from Monty Python and the Holy Grail! The answer, as presented in the film, isn’t straightforward. 

Here’s the breakdown:

  • Richard III (played by John Cleese) asks: “What… is the airspeed velocity of an unladen swallow?”
  • Archimedes (played by Graham Chapman) responds: “That depends if it is an African or European swallow.”
  • Richard III then asks: “African!”
  • Archimedes replies: “About 11 meters per second, which is 22 miles per hour. Actually, that was a bit of a guess.”

Therefore, the “correct” answer is approximately 11 meters per second (22 mph), but it’s heavily dependent on whether you’re talking about an African or European swallow.

It’s important to remember this is a comedic device in the film and not a scientifically accurate measurement. 😊 

You can find more detailed explanations of the debate surrounding this question here: https://en.wikipedia.org/wiki/Airspeed_velocity_of_an_unladen_swallow

Gemma-3-12b: A more comprehensive answer. The film reference was shorter, but it was more generalized and didn’t contain explicit hallucinations – though I would say the characterization of its use in the movie is still off. The unit conversions were much also much more accurate. (expand to see the full response)

Ah, a classic question! This comes from Monty Python and the Holy Grail. The answer isn’t straightforward, as it depends on the type of swallow. Here’s a breakdown:

The Problem & Why It’s Complicated:

  • Different Swallows: There are many species of swallows (Barn Swallow, European Swallow, etc.). Their size, wing shape, and flight characteristics vary.
  • Unladen vs. Laden: “Unladen” means without a load. A laden swallow would be carrying something, which significantly affects airspeed.
  • Airspeed vs. Ground Speed: Airspeed is the speed relative to the air; ground speed is the speed relative to the ground and affected by wind.

The Research & Estimates:

Researchers have actually tried to answer this question! Here’s a summary of findings:

  • European Swallow (most likely what they were referencing): Studies suggest an average airspeed of 11 meters per second (roughly 24 mph or 39 km/h). However, speeds can range from around 7-16 m/s (15-36 mph or 24-58 km/h) depending on factors like wind and flight style.
  • Barn Swallow: Similar ranges to the European swallow, but with some variation.

The Monty Python Context:

In the movie, after a lengthy discussion about whether an African or European swallow is being referenced, the question remains unanswered in a humorous way, highlighting the absurdity of trying to apply precise scientific measurements to a whimsical situation.

So, the “best” answer is around 11 m/s (24 mph), but with a significant range depending on the specific swallow and conditions.

Gemma-3-27b: This model’s answer was similar to the 12b, but included mention of African and European swallows as brought up in the film. It still had accurate unit conversions, but it included two website citations at the end which both turned out to be nonexistent – though one (Wikipedia) did redirect to a related page. (expand to see the full response)

Ah, a classic question! As famously debated in Monty Python and the Holy Grail, determining the airspeed velocity of an unladen swallow is… complicated.

Here’s the breakdown:

  • African or European Swallow? This is the crucial first question posed by King Arthur’s guard. The answer drastically changes things.
  • European Swallow: Roughly 11 meters per second, or about 24 miles per hour. This is generally accepted as a reasonable estimate for a typical European swallow in flight.
  • African Swallow: This is where it gets tricky! There’s no definitive answer given in the film, and it’s implied they might be able to carry coconuts (which is physically improbable). Estimates vary wildly, but are generally considered faster than the European variety due to potentially different wing structures and flight styles. Some suggest around 25 mph, others higher.

However, there are many factors that affect airspeed velocity:

  • Weight: An unladen swallow is key! Adding any weight (like a coconut) significantly reduces speed.
  • Wind conditions: Headwinds slow it down, tailwinds speed it up.
  • Altitude: Air density affects lift and drag.
  • Individual bird variation: Just like people, some swallows are stronger fliers than others.

In conclusion: The airspeed velocity of an unladen European swallow is approximately 24 mph. The African swallow’s speed remains a matter of spirited debate!

You can find more information (and the original scene) here:

  • https://en.wikipedia.org/wiki/Airspeed_Velocity_of_an_Unladen_Swallow
  • https://www.montypython.co.uk/swallow

Granted, this experiment was extremely subjective – but based on my experience with these three models I would say that the smallest one was very unreliable with its responses. It identified the correct movie that this quote was from, but then made up dialogue about it that never occurs in the film. The middle model, which was about half as fast, did a much better job and was more accurate with its description of how the quote was used in the movie. The largest model, which was roughly 30 times slower on my system, did not provide a substantial improvement – certainly nothing worth waiting so much longer to get a reply!

Conclusion

Generative AI is still a fast-growing field, and there are numerous advantages to hosting your own models including privacy, flexibility, and cost-savings over the long-term compared to paying for cloud services. I didn’t even touch on one of the most significant: retrieval-augmented generation (RAG), which lets you connect a LLM to your own data sources so that it can reference that info when responding to queries. Hopefully the tools and information presented here are helpful! I would love to hear about your experiences running local LLMs in the comments below.

Tower Computer Icon in Puget Systems Colors

Looking for an AI workstation or server?

We build computers tailor-made for your workflow. 

Configure a System
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with one of our technical consultants today.

Talk to an Expert

Related Content

  • 2025 Professional GPU Content Creation Roundup
  • 2025 Consumer GPU Content Creation Roundup
  • Intel Arc Pro B50 Review
  • Impact of RAM Speed on AMD Threadripper PRO Content Creation Performance
View All Related Content

Latest Content

  • 2025 Professional GPU Engineering Roundup
  • 2025 Professional GPU Content Creation Roundup
  • A Quick Look at Rendering Performance in Windows vs Linux
  • Standing Up AI Development Quickly for Supercomputing 2025
View All
Tags: AI, Generative AI, GPU, LLM, LM Studio, NVIDIA, Performance, Video Cards, VRAM

Who is Puget Systems?

Puget Systems builds custom workstations, servers and storage solutions tailored for your work.

We provide:

Extensive performance testing
making you more productive and giving better value for your money

Reliable computers
with fewer crashes means more time working & less time waiting

Support that understands
your complex workflows and can get you back up & running ASAP

A proven track record
as shown by our case studies and customer testimonials

Get Started

Browse Systems

Puget Systems Mobile Laptop Workstation Icon

Mobile

Puget Systems Tower Workstation Icon

Workstations

Puget Systems Rackmount Workstation Icon

Rackstations

Puget Systems Rackmount Server Icon

Servers

Puget Systems Rackmount Storage Icon

Storage

Latest Articles

  • 2025 Professional GPU Engineering Roundup
  • 2025 Professional GPU Content Creation Roundup
  • A Quick Look at Rendering Performance in Windows vs Linux
  • Standing Up AI Development Quickly for Supercomputing 2025
  • Rendering Benchmarks vs Reality
View All

Post navigation

 Impact of PCIe 5.0 Bandwidth on GPU Content Creation PerformanceAMD Ryzen Threadripper PRO 9000WX Content Creation Review 
Puget Systems Logo
Build Your Own PC Site Map FAQ
facebook instagram linkedin rss twitter youtube

Optimized Solutions

  • Adobe Premiere
  • Adobe Photoshop
  • Solidworks
  • Autodesk AutoCAD
  • Machine Learning

Workstations

  • Media & Entertainment
  • Engineering
  • Scientific PCs
  • More

Support

  • Online Guides
  • Request Support
  • Remote Help

Publications

  • All News
  • Puget Blog
  • HPC Blog
  • Hardware Articles
  • Case Studies

Policies

  • Warranty & Return
  • Terms and Conditions
  • Privacy Policy
  • Delivery Times
  • Accessibility

About Us

  • Testimonials
  • Careers
  • About Us
  • Contact Us
  • Newsletter

© Copyright 2025 - Puget Systems, All Rights Reserved.