Local alternatives to Cloud AI services

Table of Contents

Introduction

There are many advantages to using cloud-based AI services, such as the low barrier to entry and the (generally) high quality of the outputs, and some offer tools that simply aren’t publicly available. However, there is a growing ecosphere of local AI tools, both open and closed source, that are worth consideration. Local AI tools offer increased data security, allow for more customization, and can save money that would otherwise be spent on monthly software subscriptions. Today, I’d like to present a selection of tools that can be run on your own hardware, emphasizing options with a relatively low entry barrier due to low or no cost and ease of use. I hope you’ll discover something that has a niche in your workflow!

Frame Interpolation

RIFE

Real-Time Intermediate Flow Estimation for Video Frame Interpolation or “RIFE” is a command-line utility that can create new frames between existing frames in a video or still photos. Command-line utilities are not particularly easy to use for most people, and several projects that integrate RIFE into a GUI are listed on the GitHub page, including the two options listed below.

Flowframes

Primarily a GUI for frame interpolation using RIFE, although other interpolation models are also available. One benefit is that older versions of the application are available for free, so you can try out the software before making a purchase.

Fluidframes

A similar project to Flowframes, implementing a GUI for RIFE frame interpolation. Does not have a free version, but a license only costs $5. Potentially more active development than Flowframes.

Text to Speech & Speech Conversion

There are many models and libraries for converting text to speech and speech conversion, such as Bark, Coqui, Tortoise, VoiceCraft, RVC, and many others, but by and large, these are command line utilities that aren’t particularly easy to dive into. That’s why I recommend checking out TTS Generation WebUI below.

TTS Generation WebUI

TTS Generation WebUI combines a multitude of AI audio tools into one (relatively) easy-to-use interface, making it much easier to jump into the world of AI audio. Not only are TTS models included, but speech-to-speech conversion and music generation are included as well, allowing for experimentation with a wide variety of models and methods for generating AI audio.

Speech to Text transcription / Automatic Speech Recognition

Whisper

Whisper is an ASR model developed by OpenAI. Due to its high performance and permissive license, it is one of the most widely used tools for transcribing audio. It comes in several model sizes, from tiny (39M parameters) to large (1.5B). The project only supports command line and Python usage, but there are projects with GUIs, such as the two options listed below.

Subtitle Edit

Subtitle Edit is not only an excellent way of creating and modifying subtitles for videos, but it’s also one of the easiest ways to use whisper models for transcriptions, allowing for automatic transcription of audio into a multitude of text formats.

Audacity AI Plugins

Intel has released several AI tools for Audacity as plugins powered by OpenVINO. In addition to the whisper transcription tool, there is a noise suppression plugin and several music-oriented plugins offering track separation and even music generation.

Image Generation

Stable Diffusion WebUI (Automatic1111)

SD-WebUI (or Automatic1111 as it’s commonly known) is probably the most popular front-end for image generation with Stable Diffusion. It’s easy to set up and get started, and it offers many options for customizing the parameters of the outputs. In addition, it supports a huge number of extensions to expand its functionality. A number of forks exist as well, including SD.Next and SD-WebUI-directml, both of which have improved support for AMD GPUs. My personal recommendation is SD-WebUI-Forge, which offers improved performance compared to the original and comes pre-packaged with several “must-have” extensions such as ControlNet and ADetailer.

ComfyUI

ComfyUI is likely to be the second-most popular option for Stable Diffusion. However, its node-based approach to image generation differs substantially from other solutions and can present a tough learning curve for those unfamiliar with this kind of workflow. That said, it does offer an essentially unbeatable level of control over the generation process, and even if you aren’t ready to create your own custom workflows, community members freely share workflows that you can easily import (such as the absolutely massive AP Workflow). It also offers support for extensions and custom nodes, and the aptly named ComfyUI Manager extension is a “must-have” extension that massively helps with managing these add-ons.

Fooocus

Fooocus is a great option for those looking for a straightforward solution for producing high-quality outputs without much tweaking. The project is positioned as a simple, local alternative to Midjourney, and has this to say about its simplified installation: “Between pressing “download” and generating the first image, the number of needed mouse clicks is strictly limited to less than 3.” While it may not have the same level of fine control as a solution such as SD-WebUI or ComfyUI, for those who aren’t interested in tweaking generation parameters, Fooocus is a great option.

EasyDiffusion

Like the name implies, EasyDiffusion is advertised as the “The easiest way to install and use Stable Diffusion on your computer.” Despite it’s focus on being easy to use, it does still offer more advanced features such as ControlNet. In terms of ease of use and available options, it sits somewhere between SD-WebUI and Fooocus. One of the most interesting features is that it supports multiple GPUs, which is rare in SD applications. However, having more than one GPU work on a single image simultaneously with Stable Diffusion is not technically feasible. Instead, jobs are split and sent to both GPUs, so a job with a batch count of ten across two GPUs results in each GPU producing five images apiece.

Krita AI Diffusion

Unlike the standalone image generation options listed above, Krita AI Diffusion is a plugin for the sketching and painting program Krita. Per the GitHub page, “This plugin seeks to provide what “Generative Fill/Expand” do in Photoshop – and go beyond.” While you can still generate images from scratch using a text prompt, image-based features such as inpainting and outpainting are the real draw. Unlike Adobe’s Generative Fill, Krita AI Diffusion allows you to choose between your preferred model, and you don’t have to worry about “Generative Credits” since it’s powered by your own hardware. This video shows a great example of the kind of control that Krita AI diffusion affords.

Upscaling

All of the above options offer some form of upscaling, either natively, through extensions, or, in the case of ComfyUI, custom workflows. That said, there are free, standalone options for upscaling, such as Upscayl, which offers a simple but effective GUI for a handful of upscaling models. There are also more specialized options like Waifu2x, developed specifically for upscaling anime-style art. One rather expansive project based on Waifu2x is Waifu2x-Extension-GUI, which supports not only the upscaling of anime-style images but also other types of media, such as photos or even video. It even supports frame interpolation as well, making it somewhat difficult to categorize in this post!

Text Generation

LM Studio

LM Studio is a great way to get started with generating text using LLMs. It provides an excellent model browser that not only filters models based on their compatibility with LM Studio’s backend (llama.cpp) but also based on your system’s hardware. This prevents wasted time downloading models in the wrong format or are too large to load on your hardware. With llama.cpp as the only available backend, LM Studio only supports CPU inference, but GPU-offloading is available. This allows some or all of the model’s layers to be loaded onto the GPU, greatly improving performance and reducing RAM usage.

Chat with RTX

Although it’s still early in development and only offers two small models (Llama 2 13B and Mistral 7B), NVIDIA’s Chat with RTX makes the list as it’s one of the easiest ways to utilize retrieval augmented generation, or RAG. RAG allows the LLM to reference documents for more accurate outputs based on the contents of those documents. RAG techniques can also help prevent “hallucinations”, where an LLM makes up information in an attempt to fulfill the user’s request. Instead of generating potentially false information, if the LLM doesn’t find the relevant information, it can advise the user that the requested information is not within its provided context.

Text Generation WebUI (oobabooga)

The stated goal of text-generation-webui is “to become the AUTOMATIC1111/stable-diffusion-webui of text generation,” and just like SD-WebUI, TG-WebUI is often referred to by the name of the repository’s owner, oobabooga. Although not quite as simple to set up as the other text generation options listed above, the included batch files make installation straightforward. One of the best features of TG-WebUI is the array of model backends included, which allows for either CPU inference or GPU inference across a variety of quantization methods. It can also be run as an OpenAPI-compatible server, allowing you to connect to it via LAN or WAN using an OpenAPI-compatible front-end.

Conclusion

I hope this post has introduced you to some new tools that encourage you to step into the world of locally run AI. This list is by no means comprehensive, especially due to the focus on easy-to-use and low-cost tools. One obvious omission from this list is the tools included in paid, close-source applications such as those found within Adobe and BlackMagic Design applications like Premiere Pro & DaVinci Resolve. However, I’m eager to dive into those options in a future post. If you have suggestions for tools that should be cataloged here, please let me know in the comments!