Slo llama

A lot of the most common support questions I see related to Ollama are variations of the same question... Why is Ollama slow?

In this article, we will look at the most common causes of slow llamas, how to troubleshoot them, and some big brain moves to optimize your usage.

RAM bandwidth

The way dense LLMs work, the executor has to traverse the entire neural network for each token it generates. That means that RAM bandwidth will always be a limiting factor. Let's have a look at some examples, but naturally, expect a lot of variation based on your hardware, OS, open programs, etc.

A high-end gaming GPU currently has around 1 TB/s VRAM bandwidth, while DDR4 SDRAM, in dual-channel configuration, can have maximum speeds of around 25-50 GB/s, while DDR5 has a typical higher range of around 64-140 GB/s. Macs with Apple silicon have a wide range of bandwidths, most of which are considerably higher than other PCs, but not as fast as high-end GPUs. You can get a decent estimate of the maximum generation speed by dividing the bandwidth in GB/s by the model size in GB, though note that the actual speed will fall below that.

This means that an 8 GB model at q4 quantization, around 4.5 GB in size, would have a theoretical maximum of:

1000/4.5 = ~222 tokens/second on a high-end GPU.
256/4.5 = ~57 t/s on a lower-end modern GPU, or lower-end Mac.
140/4.5 = ~31 t/s on very fast DDR5 RAM.
25/4.5 = ~ 5.5 t/s on slower DDR4 RAM.

Models at higher quantizations, such as q8, will be much slower, as will models with higher parameter counts. For example, Phi4, 14B, at q4km (9.1 GB), would have a maximum output of 28 t/s on a GPU with 256 GB/s VRAM, or 16 t/s at q8 (16 GB). Of course, lower quants or parameter counts will be faster.

GPUs are generally fastest, though other options may be good enough. A high-end desktop, like a ThreadRipper Pro with 8-channel RAM, may have up to ~320 GB/s bandwidth, and new products like AMD's Strix Halo and Nvidia digits promise to compete with current discrete GPUs, especially thanks to making large portions of RAM available to the GPU.

Note that current iGPUs share memory with the CPU, and do not offer significant improvements compared to running on the CPU.

MoE

Sparse Mixture-of-Experts (MoE) models do not traverse the entire model, but only a subset for each token. As such, a 3b MoE model will run very fast even on a CPU with DDR4 memory.

Partial CPU offloading vs multiple GPUs

In cases where your model does not fit fully in your GPU's memory (VRAM), you can use a mixture of CPU and GPU. However, note that generally, the CPU's RAM will be a bottleneck.

If you run a model with 70b parameters (~48 GB) on a GPU with 24 GB of VRAM at 1000 GB/s, with the extra layers offloaded to the CPU with 100 GB/s RAM, then each token will take at least:

24/1000 = ~0.024 s of GPU time.
24/100 = ~0.24 s of CPU time.

At 0.264 s total time per token, this means that you will get considerably less than 4 tokens/second on that configuration. However, if you use two GPUs with the same specs, each token will take more than:

0.024 s on GPU 0.
0.024 s on GPU 1.

At 0.048 s per token, your theoretical maximum is now over 20 tokens/second.

Make the llama fit and fast

Beyond the model, you also need room on your GPU for the K/V Cache, if you want it to be fast. A good rule of thumb is the size of the model plus ~2 GB for context at 2048, so for an 8b model at q4km, you would need 6.5 GB RAM.

Setting the right context size

2048 is the default context size for Ollama. It can be extended, and needs to be if you hope to use it with larger documents, but this also significantly increases the memory requirements.

You can gradually increase the context size while measuring your RAM usage. Your goal should be to always keep the model and cache fully loaded on the GPU. To do so, simply run a model in the terminal, and set the num_ctx parameter, while measuring the output of ollama ps.

$ ollama run llama3.2:3b-instruct-q6_K
>>> Say "hi" and nothing else.
Hi.
>>> /bye
$ ollama ps
NAME                         ID              SIZE      PROCESSOR    UNTIL
llama3.2:3b-instruct-q6_K    355f7bc7ff61    3.7 GB    100% GPU     Forever

As you can see, the default configuration takes 3.7 GB. Let's increase the context size, and see what happens.

$ ollama run llama3.2:3b-instruct-q6_K
>>> /set parameter num_ctx 32768
>>> Say "hi" and nothing else.
Hi.
>>> /bye
$ ollama ps
NAME                         ID              SIZE      PROCESSOR    UNTIL
llama3.2:3b-instruct-q6_K    355f7bc7ff61    8.8 GB    100% GPU     Forever

Our memory requirements more than doubled, but we're still good, as we still run 100% on GPU.

Flash attention, and KV/cache quantization

You can try setting the env var OLLAMA_FLASH_ATTENTION to a true value, like 1. It's a way to make models focus on key words, allowing them to "speed read". In my tests, prompt eval time sometimes speeds up by over 30%.

You can also enable KV Cache quantization by setting the variable OLLAMA_KV_CACHE_TYPE to q8_0 or q4_0. Note: this requires flash attention to be enabled.

Here are the effects of changing the cache type from the default (F16) with the same model and context at 32k.

$ ollama ps # q8_0
NAME                         ID              SIZE      PROCESSOR    UNTIL
llama3.2:3b-instruct-q6_K    355f7bc7ff61    6.8 GB    100% GPU     Forever
$ ollama ps # q4_0
NAME                         ID              SIZE      PROCESSOR    UNTIL
llama3.2:3b-instruct-q6_K    355f7bc7ff61    5.9 GB    100% GPU     Forever

Notice how our RAM requirements decreased by 23% and 33% respectively. This can be a great way to increase context, but may affect the LLM's response quality, so definitely experiment with it.

Other apps and your VRAM

Note that other applications can also use GPU memory for graphical acceleration. Web browsers, Discord, other electron-based apps, or media players may all eat up some VRAM. If a model used to fit, but no longer does, try closing as many apps as possible, or rebooting your computer and trying again; maybe more VRAM is now available.

Be sure to check the Ollama logs; they will let you know how much VRAM is free. If you want to feed even more VRAM to the llamas, consider running Linux with no desktop environment, and using your machine as an AI server, especially if you have another that can be used as a client.

Big-🧠 power moves for fast 🦙s

Here are the key takeaways to ensure Ollama runs as fast as possible.

Use fast RAM (GPU if possible).
Ideally, have 16 GB or more VRAM (or wait for the new AI-focused hardware coming out this year).
Make sure your GPU is supported, and properly configured.
Measure performance with ollama ps, and aim for 100% GPU.
Close as many graphical apps as possible (or run headless Linux) for GPU acceleration.
Choose the right quant for your machine.
Set the context size correctly.
Try enabling Flash attention.
Use K/V Cache quantization.
Never run models larger than your GPU + CPU RAM, e.g. a 32b (or 70b 😬) model on a 16 GB Mac.

We will look at ensuring your GPU is supported in a future article.

Jan. 30, 2025, 3:16 a.m.

Tags

Ollama