Skip to content

oLLM Library Enables Execution of 8B-20B Models on NVIDIA GPUs

oLLM brings large language models to consumer GPUs. It's not perfect, but it's a significant step forward for offline and batch workloads.

In this picture we can see a platform and on this platform we can see a mosaic art of snakes and a...
In this picture we can see a platform and on this platform we can see a mosaic art of snakes and a person.

oLLM Library Enables Execution of 8B-20B Models on NVIDIA GPUs

oLLM, a new library built on Huggingface Transformers and PyTorch, has been released. It enables comfortable execution of 8B-20B models and even MoE-80B on NVIDIA GPUs, with some limitations.

oLLM's key features include KV cache read/writes that bypass host RAM, DiskCache support for Qwen3-Next-80B, Llama-3 FlashAttention-2 for stability, and GPT-OSS memory reductions via 'flash-attention-like' kernels and chunked MLP. It targets offline, single-GPU workloads, using FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching.

The library reports a throughput of ~0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti, suitable for batch/offline analytics but not interactive chat. It streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and optionally offloads layers to CPU. Out of the box, oLLM supports Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B models, targeting NVIDIA Ampere, Ada, and Hopper GPUs.

oLLM can handle up to ~100K tokens of context while keeping VRAM within 8-10 GB. Running Qwen3-Next-80B on consumer hardware with oLLM is feasible but still storage-bound and workload-specific. The library is lightweight and built for pragmatic use, offering a way to execute large models comfortably on NVIDIA GPUs.

Read also:

Latest