Running LLMs on Radeon GPUs with ROCm

One of my main homelab projects is getting large language models running efficiently on AMD Radeon GPUs. While NVIDIA dominates the ML space, AMD's ROCm platform has come a long way and offers a compelling alternative for local inference.

Why AMD?

Price/Performance - The RX 7900 XTX offers great value for inference workloads
24GB VRAM - Enough to run 7B-13B models comfortably
Open Source - ROCm is fully open source

The Stack

My current setup splits workloads across two nodes to maximize throughput:

Text Inference: llamacpp with Speculative Decoding on cblevins-7900xtx
Image/Video Gen: comfyui with WanVideo on cblevins-5930k

Real-World Configuration: Text Inference

I've moved from vLLM to llamacpp-server for better support of GGUF quantization and lower VRAM usage. Here is the actual production configuration running on the primary node:

# Qwen2.5-7B with speculative decoding (Excerpt from llamacpp-qwen2p5-7b-spec.yaml)
args:
  - |
    exec /opt/src/llama.cpp/build/bin/llama-server \
      --model /models/qwen2.5-7b-abliterated/Qwen2.5-7B-Instruct-abliterated-v2.Q4_K_M.gguf \
      --model-draft /models/qwen2.5-0.5b/qwen2.5-0.5b-instruct-q8_0.gguf \
      --ctx-size 16384 \
      --n-gpu-layers 9999 \
      --n-gpu-layers-draft 9999 \
      --draft-max 16 \
      --draft-min 4 \
      --flash-attn on \
      --cache-type-k q8_0 \
      --parallel 4

Speculative Decoding Workflow

To squeeze every bit of performance out of the 7900 XTX, I use speculative decoding. This involves running a smaller "draft" model alongside the main model to predict upcoming tokens.

Draft Model: Qwen2.5-0.5B-Instruct
Target Model: Qwen2.5-7B-Instruct (Abliterated)

The draft model rapidly predicts the next few tokens, and the target model verifies them in parallel. With --draft-max 16, we can theoretically validate up to 16 tokens in a single pass. This massively reduces the memory bandwidth bottleneck that usually constrains generation speed.

Next Steps

I'm working on a comprehensive benchmarking suite to compare different models and configurations. Stay tuned for more detailed performance data.

Check out the ROCm Inference Stack project for the full setup.

Running LLMs on Radeon GPUs with ROCm

Why AMD?

The Stack

Real-World Configuration: Text Inference

Speculative Decoding Workflow

Next Steps

Related Articles

Welcome to My Homelab

Building Practical AI Agents

Comments