Running LLMs on Radeon GPUs with ROCm
November 20, 2025·1 min read
One of my main homelab projects is getting large language models running efficiently on AMD Radeon GPUs. While NVIDIA dominates the ML space, AMD's ROCm platform has come a long way and offers a compelling alternative for local inference.
Why AMD?
- Price/Performance - The RX 7900 XTX offers great value for inference workloads
- 24GB VRAM - Enough to run 7B-13B models comfortably
- Open Source - ROCm is fully open source
The Stack
My current setup splits workloads across two nodes to maximize throughput:
- Text Inference:
llamacppwith Speculative Decoding oncblevins-7900xtx - Image/Video Gen:
comfyuiwith WanVideo oncblevins-5930k
Real-World Configuration: Text Inference
I've moved from vLLM to llamacpp-server for better support of GGUF quantization and lower VRAM usage. Here is the actual production configuration running on the primary node:
# Qwen2.5-7B with speculative decoding (Excerpt from llamacpp-qwen2p5-7b-spec.yaml)
args:
- |
exec /opt/src/llama.cpp/build/bin/llama-server \
--model /models/qwen2.5-7b-abliterated/Qwen2.5-7B-Instruct-abliterated-v2.Q4_K_M.gguf \
--model-draft /models/qwen2.5-0.5b/qwen2.5-0.5b-instruct-q8_0.gguf \
--ctx-size 16384 \
--n-gpu-layers 9999 \
--n-gpu-layers-draft 9999 \
--draft-max 16 \
--draft-min 4 \
--flash-attn on \
--cache-type-k q8_0 \
--parallel 4
Speculative Decoding Workflow
To squeeze every bit of performance out of the 7900 XTX, I use speculative decoding. This involves running a smaller "draft" model alongside the main model to predict upcoming tokens.
- Draft Model:
Qwen2.5-0.5B-Instruct - Target Model:
Qwen2.5-7B-Instruct(Abliterated)
The draft model rapidly predicts the next few tokens, and the target model verifies them in parallel. With --draft-max 16, we can theoretically validate up to 16 tokens in a single pass. This massively reduces the memory bandwidth bottleneck that usually constrains generation speed.
Next Steps
I'm working on a comprehensive benchmarking suite to compare different models and configurations. Stay tuned for more detailed performance data.
Check out the ROCm Inference Stack project for the full setup.
Related Articles
Comments
Join the discussion. Be respectful.