Running large language models on AMD hardware is becoming much easier with official ROCm and SGLang support. If you want to deploy Qwen 3.5 efficiently on AMD Instinct GPUs, SGLang provides a streamlined inference server that works out of the box with AMD’s optimized software stack.

In this guide, you’ll learn how to run Qwen 3.5 with SGLang on AMD Instinct GPUs, covering prerequisites, Docker setup, and server launch steps.
What You Need Before You Start
Before running the model, make sure your environment meets these requirements:
- Access to AMD Instinct GPUs (MI300X, MI325X, or MI35X)
- ROCm drivers installed and working
- Docker installed on the host system
- Basic familiarity with running containers and CLI commands
SGLang and Qwen 3.5 already include AMD-specific optimizations, so no manual kernel tuning is required.
Step 1: Pull the ROCm-Optimized SGLang Docker Image
AMD provides a prebuilt SGLang container with ROCm support. Pull the image first:
docker pull rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215
This image includes:
- ROCm runtime
- Triton support
- SGLang server dependencies
Step 2: Launch the Docker Container
Start the container with GPU and device access enabled:
docker run -it \
--device /dev/dri --device /dev/kfd \
--network host --ipc host \
--group-add video \
--security-opt seccomp=unconfined \
-v $(pwd):/workspace \
rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215 /bin/bash
This command allows SGLang to access AMD GPUs directly through ROCm.
Step 3: Start the SGLang Inference Server
Inside the container, launch the SGLang server with Qwen 3.5:
python3 -m sglang.launch_server \
--port 8000 \
--model-path Qwen/Qwen3.5-397B-A17B \
--tp-size 8 \
--attention-backend triton \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
What This Command Does
- Loads Qwen 3.5
- Enables Triton-based linear attention kernels
- Uses tensor parallelism for multi-GPU setups
- Activates Qwen-specific reasoning and tool parsing
SGLang automatically detects hybrid attention layers and applies optimized Gated Delta Network kernels.
Step 4: Test the Model Using the OpenAI-Compatible API
Once the server starts, it exposes an OpenAI-style API on port 8000. You can test text, image, or video inputs using supported clients.
Example use cases include:
- Long-context reasoning
- Multimodal prompts
- Tool-based AI agents
The model supports extremely large context windows, so you can safely test prompts well beyond 32K tokens.
Step 5: (Optional) Run Accuracy Evaluation
If you want to validate reasoning performance, install the evaluation tools:
pip install lm-eval[api]
Run a benchmark such as GSM8K:
lm_eval --model local-completions \
--model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
--tasks gsm8k \
--batch_size auto \
--num_fewshot 5 \
--trust_remote_code
This step helps confirm that the model runs correctly at scale.
How SGLang Optimizes Performance on AMD Instinct GPUs
SGLang integrates tightly with the ROCm ecosystem:
- Triton kernels accelerate linear attention
- hipBLASLt optimizes dense compute paths
- Large HBM capacity allows full-scale models on fewer GPUs
This combination reduces hardware overhead while maintaining strong inference throughput.
FAQs
Can I run Qwen 3.5 on AMD GPUs?
Yes, you can run Qwen 3.5 on AMD GPUs using ROCm and SGLang with official optimized Docker images.
Which AMD GPUs support Qwen 3.5?
AMD Instinct MI300X, MI325X, and MI35X GPUs support Qwen 3.5 with ROCm optimization.
Do I need ROCm to run Qwen 3.5 on AMD GPUs?
Yes, ROCm is required to enable GPU acceleration and run Qwen 3.5 on AMD GPUs.
Does SGLang support Qwen 3.5 hybrid attention?
Yes, SGLang automatically detects Qwen 3.5 hybrid attention layers and uses optimized Triton kernels.
Can I deploy Qwen 3.5 on a single AMD GPU?
Yes, you can deploy Qwen 3.5 on a single AMD Instinct GPU, depending on model size and memory capacity.
Is Qwen 3.5 optimized for long-context inference on AMD GPUs?
Yes, Qwen 3.5 uses linear attention mechanisms that improve long-context performance on AMD GPUs.
Can I use vLLM instead of SGLang?
Yes, you can run Qwen 3.5 on AMD GPUs using either SGLang or vLLM with ROCm support.
