How to Run Qwen 3.5 on AMD GPUs Using SGLang (Step-by-Step Guide)

Running large language models on AMD hardware is becoming much easier with official ROCm and SGLang support. If you want to deploy Qwen 3.5 efficiently on AMD Instinct GPUs, SGLang provides a streamlined inference server that works out of the box with AMD’s optimized software stack.

How to Run Qwen 3.5 on AMD GPUs Using SGLang (Step-by-Step Guide)

In this guide, you’ll learn how to run Qwen 3.5 with SGLang on AMD Instinct GPUs, covering prerequisites, Docker setup, and server launch steps.

What You Need Before You Start

Before running the model, make sure your environment meets these requirements:

  • Access to AMD Instinct GPUs (MI300X, MI325X, or MI35X)
  • ROCm drivers installed and working
  • Docker installed on the host system
  • Basic familiarity with running containers and CLI commands

SGLang and Qwen 3.5 already include AMD-specific optimizations, so no manual kernel tuning is required.

Step 1: Pull the ROCm-Optimized SGLang Docker Image

AMD provides a prebuilt SGLang container with ROCm support. Pull the image first:

docker pull rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215

This image includes:

  • ROCm runtime
  • Triton support
  • SGLang server dependencies

Step 2: Launch the Docker Container

Start the container with GPU and device access enabled:

docker run -it \
  --device /dev/dri --device /dev/kfd \
  --network host --ipc host \
  --group-add video \
  --security-opt seccomp=unconfined \
  -v $(pwd):/workspace \
  rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215 /bin/bash

This command allows SGLang to access AMD GPUs directly through ROCm.

Step 3: Start the SGLang Inference Server

Inside the container, launch the SGLang server with Qwen 3.5:

python3 -m sglang.launch_server \
  --port 8000 \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp-size 8 \
  --attention-backend triton \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder

What This Command Does

  • Loads Qwen 3.5
  • Enables Triton-based linear attention kernels
  • Uses tensor parallelism for multi-GPU setups
  • Activates Qwen-specific reasoning and tool parsing

SGLang automatically detects hybrid attention layers and applies optimized Gated Delta Network kernels.

Step 4: Test the Model Using the OpenAI-Compatible API

Once the server starts, it exposes an OpenAI-style API on port 8000. You can test text, image, or video inputs using supported clients.

Example use cases include:

  • Long-context reasoning
  • Multimodal prompts
  • Tool-based AI agents

The model supports extremely large context windows, so you can safely test prompts well beyond 32K tokens.

Step 5: (Optional) Run Accuracy Evaluation

If you want to validate reasoning performance, install the evaluation tools:

pip install lm-eval[api]

Run a benchmark such as GSM8K:

lm_eval --model local-completions \
  --model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
  --tasks gsm8k \
  --batch_size auto \
  --num_fewshot 5 \
  --trust_remote_code

This step helps confirm that the model runs correctly at scale.

How SGLang Optimizes Performance on AMD Instinct GPUs

SGLang integrates tightly with the ROCm ecosystem:

  • Triton kernels accelerate linear attention
  • hipBLASLt optimizes dense compute paths
  • Large HBM capacity allows full-scale models on fewer GPUs

This combination reduces hardware overhead while maintaining strong inference throughput.

FAQs

Can I run Qwen 3.5 on AMD GPUs?

Yes, you can run Qwen 3.5 on AMD GPUs using ROCm and SGLang with official optimized Docker images.

Which AMD GPUs support Qwen 3.5?

AMD Instinct MI300X, MI325X, and MI35X GPUs support Qwen 3.5 with ROCm optimization.

Do I need ROCm to run Qwen 3.5 on AMD GPUs?

Yes, ROCm is required to enable GPU acceleration and run Qwen 3.5 on AMD GPUs.

Does SGLang support Qwen 3.5 hybrid attention?

Yes, SGLang automatically detects Qwen 3.5 hybrid attention layers and uses optimized Triton kernels.

Can I deploy Qwen 3.5 on a single AMD GPU?

Yes, you can deploy Qwen 3.5 on a single AMD Instinct GPU, depending on model size and memory capacity.

Is Qwen 3.5 optimized for long-context inference on AMD GPUs?

Yes, Qwen 3.5 uses linear attention mechanisms that improve long-context performance on AMD GPUs.

Can I use vLLM instead of SGLang?

Yes, you can run Qwen 3.5 on AMD GPUs using either SGLang or vLLM with ROCm support.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply