How to Run Qwen 3.5 on AMD GPUs Using SGLang

Running large language models on AMD hardware is becoming much easier with official ROCm and SGLang support. If you want to deploy Qwen 3.5 efficiently on AMD Instinct GPUs, SGLang provides a streamlined inference server that works out of the box with AMD’s optimized software stack.

In this guide, you’ll learn how to run Qwen 3.5 with SGLang on AMD Instinct GPUs, covering prerequisites, Docker setup, and server launch steps.

What You Need Before You Start

Before running the model, make sure your environment meets these requirements:

Access to AMD Instinct GPUs (MI300X, MI325X, or MI35X)
ROCm drivers installed and working
Docker installed on the host system
Basic familiarity with running containers and CLI commands

SGLang and Qwen 3.5 already include AMD-specific optimizations, so no manual kernel tuning is required.

Step 1: Pull the ROCm-Optimized SGLang Docker Image

AMD provides a prebuilt SGLang container with ROCm support. Pull the image first:

docker pull rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215

This image includes:

ROCm runtime
Triton support
SGLang server dependencies

Step 2: Launch the Docker Container

Start the container with GPU and device access enabled:

docker run -it \
  --device /dev/dri --device /dev/kfd \
  --network host --ipc host \
  --group-add video \
  --security-opt seccomp=unconfined \
  -v $(pwd):/workspace \
  rocm/sgl-dev:v0.5.8.post1-rocm720-mi30x-20260215 /bin/bash

This command allows SGLang to access AMD GPUs directly through ROCm.

Step 3: Start the SGLang Inference Server

Inside the container, launch the SGLang server with Qwen 3.5:

python3 -m sglang.launch_server \
  --port 8000 \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp-size 8 \
  --attention-backend triton \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder

What This Command Does

Loads Qwen 3.5
Enables Triton-based linear attention kernels
Uses tensor parallelism for multi-GPU setups
Activates Qwen-specific reasoning and tool parsing

SGLang automatically detects hybrid attention layers and applies optimized Gated Delta Network kernels.

Step 4: Test the Model Using the OpenAI-Compatible API

Once the server starts, it exposes an OpenAI-style API on port 8000. You can test text, image, or video inputs using supported clients.

Example use cases include:

Long-context reasoning
Multimodal prompts
Tool-based AI agents

The model supports extremely large context windows, so you can safely test prompts well beyond 32K tokens.

Step 5: (Optional) Run Accuracy Evaluation

If you want to validate reasoning performance, install the evaluation tools:

pip install lm-eval[api]

Run a benchmark such as GSM8K:

lm_eval --model local-completions \
  --model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
  --tasks gsm8k \
  --batch_size auto \
  --num_fewshot 5 \
  --trust_remote_code

This step helps confirm that the model runs correctly at scale.

How SGLang Optimizes Performance on AMD Instinct GPUs

SGLang integrates tightly with the ROCm ecosystem:

Triton kernels accelerate linear attention
hipBLASLt optimizes dense compute paths
Large HBM capacity allows full-scale models on fewer GPUs

This combination reduces hardware overhead while maintaining strong inference throughput.

FAQs

Can I run Qwen 3.5 on AMD GPUs?

Yes, you can run Qwen 3.5 on AMD GPUs using ROCm and SGLang with official optimized Docker images.

Which AMD GPUs support Qwen 3.5?

AMD Instinct MI300X, MI325X, and MI35X GPUs support Qwen 3.5 with ROCm optimization.

Do I need ROCm to run Qwen 3.5 on AMD GPUs?

Yes, ROCm is required to enable GPU acceleration and run Qwen 3.5 on AMD GPUs.

Does SGLang support Qwen 3.5 hybrid attention?

Yes, SGLang automatically detects Qwen 3.5 hybrid attention layers and uses optimized Triton kernels.

Can I deploy Qwen 3.5 on a single AMD GPU?

Yes, you can deploy Qwen 3.5 on a single AMD Instinct GPU, depending on model size and memory capacity.

Is Qwen 3.5 optimized for long-context inference on AMD GPUs?

Yes, Qwen 3.5 uses linear attention mechanisms that improve long-context performance on AMD GPUs.

Can I use vLLM instead of SGLang?

Yes, you can run Qwen 3.5 on AMD GPUs using either SGLang or vLLM with ROCm support.

How to Run Qwen 3.5 on AMD GPUs Using SGLang (Step-by-Step Guide)

What You Need Before You Start

Step 1: Pull the ROCm-Optimized SGLang Docker Image

Step 2: Launch the Docker Container

Step 3: Start the SGLang Inference Server

What This Command Does

Step 4: Test the Model Using the OpenAI-Compatible API

Step 5: (Optional) Run Accuracy Evaluation

How SGLang Optimizes Performance on AMD Instinct GPUs

FAQs

Can I run Qwen 3.5 on AMD GPUs?

Which AMD GPUs support Qwen 3.5?

Do I need ROCm to run Qwen 3.5 on AMD GPUs?

Does SGLang support Qwen 3.5 hybrid attention?

Can I deploy Qwen 3.5 on a single AMD GPU?

Is Qwen 3.5 optimized for long-context inference on AMD GPUs?

Can I use vLLM instead of SGLang?

Read More

Comments

Leave a Reply Cancel reply

What You Need Before You Start

Step 1: Pull the ROCm-Optimized SGLang Docker Image

Step 2: Launch the Docker Container

Step 3: Start the SGLang Inference Server

What This Command Does

Step 4: Test the Model Using the OpenAI-Compatible API

Step 5: (Optional) Run Accuracy Evaluation

How SGLang Optimizes Performance on AMD Instinct GPUs

FAQs

Can I run Qwen 3.5 on AMD GPUs?

Which AMD GPUs support Qwen 3.5?

Do I need ROCm to run Qwen 3.5 on AMD GPUs?

Does SGLang support Qwen 3.5 hybrid attention?

Can I deploy Qwen 3.5 on a single AMD GPU?

Is Qwen 3.5 optimized for long-context inference on AMD GPUs?

Can I use vLLM instead of SGLang?

Share this:

Read More

Comments

Leave a Reply Cancel reply