Qwen3：如何运行和微调

openoker 2025-05-14 10:41:36 AI基础 Qwen3微调unsloth 收藏

0 / 6098

学习在本地运行和微调 Qwen3 + 我们的动态 2.0 量化

Qwen 的新 Qwen3 模型在推理、指令遵循、代理功能和多语言支持方面提供了最先进的进步。

所有 Qwen3 上传都使用我们新的 Unsloth Dynamic 2.0 方法，在 5 次 MMLU 和 KL Divergence 基准测试中提供最佳性能。这意味着，您可以运行和微调量化的 Qwen3 LLM，同时将精度损失降至最低！

我们还上传了具有原生 128K 上下文长度的 Qwen3。Qwen 通过使用 YaRN 将其原来的 40K 窗口扩展到 128K 来实现这一点。

Unsloth 现在还支持 Qwen3 和 Qwen3 MOE 模型的微调和 GRPO — 速度提高了 2 倍，VRAM 减少了 70%，上下文长度延长了 8 倍。使用我们的 Colab 笔记本免费微调 Qwen3（14B）

Running Qwen3 Tutorial

Fine-tuning Qwen3 Tutorial

🖥️ Running Qwen3

⚙️ 官方推荐设置

根据 Qwen 的说法，这些是推荐的推理设置：

Non-Thinking Mode Settings:	Thinking Mode Settings:
Temperature = 0.7	Temperature = 0.6
Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)	Min_P = 0.0
Top_P = 0.8	Top_P = 0.95
TopK = 20	TopK = 20

**Chat template/prompt format: **

<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n

对于非 NONE 思维模式，我们特意将和括起来，什么都没有：

<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n

对于 Thinking 模式，请勿使用贪婪解码 ，因为它会导致性能下降和无休止的重复。

在思考和非思考模式之间切换

Qwen3 模型带有内置的“思考模式”，可增强推理并提高响应质量 - 类似于 QwQ-32B 的工作方式。切换说明会因您使用的推理引擎而异，因此请确保使用正确的说明。

llama.cpp 和 Ollama 的说明：

您可以向用户提示或系统消息添加 and，以将模型的思维模式从转弯切换到转弯。该模型将遵循多轮次对话中的最新指令。/think``/no_think

下面是多轮次对话的示例：

> Who are you /no_think

<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. [...]

> How many 'r's are in 'strawberries'? /think

<think>
Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberries". [...]
</think>

The word strawberries contains 3 instances of the letter r. [...]

变压器和 vLLM 的说明：

思考模式：

enable_thinking=True

默认情况下，Qwen3 启用了 thinking。调用时，无需手动设置任何内容。tokenizer.apply_chat_template

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Default is True
)

在思考模式下，模型将在最终答案之前生成一个额外的块 — 这让它 “计划” 并增强其响应。<think>...</think>

不思考模式：

enable_thinking=False

启用非思考将使 Qwen3 跳过所有思考步骤，并表现得像普通的 LLM。

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disables thinking mode
)

此模式将直接提供最终响应 — 没有区块，没有思维链。<think>

🦙 Ollama：运行 Qwen3 教程

如果您还没有安装，请安装！您只能运行最大为 32B 的模型。要运行完整的 235B-A22B 模型，请参阅此处。ollama

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

运行模型！请注意，如果失败，您可以调用另一个终端！我们在 Hugging Face 上传中包含了所有修复和建议的参数（温度等）！ollama serve params

ollama run hf.co/unsloth/Qwen3-8B-GGUF:Q4_K_XL

要禁用 thinking，请使用（或者您可以在系统提示符中设置它）：

>>> Write your prompt here /nothink

如果您遇到任何循环，Ollama 可能已将上下文长度窗口设置为 2,048 左右。如果是这种情况，请将其增加到 32,000，看看问题是否仍然存在。

📖 Llama.cpp：运行 Qwen3 教程

在此处获取 GitHub 上的最新信息。您也可以按照下面的构建说明进行作。如果您没有 GPU 或只需要 CPU 推理，请更改为。llama.cpp``-DGGML_CUDA=ON``-DGGML_CUDA=OFF

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

通过以下方式下载模型（安装后）。您可以选择 Q4_K_M 或其他量化版本。pip install huggingface_hub hf_transfer

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-32B-GGUF",
    local_dir = "unsloth/Qwen3-32B-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

运行模型并尝试任何提示。要禁用 thinking，请使用（或者您可以在系统提示符中设置它）：

>>> Write your prompt here /nothink

运行 Qwen3-235B-A22B

对于 Qwen3-235B-A22B，我们将专门使用 Llama.cpp 进行优化推理和大量选项。

我们遵循与上述类似的步骤，但这次我们还需要执行额外的步骤，因为模型太大了。
通过以下方式下载模型（安装后）。您可以选择 UD_IQ2_XXS 或其他量化版本。pip install huggingface_hub hf_transfer

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-235B-A22B-GGUF",
    local_dir = "unsloth/Qwen3-235B-A22B-GGUF",
    allow_patterns = ["*UD-IQ2_XXS*"],
)

运行模型并尝试任何提示。
编辑 CPU 线程数、上下文长度、GPU 卸载层数。如果您的 GPU 内存不足，请尝试调整它。如果您具有仅 CPU 推理，也请删除它。--threads 32``--ctx-size 16384``--n-gpu-layers 99

用于将所有 MoE 层卸载到 CPU！这有效地允许您在 1 个 GPU 上拟合所有非 MoE 图层，从而提高生成速度。如果您有更多 GPU 容量，您可以自定义正则表达式以适应更多层。-ot ".ffn_.*_exps.=CPU"

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-IQ2_XXS.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    -no-cnv \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should