Fine-tuning Qwen3-ASR
This script fine-tunes Qwen3-ASR using JSONL audio-text pairs. It supports multi-GPU training via torchrun.
1) Setup
First, please install the two Python packages qwen-asr and datasets using the command below.
pip install -U qwen-asr datasets
Then, to reduce GPU memory usage and speed up training, it is recommended to install FlashAttention 2.
pip install -U flash-attn --no-build-isolation
If your machine has less than 96GB of RAM and lots of CPU cores, run:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.
2) Input JSONL format
Prepare your training file as JSONL (one JSON per line). Each line must contain:
audio: path to a WAV filetext: transcript text (you can include a language prefix)
Example:
{"audio":"/data/wavs/utt0001.wav","text":"language English<asr_text>This is a test sentence."}
{"audio":"/data/wavs/utt0002.wav","text":"language English<asr_text>Another example."}
{"audio":"/data/wavs/utt0003.wav","text":"language English<asr_text>Fine-tuning data line."}
Language prefix recommendation:
- If you have language info, use:
language English<asr_text>...language Chinese<asr_text>...
- If you do not have language info, use:
language None<asr_text>...
Note:
- If you set
language None, the model will not learn language detection from that prefix.
3) Fine-tune (single GPU)
python qwen3_asr_sft.py \
--model_path Qwen/Qwen3-ASR-1.7B \
--train_file ./train.jsonl \
--output_dir ./qwen3-asr-finetuning-out \
--batch_size 32 \
--grad_acc 4 \
--lr 2e-5 \
--epochs 1 \
--save_steps 200 \
--save_total_limit 5
Checkpoints will be written to:
./qwen3-asr-finetuning-out/checkpoint-<global_step>
4) Fine-tune (multi GPU with torchrun)
export CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 qwen3_asr_sft.py \
--model_path Qwen/Qwen3-ASR-1.7B \
--train_file ./train.jsonl \
--output_dir ./qwen3-asr-finetuning-out \
--batch_size 32 \
--grad_acc 4 \
--lr 2e-5 \
--epochs 1 \
--save_steps 200
5) Resume training
Option A: explicitly set a checkpoint path:
python qwen3_asr_sft.py \
--train_file ./train.jsonl \
--output_dir ./qwen3-asr-finetuning-out \
--resume_from ./qwen3-asr-finetuning-out/checkpoint-200
Option B: automatically resume from the latest checkpoint under output_dir:
python qwen3_asr_sft.py \
--train_file ./train.jsonl \
--output_dir ./qwen3-asr-finetuning-out \
--resume 1
6) Quick inference test
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"qwen3-asr-finetuning-out/checkpoint-200",
dtype=torch.bfloat16,
device_map="cuda:0",
)
results = model.transcribe(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
)
print(results[0].language)
print(results[0].text)
One-click shell script example
#!/usr/bin/env bash
set -e
export CUDA_VISIBLE_DEVICES=0,1
MODEL_PATH="Qwen/Qwen3-ASR-1.7B"
TRAIN_FILE="./train.jsonl"
EVAL_FILE="./eval.jsonl"
OUTPUT_DIR="./qwen3-asr-finetuning-out"
torchrun --nproc_per_node=2 qwen3_asr_sft.py \
--model_path ${MODEL_PATH} \
--train_file ${TRAIN_FILE} \
--eval_file ${EVAL_FILE} \
--output_dir ${OUTPUT_DIR} \
--batch_size 32 \
--grad_acc 4 \
--lr 2e-5 \
--epochs 1 \
--log_steps 10 \
--save_strategy steps \
--save_steps 200 \
--save_total_limit 5 \
--num_workers 2 \
--pin_memory 1 \
--persistent_workers 1 \
--prefetch_factor 2