Qwen3-ASR/README.md at c17a131fe028b2e428b6e80a33d30bb4fa57b8df

verachen/Qwen3-ASR

Fork 0

Files

Xiong Wang 9567667698 Initial commit

2026-01-29 20:23:50 +08:00

3.8 KiB

Raw Blame History

Fine-tuning Qwen3-ASR

This script fine-tunes Qwen3-ASR using JSONL audio-text pairs. It supports multi-GPU training via torchrun.

1) Setup

First, please install the two Python packages qwen-asr and datasets using the command below.

pip install -U qwen-asr datasets

Then, to reduce GPU memory usage and speed up training, it is recommended to install FlashAttention 2.

pip install -U flash-attn --no-build-isolation

If your machine has less than 96GB of RAM and lots of CPU cores, run:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.

2) Input JSONL format

Prepare your training file as JSONL (one JSON per line). Each line must contain:

audio: path to a WAV file
text: transcript text (you can include a language prefix)

Example:

{"audio":"/data/wavs/utt0001.wav","text":"language English<asr_text>This is a test sentence."}
{"audio":"/data/wavs/utt0002.wav","text":"language English<asr_text>Another example."}
{"audio":"/data/wavs/utt0003.wav","text":"language English<asr_text>Fine-tuning data line."}

Language prefix recommendation:

If you have language info, use:
- language English<asr_text>...
- language Chinese<asr_text>...
If you do not have language info, use:
- language None<asr_text>...

Note:

If you set language None, the model will not learn language detection from that prefix.

3) Fine-tune (single GPU)

python qwen3_asr_sft.py \
  --model_path Qwen/Qwen3-ASR-1.7B \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --save_steps 200 \
  --save_total_limit 5

Checkpoints will be written to:

./qwen3-asr-finetuning-out/checkpoint-<global_step>

4) Fine-tune (multi GPU with torchrun)

export CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 qwen3_asr_sft.py \
  --model_path Qwen/Qwen3-ASR-1.7B \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --save_steps 200

5) Resume training

Option A: explicitly set a checkpoint path:

python qwen3_asr_sft.py \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --resume_from ./qwen3-asr-finetuning-out/checkpoint-200

Option B: automatically resume from the latest checkpoint under output_dir:

python qwen3_asr_sft.py \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --resume 1

6) Quick inference test

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "qwen3-asr-finetuning-out/checkpoint-200",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
)

print(results[0].language)
print(results[0].text)

One-click shell script example

#!/usr/bin/env bash
set -e

export CUDA_VISIBLE_DEVICES=0,1

MODEL_PATH="Qwen/Qwen3-ASR-1.7B"
TRAIN_FILE="./train.jsonl"
EVAL_FILE="./eval.jsonl"
OUTPUT_DIR="./qwen3-asr-finetuning-out"

torchrun --nproc_per_node=2 qwen3_asr_sft.py \
  --model_path ${MODEL_PATH} \
  --train_file ${TRAIN_FILE} \
  --eval_file ${EVAL_FILE} \
  --output_dir ${OUTPUT_DIR} \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --log_steps 10 \
  --save_strategy steps \
  --save_steps 200 \
  --save_total_limit 5 \
  --num_workers 2 \
  --pin_memory 1 \
  --persistent_workers 1 \
  --prefetch_factor 2

3.8 KiB Raw Blame History