update finetune docs

2026-01-07 17:16:01 +08:00
parent 4506e00e9c
commit e98de51696
2 changed files with 131 additions and 31 deletions
--- a/docs/finetune.md
+++ b/docs/finetune.md
@ -11,25 +11,46 @@ pip install git+https://github.com/modelscope/FunASR
 Data examples

 ```
-{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写：<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav<|endofspeech|>"}, {"role": "assistant", "content": "甚至出现交易几乎停滞的情况"}], "speech_length": 418, "text_length": 6}
-{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写：<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav<|endofspeech|>"}, {"role": "assistant", "content": "湖北一公司以员工名义贷款数十员工负债千万"}], "speech_length": 572, "text_length": 11}
+head -n1 data/train_example.jsonl | jq
+
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are a helpful assistant."
+    },
+    {
+      "role": "user",
+      "content": "语音转写：<|startofspeech|>!https://modelscope.cn/datasets/FunAudioLLM/funasr-demo/resolve/master/audios/IT0011W0002.wav<|endofspeech|>"
+    },
+    {
+      "role": "assistant",
+      "content": "几点了？"
+    }
+  ],
+  "speech_length": 145,
+  "text_length": 3
+}
 ```

 Full ref to `data/train_example.jsonl`

 Description：

- `messages[1]["content"]`: audio file with speech recognition prompt
+- The content of systemis fixed as `You are a helpful assistant.`
+- The content of userincludes the prompt and the path to the audio file (enclosed between `<|startofspeech|>!`and `<|endofspeech|>`).
+  - The default prompts are `语音转写：`and `Speech transcription: `.
+  - For corresponding languages, prompts can be combined, such as `语音转写成英文：`and `Transcribe speech into Chinese: `.
+  - When the text annotation corresponding to the audio file contains no Arabic numerals or punctuation marks, you can use `语音转写，不进行文本规整：`and `Speech transcription without text normalization: `.
+- The content of assistant corresponds to the text annotation of the audio file.
+- speech_length: The number of fbank frames of the audio file (10ms per frame).
+- text_length: The number of tokens in the annotation text of the audio file (encoded using `Qwen/Qwen3-0.6B`).
+
 - `messages[2]["content"]`: transcription
 - `speech_length`: number of fbank frames of the audio file
 - `text_length`: number of tokens of the transcription (tokenized by `Qwen3-0.6B`)

-`train_text.txt`
-
-```
-BAC009S0764W0121 甚至出现交易几乎停滞的情况
-BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
-```
+We provide a data format conversion tool `scp2jsonl.py`, which can convert common speech recognition training data formats such as wav scp and transcription into the ChatML format.

 `train_wav.scp`

@ -38,13 +59,18 @@ BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test
 BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
 ```

-`Command`
+`train_text.txt`
+
+```
+BAC009S0764W0121 甚至出现交易几乎停滞的情况
+BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
+```

 ```
 python tools/scp2jsonl.py \
-  --scp-file /path/to/train_wav.scp \
-  --transcript-file /path/to/train_text.txt \
-  --jsonl-file data/train_example.jsonl
+  ++scp+file=data/train_wav.scp \
+  ++transcript_file=data/train_text.txt \
+  ++jsonl_file=data/train_example.jsonl
 ```

 ## Finetune
@ -58,3 +84,29 @@ For more detailed parameters, refer to: [SenseVoice Model Training and Testing](
 ```
 bash finetune.sh
 ```
+
+### Recommended Configuration
+
+- For training data less than 1000 hours, it is recommended to fine-tune the audio_adaptor.
+- For training data less than 5000 hours, it is recommended to fine-tune the audio_encoder and audio_adaptor.
+- For training data greater than 10000 hours, it is recommended to perform full-parameter fine-tuning.
+
+## Model Evaluation
+
+After model fine-tuning is completed, you can decode the model using the decode.py script:
+
+```
+python decode.py \
+  ++model_dir=/path/to/finetuned \
+  ++scp_file=data/val_wav.scp \
+  ++output_file=output.txt
+```
+
+After decoding is completed, text inverse normalization needs to be applied to the annotations and recognition results, and then the WER should be calculated:
+
+```
+python tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
+python tools/whisper_mix_normalize.py output.txt output_norm.txt
+compute-wer data/val_norm.txt output_norm.txt cer.txt
+tail -n8 cer.txt
+```
--- a/docs/fintune_zh.md
+++ b/docs/fintune_zh.md
@ -11,27 +11,42 @@ pip install git+https://github.com/modelscope/FunASR
 数据格式需要包括如下几个字段：

 ```
-{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写：<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav<|endofspeech|>"}, {"role": "assistant", "content": "甚至出现交易几乎停滞的情况"}], "speech_length": 418, "text_length": 6}
-{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写：<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav<|endofspeech|>"}, {"role": "assistant", "content": "湖北一公司以员工名义贷款数十员工负债千万"}], "speech_length": 572, "text_length": 11}
+head -n1 data/train_example.jsonl | jq
+
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are a helpful assistant."
+    },
+    {
+      "role": "user",
+      "content": "语音转写：<|startofspeech|>!https://modelscope.cn/datasets/FunAudioLLM/funasr-demo/resolve/master/audios/IT0011W0002.wav<|endofspeech|>"
+    },
+    {
+      "role": "assistant",
+      "content": "几点了？"
+    }
+  ],
+  "speech_length": 145,
+  "text_length": 3
+}
 ```

 详细可以参考：`data/train_example.jsonl`

 数据准备细节介绍：

- `messages[1]["content"]`: 音频文件的路径 + 语音识别的 prompt
- `messages[2]["content"]`: 音频文件标注文本
- `speech_length`: 音频文件的 fbank 帧数
- `text_length`: 音频文件标注文本的 token 数 (用 `Qwen3-0.6B` 编码)
+- system 的 content 固定为 `You are a helpful assistant.`
+- user 的 content 包含了 prompt 和音频文件的路径（位于 `<|startofspeech|>!` 和 `<|endofspeech|>`之间）
+  - prompt 默认为`语音转写：`和`Speech transcription: `
+  - 可以结合对应的语种为`语音转写成英文：`和`Transcribe speech into Chinese: `
+  - 当音频文件对应的文本标注不含阿拉伯数字或者标点符号时，可以使用`语音转写，不进行文本规整：`和 `Speech transcription without text normalization: `
+- assistant 的 content 对应音频文件对应的文本标注
+- speech_length：音频文件的 fbank 帧数（一帧 10ms）
+- text_length：音频文件标注文本的 token 数 (用 `Qwen/Qwen3-0.6B` 编码)

-`train_text.txt`
-
-左边为数据唯一 ID，需与 `train_wav.scp` 中的 ID 一一对应 右边为音频文件标注文本，格式如下：
-
-```
-BAC009S0764W0121 甚至出现交易几乎停滞的情况
-BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
-```
+我们提供了数据格式转换工具 `scp2jsonl.py`，可以将常见的语音识别训练数据格式 wav scp 和 transcription 转成 ChatML 格式。

 `train_wav.scp`

@ -42,13 +57,20 @@ BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test
 BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
 ```

-`生成指令`
+`train_text.txt`
+
+左边为数据唯一 ID，需与 `train_wav.scp` 中的 ID 一一对应 右边为音频文件标注文本，格式如下：
+
+```
+BAC009S0764W0121 甚至出现交易几乎停滞的情况
+BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
+```

 ```
 python tools/scp2jsonl.py \
-  --scp-file /path/to/train_wav.scp \
-  --transcript-file /path/to/train_text.txt \
-  --jsonl-file data/train_example.jsonl
+  ++scp_file=data/train_wav.scp \
+  ++transcript_file=data/train_text.txt \
+  ++jsonl_file=data/train_example.jsonl
 ```

 ## 启动训练
@ -62,3 +84,29 @@ python tools/scp2jsonl.py \
 ```
 bash finetune.sh
 ```
+
+### 推荐配置
+
+- 训练数据少于 1000 小时，建议微调 audio_adaptor
+- 训练数据少于 5000 小时，建议微调 audio_encoder 和 audio_adaptor
+- 训练数据大于 10000 小时，建议全量参数微调
+
+## 模型评测
+
+当模型微调结束后，可以使用 decode.py 脚本对模型进行解码：
+
+```
+python decode.py \
+  ++model_dir=/path/to/finetuned \
+  ++scp_file=data/val_wav.scp \
+  ++output_file=output.txt
+```
+
+解码结束后，需要对标注和识别结果做文本逆归一化，然后计算 WER：
+
+```
+python tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
+python tools/whisper_mix_normalize.py output.txt output_norm.txt
+compute-wer data/val_norm.txt output_norm.txt cer.txt
+tail -n8 cer.txt
+```