update finetune docs
This commit is contained in:
@ -11,25 +11,46 @@ pip install git+https://github.com/modelscope/FunASR
|
||||
Data examples
|
||||
|
||||
```
|
||||
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写:<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav<|endofspeech|>"}, {"role": "assistant", "content": "甚至出现交易几乎停滞的情况"}], "speech_length": 418, "text_length": 6}
|
||||
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "语音转写:<|startofspeech|>!https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav<|endofspeech|>"}, {"role": "assistant", "content": "湖北一公司以员工名义贷款数十员工负债千万"}], "speech_length": 572, "text_length": 11}
|
||||
head -n1 data/train_example.jsonl | jq
|
||||
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "语音转写:<|startofspeech|>!https://modelscope.cn/datasets/FunAudioLLM/funasr-demo/resolve/master/audios/IT0011W0002.wav<|endofspeech|>"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "几点了?"
|
||||
}
|
||||
],
|
||||
"speech_length": 145,
|
||||
"text_length": 3
|
||||
}
|
||||
```
|
||||
|
||||
Full ref to `data/train_example.jsonl`
|
||||
|
||||
Description:
|
||||
|
||||
- `messages[1]["content"]`: audio file with speech recognition prompt
|
||||
- The content of systemis fixed as `You are a helpful assistant.`
|
||||
- The content of userincludes the prompt and the path to the audio file (enclosed between `<|startofspeech|>!`and `<|endofspeech|>`).
|
||||
- The default prompts are `语音转写:`and `Speech transcription: `.
|
||||
- For corresponding languages, prompts can be combined, such as `语音转写成英文:`and `Transcribe speech into Chinese: `.
|
||||
- When the text annotation corresponding to the audio file contains no Arabic numerals or punctuation marks, you can use `语音转写,不进行文本规整:`and `Speech transcription without text normalization: `.
|
||||
- The content of assistant corresponds to the text annotation of the audio file.
|
||||
- speech_length: The number of fbank frames of the audio file (10ms per frame).
|
||||
- text_length: The number of tokens in the annotation text of the audio file (encoded using `Qwen/Qwen3-0.6B`).
|
||||
|
||||
- `messages[2]["content"]`: transcription
|
||||
- `speech_length`: number of fbank frames of the audio file
|
||||
- `text_length`: number of tokens of the transcription (tokenized by `Qwen3-0.6B`)
|
||||
|
||||
`train_text.txt`
|
||||
|
||||
```
|
||||
BAC009S0764W0121 甚至出现交易几乎停滞的情况
|
||||
BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
|
||||
```
|
||||
We provide a data format conversion tool `scp2jsonl.py`, which can convert common speech recognition training data formats such as wav scp and transcription into the ChatML format.
|
||||
|
||||
`train_wav.scp`
|
||||
|
||||
@ -38,13 +59,18 @@ BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test
|
||||
BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
|
||||
```
|
||||
|
||||
`Command`
|
||||
`train_text.txt`
|
||||
|
||||
```
|
||||
BAC009S0764W0121 甚至出现交易几乎停滞的情况
|
||||
BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万
|
||||
```
|
||||
|
||||
```
|
||||
python tools/scp2jsonl.py \
|
||||
--scp-file /path/to/train_wav.scp \
|
||||
--transcript-file /path/to/train_text.txt \
|
||||
--jsonl-file data/train_example.jsonl
|
||||
++scp+file=data/train_wav.scp \
|
||||
++transcript_file=data/train_text.txt \
|
||||
++jsonl_file=data/train_example.jsonl
|
||||
```
|
||||
|
||||
## Finetune
|
||||
@ -58,3 +84,29 @@ For more detailed parameters, refer to: [SenseVoice Model Training and Testing](
|
||||
```
|
||||
bash finetune.sh
|
||||
```
|
||||
|
||||
### Recommended Configuration
|
||||
|
||||
- For training data less than 1000 hours, it is recommended to fine-tune the audio_adaptor.
|
||||
- For training data less than 5000 hours, it is recommended to fine-tune the audio_encoder and audio_adaptor.
|
||||
- For training data greater than 10000 hours, it is recommended to perform full-parameter fine-tuning.
|
||||
|
||||
## Model Evaluation
|
||||
|
||||
After model fine-tuning is completed, you can decode the model using the decode.py script:
|
||||
|
||||
```
|
||||
python decode.py \
|
||||
++model_dir=/path/to/finetuned \
|
||||
++scp_file=data/val_wav.scp \
|
||||
++output_file=output.txt
|
||||
```
|
||||
|
||||
After decoding is completed, text inverse normalization needs to be applied to the annotations and recognition results, and then the WER should be calculated:
|
||||
|
||||
```
|
||||
python tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
|
||||
python tools/whisper_mix_normalize.py output.txt output_norm.txt
|
||||
compute-wer data/val_norm.txt output_norm.txt cer.txt
|
||||
tail -n8 cer.txt
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user