Skip to content

Commit a4418dc

Browse files
committed
sensevoice
1 parent d491773 commit a4418dc

File tree

6 files changed

+50
-30
lines changed

6 files changed

+50
-30
lines changed

README.md

+31-21
Original file line numberDiff line numberDiff line change
@@ -95,43 +95,49 @@ pip install -r requirements.txt
9595

9696
## Inference
9797

98-
99-
### Method 2
98+
Supports input of audio in any format and of any duration.
10099

101100
```python
102101
from funasr import AutoModel
103102
from funasr.utils.postprocess_utils import rich_transcription_postprocess
104103

105104
model_dir = "iic/SenseVoiceSmall"
106-
input_file = (
107-
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
108-
)
109105

110-
model = AutoModel(model=model_dir,
111-
vad_model="fsmn-vad",
112-
vad_kwargs={"max_single_segment_time": 30000},
113-
trust_remote_code=True, device="cuda:0")
114106

107+
model = AutoModel(
108+
model=model_dir,
109+
vad_model="fsmn-vad",
110+
vad_kwargs={"max_single_segment_time": 30000},
111+
device="cpu",
112+
)
113+
114+
# en
115115
res = model.generate(
116-
input=input_file,
116+
input=f"{model.model_path}/example/en.mp3",
117117
cache={},
118-
language="zh", # "zn", "en", "yue", "ja", "ko", "nospeech"
119-
use_itn=False,
120-
batch_size_s=0,
118+
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
119+
use_itn=True,
120+
batch_size_s=60,
121+
merge_vad=True, #
122+
merge_length_s=15,
121123
)
122-
123124
text = rich_transcription_postprocess(res[0]["text"])
124-
125125
print(text)
126126
```
127127

128-
The funasr version has integrated the VAD (Voice Activity Detection) model and supports audio input of any duration, with `batch_size_s` in seconds.
129-
If all inputs are short audios, and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
128+
Parameter Descriptions:
129+
- `model_dir`: The name of the model, or the model's path on the local disk.
130+
- `max_single_segment_time`: The maximum length of audio segments that the `vad_model` can cut, measured in milliseconds (ms).
131+
- `use_itn`: Indicates whether the output should include punctuation and inverse text normalization.
132+
- `batch_size_s`: Represents a dynamic batch size where the total duration of the audio in the batch is measured in seconds (s).
133+
- `merge_vad`: Whether to concatenate short audio fragments cut by the vad model, with the merged length being `merge_length_s`, measured in seconds (s).
134+
135+
If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
130136
```python
131137
model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
132138

133139
res = model.generate(
134-
input=input_file,
140+
input=f"{model.model_path}/example/en.mp3",
135141
cache={},
136142
language="zh", # "zn", "en", "yue", "ja", "ko", "nospeech"
137143
use_itn=False,
@@ -141,23 +147,27 @@ res = model.generate(
141147

142148
For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
143149

144-
### Method 1
150+
### Inference directly
151+
152+
Supports input of audio in any format, with an input duration limit of 30 seconds or less.
145153

146154
```python
147155
from model import SenseVoiceSmall
156+
from funasr.utils.postprocess_utils import rich_transcription_postprocess
148157

149158
model_dir = "iic/SenseVoiceSmall"
150159
m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
151160

152161

153162
res = m.inference(
154163
data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
155-
language="zh", # "zn", "en", "yue", "ja", "ko", "nospeech"
164+
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
156165
use_itn=False,
157166
**kwargs,
158167
)
159168

160-
print(res)
169+
text = rich_transcription_postprocess(res[0]["text"])
170+
print(text)
161171
```
162172

163173
### Export and Test

README_zh.md

+11-4
Original file line numberDiff line numberDiff line change
@@ -128,18 +128,23 @@ res = model.generate(
128128
text = rich_transcription_postprocess(res[0]["text"])
129129
print(text)
130130
```
131+
参数说明:
132+
- `model_dir`:模型名称,或本地磁盘中的模型路径。
133+
- `max_single_segment_time`: 表示`vad_model`最大切割音频时长, 单位是毫秒ms。
134+
- `use_itn`:输出结果中是否包含标点与逆文本正则化。
135+
- `batch_size_s` 表示采用动态batch,batch中总音频时长,单位为秒s。
136+
- `merge_vad`:是否将 vad 模型切割的短音频碎片合成,合并后长度为`merge_length_s`,单位为秒s。
131137

132-
funasr版本已经集成了vad模型,支持任意时长音频输入,`batch_size_s`单位为秒。
133138
如果输入均为短音频(小于30s),并且需要批量化推理,为了加快推理效率,可以移除vad模型,并设置`batch_size`
134139

135140
```python
136141
model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
137142

138143
res = model.generate(
139-
input=input_file,
144+
input=f"{model.model_path}/example/en.mp3",
140145
cache={},
141146
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
142-
use_itn=False,
147+
use_itn=True,
143148
batch_size=64,
144149
)
145150
```
@@ -152,6 +157,7 @@ res = model.generate(
152157

153158
```python
154159
from model import SenseVoiceSmall
160+
from funasr.utils.postprocess_utils import rich_transcription_postprocess
155161

156162
model_dir = "iic/SenseVoiceSmall"
157163
m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
@@ -164,7 +170,8 @@ res = m.inference(
164170
**kwargs,
165171
)
166172

167-
print(res)
173+
text = rich_transcription_postprocess(res[0]["text"])
174+
print(text)
168175
```
169176

170177
## 服务部署

demo_funasr.py renamed to demo1.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
model=model_dir,
1515
vad_model="fsmn-vad",
1616
vad_kwargs={"max_single_segment_time": 30000},
17-
device="cpu",
17+
device="cuda:0",
1818
)
1919

2020
# en

demo.py renamed to demo2.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,19 @@
44
# MIT License (https://opensource.org/licenses/MIT)
55

66
from model import SenseVoiceSmall
7+
from funasr.utils.postprocess_utils import rich_transcription_postprocess
8+
79

810
model_dir = "iic/SenseVoiceSmall"
911
m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
1012

1113

1214
res = m.inference(
13-
data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
15+
data_in=f"{m.model_path}/example/en.mp3",
1416
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
1517
use_itn=False,
1618
**kwargs,
1719
)
1820

19-
print(res)
21+
text = rich_transcription_postprocess(res[0]["text"])
22+
print(text)

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@ torchaudio
33
modelscope
44
huggingface
55
huggingface_hub
6-
funasr>=1.1.1
6+
funasr>=1.1.2
77
numpy<=1.26.4

webui.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ def model_inference(input_wav, language, fs=16000):
168168
cache={},
169169
language=language,
170170
use_itn=True,
171-
batch_size_s=0, merge_vad=merge_vad)
171+
batch_size_s=60, merge_vad=merge_vad)
172172

173173
print(text)
174174
text = text[0]["text"]

0 commit comments

Comments
 (0)