youjian commited on
Commit
bb41dfa
·
0 Parent(s):
Files changed (6) hide show
  1. .gitattributes +35 -0
  2. .gitignore +6 -0
  3. README.md +105 -0
  4. app.py +594 -0
  5. packages.txt +4 -0
  6. requirements.txt +13 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ local_run_sensevoice.py
2
+ SenseVoiceSmall/
3
+ *.pt
4
+ *.onnx
5
+ *.bin
6
+ .venv/
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: SenseVoice Audio Transcription
3
+ emoji: 🎙️
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 4.36.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Multilingual Audio Transcription with SenseVoice
13
+
14
+ This application transcribes audio using SenseVoice Small model with multilingual support, providing accurate transcription for Chinese, English, Japanese, Korean, and Cantonese.
15
+
16
+ ## Features
17
+
18
+ - **Multilingual Support**: Chinese (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue)
19
+ - **Multiple Audio Sources**:
20
+ - Uploaded audio files
21
+ - Direct URLs to audio files (no YouTube support due to cookie requirements)
22
+ - **Model Options**:
23
+ - Local SenseVoice model
24
+ - Hugging Face model: `FunAudioLLM/SenseVoiceSmall`
25
+ - **Advanced Features**:
26
+ - Audio trimming with start/end time
27
+ - Proxy support for downloads
28
+ - Verbose logging output
29
+ - Automatic inverse text normalization (ITN)
30
+
31
+ ## Model Setup
32
+
33
+ ### For Hugging Face Spaces Deployment
34
+ The app is configured to work with:
35
+ 1. **Local Model**: `"SenseVoiceSmall"` - Model files in the same directory
36
+ 2. **HF Model**: `"FunAudioLLM/SenseVoiceSmall"` - Auto-downloaded from Hugging Face
37
+
38
+ ### For Local Development
39
+ - Update `MODEL_PATH_LIST` in app.py to use your custom models
40
+ - Supports local paths and Hugging Face repository names
41
+
42
+ ## How to Use
43
+
44
+ 1. **Upload Audio**: Click "Upload or Record Audio" to select your audio file
45
+ 2. **Select Model**: Choose from available models in the dropdown
46
+ 3. **Configure Options**:
47
+ - Set start/end time for audio trimming
48
+ - Enable verbose output for debugging
49
+ 4. **Transcribe**: Click "Transcribe" to start the process
50
+
51
+ ## Git LFS Setup for Large Models
52
+
53
+ Since this project uses large model files, Git LFS is recommended:
54
+
55
+ ```bash
56
+ # Initialize Git LFS
57
+ git lfs install
58
+
59
+ # Track large model files
60
+ git lfs track "*.bin"
61
+ git lfs track "*.safetensors"
62
+
63
+ # Add and commit
64
+ git add .gitattributes
65
+ git add .
66
+ git commit -m "Add SenseVoice model with LFS tracking"
67
+ ```
68
+
69
+ ## Deployment Notes
70
+
71
+ ### Hugging Face Spaces
72
+ - Use `git push huggingface main` to deploy
73
+ - Models are automatically cached during runtime
74
+ - First load may be slower due to model download
75
+
76
+ ### Model Repository Structure
77
+ ```
78
+ your-repo/
79
+ ├── app.py
80
+ ├── README.md
81
+ ├── requirements.txt
82
+ └── SenseVoiceSmall/ # Model directory
83
+ ├── config.json
84
+ ├── model.bin
85
+ └── other model files...
86
+ ```
87
+
88
+ ## Output
89
+
90
+ The application provides:
91
+ - **Transcription Text**: Full processed transcription with ITN
92
+ - **Metrics**: Processing time and file information
93
+ - **Download**: Text file with transcription results
94
+
95
+ ## Supported Languages
96
+
97
+ - 🇨🇳 Chinese (Mandarin)
98
+ - 🇺🇸 English
99
+ - 🇯🇵 Japanese
100
+ - 🇰🇷 Korean
101
+ - 🇭🇰 Cantonese
102
+
103
+ ## Feedback and Contributions
104
+
105
+ Welcome feedback and contributions to improve this multilingual transcription tool.
app.py ADDED
@@ -0,0 +1,594 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from huggingface_hub import snapshot_download
3
+
4
+
5
+ # 1. 定义本地路径和远程仓库ID
6
+ LOCAL_MODEL_DIR = "SenseVoiceSmall"
7
+ REPO_ID = "FunAudioLLM/SenseVoiceSmall"
8
+ REPO_TYPE = "hf" # "hf" for Hugging Face, "modelscope" for ModelScope
9
+ # 2. 检查本地是否存在,不存在则下载
10
+ if not os.path.exists(LOCAL_MODEL_DIR):
11
+ print(f"正在下载模型 {REPO_ID} 到 {LOCAL_MODEL_DIR} ...")
12
+ snapshot_download(
13
+ REPO_ID=REPO_ID,
14
+ local_dir=LOCAL_MODEL_DIR,
15
+ ignore_patterns=["*.onnx"], # 如果你不需要onnx文件,可以过滤掉以节省时间和空间
16
+ )
17
+ print("模型下载完毕!")
18
+ else:
19
+ print("检测到本地模型文件,跳过下载。")
20
+
21
+
22
+
23
+ import gradio as gr
24
+ import time
25
+ import sys
26
+ import io
27
+ import tempfile
28
+ import subprocess
29
+ import requests
30
+ from urllib.parse import urlparse
31
+ from pydub import AudioSegment
32
+ import logging
33
+ import torch
34
+ import importlib
35
+ from funasr import AutoModel
36
+ from funasr.utils.postprocess_utils import rich_transcription_postprocess
37
+
38
+ # Model configurations for Hugging Face deployment
39
+ MODEL_PATH_LIST = [
40
+ "SenseVoiceSmall", # local path together with this hf space
41
+ "FunAudioLLM/SenseVoiceSmall" # huggingface model repo
42
+ "iic/SenseVoiceSmall" # ModelScope model repo
43
+ ]
44
+
45
+ class LogCapture(io.StringIO):
46
+ def __init__(self, callback):
47
+ super().__init__()
48
+ self.callback = callback
49
+
50
+ def write(self, s):
51
+ super().write(s)
52
+ self.callback(s)
53
+
54
+ # Set up logging
55
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
56
+
57
+
58
+
59
+
60
+
61
+
62
+ # Check for CUDA availability
63
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
64
+ logging.info(f"Using device: {device}")
65
+
66
+ def download_audio(url, method_choice, proxy_url, proxy_username, proxy_password):
67
+ """
68
+ Downloads audio from a given URL using the specified method and proxy settings.
69
+
70
+ Args:
71
+ url (str): The URL of the audio.
72
+ method_choice (str): The method to use for downloading audio.
73
+ proxy_url (str): Proxy URL if needed.
74
+ proxy_username (str): Proxy username.
75
+ proxy_password (str): Proxy password.
76
+
77
+ Returns:
78
+ tuple: (path to the downloaded audio file, is_temp_file), or (None, False) if failed.
79
+ """
80
+ parsed_url = urlparse(url)
81
+ logging.info(f"Downloading audio from URL: {url} using method: {method_choice}")
82
+ try:
83
+ if 'youtube.com' in parsed_url.netloc or 'youtu.be' in parsed_url.netloc:
84
+ error_msg = f"YouTube download is not supported. Please use direct audio URLs instead."
85
+ logging.error(error_msg)
86
+ return None, False
87
+ elif parsed_url.scheme == 'rtsp':
88
+ audio_file = download_rtsp_audio(url, proxy_url)
89
+ if not audio_file:
90
+ error_msg = f"Failed to download RTSP audio from {url}"
91
+ logging.error(error_msg)
92
+ return None, False
93
+ else:
94
+ audio_file = download_direct_audio(url, method_choice, proxy_url, proxy_username, proxy_password)
95
+ if not audio_file:
96
+ error_msg = f"Failed to download audio from {url} using method {method_choice}"
97
+ logging.error(error_msg)
98
+ return None, False
99
+ return audio_file, True
100
+ except Exception as e:
101
+ error_msg = f"Error downloading audio from {url} using method {method_choice}: {str(e)}"
102
+ logging.error(error_msg)
103
+ return None, False
104
+
105
+
106
+
107
+
108
+ def download_rtsp_audio(url, proxy_url):
109
+ """
110
+ Downloads audio from an RTSP URL using FFmpeg.
111
+
112
+ Args:
113
+ url (str): The RTSP URL.
114
+ proxy_url (str): Proxy URL if needed.
115
+
116
+ Returns:
117
+ str: Path to the downloaded audio file, or None if failed.
118
+ """
119
+ logging.info("Using FFmpeg to download RTSP stream")
120
+ output_file = tempfile.mktemp(suffix='.mp3')
121
+ command = ['ffmpeg', '-i', url, '-acodec', 'libmp3lame', '-ab', '192k', '-y', output_file]
122
+ env = os.environ.copy()
123
+ if proxy_url and len(proxy_url.strip()) > 0:
124
+ env['http_proxy'] = proxy_url
125
+ env['https_proxy'] = proxy_url
126
+ try:
127
+ subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
128
+ logging.info(f"Downloaded RTSP audio to: {output_file}")
129
+ return output_file
130
+ except subprocess.CalledProcessError as e:
131
+ logging.error(f"FFmpeg error: {e.stderr.decode()}")
132
+ return None
133
+ except Exception as e:
134
+ logging.error(f"Error downloading RTSP audio: {str(e)}")
135
+ return None
136
+
137
+ def download_direct_audio(url, method_choice, proxy_url, proxy_username, proxy_password):
138
+ """
139
+ Downloads audio from a direct URL using the specified method.
140
+
141
+ Args:
142
+ url (str): The direct URL of the audio file.
143
+ method_choice (str): The method to use for downloading.
144
+ proxy_url (str): Proxy URL if needed.
145
+ proxy_username (str): Proxy username.
146
+ proxy_password (str): Proxy password.
147
+
148
+ Returns:
149
+ str: Path to the downloaded audio file, or None if failed.
150
+ """
151
+ logging.info(f"Downloading direct audio from: {url} using method: {method_choice}")
152
+ methods = {
153
+ 'wget': wget_method,
154
+ 'requests': requests_method,
155
+ 'ffmpeg': ffmpeg_method,
156
+ 'aria2': aria2_method,
157
+ }
158
+ method = methods.get(method_choice, requests_method)
159
+ try:
160
+ audio_file = method(url, proxy_url, proxy_username, proxy_password)
161
+ if not audio_file or not os.path.exists(audio_file):
162
+ error_msg = f"Failed to download direct audio from {url} using method {method_choice}"
163
+ logging.error(error_msg)
164
+ return None
165
+ return audio_file
166
+ except Exception as e:
167
+ logging.error(f"Error downloading direct audio with {method_choice}: {str(e)}")
168
+ return None
169
+
170
+ def requests_method(url, proxy_url, proxy_username, proxy_password):
171
+ """
172
+ Downloads audio using the requests library.
173
+
174
+ Args:
175
+ url (str): The URL of the audio file.
176
+ proxy_url (str): Proxy URL if needed.
177
+ proxy_username (str): Proxy username.
178
+ proxy_password (str): Proxy password.
179
+
180
+ Returns:
181
+ str: Path to the downloaded audio file, or None if failed.
182
+ """
183
+ try:
184
+ proxies = None
185
+ auth = None
186
+ if proxy_url and len(proxy_url.strip()) > 0:
187
+ proxies = {
188
+ "http": proxy_url,
189
+ "https": proxy_url
190
+ }
191
+ if proxy_username and proxy_password:
192
+ auth = (proxy_username, proxy_password)
193
+ response = requests.get(url, stream=True, proxies=proxies, auth=auth)
194
+ if response.status_code == 200:
195
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as temp_file:
196
+ for chunk in response.iter_content(chunk_size=8192):
197
+ if chunk:
198
+ temp_file.write(chunk)
199
+ logging.info(f"Downloaded direct audio to: {temp_file.name}")
200
+ return temp_file.name
201
+ else:
202
+ logging.error(f"Failed to download audio from {url} with status code {response.status_code}")
203
+ return None
204
+ except Exception as e:
205
+ logging.error(f"Error in requests_method: {str(e)}")
206
+ return None
207
+
208
+ def wget_method(url, proxy_url, proxy_username, proxy_password):
209
+ """
210
+ Downloads audio using the wget command-line tool.
211
+
212
+ Args:
213
+ url (str): The URL of the audio file.
214
+ proxy_url (str): Proxy URL if needed.
215
+ proxy_username (str): Proxy username.
216
+ proxy_password (str): Proxy password.
217
+
218
+ Returns:
219
+ str: Path to the downloaded audio file, or None if failed.
220
+ """
221
+ logging.info("Using wget method")
222
+ output_file = tempfile.mktemp(suffix='.mp3')
223
+ command = ['wget', '-O', output_file, url]
224
+ env = os.environ.copy()
225
+ if proxy_url and len(proxy_url.strip()) > 0:
226
+ env['http_proxy'] = proxy_url
227
+ env['https_proxy'] = proxy_url
228
+ try:
229
+ subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
230
+ logging.info(f"Downloaded audio to: {output_file}")
231
+ return output_file
232
+ except subprocess.CalledProcessError as e:
233
+ logging.error(f"Wget error: {e.stderr.decode()}")
234
+ return None
235
+ except Exception as e:
236
+ logging.error(f"Error in wget_method: {str(e)}")
237
+ return None
238
+
239
+
240
+ def ffmpeg_method(url, proxy_url, proxy_username, proxy_password):
241
+ """
242
+ Downloads audio using FFmpeg.
243
+
244
+ Args:
245
+ url (str): The URL of the audio file.
246
+ proxy_url (str): Proxy URL if needed.
247
+ proxy_username (str): Proxy username.
248
+ proxy_password (str): Proxy password.
249
+
250
+ Returns:
251
+ str: Path to the downloaded audio file, or None if failed.
252
+ """
253
+ logging.info("Using ffmpeg method")
254
+ output_file = tempfile.mktemp(suffix='.mp3')
255
+ command = ['ffmpeg', '-i', url, '-vn', '-acodec', 'libmp3lame', '-q:a', '2', output_file]
256
+ env = os.environ.copy()
257
+ if proxy_url and len(proxy_url.strip()) > 0:
258
+ env['http_proxy'] = proxy_url
259
+ env['https_proxy'] = proxy_url
260
+ try:
261
+ subprocess.run(command, check=True, capture_output=True, text=True, env=env)
262
+ logging.info(f"Downloaded and converted audio to: {output_file}")
263
+ return output_file
264
+ except subprocess.CalledProcessError as e:
265
+ logging.error(f"FFmpeg error: {e.stderr}")
266
+ return None
267
+ except Exception as e:
268
+ logging.error(f"Error in ffmpeg_method: {str(e)}")
269
+ return None
270
+
271
+ def aria2_method(url, proxy_url, proxy_username, proxy_password):
272
+ """
273
+ Downloads audio using aria2.
274
+
275
+ Args:
276
+ url (str): The URL of the audio file.
277
+ proxy_url (str): Proxy URL if needed.
278
+ proxy_username (str): Proxy username.
279
+ proxy_password (str): Proxy password.
280
+
281
+ Returns:
282
+ str: Path to the downloaded audio file, or None if failed.
283
+ """
284
+ logging.info("Using aria2 method")
285
+ output_file = tempfile.mktemp(suffix='.mp3')
286
+ command = ['aria2c', '--split=4', '--max-connection-per-server=4', '--out', output_file, url]
287
+ if proxy_url and len(proxy_url.strip()) > 0:
288
+ command.extend(['--all-proxy', proxy_url])
289
+ try:
290
+ subprocess.run(command, check=True, capture_output=True, text=True)
291
+ logging.info(f"Downloaded audio to: {output_file}")
292
+ return output_file
293
+ except subprocess.CalledProcessError as e:
294
+ logging.error(f"Aria2 error: {e.stderr}")
295
+ return None
296
+ except Exception as e:
297
+ logging.error(f"Error in aria2_method: {str(e)}")
298
+ return None
299
+
300
+ def trim_audio(audio_path, start_time, end_time):
301
+ """
302
+ Trims an audio file to the specified start and end times.
303
+
304
+ Args:
305
+ audio_path (str): Path to the audio file.
306
+ start_time (float): Start time in seconds.
307
+ end_time (float): End time in seconds.
308
+
309
+ Returns:
310
+ str: Path to the trimmed audio file.
311
+
312
+ Raises:
313
+ gr.Error: If invalid start or end times are provided.
314
+ """
315
+ try:
316
+ logging.info(f"Trimming audio from {start_time} to {end_time}")
317
+ audio = AudioSegment.from_file(audio_path)
318
+ audio_duration = len(audio) / 1000 # Duration in seconds
319
+
320
+ # Default start and end times if None
321
+ start_time = max(0, start_time) if start_time is not None else 0
322
+ end_time = min(audio_duration, end_time) if end_time is not None else audio_duration
323
+
324
+ # Validate times
325
+ if start_time >= end_time:
326
+ raise gr.Error("End time must be greater than start time.")
327
+
328
+ trimmed_audio = audio[int(start_time * 1000):int(end_time * 1000)]
329
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio_file:
330
+ trimmed_audio.export(temp_audio_file.name, format="wav")
331
+ logging.info(f"Trimmed audio saved to: {temp_audio_file.name}")
332
+ return temp_audio_file.name
333
+ except Exception as e:
334
+ logging.error(f"Error trimming audio: {str(e)}")
335
+ raise gr.Error(f"Error trimming audio: {str(e)}")
336
+
337
+ def save_transcription(transcription):
338
+ """
339
+ Saves the transcription text to a temporary file.
340
+
341
+ Args:
342
+ transcription (str): The transcription text.
343
+
344
+ Returns:
345
+ str: The path to the transcription file.
346
+ """
347
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.txt', mode='w', encoding='utf-8') as temp_file:
348
+ temp_file.write(transcription)
349
+ logging.info(f"Transcription saved to: {temp_file.name}")
350
+ return temp_file.name
351
+
352
+ def get_model_options(pipeline_type):
353
+ """
354
+ Returns a list of model IDs based on the selected pipeline type.
355
+
356
+ Args:
357
+ pipeline_type (str): The type of pipeline.
358
+
359
+ Returns:
360
+ list: A list of model IDs.
361
+ """
362
+ if pipeline_type == "sensevoice":
363
+ return MODEL_PATH_LIST
364
+ else:
365
+ return []
366
+
367
+ # Dictionary to store loaded models
368
+ loaded_models = {}
369
+
370
+ def transcribe_audio(audio_input, audio_url, proxy_url, proxy_username, proxy_password, pipeline_type, model_id, download_method, start_time=None, end_time=None, verbose=False):
371
+ """
372
+ Transcribes audio from a given source using SenseVoice.
373
+
374
+ Args:
375
+ audio_input (str): Path to uploaded audio file or recorded audio.
376
+ audio_url (str): URL of audio.
377
+ proxy_url (str): Proxy URL if needed.
378
+ proxy_username (str): Proxy username.
379
+ proxy_password (str): Proxy password.
380
+ pipeline_type (str): Type of pipeline to use ('sensevoice').
381
+ model_id (str): The ID of the model to use.
382
+ download_method (str): Method to use for downloading audio.
383
+ start_time (float, optional): Start time in seconds for trimming audio.
384
+ end_time (float, optional): End time in seconds for trimming audio.
385
+ verbose (bool, optional): Whether to output verbose logging.
386
+
387
+ Yields:
388
+ Tuple[str, str, str or None]: Metrics and messages, transcription text, path to transcription file.
389
+ """
390
+ try:
391
+ if verbose:
392
+ logging.getLogger().setLevel(logging.INFO)
393
+ else:
394
+ logging.getLogger().setLevel(logging.WARNING)
395
+
396
+ logging.info(f"Transcription parameters: pipeline_type={pipeline_type}, model_id={model_id}, download_method={download_method}")
397
+ verbose_messages = f"Starting transcription with parameters:\nPipeline Type: {pipeline_type}\nModel ID: {model_id}\nDownload Method: {download_method}\n"
398
+
399
+ if verbose:
400
+ yield verbose_messages, "", None
401
+
402
+ # Determine the audio source
403
+ audio_path = None
404
+ is_temp_file = False
405
+
406
+ if audio_input is not None and len(audio_input) > 0:
407
+ # audio_input is a filepath to uploaded or recorded audio
408
+ audio_path = audio_input
409
+ is_temp_file = False
410
+ elif audio_url is not None and len(audio_url.strip()) > 0:
411
+ # audio_url is provided
412
+ audio_path, is_temp_file = download_audio(audio_url, download_method, proxy_url, proxy_username, proxy_password)
413
+ if not audio_path:
414
+ error_msg = f"Error downloading audio from {audio_url} using method {download_method}. Check logs for details."
415
+ logging.error(error_msg)
416
+ yield verbose_messages + error_msg, "", None
417
+ return
418
+ else:
419
+ error_msg = "No audio source provided. Please upload an audio file, record audio, or enter a URL."
420
+ logging.error(error_msg)
421
+ yield verbose_messages + error_msg, "", None
422
+ return
423
+
424
+ # Convert start_time and end_time to float or None
425
+ start_time = float(start_time) if start_time else None
426
+ end_time = float(end_time) if end_time else None
427
+
428
+ if start_time is not None or end_time is not None:
429
+ audio_path = trim_audio(audio_path, start_time, end_time)
430
+ is_temp_file = True # The trimmed audio is a temporary file
431
+ verbose_messages += f"Audio trimmed from {start_time} to {end_time}\n"
432
+ if verbose:
433
+ yield verbose_messages, "", None
434
+
435
+ # Model caching
436
+ model_key = (pipeline_type, model_id)
437
+ if model_key in loaded_models:
438
+ model = loaded_models[model_key]
439
+ logging.info("Loaded model from cache")
440
+ else:
441
+ if pipeline_type == "sensevoice":
442
+ model = AutoModel(
443
+ model=model_id,
444
+ trust_remote_code=False,
445
+ vad_model="fsmn-vad",
446
+ vad_kwargs={"max_single_segment_time": 30000},
447
+ device=device,
448
+ disable_update=True,
449
+ hub=REPO_TYPE,
450
+ )
451
+ else:
452
+ error_msg = "Invalid pipeline type. Only 'sensevoice' is supported."
453
+ logging.error(error_msg)
454
+ yield verbose_messages + error_msg, "", None
455
+ return
456
+ loaded_models[model_key] = model
457
+
458
+ # Perform the transcription
459
+ start_time_perf = time.time()
460
+
461
+ res = model.generate(
462
+ input=audio_path,
463
+ cache={},
464
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
465
+ use_itn=True,
466
+ batch_size_s=60,
467
+ merge_vad=True,
468
+ merge_length_s=15,
469
+ )
470
+
471
+ transcription = rich_transcription_postprocess(res[0]["text"])
472
+ end_time_perf = time.time()
473
+
474
+ # Calculate metrics
475
+ transcription_time = end_time_perf - start_time_perf
476
+ audio_file_size = os.path.getsize(audio_path) / (1024 * 1024)
477
+
478
+ metrics_output = (
479
+ f"Transcription time: {transcription_time:.2f} seconds\n"
480
+ f"Audio file size: {audio_file_size:.2f} MB\n"
481
+ )
482
+
483
+ if verbose:
484
+ yield verbose_messages + metrics_output, transcription, None
485
+
486
+ # Save the transcription to a file
487
+ transcription_file = save_transcription(transcription)
488
+ yield verbose_messages + metrics_output, transcription, transcription_file
489
+
490
+ except Exception as e:
491
+ error_msg = f"An error occurred during transcription: {str(e)}"
492
+ logging.error(error_msg)
493
+ yield verbose_messages + error_msg, "", None
494
+
495
+ finally:
496
+ # Clean up temporary audio files
497
+ if audio_path and is_temp_file and os.path.exists(audio_path):
498
+ os.remove(audio_path)
499
+
500
+
501
+ with gr.Blocks() as iface:
502
+ gr.Markdown("# Audio Transcription")
503
+ gr.Markdown("Transcribe audio using SenseVoice model with multilingual support.")
504
+
505
+ with gr.Row():
506
+ audio_input = gr.Audio(label="Upload or Record Audio", sources=["upload", "microphone"], type="filepath")
507
+ audio_url = gr.Textbox(label="Or Enter URL of audio file (direct link only, no YouTube)")
508
+
509
+ transcribe_button = gr.Button("Transcribe")
510
+
511
+ with gr.Accordion("Advanced Options", open=False):
512
+ with gr.Row():
513
+ proxy_url = gr.Textbox(label="Proxy URL", placeholder="Enter proxy URL if needed", value="", lines=1)
514
+ proxy_username = gr.Textbox(label="Proxy Username", placeholder="Proxy username (optional)", value="", lines=1)
515
+ proxy_password = gr.Textbox(label="Proxy Password", placeholder="Proxy password (optional)", value="", lines=1, type="password")
516
+
517
+
518
+ with gr.Row():
519
+ pipeline_type = gr.Dropdown(
520
+ choices=["sensevoice"],
521
+ label="Pipeline Type",
522
+ value="sensevoice"
523
+ )
524
+ model_id = gr.Dropdown(
525
+ label="Model",
526
+ choices=get_model_options("sensevoice"),
527
+ value=MODEL_PATH_LIST[0] # Default to official Local Model
528
+ )
529
+ with gr.Row():
530
+ download_method = gr.Dropdown(
531
+ choices=["requests", "ffmpeg", "aria2", "wget"],
532
+ label="Download Method",
533
+ value="requests"
534
+ )
535
+
536
+ with gr.Row():
537
+ start_time = gr.Number(label="Start Time (seconds)", value=None, minimum=0)
538
+ end_time = gr.Number(label="End Time (seconds)", value=None, minimum=0)
539
+ verbose = gr.Checkbox(label="Verbose Output", value=False)
540
+
541
+ with gr.Row():
542
+ metrics_output = gr.Textbox(label="Transcription Metrics and Verbose Messages", lines=10)
543
+ transcription_output = gr.Textbox(label="Transcription", lines=10)
544
+ transcription_file = gr.File(label="Download Transcription")
545
+
546
+ def update_model_dropdown(pipeline_type):
547
+ """
548
+ Updates the model dropdown choices based on the selected pipeline type.
549
+
550
+ Args:
551
+ pipeline_type (str): The selected pipeline type.
552
+
553
+ Returns:
554
+ gr.update: Updated model dropdown component.
555
+ """
556
+ try:
557
+ model_choices = get_model_options(pipeline_type)
558
+ logging.info(f"Model choices for {pipeline_type}: {model_choices}")
559
+ if model_choices:
560
+ return gr.update(choices=model_choices, value=model_choices[0], visible=True)
561
+ else:
562
+ return gr.update(choices=["No models available"], value=None, visible=False)
563
+ except Exception as e:
564
+ logging.error(f"Error in update_model_dropdown: {str(e)}")
565
+ return gr.update(choices=["Error"], value="Error", visible=True)
566
+
567
+ # Event handler for pipeline_type change
568
+ pipeline_type.change(update_model_dropdown, inputs=[pipeline_type], outputs=[model_id])
569
+
570
+ def transcribe_with_progress(*args):
571
+ # The audio_input is now the first argument
572
+ for result in transcribe_audio(*args):
573
+ yield result
574
+
575
+ transcribe_button.click(
576
+ transcribe_with_progress,
577
+ inputs=[audio_input, audio_url, proxy_url, proxy_username, proxy_password, pipeline_type, model_id, download_method, start_time, end_time, verbose],
578
+ outputs=[metrics_output, transcription_output, transcription_file]
579
+ )
580
+
581
+ # Note: For examples, users should use local audio files or upload their own files
582
+ # Examples with specific paths may not work for all users
583
+
584
+ gr.Markdown(f"""
585
+ ### Usage Examples:
586
+ 1. **Upload Audio**: Click the "Upload or Record Audio" button to select your audio file
587
+ 2. **Use Model**: Select from available models:
588
+ - `{MODEL_PATH_LIST}` - (default) `{MODEL_PATH_LIST[0]}`
589
+ 3. **Local Testing**: For development, you can also use local paths like `/path/to/your/SenseVoiceSmall`
590
+
591
+ Supported languages: Chinese (zh), English (en), Cantonese (yue), Japanese (ja), Korean (ko)
592
+ """)
593
+
594
+ iface.launch(share=False, debug=True)
packages.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ffmpeg
2
+ aria2
3
+ wget
4
+
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ numpy==1.23.5
2
+ gradio>=4.0.0
3
+ yt-dlp
4
+ requests
5
+ pytube
6
+ ffmpeg-python
7
+ pydub
8
+ torch
9
+ transformers
10
+ funasr>=1.1.3
11
+ torchaudio
12
+ modelscope
13
+ huggingface_hub