Updating the README.md file
Browse filesSigned-off-by: taejinp <tango4j@gmail.com>
README.md
CHANGED
|
@@ -257,18 +257,79 @@ The model is available for use in the NeMo Framework[7], and can be used as a pr
|
|
| 257 |
|
| 258 |
**Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
|
| 259 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
```
|
| 261 |
# Running streaming multitalker Parakeet with streaming Sortformer
|
| 262 |
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 263 |
asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
|
| 264 |
diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
|
| 265 |
-
audio_file=example.wav \
|
| 266 |
-
|
| 267 |
-
masked_asr=false \
|
| 268 |
-
parallel_speaker_strategy=true \
|
| 269 |
-
att_context_size=[70,13] \
|
| 270 |
-
output_path=./output.json \ # where to save the output seglst file
|
| 271 |
-
print_path=./print_script.sh
|
| 272 |
```
|
| 273 |
|
| 274 |
Or the audio_file can be replaced with the manifest_file
|
|
@@ -294,24 +355,6 @@ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_m
|
|
| 294 |
}
|
| 295 |
```
|
| 296 |
|
| 297 |
-
### Single Speaker ASR
|
| 298 |
-
|
| 299 |
-
The model can also be used for single speaker ASR:
|
| 300 |
-
|
| 301 |
-
```
|
| 302 |
-
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 303 |
-
asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
|
| 304 |
-
diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
|
| 305 |
-
audio_file=example.wav \
|
| 306 |
-
max_num_of_spks=1 \
|
| 307 |
-
single_speaker_mode=true \
|
| 308 |
-
masked_asr=false \
|
| 309 |
-
parallel_speaker_strategy=true \
|
| 310 |
-
att_context_size=[70,13] \
|
| 311 |
-
output_path=./output.json \ # where to save the output seglst file
|
| 312 |
-
print_path=./print_script.sh
|
| 313 |
-
```
|
| 314 |
-
|
| 315 |
### Setting up Streaming Configuration
|
| 316 |
|
| 317 |
Latency is defined by the `att_context_size`, all measured in **80ms frames**:
|
|
@@ -320,36 +363,6 @@ Latency is defined by the `att_context_size`, all measured in **80ms frames**:
|
|
| 320 |
* [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
|
| 321 |
* [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
|
| 322 |
|
| 323 |
-
|
| 324 |
-
<!-- ### Getting Transcription Results -->
|
| 325 |
-
|
| 326 |
-
<!-- The model requires speaker diarization information to perform speaker-wise ASR. You need to:
|
| 327 |
-
|
| 328 |
-
1. **Obtain speaker diarization** (e.g., using Streaming Sortformer or similar diarization system)
|
| 329 |
-
2. **Deploy one model instance per speaker** identified in the diarization output
|
| 330 |
-
3. **Feed each instance** with:
|
| 331 |
-
- The same audio input
|
| 332 |
-
- Speaker-specific activity information for that speaker
|
| 333 |
-
|
| 334 |
-
```python3
|
| 335 |
-
# Example: For a 2-speaker conversation
|
| 336 |
-
# Assuming you have diarization outputs for Speaker 0 and Speaker 1
|
| 337 |
-
|
| 338 |
-
# Create two model instances
|
| 339 |
-
model_spk0 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
|
| 340 |
-
model_spk1 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
|
| 341 |
-
|
| 342 |
-
# Get transcription for each speaker
|
| 343 |
-
# Each model instance uses speaker-specific kernels to adapt to its target speaker
|
| 344 |
-
results_spk0 = model_spk0.transcribe(audio=audio_input, speaker_id=0, diarization_info=diar_info)
|
| 345 |
-
results_spk1 = model_spk1.transcribe(audio=audio_input, speaker_id=1, diarization_info=diar_info)
|
| 346 |
-
|
| 347 |
-
# Combine results to get complete multitalker transcription
|
| 348 |
-
``` -->
|
| 349 |
-
|
| 350 |
-
<!-- **Note**: The specific API for providing diarization information may vary. Please refer to the [NeMo documentation](https://github.com/NVIDIA/NeMo) for detailed usage instructions. -->
|
| 351 |
-
|
| 352 |
-
|
| 353 |
### Input
|
| 354 |
|
| 355 |
This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
|
@@ -397,15 +410,11 @@ Data collection methods vary across individual datasets. The training datasets i
|
|
| 397 |
|
| 398 |
| **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
|
| 399 |
|-------------|-------------|-------------|-----------|-------------|
|
| 400 |
-
| 0.16s | ----- | ----- | ----- | ----- |
|
| 401 |
-
| 0.16s | ----- | ----- | ----- | ----- |
|
| 402 |
-
| 0.56s | ----- | ----- | ----- | ----- |
|
| 403 |
| 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
|
| 404 |
|
| 405 |
## References
|
| 406 |
|
| 407 |
[1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
|
| 408 |
-
W. Wang, T. Park, I. Medennikov, J. Wang, K. Dhawan, H. Huang, N. R. Koluguri, J. Balam, B. Ginsburg. *Proc. INTERSPEECH 2025*
|
| 409 |
|
| 410 |
[2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 411 |
|
|
|
|
| 257 |
|
| 258 |
**Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
|
| 259 |
|
| 260 |
+
### Method 1. Code snippet
|
| 261 |
+
|
| 262 |
+
```python
|
| 263 |
+
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 264 |
+
import torch
|
| 265 |
+
|
| 266 |
+
# Step 1: Load streaming diarization model (provides speaker activity predictions)
|
| 267 |
+
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2.1")
|
| 268 |
+
diar_model.eval().to(torch.device("cuda"))
|
| 269 |
+
|
| 270 |
+
# Step 2: Load streaming multitalker ASR model (transcribes each speaker separately)
|
| 271 |
+
asr_model = ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1.nemo")
|
| 272 |
+
asr_model.eval().to(torch.device("cuda"))
|
| 273 |
+
|
| 274 |
+
from multitalker_transcript_config import MultitalkerTranscriptionConfig
|
| 275 |
+
from omegaconf import OmegaConf
|
| 276 |
+
# Step 3: Configure models with streaming parameters (latency, chunk sizes, etc.)
|
| 277 |
+
cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
|
| 278 |
+
cfg.audio_file = "/path/to/your/audio.wav"
|
| 279 |
+
cfg.output_path = "/path/to/output_transcription.json"
|
| 280 |
+
|
| 281 |
+
# Initialize diarization model with streaming config (sets chunk_len, context, etc.)
|
| 282 |
+
diar_model = MultitalkerTranscriptionConfig.init_diar_model(cfg, diar_model)
|
| 283 |
+
|
| 284 |
+
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
|
| 285 |
+
|
| 286 |
+
# Step 4: Setup streaming buffer (simulates real-time audio stream)
|
| 287 |
+
samples = [{'audio_filepath': cfg.audio_file}]
|
| 288 |
+
streaming_buffer = CacheAwareStreamingAudioBuffer(
|
| 289 |
+
model=asr_model,
|
| 290 |
+
online_normalization=cfg.online_normalization,
|
| 291 |
+
pad_and_drop_preencoded=cfg.pad_and_drop_preencoded,
|
| 292 |
+
)
|
| 293 |
+
streaming_buffer.append_audio_file(audio_filepath=cfg.audio_file, stream_id=-1)
|
| 294 |
+
streaming_buffer_iter = iter(streaming_buffer)
|
| 295 |
+
|
| 296 |
+
from nemo.collections.asr.parts.utils.multispk_transcribe_utils import SpeakerTaggedASR
|
| 297 |
+
|
| 298 |
+
# Step 5: Initialize multi-instance ASR streamer (manages per-speaker ASR instances)
|
| 299 |
+
multispk_asr_streamer = SpeakerTaggedASR(cfg, asr_model, diar_model)
|
| 300 |
+
|
| 301 |
+
# Step 6: Process audio chunks iteratively (streaming inference loop)
|
| 302 |
+
for step_num, (chunk_audio, chunk_lengths) in enumerate(streaming_buffer_iter):
|
| 303 |
+
drop_extra_pre_encoded = (
|
| 304 |
+
0
|
| 305 |
+
if step_num == 0 and not cfg.pad_and_drop_preencoded
|
| 306 |
+
else asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
|
| 307 |
+
)
|
| 308 |
+
with torch.inference_mode():
|
| 309 |
+
with torch.amp.autocast(diar_model.device.type, enabled=True):
|
| 310 |
+
with torch.no_grad():
|
| 311 |
+
multispk_asr_streamer.perform_parallel_streaming_stt_spk(
|
| 312 |
+
step_num=step_num,
|
| 313 |
+
chunk_audio=chunk_audio,
|
| 314 |
+
chunk_lengths=chunk_lengths,
|
| 315 |
+
is_buffer_empty=streaming_buffer.is_buffer_empty(),
|
| 316 |
+
drop_extra_pre_encoded=drop_extra_pre_encoded,
|
| 317 |
+
)
|
| 318 |
+
# Step 7: Generate final transcriptions in SegLST format (speaker-tagged with timestamps)
|
| 319 |
+
seglst_dict_list = multispk_asr_streamer.generate_seglst_dicts_from_parallel_streaming(samples=samples)
|
| 320 |
+
|
| 321 |
+
# Display speaker-tagged transcriptions with timestamps
|
| 322 |
+
print(seglst_dict_list)
|
| 323 |
+
```
|
| 324 |
+
|
| 325 |
+
### Method 2. Use NeMo example file in NVIDIA/NeMo
|
| 326 |
```
|
| 327 |
# Running streaming multitalker Parakeet with streaming Sortformer
|
| 328 |
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 329 |
asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
|
| 330 |
diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
|
| 331 |
+
audio_file="/path/to/example.wav" \
|
| 332 |
+
output_path="/path/to/example_output.json" \ # where to save the output seglst file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 333 |
```
|
| 334 |
|
| 335 |
Or the audio_file can be replaced with the manifest_file
|
|
|
|
| 355 |
}
|
| 356 |
```
|
| 357 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 358 |
### Setting up Streaming Configuration
|
| 359 |
|
| 360 |
Latency is defined by the `att_context_size`, all measured in **80ms frames**:
|
|
|
|
| 363 |
* [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
|
| 364 |
* [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
|
| 365 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
### Input
|
| 367 |
|
| 368 |
This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
|
|
|
| 410 |
|
| 411 |
| **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
|
| 412 |
|-------------|-------------|-------------|-----------|-------------|
|
|
|
|
|
|
|
|
|
|
| 413 |
| 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
|
| 414 |
|
| 415 |
## References
|
| 416 |
|
| 417 |
[1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
|
|
|
|
| 418 |
|
| 419 |
[2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 420 |
|