Taejin commited on
Commit
a7109eb
·
1 Parent(s): 9438a9f

Updating the README.md file

Browse files

Signed-off-by: taejinp <tango4j@gmail.com>

Files changed (1) hide show
  1. README.md +68 -59
README.md CHANGED
@@ -257,18 +257,79 @@ The model is available for use in the NeMo Framework[7], and can be used as a pr
257
 
258
  **Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
259
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
  ```
261
  # Running streaming multitalker Parakeet with streaming Sortformer
262
  python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
263
  asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
264
  diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
265
- audio_file=example.wav \
266
- max_num_of_spks=4 \
267
- masked_asr=false \
268
- parallel_speaker_strategy=true \
269
- att_context_size=[70,13] \
270
- output_path=./output.json \ # where to save the output seglst file
271
- print_path=./print_script.sh
272
  ```
273
 
274
  Or the audio_file can be replaced with the manifest_file
@@ -294,24 +355,6 @@ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_m
294
  }
295
  ```
296
 
297
- ### Single Speaker ASR
298
-
299
- The model can also be used for single speaker ASR:
300
-
301
- ```
302
- python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
303
- asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
304
- diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
305
- audio_file=example.wav \
306
- max_num_of_spks=1 \
307
- single_speaker_mode=true \
308
- masked_asr=false \
309
- parallel_speaker_strategy=true \
310
- att_context_size=[70,13] \
311
- output_path=./output.json \ # where to save the output seglst file
312
- print_path=./print_script.sh
313
- ```
314
-
315
  ### Setting up Streaming Configuration
316
 
317
  Latency is defined by the `att_context_size`, all measured in **80ms frames**:
@@ -320,36 +363,6 @@ Latency is defined by the `att_context_size`, all measured in **80ms frames**:
320
  * [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
321
  * [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
322
 
323
-
324
- <!-- ### Getting Transcription Results -->
325
-
326
- <!-- The model requires speaker diarization information to perform speaker-wise ASR. You need to:
327
-
328
- 1. **Obtain speaker diarization** (e.g., using Streaming Sortformer or similar diarization system)
329
- 2. **Deploy one model instance per speaker** identified in the diarization output
330
- 3. **Feed each instance** with:
331
- - The same audio input
332
- - Speaker-specific activity information for that speaker
333
-
334
- ```python3
335
- # Example: For a 2-speaker conversation
336
- # Assuming you have diarization outputs for Speaker 0 and Speaker 1
337
-
338
- # Create two model instances
339
- model_spk0 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
340
- model_spk1 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
341
-
342
- # Get transcription for each speaker
343
- # Each model instance uses speaker-specific kernels to adapt to its target speaker
344
- results_spk0 = model_spk0.transcribe(audio=audio_input, speaker_id=0, diarization_info=diar_info)
345
- results_spk1 = model_spk1.transcribe(audio=audio_input, speaker_id=1, diarization_info=diar_info)
346
-
347
- # Combine results to get complete multitalker transcription
348
- ``` -->
349
-
350
- <!-- **Note**: The specific API for providing diarization information may vary. Please refer to the [NeMo documentation](https://github.com/NVIDIA/NeMo) for detailed usage instructions. -->
351
-
352
-
353
  ### Input
354
 
355
  This model accepts single-channel (mono) audio sampled at 16,000 Hz.
@@ -397,15 +410,11 @@ Data collection methods vary across individual datasets. The training datasets i
397
 
398
  | **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
399
  |-------------|-------------|-------------|-----------|-------------|
400
- | 0.16s | ----- | ----- | ----- | ----- |
401
- | 0.16s | ----- | ----- | ----- | ----- |
402
- | 0.56s | ----- | ----- | ----- | ----- |
403
  | 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
404
 
405
  ## References
406
 
407
  [1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
408
- W. Wang, T. Park, I. Medennikov, J. Wang, K. Dhawan, H. Huang, N. R. Koluguri, J. Balam, B. Ginsburg. *Proc. INTERSPEECH 2025*
409
 
410
  [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
411
 
 
257
 
258
  **Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
259
 
260
+ ### Method 1. Code snippet
261
+
262
+ ```python
263
+ from nemo.collections.asr.models import SortformerEncLabelModel
264
+ import torch
265
+
266
+ # Step 1: Load streaming diarization model (provides speaker activity predictions)
267
+ diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2.1")
268
+ diar_model.eval().to(torch.device("cuda"))
269
+
270
+ # Step 2: Load streaming multitalker ASR model (transcribes each speaker separately)
271
+ asr_model = ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1.nemo")
272
+ asr_model.eval().to(torch.device("cuda"))
273
+
274
+ from multitalker_transcript_config import MultitalkerTranscriptionConfig
275
+ from omegaconf import OmegaConf
276
+ # Step 3: Configure models with streaming parameters (latency, chunk sizes, etc.)
277
+ cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
278
+ cfg.audio_file = "/path/to/your/audio.wav"
279
+ cfg.output_path = "/path/to/output_transcription.json"
280
+
281
+ # Initialize diarization model with streaming config (sets chunk_len, context, etc.)
282
+ diar_model = MultitalkerTranscriptionConfig.init_diar_model(cfg, diar_model)
283
+
284
+ from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
285
+
286
+ # Step 4: Setup streaming buffer (simulates real-time audio stream)
287
+ samples = [{'audio_filepath': cfg.audio_file}]
288
+ streaming_buffer = CacheAwareStreamingAudioBuffer(
289
+ model=asr_model,
290
+ online_normalization=cfg.online_normalization,
291
+ pad_and_drop_preencoded=cfg.pad_and_drop_preencoded,
292
+ )
293
+ streaming_buffer.append_audio_file(audio_filepath=cfg.audio_file, stream_id=-1)
294
+ streaming_buffer_iter = iter(streaming_buffer)
295
+
296
+ from nemo.collections.asr.parts.utils.multispk_transcribe_utils import SpeakerTaggedASR
297
+
298
+ # Step 5: Initialize multi-instance ASR streamer (manages per-speaker ASR instances)
299
+ multispk_asr_streamer = SpeakerTaggedASR(cfg, asr_model, diar_model)
300
+
301
+ # Step 6: Process audio chunks iteratively (streaming inference loop)
302
+ for step_num, (chunk_audio, chunk_lengths) in enumerate(streaming_buffer_iter):
303
+ drop_extra_pre_encoded = (
304
+ 0
305
+ if step_num == 0 and not cfg.pad_and_drop_preencoded
306
+ else asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
307
+ )
308
+ with torch.inference_mode():
309
+ with torch.amp.autocast(diar_model.device.type, enabled=True):
310
+ with torch.no_grad():
311
+ multispk_asr_streamer.perform_parallel_streaming_stt_spk(
312
+ step_num=step_num,
313
+ chunk_audio=chunk_audio,
314
+ chunk_lengths=chunk_lengths,
315
+ is_buffer_empty=streaming_buffer.is_buffer_empty(),
316
+ drop_extra_pre_encoded=drop_extra_pre_encoded,
317
+ )
318
+ # Step 7: Generate final transcriptions in SegLST format (speaker-tagged with timestamps)
319
+ seglst_dict_list = multispk_asr_streamer.generate_seglst_dicts_from_parallel_streaming(samples=samples)
320
+
321
+ # Display speaker-tagged transcriptions with timestamps
322
+ print(seglst_dict_list)
323
+ ```
324
+
325
+ ### Method 2. Use NeMo example file in NVIDIA/NeMo
326
  ```
327
  # Running streaming multitalker Parakeet with streaming Sortformer
328
  python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
329
  asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
330
  diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
331
+ audio_file="/path/to/example.wav" \
332
+ output_path="/path/to/example_output.json" \ # where to save the output seglst file
 
 
 
 
 
333
  ```
334
 
335
  Or the audio_file can be replaced with the manifest_file
 
355
  }
356
  ```
357
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
358
  ### Setting up Streaming Configuration
359
 
360
  Latency is defined by the `att_context_size`, all measured in **80ms frames**:
 
363
  * [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
364
  * [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
365
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
  ### Input
367
 
368
  This model accepts single-channel (mono) audio sampled at 16,000 Hz.
 
410
 
411
  | **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
412
  |-------------|-------------|-------------|-----------|-------------|
 
 
 
413
  | 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
414
 
415
  ## References
416
 
417
  [1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
 
418
 
419
  [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
420