Updating the README.md file

Browse files

Signed-off-by: taejinp <tango4j@gmail.com>

Files changed (1) hide show

README.md +68 -59

README.md CHANGED Viewed

@@ -257,18 +257,79 @@ The model is available for use in the NeMo Framework[7], and can be used as a pr
 **Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
 ```
 # Running streaming multitalker Parakeet with streaming Sortformer
 python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
           asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
           diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
-          audio_file=example.wav \
-          max_num_of_spks=4 \
-          masked_asr=false \
-          parallel_speaker_strategy=true \
-          att_context_size=[70,13] \
-          output_path=./output.json \ # where to save the output seglst file
-          print_path=./print_script.sh
 ```
 Or the audio_file can be replaced with the manifest_file
@@ -294,24 +355,6 @@ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_m
 }
 ```
-### Single Speaker ASR
-The model can also be used for single speaker ASR:
-```
-python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
-          asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
-          diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
-          audio_file=example.wav \
-          max_num_of_spks=1 \
-          single_speaker_mode=true \
-          masked_asr=false \
-          parallel_speaker_strategy=true \
-          att_context_size=[70,13] \
-          output_path=./output.json \ # where to save the output seglst file
-          print_path=./print_script.sh
-```
 ### Setting up Streaming Configuration
 Latency is defined by the `att_context_size`, all measured in **80ms frames**:
@@ -320,36 +363,6 @@ Latency is defined by the `att_context_size`, all measured in **80ms frames**:
 * [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
 * [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
-<!-- ### Getting Transcription Results -->
-<!-- The model requires speaker diarization information to perform speaker-wise ASR. You need to:
-1. **Obtain speaker diarization** (e.g., using Streaming Sortformer or similar diarization system)
-2. **Deploy one model instance per speaker** identified in the diarization output
-3. **Feed each instance** with:
-   - The same audio input
-   - Speaker-specific activity information for that speaker
-```python3
-# Example: For a 2-speaker conversation
-# Assuming you have diarization outputs for Speaker 0 and Speaker 1
-# Create two model instances
-model_spk0 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
-model_spk1 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
-# Get transcription for each speaker
-# Each model instance uses speaker-specific kernels to adapt to its target speaker
-results_spk0 = model_spk0.transcribe(audio=audio_input, speaker_id=0, diarization_info=diar_info)
-results_spk1 = model_spk1.transcribe(audio=audio_input, speaker_id=1, diarization_info=diar_info)
-# Combine results to get complete multitalker transcription
-``` -->
-<!-- **Note**: The specific API for providing diarization information may vary. Please refer to the [NeMo documentation](https://github.com/NVIDIA/NeMo) for detailed usage instructions. -->
 ### Input
 This model accepts single-channel (mono) audio sampled at 16,000 Hz.
@@ -397,15 +410,11 @@ Data collection methods vary across individual datasets. The training datasets i
 | **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
 |-------------|-------------|-------------|-----------|-------------|
-| 0.16s       | -----       | -----       | -----     | -----       |
-| 0.16s       | -----       | -----       | -----     | -----       |
-| 0.56s       | -----       | -----       | -----     | -----       |
 | 1.12s       | 21.26       | 37.44       | 15.81     | 23.81       |
 ## References
 [1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
-W. Wang, T. Park, I. Medennikov, J. Wang, K. Dhawan, H. Huang, N. R. Koluguri, J. Balam, B. Ginsburg. *Proc. INTERSPEECH 2025*
 [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)

 **Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
+### Method 1. Code snippet
+```python
+from nemo.collections.asr.models import SortformerEncLabelModel
+import torch
+# Step 1: Load streaming diarization model (provides speaker activity predictions)
+diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2.1")
+diar_model.eval().to(torch.device("cuda"))
+# Step 2: Load streaming multitalker ASR model (transcribes each speaker separately)
+asr_model = ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1.nemo")
+asr_model.eval().to(torch.device("cuda"))
+from multitalker_transcript_config import MultitalkerTranscriptionConfig
+from omegaconf import OmegaConf
+# Step 3: Configure models with streaming parameters (latency, chunk sizes, etc.)
+cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
+cfg.audio_file = "/path/to/your/audio.wav"
+cfg.output_path = "/path/to/output_transcription.json"
+# Initialize diarization model with streaming config (sets chunk_len, context, etc.)
+diar_model = MultitalkerTranscriptionConfig.init_diar_model(cfg, diar_model)
+from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
+# Step 4: Setup streaming buffer (simulates real-time audio stream)
+samples = [{'audio_filepath': cfg.audio_file}]
+streaming_buffer = CacheAwareStreamingAudioBuffer(
+    model=asr_model,
+    online_normalization=cfg.online_normalization,
+    pad_and_drop_preencoded=cfg.pad_and_drop_preencoded,
+)
+streaming_buffer.append_audio_file(audio_filepath=cfg.audio_file, stream_id=-1)
+streaming_buffer_iter = iter(streaming_buffer)
+from nemo.collections.asr.parts.utils.multispk_transcribe_utils import SpeakerTaggedASR
+# Step 5: Initialize multi-instance ASR streamer (manages per-speaker ASR instances)
+multispk_asr_streamer = SpeakerTaggedASR(cfg, asr_model, diar_model)
+# Step 6: Process audio chunks iteratively (streaming inference loop)
+for step_num, (chunk_audio, chunk_lengths) in enumerate(streaming_buffer_iter):
+    drop_extra_pre_encoded = (
+        0
+        if step_num == 0 and not cfg.pad_and_drop_preencoded
+        else asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
+    )
+    with torch.inference_mode():
+        with torch.amp.autocast(diar_model.device.type, enabled=True):
+            with torch.no_grad():
+                multispk_asr_streamer.perform_parallel_streaming_stt_spk(
+                    step_num=step_num,
+                    chunk_audio=chunk_audio,
+                    chunk_lengths=chunk_lengths,
+                    is_buffer_empty=streaming_buffer.is_buffer_empty(),
+                    drop_extra_pre_encoded=drop_extra_pre_encoded,
+                )
+# Step 7: Generate final transcriptions in SegLST format (speaker-tagged with timestamps)
+seglst_dict_list = multispk_asr_streamer.generate_seglst_dicts_from_parallel_streaming(samples=samples)
+# Display speaker-tagged transcriptions with timestamps
+print(seglst_dict_list)
+```
+### Method 2. Use NeMo example file in NVIDIA/NeMo
 ```
 # Running streaming multitalker Parakeet with streaming Sortformer
 python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
           asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
           diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
+          audio_file="/path/to/example.wav" \
+          output_path="/path/to/example_output.json" \ # where to save the output seglst file
 ```
 Or the audio_file can be replaced with the manifest_file
 }
 ```
 ### Setting up Streaming Configuration
 Latency is defined by the `att_context_size`, all measured in **80ms frames**:
 * [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
 * [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
 ### Input
 This model accepts single-channel (mono) audio sampled at 16,000 Hz.
 | **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
 |-------------|-------------|-------------|-----------|-------------|
 | 1.12s       | 21.26       | 37.44       | 15.81     | 23.81       |
 ## References
 [1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
 [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)