taejinp weiqingw4ng commited on
Commit
37f2657
·
verified ·
1 Parent(s): f8e3b93

Update README.md (#1)

Browse files

- Update README.md (8e83774ce9f5496cb96a5af3649af2a92f9b6708)


Co-authored-by: Weiqing Wang <weiqingw4ng@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +130 -167
README.md CHANGED
@@ -181,7 +181,7 @@ pipeline_tag: audio-classification
181
  ---
182
 
183
 
184
- # Streaming Sortformer Diarizer 4spk v2
185
 
186
  <style>
187
  img {
@@ -190,48 +190,57 @@ img {
190
  </style>
191
 
192
  [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
193
- | [![Model size](https://img.shields.io/badge/Params-117M-lightgrey#model-badge)](#model-architecture)
194
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
195
 
196
- This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
197
 
198
- <div align="center">
199
- <img src="figures/sortformer_intro.png" width="750" />
200
- </div>
201
 
202
- [Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
203
- <div align="center">
204
- <img src="figures/aosc_3spk_example.gif" width="1400" />
205
- </div>
206
- <div align="center">
207
- <img src="figures/aosc_4spk_example.gif" width="1400" />
208
- </div>
209
 
210
- Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
 
 
 
 
 
 
 
211
 
212
  ## Model Architecture
213
 
214
- Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate speaker-cache. At each step, speaker cache is filtered to only retain the high-quality speaker cache vectors.
 
 
215
 
216
  <div align="center">
217
- <img src="figures/streaming_steps.png" width="1400" />
218
  </div>
219
 
 
220
 
221
- Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
222
- Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
223
- and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
224
 
225
  <div align="center">
226
- <img src="figures/sortformer-v1-model.png" width="450" />
227
  </div>
228
 
 
 
 
 
 
 
 
229
 
230
 
231
 
232
  ## NVIDIA NeMo
233
 
234
- To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[6]. We recommend you install it after you've installed Cython and latest PyTorch version.
235
 
236
  ```
237
  apt-get update && apt-get install -y libsndfile1 ffmpeg
@@ -241,39 +250,35 @@ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
241
 
242
  ## How to Use this Model
243
 
244
- The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
245
 
246
- ### Loading the Model
247
 
248
- ```python3
249
- from nemo.collections.asr.models import SortformerEncLabelModel
250
-
251
- # load model from Hugging Face model card directly (You need a Hugging Face token)
252
- diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
253
-
254
- # If you have a downloaded model in "/path/to/diar_streaming_sortformer_4spk-v2.nemo", load model from a downloaded file
255
- diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_streaming_sortformer_4spk-v2.nemo", map_location='cuda', strict=False)
256
-
257
- # switch to inference mode
258
- diar_model.eval()
 
259
  ```
260
 
261
- ### Input Format
262
- Input to Sortformer can be an individual audio file:
263
- ```python3
264
- audio_input="/path/to/multispeaker_audio1.wav"
265
  ```
266
- or a list of paths to audio files:
267
- ```python3
268
- audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 
 
269
  ```
270
- or a jsonl manifest file:
271
- ```python3
272
- audio_input="/path/to/multispeaker_manifest.json"
273
  ```
274
- where each line is a dictionary containing the following fields:
275
- ```yaml
276
- # Example of a line in `multispeaker_manifest.json`
277
  {
278
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
279
  "offset": 0, # offset (start) time of the input audio
@@ -286,101 +291,86 @@ where each line is a dictionary containing the following fields:
286
  }
287
  ```
288
 
289
- ### Setting up Streaming Configuration
290
 
291
- Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
292
- * **CHUNK_SIZE**: The number of frames in a processing chunk.
293
- * **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
294
- * **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
295
- * **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
296
- * **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
297
-
298
- Here are recommended configurations for different scenarios:
299
- | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
300
- | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
301
- | very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
302
- | high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
303
- | low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
304
- | ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
305
-
306
- For clarity on the metrics used in the table:
307
- * **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
308
- * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
309
-
310
- To set streaming configuration, use:
311
- ```python3
312
- diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
313
- diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
314
- diar_model.sortformer_modules.fifo_len = FIFO_SIZE
315
- diar_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD
316
- diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
317
- diar_model.sortformer_modules._check_streaming_parameters()
318
- ```
319
 
320
- ### Getting Diarization Results
321
- To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
322
- ```python3
323
- predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
324
  ```
325
- To obtain tensors of speaker activity probabilities, use:
326
- ```python3
327
- predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
 
 
 
 
 
 
 
 
328
  ```
329
 
 
330
 
331
- ### Input
 
 
 
 
332
 
333
- This model accepts single-channel (mono) audio sampled at 16,000 Hz.
334
- - The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal.
335
- - For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.
336
 
337
- ### Output
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
338
 
339
- The output of the model is an T x S matrix, where:
340
- - S is the maximum number of speakers (in this model, S = 4).
341
- - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
342
- Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
343
 
 
 
344
 
345
- ## Train and evaluate Sortformer diarizer using NeMo
346
- ### Training
347
 
348
- Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
349
- The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
350
 
351
- ### Inference
352
 
353
- Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
354
 
355
- ### Technical Limitations
356
 
357
- - The model operates in a streaming mode (online mode).
358
- - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
359
- - While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings.
360
- - The model was trained on publicly available speech datasets, primarily in English. As a result:
361
- * Performance may degrade on non-English speech.
362
- * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
363
 
364
  ## Datasets
365
 
366
- Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
367
- All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
368
- Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
369
 
370
 
371
  ### Training Datasets (Real conversations)
 
372
  - Fisher English (LDC)
373
- - AMI Meeting Corpus
374
- - VoxConverse-v0.3
 
375
  - ICSI
376
- - AISHELL-4
377
- - Third DIHARD Challenge Development (LDC)
378
- - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
379
- - DiPCo
380
- - AliMeeting
381
 
382
  ### Training Datasets (Used to simulate audio mixtures)
383
- - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
384
  - Librispeech
385
 
386
  ## Performance
@@ -388,70 +378,43 @@ Data collection methods vary across individual datasets. For example, the above
388
 
389
  ### Evaluation data specifications
390
 
391
- | **Dataset** | **Number of speakers** | **Number of Sessions** |
392
- |----------------------------|------------------------|------------------------|
393
- | **DIHARD III Eval <=4spk** | 1-4 | 219 |
394
- | **DIHARD III Eval >=5spk** | 5-9 | 40 |
395
- | **DIHARD III Eval full** | 1-9 | 259 |
396
- | **CALLHOME-part2 2spk** | 2 | 148 |
397
- | **CALLHOME-part2 3spk** | 3 | 74 |
398
- | **CALLHOME-part2 4spk** | 4 | 20 |
399
- | **CALLHOME-part2 5spk** | 5 | 5 |
400
- | **CALLHOME-part2 6spk** | 6 | 3 |
401
- | **CALLHOME-part2 full** | 2-6 | 250 |
402
- | **CH109** | 2 | 109 |
403
 
404
 
405
- ### Diarization Error Rate (DER)
406
 
407
  * All evaluations include overlapping speech.
408
  * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
409
- * Post-Processing (PP) is optimized on two different held-out dataset splits.
410
- - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
411
- - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
412
-
413
- | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
414
- |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
415
- | 30.4s | no | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
416
- | 30.4s | yes | 13.45 | 41.40 | 18.85 | 5.34 | 9.22 | 11.29 | 18.84 | 27.29 | 9.54 | 4.61 |
417
- | 10.0s | no | 14.90 | 41.06 | 19.96 | 6.96 | 11.05 | 12.93 | 20.47 | 28.10 | 11.21 | 5.28 |
418
- | 10.0s | yes | 13.75 | 41.41 | 19.10 | 6.05 | 9.88 | 11.72 | 19.66 | 27.37 | 10.15 | 4.80 |
419
- | 1.04s | no | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
420
- | 1.04s | yes | 13.24 | 42.56 | 18.91 | 6.57 | 10.05 | 12.44 | 21.68 | 28.74 | 10.70 | 4.88 |
421
- | 0.32s | no | 14.64 | 43.47 | 20.19 | 8.63 | 12.91 | 16.19 | 29.40 | 30.60 | 13.57 | 6.46 |
422
- | 0.32s | yes | 13.44 | 43.73 | 19.28 | 6.91 | 10.45 | 13.70 | 27.04 | 28.58 | 11.38 | 5.27 |
423
-
424
-
425
- ## NVIDIA Riva: Deployment
426
-
427
- Streaming Sortformer is deployed via NVIDIA RIVA ASR - [Speech Recognition with Speaker Diarization](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#speech-recognition-with-speaker-diarization)
428
-
429
- [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
430
- Additionally, Riva provides:
431
-
432
- * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
433
- * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
434
- * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
435
-
436
- For more information on NVIDIA RIVA, see the [list of supported models](https://huggingface.co/models?other=Riva) is here.
437
- Also check out the [Riva live demo](https://developer.nvidia.com/riva#demos).
438
 
 
 
 
 
 
 
439
 
440
  ## References
441
- [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
442
 
443
- [2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
 
 
 
444
 
445
- [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
446
 
447
- [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
448
 
449
- [5] [Attention is all you need](https://arxiv.org/abs/1706.03762)
450
 
451
- [6] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
452
 
453
- [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
454
 
455
- ## Licence
456
 
457
- License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.
 
181
  ---
182
 
183
 
184
+ # Multitalker Parakeet Streaming 0.6B v1
185
 
186
  <style>
187
  img {
 
190
  </style>
191
 
192
  [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
193
+ | [![Model size](https://img.shields.io/badge/Params-600M-lightgrey#model-badge)](#model-architecture)
194
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
195
 
196
+ This model is a streaming multitalker ASR model based on the Parakeet architecture. The model only takes the speaker diarization outputs as external information and eliminates the need for explicit speaker queries or enrollment audio [[Wang et al., 2025]](https://arxiv.org/abs/2506.22646). Unlike conventional target-speaker ASR approaches that require speaker embeddings, this model dynamically adapts to individual speakers through speaker-wise speech activity prediction.
197
 
198
+ The key innovation involves injecting learnable **speaker kernels** into the pre-encode layer of the Fast-Conformer encoder. These speaker kernels are generated via speaker supervision activations, enabling instantaneous adaptation to target speakers. This approach leverages the inherent tendency of streaming ASR systems to prioritize specific speakers, repurposing this mechanism to achieve robust speaker-focused recognition.
 
 
199
 
200
+ The model architecture requires deploying **one model instance per speaker**, meaning the number of model instances matches the number of speakers in the conversation. While this necessitates additional computational resources, it achieves state-of-the-art performance in handling fully overlapped speech in both offline and streaming scenarios.
 
 
 
 
 
 
201
 
202
+ ## Key Advantages
203
+
204
+ This self-speaker adaptation approach offers several advantages over traditional multitalker ASR methods:
205
+
206
+ 1. **No Speaker Enrollment**: Unlike target-speaker ASR systems that require pre-enrollment audio or speaker embeddings, this model only needs speaker activity information from diarization
207
+ 2. **Handles Severe Overlap**: Each instance focuses on a single speaker, enabling accurate transcription even during fully overlapped speech
208
+ 3. **Streaming Capable**: Designed for real-time streaming scenarios with configurable latency-accuracy tradeoffs
209
+ 4. **Leverages Single-Speaker Models**: Can be fine-tuned from strong pre-trained single-speaker ASR models, and single speaker ASR performance is also preserved
210
 
211
  ## Model Architecture
212
 
213
+ ### Speaker Kernel Injection
214
+
215
+ The streaming multitalker Parakeet model employs a **speaker kernel injection** mechanism at some layers of the Fast-Conformer encoder. As shown in the figure below, learnable speaker kernels are injected into selected encoder layers, enabling the model to dynamically adapt to specific speakers.
216
 
217
  <div align="center">
218
+ <img src="figures/speaker_injection.png" width="750" />
219
  </div>
220
 
221
+ The speaker kernels are generated through speaker supervision activations that detect speech activity for each target speaker. This enables the encoder states to become more responsive to the targeted speaker's speech characteristics, even during periods of fully overlapped speech.
222
 
223
+ ### Multi-Instance Architecture
224
+
225
+ The model is based on the Parakeet architecture and consists of a [NeMo Encoder for Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[4] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[5] encoder. The key architectural innovation is the **multi-instance approach**, where one model instance is deployed per speaker as illustrated below:
226
 
227
  <div align="center">
228
+ <img src="figures/multi_instance.png" width="1400" />
229
  </div>
230
 
231
+ Each model instance:
232
+ - Receives the same mixed audio input
233
+ - Injects speaker-specific kernels at the pre-encode layer
234
+ - Produces transcription output specific to its target speaker
235
+ - Operates independently and can run in parallel with other instances
236
+
237
+ This architecture enables the model to handle severe speech overlap by having each instance focus exclusively on one speaker, eliminating the permutation problem that affects other multitalker ASR approaches.
238
 
239
 
240
 
241
  ## NVIDIA NeMo
242
 
243
+ To train, fine-tune or perform multitalker ASR with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[7]. We recommend you install it after you've installed Cython and latest PyTorch version.
244
 
245
  ```
246
  apt-get update && apt-get install -y libsndfile1 ffmpeg
 
250
 
251
  ## How to Use this Model
252
 
253
+ The model is available for use in the NeMo Framework[7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
254
 
255
+ **Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
256
 
257
+ ```
258
+ # Running streaming multitalker Parakeet with streaming Sortformer
259
+ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
260
+ asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
261
+ diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
262
+ audio_file=example.wav \
263
+ max_num_of_spks=4 \
264
+ masked_asr=false \
265
+ parallel_speaker_strategy=true \
266
+ att_context_size=[70,13] \
267
+ output_path=./output.json \ # where to save the output seglst file
268
+ print_path=./print_script.sh
269
  ```
270
 
271
+ Or the audio_file can be replaced with the manifest_file
 
 
 
272
  ```
273
+ # Running streaming multitalker Parakeet with streaming Sortformer
274
+ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
275
+ ... \
276
+ manifest_file=example.json \
277
+ ... \
278
  ```
279
+
280
+ , where each line is a dictionary containing the following fields:
 
281
  ```
 
 
 
282
  {
283
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
284
  "offset": 0, # offset (start) time of the input audio
 
291
  }
292
  ```
293
 
294
+ ### Single Speaker ASR
295
 
296
+ The model can also be used for single speaker ASR:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
 
 
 
 
 
298
  ```
299
+ python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
300
+ asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
301
+ diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
302
+ audio_file=example.wav \
303
+ max_num_of_spks=1 \
304
+ single_speaker_mode=true \
305
+ masked_asr=false \
306
+ parallel_speaker_strategy=true \
307
+ att_context_size=[70,13] \
308
+ output_path=./output.json \ # where to save the output seglst file
309
+ print_path=./print_script.sh
310
  ```
311
 
312
+ ### Setting up Streaming Configuration
313
 
314
+ Latency is defined by the `att_context_size`, all measured in **80ms frames**:
315
+ * [70, 0]: Chunk size = 1 (1 * 80ms = 0.08s)
316
+ * [70, 1]: Chunk size = 2 (2 * 80ms = 0.16s)
317
+ * [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
318
+ * [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
319
 
 
 
 
320
 
321
+ <!-- ### Getting Transcription Results -->
322
+
323
+ <!-- The model requires speaker diarization information to perform speaker-wise ASR. You need to:
324
+
325
+ 1. **Obtain speaker diarization** (e.g., using Streaming Sortformer or similar diarization system)
326
+ 2. **Deploy one model instance per speaker** identified in the diarization output
327
+ 3. **Feed each instance** with:
328
+ - The same audio input
329
+ - Speaker-specific activity information for that speaker
330
+
331
+ ```python3
332
+ # Example: For a 2-speaker conversation
333
+ # Assuming you have diarization outputs for Speaker 0 and Speaker 1
334
+
335
+ # Create two model instances
336
+ model_spk0 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
337
+ model_spk1 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
338
 
339
+ # Get transcription for each speaker
340
+ # Each model instance uses speaker-specific kernels to adapt to its target speaker
341
+ results_spk0 = model_spk0.transcribe(audio=audio_input, speaker_id=0, diarization_info=diar_info)
342
+ results_spk1 = model_spk1.transcribe(audio=audio_input, speaker_id=1, diarization_info=diar_info)
343
 
344
+ # Combine results to get complete multitalker transcription
345
+ ``` -->
346
 
347
+ <!-- **Note**: The specific API for providing diarization information may vary. Please refer to the [NeMo documentation](https://github.com/NVIDIA/NeMo) for detailed usage instructions. -->
 
348
 
 
 
349
 
350
+ ### Input
351
 
352
+ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
353
 
354
+ ### Output
355
 
356
+ The results will be found in `output_path`, which is in the seglst format. For more information please refer to [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format.
 
 
 
 
 
357
 
358
  ## Datasets
359
 
360
+ This multitalker ASR model was trained on a large combination of real conversations and simulated audio mixtures.
361
+ The training data includes both single-speaker and multi-speaker recordings with corresponding transcriptions and speaker labels in [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format
362
+ Data collection methods vary across individual datasets. The training datasets include phone calls, interviews, web videos, meeting recordings, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or individual dataset webpages for detailed data collection methods.
363
 
364
 
365
  ### Training Datasets (Real conversations)
366
+ - Granary (single speaker)
367
  - Fisher English (LDC)
368
+ - LibriSpeech
369
+ - AMI Corpus
370
+ - NOTSOFAR
371
  - ICSI
 
 
 
 
 
372
 
373
  ### Training Datasets (Used to simulate audio mixtures)
 
374
  - Librispeech
375
 
376
  ## Performance
 
378
 
379
  ### Evaluation data specifications
380
 
381
+ | **Dataset** | **Number of speakers** | **Number of Sessions** |
382
+ |-------------|------------------------|------------------------|
383
+ | **AMI IHM** | 3-4 | 219 |
384
+ | **AMI SDM** | 3-4 | 40 |
385
+ | **CH109** | 2 | 259 |
386
+ | **Mixer 6** | 2 | 148 |
 
 
 
 
 
 
387
 
388
 
389
+ ### Concatenated minimum-permutation Word Error Rate (cpWER)
390
 
391
  * All evaluations include overlapping speech.
392
  * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
393
+ * Post-Processing (PP) can be optimized on different held-out dataset splits to improve diarization performance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
394
 
395
+ | **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
396
+ |-------------|-------------|-------------|-----------|-------------|
397
+ | 0.16s | ----- | ----- | ----- | ----- |
398
+ | 0.16s | ----- | ----- | ----- | ----- |
399
+ | 0.56s | ----- | ----- | ----- | ----- |
400
+ | 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
401
 
402
  ## References
 
403
 
404
+ [1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
405
+ W. Wang, T. Park, I. Medennikov, J. Wang, K. Dhawan, H. Huang, N. R. Koluguri, J. Balam, B. Ginsburg. *Proc. INTERSPEECH 2025*
406
+
407
+ [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
408
 
409
+ [3] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
410
 
411
+ [4] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
412
 
413
+ [5] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
414
 
415
+ [6] [Attention is all you need](https://arxiv.org/abs/1706.03762)
416
 
417
+ [7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
418
 
419
+ [8] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
420