Update README.md (#1)
Browse files- Update README.md (8e83774ce9f5496cb96a5af3649af2a92f9b6708)
Co-authored-by: Weiqing Wang <weiqingw4ng@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -181,7 +181,7 @@ pipeline_tag: audio-classification
|
|
| 181 |
---
|
| 182 |
|
| 183 |
|
| 184 |
-
#
|
| 185 |
|
| 186 |
<style>
|
| 187 |
img {
|
|
@@ -190,48 +190,57 @@ img {
|
|
| 190 |
</style>
|
| 191 |
|
| 192 |
[](#model-architecture)
|
| 193 |
-
| [](#datasets) -->
|
| 195 |
|
| 196 |
-
This model is a streaming
|
| 197 |
|
| 198 |
-
|
| 199 |
-
<img src="figures/sortformer_intro.png" width="750" />
|
| 200 |
-
</div>
|
| 201 |
|
| 202 |
-
|
| 203 |
-
<div align="center">
|
| 204 |
-
<img src="figures/aosc_3spk_example.gif" width="1400" />
|
| 205 |
-
</div>
|
| 206 |
-
<div align="center">
|
| 207 |
-
<img src="figures/aosc_4spk_example.gif" width="1400" />
|
| 208 |
-
</div>
|
| 209 |
|
| 210 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 211 |
|
| 212 |
## Model Architecture
|
| 213 |
|
| 214 |
-
|
|
|
|
|
|
|
| 215 |
|
| 216 |
<div align="center">
|
| 217 |
-
<img src="figures/
|
| 218 |
</div>
|
| 219 |
|
|
|
|
| 220 |
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
|
| 225 |
<div align="center">
|
| 226 |
-
<img src="figures/
|
| 227 |
</div>
|
| 228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
|
| 230 |
|
| 231 |
|
| 232 |
## NVIDIA NeMo
|
| 233 |
|
| 234 |
-
To train, fine-tune or perform
|
| 235 |
|
| 236 |
```
|
| 237 |
apt-get update && apt-get install -y libsndfile1 ffmpeg
|
|
@@ -241,39 +250,35 @@ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
|
|
| 241 |
|
| 242 |
## How to Use this Model
|
| 243 |
|
| 244 |
-
The model is available for use in the NeMo Framework[
|
| 245 |
|
| 246 |
-
|
| 247 |
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
diar_model
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
|
|
|
| 259 |
```
|
| 260 |
|
| 261 |
-
|
| 262 |
-
Input to Sortformer can be an individual audio file:
|
| 263 |
-
```python3
|
| 264 |
-
audio_input="/path/to/multispeaker_audio1.wav"
|
| 265 |
```
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
|
|
|
|
|
|
| 269 |
```
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
audio_input="/path/to/multispeaker_manifest.json"
|
| 273 |
```
|
| 274 |
-
where each line is a dictionary containing the following fields:
|
| 275 |
-
```yaml
|
| 276 |
-
# Example of a line in `multispeaker_manifest.json`
|
| 277 |
{
|
| 278 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
| 279 |
"offset": 0, # offset (start) time of the input audio
|
|
@@ -286,101 +291,86 @@ where each line is a dictionary containing the following fields:
|
|
| 286 |
}
|
| 287 |
```
|
| 288 |
|
| 289 |
-
###
|
| 290 |
|
| 291 |
-
|
| 292 |
-
* **CHUNK_SIZE**: The number of frames in a processing chunk.
|
| 293 |
-
* **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
|
| 294 |
-
* **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
|
| 295 |
-
* **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
|
| 296 |
-
* **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
|
| 297 |
-
|
| 298 |
-
Here are recommended configurations for different scenarios:
|
| 299 |
-
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
| 300 |
-
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
| 301 |
-
| very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
|
| 302 |
-
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
| 303 |
-
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
| 304 |
-
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
| 305 |
-
|
| 306 |
-
For clarity on the metrics used in the table:
|
| 307 |
-
* **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
|
| 308 |
-
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
| 309 |
-
|
| 310 |
-
To set streaming configuration, use:
|
| 311 |
-
```python3
|
| 312 |
-
diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
|
| 313 |
-
diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
|
| 314 |
-
diar_model.sortformer_modules.fifo_len = FIFO_SIZE
|
| 315 |
-
diar_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD
|
| 316 |
-
diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
|
| 317 |
-
diar_model.sortformer_modules._check_streaming_parameters()
|
| 318 |
-
```
|
| 319 |
|
| 320 |
-
### Getting Diarization Results
|
| 321 |
-
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
| 322 |
-
```python3
|
| 323 |
-
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
| 324 |
```
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
```
|
| 329 |
|
|
|
|
| 330 |
|
| 331 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 332 |
|
| 333 |
-
This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
| 334 |
-
- The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal.
|
| 335 |
-
- For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.
|
| 336 |
|
| 337 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 338 |
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
|
|
|
|
|
|
|
| 344 |
|
| 345 |
-
|
| 346 |
-
### Training
|
| 347 |
|
| 348 |
-
Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
|
| 349 |
-
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
|
| 350 |
|
| 351 |
-
###
|
| 352 |
|
| 353 |
-
|
| 354 |
|
| 355 |
-
###
|
| 356 |
|
| 357 |
-
|
| 358 |
-
- It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
|
| 359 |
-
- While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings.
|
| 360 |
-
- The model was trained on publicly available speech datasets, primarily in English. As a result:
|
| 361 |
-
* Performance may degrade on non-English speech.
|
| 362 |
-
* Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
|
| 363 |
|
| 364 |
## Datasets
|
| 365 |
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
Data collection methods vary across individual datasets.
|
| 369 |
|
| 370 |
|
| 371 |
### Training Datasets (Real conversations)
|
|
|
|
| 372 |
- Fisher English (LDC)
|
| 373 |
-
-
|
| 374 |
-
-
|
|
|
|
| 375 |
- ICSI
|
| 376 |
-
- AISHELL-4
|
| 377 |
-
- Third DIHARD Challenge Development (LDC)
|
| 378 |
-
- 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
|
| 379 |
-
- DiPCo
|
| 380 |
-
- AliMeeting
|
| 381 |
|
| 382 |
### Training Datasets (Used to simulate audio mixtures)
|
| 383 |
-
- 2004-2010 NIST Speaker Recognition Evaluation (LDC)
|
| 384 |
- Librispeech
|
| 385 |
|
| 386 |
## Performance
|
|
@@ -388,70 +378,43 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 388 |
|
| 389 |
### Evaluation data specifications
|
| 390 |
|
| 391 |
-
| **Dataset**
|
| 392 |
-
|
| 393 |
-
| **
|
| 394 |
-
| **
|
| 395 |
-
| **
|
| 396 |
-
| **
|
| 397 |
-
| **CALLHOME-part2 3spk** | 3 | 74 |
|
| 398 |
-
| **CALLHOME-part2 4spk** | 4 | 20 |
|
| 399 |
-
| **CALLHOME-part2 5spk** | 5 | 5 |
|
| 400 |
-
| **CALLHOME-part2 6spk** | 6 | 3 |
|
| 401 |
-
| **CALLHOME-part2 full** | 2-6 | 250 |
|
| 402 |
-
| **CH109** | 2 | 109 |
|
| 403 |
|
| 404 |
|
| 405 |
-
###
|
| 406 |
|
| 407 |
* All evaluations include overlapping speech.
|
| 408 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
| 409 |
-
* Post-Processing (PP)
|
| 410 |
-
- [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
|
| 411 |
-
- [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
|
| 412 |
-
|
| 413 |
-
| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 414 |
-
|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
| 415 |
-
| 30.4s | no | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
|
| 416 |
-
| 30.4s | yes | 13.45 | 41.40 | 18.85 | 5.34 | 9.22 | 11.29 | 18.84 | 27.29 | 9.54 | 4.61 |
|
| 417 |
-
| 10.0s | no | 14.90 | 41.06 | 19.96 | 6.96 | 11.05 | 12.93 | 20.47 | 28.10 | 11.21 | 5.28 |
|
| 418 |
-
| 10.0s | yes | 13.75 | 41.41 | 19.10 | 6.05 | 9.88 | 11.72 | 19.66 | 27.37 | 10.15 | 4.80 |
|
| 419 |
-
| 1.04s | no | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
| 420 |
-
| 1.04s | yes | 13.24 | 42.56 | 18.91 | 6.57 | 10.05 | 12.44 | 21.68 | 28.74 | 10.70 | 4.88 |
|
| 421 |
-
| 0.32s | no | 14.64 | 43.47 | 20.19 | 8.63 | 12.91 | 16.19 | 29.40 | 30.60 | 13.57 | 6.46 |
|
| 422 |
-
| 0.32s | yes | 13.44 | 43.73 | 19.28 | 6.91 | 10.45 | 13.70 | 27.04 | 28.58 | 11.38 | 5.27 |
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
## NVIDIA Riva: Deployment
|
| 426 |
-
|
| 427 |
-
Streaming Sortformer is deployed via NVIDIA RIVA ASR - [Speech Recognition with Speaker Diarization](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#speech-recognition-with-speaker-diarization)
|
| 428 |
-
|
| 429 |
-
[NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
|
| 430 |
-
Additionally, Riva provides:
|
| 431 |
-
|
| 432 |
-
* World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
|
| 433 |
-
* Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
|
| 434 |
-
* Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
|
| 435 |
-
|
| 436 |
-
For more information on NVIDIA RIVA, see the [list of supported models](https://huggingface.co/models?other=Riva) is here.
|
| 437 |
-
Also check out the [Riva live demo](https://developer.nvidia.com/riva#demos).
|
| 438 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 439 |
|
| 440 |
## References
|
| 441 |
-
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 442 |
|
| 443 |
-
[
|
|
|
|
|
|
|
|
|
|
| 444 |
|
| 445 |
-
[3] [
|
| 446 |
|
| 447 |
-
[4] [Fast Conformer
|
| 448 |
|
| 449 |
-
[5] [Attention
|
| 450 |
|
| 451 |
-
[6] [
|
| 452 |
|
| 453 |
-
[7] [NeMo
|
| 454 |
|
| 455 |
-
|
| 456 |
|
| 457 |
-
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.
|
|
|
|
| 181 |
---
|
| 182 |
|
| 183 |
|
| 184 |
+
# Multitalker Parakeet Streaming 0.6B v1
|
| 185 |
|
| 186 |
<style>
|
| 187 |
img {
|
|
|
|
| 190 |
</style>
|
| 191 |
|
| 192 |
[](#model-architecture)
|
| 193 |
+
| [](#model-architecture)
|
| 194 |
<!-- | [](#datasets) -->
|
| 195 |
|
| 196 |
+
This model is a streaming multitalker ASR model based on the Parakeet architecture. The model only takes the speaker diarization outputs as external information and eliminates the need for explicit speaker queries or enrollment audio [[Wang et al., 2025]](https://arxiv.org/abs/2506.22646). Unlike conventional target-speaker ASR approaches that require speaker embeddings, this model dynamically adapts to individual speakers through speaker-wise speech activity prediction.
|
| 197 |
|
| 198 |
+
The key innovation involves injecting learnable **speaker kernels** into the pre-encode layer of the Fast-Conformer encoder. These speaker kernels are generated via speaker supervision activations, enabling instantaneous adaptation to target speakers. This approach leverages the inherent tendency of streaming ASR systems to prioritize specific speakers, repurposing this mechanism to achieve robust speaker-focused recognition.
|
|
|
|
|
|
|
| 199 |
|
| 200 |
+
The model architecture requires deploying **one model instance per speaker**, meaning the number of model instances matches the number of speakers in the conversation. While this necessitates additional computational resources, it achieves state-of-the-art performance in handling fully overlapped speech in both offline and streaming scenarios.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
+
## Key Advantages
|
| 203 |
+
|
| 204 |
+
This self-speaker adaptation approach offers several advantages over traditional multitalker ASR methods:
|
| 205 |
+
|
| 206 |
+
1. **No Speaker Enrollment**: Unlike target-speaker ASR systems that require pre-enrollment audio or speaker embeddings, this model only needs speaker activity information from diarization
|
| 207 |
+
2. **Handles Severe Overlap**: Each instance focuses on a single speaker, enabling accurate transcription even during fully overlapped speech
|
| 208 |
+
3. **Streaming Capable**: Designed for real-time streaming scenarios with configurable latency-accuracy tradeoffs
|
| 209 |
+
4. **Leverages Single-Speaker Models**: Can be fine-tuned from strong pre-trained single-speaker ASR models, and single speaker ASR performance is also preserved
|
| 210 |
|
| 211 |
## Model Architecture
|
| 212 |
|
| 213 |
+
### Speaker Kernel Injection
|
| 214 |
+
|
| 215 |
+
The streaming multitalker Parakeet model employs a **speaker kernel injection** mechanism at some layers of the Fast-Conformer encoder. As shown in the figure below, learnable speaker kernels are injected into selected encoder layers, enabling the model to dynamically adapt to specific speakers.
|
| 216 |
|
| 217 |
<div align="center">
|
| 218 |
+
<img src="figures/speaker_injection.png" width="750" />
|
| 219 |
</div>
|
| 220 |
|
| 221 |
+
The speaker kernels are generated through speaker supervision activations that detect speech activity for each target speaker. This enables the encoder states to become more responsive to the targeted speaker's speech characteristics, even during periods of fully overlapped speech.
|
| 222 |
|
| 223 |
+
### Multi-Instance Architecture
|
| 224 |
+
|
| 225 |
+
The model is based on the Parakeet architecture and consists of a [NeMo Encoder for Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[4] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[5] encoder. The key architectural innovation is the **multi-instance approach**, where one model instance is deployed per speaker as illustrated below:
|
| 226 |
|
| 227 |
<div align="center">
|
| 228 |
+
<img src="figures/multi_instance.png" width="1400" />
|
| 229 |
</div>
|
| 230 |
|
| 231 |
+
Each model instance:
|
| 232 |
+
- Receives the same mixed audio input
|
| 233 |
+
- Injects speaker-specific kernels at the pre-encode layer
|
| 234 |
+
- Produces transcription output specific to its target speaker
|
| 235 |
+
- Operates independently and can run in parallel with other instances
|
| 236 |
+
|
| 237 |
+
This architecture enables the model to handle severe speech overlap by having each instance focus exclusively on one speaker, eliminating the permutation problem that affects other multitalker ASR approaches.
|
| 238 |
|
| 239 |
|
| 240 |
|
| 241 |
## NVIDIA NeMo
|
| 242 |
|
| 243 |
+
To train, fine-tune or perform multitalker ASR with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[7]. We recommend you install it after you've installed Cython and latest PyTorch version.
|
| 244 |
|
| 245 |
```
|
| 246 |
apt-get update && apt-get install -y libsndfile1 ffmpeg
|
|
|
|
| 250 |
|
| 251 |
## How to Use this Model
|
| 252 |
|
| 253 |
+
The model is available for use in the NeMo Framework[7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
| 254 |
|
| 255 |
+
**Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
|
| 256 |
|
| 257 |
+
```
|
| 258 |
+
# Running streaming multitalker Parakeet with streaming Sortformer
|
| 259 |
+
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 260 |
+
asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
|
| 261 |
+
diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
|
| 262 |
+
audio_file=example.wav \
|
| 263 |
+
max_num_of_spks=4 \
|
| 264 |
+
masked_asr=false \
|
| 265 |
+
parallel_speaker_strategy=true \
|
| 266 |
+
att_context_size=[70,13] \
|
| 267 |
+
output_path=./output.json \ # where to save the output seglst file
|
| 268 |
+
print_path=./print_script.sh
|
| 269 |
```
|
| 270 |
|
| 271 |
+
Or the audio_file can be replaced with the manifest_file
|
|
|
|
|
|
|
|
|
|
| 272 |
```
|
| 273 |
+
# Running streaming multitalker Parakeet with streaming Sortformer
|
| 274 |
+
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 275 |
+
... \
|
| 276 |
+
manifest_file=example.json \
|
| 277 |
+
... \
|
| 278 |
```
|
| 279 |
+
|
| 280 |
+
, where each line is a dictionary containing the following fields:
|
|
|
|
| 281 |
```
|
|
|
|
|
|
|
|
|
|
| 282 |
{
|
| 283 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
| 284 |
"offset": 0, # offset (start) time of the input audio
|
|
|
|
| 291 |
}
|
| 292 |
```
|
| 293 |
|
| 294 |
+
### Single Speaker ASR
|
| 295 |
|
| 296 |
+
The model can also be used for single speaker ASR:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
```
|
| 299 |
+
python [NEMO_GIT_FOLDER]/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
|
| 300 |
+
asr_model=nvidia/multitalker-parakeet-streaming-0.6b-v1 \
|
| 301 |
+
diar_model=nvidia/diar_streaming_sortformer_4spk-v2 \
|
| 302 |
+
audio_file=example.wav \
|
| 303 |
+
max_num_of_spks=1 \
|
| 304 |
+
single_speaker_mode=true \
|
| 305 |
+
masked_asr=false \
|
| 306 |
+
parallel_speaker_strategy=true \
|
| 307 |
+
att_context_size=[70,13] \
|
| 308 |
+
output_path=./output.json \ # where to save the output seglst file
|
| 309 |
+
print_path=./print_script.sh
|
| 310 |
```
|
| 311 |
|
| 312 |
+
### Setting up Streaming Configuration
|
| 313 |
|
| 314 |
+
Latency is defined by the `att_context_size`, all measured in **80ms frames**:
|
| 315 |
+
* [70, 0]: Chunk size = 1 (1 * 80ms = 0.08s)
|
| 316 |
+
* [70, 1]: Chunk size = 2 (2 * 80ms = 0.16s)
|
| 317 |
+
* [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
|
| 318 |
+
* [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
|
| 319 |
|
|
|
|
|
|
|
|
|
|
| 320 |
|
| 321 |
+
<!-- ### Getting Transcription Results -->
|
| 322 |
+
|
| 323 |
+
<!-- The model requires speaker diarization information to perform speaker-wise ASR. You need to:
|
| 324 |
+
|
| 325 |
+
1. **Obtain speaker diarization** (e.g., using Streaming Sortformer or similar diarization system)
|
| 326 |
+
2. **Deploy one model instance per speaker** identified in the diarization output
|
| 327 |
+
3. **Feed each instance** with:
|
| 328 |
+
- The same audio input
|
| 329 |
+
- Speaker-specific activity information for that speaker
|
| 330 |
+
|
| 331 |
+
```python3
|
| 332 |
+
# Example: For a 2-speaker conversation
|
| 333 |
+
# Assuming you have diarization outputs for Speaker 0 and Speaker 1
|
| 334 |
+
|
| 335 |
+
# Create two model instances
|
| 336 |
+
model_spk0 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
|
| 337 |
+
model_spk1 = EncDecMultiTaskModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
|
| 338 |
|
| 339 |
+
# Get transcription for each speaker
|
| 340 |
+
# Each model instance uses speaker-specific kernels to adapt to its target speaker
|
| 341 |
+
results_spk0 = model_spk0.transcribe(audio=audio_input, speaker_id=0, diarization_info=diar_info)
|
| 342 |
+
results_spk1 = model_spk1.transcribe(audio=audio_input, speaker_id=1, diarization_info=diar_info)
|
| 343 |
|
| 344 |
+
# Combine results to get complete multitalker transcription
|
| 345 |
+
``` -->
|
| 346 |
|
| 347 |
+
<!-- **Note**: The specific API for providing diarization information may vary. Please refer to the [NeMo documentation](https://github.com/NVIDIA/NeMo) for detailed usage instructions. -->
|
|
|
|
| 348 |
|
|
|
|
|
|
|
| 349 |
|
| 350 |
+
### Input
|
| 351 |
|
| 352 |
+
This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
| 353 |
|
| 354 |
+
### Output
|
| 355 |
|
| 356 |
+
The results will be found in `output_path`, which is in the seglst format. For more information please refer to [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 357 |
|
| 358 |
## Datasets
|
| 359 |
|
| 360 |
+
This multitalker ASR model was trained on a large combination of real conversations and simulated audio mixtures.
|
| 361 |
+
The training data includes both single-speaker and multi-speaker recordings with corresponding transcriptions and speaker labels in [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format
|
| 362 |
+
Data collection methods vary across individual datasets. The training datasets include phone calls, interviews, web videos, meeting recordings, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or individual dataset webpages for detailed data collection methods.
|
| 363 |
|
| 364 |
|
| 365 |
### Training Datasets (Real conversations)
|
| 366 |
+
- Granary (single speaker)
|
| 367 |
- Fisher English (LDC)
|
| 368 |
+
- LibriSpeech
|
| 369 |
+
- AMI Corpus
|
| 370 |
+
- NOTSOFAR
|
| 371 |
- ICSI
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 372 |
|
| 373 |
### Training Datasets (Used to simulate audio mixtures)
|
|
|
|
| 374 |
- Librispeech
|
| 375 |
|
| 376 |
## Performance
|
|
|
|
| 378 |
|
| 379 |
### Evaluation data specifications
|
| 380 |
|
| 381 |
+
| **Dataset** | **Number of speakers** | **Number of Sessions** |
|
| 382 |
+
|-------------|------------------------|------------------------|
|
| 383 |
+
| **AMI IHM** | 3-4 | 219 |
|
| 384 |
+
| **AMI SDM** | 3-4 | 40 |
|
| 385 |
+
| **CH109** | 2 | 259 |
|
| 386 |
+
| **Mixer 6** | 2 | 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
|
| 388 |
|
| 389 |
+
### Concatenated minimum-permutation Word Error Rate (cpWER)
|
| 390 |
|
| 391 |
* All evaluations include overlapping speech.
|
| 392 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
| 393 |
+
* Post-Processing (PP) can be optimized on different held-out dataset splits to improve diarization performance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 394 |
|
| 395 |
+
| **Latency** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
|
| 396 |
+
|-------------|-------------|-------------|-----------|-------------|
|
| 397 |
+
| 0.16s | ----- | ----- | ----- | ----- |
|
| 398 |
+
| 0.16s | ----- | ----- | ----- | ----- |
|
| 399 |
+
| 0.56s | ----- | ----- | ----- | ----- |
|
| 400 |
+
| 1.12s | 21.26 | 37.44 | 15.81 | 23.81 |
|
| 401 |
|
| 402 |
## References
|
|
|
|
| 403 |
|
| 404 |
+
[1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
|
| 405 |
+
W. Wang, T. Park, I. Medennikov, J. Wang, K. Dhawan, H. Huang, N. R. Koluguri, J. Balam, B. Ginsburg. *Proc. INTERSPEECH 2025*
|
| 406 |
+
|
| 407 |
+
[2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 408 |
|
| 409 |
+
[3] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
|
| 410 |
|
| 411 |
+
[4] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
|
| 412 |
|
| 413 |
+
[5] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
| 414 |
|
| 415 |
+
[6] [Attention is all you need](https://arxiv.org/abs/1706.03762)
|
| 416 |
|
| 417 |
+
[7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
|
| 418 |
|
| 419 |
+
[8] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
|
| 420 |
|
|
|