CPCConformerTransfomerSpeakerDiarizationModel-en-2spk
This is a speaker diarization model for two-speaker English audio.
Model Details
Model Description
This is a speaker diarization model capable of online (streaming) processing for two-speaker English audio. The architecture is based on BW-EDA-EEND, incorporating a Conformer and CPC (Contrastive Predictive Coding).
- Developed by: mocomoco inc.
- Language(s) (NLP):
en(English) - License:
apache-2.0 - Finetuned from model: This model is trained from scratch, but uses a pre-trained CPC model as the feature extractor.
Model Sources
- Repository: CPCConformerTransfomerSpeakerDiarizationModel
Note on the Conference Presentation
Please note that some improvements have been made for this public release, so the implementation and performance may not exactly match what was presented at the conference.
Uses
Direct Use
Please also refer to the CPCConformerTransfomerSpeakerDiarizationModel/examples.
import torch
import torchaudio
from cpc_streaming_diarization.model import CPCStreamingDiarizationModel
from cpc_streaming_diarization.utils import get_default_device
device = get_default_device()
# Load the model
model = CPCStreamingDiarizationModel.from_pretrained(
"mocomoco-inc/SpeakerDiarizationModel-en-2spk"
)
model = model.to(device)
model.eval()
# Load a wav file
wav, sr = torchaudio.load("example.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(dim=0, keepdim=True) # Convert to mono
wav = wav.to(device)
# Inference
with torch.no_grad():
preds = model(wav, block_size_sec=10.0, hop_size_sec=1.0)
# Post-processing
preds = model.postprocess(preds, min_duration_on=0.3, min_duration_off=0.2)
threshold = 0.5
preds = (preds > threshold).int()
print(preds)
# Output example
#
# 0 represents non-speech and 1 represents speech for the corresponding speaker.
# In the example below, with two speakers, neither is speaking at the beginning,
# and at the end, one speaker is talking.
#
# tensor([[[0, 0],
# [0, 0],
# [0, 0],
# ...,
# [0, 1],
# [0, 1],
# [0, 0]]], dtype=torch.int32)
Environment
This model has been tested and confirmed to work in the following environment:
- OS: Ubuntu
- GPU: NVIDIA GPU with CUDA support
Out-of-Scope Use
Performance is not guaranteed for the following uses:
- Audio in languages other than English
- Audio with three or more speakers
Bias, Risks, and Limitations
The model may not perform well on audio with characteristics different from the training data.
For speakers of the same gender with similar voice qualities, speaker label confusion may occur more frequently.
Recommendations
When using this model, it is recommended to understand the above risks and limitations and to conduct a thorough performance evaluation for your specific use case. If you intend to use the model for languages other than English, fine-tuning on a dataset of the target language is recommended.
Evaluation
Testing Data, Factors & Metrics
Metrics
The evaluation metric used is DER (Diarization Error Rate).
Results
The proposed model achieves a better DER on the test dataset compared to the widely used baseline, pyannote.
| pyannote | this model | |
|---|---|---|
| False Alarm | 10.230 | 4.351 |
| Missed Detection | 2.998 | 4.547 |
| Confusion | 2.783 | 3.467 |
| DER | 16.011 | 12.364 |
Technical Specifications
Model Architecture and Objective
The model employs an End-to-End architecture based on BW-EDA-EEND and consists of the following components:
- Feature Extraction: Downsamples features from a pre-trained CPC model.
- Encoder: Consists of a 4-layer Conformer.
- Decoder: Consists of a 1-layer Transformer.
Contact
For any inquiries, please contact us at:
mocomoco inc. Inada Bldg. 302, 7-20-19 Roppongi,
Minato-ku, Tokyo 106-0032, Japan
contact@mocomoco.ai
- Downloads last month
- -