You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CPCConformerTransfomerSpeakerDiarizationModel-en-2spk

This is a speaker diarization model for two-speaker English audio.

Model Details

Model Description

This is a speaker diarization model capable of online (streaming) processing for two-speaker English audio. The architecture is based on BW-EDA-EEND, incorporating a Conformer and CPC (Contrastive Predictive Coding).

Developed by: mocomoco inc.
Language(s) (NLP): en (English)
License: apache-2.0
Finetuned from model: This model is trained from scratch, but uses a pre-trained CPC model as the feature extractor.

Model Sources

Repository: CPCConformerTransfomerSpeakerDiarizationModel

Note on the Conference Presentation

Please note that some improvements have been made for this public release, so the implementation and performance may not exactly match what was presented at the conference.

Uses

Direct Use

Please also refer to the CPCConformerTransfomerSpeakerDiarizationModel/examples.

import torch
import torchaudio
from cpc_streaming_diarization.model import CPCStreamingDiarizationModel
from cpc_streaming_diarization.utils import get_default_device

device = get_default_device()
# Load the model
model = CPCStreamingDiarizationModel.from_pretrained(
    "mocomoco-inc/SpeakerDiarizationModel-en-2spk"
)
model = model.to(device)
model.eval()

# Load a wav file
wav, sr = torchaudio.load("example.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(dim=0, keepdim=True)  # Convert to mono
wav = wav.to(device)

# Inference
with torch.no_grad():
    preds = model(wav, block_size_sec=10.0, hop_size_sec=1.0)

# Post-processing
preds = model.postprocess(preds, min_duration_on=0.3, min_duration_off=0.2)

threshold = 0.5
preds = (preds > threshold).int()
print(preds)
# Output example
#
# 0 represents non-speech and 1 represents speech for the corresponding speaker.
# In the example below, with two speakers, neither is speaking at the beginning,
# and at the end, one speaker is talking.
#
# tensor([[[0, 0],
#          [0, 0],
#          [0, 0],
#          ...,
#          [0, 1],
#          [0, 1],
#          [0, 0]]], dtype=torch.int32)

Environment

This model has been tested and confirmed to work in the following environment:

OS: Ubuntu
GPU: NVIDIA GPU with CUDA support

Out-of-Scope Use

Performance is not guaranteed for the following uses:

Audio in languages other than English
Audio with three or more speakers

Bias, Risks, and Limitations

The model may not perform well on audio with characteristics different from the training data.

For speakers of the same gender with similar voice qualities, speaker label confusion may occur more frequently.

Recommendations

When using this model, it is recommended to understand the above risks and limitations and to conduct a thorough performance evaluation for your specific use case. If you intend to use the model for languages other than English, fine-tuning on a dataset of the target language is recommended.

Evaluation

Testing Data, Factors & Metrics

Metrics

The evaluation metric used is DER (Diarization Error Rate).

Results

The proposed model achieves a better DER on the test dataset compared to the widely used baseline, pyannote.

	pyannote	this model
False Alarm	10.230	4.351
Missed Detection	2.998	4.547
Confusion	2.783	3.467
DER	16.011	12.364

Technical Specifications

Model Architecture and Objective

The model employs an End-to-End architecture based on BW-EDA-EEND and consists of the following components:

Feature Extraction: Downsamples features from a pre-trained CPC model.
Encoder: Consists of a 4-layer Conformer.
Decoder: Consists of a 1-layer Transformer.

Contact

For any inquiries, please contact us at:
mocomoco inc. Inada Bldg. 302, 7-20-19 Roppongi,
Minato-ku, Tokyo 106-0032, Japan
contact@mocomoco.ai

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support