Entertainment

Developers test Silero VAD and Pyannote SCD for CPU-based audio segmentation of a 14-second conversation

AI-Generated Summary

1 sources

3 hours ago

1 views

Developers test Silero VAD and Pyannote SCD for CPU-based audio segmentation of a 14-second conversation

Key Points

Both tests decode a shared 14.171-second two-speaker MP3 into 16 kHz mono audio and process it with ONNX Runtime using CPUExecutionProvider.
Silero VAD with specified thresholds and timing rules outputs 12 separate speech segments (totaling about 11.917 seconds of audio).
Pyannote Segmentation 3.0 based speaker change detection outputs 6 utterance segments that cover the entire 14.171 seconds with no gaps or overlaps.
Speaker diarization identity tracking is not fully solved in these write-ups: VAD does not identify speakers, and Pyannote’s indexes are not guaranteed persistent identities across window boundaries.
Each implementation extracts the detected intervals into separate 16-bit PCM, 16 kHz, mono WAV files and removes any existing output before running.

Two developer write-ups describe using ONNX Runtime on a CPU to split a short two-speaker, 14.171-second conversation into smaller audio segments. One approach uses Silero VAD (voice activity detection) to keep only periods classified as speech. The MP3 is decoded to a 16 kHz mono waveform, processed in 32 ms chunks, and run with CPUExecutionProvider. Using a start threshold of 0.5, an end threshold of 0.35, minimum silence of 100 ms, minimum speech of 250 ms, and 30 ms padding, the detector produces 12 speech intervals and saves them as separate WAV files. The combined speech segments total about 11.917 seconds of audio.

The second approach uses Pyannote Segmentation 3.0 in ONNX form to detect speaker changes rather than general speech presence. The model runs on 10-second windows with 16 kHz input. Frames are labeled using probabilities for speaker activity (with a 0.5 active threshold and overlap margin rule), followed by post-processing that merges short states and assigns silent frames to adjacent speakers. This results in 6 utterance-level WAV segments that cover the full input without gaps or overlaps. Both tests report real-time factors well below 1x for inference/segmentation on a Mac Studio and note that the speaker labels are not guaranteed to be consistent identities across longer recordings.

How Outlets Covered This Story

DEV

Dev.to

Detecting Speaker Changes with Pyannote Segmentation 3.0 and ONNX Runtime

Hello, everyone. When listening to a conversation, we naturally keep track of who is speaking. A program has a harder job: beyond finding speech, it must also determine where one speaker gives way to another. Today, I will use an ONNX version of Pyannote Segmentation 3.0 to detect speaker changes in a two-person conversation and split the recording into one WAV file per utterance. What I Tested This lab uses FFmpeg to decode a roughly 14-second conversation into a 16 kHz mono waveform. It then combines the Pyannote segmentation model with simple post-processing to produce contiguous speaker segments. I wanted to verify: Whether six alternating utterances can be separated into six segments Whether the detected speaker indexes remain consistent throughout the recording Whether ONNX Runtime can process the audio faster than real time using only its CPU execution provider Whether every segment can be saved as a separate WAV file The complete code and reproducible environment are available in the pyannote-scd lab in kiarina/labs. This test performs segmentation using the model's speaker indexes. It does not compare speaker embeddings or run clustering, so it is not a complete speaker diarization pipeline that identifies the same person throughout a long recording. Reproducing the Lab You will need: mise uv FFmpeg curl The following commands fetch only this lab, download the shared test audio, and run it: git clone --depth 1 --filter=blob:none --sparse \ https://github.com/kiarina/labs.git cd labs git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \ 2026/07/04/pyannote-scd make download-test-assets mise -C 2026/07/04/pyannote-scd run On the first run, the task downloads the full-precision onnx/model.onnx file from onnx-community/pyannote-segmentation-3.0 on Hugging Face. uv then prepares the Python dependencies and runs the detector. How Speaker Segments Are Detected The input is this shared test asset: assets/mp3/conversation_2speaker_14s_16k.mp3 The recording follows this scenario: Speaker 1: Hello? Are you at the station already? Speaker 2: Yeah. I just came through the ticket gate. How about you? Speaker 1: I am still on the train. I think I will be there in about five minutes. Speaker 2: Got it. I will wait in front of the café, then. Speaker 1: Thanks. It is pretty cold today. Speaker 2: Definitely. I am glad I brought my scarf. FFmpeg decodes the MP3 into a 16 kHz mono waveform, which is passed to the model in 10-second windows. The final chunk is zero-padded to 10 seconds, and frames beyond the original audio duration are discarded after inference. The detector uses these settings: Setting Value sample rate 16 kHz inference window 10 seconds model speakers 3 maximum simultaneous speakers 2 active speaker threshold 0.5 overlap margin 0.1 minimum speaker change 100 ms minimum speech segment 100 ms execution provider CPUExecutionProvider For every frame, the model outputs log probabilities for seven classes covering silence, individual speakers, and pairs of speakers. This powerset representation is converted into probabilities for three speaker indexes. A speaker is considered active when its probability reaches 0.5. When at least two speakers are active and the difference between the top two probabilities is no more than 0.1, the frame is labeled overlap. This rule only indicates that the probabilities are close; it does not prove that the source contains overlapping speech. Converting these probabilities directly into segments would create fragments around silence and brief fluctuations. The post-processing assigns silent frames to adjacent speakers and merges speaker states shorter than 100 ms into neighboring segments. One model frame is approximately 16.978 ms in this run, so 100 ms corresponds to about six frames. Finally, each segment is saved under output/ as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run. Results On a Mac Studio, the detector found six speaker segments in the 14.171-second input: audio duration: 14.171s SCD elapsed: 0.019s SCD real-time factor: 0.001x frame duration: 16.978ms segments: 6 001: 0.000s - 1.851s ( 1.851s) speaker_2 002: 1.851s - 4.737s ( 2.886s) speaker_1 003: 4.737s - 7.317s ( 2.581s) speaker_2 004: 7.317s - 9.677s ( 2.360s) speaker_1 005: 9.677s - 11.834s ( 2.156s) speaker_2 006: 11.834s - 14.171s ( 2.337s) speaker_1 The six segments cover the complete 14.171-second input without gaps or overlaps. Together, the generated files contain 226,736 samples, and every file is 16-bit PCM, 16 kHz, and mono. Listening to the files and comparing them with the script produced the following mapping: file model output speech speaker_2_001.wav speaker_2 もしもし、もう駅に着いた？ (Hello? Are you at the station already?) speaker_1_002.wav speaker_1 うん。今、改札出たところ。そっちは？ (Yeah. I just came through the ticket gate. How about you?) speaker_2_003.wav speaker_2 こっちはまだ電車。あと5分くらいかな。 (I am still on the train. I think I will be there in about five minutes.) speaker_1_004.wav speaker_1 了解。じゃあ、カフェの前で待ってるね。 (Got it. I will wait in front of the café, then.) speaker_2_005.wav speaker_2 助かる。今日は結構寒いね。 (Thanks. It is pretty cold today.) speaker_1_006.wav speaker_1 ほんと、マフラー持ってきて正解だった。 (Definitely. I am glad I brought my scarf.) Each file corresponds to one utterance in the script. None was split in the middle, and none contained speech from the adjacent speaker. The model's speaker_2 index corresponded to Speaker 1, while speaker_1 corresponded to Speaker 2, with the indexes alternating consistently across all six segments. The verification environment was: machine: Mac Studio chip: Apple M4 Max memory: 128 GB OS: macOS 26.5.1 (25F80), arm64 Python: 3.12.11 ONNX Runtime: 1.27.0 execution provider: CPUExecutionProvider SCD elapsed measures only ONNX inference and speaker segment detection. It excludes model initialization, FFmpeg decoding, and WAV generation. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was 0.001x. Interpreting the Results For this recording, the six scripted utterances matched the six detected segments. Because silent frames were distributed between adjacent speakers instead of becoming separate files, the detector preserved the entire input while splitting it at speaker changes. That output should be convenient as preprocessing for transcription because each segment contains only one person's utterance. Processing 14.171 seconds of audio in 0.019 seconds on a CPU is also encouraging. However, the measurement covers only inference and segment detection, and it is a single reference value. It does not represent end-to-end performance including file I/O, nor does it predict performance on another machine. The speaker_1 and speaker_2 labels are not persistent identities. This implementation runs inference independently on 10-second windows and does not use speaker embeddings to match identities across them. The indexes happened to remain consistent across the 10-second boundary in this input, but that behavior is not guaranteed for other recordings. The evaluation is also limited to one short, clean recording of two alternating speakers. It does not cover overlapping speech, conversations with three or more people, noise, or long recordings, and there are no timestamp-level ground-truth labels for a quantitative evaluation. The result establishes that this recording was split correctly under these settings, not that the approach will generalize unchanged to every conversation. My Takeaway What I find interesting about the Pyannote segmentation model is that it goes one step beyond VAD. Instead of only answering whether somebody is speaking, it provides enough information to locate speaker changes. In this short conversation, a simple threshold and smoothing stage was enough to produce clean utterance-level files. Running comfortably on a CPU through ONNX Runtime also makes it appealing for local processing. At the same time, six clean output files can make the system look like a finished diarization pipeline. It is not: cross-window speaker identity matching is still missing. That distinction will matter much more with longer recordings. Next, I would like to evaluate the boundaries with overlapping speech and three or more speakers, then add speaker embeddings and clustering so that the same person can keep a consistent identity throughout a long recording.

2 hours ago

DEV

Dev.to

Extracting Speech Segments with Silero VAD and ONNX Runtime

Hello, everyone. Have you ever wanted to keep only the parts of a recording where someone is speaking? Finding silence before transcription can reduce downstream work and divide a long recording into more manageable pieces. Today, I will use the ONNX model from Silero VAD to detect speech in a roughly 14-second conversation and extract each segment as a WAV file. What I Tested This lab uses FFmpeg to decode an MP3 conversation between two speakers into a 16 kHz mono waveform. It then feeds the audio to Silero VAD in 32 ms chunks. I wanted to verify: Whether ONNX Runtime can detect speech using only its CPU execution provider How many segments are found in 14.171 seconds of audio How long detection takes Whether every detected segment can be saved as a separate WAV file The complete code and reproducible environment are available in the silero-vad lab in kiarina/labs. VAD stands for Voice Activity Detection. It determines whether speech is present, but it does not identify which of the two people is speaking. Speaker diarization is outside the scope of this test. Reproducing the Lab You will need: mise uv FFmpeg curl The following commands fetch only this lab, download the shared test audio, and run it: git clone --depth 1 --filter=blob:none --sparse \ https://github.com/kiarina/labs.git cd labs git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \ 2026/07/03/silero-vad make download-test-assets mise -C 2026/07/03/silero-vad run On the first run, the task downloads silero_vad.onnx from the official Silero VAD repository. uv then prepares the Python dependencies and runs the detector. How Speech Segments Are Detected The input is this shared test asset: assets/mp3/conversation_2speaker_14s_16k.mp3 The recording follows this scenario: Speaker 1: Hello? Are you at the station already? Speaker 2: Yeah. I just came through the ticket gate. How about you? Speaker 1: I am still on the train. I think I will be there in about five minutes. Speaker 2: Got it. I will wait in front of the café, then. Speaker 1: Thanks. It is pretty cold today. Speaker 2: Definitely. I am glad I brought my scarf. After FFmpeg decodes the file, the waveform is divided into chunks of 512 samples, or 32 ms. Silero VAD returns a speech probability for each chunk. The detector uses these settings: Setting Value sample rate 16 kHz chunk size 512 samples (32 ms) speech threshold 0.5 negative threshold 0.35 minimum silence 100 ms minimum speech 250 ms speech padding 30 ms The detector uses a threshold of 0.5 to start speech and a lower threshold of 0.35 to mark a possible end. This hysteresis prevents a probability fluctuating near one boundary from repeatedly opening and closing a segment. A segment begins when the speech probability reaches 0.5. Once the probability falls below 0.35 for at least 100 ms, that position becomes the end. Segments shorter than 250 ms are discarded, and 30 ms of padding is added at both ends. The chunks are not processed independently. The implementation carries the ONNX Runtime state and the previous 64 samples of context into the next chunk. This preserves temporal context while processing the input incrementally, as required for streaming. Finally, FFmpeg extracts each detected interval from the original MP3 and saves it under output/ as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run. Results On a Mac Studio, the detector found 12 speech segments in the 14.171-second input: audio duration: 14.171s VAD elapsed: 0.028s VAD real-time factor: 0.002x speech segments: 12 001: 0.162s - 1.726s ( 1.564s) 002: 2.050s - 2.462s ( 0.412s) 003: 2.626s - 3.934s ( 1.308s) 004: 4.130s - 4.638s ( 0.508s) 005: 4.930s - 6.014s ( 1.084s) 006: 6.178s - 7.262s ( 1.084s) 007: 7.458s - 7.998s ( 0.540s) 008: 8.194s - 9.598s ( 1.404s) 009: 9.794s - 10.430s ( 0.636s) 010: 10.530s - 11.806s ( 1.276s) 011: 11.970s - 12.510s ( 0.540s) 012: 12.610s - 14.171s ( 1.561s) The run created 12 files, from speech_001.wav through speech_012.wav. I also verified that every file is 16-bit PCM, 16 kHz, and mono. Together, the extracted segments contain about 11.917 seconds of audio, or 84.1% of the input. Listening to the files and comparing them with the Japanese script produced the following mapping: file speech speech_001.wav もしもし、もう駅に着いた？ (Hello? Are you at the station already?) speech_002.wav うん。 (Yeah.) speech_003.wav 今、改札出たところ。 (I just came through the ticket gate.) speech_004.wav そっちは？ (How about you?) speech_005.wav こっちはまだ電車。 (I am still on the train.) speech_006.wav あと5分くらいかな。 (I think I will be there in about five minutes.) speech_007.wav 了解。 (Got it.) speech_008.wav じゃあ、カフェの前で待ってるね。 (I will wait in front of the café, then.) speech_009.wav 助かる。 (Thanks.) speech_010.wav 今日は結構寒いね。 (It is pretty cold today.) speech_011.wav ほんと、 (Definitely.) speech_012.wav マフラー持ってきて正解だった。 (I am glad I brought my scarf.) The recording was cleanly divided at natural pauses corresponding to periods and question marks. Only the final ほんと、 became a separate segment because of the short pause that followed it. Short responses such as “yeah,” “got it,” “thanks,” and “definitely” were preserved. The verification environment was: machine: Mac Studio (Mac16,9) chip: Apple M4 Max, 16 cores (12 performance + 4 efficiency) memory: 128 GB OS: macOS 26.5.1 (25F80), arm64 Python: 3.12.11 ONNX Runtime: 1.27.0 execution provider: CPUExecutionProvider VAD elapsed measures only Silero VAD inference and segment detection. It excludes model initialization, FFmpeg decoding, and WAV extraction. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was 0.002x. Interpreting the Results Under these conditions, the CPU processed 14.171 seconds of audio in 0.028 seconds. That leaves substantial headroom for both offline processing and applications that consume microphone input incrementally. Processing time will vary by machine, so this number should not be treated as a universal benchmark. The implementation does not simply treat every 32 ms prediction as an independent segment. Separate start and end thresholds, a 100 ms silence requirement, minimum segment duration, and padding turn the probability sequence into intervals that are more useful downstream. In a real application, these segmentation rules can affect the result as much as model inference itself. For this input, the detector followed the script's punctuation and natural conversational pauses without dropping any content. It is especially useful that short acknowledgments survived even with segments shorter than 250 ms being discarded. The recording does not include timestamp-level ground-truth annotations, so I did not calculate precision or recall. The listening comparison against the script was successful, but that does not make these settings optimal for every recording. Data containing noise, music, whispers, long pauses, or overlapping speakers would require threshold tuning and comparison against labeled examples. My Takeaway Silero VAD feels like a practical small component to place before speech recognition. The ONNX model is about 2.3 MB, runs comfortably on a CPU, and exposes a straightforward core operation: pass in a chunk and receive a probability. At the same time, VAD output is raw material for segmentation rather than a finished audio split. Whether an application should preserve short acknowledgments or combine speech into longer sentence-like pieces changes what values such as 100 ms and 250 ms should mean. Next, I would like to compare how the intervals change across thresholds, recording environments, and artificially added noise. I am also interested in measuring how this lightweight preprocessing affects end-to-end speed and accuracy when combined with transcription and speaker diarization.

4 hours ago

Aamir Khan confirms July 5 intimate wedding to Gauri Spratt at his Mumbai home

Bollywood actor Aamir Khan confirms he will marry Gauri Spratt on July 5 in an intimate ceremony at his Mumbai residence...

10 sources 2 days ago

Entertainment

Kaleb Cooper of Clarkson’s Farm marries Taya Wilkinson in Cotswolds wedding

Kaleb Cooper, known for Channel 2’s Clarkson’s Farm, marries his long-term partner Taya Wilkinson in a wedding in the Co...

6 sources 1 week ago

Entertainment

Raj B Shetty joins Sivakarthikeyan’s ‘Seyon’ in birthday poster announcement

Sivakarthikeyan’s upcoming rural action drama, ‘Seyon’, announces Kannada actor-filmmaker Raj B Shetty’s entry into the...

2 sources 2 hours ago